subject:"\[RFC\]\[PATCH 0\/5\] arch\: atomic rework"

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Linus Torvalds

On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel  wrote:
>
> I think the example is easy to misunderstand, because the context isn't
> clear.  Therefore, let me first try to clarify the background.
>
> (1) The abstract machine does not write speculatively.
> (2) Emitting a branch instruction and executing a branch at runtime is
> not part of the specified behavior of the abstract machine.  Of course,
> the abstract machine performs conditional execution, but that just
> specifies the output / side effects that it must produce (e.g., volatile
> stores) -- not with which hardware instructions it is producing this.
> (3) A compiled program must produce the same output as if executed by
> the abstract machine.

Ok, I'm fine with that.

> Thus, we need to be careful what "speculative store" is meant to refer
> to.  A few examples:
>
> if (atomic_load(, mo_relaxed) == 1)
>   atomic_store(, 3, mo_relaxed));

No, please don't use this idiotic example. It is wrong.

The fact is, if a compiler generates anything but the obvious sequence
(read/cmp/branch/store - where branch/store might obviously be done
with some other machine conditional like a predicate), the compiler is
wrong.

Anybody who argues anything else is wrong, or confused, or confusing.

Instead, argue about *other* sequences where the compiler can do something.

For example, this sequence:

   atomic_store(, a, mo_relaxed);
   b = atomic_load(, mo_relaxed);

can validly be transformed to

   atomic_store(, a, mo_relaxed);
   b = (typeof(x)) a;

and I think everybody agrees about that. In fact, that optimization
can be done even for mo_strict.

But even that "obvious" optimization has subtle cases. What if the
store is relaxed, but the load is strict? You can't do the
optimization without a lot of though, because dropping the strict load
would drop an ordering point. So even the "store followed by exact
same load" case has subtle issues.

With similar caveats, it is perfectly valid to merge two consecutive
loads, and to merge two consecutive stores.

Now that means that the sequence

atomic_store(, 1, mo_relaxed);
if (atomic_load(, mo_relaxed) == 1)
atomic_store(, 3, mo_relaxed);

can first be optimized to

atomic_store(, 1, mo_relaxed);
if (1 == 1)
atomic_store(, 3, mo_relaxed);

and then you get the end result that you wanted in the first place
(including the ability to re-order the two stores due to the relaxed
ordering, assuming they can be proven to not alias - and please don't
use the idiotic type-based aliasing rules).

Bringing up your first example is pure and utter confusion. Don't do
it. Instead, show what are obvious and valid transformations, and then
you can bring up these kinds of combinations as "look, this is
obviously also correct".

Now, the reason I want to make it clear that the code example you
point to is a crap example is that because it doesn't talk about the
*real* issue, it also misses a lot of really important details.

For example, when you say "if the compiler can prove that the
conditional is always true" then YOU ARE TOTALLY WRONG.

So why do I say you are wrong, after I just gave you an example of how
it happens? Because my example went back to the *real* issue, and
there are actual real semantically meaningful details with doing
things like load merging.

To give an example, let's rewrite things a bit more to use an extra variable:

atomic_store(, 1, mo_relaxed);
a = atomic_load(&1, mo_relaxed);
if (a == 1)
atomic_store(, 3, mo_relaxed);

which looks exactly the same. If you now say "if you can prove the
conditional is always true, you can make the store unconditional", YOU
ARE WRONG.

Why?

This sequence:

atomic_store(, 1, mo_relaxed);
a = atomic_load(, mo_relaxed);
atomic_store(, 3, mo_relaxed);

is actually - and very seriously - buggy.

Why? Because you have effectively split the atomic_load into two loads
- one for the value of 'a', and one for your 'proof' that the store is
unconditional.

Maybe you go "Nobody sane would do that", and you'd be right. But
compilers aren't "sane" (and compiler _writers_ I have some serious
doubts about), and how you describe the problem really does affect the
solution.

My description ("you can combine two subsequent atomic accesses, if
you are careful about still having the same ordering points") doesn't
have the confusion. It makes it clear that no, you can't speculate
writes, but yes, obviously you can combine certain things (think
"write buffers" and "memory caches"), and that may make you able to
remove the conditional. Which is very different from speculating
writes.

But my description also makes it clear that the transformation above
was buggy, but that rewriting it as

a = 1;
atomic_store(, 3, mo_relaxed);
atomic_store(, a, mo_relaxed);

is fine (doing the re-ordering here just to make a point).

So I suspect we actually agree, I just think that the "example" that
has been floating around, and the

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Linus Torvalds

On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel  wrote:
>
> I think a major benefit of C11's memory model is that it gives a
> *precise* specification for how a compiler is allowed to optimize.

Clearly it does *not*. This whole discussion is proof of that. It's
not at all clear, and the standard apparently is at least debatably
allowing things that shouldn't be allowed. It's also a whole lot more
complicated than "volatile", so the likelihood of a compiler writer
actually getting it right - even if the standard does - is lower.
They've gotten "volatile" wrong too, after all (particularly in C++).

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Torvald Riegel

On Fri, 2014-02-14 at 18:44 -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney
>  wrote:
> >
> > One way of looking at the discussion between Torvald and myself would be
> > as a seller (Torvald) and a buyer (me) haggling over the fine print in
> > a proposed contract (the standard).  Whether that makes you feel better
> > or worse about the situation I cannot say.  ;-)
> 
> Oh, I'm perfectly fine with that. But we're not desperate to buy, and
> I actually think the C11 people are - or at least should be - *way*
> more desperate to sell their model to us than we are to take it.
> 
> Which means that as a buyer you should say "this is what we want, if
> you don't give us this, we'll just walk away". Not try to see how much
> we can pay for it. Because there is very little upside for us, and
> _unless_ the C11 standard gets things right it's just extra complexity
> for us, coupled with new compiler fragility and years of code
> generation bugs.

I think there is an upside to you, mainly in that it allows compiler
testing tools, potentially verification tools for atomics, and tools
like cppmem that show allowed executions of code.

I agree that the Linux community has been working well without this, and
it's big enough to make running it's own show viable.  This will be
different for smaller projects, though.

> Why would we want that extra complexity and inevitable compiler bugs?
> If we then have to fight compiler writers that point to the standard
> and say "..but look, the standard says we can do this", then at that
> point it went from "extra complexity and compiler bugs" to a whole
> 'nother level of frustration and pain.

I see your point, but flip side of the coin is that if you get the
standard to say what you want, then you can tell the compiler writers to
look at the standard.  Or show them bugs revealed by fuzz testing and
such.

> So just walk away unless the C11 standard gives us exactly what we
> want. Not "something kind of like what we'd use". EXACTLY. Because I'm
> not in the least interested in fighting compiler people that have a
> crappy standard they can point to. Been there, done that, got the
> T-shirt and learnt my lesson.
> 
> And the thing is, I suspect that the Linux kernel is the most complete
> - and most serious - user of true atomics that the C11 people can sell
> their solution to.

I agree, but there are likely also other big projects that could make
use of C11 atomics on the userspace side (e.g., certain databases, ...).

> If we don't buy it, they have no serious user.

I disagree with that.  That obviously depends on one's definition of
"serious", but if you combine all C/C++ programs that use low-level
atomics, then this is serious use as well.  There's lots of
shared-memory synchronization in userspace as well.

> Sure, they'll have lots
> of random other one-off users for their atomics, where each user wants
> one particular thing, but I suspect that we'll have the only really
> unified portable code base

glibc is a counterexample that comes to mind, although it's a smaller
code base.  (It's currently not using C11 atomics, but transitioning
there makes sense, and some thing I want to get to eventually.)

> that handles pretty much *all* the serious
> odd cases that the C11 atomics can actually talk about to each other.

You certainly have lots of odd cases, but I would disagree with the
assumption that only the Linux kernel will do full "testing" of the
implementations.  If you have plenty of userspace programs using the
atomics, that's a pretty big test suite, and one that should help bring
the compilers up to speed.  So that might be a benefit even to the Linux
kernel if it would use the C11 atomics.

> Oh, they'll push things through with or without us, and it will be a
> collection of random stuff, where they tried to please everybody, with
> particularly compiler/architecture people who have no f*cking clue
> about how their stuff is used pushing to make it easy/efficient for
> their particular compiler/architecture.

I'll ignore this... :)

> But we have real optimized uses of pretty much all relevant cases that
> people actually care about.

You certainly cover a lot of cases.  Finding out whether you cover all
that "people care about" would require you to actually ask all people,
which I'm sure you've done ;)

> We can walk away from them, and not really lose anything but a small
> convenience (and it's a convenience *only* if the standard gets things
> right).
> 
> And conversely, the C11 people can walk away from us too. But if they
> can't make us happy (and by "make us happy", I really mean no stupid
> games on our part) I personally think they'll have a stronger
> standard, and a real use case, and real arguments. I'm assuming they
> want that.

I agree.

> That's why I complain when you talk about things like marking control
> dependencies explicitly. That's *us* bending over backwards. And as a
> buyer, we have absolutely

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Torvald Riegel

On Fri, 2014-02-14 at 12:02 -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
>  wrote:
> >
> > Why are we still discussing this idiocy? It's irrelevant. If the
> > standard really allows random store speculation, the standard doesn't
> > matter, and sane people shouldn't waste their time arguing about it.
> 
> Btw, the other part of this coin is that our manual types (using
> volatile and various architecture-specific stuff) and our manual
> barriers and inline asm accesses are generally *fine*.

AFAICT, it does work for you, but hasn't been exactly pain-free.

I think a major benefit of C11's memory model is that it gives a
*precise* specification for how a compiler is allowed to optimize.
There is a formalization of the model, which allows things like the
cppmem tool by the Cambridge group.  It also allows meaningful fuzz
testing: http://www.di.ens.fr/~zappa/projects/cmmtest/ ; this did reveal
several GCC compiler bugs.

I also think that reasoning about this model is easier than reasoning
about how lots of different, concrete compiler optimizations would
interact.

> The C11 stuff doesn't buy us anything. The argument that "new
> architectures might want to use it" is prue and utter bollocks, since
> unless the standard gets the thing *right*, nobody sane would ever use
> it for some new architecture, when the sane thing to do is to just
> fill in the normal barriers and inline asms.
> 
> So I'm very very serious: either the compiler and the standard gets
> things right, or we don't use it. There is no middle ground where "we
> might use it for one or two architectures and add random hints".
> That's just stupid.
> 
> The only "middle ground" is about which compiler version we end up
> trusting _if_ it turns out that the compiler and standard do get
> things right. From Torvald's explanations (once I don't mis-read them
> ;), my take-away so far has actually been that the standard *does* get
> things right, but I do know from over-long personal experience that
> compiler people sometimes want to be legalistic and twist the
> documentation to the breaking point, at which point we just go "we'd
> be crazy do use that".

I agree that compilers want to optimize, and sometimes there's probably
a little too much emphasis on applying an optimization vs. not
surprising users.  But we have to draw a line (e.g., what is undefined
behavior and what is not), because we need this to be actually able to
optimize.

Therefore, we need to get the rules into a shape that both allows
optimizations and isn't full of surprising corner cases.  The rules are
the standard, so it's the standard we have to get right.  According to
my experience, a lot of thought goes into how to design the standard's
language and library so that they are intuitive yet efficient.

If you see issues in the standard, please bring them up.  Either report
the defects directly and get involved yourself, or reach out to somebody
that is participating in the standards process.

The standard certainly isn't perfect, so there is room to contribute.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Torvald Riegel

On Fri, 2014-02-14 at 11:50 -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney
>  wrote:
> >
> > Linus, Peter, any objections to marking places where we are relying on
> > ordering from control dependencies against later stores?  This approach
> > seems to me to have significant documentation benefits.
> 
> Quite frankly, I think it's stupid, and the "documentation" is not a
> benefit, it's just wrong.

I think the example is easy to misunderstand, because the context isn't
clear.  Therefore, let me first try to clarify the background.

(1) The abstract machine does not write speculatively.
(2) Emitting a branch instruction and executing a branch at runtime is
not part of the specified behavior of the abstract machine.  Of course,
the abstract machine performs conditional execution, but that just
specifies the output / side effects that it must produce (e.g., volatile
stores) -- not with which hardware instructions it is producing this.
(3) A compiled program must produce the same output as if executed by
the abstract machine.

Thus, we need to be careful what "speculative store" is meant to refer
to.  A few examples:

if (atomic_load(, mo_relaxed) == 1)
  atomic_store(, 3, mo_relaxed));

Here, the requirement is that in terms of program logic, y is assigned 3
if x equals 1.  It's not specified how an implementation does that.
* If the compiler can prove that x is always 1, then it can remove the
branch.  This is because of (2).  Because of the proof, (1) is not
violated.
* If the compiler can prove that the store to y is never observed or
does not change the program's output, the store can be removed.

if (atomic_load(, mo_relaxed) == 1)
  { atomic_store(, 3, mo_relaxed)); other_a(); }
else
  { atomic_store(, 3, mo_relaxed)); other_b(); }

Here, y will be assigned to regardless of the value of x.
* The compiler can hoist the store out of the two branches.  This is
because the store and the branch instruction aren't observable outcomes
of the abstract machine.
* The compiler can even move the store to y before the load from x
(unless this affects logical program order of this thread in some way.)
This is because the load/store are ordered by sequenced-before
(intra-thread), but mo_relaxed allows the hardware to reorder, so the
compiler can do it as well (IOW, other threads can't expect a particular
order).

if (atomic_load(, mo_acquire) == 1)
  atomic_store(, 3, mo_relaxed));

This is similar to the first case, but with stronger memory order.
* If the compiler proves that x is always 1, then it does so by showing
that the load will always be able to read from a particular store (or
several of them) that (all) assigned 1 to x -- as specified by the
abstract machine and taking the forward progress guarantees into
account.  In general, it still has to establish the synchronized-with
edge if any of those stores used release_mo (or other fences resulting
in the same situation), so it can't just get rid of the acquire "fence"
in this case.  (There are probably situations in which this can be done,
but I can't characterize them easily at the moment.)

These examples all rely on the abstract machine as specified in the
current standard.  In contrast, the example that Paul (and Peter, I
assume) where looking at is not currently modeled by the standard.
AFAIU, they want to exploit that control dependencies, when encountered
in binary code, can result in the hardware giving certain ordering
guarantees.

This is vaguely similar to mo_consume which is about data dependencies.
mo_consume is, partially due to how it's specified, pretty hard to
implement for compilers in a way that actually exploits and preserves
data dependencies and not just substitutes mo_consume for a stronger
memory order.

Part of this problem is that the standard takes an opt-out approach
regarding the code that should track dependencies (e.g., certain
operators are specified as not preserving them), instead of cleanly
carving out meaningful operators where one can track dependencies
without obstructing generally useful compiler optimizations (i.e.,
"opt-in").  This leads to cases such as that in "*(p + f - f)", the
compiler either has to keep f - f or emit a stronger fence if f is
originating from a mo_consume load.  Furthermore, dependencies are
supposed to be tracked across any load and store, so the compiler needs
to do points-to if it wants to optimize this as much as possible.

Paul and I have been thinking about alternatives, and one of them was
doing the opt-in by demarcating code that needs explicit dependency
tracking because it wants to exploit mo_consume.

Back to HW control dependencies, this lead to the idea of marking the
"control dependencies" in the source code (ie, on the abstract machine
level), that need to be preserved in the generated binary code, even if
they have no semantic meaning on the abstract machine level.  So, this
is something extra that isn't modeled in the standard currently,

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Torvald Riegel

On Fri, 2014-02-14 at 11:50 -0800, Linus Torvalds wrote:
 On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  Linus, Peter, any objections to marking places where we are relying on
  ordering from control dependencies against later stores?  This approach
  seems to me to have significant documentation benefits.
 
 Quite frankly, I think it's stupid, and the documentation is not a
 benefit, it's just wrong.

I think the example is easy to misunderstand, because the context isn't
clear.  Therefore, let me first try to clarify the background.

(1) The abstract machine does not write speculatively.
(2) Emitting a branch instruction and executing a branch at runtime is
not part of the specified behavior of the abstract machine.  Of course,
the abstract machine performs conditional execution, but that just
specifies the output / side effects that it must produce (e.g., volatile
stores) -- not with which hardware instructions it is producing this.
(3) A compiled program must produce the same output as if executed by
the abstract machine.

Thus, we need to be careful what speculative store is meant to refer
to.  A few examples:

if (atomic_load(x, mo_relaxed) == 1)
  atomic_store(y, 3, mo_relaxed));

Here, the requirement is that in terms of program logic, y is assigned 3
if x equals 1.  It's not specified how an implementation does that.
* If the compiler can prove that x is always 1, then it can remove the
branch.  This is because of (2).  Because of the proof, (1) is not
violated.
* If the compiler can prove that the store to y is never observed or
does not change the program's output, the store can be removed.

if (atomic_load(x, mo_relaxed) == 1)
  { atomic_store(y, 3, mo_relaxed)); other_a(); }
else
  { atomic_store(y, 3, mo_relaxed)); other_b(); }

Here, y will be assigned to regardless of the value of x.
* The compiler can hoist the store out of the two branches.  This is
because the store and the branch instruction aren't observable outcomes
of the abstract machine.
* The compiler can even move the store to y before the load from x
(unless this affects logical program order of this thread in some way.)
This is because the load/store are ordered by sequenced-before
(intra-thread), but mo_relaxed allows the hardware to reorder, so the
compiler can do it as well (IOW, other threads can't expect a particular
order).

if (atomic_load(x, mo_acquire) == 1)
  atomic_store(y, 3, mo_relaxed));

This is similar to the first case, but with stronger memory order.
* If the compiler proves that x is always 1, then it does so by showing
that the load will always be able to read from a particular store (or
several of them) that (all) assigned 1 to x -- as specified by the
abstract machine and taking the forward progress guarantees into
account.  In general, it still has to establish the synchronized-with
edge if any of those stores used release_mo (or other fences resulting
in the same situation), so it can't just get rid of the acquire fence
in this case.  (There are probably situations in which this can be done,
but I can't characterize them easily at the moment.)


These examples all rely on the abstract machine as specified in the
current standard.  In contrast, the example that Paul (and Peter, I
assume) where looking at is not currently modeled by the standard.
AFAIU, they want to exploit that control dependencies, when encountered
in binary code, can result in the hardware giving certain ordering
guarantees.

This is vaguely similar to mo_consume which is about data dependencies.
mo_consume is, partially due to how it's specified, pretty hard to
implement for compilers in a way that actually exploits and preserves
data dependencies and not just substitutes mo_consume for a stronger
memory order.

Part of this problem is that the standard takes an opt-out approach
regarding the code that should track dependencies (e.g., certain
operators are specified as not preserving them), instead of cleanly
carving out meaningful operators where one can track dependencies
without obstructing generally useful compiler optimizations (i.e.,
opt-in).  This leads to cases such as that in *(p + f - f), the
compiler either has to keep f - f or emit a stronger fence if f is
originating from a mo_consume load.  Furthermore, dependencies are
supposed to be tracked across any load and store, so the compiler needs
to do points-to if it wants to optimize this as much as possible.

Paul and I have been thinking about alternatives, and one of them was
doing the opt-in by demarcating code that needs explicit dependency
tracking because it wants to exploit mo_consume.

Back to HW control dependencies, this lead to the idea of marking the
control dependencies in the source code (ie, on the abstract machine
level), that need to be preserved in the generated binary code, even if
they have no semantic meaning on the abstract machine level.  So, this
is something extra that isn't modeled in the standard

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Torvald Riegel

On Fri, 2014-02-14 at 12:02 -0800, Linus Torvalds wrote:
 On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  Why are we still discussing this idiocy? It's irrelevant. If the
  standard really allows random store speculation, the standard doesn't
  matter, and sane people shouldn't waste their time arguing about it.
 
 Btw, the other part of this coin is that our manual types (using
 volatile and various architecture-specific stuff) and our manual
 barriers and inline asm accesses are generally *fine*.

AFAICT, it does work for you, but hasn't been exactly pain-free.

I think a major benefit of C11's memory model is that it gives a
*precise* specification for how a compiler is allowed to optimize.
There is a formalization of the model, which allows things like the
cppmem tool by the Cambridge group.  It also allows meaningful fuzz
testing: http://www.di.ens.fr/~zappa/projects/cmmtest/ ; this did reveal
several GCC compiler bugs.

I also think that reasoning about this model is easier than reasoning
about how lots of different, concrete compiler optimizations would
interact.

 The C11 stuff doesn't buy us anything. The argument that new
 architectures might want to use it is prue and utter bollocks, since
 unless the standard gets the thing *right*, nobody sane would ever use
 it for some new architecture, when the sane thing to do is to just
 fill in the normal barriers and inline asms.
 
 So I'm very very serious: either the compiler and the standard gets
 things right, or we don't use it. There is no middle ground where we
 might use it for one or two architectures and add random hints.
 That's just stupid.
 
 The only middle ground is about which compiler version we end up
 trusting _if_ it turns out that the compiler and standard do get
 things right. From Torvald's explanations (once I don't mis-read them
 ;), my take-away so far has actually been that the standard *does* get
 things right, but I do know from over-long personal experience that
 compiler people sometimes want to be legalistic and twist the
 documentation to the breaking point, at which point we just go we'd
 be crazy do use that.

I agree that compilers want to optimize, and sometimes there's probably
a little too much emphasis on applying an optimization vs. not
surprising users.  But we have to draw a line (e.g., what is undefined
behavior and what is not), because we need this to be actually able to
optimize.

Therefore, we need to get the rules into a shape that both allows
optimizations and isn't full of surprising corner cases.  The rules are
the standard, so it's the standard we have to get right.  According to
my experience, a lot of thought goes into how to design the standard's
language and library so that they are intuitive yet efficient.

If you see issues in the standard, please bring them up.  Either report
the defects directly and get involved yourself, or reach out to somebody
that is participating in the standards process.

The standard certainly isn't perfect, so there is room to contribute.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Torvald Riegel

On Fri, 2014-02-14 at 18:44 -0800, Linus Torvalds wrote:
 On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  One way of looking at the discussion between Torvald and myself would be
  as a seller (Torvald) and a buyer (me) haggling over the fine print in
  a proposed contract (the standard).  Whether that makes you feel better
  or worse about the situation I cannot say.  ;-)
 
 Oh, I'm perfectly fine with that. But we're not desperate to buy, and
 I actually think the C11 people are - or at least should be - *way*
 more desperate to sell their model to us than we are to take it.
 
 Which means that as a buyer you should say this is what we want, if
 you don't give us this, we'll just walk away. Not try to see how much
 we can pay for it. Because there is very little upside for us, and
 _unless_ the C11 standard gets things right it's just extra complexity
 for us, coupled with new compiler fragility and years of code
 generation bugs.

I think there is an upside to you, mainly in that it allows compiler
testing tools, potentially verification tools for atomics, and tools
like cppmem that show allowed executions of code.

I agree that the Linux community has been working well without this, and
it's big enough to make running it's own show viable.  This will be
different for smaller projects, though.

 Why would we want that extra complexity and inevitable compiler bugs?
 If we then have to fight compiler writers that point to the standard
 and say ..but look, the standard says we can do this, then at that
 point it went from extra complexity and compiler bugs to a whole
 'nother level of frustration and pain.

I see your point, but flip side of the coin is that if you get the
standard to say what you want, then you can tell the compiler writers to
look at the standard.  Or show them bugs revealed by fuzz testing and
such.

 So just walk away unless the C11 standard gives us exactly what we
 want. Not something kind of like what we'd use. EXACTLY. Because I'm
 not in the least interested in fighting compiler people that have a
 crappy standard they can point to. Been there, done that, got the
 T-shirt and learnt my lesson.
 
 And the thing is, I suspect that the Linux kernel is the most complete
 - and most serious - user of true atomics that the C11 people can sell
 their solution to.

I agree, but there are likely also other big projects that could make
use of C11 atomics on the userspace side (e.g., certain databases, ...).

 If we don't buy it, they have no serious user.

I disagree with that.  That obviously depends on one's definition of
serious, but if you combine all C/C++ programs that use low-level
atomics, then this is serious use as well.  There's lots of
shared-memory synchronization in userspace as well.

 Sure, they'll have lots
 of random other one-off users for their atomics, where each user wants
 one particular thing, but I suspect that we'll have the only really
 unified portable code base

glibc is a counterexample that comes to mind, although it's a smaller
code base.  (It's currently not using C11 atomics, but transitioning
there makes sense, and some thing I want to get to eventually.)

 that handles pretty much *all* the serious
 odd cases that the C11 atomics can actually talk about to each other.

You certainly have lots of odd cases, but I would disagree with the
assumption that only the Linux kernel will do full testing of the
implementations.  If you have plenty of userspace programs using the
atomics, that's a pretty big test suite, and one that should help bring
the compilers up to speed.  So that might be a benefit even to the Linux
kernel if it would use the C11 atomics.

 Oh, they'll push things through with or without us, and it will be a
 collection of random stuff, where they tried to please everybody, with
 particularly compiler/architecture people who have no f*cking clue
 about how their stuff is used pushing to make it easy/efficient for
 their particular compiler/architecture.

I'll ignore this... :)

 But we have real optimized uses of pretty much all relevant cases that
 people actually care about.

You certainly cover a lot of cases.  Finding out whether you cover all
that people care about would require you to actually ask all people,
which I'm sure you've done ;)

 We can walk away from them, and not really lose anything but a small
 convenience (and it's a convenience *only* if the standard gets things
 right).
 
 And conversely, the C11 people can walk away from us too. But if they
 can't make us happy (and by make us happy, I really mean no stupid
 games on our part) I personally think they'll have a stronger
 standard, and a real use case, and real arguments. I'm assuming they
 want that.

I agree.

 That's why I complain when you talk about things like marking control
 dependencies explicitly. That's *us* bending over backwards. And as a
 buyer, we have absolutely zero reason to do that.

As I understood the

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Linus Torvalds

On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel trie...@redhat.com wrote:

 I think a major benefit of C11's memory model is that it gives a
 *precise* specification for how a compiler is allowed to optimize.

Clearly it does *not*. This whole discussion is proof of that. It's
not at all clear, and the standard apparently is at least debatably
allowing things that shouldn't be allowed. It's also a whole lot more
complicated than volatile, so the likelihood of a compiler writer
actually getting it right - even if the standard does - is lower.
They've gotten volatile wrong too, after all (particularly in C++).

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Linus Torvalds

On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel trie...@redhat.com wrote:

 I think the example is easy to misunderstand, because the context isn't
 clear.  Therefore, let me first try to clarify the background.

 (1) The abstract machine does not write speculatively.
 (2) Emitting a branch instruction and executing a branch at runtime is
 not part of the specified behavior of the abstract machine.  Of course,
 the abstract machine performs conditional execution, but that just
 specifies the output / side effects that it must produce (e.g., volatile
 stores) -- not with which hardware instructions it is producing this.
 (3) A compiled program must produce the same output as if executed by
 the abstract machine.

Ok, I'm fine with that.

 Thus, we need to be careful what speculative store is meant to refer
 to.  A few examples:

 if (atomic_load(x, mo_relaxed) == 1)
   atomic_store(y, 3, mo_relaxed));

No, please don't use this idiotic example. It is wrong.

The fact is, if a compiler generates anything but the obvious sequence
(read/cmp/branch/store - where branch/store might obviously be done
with some other machine conditional like a predicate), the compiler is
wrong.

Anybody who argues anything else is wrong, or confused, or confusing.

Instead, argue about *other* sequences where the compiler can do something.

For example, this sequence:

   atomic_store(x, a, mo_relaxed);
   b = atomic_load(x, mo_relaxed);

can validly be transformed to

   atomic_store(x, a, mo_relaxed);
   b = (typeof(x)) a;

and I think everybody agrees about that. In fact, that optimization
can be done even for mo_strict.

But even that obvious optimization has subtle cases. What if the
store is relaxed, but the load is strict? You can't do the
optimization without a lot of though, because dropping the strict load
would drop an ordering point. So even the store followed by exact
same load case has subtle issues.

With similar caveats, it is perfectly valid to merge two consecutive
loads, and to merge two consecutive stores.

Now that means that the sequence

atomic_store(x, 1, mo_relaxed);
if (atomic_load(x, mo_relaxed) == 1)
atomic_store(y, 3, mo_relaxed);

can first be optimized to

atomic_store(x, 1, mo_relaxed);
if (1 == 1)
atomic_store(y, 3, mo_relaxed);

and then you get the end result that you wanted in the first place
(including the ability to re-order the two stores due to the relaxed
ordering, assuming they can be proven to not alias - and please don't
use the idiotic type-based aliasing rules).

Bringing up your first example is pure and utter confusion. Don't do
it. Instead, show what are obvious and valid transformations, and then
you can bring up these kinds of combinations as look, this is
obviously also correct.

Now, the reason I want to make it clear that the code example you
point to is a crap example is that because it doesn't talk about the
*real* issue, it also misses a lot of really important details.

For example, when you say if the compiler can prove that the
conditional is always true then YOU ARE TOTALLY WRONG.

So why do I say you are wrong, after I just gave you an example of how
it happens? Because my example went back to the *real* issue, and
there are actual real semantically meaningful details with doing
things like load merging.

To give an example, let's rewrite things a bit more to use an extra variable:

atomic_store(x, 1, mo_relaxed);
a = atomic_load(1, mo_relaxed);
if (a == 1)
atomic_store(y, 3, mo_relaxed);

which looks exactly the same. If you now say if you can prove the
conditional is always true, you can make the store unconditional, YOU
ARE WRONG.

Why?

This sequence:

atomic_store(x, 1, mo_relaxed);
a = atomic_load(x, mo_relaxed);
atomic_store(y, 3, mo_relaxed);

is actually - and very seriously - buggy.

Why? Because you have effectively split the atomic_load into two loads
- one for the value of 'a', and one for your 'proof' that the store is
unconditional.

Maybe you go Nobody sane would do that, and you'd be right. But
compilers aren't sane (and compiler _writers_ I have some serious
doubts about), and how you describe the problem really does affect the
solution.

My description (you can combine two subsequent atomic accesses, if
you are careful about still having the same ordering points) doesn't
have the confusion. It makes it clear that no, you can't speculate
writes, but yes, obviously you can combine certain things (think
write buffers and memory caches), and that may make you able to
remove the conditional. Which is very different from speculating
writes.

But my description also makes it clear that the transformation above
was buggy, but that rewriting it as

a = 1;
atomic_store(y, 3, mo_relaxed);
atomic_store(x, a, mo_relaxed);

is fine (doing the re-ordering here just to make a point).

So I suspect we actually agree, I just think that the example that
has been floating around, and the

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Paul E. McKenney

On Fri, Feb 14, 2014 at 10:35:44PM -0800, Paul E. McKenney wrote:
> On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote:
> > On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
> >  wrote:
> > >
> > > And conversely, the C11 people can walk away from us too. But if they
> > > can't make us happy (and by "make us happy", I really mean no stupid
> > > games on our part) I personally think they'll have a stronger
> > > standard, and a real use case, and real arguments. I'm assuming they
> > > want that.
> > 
> > I should have somebody who proof-reads my emails before I send them out.
> > 
> > I obviously meant "if they *can* make us happy" (not "can't").
> 
> Understood.  My next step is to take a more detailed look at the piece
> of the standard that should support RCU.  Depending on how that turns
> out, I might look at other parts of the standard vs. Linux's atomics
> and memory-ordering needs.  Should be interesting.  ;-)

And perhaps a better way to represent the roles is that I am not the
buyer, but rather the purchasing agent for the -potential- buyer.  -You-
are of course the potential buyer.

If I were to see myself as the buyer, then I must confess that the
concerns you implicitly expressed in your prior email would be all too
well-founded!

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Paul E. McKenney

On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
>  wrote:
> >
> > And conversely, the C11 people can walk away from us too. But if they
> > can't make us happy (and by "make us happy", I really mean no stupid
> > games on our part) I personally think they'll have a stronger
> > standard, and a real use case, and real arguments. I'm assuming they
> > want that.
> 
> I should have somebody who proof-reads my emails before I send them out.
> 
> I obviously meant "if they *can* make us happy" (not "can't").

Understood.  My next step is to take a more detailed look at the piece
of the standard that should support RCU.  Depending on how that turns
out, I might look at other parts of the standard vs. Linux's atomics
and memory-ordering needs.  Should be interesting.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds

On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
 wrote:
>
> And conversely, the C11 people can walk away from us too. But if they
> can't make us happy (and by "make us happy", I really mean no stupid
> games on our part) I personally think they'll have a stronger
> standard, and a real use case, and real arguments. I'm assuming they
> want that.

I should have somebody who proof-reads my emails before I send them out.

I obviously meant "if they *can* make us happy" (not "can't").

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds

On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney
 wrote:
>
> One way of looking at the discussion between Torvald and myself would be
> as a seller (Torvald) and a buyer (me) haggling over the fine print in
> a proposed contract (the standard).  Whether that makes you feel better
> or worse about the situation I cannot say.  ;-)

Oh, I'm perfectly fine with that. But we're not desperate to buy, and
I actually think the C11 people are - or at least should be - *way*
more desperate to sell their model to us than we are to take it.

Which means that as a buyer you should say "this is what we want, if
you don't give us this, we'll just walk away". Not try to see how much
we can pay for it. Because there is very little upside for us, and
_unless_ the C11 standard gets things right it's just extra complexity
for us, coupled with new compiler fragility and years of code
generation bugs.

Why would we want that extra complexity and inevitable compiler bugs?
If we then have to fight compiler writers that point to the standard
and say "..but look, the standard says we can do this", then at that
point it went from "extra complexity and compiler bugs" to a whole
'nother level of frustration and pain.

So just walk away unless the C11 standard gives us exactly what we
want. Not "something kind of like what we'd use". EXACTLY. Because I'm
not in the least interested in fighting compiler people that have a
crappy standard they can point to. Been there, done that, got the
T-shirt and learnt my lesson.

And the thing is, I suspect that the Linux kernel is the most complete
- and most serious - user of true atomics that the C11 people can sell
their solution to.

If we don't buy it, they have no serious user. Sure, they'll have lots
of random other one-off users for their atomics, where each user wants
one particular thing, but I suspect that we'll have the only really
unified portable code base that handles pretty much *all* the serious
odd cases that the C11 atomics can actually talk about to each other.

Oh, they'll push things through with or without us, and it will be a
collection of random stuff, where they tried to please everybody, with
particularly compiler/architecture people who have no f*cking clue
about how their stuff is used pushing to make it easy/efficient for
their particular compiler/architecture.

But we have real optimized uses of pretty much all relevant cases that
people actually care about.

We can walk away from them, and not really lose anything but a small
convenience (and it's a convenience *only* if the standard gets things
right).

And conversely, the C11 people can walk away from us too. But if they
can't make us happy (and by "make us happy", I really mean no stupid
games on our part) I personally think they'll have a stronger
standard, and a real use case, and real arguments. I'm assuming they
want that.

That's why I complain when you talk about things like marking control
dependencies explicitly. That's *us* bending over backwards. And as a
buyer, we have absolutely zero reason to do that.

Tell the C11 people: "no speculative writes". Full stop. End of story.
Because we're not buying anything else.

Similarly, if we need to mark atomics "volatile", then now the C11
atomics are no longer even a "small convenience", now they are just
extra complexity wrt what we already have. So just make it clear that
if the C11 standard needs to mark atomics volatile in order to get
non-speculative and non-reloading behavior, then the C11 atomics are
useless to us, and we're not buying.

Remember: a compiler can *always* do "as if" optimizations - if a
compiler writer can prove that the end result acts 100% the same using
an optimized sequence, then they can do whatever the hell they want.
That's not the issue. But if we can *ever* see semantic impact of
speculative writes, the compiler is buggy, and the compiler writers
need to be aware that it is buggy. No ifs, buts, maybes about it.

So I'm perfectly fine with you seeing yourself as a buyer. But I want
you to be a really *picky* and anal buyer - one that knows he has the
upper hand, and can walk away with no downside.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Paul E. McKenney

On Fri, Feb 14, 2014 at 12:02:23PM -0800, Linus Torvalds wrote:
> On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
>  wrote:
> >
> > Why are we still discussing this idiocy? It's irrelevant. If the
> > standard really allows random store speculation, the standard doesn't
> > matter, and sane people shouldn't waste their time arguing about it.
> 
> Btw, the other part of this coin is that our manual types (using
> volatile and various architecture-specific stuff) and our manual
> barriers and inline asm accesses are generally *fine*.
> 
> The C11 stuff doesn't buy us anything. The argument that "new
> architectures might want to use it" is prue and utter bollocks, since
> unless the standard gets the thing *right*, nobody sane would ever use
> it for some new architecture, when the sane thing to do is to just
> fill in the normal barriers and inline asms.
> 
> So I'm very very serious: either the compiler and the standard gets
> things right, or we don't use it. There is no middle ground where "we
> might use it for one or two architectures and add random hints".
> That's just stupid.
> 
> The only "middle ground" is about which compiler version we end up
> trusting _if_ it turns out that the compiler and standard do get
> things right. From Torvald's explanations (once I don't mis-read them
> ;), my take-away so far has actually been that the standard *does* get
> things right, but I do know from over-long personal experience that
> compiler people sometimes want to be legalistic and twist the
> documentation to the breaking point, at which point we just go "we'd
> be crazy do use that".
> 
> See our use of "-fno-strict-aliasing", for example. The C standard
> aliasing rules are a mistake, stupid, and wrong, and gcc uses those
> stupid type-based alias rules even when statically *proving* the
> aliasing gives the opposite result. End result: we turn the shit off.
> 
> Exact same deal wrt atomics. We are *not* going to add crazy "this
> here is a control dependency" crap. There's a test, the compiler
> *sees* the control dependency for chrissake, and it still generates
> crap, we turn that broken "optimization" off. It really is that
> simple.

>From what I can see at the moment, the standard -generally- avoids
speculative stores, but there are a few corner cases where it might
allow them.  I will be working with the committee to see exactly what the
situation is.  Might be that I am confused and that everything really
is OK, might be that I am right but the corner cases are things that
no sane kernel developer would do anyway, it might be that the standard
needs a bit of repair, or it might be that the corner cases are somehow
inherent and problematic (but I hope not!).  I will let you know what
I find, but it will probably be a few months.

In the meantime, agreed, we keep doing what we have been doing.  And
maybe in the long term as well, for that matter.

One way of looking at the discussion between Torvald and myself would be
as a seller (Torvald) and a buyer (me) haggling over the fine print in
a proposed contract (the standard).  Whether that makes you feel better
or worse about the situation I cannot say.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds

On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
 wrote:
>
> Why are we still discussing this idiocy? It's irrelevant. If the
> standard really allows random store speculation, the standard doesn't
> matter, and sane people shouldn't waste their time arguing about it.

Btw, the other part of this coin is that our manual types (using
volatile and various architecture-specific stuff) and our manual
barriers and inline asm accesses are generally *fine*.

The C11 stuff doesn't buy us anything. The argument that "new
architectures might want to use it" is prue and utter bollocks, since
unless the standard gets the thing *right*, nobody sane would ever use
it for some new architecture, when the sane thing to do is to just
fill in the normal barriers and inline asms.

So I'm very very serious: either the compiler and the standard gets
things right, or we don't use it. There is no middle ground where "we
might use it for one or two architectures and add random hints".
That's just stupid.

The only "middle ground" is about which compiler version we end up
trusting _if_ it turns out that the compiler and standard do get
things right. From Torvald's explanations (once I don't mis-read them
;), my take-away so far has actually been that the standard *does* get
things right, but I do know from over-long personal experience that
compiler people sometimes want to be legalistic and twist the
documentation to the breaking point, at which point we just go "we'd
be crazy do use that".

See our use of "-fno-strict-aliasing", for example. The C standard
aliasing rules are a mistake, stupid, and wrong, and gcc uses those
stupid type-based alias rules even when statically *proving* the
aliasing gives the opposite result. End result: we turn the shit off.

Exact same deal wrt atomics. We are *not* going to add crazy "this
here is a control dependency" crap. There's a test, the compiler
*sees* the control dependency for chrissake, and it still generates
crap, we turn that broken "optimization" off. It really is that
simple.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds

On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney
 wrote:
>
> Linus, Peter, any objections to marking places where we are relying on
> ordering from control dependencies against later stores?  This approach
> seems to me to have significant documentation benefits.

Quite frankly, I think it's stupid, and the "documentation" is not a
benefit, it's just wrong.

How would you figure out whether your added "documentation" holds true
for particular branches but not others?

How could you *ever* trust a compiler that makes the dependency meaningful?

Again, let's keep this simple and sane:

 - if a compiler ever generates code where an atomic store movement is
"visible" in any way, then that compiler is broken shit.

I don't understand why you even argue this. Seriously, Paul, you seem
to *want* to think that "broken shit" is acceptable, and that we
should then add magic markers to say "now you need to *not* be broken
shit".

Here's a magic marker for you: DON'T USE THAT BROKEN COMPILER.

And if a compiler can *prove* that whatever code movement it does
cannot make a difference, then let it do so. No amount of
"documentation" should matter.

Seriously, this whole discussion has been completely moronic. I don't
understand why you even bring shit like this up:

> > r1 = atomic_load(x, memory_order_control);
> > if (control_dependency(r1))
> > atomic_store(y, memory_order_relaxed);

I mean, really? Anybody who writes code like that, or any compiler
where that "control_dependency()" marker makes any difference
what-so-ever for code generation should just be retroactively aborted.

There is absolutely *zero* reason for that "control_dependency()"
crap. If you ever find a reason for it, it is either because the
compiler is buggy, or because the standard is so shit that we should
never *ever* use the atomics.

Seriously. This thread has devolved into some kind of "just what kind
of idiotic compiler cesspool crap could we accept". Get away from that
f*cking mindset. We don't accept *any* crap.

Why are we still discussing this idiocy? It's irrelevant. If the
standard really allows random store speculation, the standard doesn't
matter, and sane people shouldn't waste their time arguing about it.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Torvald Riegel

On Fri, 2014-02-14 at 09:29 -0800, Paul E. McKenney wrote:
> On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote:
> > On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > Another option would be to flag the conditional expression, prohibiting
> > > the compiler from optimizing out any conditional branches.  Perhaps
> > > something like this:
> > > 
> > >   r1 = atomic_load(x, memory_order_control);
> > >   if (control_dependency(r1))
> > >   atomic_store(y, memory_order_relaxed);
> > 
> > That's the one I had in mind and talked to you about earlier today.  My
> > gut feeling is that this is preferably over the other because it "marks"
> > the if-statement, so the compiler knows exactly which branches matter.
> > I'm not sure one would need the other memory order for that, if indeed
> > all you want is relaxed -> branch -> relaxed.  But maybe there are
> > corner cases (see the weaker-than-relaxed discussion in SG1 today).
> 
> Linus, Peter, any objections to marking places where we are relying on
> ordering from control dependencies against later stores?  This approach
> seems to me to have significant documentation benefits.

Let me note that at least as I'm concerned, that's just a quick idea.
At least I haven't looked at (1) how to properly specify the semantics
of this, (2) whether it has any bad effects on unrelated code, (3) and
whether there are pitfalls for compiler implementations.  It looks not
too bad at first glance, though.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Torvald Riegel

On Fri, 2014-02-14 at 10:50 +0100, Peter Zijlstra wrote:
> On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote:
> > That depends on what your goal is.

First, I don't know why you quoted that, but without the context,
quoting it doesn't make sense.  Let me repeat the point.  The standard
is the rule set for the compiler.  Period.  The compiler does not just
serve the semantics that you might have in your head.  It does have to
do something meaningful for all of its users.  Thus, the goal for the
compiler is to properly compile programs in the language as specified.

If there is a deficiency in the standard (bug or missing feature) -- and
thus the specification, we need to have a patch for the standard that
fixes this deficiency.  If you think that this is the case, that's where
you fix it.

If your goal is to do wishful thinking, imagine some kind of semantics
in your head, and then assume that magically, implementations will do
just that, then that's bound to fail.

> A compiler that we don't need to fight in order to generate sane code
> would be nice. But as Linus said; we can continue to ignore you lot and
> go on as we've done.

I don't see why it's so hard to understand that you need to specify
semantics, and the place (or at least the base) for that is the
standard.  Aren't you guys the ones replying "send a patch"? :)

This isn't any different.  If you're uncomfortable working with the
standard, then say so, and reach out to people that aren't.

You can surely ignore the specification of the language(s) that you are
depending on.  But that won't help you.  If you want a change, get
involved.  (Oh, and claiming that the other side doesn't get it doesn't
count as getting involved.)

There's no fight between people here.  It's just a technical problem
that we have to solve in the right way.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Paul E. McKenney

On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote:
> On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:

[ . . . ]

> > Another option would be to flag the conditional expression, prohibiting
> > the compiler from optimizing out any conditional branches.  Perhaps
> > something like this:
> > 
> > r1 = atomic_load(x, memory_order_control);
> > if (control_dependency(r1))
> > atomic_store(y, memory_order_relaxed);
> 
> That's the one I had in mind and talked to you about earlier today.  My
> gut feeling is that this is preferably over the other because it "marks"
> the if-statement, so the compiler knows exactly which branches matter.
> I'm not sure one would need the other memory order for that, if indeed
> all you want is relaxed -> branch -> relaxed.  But maybe there are
> corner cases (see the weaker-than-relaxed discussion in SG1 today).

Linus, Peter, any objections to marking places where we are relying on
ordering from control dependencies against later stores?  This approach
seems to me to have significant documentation benefits.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Peter Zijlstra

On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote:
> That depends on what your goal is.

A compiler that we don't need to fight in order to generate sane code
would be nice. But as Linus said; we can continue to ignore you lot and
go on as we've done.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Peter Zijlstra

On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote:
 That depends on what your goal is.

A compiler that we don't need to fight in order to generate sane code
would be nice. But as Linus said; we can continue to ignore you lot and
go on as we've done.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Paul E. McKenney

On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote:
 On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:

[ . . . ]

  Another option would be to flag the conditional expression, prohibiting
  the compiler from optimizing out any conditional branches.  Perhaps
  something like this:
  
  r1 = atomic_load(x, memory_order_control);
  if (control_dependency(r1))
  atomic_store(y, memory_order_relaxed);
 
 That's the one I had in mind and talked to you about earlier today.  My
 gut feeling is that this is preferably over the other because it marks
 the if-statement, so the compiler knows exactly which branches matter.
 I'm not sure one would need the other memory order for that, if indeed
 all you want is relaxed - branch - relaxed.  But maybe there are
 corner cases (see the weaker-than-relaxed discussion in SG1 today).

Linus, Peter, any objections to marking places where we are relying on
ordering from control dependencies against later stores?  This approach
seems to me to have significant documentation benefits.

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Torvald Riegel

On Fri, 2014-02-14 at 10:50 +0100, Peter Zijlstra wrote:
 On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote:
  That depends on what your goal is.

First, I don't know why you quoted that, but without the context,
quoting it doesn't make sense.  Let me repeat the point.  The standard
is the rule set for the compiler.  Period.  The compiler does not just
serve the semantics that you might have in your head.  It does have to
do something meaningful for all of its users.  Thus, the goal for the
compiler is to properly compile programs in the language as specified.

If there is a deficiency in the standard (bug or missing feature) -- and
thus the specification, we need to have a patch for the standard that
fixes this deficiency.  If you think that this is the case, that's where
you fix it.

If your goal is to do wishful thinking, imagine some kind of semantics
in your head, and then assume that magically, implementations will do
just that, then that's bound to fail.

 A compiler that we don't need to fight in order to generate sane code
 would be nice. But as Linus said; we can continue to ignore you lot and
 go on as we've done.

I don't see why it's so hard to understand that you need to specify
semantics, and the place (or at least the base) for that is the
standard.  Aren't you guys the ones replying send a patch? :)

This isn't any different.  If you're uncomfortable working with the
standard, then say so, and reach out to people that aren't.

You can surely ignore the specification of the language(s) that you are
depending on.  But that won't help you.  If you want a change, get
involved.  (Oh, and claiming that the other side doesn't get it doesn't
count as getting involved.)

There's no fight between people here.  It's just a technical problem
that we have to solve in the right way.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Torvald Riegel

On Fri, 2014-02-14 at 09:29 -0800, Paul E. McKenney wrote:
 On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote:
  On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:
 
 [ . . . ]
 
   Another option would be to flag the conditional expression, prohibiting
   the compiler from optimizing out any conditional branches.  Perhaps
   something like this:
   
 r1 = atomic_load(x, memory_order_control);
 if (control_dependency(r1))
 atomic_store(y, memory_order_relaxed);
  
  That's the one I had in mind and talked to you about earlier today.  My
  gut feeling is that this is preferably over the other because it marks
  the if-statement, so the compiler knows exactly which branches matter.
  I'm not sure one would need the other memory order for that, if indeed
  all you want is relaxed - branch - relaxed.  But maybe there are
  corner cases (see the weaker-than-relaxed discussion in SG1 today).
 
 Linus, Peter, any objections to marking places where we are relying on
 ordering from control dependencies against later stores?  This approach
 seems to me to have significant documentation benefits.

Let me note that at least as I'm concerned, that's just a quick idea.
At least I haven't looked at (1) how to properly specify the semantics
of this, (2) whether it has any bad effects on unrelated code, (3) and
whether there are pitfalls for compiler implementations.  It looks not
too bad at first glance, though.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds

On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 Linus, Peter, any objections to marking places where we are relying on
 ordering from control dependencies against later stores?  This approach
 seems to me to have significant documentation benefits.

Quite frankly, I think it's stupid, and the documentation is not a
benefit, it's just wrong.

How would you figure out whether your added documentation holds true
for particular branches but not others?

How could you *ever* trust a compiler that makes the dependency meaningful?

Again, let's keep this simple and sane:

 - if a compiler ever generates code where an atomic store movement is
visible in any way, then that compiler is broken shit.

I don't understand why you even argue this. Seriously, Paul, you seem
to *want* to think that broken shit is acceptable, and that we
should then add magic markers to say now you need to *not* be broken
shit.

Here's a magic marker for you: DON'T USE THAT BROKEN COMPILER.

And if a compiler can *prove* that whatever code movement it does
cannot make a difference, then let it do so. No amount of
documentation should matter.

Seriously, this whole discussion has been completely moronic. I don't
understand why you even bring shit like this up:

  r1 = atomic_load(x, memory_order_control);
  if (control_dependency(r1))
  atomic_store(y, memory_order_relaxed);

I mean, really? Anybody who writes code like that, or any compiler
where that control_dependency() marker makes any difference
what-so-ever for code generation should just be retroactively aborted.

There is absolutely *zero* reason for that control_dependency()
crap. If you ever find a reason for it, it is either because the
compiler is buggy, or because the standard is so shit that we should
never *ever* use the atomics.

Seriously. This thread has devolved into some kind of just what kind
of idiotic compiler cesspool crap could we accept. Get away from that
f*cking mindset. We don't accept *any* crap.

Why are we still discussing this idiocy? It's irrelevant. If the
standard really allows random store speculation, the standard doesn't
matter, and sane people shouldn't waste their time arguing about it.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds

On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 Why are we still discussing this idiocy? It's irrelevant. If the
 standard really allows random store speculation, the standard doesn't
 matter, and sane people shouldn't waste their time arguing about it.

Btw, the other part of this coin is that our manual types (using
volatile and various architecture-specific stuff) and our manual
barriers and inline asm accesses are generally *fine*.

The C11 stuff doesn't buy us anything. The argument that new
architectures might want to use it is prue and utter bollocks, since
unless the standard gets the thing *right*, nobody sane would ever use
it for some new architecture, when the sane thing to do is to just
fill in the normal barriers and inline asms.

So I'm very very serious: either the compiler and the standard gets
things right, or we don't use it. There is no middle ground where we
might use it for one or two architectures and add random hints.
That's just stupid.

The only middle ground is about which compiler version we end up
trusting _if_ it turns out that the compiler and standard do get
things right. From Torvald's explanations (once I don't mis-read them
;), my take-away so far has actually been that the standard *does* get
things right, but I do know from over-long personal experience that
compiler people sometimes want to be legalistic and twist the
documentation to the breaking point, at which point we just go we'd
be crazy do use that.

See our use of -fno-strict-aliasing, for example. The C standard
aliasing rules are a mistake, stupid, and wrong, and gcc uses those
stupid type-based alias rules even when statically *proving* the
aliasing gives the opposite result. End result: we turn the shit off.

Exact same deal wrt atomics. We are *not* going to add crazy this
here is a control dependency crap. There's a test, the compiler
*sees* the control dependency for chrissake, and it still generates
crap, we turn that broken optimization off. It really is that
simple.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Paul E. McKenney

On Fri, Feb 14, 2014 at 12:02:23PM -0800, Linus Torvalds wrote:
 On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  Why are we still discussing this idiocy? It's irrelevant. If the
  standard really allows random store speculation, the standard doesn't
  matter, and sane people shouldn't waste their time arguing about it.
 
 Btw, the other part of this coin is that our manual types (using
 volatile and various architecture-specific stuff) and our manual
 barriers and inline asm accesses are generally *fine*.
 
 The C11 stuff doesn't buy us anything. The argument that new
 architectures might want to use it is prue and utter bollocks, since
 unless the standard gets the thing *right*, nobody sane would ever use
 it for some new architecture, when the sane thing to do is to just
 fill in the normal barriers and inline asms.
 
 So I'm very very serious: either the compiler and the standard gets
 things right, or we don't use it. There is no middle ground where we
 might use it for one or two architectures and add random hints.
 That's just stupid.
 
 The only middle ground is about which compiler version we end up
 trusting _if_ it turns out that the compiler and standard do get
 things right. From Torvald's explanations (once I don't mis-read them
 ;), my take-away so far has actually been that the standard *does* get
 things right, but I do know from over-long personal experience that
 compiler people sometimes want to be legalistic and twist the
 documentation to the breaking point, at which point we just go we'd
 be crazy do use that.
 
 See our use of -fno-strict-aliasing, for example. The C standard
 aliasing rules are a mistake, stupid, and wrong, and gcc uses those
 stupid type-based alias rules even when statically *proving* the
 aliasing gives the opposite result. End result: we turn the shit off.
 
 Exact same deal wrt atomics. We are *not* going to add crazy this
 here is a control dependency crap. There's a test, the compiler
 *sees* the control dependency for chrissake, and it still generates
 crap, we turn that broken optimization off. It really is that
 simple.

From what I can see at the moment, the standard -generally- avoids
speculative stores, but there are a few corner cases where it might
allow them.  I will be working with the committee to see exactly what the
situation is.  Might be that I am confused and that everything really
is OK, might be that I am right but the corner cases are things that
no sane kernel developer would do anyway, it might be that the standard
needs a bit of repair, or it might be that the corner cases are somehow
inherent and problematic (but I hope not!).  I will let you know what
I find, but it will probably be a few months.

In the meantime, agreed, we keep doing what we have been doing.  And
maybe in the long term as well, for that matter.

One way of looking at the discussion between Torvald and myself would be
as a seller (Torvald) and a buyer (me) haggling over the fine print in
a proposed contract (the standard).  Whether that makes you feel better
or worse about the situation I cannot say.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds

On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 One way of looking at the discussion between Torvald and myself would be
 as a seller (Torvald) and a buyer (me) haggling over the fine print in
 a proposed contract (the standard).  Whether that makes you feel better
 or worse about the situation I cannot say.  ;-)

Oh, I'm perfectly fine with that. But we're not desperate to buy, and
I actually think the C11 people are - or at least should be - *way*
more desperate to sell their model to us than we are to take it.

Which means that as a buyer you should say this is what we want, if
you don't give us this, we'll just walk away. Not try to see how much
we can pay for it. Because there is very little upside for us, and
_unless_ the C11 standard gets things right it's just extra complexity
for us, coupled with new compiler fragility and years of code
generation bugs.

Why would we want that extra complexity and inevitable compiler bugs?
If we then have to fight compiler writers that point to the standard
and say ..but look, the standard says we can do this, then at that
point it went from extra complexity and compiler bugs to a whole
'nother level of frustration and pain.

So just walk away unless the C11 standard gives us exactly what we
want. Not something kind of like what we'd use. EXACTLY. Because I'm
not in the least interested in fighting compiler people that have a
crappy standard they can point to. Been there, done that, got the
T-shirt and learnt my lesson.

And the thing is, I suspect that the Linux kernel is the most complete
- and most serious - user of true atomics that the C11 people can sell
their solution to.

If we don't buy it, they have no serious user. Sure, they'll have lots
of random other one-off users for their atomics, where each user wants
one particular thing, but I suspect that we'll have the only really
unified portable code base that handles pretty much *all* the serious
odd cases that the C11 atomics can actually talk about to each other.

Oh, they'll push things through with or without us, and it will be a
collection of random stuff, where they tried to please everybody, with
particularly compiler/architecture people who have no f*cking clue
about how their stuff is used pushing to make it easy/efficient for
their particular compiler/architecture.

But we have real optimized uses of pretty much all relevant cases that
people actually care about.

We can walk away from them, and not really lose anything but a small
convenience (and it's a convenience *only* if the standard gets things
right).

And conversely, the C11 people can walk away from us too. But if they
can't make us happy (and by make us happy, I really mean no stupid
games on our part) I personally think they'll have a stronger
standard, and a real use case, and real arguments. I'm assuming they
want that.

That's why I complain when you talk about things like marking control
dependencies explicitly. That's *us* bending over backwards. And as a
buyer, we have absolutely zero reason to do that.

Tell the C11 people: no speculative writes. Full stop. End of story.
Because we're not buying anything else.

Similarly, if we need to mark atomics volatile, then now the C11
atomics are no longer even a small convenience, now they are just
extra complexity wrt what we already have. So just make it clear that
if the C11 standard needs to mark atomics volatile in order to get
non-speculative and non-reloading behavior, then the C11 atomics are
useless to us, and we're not buying.

Remember: a compiler can *always* do as if optimizations - if a
compiler writer can prove that the end result acts 100% the same using
an optimized sequence, then they can do whatever the hell they want.
That's not the issue. But if we can *ever* see semantic impact of
speculative writes, the compiler is buggy, and the compiler writers
need to be aware that it is buggy. No ifs, buts, maybes about it.

So I'm perfectly fine with you seeing yourself as a buyer. But I want
you to be a really *picky* and anal buyer - one that knows he has the
upper hand, and can walk away with no downside.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds

On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 And conversely, the C11 people can walk away from us too. But if they
 can't make us happy (and by make us happy, I really mean no stupid
 games on our part) I personally think they'll have a stronger
 standard, and a real use case, and real arguments. I'm assuming they
 want that.

I should have somebody who proof-reads my emails before I send them out.

I obviously meant if they *can* make us happy (not can't).

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Paul E. McKenney

On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote:
 On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  And conversely, the C11 people can walk away from us too. But if they
  can't make us happy (and by make us happy, I really mean no stupid
  games on our part) I personally think they'll have a stronger
  standard, and a real use case, and real arguments. I'm assuming they
  want that.
 
 I should have somebody who proof-reads my emails before I send them out.
 
 I obviously meant if they *can* make us happy (not can't).

Understood.  My next step is to take a more detailed look at the piece
of the standard that should support RCU.  Depending on how that turns
out, I might look at other parts of the standard vs. Linux's atomics
and memory-ordering needs.  Should be interesting.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Paul E. McKenney

On Fri, Feb 14, 2014 at 10:35:44PM -0800, Paul E. McKenney wrote:
 On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote:
  On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
  torva...@linux-foundation.org wrote:
  
   And conversely, the C11 people can walk away from us too. But if they
   can't make us happy (and by make us happy, I really mean no stupid
   games on our part) I personally think they'll have a stronger
   standard, and a real use case, and real arguments. I'm assuming they
   want that.
  
  I should have somebody who proof-reads my emails before I send them out.
  
  I obviously meant if they *can* make us happy (not can't).
 
 Understood.  My next step is to take a more detailed look at the piece
 of the standard that should support RCU.  Depending on how that turns
 out, I might look at other parts of the standard vs. Linux's atomics
 and memory-ordering needs.  Should be interesting.  ;-)

And perhaps a better way to represent the roles is that I am not the
buyer, but rather the purchasing agent for the -potential- buyer.  -You-
are of course the potential buyer.

If I were to see myself as the buyer, then I must confess that the
concerns you implicitly expressed in your prior email would be all too
well-founded!

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Torvald Riegel

On Wed, 2014-02-12 at 10:19 +0100, Peter Zijlstra wrote:
> > I don't know the specifics of your example, but from how I understand
> > it, I don't see a problem if the compiler can prove that the store will
> > always happen.
> > 
> > To be more specific, if the compiler can prove that the store will
> > happen anyway, and the region of code can be assumed to always run
> > atomically (e.g., there's no loop or such in there), then it is known
> > that we have one atomic region of code that will always perform the
> > store, so we might as well do the stuff in the region in some order.
> > 
> > Now, if any of the memory accesses are atomic, then the whole region of
> > code containing those accesses is often not atomic because other threads
> > might observe intermediate results in a data-race-free way.
> > 
> > (I know that this isn't a very precise formulation, but I hope it brings
> > my line of reasoning across.)
> 
> So given something like:
> 
>   if (x)
>   y = 3;
> 
> assuming both x and y are atomic (so don't gimme crap for now knowing
> the C11 atomic incantations); and you can prove x is always true; you
> don't see a problem with not emitting the conditional?

That depends on what your goal is.  It would be correct as far as the
standard is specified; this makes sense if all you want is indeed a
program that does what the abstract machine might do, and produces the
same output / side effects.

If you're trying to preserve the branch in the code emitted / executed
by the implementation, then it would not be correct.  But those branches
aren't specified as being part of the observable side effects.  In the
common case, this makes sense because it enables optimizations that are
useful; this line of reasoning also allows the compiler to merge some
atomic accesses in the way that Linus would like to see it.

> Avoiding the conditional changes the result; see that control dependency
> email from earlier.

It does not regarding how the standard defines "result".

> In the above example the load of X and the store to
> Y are strictly ordered, due to control dependencies. Not emitting the
> condition and maybe not even emitting the load completely wrecks this.

I think you're trying to solve this backwards.  You are looking at this
with an implicit wishlist of what the compiler should do (or how you
want to use the hardware), but this is not a viable specification that
one can write a compiler against.

We do need clear rules for what the compiler is allowed to do or not
(e.g., a memory model that models multi-threaded executions).  Otherwise
it's all hand-waving, and we're getting nowhere.  Thus, the way to
approach this is to propose a feature or change to the standard, make
sure that this is consistent and has no unintended side effects for
other aspects of compilation or other code, and then ask the compiler to
implement it.  IOW, we need a patch for where this all starts: in the
rules and requirements for compilation.

Paul and I are at the C++ meeting currently, and we had sessions in
which the concurrency study group talked about memory model issues like
dependency tracking and memory_order_consume.  Paul shared uses of
atomics (or likewise) in the kernel, and we discussed how the memory
model currently handles various cases and why, how one could express
other requirements consistently, and what is actually implementable in
practice.  I can't speak for Paul, but I thought those discussions were
productive.

> Its therefore an invalid optimization to take out the conditional or
> speculate the store, since it takes out the dependency.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Torvald Riegel

On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:
> On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote:
> > On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> > > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
> > > >  wrote:
> > > > >
> > > > > Us Linux-kernel hackers will often need to use volatile semantics in
> > > > > combination with C11 atomics in most cases.  The C11 atomics do cover
> > > > > some of the reasons we currently use ACCESS_ONCE(), but not all of 
> > > > > them --
> > > > > in particular, it allows load/store merging.
> > > > 
> > > > I really disagree with the "will need to use volatile".
> > > > 
> > > > We should never need to use volatile (outside of whatever MMIO we do
> > > > using C) if C11 defines atomics correctly.
> > > > 
> > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> > > > it's called a cache - and there's no actual reason to think that
> > > > "ACCESS_ONCE()" has to mean our current "volatile".
> > > > 
> > > > Now, it's possible that the C standards simply get atomics _wrong_, so
> > > > that they create visible semantics that are different from what a CPU
> > > > cache already does, but that's a plain bug in the standard if so.
> > > > 
> > > > But merging loads and stores is fine. And I *guarantee* it is fine,
> > > > exactly because CPU's already do it, so claiming that the compiler
> > > > couldn't do it is just insanity.
> > > 
> > > Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
> > > normally get their stores pushed through the store buffer in reasonable
> > > time, and CPUs also use things like invalidations to ensure that a
> > > store is seen in reasonable time by readers.  Compilers don't always
> > > have these two properties, so we do need to be more careful of load
> > > and store merging by compilers.
> > 
> > The standard's _wording_ is a little vague about forward-progress
> > guarantees, but I believe the vast majority of the people involved do
> > want compilers to not prevent forward progress.  There is of course a
> > difference whether a compiler establishes _eventual_ forward progress in
> > the sense of after 10 years or forward progress in a small bounded
> > interval of time, but this is a QoI issue, and good compilers won't want
> > to introduce unnecessary latencies.  I believe that it is fine if the
> > standard merely talks about eventual forward progress.
> 
> The compiler will need to earn my trust on this one.  ;-)
> 
> > > > Now, there are things that are *not* fine, like speculative stores
> > > > that could be visible to other threads. Those are *bugs* (either in
> > > > the compiler or in the standard), and anybody who claims otherwise is
> > > > not worth discussing with.
> > > 
> > > And as near as I can tell, volatile semantics are required in C11 to
> > > avoid speculative stores.  I might be wrong about this, and hope that
> > > I am wrong.  But I am currently not seeing it in the current standard.
> > > (Though I expect that most compilers would avoid speculating stores,
> > > especially in the near term.
> > 
> > This really depends on how we define speculative stores.  The memory
> > model is absolutely clear that programs have to behave as if executed by
> > the virtual machine, and that rules out speculative stores to volatiles
> > and other locations.  Under certain circumstances, there will be
> > "speculative" stores in the sense that they will happen at different
> > times as if you had a trivial implementation of the abstract machine.
> > But to be allowed to do that, the compiler has to prove that such a
> > transformation still fulfills the as-if rule.
> 
> Agreed, although the as-if rule would ignore control dependencies, since
> these are not yet part of the standard (as you in fact note below).
> I nevertheless consider myself at least somewhat reassured that current
> C11 won't speculate stores.  My remaining concerns involve the compiler
> proving to itself that a given branch is always taken, thus motivating
> it to optimize the branch away -- though this is more properly a
> control-dependency concern.
> 
> > IOW, the abstract machine is what currently defines disallowed
> > speculative stores.  If you want to put *further* constraints on what
> > implementations are allowed to do, I suppose it is best to talk about
> > those and see how we can add rules that allow programmers to express
> > those constraints.  For example, control dependencies might be such a
> > case.  I don't have a specific suggestion -- maybe the control
> > dependencies are best tackled similar to consume dependencies (even
> > though we don't have a good solution for those yets).  But using
> > volatile accesses for that seems to be a big hammer, or even the wrong
> > one.
> 
> In current compilers, the two hammers we have are volatile and barrier().
> But yes, it would be

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Paul E. McKenney

On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote:
> On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
> > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
> > >  wrote:
> > > >
> > > > Us Linux-kernel hackers will often need to use volatile semantics in
> > > > combination with C11 atomics in most cases.  The C11 atomics do cover
> > > > some of the reasons we currently use ACCESS_ONCE(), but not all of them 
> > > > --
> > > > in particular, it allows load/store merging.
> > > 
> > > I really disagree with the "will need to use volatile".
> > > 
> > > We should never need to use volatile (outside of whatever MMIO we do
> > > using C) if C11 defines atomics correctly.
> > > 
> > > Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> > > it's called a cache - and there's no actual reason to think that
> > > "ACCESS_ONCE()" has to mean our current "volatile".
> > > 
> > > Now, it's possible that the C standards simply get atomics _wrong_, so
> > > that they create visible semantics that are different from what a CPU
> > > cache already does, but that's a plain bug in the standard if so.
> > > 
> > > But merging loads and stores is fine. And I *guarantee* it is fine,
> > > exactly because CPU's already do it, so claiming that the compiler
> > > couldn't do it is just insanity.
> > 
> > Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
> > normally get their stores pushed through the store buffer in reasonable
> > time, and CPUs also use things like invalidations to ensure that a
> > store is seen in reasonable time by readers.  Compilers don't always
> > have these two properties, so we do need to be more careful of load
> > and store merging by compilers.
> 
> The standard's _wording_ is a little vague about forward-progress
> guarantees, but I believe the vast majority of the people involved do
> want compilers to not prevent forward progress.  There is of course a
> difference whether a compiler establishes _eventual_ forward progress in
> the sense of after 10 years or forward progress in a small bounded
> interval of time, but this is a QoI issue, and good compilers won't want
> to introduce unnecessary latencies.  I believe that it is fine if the
> standard merely talks about eventual forward progress.

The compiler will need to earn my trust on this one.  ;-)

> > > Now, there are things that are *not* fine, like speculative stores
> > > that could be visible to other threads. Those are *bugs* (either in
> > > the compiler or in the standard), and anybody who claims otherwise is
> > > not worth discussing with.
> > 
> > And as near as I can tell, volatile semantics are required in C11 to
> > avoid speculative stores.  I might be wrong about this, and hope that
> > I am wrong.  But I am currently not seeing it in the current standard.
> > (Though I expect that most compilers would avoid speculating stores,
> > especially in the near term.
> 
> This really depends on how we define speculative stores.  The memory
> model is absolutely clear that programs have to behave as if executed by
> the virtual machine, and that rules out speculative stores to volatiles
> and other locations.  Under certain circumstances, there will be
> "speculative" stores in the sense that they will happen at different
> times as if you had a trivial implementation of the abstract machine.
> But to be allowed to do that, the compiler has to prove that such a
> transformation still fulfills the as-if rule.

Agreed, although the as-if rule would ignore control dependencies, since
these are not yet part of the standard (as you in fact note below).
I nevertheless consider myself at least somewhat reassured that current
C11 won't speculate stores.  My remaining concerns involve the compiler
proving to itself that a given branch is always taken, thus motivating
it to optimize the branch away -- though this is more properly a
control-dependency concern.

> IOW, the abstract machine is what currently defines disallowed
> speculative stores.  If you want to put *further* constraints on what
> implementations are allowed to do, I suppose it is best to talk about
> those and see how we can add rules that allow programmers to express
> those constraints.  For example, control dependencies might be such a
> case.  I don't have a specific suggestion -- maybe the control
> dependencies are best tackled similar to consume dependencies (even
> though we don't have a good solution for those yets).  But using
> volatile accesses for that seems to be a big hammer, or even the wrong
> one.

In current compilers, the two hammers we have are volatile and barrier().
But yes, it would be good to have something more focused.  One option
would be to propose memory_order_control loads to see how loudly the
committee screams.  One use case might be as follows:

if (atomic_load(x, memory_order_control))

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Torvald Riegel

On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
> On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
> >  wrote:
> > >
> > > Us Linux-kernel hackers will often need to use volatile semantics in
> > > combination with C11 atomics in most cases.  The C11 atomics do cover
> > > some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> > > in particular, it allows load/store merging.
> > 
> > I really disagree with the "will need to use volatile".
> > 
> > We should never need to use volatile (outside of whatever MMIO we do
> > using C) if C11 defines atomics correctly.
> > 
> > Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> > it's called a cache - and there's no actual reason to think that
> > "ACCESS_ONCE()" has to mean our current "volatile".
> > 
> > Now, it's possible that the C standards simply get atomics _wrong_, so
> > that they create visible semantics that are different from what a CPU
> > cache already does, but that's a plain bug in the standard if so.
> > 
> > But merging loads and stores is fine. And I *guarantee* it is fine,
> > exactly because CPU's already do it, so claiming that the compiler
> > couldn't do it is just insanity.
> 
> Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
> normally get their stores pushed through the store buffer in reasonable
> time, and CPUs also use things like invalidations to ensure that a
> store is seen in reasonable time by readers.  Compilers don't always
> have these two properties, so we do need to be more careful of load
> and store merging by compilers.

The standard's _wording_ is a little vague about forward-progress
guarantees, but I believe the vast majority of the people involved do
want compilers to not prevent forward progress.  There is of course a
difference whether a compiler establishes _eventual_ forward progress in
the sense of after 10 years or forward progress in a small bounded
interval of time, but this is a QoI issue, and good compilers won't want
to introduce unnecessary latencies.  I believe that it is fine if the
standard merely talks about eventual forward progress.

> > Now, there are things that are *not* fine, like speculative stores
> > that could be visible to other threads. Those are *bugs* (either in
> > the compiler or in the standard), and anybody who claims otherwise is
> > not worth discussing with.
> 
> And as near as I can tell, volatile semantics are required in C11 to
> avoid speculative stores.  I might be wrong about this, and hope that
> I am wrong.  But I am currently not seeing it in the current standard.
> (Though I expect that most compilers would avoid speculating stores,
> especially in the near term.

This really depends on how we define speculative stores.  The memory
model is absolutely clear that programs have to behave as if executed by
the virtual machine, and that rules out speculative stores to volatiles
and other locations.  Under certain circumstances, there will be
"speculative" stores in the sense that they will happen at different
times as if you had a trivial implementation of the abstract machine.
But to be allowed to do that, the compiler has to prove that such a
transformation still fulfills the as-if rule.

IOW, the abstract machine is what currently defines disallowed
speculative stores.  If you want to put *further* constraints on what
implementations are allowed to do, I suppose it is best to talk about
those and see how we can add rules that allow programmers to express
those constraints.  For example, control dependencies might be such a
case.  I don't have a specific suggestion -- maybe the control
dependencies are best tackled similar to consume dependencies (even
though we don't have a good solution for those yets).  But using
volatile accesses for that seems to be a big hammer, or even the wrong
one.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Torvald Riegel

On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
 On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
  On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
  paul...@linux.vnet.ibm.com wrote:
  
   Us Linux-kernel hackers will often need to use volatile semantics in
   combination with C11 atomics in most cases.  The C11 atomics do cover
   some of the reasons we currently use ACCESS_ONCE(), but not all of them --
   in particular, it allows load/store merging.
  
  I really disagree with the will need to use volatile.
  
  We should never need to use volatile (outside of whatever MMIO we do
  using C) if C11 defines atomics correctly.
  
  Allowing load/store merging is *fine*. All sane CPU's do that anyway -
  it's called a cache - and there's no actual reason to think that
  ACCESS_ONCE() has to mean our current volatile.
  
  Now, it's possible that the C standards simply get atomics _wrong_, so
  that they create visible semantics that are different from what a CPU
  cache already does, but that's a plain bug in the standard if so.
  
  But merging loads and stores is fine. And I *guarantee* it is fine,
  exactly because CPU's already do it, so claiming that the compiler
  couldn't do it is just insanity.
 
 Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
 normally get their stores pushed through the store buffer in reasonable
 time, and CPUs also use things like invalidations to ensure that a
 store is seen in reasonable time by readers.  Compilers don't always
 have these two properties, so we do need to be more careful of load
 and store merging by compilers.

The standard's _wording_ is a little vague about forward-progress
guarantees, but I believe the vast majority of the people involved do
want compilers to not prevent forward progress.  There is of course a
difference whether a compiler establishes _eventual_ forward progress in
the sense of after 10 years or forward progress in a small bounded
interval of time, but this is a QoI issue, and good compilers won't want
to introduce unnecessary latencies.  I believe that it is fine if the
standard merely talks about eventual forward progress.

  Now, there are things that are *not* fine, like speculative stores
  that could be visible to other threads. Those are *bugs* (either in
  the compiler or in the standard), and anybody who claims otherwise is
  not worth discussing with.
 
 And as near as I can tell, volatile semantics are required in C11 to
 avoid speculative stores.  I might be wrong about this, and hope that
 I am wrong.  But I am currently not seeing it in the current standard.
 (Though I expect that most compilers would avoid speculating stores,
 especially in the near term.

This really depends on how we define speculative stores.  The memory
model is absolutely clear that programs have to behave as if executed by
the virtual machine, and that rules out speculative stores to volatiles
and other locations.  Under certain circumstances, there will be
speculative stores in the sense that they will happen at different
times as if you had a trivial implementation of the abstract machine.
But to be allowed to do that, the compiler has to prove that such a
transformation still fulfills the as-if rule.

IOW, the abstract machine is what currently defines disallowed
speculative stores.  If you want to put *further* constraints on what
implementations are allowed to do, I suppose it is best to talk about
those and see how we can add rules that allow programmers to express
those constraints.  For example, control dependencies might be such a
case.  I don't have a specific suggestion -- maybe the control
dependencies are best tackled similar to consume dependencies (even
though we don't have a good solution for those yets).  But using
volatile accesses for that seems to be a big hammer, or even the wrong
one.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Paul E. McKenney

On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote:
 On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
  On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
   On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
   paul...@linux.vnet.ibm.com wrote:
   
Us Linux-kernel hackers will often need to use volatile semantics in
combination with C11 atomics in most cases.  The C11 atomics do cover
some of the reasons we currently use ACCESS_ONCE(), but not all of them 
--
in particular, it allows load/store merging.
   
   I really disagree with the will need to use volatile.
   
   We should never need to use volatile (outside of whatever MMIO we do
   using C) if C11 defines atomics correctly.
   
   Allowing load/store merging is *fine*. All sane CPU's do that anyway -
   it's called a cache - and there's no actual reason to think that
   ACCESS_ONCE() has to mean our current volatile.
   
   Now, it's possible that the C standards simply get atomics _wrong_, so
   that they create visible semantics that are different from what a CPU
   cache already does, but that's a plain bug in the standard if so.
   
   But merging loads and stores is fine. And I *guarantee* it is fine,
   exactly because CPU's already do it, so claiming that the compiler
   couldn't do it is just insanity.
  
  Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
  normally get their stores pushed through the store buffer in reasonable
  time, and CPUs also use things like invalidations to ensure that a
  store is seen in reasonable time by readers.  Compilers don't always
  have these two properties, so we do need to be more careful of load
  and store merging by compilers.
 
 The standard's _wording_ is a little vague about forward-progress
 guarantees, but I believe the vast majority of the people involved do
 want compilers to not prevent forward progress.  There is of course a
 difference whether a compiler establishes _eventual_ forward progress in
 the sense of after 10 years or forward progress in a small bounded
 interval of time, but this is a QoI issue, and good compilers won't want
 to introduce unnecessary latencies.  I believe that it is fine if the
 standard merely talks about eventual forward progress.

The compiler will need to earn my trust on this one.  ;-)

   Now, there are things that are *not* fine, like speculative stores
   that could be visible to other threads. Those are *bugs* (either in
   the compiler or in the standard), and anybody who claims otherwise is
   not worth discussing with.
  
  And as near as I can tell, volatile semantics are required in C11 to
  avoid speculative stores.  I might be wrong about this, and hope that
  I am wrong.  But I am currently not seeing it in the current standard.
  (Though I expect that most compilers would avoid speculating stores,
  especially in the near term.
 
 This really depends on how we define speculative stores.  The memory
 model is absolutely clear that programs have to behave as if executed by
 the virtual machine, and that rules out speculative stores to volatiles
 and other locations.  Under certain circumstances, there will be
 speculative stores in the sense that they will happen at different
 times as if you had a trivial implementation of the abstract machine.
 But to be allowed to do that, the compiler has to prove that such a
 transformation still fulfills the as-if rule.

Agreed, although the as-if rule would ignore control dependencies, since
these are not yet part of the standard (as you in fact note below).
I nevertheless consider myself at least somewhat reassured that current
C11 won't speculate stores.  My remaining concerns involve the compiler
proving to itself that a given branch is always taken, thus motivating
it to optimize the branch away -- though this is more properly a
control-dependency concern.

 IOW, the abstract machine is what currently defines disallowed
 speculative stores.  If you want to put *further* constraints on what
 implementations are allowed to do, I suppose it is best to talk about
 those and see how we can add rules that allow programmers to express
 those constraints.  For example, control dependencies might be such a
 case.  I don't have a specific suggestion -- maybe the control
 dependencies are best tackled similar to consume dependencies (even
 though we don't have a good solution for those yets).  But using
 volatile accesses for that seems to be a big hammer, or even the wrong
 one.

In current compilers, the two hammers we have are volatile and barrier().
But yes, it would be good to have something more focused.  One option
would be to propose memory_order_control loads to see how loudly the
committee screams.  One use case might be as follows:

if (atomic_load(x, memory_order_control))
atomic_store(y, memory_order_relaxed);

This could also be written:

r1 = atomic_load(x, memory_order_control);
if

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Torvald Riegel

On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote:
 On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote:
  On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote:
   On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 Us Linux-kernel hackers will often need to use volatile semantics in
 combination with C11 atomics in most cases.  The C11 atomics do cover
 some of the reasons we currently use ACCESS_ONCE(), but not all of 
 them --
 in particular, it allows load/store merging.

I really disagree with the will need to use volatile.

We should never need to use volatile (outside of whatever MMIO we do
using C) if C11 defines atomics correctly.

Allowing load/store merging is *fine*. All sane CPU's do that anyway -
it's called a cache - and there's no actual reason to think that
ACCESS_ONCE() has to mean our current volatile.

Now, it's possible that the C standards simply get atomics _wrong_, so
that they create visible semantics that are different from what a CPU
cache already does, but that's a plain bug in the standard if so.

But merging loads and stores is fine. And I *guarantee* it is fine,
exactly because CPU's already do it, so claiming that the compiler
couldn't do it is just insanity.
   
   Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
   normally get their stores pushed through the store buffer in reasonable
   time, and CPUs also use things like invalidations to ensure that a
   store is seen in reasonable time by readers.  Compilers don't always
   have these two properties, so we do need to be more careful of load
   and store merging by compilers.
  
  The standard's _wording_ is a little vague about forward-progress
  guarantees, but I believe the vast majority of the people involved do
  want compilers to not prevent forward progress.  There is of course a
  difference whether a compiler establishes _eventual_ forward progress in
  the sense of after 10 years or forward progress in a small bounded
  interval of time, but this is a QoI issue, and good compilers won't want
  to introduce unnecessary latencies.  I believe that it is fine if the
  standard merely talks about eventual forward progress.
 
 The compiler will need to earn my trust on this one.  ;-)
 
Now, there are things that are *not* fine, like speculative stores
that could be visible to other threads. Those are *bugs* (either in
the compiler or in the standard), and anybody who claims otherwise is
not worth discussing with.
   
   And as near as I can tell, volatile semantics are required in C11 to
   avoid speculative stores.  I might be wrong about this, and hope that
   I am wrong.  But I am currently not seeing it in the current standard.
   (Though I expect that most compilers would avoid speculating stores,
   especially in the near term.
  
  This really depends on how we define speculative stores.  The memory
  model is absolutely clear that programs have to behave as if executed by
  the virtual machine, and that rules out speculative stores to volatiles
  and other locations.  Under certain circumstances, there will be
  speculative stores in the sense that they will happen at different
  times as if you had a trivial implementation of the abstract machine.
  But to be allowed to do that, the compiler has to prove that such a
  transformation still fulfills the as-if rule.
 
 Agreed, although the as-if rule would ignore control dependencies, since
 these are not yet part of the standard (as you in fact note below).
 I nevertheless consider myself at least somewhat reassured that current
 C11 won't speculate stores.  My remaining concerns involve the compiler
 proving to itself that a given branch is always taken, thus motivating
 it to optimize the branch away -- though this is more properly a
 control-dependency concern.
 
  IOW, the abstract machine is what currently defines disallowed
  speculative stores.  If you want to put *further* constraints on what
  implementations are allowed to do, I suppose it is best to talk about
  those and see how we can add rules that allow programmers to express
  those constraints.  For example, control dependencies might be such a
  case.  I don't have a specific suggestion -- maybe the control
  dependencies are best tackled similar to consume dependencies (even
  though we don't have a good solution for those yets).  But using
  volatile accesses for that seems to be a big hammer, or even the wrong
  one.
 
 In current compilers, the two hammers we have are volatile and barrier().
 But yes, it would be good to have something more focused.  One option
 would be to propose memory_order_control loads to see how loudly the
 committee screams.  One use case might be as follows:
 
   if (atomic_load(x,

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-13 Thread Torvald Riegel

On Wed, 2014-02-12 at 10:19 +0100, Peter Zijlstra wrote:
  I don't know the specifics of your example, but from how I understand
  it, I don't see a problem if the compiler can prove that the store will
  always happen.
  
  To be more specific, if the compiler can prove that the store will
  happen anyway, and the region of code can be assumed to always run
  atomically (e.g., there's no loop or such in there), then it is known
  that we have one atomic region of code that will always perform the
  store, so we might as well do the stuff in the region in some order.
  
  Now, if any of the memory accesses are atomic, then the whole region of
  code containing those accesses is often not atomic because other threads
  might observe intermediate results in a data-race-free way.
  
  (I know that this isn't a very precise formulation, but I hope it brings
  my line of reasoning across.)
 
 So given something like:
 
   if (x)
   y = 3;
 
 assuming both x and y are atomic (so don't gimme crap for now knowing
 the C11 atomic incantations); and you can prove x is always true; you
 don't see a problem with not emitting the conditional?

That depends on what your goal is.  It would be correct as far as the
standard is specified; this makes sense if all you want is indeed a
program that does what the abstract machine might do, and produces the
same output / side effects.

If you're trying to preserve the branch in the code emitted / executed
by the implementation, then it would not be correct.  But those branches
aren't specified as being part of the observable side effects.  In the
common case, this makes sense because it enables optimizations that are
useful; this line of reasoning also allows the compiler to merge some
atomic accesses in the way that Linus would like to see it.

 Avoiding the conditional changes the result; see that control dependency
 email from earlier.

It does not regarding how the standard defines result.

 In the above example the load of X and the store to
 Y are strictly ordered, due to control dependencies. Not emitting the
 condition and maybe not even emitting the load completely wrecks this.

I think you're trying to solve this backwards.  You are looking at this
with an implicit wishlist of what the compiler should do (or how you
want to use the hardware), but this is not a viable specification that
one can write a compiler against.

We do need clear rules for what the compiler is allowed to do or not
(e.g., a memory model that models multi-threaded executions).  Otherwise
it's all hand-waving, and we're getting nowhere.  Thus, the way to
approach this is to propose a feature or change to the standard, make
sure that this is consistent and has no unintended side effects for
other aspects of compilation or other code, and then ask the compiler to
implement it.  IOW, we need a patch for where this all starts: in the
rules and requirements for compilation.

Paul and I are at the C++ meeting currently, and we had sessions in
which the concurrency study group talked about memory model issues like
dependency tracking and memory_order_consume.  Paul shared uses of
atomics (or likewise) in the kernel, and we discussed how the memory
model currently handles various cases and why, how one could express
other requirements consistently, and what is actually implementable in
practice.  I can't speak for Paul, but I thought those discussions were
productive.

 Its therefore an invalid optimization to take out the conditional or
 speculate the store, since it takes out the dependency.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
> On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
>  wrote:
> >
> > Us Linux-kernel hackers will often need to use volatile semantics in
> > combination with C11 atomics in most cases.  The C11 atomics do cover
> > some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> > in particular, it allows load/store merging.
> 
> I really disagree with the "will need to use volatile".
> 
> We should never need to use volatile (outside of whatever MMIO we do
> using C) if C11 defines atomics correctly.
> 
> Allowing load/store merging is *fine*. All sane CPU's do that anyway -
> it's called a cache - and there's no actual reason to think that
> "ACCESS_ONCE()" has to mean our current "volatile".
> 
> Now, it's possible that the C standards simply get atomics _wrong_, so
> that they create visible semantics that are different from what a CPU
> cache already does, but that's a plain bug in the standard if so.
> 
> But merging loads and stores is fine. And I *guarantee* it is fine,
> exactly because CPU's already do it, so claiming that the compiler
> couldn't do it is just insanity.

Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
normally get their stores pushed through the store buffer in reasonable
time, and CPUs also use things like invalidations to ensure that a
store is seen in reasonable time by readers.  Compilers don't always
have these two properties, so we do need to be more careful of load
and store merging by compilers.

> Now, there are things that are *not* fine, like speculative stores
> that could be visible to other threads. Those are *bugs* (either in
> the compiler or in the standard), and anybody who claims otherwise is
> not worth discussing with.

And as near as I can tell, volatile semantics are required in C11 to
avoid speculative stores.  I might be wrong about this, and hope that
I am wrong.  But I am currently not seeing it in the current standard.
(Though I expect that most compilers would avoid speculating stores,
especially in the near term.

> But I really really disagree with the "we might have to use
> 'volatile'". Because if we *ever* have to use 'volatile' with the
> standard C atomic types, then we're just better off ignoring the
> atomic types entirely, because they are obviously broken shit - and
> we're better off doing it ourselves the way we have forever.
> 
> Seriously. This is not even hyperbole. It really is as simple as that.

Agreed, if we are talking about replacing ACCESS_ONCE() with C11
relaxed atomics any time soon.  But someone porting Linux to a
new CPU architecture might use a carefully chosen subset of C11
atomics to implement some of the Linux atomic operations, especially
non-value-returning atomics such as atomic_inc().

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Linus Torvalds

On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
 wrote:
>
> Us Linux-kernel hackers will often need to use volatile semantics in
> combination with C11 atomics in most cases.  The C11 atomics do cover
> some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> in particular, it allows load/store merging.

I really disagree with the "will need to use volatile".

We should never need to use volatile (outside of whatever MMIO we do
using C) if C11 defines atomics correctly.

Allowing load/store merging is *fine*. All sane CPU's do that anyway -
it's called a cache - and there's no actual reason to think that
"ACCESS_ONCE()" has to mean our current "volatile".

Now, it's possible that the C standards simply get atomics _wrong_, so
that they create visible semantics that are different from what a CPU
cache already does, but that's a plain bug in the standard if so.

But merging loads and stores is fine. And I *guarantee* it is fine,
exactly because CPU's already do it, so claiming that the compiler
couldn't do it is just insanity.

Now, there are things that are *not* fine, like speculative stores
that could be visible to other threads. Those are *bugs* (either in
the compiler or in the standard), and anybody who claims otherwise is
not worth discussing with.

But I really really disagree with the "we might have to use
'volatile'". Because if we *ever* have to use 'volatile' with the
standard C atomic types, then we're just better off ignoring the
atomic types entirely, because they are obviously broken shit - and
we're better off doing it ourselves the way we have forever.

Seriously. This is not even hyperbole. It really is as simple as that.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Tue, Feb 11, 2014 at 09:13:34PM -0800, Torvald Riegel wrote:
> On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote:
> > On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > > > > > There are also so many ways to blow your head off it's 
> > > > > > > > > untrue. For example,
> > > > > > > > > cmpxchg takes a separate memory model parameter for failure 
> > > > > > > > > and success, but
> > > > > > > > > then there are restrictions on the sets you can use for each. 
> > > > > > > > > It's not hard
> > > > > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > > > > memory_model_seq_cst for everything, it's too hard 
> > > > > > > > > otherwise". Then there's
> > > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely 
> > > > > > > > > ignores consume
> > > > > > > > > atm and optimises all of the data dependencies away) as well 
> > > > > > > > > as the definition
> > > > > > > > > of "data races", which seem to be used as an excuse to 
> > > > > > > > > miscompile a program
> > > > > > > > > at the earliest opportunity.
> > > > > > > > 
> > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms 
> > > > > > > > of
> > > > > > > > memory_order_consume until the compilers implement it both 
> > > > > > > > correctly and
> > > > > > > > efficiently.  They are not there yet, and there is currently no 
> > > > > > > > shortage
> > > > > > > > of compiler writers who would prefer to ignore 
> > > > > > > > memory_order_consume.
> > > > > > > 
> > > > > > > Do you have any input on
> > > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In 
> > > > > > > particular, the
> > > > > > > language standard's definition of dependencies?
> > > > > > 
> > > > > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > > > > 
> > > > > > — B is an invocation of any specialization of std::kill_dependency 
> > > > > > (29.3), or
> > > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or 
> > > > > > logical OR (||, see 5.15) operator,
> > > > > > or
> > > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > > > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > > > > 
> > > > > > So the use of "flag" before the "?" is ignored.  But the "flag - 
> > > > > > flag"
> > > > > > after the "?" will carry a dependency, so the code fragment in 59448
> > > > > > needs to do the ordering rather than just optimizing "flag - flag" 
> > > > > > out
> > > > > > of existence.  One way to do that on both ARM and Power is to 
> > > > > > actually
> > > > > > emit code for "flag - flag", but there are a number of other ways to
> > > > > > make that work.
> > > > > 
> > > > > And that's what would concern me, considering that these requirements
> > > > > seem to be able to creep out easily.  Also, whereas the other atomics
> > > > > just constrain compilers wrt. reordering across atomic accesses or
> > > > > changes to the atomic accesses themselves, the dependencies are new
> > > > > requirements on pieces of otherwise non-synchronizing code.  The 
> > > > > latter
> > > > > seems far more involved to me.
> > > > 
> > > > Well, the wording of 1.10p9 is pretty explicit on this point.
> > > > There are only a few exceptions to the rule that dependencies from
> > > > memory_order_consume loads must be tracked.  And to your point about
> > > > requirements being placed on pieces of otherwise non-synchronizing code,
> > > > we already have that with plain old load acquire and store release --
> > > > both of these put ordering constraints that affect the surrounding
> > > > non-synchronizing code.
> > > 
> > > I think there's a significant difference.  With acquire/release or more
> > > general memory orders, it's true that we can't order _across_ the atomic
> > > access.  However, we can reorder and optimize without additional
> > > constraints if we do not reorder.  This is not the case with consume
> > > memory order, as the (p + flag - flag) example shows.
> > 
> > Agreed, memory_order_consume does introduce additional restrictions.
> > 
> > > > This issue got a lot of discussion, and the compromise is that
> > > > dependencies cannot leak into or out of functions unless the relevant
> > > > parameters or return values are annotated with [[carries_dependency]].
> > > > This means that the compiler can see all the places where dependencies
> > > > must be tracked.  This is described in 7.6.4.
> > > 
> > > I wasn't aware

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Tue, Feb 11, 2014 at 09:39:24PM -0800, Torvald Riegel wrote:
> On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote:
> > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> > >
> > > Intuitively, this is wrong because this let's the program take a step
> > > the abstract machine wouldn't do.  This is different to the sequential
> > > code that Peter posted because it uses atomics, and thus one can't
> > > easily assume that the difference is not observable.
> > 
> > Btw, what is the definition of "observable" for the atomics?
> > 
> > Because I'm hoping that it's not the same as for volatiles, where
> > "observable" is about the virtual machine itself, and as such volatile
> > accesses cannot be combined or optimized at all.
> 
> No, atomics aren't an observable behavior of the abstract machine
> (unless they are volatile).  See 1.8.p8 (citing the C++ standard).

Us Linux-kernel hackers will often need to use volatile semantics in
combination with C11 atomics in most cases.  The C11 atomics do cover
some of the reasons we currently use ACCESS_ONCE(), but not all of them --
in particular, it allows load/store merging.

> > Now, I claim that atomic accesses cannot be done speculatively for
> > writes, and not re-done for reads (because the value could change),
> 
> Agreed, unless the compiler can prove that this doesn't make a
> difference in the program at hand and it's not volatile atomics.  In
> general, that will be hard and thus won't happen often I suppose, but if
> correctly proved it would fall under the as-if rule I think.
> 
> > but *combining* them would be possible and good.
> 
> Agreed.

In some cases, agreed.  But many uses in the Linux kernel will need
volatile semantics in combination with C11 atomics.  Which is OK, for
the foreseeable future, anyway.

> > For example, we often have multiple independent atomic accesses that
> > could certainly be combined: testing the individual bits of an atomic
> > value with helper functions, causing things like "load atomic, test
> > bit, load same atomic, test another bit". The two atomic loads could
> > be done as a single load without possibly changing semantics on a real
> > machine, but if "visibility" is defined in the same way it is for
> > "volatile", that wouldn't be a valid transformation. Right now we use
> > "volatile" semantics for these kinds of things, and they really can
> > hurt.
> 
> Agreed.  In your example, the compiler would have to prove that the
> abstract machine would always be able to run the two loads atomically
> (ie, as one load) without running into impossible/disallowed behavior of
> the program.  But if there's no loop or branch or such in-between, this
> should be straight-forward because any hardware oddity or similar could
> merge those loads and it wouldn't be disallowed by the standard
> (considering that we're talking about a finite number of loads), so the
> compiler would be allowed to do it as well.

As long as they are not marked volatile, agreed.

Thanx, Paul

> > Same goes for multiple writes (possibly due to setting bits):
> > combining multiple accesses into a single one is generally fine, it's
> > *adding* write accesses speculatively that is broken by design..
> 
> Agreed.  As Paul points out, this being correct assumes that there are
> no other ordering guarantees or memory accesses "interfering", but if
> the stores are to the same memory location and adjacent to each other in
> the program, then I don't see a reason why they wouldn't be combinable.
> 
> > At the same time, you can't combine atomic loads or stores infinitely
> > - "visibility" on a real machine definitely is about timeliness.
> > Removing all but the last write when there are multiple consecutive
> > writes is generally fine, even if you unroll a loop to generate those
> > writes. But if what remains is a loop, it might be a busy-loop
> > basically waiting for something, so it would be wrong ("untimely") to
> > hoist a store in a loop entirely past the end of the loop, or hoist a
> > load in a loop to before the loop.
> 
> Agreed.  That's what 1.10p24 and 1.10p25 are meant to specify for loads,
> although those might not be bullet-proof as Paul points out.  Forward
> progress is rather vaguely specified in the standard, but at least parts
> of the committee (and people in ISO C++ SG1, in particular) are working
> on trying to improve this.
> 
> > Does the standard allow for that kind of behavior?
> 
> I think the standard requires (or intends to require) the behavior that
> you (and I) seem to prefer in these examples.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Peter Zijlstra

On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
> You need volatile semantics to force the compiler to ignore any proofs
> it might otherwise attempt to construct.  Hence all the ACCESS_ONCE()
> calls in my email to Torvald.  (Hopefully I translated your example
> reasonably.)

My brain gave out for today; but it did appear to have the right
structure.

I would prefer it C11 would not require the volatile casts. It should
simply _never_ speculate with atomic writes, volatile or not.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Tue, Feb 11, 2014 at 10:06:34PM -0800, Torvald Riegel wrote:
> On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote:
> > On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
> > > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> > > >
> > > > Intuitively, this is wrong because this let's the program take a step
> > > > the abstract machine wouldn't do.  This is different to the sequential
> > > > code that Peter posted because it uses atomics, and thus one can't
> > > > easily assume that the difference is not observable.
> > > 
> > > Btw, what is the definition of "observable" for the atomics?
> > > 
> > > Because I'm hoping that it's not the same as for volatiles, where
> > > "observable" is about the virtual machine itself, and as such volatile
> > > accesses cannot be combined or optimized at all.
> > > 
> > > Now, I claim that atomic accesses cannot be done speculatively for
> > > writes, and not re-done for reads (because the value could change),
> > > but *combining* them would be possible and good.
> > > 
> > > For example, we often have multiple independent atomic accesses that
> > > could certainly be combined: testing the individual bits of an atomic
> > > value with helper functions, causing things like "load atomic, test
> > > bit, load same atomic, test another bit". The two atomic loads could
> > > be done as a single load without possibly changing semantics on a real
> > > machine, but if "visibility" is defined in the same way it is for
> > > "volatile", that wouldn't be a valid transformation. Right now we use
> > > "volatile" semantics for these kinds of things, and they really can
> > > hurt.
> > > 
> > > Same goes for multiple writes (possibly due to setting bits):
> > > combining multiple accesses into a single one is generally fine, it's
> > > *adding* write accesses speculatively that is broken by design..
> > > 
> > > At the same time, you can't combine atomic loads or stores infinitely
> > > - "visibility" on a real machine definitely is about timeliness.
> > > Removing all but the last write when there are multiple consecutive
> > > writes is generally fine, even if you unroll a loop to generate those
> > > writes. But if what remains is a loop, it might be a busy-loop
> > > basically waiting for something, so it would be wrong ("untimely") to
> > > hoist a store in a loop entirely past the end of the loop, or hoist a
> > > load in a loop to before the loop.
> > > 
> > > Does the standard allow for that kind of behavior?
> > 
> > You asked!  ;-)
> > 
> > So the current standard allows merging of both loads and stores, unless of
> > course ordring constraints prevent the merging.  Volatile semantics may be
> > used to prevent this merging, if desired, for example, for real-time code.
> 
> Agreed.
> 
> > Infinite merging is intended to be prohibited, but I am not certain that
> > the current wording is bullet-proof (1.10p24 and 1.10p25).
> 
> Yeah, maybe not.  But it at least seems to rather clearly indicate the
> intent ;)

That is my hope.  ;-)

> > The only prohibition against speculative stores that I can see is in a
> > non-normative note, and it can be argued to apply only to things that are
> > not atomics (1.10p22).
> 
> I think this one is specifically about speculative stores that would
> affect memory locations that the abstract machine would not write to,
> and that might be observable or create data races.  While a compiler
> could potentially prove that such stores aren't leading to a difference
> in the behavior of the program (e.g., by proving that there are no
> observers anywhere and this isn't overlapping with any volatile
> locations), I think that this is hard in general and most compilers will
> just not do such things.  In GCC, bugs in that category were fixed after
> researchers doing fuzz-testing found them (IIRC, speculative stores by
> loops).

And that is my fear.  ;-)

> > I don't see any prohibition against reordering
> > a store to precede a load preceding a conditional branch -- which would
> > not be speculative if the branch was know to be taken and the load
> > hit in the store buffer.  In a system where stores could be reordered,
> > some other CPU might perceive the store as happening before the load
> > that controlled the conditional branch.  This needs to be addressed.
> 
> I don't know the specifics of your example, but from how I understand
> it, I don't see a problem if the compiler can prove that the store will
> always happen.

The current Documentation/memory-barriers.txt formulation requires
that both the load and the store have volatile semantics.  Does
that help?

> To be more specific, if the compiler can prove that the store will
> happen anyway, and the region of code can be assumed to always run
> atomically (e.g., there's no loop or such in there), then it is known
> that we have one atomic region of code that will always perform the
> store, so we might as well do the stuff in the region in some

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Wed, Feb 12, 2014 at 10:19:07AM +0100, Peter Zijlstra wrote:
> > I don't know the specifics of your example, but from how I understand
> > it, I don't see a problem if the compiler can prove that the store will
> > always happen.
> > 
> > To be more specific, if the compiler can prove that the store will
> > happen anyway, and the region of code can be assumed to always run
> > atomically (e.g., there's no loop or such in there), then it is known
> > that we have one atomic region of code that will always perform the
> > store, so we might as well do the stuff in the region in some order.
> > 
> > Now, if any of the memory accesses are atomic, then the whole region of
> > code containing those accesses is often not atomic because other threads
> > might observe intermediate results in a data-race-free way.
> > 
> > (I know that this isn't a very precise formulation, but I hope it brings
> > my line of reasoning across.)
> 
> So given something like:
> 
>   if (x)
>   y = 3;
> 
> assuming both x and y are atomic (so don't gimme crap for now knowing
> the C11 atomic incantations); and you can prove x is always true; you
> don't see a problem with not emitting the conditional?

You need volatile semantics to force the compiler to ignore any proofs
it might otherwise attempt to construct.  Hence all the ACCESS_ONCE()
calls in my email to Torvald.  (Hopefully I translated your example
reasonably.)

Thanx, Paul

> Avoiding the conditional changes the result; see that control dependency
> email from earlier. In the above example the load of X and the store to
> Y are strictly ordered, due to control dependencies. Not emitting the
> condition and maybe not even emitting the load completely wrecks this.
> 
> Its therefore an invalid optimization to take out the conditional or
> speculate the store, since it takes out the dependency.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Peter Zijlstra

> I don't know the specifics of your example, but from how I understand
> it, I don't see a problem if the compiler can prove that the store will
> always happen.
> 
> To be more specific, if the compiler can prove that the store will
> happen anyway, and the region of code can be assumed to always run
> atomically (e.g., there's no loop or such in there), then it is known
> that we have one atomic region of code that will always perform the
> store, so we might as well do the stuff in the region in some order.
> 
> Now, if any of the memory accesses are atomic, then the whole region of
> code containing those accesses is often not atomic because other threads
> might observe intermediate results in a data-race-free way.
> 
> (I know that this isn't a very precise formulation, but I hope it brings
> my line of reasoning across.)

So given something like:

if (x)
y = 3;

assuming both x and y are atomic (so don't gimme crap for now knowing
the C11 atomic incantations); and you can prove x is always true; you
don't see a problem with not emitting the conditional?

Avoiding the conditional changes the result; see that control dependency
email from earlier. In the above example the load of X and the store to
Y are strictly ordered, due to control dependencies. Not emitting the
condition and maybe not even emitting the load completely wrecks this.

Its therefore an invalid optimization to take out the conditional or
speculate the store, since it takes out the dependency.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Peter Zijlstra

 I don't know the specifics of your example, but from how I understand
 it, I don't see a problem if the compiler can prove that the store will
 always happen.
 
 To be more specific, if the compiler can prove that the store will
 happen anyway, and the region of code can be assumed to always run
 atomically (e.g., there's no loop or such in there), then it is known
 that we have one atomic region of code that will always perform the
 store, so we might as well do the stuff in the region in some order.
 
 Now, if any of the memory accesses are atomic, then the whole region of
 code containing those accesses is often not atomic because other threads
 might observe intermediate results in a data-race-free way.
 
 (I know that this isn't a very precise formulation, but I hope it brings
 my line of reasoning across.)

So given something like:

if (x)
y = 3;

assuming both x and y are atomic (so don't gimme crap for now knowing
the C11 atomic incantations); and you can prove x is always true; you
don't see a problem with not emitting the conditional?

Avoiding the conditional changes the result; see that control dependency
email from earlier. In the above example the load of X and the store to
Y are strictly ordered, due to control dependencies. Not emitting the
condition and maybe not even emitting the load completely wrecks this.

Its therefore an invalid optimization to take out the conditional or
speculate the store, since it takes out the dependency.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Wed, Feb 12, 2014 at 10:19:07AM +0100, Peter Zijlstra wrote:
  I don't know the specifics of your example, but from how I understand
  it, I don't see a problem if the compiler can prove that the store will
  always happen.
  
  To be more specific, if the compiler can prove that the store will
  happen anyway, and the region of code can be assumed to always run
  atomically (e.g., there's no loop or such in there), then it is known
  that we have one atomic region of code that will always perform the
  store, so we might as well do the stuff in the region in some order.
  
  Now, if any of the memory accesses are atomic, then the whole region of
  code containing those accesses is often not atomic because other threads
  might observe intermediate results in a data-race-free way.
  
  (I know that this isn't a very precise formulation, but I hope it brings
  my line of reasoning across.)
 
 So given something like:
 
   if (x)
   y = 3;
 
 assuming both x and y are atomic (so don't gimme crap for now knowing
 the C11 atomic incantations); and you can prove x is always true; you
 don't see a problem with not emitting the conditional?

You need volatile semantics to force the compiler to ignore any proofs
it might otherwise attempt to construct.  Hence all the ACCESS_ONCE()
calls in my email to Torvald.  (Hopefully I translated your example
reasonably.)

Thanx, Paul

 Avoiding the conditional changes the result; see that control dependency
 email from earlier. In the above example the load of X and the store to
 Y are strictly ordered, due to control dependencies. Not emitting the
 condition and maybe not even emitting the load completely wrecks this.
 
 Its therefore an invalid optimization to take out the conditional or
 speculate the store, since it takes out the dependency.
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Tue, Feb 11, 2014 at 10:06:34PM -0800, Torvald Riegel wrote:
 On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote:
  On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
   On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote:
   
Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.
   
   Btw, what is the definition of observable for the atomics?
   
   Because I'm hoping that it's not the same as for volatiles, where
   observable is about the virtual machine itself, and as such volatile
   accesses cannot be combined or optimized at all.
   
   Now, I claim that atomic accesses cannot be done speculatively for
   writes, and not re-done for reads (because the value could change),
   but *combining* them would be possible and good.
   
   For example, we often have multiple independent atomic accesses that
   could certainly be combined: testing the individual bits of an atomic
   value with helper functions, causing things like load atomic, test
   bit, load same atomic, test another bit. The two atomic loads could
   be done as a single load without possibly changing semantics on a real
   machine, but if visibility is defined in the same way it is for
   volatile, that wouldn't be a valid transformation. Right now we use
   volatile semantics for these kinds of things, and they really can
   hurt.
   
   Same goes for multiple writes (possibly due to setting bits):
   combining multiple accesses into a single one is generally fine, it's
   *adding* write accesses speculatively that is broken by design..
   
   At the same time, you can't combine atomic loads or stores infinitely
   - visibility on a real machine definitely is about timeliness.
   Removing all but the last write when there are multiple consecutive
   writes is generally fine, even if you unroll a loop to generate those
   writes. But if what remains is a loop, it might be a busy-loop
   basically waiting for something, so it would be wrong (untimely) to
   hoist a store in a loop entirely past the end of the loop, or hoist a
   load in a loop to before the loop.
   
   Does the standard allow for that kind of behavior?
  
  You asked!  ;-)
  
  So the current standard allows merging of both loads and stores, unless of
  course ordring constraints prevent the merging.  Volatile semantics may be
  used to prevent this merging, if desired, for example, for real-time code.
 
 Agreed.
 
  Infinite merging is intended to be prohibited, but I am not certain that
  the current wording is bullet-proof (1.10p24 and 1.10p25).
 
 Yeah, maybe not.  But it at least seems to rather clearly indicate the
 intent ;)

That is my hope.  ;-)

  The only prohibition against speculative stores that I can see is in a
  non-normative note, and it can be argued to apply only to things that are
  not atomics (1.10p22).
 
 I think this one is specifically about speculative stores that would
 affect memory locations that the abstract machine would not write to,
 and that might be observable or create data races.  While a compiler
 could potentially prove that such stores aren't leading to a difference
 in the behavior of the program (e.g., by proving that there are no
 observers anywhere and this isn't overlapping with any volatile
 locations), I think that this is hard in general and most compilers will
 just not do such things.  In GCC, bugs in that category were fixed after
 researchers doing fuzz-testing found them (IIRC, speculative stores by
 loops).

And that is my fear.  ;-)

  I don't see any prohibition against reordering
  a store to precede a load preceding a conditional branch -- which would
  not be speculative if the branch was know to be taken and the load
  hit in the store buffer.  In a system where stores could be reordered,
  some other CPU might perceive the store as happening before the load
  that controlled the conditional branch.  This needs to be addressed.
 
 I don't know the specifics of your example, but from how I understand
 it, I don't see a problem if the compiler can prove that the store will
 always happen.

The current Documentation/memory-barriers.txt formulation requires
that both the load and the store have volatile semantics.  Does
that help?

 To be more specific, if the compiler can prove that the store will
 happen anyway, and the region of code can be assumed to always run
 atomically (e.g., there's no loop or such in there), then it is known
 that we have one atomic region of code that will always perform the
 store, so we might as well do the stuff in the region in some order.

And it would be very hard to write a program that proved that the
store had been reordered prior to the load in this case.

 Now, if any of the memory accesses are atomic, then the

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Peter Zijlstra

On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote:
 You need volatile semantics to force the compiler to ignore any proofs
 it might otherwise attempt to construct.  Hence all the ACCESS_ONCE()
 calls in my email to Torvald.  (Hopefully I translated your example
 reasonably.)

My brain gave out for today; but it did appear to have the right
structure.

I would prefer it C11 would not require the volatile casts. It should
simply _never_ speculate with atomic writes, volatile or not.



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Tue, Feb 11, 2014 at 09:39:24PM -0800, Torvald Riegel wrote:
 On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote:
  On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote:
  
   Intuitively, this is wrong because this let's the program take a step
   the abstract machine wouldn't do.  This is different to the sequential
   code that Peter posted because it uses atomics, and thus one can't
   easily assume that the difference is not observable.
  
  Btw, what is the definition of observable for the atomics?
  
  Because I'm hoping that it's not the same as for volatiles, where
  observable is about the virtual machine itself, and as such volatile
  accesses cannot be combined or optimized at all.
 
 No, atomics aren't an observable behavior of the abstract machine
 (unless they are volatile).  See 1.8.p8 (citing the C++ standard).

Us Linux-kernel hackers will often need to use volatile semantics in
combination with C11 atomics in most cases.  The C11 atomics do cover
some of the reasons we currently use ACCESS_ONCE(), but not all of them --
in particular, it allows load/store merging.

  Now, I claim that atomic accesses cannot be done speculatively for
  writes, and not re-done for reads (because the value could change),
 
 Agreed, unless the compiler can prove that this doesn't make a
 difference in the program at hand and it's not volatile atomics.  In
 general, that will be hard and thus won't happen often I suppose, but if
 correctly proved it would fall under the as-if rule I think.
 
  but *combining* them would be possible and good.
 
 Agreed.

In some cases, agreed.  But many uses in the Linux kernel will need
volatile semantics in combination with C11 atomics.  Which is OK, for
the foreseeable future, anyway.

  For example, we often have multiple independent atomic accesses that
  could certainly be combined: testing the individual bits of an atomic
  value with helper functions, causing things like load atomic, test
  bit, load same atomic, test another bit. The two atomic loads could
  be done as a single load without possibly changing semantics on a real
  machine, but if visibility is defined in the same way it is for
  volatile, that wouldn't be a valid transformation. Right now we use
  volatile semantics for these kinds of things, and they really can
  hurt.
 
 Agreed.  In your example, the compiler would have to prove that the
 abstract machine would always be able to run the two loads atomically
 (ie, as one load) without running into impossible/disallowed behavior of
 the program.  But if there's no loop or branch or such in-between, this
 should be straight-forward because any hardware oddity or similar could
 merge those loads and it wouldn't be disallowed by the standard
 (considering that we're talking about a finite number of loads), so the
 compiler would be allowed to do it as well.

As long as they are not marked volatile, agreed.

Thanx, Paul

  Same goes for multiple writes (possibly due to setting bits):
  combining multiple accesses into a single one is generally fine, it's
  *adding* write accesses speculatively that is broken by design..
 
 Agreed.  As Paul points out, this being correct assumes that there are
 no other ordering guarantees or memory accesses interfering, but if
 the stores are to the same memory location and adjacent to each other in
 the program, then I don't see a reason why they wouldn't be combinable.
 
  At the same time, you can't combine atomic loads or stores infinitely
  - visibility on a real machine definitely is about timeliness.
  Removing all but the last write when there are multiple consecutive
  writes is generally fine, even if you unroll a loop to generate those
  writes. But if what remains is a loop, it might be a busy-loop
  basically waiting for something, so it would be wrong (untimely) to
  hoist a store in a loop entirely past the end of the loop, or hoist a
  load in a loop to before the loop.
 
 Agreed.  That's what 1.10p24 and 1.10p25 are meant to specify for loads,
 although those might not be bullet-proof as Paul points out.  Forward
 progress is rather vaguely specified in the standard, but at least parts
 of the committee (and people in ISO C++ SG1, in particular) are working
 on trying to improve this.
 
  Does the standard allow for that kind of behavior?
 
 I think the standard requires (or intends to require) the behavior that
 you (and I) seem to prefer in these examples.
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Tue, Feb 11, 2014 at 09:13:34PM -0800, Torvald Riegel wrote:
 On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote:
  On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
   On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
 On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
  On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
   On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
 There are also so many ways to blow your head off it's 
 untrue. For example,
 cmpxchg takes a separate memory model parameter for failure 
 and success, but
 then there are restrictions on the sets you can use for each. 
 It's not hard
 to find well-known memory-ordering experts shouting Just use
 memory_model_seq_cst for everything, it's too hard 
 otherwise. Then there's
 the fun of load-consume vs load-acquire (arm64 GCC completely 
 ignores consume
 atm and optimises all of the data dependencies away) as well 
 as the definition
 of data races, which seem to be used as an excuse to 
 miscompile a program
 at the earliest opportunity.

Trust me, rcu_dereference() is not going to be defined in terms 
of
memory_order_consume until the compilers implement it both 
correctly and
efficiently.  They are not there yet, and there is currently no 
shortage
of compiler writers who would prefer to ignore 
memory_order_consume.
   
   Do you have any input on
   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In 
   particular, the
   language standard's definition of dependencies?
  
  Let's see...  1.10p9 says that a dependency must be carried unless:
  
  — B is an invocation of any specialization of std::kill_dependency 
  (29.3), or
  — A is the left operand of a built-in logical AND (, see 5.14) or 
  logical OR (||, see 5.15) operator,
  or
  — A is the left operand of a conditional (?:, see 5.16) operator, or
  — A is the left operand of the built-in comma (,) operator (5.18);
  
  So the use of flag before the ? is ignored.  But the flag - 
  flag
  after the ? will carry a dependency, so the code fragment in 59448
  needs to do the ordering rather than just optimizing flag - flag 
  out
  of existence.  One way to do that on both ARM and Power is to 
  actually
  emit code for flag - flag, but there are a number of other ways to
  make that work.
 
 And that's what would concern me, considering that these requirements
 seem to be able to creep out easily.  Also, whereas the other atomics
 just constrain compilers wrt. reordering across atomic accesses or
 changes to the atomic accesses themselves, the dependencies are new
 requirements on pieces of otherwise non-synchronizing code.  The 
 latter
 seems far more involved to me.

Well, the wording of 1.10p9 is pretty explicit on this point.
There are only a few exceptions to the rule that dependencies from
memory_order_consume loads must be tracked.  And to your point about
requirements being placed on pieces of otherwise non-synchronizing code,
we already have that with plain old load acquire and store release --
both of these put ordering constraints that affect the surrounding
non-synchronizing code.
   
   I think there's a significant difference.  With acquire/release or more
   general memory orders, it's true that we can't order _across_ the atomic
   access.  However, we can reorder and optimize without additional
   constraints if we do not reorder.  This is not the case with consume
   memory order, as the (p + flag - flag) example shows.
  
  Agreed, memory_order_consume does introduce additional restrictions.
  
This issue got a lot of discussion, and the compromise is that
dependencies cannot leak into or out of functions unless the relevant
parameters or return values are annotated with [[carries_dependency]].
This means that the compiler can see all the places where dependencies
must be tracked.  This is described in 7.6.4.
   
   I wasn't aware of 7.6.4 (but it isn't referred to as an additional
   constraint--what it is--in 1.10, so I guess at least that should be
   fixed).
   Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
   difference, at least if one is assuming that normal optimization of
   sequential code is the default, and that maintaining things such as
   (flag-flag) is not; if optimizing away (flag-flag) would require the
   insertion of fences unless there is the carries_dependency attribute,
   then this would be bad I think.
  
  No, the

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Linus Torvalds

On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
paul...@linux.vnet.ibm.com wrote:

 Us Linux-kernel hackers will often need to use volatile semantics in
 combination with C11 atomics in most cases.  The C11 atomics do cover
 some of the reasons we currently use ACCESS_ONCE(), but not all of them --
 in particular, it allows load/store merging.

I really disagree with the will need to use volatile.

We should never need to use volatile (outside of whatever MMIO we do
using C) if C11 defines atomics correctly.

Allowing load/store merging is *fine*. All sane CPU's do that anyway -
it's called a cache - and there's no actual reason to think that
ACCESS_ONCE() has to mean our current volatile.

Now, it's possible that the C standards simply get atomics _wrong_, so
that they create visible semantics that are different from what a CPU
cache already does, but that's a plain bug in the standard if so.

But merging loads and stores is fine. And I *guarantee* it is fine,
exactly because CPU's already do it, so claiming that the compiler
couldn't do it is just insanity.

Now, there are things that are *not* fine, like speculative stores
that could be visible to other threads. Those are *bugs* (either in
the compiler or in the standard), and anybody who claims otherwise is
not worth discussing with.

But I really really disagree with the we might have to use
'volatile'. Because if we *ever* have to use 'volatile' with the
standard C atomic types, then we're just better off ignoring the
atomic types entirely, because they are obviously broken shit - and
we're better off doing it ourselves the way we have forever.

Seriously. This is not even hyperbole. It really is as simple as that.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Paul E. McKenney

On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote:
 On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
 paul...@linux.vnet.ibm.com wrote:
 
  Us Linux-kernel hackers will often need to use volatile semantics in
  combination with C11 atomics in most cases.  The C11 atomics do cover
  some of the reasons we currently use ACCESS_ONCE(), but not all of them --
  in particular, it allows load/store merging.
 
 I really disagree with the will need to use volatile.
 
 We should never need to use volatile (outside of whatever MMIO we do
 using C) if C11 defines atomics correctly.
 
 Allowing load/store merging is *fine*. All sane CPU's do that anyway -
 it's called a cache - and there's no actual reason to think that
 ACCESS_ONCE() has to mean our current volatile.
 
 Now, it's possible that the C standards simply get atomics _wrong_, so
 that they create visible semantics that are different from what a CPU
 cache already does, but that's a plain bug in the standard if so.
 
 But merging loads and stores is fine. And I *guarantee* it is fine,
 exactly because CPU's already do it, so claiming that the compiler
 couldn't do it is just insanity.

Agreed, both CPUs and compilers can merge loads and stores.  But CPUs
normally get their stores pushed through the store buffer in reasonable
time, and CPUs also use things like invalidations to ensure that a
store is seen in reasonable time by readers.  Compilers don't always
have these two properties, so we do need to be more careful of load
and store merging by compilers.

 Now, there are things that are *not* fine, like speculative stores
 that could be visible to other threads. Those are *bugs* (either in
 the compiler or in the standard), and anybody who claims otherwise is
 not worth discussing with.

And as near as I can tell, volatile semantics are required in C11 to
avoid speculative stores.  I might be wrong about this, and hope that
I am wrong.  But I am currently not seeing it in the current standard.
(Though I expect that most compilers would avoid speculating stores,
especially in the near term.

 But I really really disagree with the we might have to use
 'volatile'. Because if we *ever* have to use 'volatile' with the
 standard C atomic types, then we're just better off ignoring the
 atomic types entirely, because they are obviously broken shit - and
 we're better off doing it ourselves the way we have forever.
 
 Seriously. This is not even hyperbole. It really is as simple as that.

Agreed, if we are talking about replacing ACCESS_ONCE() with C11
relaxed atomics any time soon.  But someone porting Linux to a
new CPU architecture might use a carefully chosen subset of C11
atomics to implement some of the Linux atomic operations, especially
non-value-returning atomics such as atomic_inc().

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel

On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
> > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> > >
> > > Intuitively, this is wrong because this let's the program take a step
> > > the abstract machine wouldn't do.  This is different to the sequential
> > > code that Peter posted because it uses atomics, and thus one can't
> > > easily assume that the difference is not observable.
> > 
> > Btw, what is the definition of "observable" for the atomics?
> > 
> > Because I'm hoping that it's not the same as for volatiles, where
> > "observable" is about the virtual machine itself, and as such volatile
> > accesses cannot be combined or optimized at all.
> > 
> > Now, I claim that atomic accesses cannot be done speculatively for
> > writes, and not re-done for reads (because the value could change),
> > but *combining* them would be possible and good.
> > 
> > For example, we often have multiple independent atomic accesses that
> > could certainly be combined: testing the individual bits of an atomic
> > value with helper functions, causing things like "load atomic, test
> > bit, load same atomic, test another bit". The two atomic loads could
> > be done as a single load without possibly changing semantics on a real
> > machine, but if "visibility" is defined in the same way it is for
> > "volatile", that wouldn't be a valid transformation. Right now we use
> > "volatile" semantics for these kinds of things, and they really can
> > hurt.
> > 
> > Same goes for multiple writes (possibly due to setting bits):
> > combining multiple accesses into a single one is generally fine, it's
> > *adding* write accesses speculatively that is broken by design..
> > 
> > At the same time, you can't combine atomic loads or stores infinitely
> > - "visibility" on a real machine definitely is about timeliness.
> > Removing all but the last write when there are multiple consecutive
> > writes is generally fine, even if you unroll a loop to generate those
> > writes. But if what remains is a loop, it might be a busy-loop
> > basically waiting for something, so it would be wrong ("untimely") to
> > hoist a store in a loop entirely past the end of the loop, or hoist a
> > load in a loop to before the loop.
> > 
> > Does the standard allow for that kind of behavior?
> 
> You asked!  ;-)
> 
> So the current standard allows merging of both loads and stores, unless of
> course ordring constraints prevent the merging.  Volatile semantics may be
> used to prevent this merging, if desired, for example, for real-time code.

Agreed.

> Infinite merging is intended to be prohibited, but I am not certain that
> the current wording is bullet-proof (1.10p24 and 1.10p25).

Yeah, maybe not.  But it at least seems to rather clearly indicate the
intent ;)

> The only prohibition against speculative stores that I can see is in a
> non-normative note, and it can be argued to apply only to things that are
> not atomics (1.10p22).

I think this one is specifically about speculative stores that would
affect memory locations that the abstract machine would not write to,
and that might be observable or create data races.  While a compiler
could potentially prove that such stores aren't leading to a difference
in the behavior of the program (e.g., by proving that there are no
observers anywhere and this isn't overlapping with any volatile
locations), I think that this is hard in general and most compilers will
just not do such things.  In GCC, bugs in that category were fixed after
researchers doing fuzz-testing found them (IIRC, speculative stores by
loops).

> I don't see any prohibition against reordering
> a store to precede a load preceding a conditional branch -- which would
> not be speculative if the branch was know to be taken and the load
> hit in the store buffer.  In a system where stores could be reordered,
> some other CPU might perceive the store as happening before the load
> that controlled the conditional branch.  This needs to be addressed.

I don't know the specifics of your example, but from how I understand
it, I don't see a problem if the compiler can prove that the store will
always happen.

To be more specific, if the compiler can prove that the store will
happen anyway, and the region of code can be assumed to always run
atomically (e.g., there's no loop or such in there), then it is known
that we have one atomic region of code that will always perform the
store, so we might as well do the stuff in the region in some order.

Now, if any of the memory accesses are atomic, then the whole region of
code containing those accesses is often not atomic because other threads
might observe intermediate results in a data-race-free way.

(I know that this isn't a very precise formulation, but I hope it brings
my line of reasoning across.)

> Why this hole?  At the time, the current formalizations of popular
> CPU architectures did not exist, and it was

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel

On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> >
> > Intuitively, this is wrong because this let's the program take a step
> > the abstract machine wouldn't do.  This is different to the sequential
> > code that Peter posted because it uses atomics, and thus one can't
> > easily assume that the difference is not observable.
> 
> Btw, what is the definition of "observable" for the atomics?
> 
> Because I'm hoping that it's not the same as for volatiles, where
> "observable" is about the virtual machine itself, and as such volatile
> accesses cannot be combined or optimized at all.

No, atomics aren't an observable behavior of the abstract machine
(unless they are volatile).  See 1.8.p8 (citing the C++ standard).

> Now, I claim that atomic accesses cannot be done speculatively for
> writes, and not re-done for reads (because the value could change),

Agreed, unless the compiler can prove that this doesn't make a
difference in the program at hand and it's not volatile atomics.  In
general, that will be hard and thus won't happen often I suppose, but if
correctly proved it would fall under the as-if rule I think.

> but *combining* them would be possible and good.

Agreed.

> For example, we often have multiple independent atomic accesses that
> could certainly be combined: testing the individual bits of an atomic
> value with helper functions, causing things like "load atomic, test
> bit, load same atomic, test another bit". The two atomic loads could
> be done as a single load without possibly changing semantics on a real
> machine, but if "visibility" is defined in the same way it is for
> "volatile", that wouldn't be a valid transformation. Right now we use
> "volatile" semantics for these kinds of things, and they really can
> hurt.

Agreed.  In your example, the compiler would have to prove that the
abstract machine would always be able to run the two loads atomically
(ie, as one load) without running into impossible/disallowed behavior of
the program.  But if there's no loop or branch or such in-between, this
should be straight-forward because any hardware oddity or similar could
merge those loads and it wouldn't be disallowed by the standard
(considering that we're talking about a finite number of loads), so the
compiler would be allowed to do it as well.

> Same goes for multiple writes (possibly due to setting bits):
> combining multiple accesses into a single one is generally fine, it's
> *adding* write accesses speculatively that is broken by design..

Agreed.  As Paul points out, this being correct assumes that there are
no other ordering guarantees or memory accesses "interfering", but if
the stores are to the same memory location and adjacent to each other in
the program, then I don't see a reason why they wouldn't be combinable.

> At the same time, you can't combine atomic loads or stores infinitely
> - "visibility" on a real machine definitely is about timeliness.
> Removing all but the last write when there are multiple consecutive
> writes is generally fine, even if you unroll a loop to generate those
> writes. But if what remains is a loop, it might be a busy-loop
> basically waiting for something, so it would be wrong ("untimely") to
> hoist a store in a loop entirely past the end of the loop, or hoist a
> load in a loop to before the loop.

Agreed.  That's what 1.10p24 and 1.10p25 are meant to specify for loads,
although those might not be bullet-proof as Paul points out.  Forward
progress is rather vaguely specified in the standard, but at least parts
of the committee (and people in ISO C++ SG1, in particular) are working
on trying to improve this.

> Does the standard allow for that kind of behavior?

I think the standard requires (or intends to require) the behavior that
you (and I) seem to prefer in these examples.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel

On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > > > > There are also so many ways to blow your head off it's untrue. 
> > > > > > > > For example,
> > > > > > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > > > > > success, but
> > > > > > > > then there are restrictions on the sets you can use for each. 
> > > > > > > > It's not hard
> > > > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". 
> > > > > > > > Then there's
> > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely 
> > > > > > > > ignores consume
> > > > > > > > atm and optimises all of the data dependencies away) as well as 
> > > > > > > > the definition
> > > > > > > > of "data races", which seem to be used as an excuse to 
> > > > > > > > miscompile a program
> > > > > > > > at the earliest opportunity.
> > > > > > > 
> > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > > > memory_order_consume until the compilers implement it both 
> > > > > > > correctly and
> > > > > > > efficiently.  They are not there yet, and there is currently no 
> > > > > > > shortage
> > > > > > > of compiler writers who would prefer to ignore 
> > > > > > > memory_order_consume.
> > > > > > 
> > > > > > Do you have any input on
> > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, 
> > > > > > the
> > > > > > language standard's definition of dependencies?
> > > > > 
> > > > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > > > 
> > > > > — B is an invocation of any specialization of std::kill_dependency 
> > > > > (29.3), or
> > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or 
> > > > > logical OR (||, see 5.15) operator,
> > > > > or
> > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > > > 
> > > > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > > > after the "?" will carry a dependency, so the code fragment in 59448
> > > > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > > > of existence.  One way to do that on both ARM and Power is to actually
> > > > > emit code for "flag - flag", but there are a number of other ways to
> > > > > make that work.
> > > > 
> > > > And that's what would concern me, considering that these requirements
> > > > seem to be able to creep out easily.  Also, whereas the other atomics
> > > > just constrain compilers wrt. reordering across atomic accesses or
> > > > changes to the atomic accesses themselves, the dependencies are new
> > > > requirements on pieces of otherwise non-synchronizing code.  The latter
> > > > seems far more involved to me.
> > > 
> > > Well, the wording of 1.10p9 is pretty explicit on this point.
> > > There are only a few exceptions to the rule that dependencies from
> > > memory_order_consume loads must be tracked.  And to your point about
> > > requirements being placed on pieces of otherwise non-synchronizing code,
> > > we already have that with plain old load acquire and store release --
> > > both of these put ordering constraints that affect the surrounding
> > > non-synchronizing code.
> > 
> > I think there's a significant difference.  With acquire/release or more
> > general memory orders, it's true that we can't order _across_ the atomic
> > access.  However, we can reorder and optimize without additional
> > constraints if we do not reorder.  This is not the case with consume
> > memory order, as the (p + flag - flag) example shows.
> 
> Agreed, memory_order_consume does introduce additional restrictions.
> 
> > > This issue got a lot of discussion, and the compromise is that
> > > dependencies cannot leak into or out of functions unless the relevant
> > > parameters or return values are annotated with [[carries_dependency]].
> > > This means that the compiler can see all the places where dependencies
> > > must be tracked.  This is described in 7.6.4.
> > 
> > I wasn't aware of 7.6.4 (but it isn't referred to as an additional
> > constraint--what it is--in 1.10, so I guess at least that should be
> > fixed).
> > Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
> > difference, at least if one is assuming that normal optimization of
> > sequential

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> >
> > Intuitively, this is wrong because this let's the program take a step
> > the abstract machine wouldn't do.  This is different to the sequential
> > code that Peter posted because it uses atomics, and thus one can't
> > easily assume that the difference is not observable.
> 
> Btw, what is the definition of "observable" for the atomics?
> 
> Because I'm hoping that it's not the same as for volatiles, where
> "observable" is about the virtual machine itself, and as such volatile
> accesses cannot be combined or optimized at all.
> 
> Now, I claim that atomic accesses cannot be done speculatively for
> writes, and not re-done for reads (because the value could change),
> but *combining* them would be possible and good.
> 
> For example, we often have multiple independent atomic accesses that
> could certainly be combined: testing the individual bits of an atomic
> value with helper functions, causing things like "load atomic, test
> bit, load same atomic, test another bit". The two atomic loads could
> be done as a single load without possibly changing semantics on a real
> machine, but if "visibility" is defined in the same way it is for
> "volatile", that wouldn't be a valid transformation. Right now we use
> "volatile" semantics for these kinds of things, and they really can
> hurt.
> 
> Same goes for multiple writes (possibly due to setting bits):
> combining multiple accesses into a single one is generally fine, it's
> *adding* write accesses speculatively that is broken by design..
> 
> At the same time, you can't combine atomic loads or stores infinitely
> - "visibility" on a real machine definitely is about timeliness.
> Removing all but the last write when there are multiple consecutive
> writes is generally fine, even if you unroll a loop to generate those
> writes. But if what remains is a loop, it might be a busy-loop
> basically waiting for something, so it would be wrong ("untimely") to
> hoist a store in a loop entirely past the end of the loop, or hoist a
> load in a loop to before the loop.
> 
> Does the standard allow for that kind of behavior?

You asked!  ;-)

So the current standard allows merging of both loads and stores, unless of
course ordring constraints prevent the merging.  Volatile semantics may be
used to prevent this merging, if desired, for example, for real-time code.
Infinite merging is intended to be prohibited, but I am not certain that
the current wording is bullet-proof (1.10p24 and 1.10p25).

The only prohibition against speculative stores that I can see is in a
non-normative note, and it can be argued to apply only to things that are
not atomics (1.10p22).  I don't see any prohibition against reordering
a store to precede a load preceding a conditional branch -- which would
not be speculative if the branch was know to be taken and the load
hit in the store buffer.  In a system where stores could be reordered,
some other CPU might perceive the store as happening before the load
that controlled the conditional branch.  This needs to be addressed.

Why this hole?  At the time, the current formalizations of popular
CPU architectures did not exist, and it was not clear that all popular
hardware avoided speculative stores.  

There is also fun with "out of thin air" values, which everyone agrees
should be prohibited, but where there is not agreement on how to prohibit
them in a mathematically constructive manner.  The current draft contains
a clause simply stating that out-of-thin-air values are prohibited,
which doesn't help someone constructing tools to analyze C++ code.
One proposal requires that subsequent atomic stores never be reordered
before prior atomic loads, which requires useless ordering code to be
emitted on ARM and PowerPC (you may have seen Will Deacon's and Peter
Zijlstra's reaction to this proposal a few days ago).  Note that Itanium
already pays this price in order to provide full single-variable cache
coherence.  This out-of-thin-air discussion is also happening in the
Java community in preparation for a new rev of the Java memory model.

There will also be some discussions on memory_order_consume, which is
intended to (eventually) implement rcu_dereference().  The compiler
writers don't like tracking dependencies, but there may be some ways
of constraining optimizations to preserve the common dependencies that,
while providing some syntax to force preservation of dependencies that
would normally be optimized out.  One example of this is where you have an
RCU-protected array that might sometimes contain only a single element.
In the single-element case, the compiler knows a priori which element
will be used, and will therefore optimize the dependency away, so that
the reader might see pre-initialization state.  But this is rare, so
if added syntax needs to be added in this case, I believe we should be
OK with it.  (If syntax is

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
 On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote:
 
  Intuitively, this is wrong because this let's the program take a step
  the abstract machine wouldn't do.  This is different to the sequential
  code that Peter posted because it uses atomics, and thus one can't
  easily assume that the difference is not observable.
 
 Btw, what is the definition of observable for the atomics?
 
 Because I'm hoping that it's not the same as for volatiles, where
 observable is about the virtual machine itself, and as such volatile
 accesses cannot be combined or optimized at all.
 
 Now, I claim that atomic accesses cannot be done speculatively for
 writes, and not re-done for reads (because the value could change),
 but *combining* them would be possible and good.
 
 For example, we often have multiple independent atomic accesses that
 could certainly be combined: testing the individual bits of an atomic
 value with helper functions, causing things like load atomic, test
 bit, load same atomic, test another bit. The two atomic loads could
 be done as a single load without possibly changing semantics on a real
 machine, but if visibility is defined in the same way it is for
 volatile, that wouldn't be a valid transformation. Right now we use
 volatile semantics for these kinds of things, and they really can
 hurt.
 
 Same goes for multiple writes (possibly due to setting bits):
 combining multiple accesses into a single one is generally fine, it's
 *adding* write accesses speculatively that is broken by design..
 
 At the same time, you can't combine atomic loads or stores infinitely
 - visibility on a real machine definitely is about timeliness.
 Removing all but the last write when there are multiple consecutive
 writes is generally fine, even if you unroll a loop to generate those
 writes. But if what remains is a loop, it might be a busy-loop
 basically waiting for something, so it would be wrong (untimely) to
 hoist a store in a loop entirely past the end of the loop, or hoist a
 load in a loop to before the loop.
 
 Does the standard allow for that kind of behavior?

You asked!  ;-)

So the current standard allows merging of both loads and stores, unless of
course ordring constraints prevent the merging.  Volatile semantics may be
used to prevent this merging, if desired, for example, for real-time code.
Infinite merging is intended to be prohibited, but I am not certain that
the current wording is bullet-proof (1.10p24 and 1.10p25).

The only prohibition against speculative stores that I can see is in a
non-normative note, and it can be argued to apply only to things that are
not atomics (1.10p22).  I don't see any prohibition against reordering
a store to precede a load preceding a conditional branch -- which would
not be speculative if the branch was know to be taken and the load
hit in the store buffer.  In a system where stores could be reordered,
some other CPU might perceive the store as happening before the load
that controlled the conditional branch.  This needs to be addressed.

Why this hole?  At the time, the current formalizations of popular
CPU architectures did not exist, and it was not clear that all popular
hardware avoided speculative stores.  

There is also fun with out of thin air values, which everyone agrees
should be prohibited, but where there is not agreement on how to prohibit
them in a mathematically constructive manner.  The current draft contains
a clause simply stating that out-of-thin-air values are prohibited,
which doesn't help someone constructing tools to analyze C++ code.
One proposal requires that subsequent atomic stores never be reordered
before prior atomic loads, which requires useless ordering code to be
emitted on ARM and PowerPC (you may have seen Will Deacon's and Peter
Zijlstra's reaction to this proposal a few days ago).  Note that Itanium
already pays this price in order to provide full single-variable cache
coherence.  This out-of-thin-air discussion is also happening in the
Java community in preparation for a new rev of the Java memory model.

There will also be some discussions on memory_order_consume, which is
intended to (eventually) implement rcu_dereference().  The compiler
writers don't like tracking dependencies, but there may be some ways
of constraining optimizations to preserve the common dependencies that,
while providing some syntax to force preservation of dependencies that
would normally be optimized out.  One example of this is where you have an
RCU-protected array that might sometimes contain only a single element.
In the single-element case, the compiler knows a priori which element
will be used, and will therefore optimize the dependency away, so that
the reader might see pre-initialization state.  But this is rare, so
if added syntax needs to be added in this case, I believe we should be
OK with it.  (If syntax is needed for plain old dereferences, it is

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel

On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote:
 On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
  On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
   On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
 On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
  On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
   On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
There are also so many ways to blow your head off it's untrue. 
For example,
cmpxchg takes a separate memory model parameter for failure and 
success, but
then there are restrictions on the sets you can use for each. 
It's not hard
to find well-known memory-ordering experts shouting Just use
memory_model_seq_cst for everything, it's too hard otherwise. 
Then there's
the fun of load-consume vs load-acquire (arm64 GCC completely 
ignores consume
atm and optimises all of the data dependencies away) as well as 
the definition
of data races, which seem to be used as an excuse to 
miscompile a program
at the earliest opportunity.
   
   Trust me, rcu_dereference() is not going to be defined in terms of
   memory_order_consume until the compilers implement it both 
   correctly and
   efficiently.  They are not there yet, and there is currently no 
   shortage
   of compiler writers who would prefer to ignore 
   memory_order_consume.
  
  Do you have any input on
  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, 
  the
  language standard's definition of dependencies?
 
 Let's see...  1.10p9 says that a dependency must be carried unless:
 
 — B is an invocation of any specialization of std::kill_dependency 
 (29.3), or
 — A is the left operand of a built-in logical AND (, see 5.14) or 
 logical OR (||, see 5.15) operator,
 or
 — A is the left operand of a conditional (?:, see 5.16) operator, or
 — A is the left operand of the built-in comma (,) operator (5.18);
 
 So the use of flag before the ? is ignored.  But the flag - flag
 after the ? will carry a dependency, so the code fragment in 59448
 needs to do the ordering rather than just optimizing flag - flag out
 of existence.  One way to do that on both ARM and Power is to actually
 emit code for flag - flag, but there are a number of other ways to
 make that work.

And that's what would concern me, considering that these requirements
seem to be able to creep out easily.  Also, whereas the other atomics
just constrain compilers wrt. reordering across atomic accesses or
changes to the atomic accesses themselves, the dependencies are new
requirements on pieces of otherwise non-synchronizing code.  The latter
seems far more involved to me.
   
   Well, the wording of 1.10p9 is pretty explicit on this point.
   There are only a few exceptions to the rule that dependencies from
   memory_order_consume loads must be tracked.  And to your point about
   requirements being placed on pieces of otherwise non-synchronizing code,
   we already have that with plain old load acquire and store release --
   both of these put ordering constraints that affect the surrounding
   non-synchronizing code.
  
  I think there's a significant difference.  With acquire/release or more
  general memory orders, it's true that we can't order _across_ the atomic
  access.  However, we can reorder and optimize without additional
  constraints if we do not reorder.  This is not the case with consume
  memory order, as the (p + flag - flag) example shows.
 
 Agreed, memory_order_consume does introduce additional restrictions.
 
   This issue got a lot of discussion, and the compromise is that
   dependencies cannot leak into or out of functions unless the relevant
   parameters or return values are annotated with [[carries_dependency]].
   This means that the compiler can see all the places where dependencies
   must be tracked.  This is described in 7.6.4.
  
  I wasn't aware of 7.6.4 (but it isn't referred to as an additional
  constraint--what it is--in 1.10, so I guess at least that should be
  fixed).
  Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
  difference, at least if one is assuming that normal optimization of
  sequential code is the default, and that maintaining things such as
  (flag-flag) is not; if optimizing away (flag-flag) would require the
  insertion of fences unless there is the carries_dependency attribute,
  then this would be bad I think.
 
 No, the attribute does not make a semantic difference.  If a dependency
 flows into a function without [[carries_dependency]], the implementation
 is within its right to emit an acquire barrier or

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel

On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote:
 On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote:
 
  Intuitively, this is wrong because this let's the program take a step
  the abstract machine wouldn't do.  This is different to the sequential
  code that Peter posted because it uses atomics, and thus one can't
  easily assume that the difference is not observable.
 
 Btw, what is the definition of observable for the atomics?
 
 Because I'm hoping that it's not the same as for volatiles, where
 observable is about the virtual machine itself, and as such volatile
 accesses cannot be combined or optimized at all.

No, atomics aren't an observable behavior of the abstract machine
(unless they are volatile).  See 1.8.p8 (citing the C++ standard).

 Now, I claim that atomic accesses cannot be done speculatively for
 writes, and not re-done for reads (because the value could change),

Agreed, unless the compiler can prove that this doesn't make a
difference in the program at hand and it's not volatile atomics.  In
general, that will be hard and thus won't happen often I suppose, but if
correctly proved it would fall under the as-if rule I think.

 but *combining* them would be possible and good.

Agreed.

 For example, we often have multiple independent atomic accesses that
 could certainly be combined: testing the individual bits of an atomic
 value with helper functions, causing things like load atomic, test
 bit, load same atomic, test another bit. The two atomic loads could
 be done as a single load without possibly changing semantics on a real
 machine, but if visibility is defined in the same way it is for
 volatile, that wouldn't be a valid transformation. Right now we use
 volatile semantics for these kinds of things, and they really can
 hurt.

Agreed.  In your example, the compiler would have to prove that the
abstract machine would always be able to run the two loads atomically
(ie, as one load) without running into impossible/disallowed behavior of
the program.  But if there's no loop or branch or such in-between, this
should be straight-forward because any hardware oddity or similar could
merge those loads and it wouldn't be disallowed by the standard
(considering that we're talking about a finite number of loads), so the
compiler would be allowed to do it as well.

 Same goes for multiple writes (possibly due to setting bits):
 combining multiple accesses into a single one is generally fine, it's
 *adding* write accesses speculatively that is broken by design..

Agreed.  As Paul points out, this being correct assumes that there are
no other ordering guarantees or memory accesses interfering, but if
the stores are to the same memory location and adjacent to each other in
the program, then I don't see a reason why they wouldn't be combinable.

 At the same time, you can't combine atomic loads or stores infinitely
 - visibility on a real machine definitely is about timeliness.
 Removing all but the last write when there are multiple consecutive
 writes is generally fine, even if you unroll a loop to generate those
 writes. But if what remains is a loop, it might be a busy-loop
 basically waiting for something, so it would be wrong (untimely) to
 hoist a store in a loop entirely past the end of the loop, or hoist a
 load in a loop to before the loop.

Agreed.  That's what 1.10p24 and 1.10p25 are meant to specify for loads,
although those might not be bullet-proof as Paul points out.  Forward
progress is rather vaguely specified in the standard, but at least parts
of the committee (and people in ISO C++ SG1, in particular) are working
on trying to improve this.

 Does the standard allow for that kind of behavior?

I think the standard requires (or intends to require) the behavior that
you (and I) seem to prefer in these examples.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-11 Thread Torvald Riegel

On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote:
 On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote:
  On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote:
  
   Intuitively, this is wrong because this let's the program take a step
   the abstract machine wouldn't do.  This is different to the sequential
   code that Peter posted because it uses atomics, and thus one can't
   easily assume that the difference is not observable.
  
  Btw, what is the definition of observable for the atomics?
  
  Because I'm hoping that it's not the same as for volatiles, where
  observable is about the virtual machine itself, and as such volatile
  accesses cannot be combined or optimized at all.
  
  Now, I claim that atomic accesses cannot be done speculatively for
  writes, and not re-done for reads (because the value could change),
  but *combining* them would be possible and good.
  
  For example, we often have multiple independent atomic accesses that
  could certainly be combined: testing the individual bits of an atomic
  value with helper functions, causing things like load atomic, test
  bit, load same atomic, test another bit. The two atomic loads could
  be done as a single load without possibly changing semantics on a real
  machine, but if visibility is defined in the same way it is for
  volatile, that wouldn't be a valid transformation. Right now we use
  volatile semantics for these kinds of things, and they really can
  hurt.
  
  Same goes for multiple writes (possibly due to setting bits):
  combining multiple accesses into a single one is generally fine, it's
  *adding* write accesses speculatively that is broken by design..
  
  At the same time, you can't combine atomic loads or stores infinitely
  - visibility on a real machine definitely is about timeliness.
  Removing all but the last write when there are multiple consecutive
  writes is generally fine, even if you unroll a loop to generate those
  writes. But if what remains is a loop, it might be a busy-loop
  basically waiting for something, so it would be wrong (untimely) to
  hoist a store in a loop entirely past the end of the loop, or hoist a
  load in a loop to before the loop.
  
  Does the standard allow for that kind of behavior?
 
 You asked!  ;-)
 
 So the current standard allows merging of both loads and stores, unless of
 course ordring constraints prevent the merging.  Volatile semantics may be
 used to prevent this merging, if desired, for example, for real-time code.

Agreed.

 Infinite merging is intended to be prohibited, but I am not certain that
 the current wording is bullet-proof (1.10p24 and 1.10p25).

Yeah, maybe not.  But it at least seems to rather clearly indicate the
intent ;)

 The only prohibition against speculative stores that I can see is in a
 non-normative note, and it can be argued to apply only to things that are
 not atomics (1.10p22).

I think this one is specifically about speculative stores that would
affect memory locations that the abstract machine would not write to,
and that might be observable or create data races.  While a compiler
could potentially prove that such stores aren't leading to a difference
in the behavior of the program (e.g., by proving that there are no
observers anywhere and this isn't overlapping with any volatile
locations), I think that this is hard in general and most compilers will
just not do such things.  In GCC, bugs in that category were fixed after
researchers doing fuzz-testing found them (IIRC, speculative stores by
loops).

 I don't see any prohibition against reordering
 a store to precede a load preceding a conditional branch -- which would
 not be speculative if the branch was know to be taken and the load
 hit in the store buffer.  In a system where stores could be reordered,
 some other CPU might perceive the store as happening before the load
 that controlled the conditional branch.  This needs to be addressed.

I don't know the specifics of your example, but from how I understand
it, I don't see a problem if the compiler can prove that the store will
always happen.

To be more specific, if the compiler can prove that the store will
happen anyway, and the region of code can be assumed to always run
atomically (e.g., there's no loop or such in there), then it is known
that we have one atomic region of code that will always perform the
store, so we might as well do the stuff in the region in some order.

Now, if any of the memory accesses are atomic, then the whole region of
code containing those accesses is often not atomic because other threads
might observe intermediate results in a data-race-free way.

(I know that this isn't a very precise formulation, but I hope it brings
my line of reasoning across.)

 Why this hole?  At the time, the current formalizations of popular
 CPU architectures did not exist, and it was not clear that all popular
 hardware avoided speculative stores.  

I'm not quite sure which hole you see

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Mon, Feb 10, 2014 at 04:08:11PM -0500, Chris Metcalf wrote:
> (+LKML again)
> 
> On 2/10/2014 3:57 PM, Peter Zijlstra wrote:
> > On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote:
> >> On 2/6/2014 8:52 AM, Peter Zijlstra wrote:
> >>> Its been compiled on everything I have a compiler for, however frv and
> >>> tile are missing because they're special and I was tired.
> >> So what's the specialness on tile?
> > Its not doing the atomic work in ASM but uses magic builtins or such.
> >
> > I got the list of magic funcs for tilegx, but didn't look into the 32bit
> > chips.
> 
> Oh, I see.  The  files on tile are already reasonably 
> well-factored.
> 
> It's possible you could do better, but I think not by too much, other than 
> possibly
> by using  for some of the common idioms like 
> "subtraction
> is addition with a negative second argument", etc., which hasn't been done 
> elsewhere.

The last patch 5/5 adds a few atomic ops; I could of course use
cmpxchg() loops for everything, but I found tilegx actually has fetch_or
and fetch_and to implement atomic_or() / atomic_and().

It doesn't have fetch_xor() from what I've been told so atomic_xor()
will have to become a cmpxchg() loop.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Chris Metcalf

(+LKML again)

On 2/10/2014 3:57 PM, Peter Zijlstra wrote:
> On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote:
>> On 2/6/2014 8:52 AM, Peter Zijlstra wrote:
>>> Its been compiled on everything I have a compiler for, however frv and
>>> tile are missing because they're special and I was tired.
>> So what's the specialness on tile?
> Its not doing the atomic work in ASM but uses magic builtins or such.
>
> I got the list of magic funcs for tilegx, but didn't look into the 32bit
> chips.

Oh, I see.  The  files on tile are already reasonably 
well-factored.

It's possible you could do better, but I think not by too much, other than 
possibly
by using  for some of the common idioms like "subtraction
is addition with a negative second argument", etc., which hasn't been done 
elsewhere.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
>
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.

Btw, what is the definition of "observable" for the atomics?

Because I'm hoping that it's not the same as for volatiles, where
"observable" is about the virtual machine itself, and as such volatile
accesses cannot be combined or optimized at all.

Now, I claim that atomic accesses cannot be done speculatively for
writes, and not re-done for reads (because the value could change),
but *combining* them would be possible and good.

For example, we often have multiple independent atomic accesses that
could certainly be combined: testing the individual bits of an atomic
value with helper functions, causing things like "load atomic, test
bit, load same atomic, test another bit". The two atomic loads could
be done as a single load without possibly changing semantics on a real
machine, but if "visibility" is defined in the same way it is for
"volatile", that wouldn't be a valid transformation. Right now we use
"volatile" semantics for these kinds of things, and they really can
hurt.

Same goes for multiple writes (possibly due to setting bits):
combining multiple accesses into a single one is generally fine, it's
*adding* write accesses speculatively that is broken by design..

At the same time, you can't combine atomic loads or stores infinitely
- "visibility" on a real machine definitely is about timeliness.
Removing all but the last write when there are multiple consecutive
writes is generally fine, even if you unroll a loop to generate those
writes. But if what remains is a loop, it might be a busy-loop
basically waiting for something, so it would be wrong ("untimely") to
hoist a store in a loop entirely past the end of the loop, or hoist a
load in a loop to before the loop.

Does the standard allow for that kind of behavior?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Will Deacon

On Mon, Feb 10, 2014 at 03:04:43PM +, Paul E. McKenney wrote:
> On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote:
> > On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
> > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > > speculative-store optimizations because it requires them to introduce
> > > > both control and data dependency tracking into their compilers.  Many of
> > > > them seem to hate dependency tracking with a purple passion.  At least,
> > > > such a hatred would go a long way towards explaining the incomplete
> > > > and high-overhead implementations of memory_order_consume, the long
> > > > and successful use of idioms based on the memory_order_consume pattern
> > > > notwithstanding [*].  ;-)
> > > 
> > > Just tell them that because the hardware provides control dependencies
> > > we actually use and rely on them.
> > 
> > s/control/address/ ?
> 
> Both are important, but as Peter's reply noted, it was control
> dependencies under discussion.  Data dependencies (which include the
> ARM/PowerPC notion of address dependencies) are called out by the standard
> already, but control dependencies are not.  I am not all that satisified
> by current implementations of data dependencies, admittedly.  Should
> be an interesting discussion.  ;-)

Ok, but since you can't use control dependencies to order LOAD -> LOAD, it's
a pretty big ask of the compiler to make use of them for things like
consume, where a data dependency will suffice for any combination of
accesses.

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote:
> On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
> > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > speculative-store optimizations because it requires them to introduce
> > > both control and data dependency tracking into their compilers.  Many of
> > > them seem to hate dependency tracking with a purple passion.  At least,
> > > such a hatred would go a long way towards explaining the incomplete
> > > and high-overhead implementations of memory_order_consume, the long
> > > and successful use of idioms based on the memory_order_consume pattern
> > > notwithstanding [*].  ;-)
> > 
> > Just tell them that because the hardware provides control dependencies
> > we actually use and rely on them.
> 
> s/control/address/ ?

Both are important, but as Peter's reply noted, it was control
dependencies under discussion.  Data dependencies (which include the
ARM/PowerPC notion of address dependencies) are called out by the standard
already, but control dependencies are not.  I am not all that satisified
by current implementations of data dependencies, admittedly.  Should
be an interesting discussion.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote:
> On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
> > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > > As near as I can tell, compiler writers hate the idea of prohibiting
> > > speculative-store optimizations because it requires them to introduce
> > > both control and data dependency tracking into their compilers.  Many of
> > > them seem to hate dependency tracking with a purple passion.  At least,
> > > such a hatred would go a long way towards explaining the incomplete
> > > and high-overhead implementations of memory_order_consume, the long
> > > and successful use of idioms based on the memory_order_consume pattern
> > > notwithstanding [*].  ;-)
> > 
> > Just tell them that because the hardware provides control dependencies
> > we actually use and rely on them.
> 
> s/control/address/ ?

Nope, control.

Since stores cannot be speculated and thus require linear control flow
history we can use it to order LOAD -> STORE when the LOAD is required
for the control flow decision and the STORE depends on the control flow
path.

Also see commit 18c03c61444a211237f3d4782353cb38dba795df to
Documentation/memory-barriers.txt

---
commit c7f2e3cd6c1f4932ccc4135d050eae3f7c7aef63
Author: Peter Zijlstra 
Date:   Mon Nov 25 11:49:10 2013 +0100

perf: Optimize ring-buffer write by depending on control dependencies

Remove a full barrier from the ring-buffer write path by relying on
a control dependency to order a LOAD -> STORE scenario.

Cc: "Paul E. McKenney" 
Signed-off-by: Peter Zijlstra 
Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnx...@git.kernel.org
Signed-off-by: Ingo Molnar 

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index e8b168af135b..146a5792b1d2 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -61,19 +61,20 @@ static void perf_output_put_handle(struct 
perf_output_handle *handle)
 *
 *   kernel user
 *
-*   READ ->data_tail   READ ->data_head
-*   smp_mb()   (A) smp_rmb()   (C)
-*   WRITE $dataREAD $data
-*   smp_wmb()  (B) smp_mb()(D)
-*   STORE ->data_head  WRITE ->data_tail
+*   if (LOAD ->data_tail) {LOAD ->data_head
+*  (A) smp_rmb()   (C)
+*  STORE $data LOAD $data
+*  smp_wmb()   (B) smp_mb()(D)
+*  STORE ->data_head   STORE ->data_tail
+*   }
 *
 * Where A pairs with D, and B pairs with C.
 *
-* I don't think A needs to be a full barrier because we won't in fact
-* write data until we see the store from userspace. So we simply don't
-* issue the data WRITE until we observe it. Be conservative for now.
+* In our case (A) is a control dependency that separates the load of
+* the ->data_tail and the stores of $data. In case ->data_tail
+* indicates there is no room in the buffer to store $data we do not.
 *
-* OTOH, D needs to be a full barrier since it separates the data READ
+* D needs to be a full barrier since it separates the data READ
 * from the tail WRITE.
 *
 * For B a WMB is sufficient since it separates two WRITEs, and for C
@@ -81,7 +82,7 @@ static void perf_output_put_handle(struct perf_output_handle 
*handle)
 *
 * See perf_output_begin().
 */
-   smp_wmb();
+   smp_wmb(); /* B, matches C */
rb->user_page->data_head = head;
 
/*
@@ -144,17 +145,26 @@ int perf_output_begin(struct perf_output_handle *handle,
if (!rb->overwrite &&
unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
goto fail;
+
+   /*
+* The above forms a control dependency barrier separating the
+* @tail load above from the data stores below. Since the @tail
+* load is required to compute the branch to fail below.
+*
+* A, matches D; the full memory barrier userspace SHOULD issue
+* after reading the data and before storing the new tail
+* position.
+*
+* See perf_output_put_handle().
+*/
+
head += size;
} while (local_cmpxchg(>head, offset, head) != offset);
 
/*
-* Separate the userpage->tail read from the data stores below.
-* Matches the MB userspace SHOULD issue after reading the data
-* and before storing the new tail position.
-*
-* See perf_output_put_handle().
+* We rely on

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Will Deacon

On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
> On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> > As near as I can tell, compiler writers hate the idea of prohibiting
> > speculative-store optimizations because it requires them to introduce
> > both control and data dependency tracking into their compilers.  Many of
> > them seem to hate dependency tracking with a purple passion.  At least,
> > such a hatred would go a long way towards explaining the incomplete
> > and high-overhead implementations of memory_order_consume, the long
> > and successful use of idioms based on the memory_order_consume pattern
> > notwithstanding [*].  ;-)
> 
> Just tell them that because the hardware provides control dependencies
> we actually use and rely on them.

s/control/address/ ?

Will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
> As near as I can tell, compiler writers hate the idea of prohibiting
> speculative-store optimizations because it requires them to introduce
> both control and data dependency tracking into their compilers.  Many of
> them seem to hate dependency tracking with a purple passion.  At least,
> such a hatred would go a long way towards explaining the incomplete
> and high-overhead implementations of memory_order_consume, the long
> and successful use of idioms based on the memory_order_consume pattern
> notwithstanding [*].  ;-)

Just tell them that because the hardware provides control dependencies
we actually use and rely on them.

Not that I expect they care too much what we do, given the current state
of things.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
> > Initial state: x == y == 0
> > 
> > T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
> > atomic_store_explicit(42, y, memory_order_relaxed);
> > if (r1 != 42)
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.

Yeah, my bad for not being familiar with the atrocious crap C11 made of
atomics :/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
  Initial state: x == y == 0
  
  T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
  atomic_store_explicit(42, y, memory_order_relaxed);
  if (r1 != 42)
  atomic_store_explicit(r1, y, memory_order_relaxed);
  
  T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
  atomic_store_explicit(r2, x, memory_order_relaxed);
 
 Intuitively, this is wrong because this let's the program take a step
 the abstract machine wouldn't do.  This is different to the sequential
 code that Peter posted because it uses atomics, and thus one can't
 easily assume that the difference is not observable.

Yeah, my bad for not being familiar with the atrocious crap C11 made of
atomics :/

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
 As near as I can tell, compiler writers hate the idea of prohibiting
 speculative-store optimizations because it requires them to introduce
 both control and data dependency tracking into their compilers.  Many of
 them seem to hate dependency tracking with a purple passion.  At least,
 such a hatred would go a long way towards explaining the incomplete
 and high-overhead implementations of memory_order_consume, the long
 and successful use of idioms based on the memory_order_consume pattern
 notwithstanding [*].  ;-)

Just tell them that because the hardware provides control dependencies
we actually use and rely on them.

Not that I expect they care too much what we do, given the current state
of things.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Will Deacon

On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
 On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
  As near as I can tell, compiler writers hate the idea of prohibiting
  speculative-store optimizations because it requires them to introduce
  both control and data dependency tracking into their compilers.  Many of
  them seem to hate dependency tracking with a purple passion.  At least,
  such a hatred would go a long way towards explaining the incomplete
  and high-overhead implementations of memory_order_consume, the long
  and successful use of idioms based on the memory_order_consume pattern
  notwithstanding [*].  ;-)
 
 Just tell them that because the hardware provides control dependencies
 we actually use and rely on them.

s/control/address/ ?

Will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote:
 On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
  On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
   As near as I can tell, compiler writers hate the idea of prohibiting
   speculative-store optimizations because it requires them to introduce
   both control and data dependency tracking into their compilers.  Many of
   them seem to hate dependency tracking with a purple passion.  At least,
   such a hatred would go a long way towards explaining the incomplete
   and high-overhead implementations of memory_order_consume, the long
   and successful use of idioms based on the memory_order_consume pattern
   notwithstanding [*].  ;-)
  
  Just tell them that because the hardware provides control dependencies
  we actually use and rely on them.
 
 s/control/address/ ?

Nope, control.

Since stores cannot be speculated and thus require linear control flow
history we can use it to order LOAD - STORE when the LOAD is required
for the control flow decision and the STORE depends on the control flow
path.

Also see commit 18c03c61444a211237f3d4782353cb38dba795df to
Documentation/memory-barriers.txt

---
commit c7f2e3cd6c1f4932ccc4135d050eae3f7c7aef63
Author: Peter Zijlstra pet...@infradead.org
Date:   Mon Nov 25 11:49:10 2013 +0100

perf: Optimize ring-buffer write by depending on control dependencies

Remove a full barrier from the ring-buffer write path by relying on
a control dependency to order a LOAD - STORE scenario.

Cc: Paul E. McKenney paul...@us.ibm.com
Signed-off-by: Peter Zijlstra pet...@infradead.org
Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnx...@git.kernel.org
Signed-off-by: Ingo Molnar mi...@kernel.org

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index e8b168af135b..146a5792b1d2 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -61,19 +61,20 @@ static void perf_output_put_handle(struct 
perf_output_handle *handle)
 *
 *   kernel user
 *
-*   READ -data_tail   READ -data_head
-*   smp_mb()   (A) smp_rmb()   (C)
-*   WRITE $dataREAD $data
-*   smp_wmb()  (B) smp_mb()(D)
-*   STORE -data_head  WRITE -data_tail
+*   if (LOAD -data_tail) {LOAD -data_head
+*  (A) smp_rmb()   (C)
+*  STORE $data LOAD $data
+*  smp_wmb()   (B) smp_mb()(D)
+*  STORE -data_head   STORE -data_tail
+*   }
 *
 * Where A pairs with D, and B pairs with C.
 *
-* I don't think A needs to be a full barrier because we won't in fact
-* write data until we see the store from userspace. So we simply don't
-* issue the data WRITE until we observe it. Be conservative for now.
+* In our case (A) is a control dependency that separates the load of
+* the -data_tail and the stores of $data. In case -data_tail
+* indicates there is no room in the buffer to store $data we do not.
 *
-* OTOH, D needs to be a full barrier since it separates the data READ
+* D needs to be a full barrier since it separates the data READ
 * from the tail WRITE.
 *
 * For B a WMB is sufficient since it separates two WRITEs, and for C
@@ -81,7 +82,7 @@ static void perf_output_put_handle(struct perf_output_handle 
*handle)
 *
 * See perf_output_begin().
 */
-   smp_wmb();
+   smp_wmb(); /* B, matches C */
rb-user_page-data_head = head;
 
/*
@@ -144,17 +145,26 @@ int perf_output_begin(struct perf_output_handle *handle,
if (!rb-overwrite 
unlikely(CIRC_SPACE(head, tail, perf_data_size(rb))  size))
goto fail;
+
+   /*
+* The above forms a control dependency barrier separating the
+* @tail load above from the data stores below. Since the @tail
+* load is required to compute the branch to fail below.
+*
+* A, matches D; the full memory barrier userspace SHOULD issue
+* after reading the data and before storing the new tail
+* position.
+*
+* See perf_output_put_handle().
+*/
+
head += size;
} while (local_cmpxchg(rb-head, offset, head) != offset);
 
/*
-* Separate the userpage-tail read from the data stores below.
-* Matches the MB userspace SHOULD issue after reading the data
-* and before storing the new tail position.
-*
-* See perf_output_put_handle().
+

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote:
 On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
  On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
   As near as I can tell, compiler writers hate the idea of prohibiting
   speculative-store optimizations because it requires them to introduce
   both control and data dependency tracking into their compilers.  Many of
   them seem to hate dependency tracking with a purple passion.  At least,
   such a hatred would go a long way towards explaining the incomplete
   and high-overhead implementations of memory_order_consume, the long
   and successful use of idioms based on the memory_order_consume pattern
   notwithstanding [*].  ;-)
  
  Just tell them that because the hardware provides control dependencies
  we actually use and rely on them.
 
 s/control/address/ ?

Both are important, but as Peter's reply noted, it was control
dependencies under discussion.  Data dependencies (which include the
ARM/PowerPC notion of address dependencies) are called out by the standard
already, but control dependencies are not.  I am not all that satisified
by current implementations of data dependencies, admittedly.  Should
be an interesting discussion.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Will Deacon

On Mon, Feb 10, 2014 at 03:04:43PM +, Paul E. McKenney wrote:
 On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote:
  On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote:
   On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote:
As near as I can tell, compiler writers hate the idea of prohibiting
speculative-store optimizations because it requires them to introduce
both control and data dependency tracking into their compilers.  Many of
them seem to hate dependency tracking with a purple passion.  At least,
such a hatred would go a long way towards explaining the incomplete
and high-overhead implementations of memory_order_consume, the long
and successful use of idioms based on the memory_order_consume pattern
notwithstanding [*].  ;-)
   
   Just tell them that because the hardware provides control dependencies
   we actually use and rely on them.
  
  s/control/address/ ?
 
 Both are important, but as Peter's reply noted, it was control
 dependencies under discussion.  Data dependencies (which include the
 ARM/PowerPC notion of address dependencies) are called out by the standard
 already, but control dependencies are not.  I am not all that satisified
 by current implementations of data dependencies, admittedly.  Should
 be an interesting discussion.  ;-)

Ok, but since you can't use control dependencies to order LOAD - LOAD, it's
a pretty big ask of the compiler to make use of them for things like
consume, where a data dependency will suffice for any combination of
accesses.

Will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote:

 Intuitively, this is wrong because this let's the program take a step
 the abstract machine wouldn't do.  This is different to the sequential
 code that Peter posted because it uses atomics, and thus one can't
 easily assume that the difference is not observable.

Btw, what is the definition of observable for the atomics?

Because I'm hoping that it's not the same as for volatiles, where
observable is about the virtual machine itself, and as such volatile
accesses cannot be combined or optimized at all.

Now, I claim that atomic accesses cannot be done speculatively for
writes, and not re-done for reads (because the value could change),
but *combining* them would be possible and good.

For example, we often have multiple independent atomic accesses that
could certainly be combined: testing the individual bits of an atomic
value with helper functions, causing things like load atomic, test
bit, load same atomic, test another bit. The two atomic loads could
be done as a single load without possibly changing semantics on a real
machine, but if visibility is defined in the same way it is for
volatile, that wouldn't be a valid transformation. Right now we use
volatile semantics for these kinds of things, and they really can
hurt.

Same goes for multiple writes (possibly due to setting bits):
combining multiple accesses into a single one is generally fine, it's
*adding* write accesses speculatively that is broken by design..

At the same time, you can't combine atomic loads or stores infinitely
- visibility on a real machine definitely is about timeliness.
Removing all but the last write when there are multiple consecutive
writes is generally fine, even if you unroll a loop to generate those
writes. But if what remains is a loop, it might be a busy-loop
basically waiting for something, so it would be wrong (untimely) to
hoist a store in a loop entirely past the end of the loop, or hoist a
load in a loop to before the loop.

Does the standard allow for that kind of behavior?

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Chris Metcalf

(+LKML again)

On 2/10/2014 3:57 PM, Peter Zijlstra wrote:
 On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote:
 On 2/6/2014 8:52 AM, Peter Zijlstra wrote:
 Its been compiled on everything I have a compiler for, however frv and
 tile are missing because they're special and I was tired.
 So what's the specialness on tile?
 Its not doing the atomic work in ASM but uses magic builtins or such.

 I got the list of magic funcs for tilegx, but didn't look into the 32bit
 chips.

Oh, I see.  The asm/atomic.h files on tile are already reasonably 
well-factored.

It's possible you could do better, but I think not by too much, other than 
possibly
by using asm-generic/atomic.h for some of the common idioms like subtraction
is addition with a negative second argument, etc., which hasn't been done 
elsewhere.

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Peter Zijlstra

On Mon, Feb 10, 2014 at 04:08:11PM -0500, Chris Metcalf wrote:
 (+LKML again)
 
 On 2/10/2014 3:57 PM, Peter Zijlstra wrote:
  On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote:
  On 2/6/2014 8:52 AM, Peter Zijlstra wrote:
  Its been compiled on everything I have a compiler for, however frv and
  tile are missing because they're special and I was tired.
  So what's the specialness on tile?
  Its not doing the atomic work in ASM but uses magic builtins or such.
 
  I got the list of magic funcs for tilegx, but didn't look into the 32bit
  chips.
 
 Oh, I see.  The asm/atomic.h files on tile are already reasonably 
 well-factored.
 
 It's possible you could do better, but I think not by too much, other than 
 possibly
 by using asm-generic/atomic.h for some of the common idioms like 
 subtraction
 is addition with a negative second argument, etc., which hasn't been done 
 elsewhere.

The last patch 5/5 adds a few atomic ops; I could of course use
cmpxchg() loops for everything, but I found tilegx actually has fetch_or
and fetch_and to implement atomic_or() / atomic_and().

It doesn't have fetch_xor() from what I've been told so atomic_xor()
will have to become a cmpxchg() loop.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote:
> On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > > > There are also so many ways to blow your head off it's untrue. 
> > > > > > > For example,
> > > > > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > > > > success, but
> > > > > > > then there are restrictions on the sets you can use for each. 
> > > > > > > It's not hard
> > > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". 
> > > > > > > Then there's
> > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely 
> > > > > > > ignores consume
> > > > > > > atm and optimises all of the data dependencies away) as well as 
> > > > > > > the definition
> > > > > > > of "data races", which seem to be used as an excuse to miscompile 
> > > > > > > a program
> > > > > > > at the earliest opportunity.
> > > > > > 
> > > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > > memory_order_consume until the compilers implement it both 
> > > > > > correctly and
> > > > > > efficiently.  They are not there yet, and there is currently no 
> > > > > > shortage
> > > > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > > > 
> > > > > Do you have any input on
> > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > > > language standard's definition of dependencies?
> > > > 
> > > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > > 
> > > > — B is an invocation of any specialization of std::kill_dependency 
> > > > (29.3), or
> > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or 
> > > > logical OR (||, see 5.15) operator,
> > > > or
> > > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > > 
> > > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > > after the "?" will carry a dependency, so the code fragment in 59448
> > > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > > of existence.  One way to do that on both ARM and Power is to actually
> > > > emit code for "flag - flag", but there are a number of other ways to
> > > > make that work.
> > > 
> > > And that's what would concern me, considering that these requirements
> > > seem to be able to creep out easily.  Also, whereas the other atomics
> > > just constrain compilers wrt. reordering across atomic accesses or
> > > changes to the atomic accesses themselves, the dependencies are new
> > > requirements on pieces of otherwise non-synchronizing code.  The latter
> > > seems far more involved to me.
> > 
> > Well, the wording of 1.10p9 is pretty explicit on this point.
> > There are only a few exceptions to the rule that dependencies from
> > memory_order_consume loads must be tracked.  And to your point about
> > requirements being placed on pieces of otherwise non-synchronizing code,
> > we already have that with plain old load acquire and store release --
> > both of these put ordering constraints that affect the surrounding
> > non-synchronizing code.
> 
> I think there's a significant difference.  With acquire/release or more
> general memory orders, it's true that we can't order _across_ the atomic
> access.  However, we can reorder and optimize without additional
> constraints if we do not reorder.  This is not the case with consume
> memory order, as the (p + flag - flag) example shows.

Agreed, memory_order_consume does introduce additional restrictions.

> > This issue got a lot of discussion, and the compromise is that
> > dependencies cannot leak into or out of functions unless the relevant
> > parameters or return values are annotated with [[carries_dependency]].
> > This means that the compiler can see all the places where dependencies
> > must be tracked.  This is described in 7.6.4.
> 
> I wasn't aware of 7.6.4 (but it isn't referred to as an additional
> constraint--what it is--in 1.10, so I guess at least that should be
> fixed).
> Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
> difference, at least if one is assuming that normal optimization of
> sequential code is the default, and that maintaining things such as
> (flag-flag) is not; if optimizing away (flag-flag) would require the
> insertion of fences unless there is the carries_dependency attribute,
> then this would be bad I think.

No, the attribute does not

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
> On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:

[ . . . ]

> > And then it is a short and uncontroversial step to the following:
> > 
> > Initial state: x == y == 0
> > 
> > T1: atomic_store_explicit(42, y, memory_order_relaxed);
> > r1 = atomic_load_explicit(x, memory_order_relaxed);
> > if (r1 != 42)
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> > 
> > This can of course result in r1 == r2 == 42, even though the constant
> > 42 never appeared in the original code.  This is one way to generate
> > an out-of-thin-air value.
> > 
> > As near as I can tell, compiler writers hate the idea of prohibiting
> > speculative-store optimizations because it requires them to introduce
> > both control and data dependency tracking into their compilers.
> 
> I wouldn't characterize the situation like this (although I can't speak
> for others, obviously).  IMHO, it's perfectly fine on sequential /
> non-synchronizing code, because we know the difference isn't observable
> by a correct program.  For synchronizing code, compilers just shouldn't
> do it, or they would have to truly prove that speculation is harmless.
> That will be hard, so I think it should just be avoided.
> 
> Synchronization code will likely have been tuned anyway (especially if
> it uses relaxed MO), so I don't see a large need for trying to optimize
> using speculative atomic stores.
> 
> Thus, I think there's an easy and practical solution.

I like this approach, but there has been resistance to it in the past.
Definitely worth a good try, though!

> > Many of
> > them seem to hate dependency tracking with a purple passion.  At least,
> > such a hatred would go a long way towards explaining the incomplete
> > and high-overhead implementations of memory_order_consume, the long
> > and successful use of idioms based on the memory_order_consume pattern
> > notwithstanding [*].  ;-)
> 
> I still think that's different because it blurs the difference between
> sequential code and synchronizing code (ie, atomic accesses).  With
> consume MO, the simple solution above doesn't work anymore, because
> suddenly synchronizing code does affect optimizations in sequential
> code, even if that wouldn't reorder across the synchronizing code (which
> would be clearly "visible" to the implementation of the optimization).

I understand that memory_order_consume is a bit harder on compiler
writers than the other memory orders, but it is also pretty valuable.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
> On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> > > Hi Paul,
> > > 
> > > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > > 
> > > > > Yes, absolutely shoot store speculation in the head already. Then 
> > > > > drive
> > > > > a wooden stake through its hart.
> > > > > 
> > > > > C11/C++11 should not be allowed to claim itself a memory model until 
> > > > > that
> > > > > is sorted.
> > > > 
> > > > There actually is a proposal being put forward, but it might not make 
> > > > ARM
> > > > and Power people happy because it involves adding a compare, a branch,
> > > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > > much preferring the no-store-speculation approach.
> > > 
> > > Can you elaborate a bit on this please? We don't permit speculative stores
> > > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > > emit any additional instructions to prevent that from happening.
> > 
> > Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
> > proof that out-of-thin-air values cannot be observed in the face of any
> > compiler optimization that refrains from reordering a prior relaxed load
> > with a subsequent relaxed store.
> > 
> > > Stores can, of course, be observed out-of-order but that's a lot more
> > > reasonable :)
> > 
> > So let me try an example.  I am sure that Torvald Riegel will jump in
> > with any needed corrections or amplifications:
> > 
> > Initial state: x == y == 0
> > 
> > T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> > 
> > One would intuitively expect r1 == r2 == 0 as the only possible outcome.
> > But suppose that the compiler used specialization optimizations, as it
> > would if there was a function that has a very lightweight implementation
> > for some values and a very heavyweight one for other.  In particular,
> > suppose that the lightweight implementation was for the value 42.
> > Then the compiler might do something like the following:
> > 
> > Initial state: x == y == 0
> > 
> > T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
> > if (r1 == 42)
> > atomic_store_explicit(42, y, memory_order_relaxed);
> > else
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> > 
> > Suddenly we have an explicit constant 42 showing up.  Of course, if
> > the compiler carefully avoided speculative stores (as both Peter and
> > I believe that it should if its code generation is to be regarded as
> > anything other than an act of vandalism, the words in the standard
> > notwithstanding), there would be no problem.  But currently, a number
> > of compiler writers see absolutely nothing wrong with transforming
> > the optimized-for-42 version above with something like this:
> > 
> > Initial state: x == y == 0
> > 
> > T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
> > atomic_store_explicit(42, y, memory_order_relaxed);
> > if (r1 != 42)
> > atomic_store_explicit(r1, y, memory_order_relaxed);
> > 
> > T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
> > atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.
> 
> For this to be correct, the compiler would actually have to prove that
> the speculative store is "as-if correct", which in turn would mean that
> it needs to be aware of all potential observers, and check whether those
> observers aren't actually affected by the speculative store.
> 
> I would guess that the compilers you have in mind don't really do that.
> If they do, then I don't see why this should be okay, unless you think
> out-of-thin-air values are something good (which I wouldn't agree with).

OK, we agree that pulling the atomic store to y out of its "if" statement
is a bad thing.  Very good!  Now we just have to convince others on
the committee.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 5:46 PM, Torvald Riegel  wrote:
>
> IOW, I wrote that such a compiler transformation would be wrong in my
> opinion.  Thus, it should *not* return 42.

Ahh, I am happy to have misunderstood. The "intuitively" threw me,
because I thought that was building up to a "but", and misread the
rest.

I then react stronly, because I've seen so much total crap (the
type-based C aliasing rules topping my list) etc coming out of
standards groups because it allows them to generate wrong code that
goes faster, that I just assume compiler people are out to do stupid
things in the name of "..but the standard allows it".

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Sun, 2014-02-09 at 17:24 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel  wrote:
> >
> > (a) seems to say that you don't like requiring programmers to mark
> > atomic accesses specially.  Is that the case?
> 
> In Paul's example, they were marked specially.
> 
> And you seemed to argue that Paul's example could possibly return
> anything but 0/0.

Just read my reply to Paul again.  Here's an excerpt:

> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   atomic_store_explicit(42, y, memory_order_relaxed);
>   if (r1 != 42)
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.

IOW, I wrote that such a compiler transformation would be wrong in my
opinion.  Thus, it should *not* return 42.
If you still see how what I wrote could be misunderstood, please let me
know because I really don't see it.

Yes, I don't do so by swearing or calling other insane, and I try to see
the reasoning that those compilers' authors might have had, even if I
don't agree with it.  In my personal opinion, that's a good thing.

> If you really think that, I hope to God that you have nothing to do
> with the C standard or any actual compiler I ever use.
> 
> Because such a standard or compiler would be shit. It's sadly not too 
> uncommon.

Thanks for the kind words.  Perhaps writing that was quicker than
reading what I actually wrote.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel  wrote:
>
> (a) seems to say that you don't like requiring programmers to mark
> atomic accesses specially.  Is that the case?

In Paul's example, they were marked specially.

And you seemed to argue that Paul's example could possibly return
anything but 0/0.

If you really think that, I hope to God that you have nothing to do
with the C standard or any actual compiler I ever use.

Because such a standard or compiler would be shit. It's sadly not too uncommon.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Sun, 2014-02-09 at 16:56 -0800, Linus Torvalds wrote:
> On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
> >
> > I wouldn't characterize the situation like this (although I can't speak
> > for others, obviously).  IMHO, it's perfectly fine on sequential /
> > non-synchronizing code, because we know the difference isn't observable
> > by a correct program.
> 
> What BS is that? If you use an "atomic_store_explicit()", by
> definition you're either
> 
>  (a) f*cking insane
>  (b) not doing sequential non-synchronizing code
> 
> and a compiler that assumes that the programmer is insane may actually
> be correct more often than not, but it's still a shit compiler.
> Agreed?

Due to all the expletives, I can't really understand what you are
saying.  Nor does what I guess it might mean seem to relate to what I
said.

(a) seems to say that you don't like requiring programmers to mark
atomic accesses specially.  Is that the case?

If so, I have to disagree.  If you're writing concurrent code, marking
the atomic accesses is the least of your problem.  Instead, for the
concurrent code I've written, it rather improved readability; it made
code more verbose, but in turn it made the synchronization easier to
see.

Beside this question of taste (and I don't care what the Kernel style
guides are), there is a real technical argument here, though:  Requiring
all synchronizing memory accesses to be annotated makes the compiler
aware what is sequential code, and what is not.  Without it, one can't
exploit data-race-freedom.  So unless we want to slow down optimization
of sequential code, we need to tell the compiler what is what.  If every
access is potentially synchronizing, then this doesn't just prevent
speculative stores.

(b) seems as if you are saying that if there is a specially annotated
atomic access in the code, that this isn't sequential/non-synchronizing
code anymore.  I don't agree with that, obviously.

> So I don't see how any sane person can say that speculative writes are
> ok. They are clearly not ok.

We are discussing programs written against the C11/C++11 memory model
here.  At least that's the part that got forwarded to g...@gcc.gnu.org,
and was subject of the nearest emails in the thread.

This memory model requires programs to be data-race-free.  Thus, we can
optimize accordingly.  If you don't like that, then this isn't C11/C++11
anymore.  Which would be fine, but then complain about that
specifically.

> Speculative stores are a bad idea in general.

I disagree in the context of C11/C++11 programs.  At least from a
correctness point of view.

> They are completely
> invalid for anything that says "atomic".

I agree, as you will see when you read the other emails I posted as part
of this thread.  But those two things are separate.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
>
> I wouldn't characterize the situation like this (although I can't speak
> for others, obviously).  IMHO, it's perfectly fine on sequential /
> non-synchronizing code, because we know the difference isn't observable
> by a correct program.

What BS is that? If you use an "atomic_store_explicit()", by
definition you're either

 (a) f*cking insane
 (b) not doing sequential non-synchronizing code

and a compiler that assumes that the programmer is insane may actually
be correct more often than not, but it's still a shit compiler.
Agreed?

So I don't see how any sane person can say that speculative writes are
ok. They are clearly not ok.

Speculative stores are a bad idea in general. They are completely
invalid for anything that says "atomic". This is not even worth
discussing.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
> > Hi Paul,
> > 
> > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
> > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
> > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
> > > > > Hopefully some discussion of out-of-thin-air values as well.
> > > > 
> > > > Yes, absolutely shoot store speculation in the head already. Then drive
> > > > a wooden stake through its hart.
> > > > 
> > > > C11/C++11 should not be allowed to claim itself a memory model until 
> > > > that
> > > > is sorted.
> > > 
> > > There actually is a proposal being put forward, but it might not make ARM
> > > and Power people happy because it involves adding a compare, a branch,
> > > and an ISB/isync after every relaxed load...  Me, I agree with you,
> > > much preferring the no-store-speculation approach.
> > 
> > Can you elaborate a bit on this please? We don't permit speculative stores
> > in the ARM architecture, so it seems counter-intuitive that GCC needs to
> > emit any additional instructions to prevent that from happening.
> 
> Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
> proof that out-of-thin-air values cannot be observed in the face of any
> compiler optimization that refrains from reordering a prior relaxed load
> with a subsequent relaxed store.
> 
> > Stores can, of course, be observed out-of-order but that's a lot more
> > reasonable :)
> 
> So let me try an example.  I am sure that Torvald Riegel will jump in
> with any needed corrections or amplifications:
> 
> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> One would intuitively expect r1 == r2 == 0 as the only possible outcome.
> But suppose that the compiler used specialization optimizations, as it
> would if there was a function that has a very lightweight implementation
> for some values and a very heavyweight one for other.  In particular,
> suppose that the lightweight implementation was for the value 42.
> Then the compiler might do something like the following:
> 
> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   if (r1 == 42)
>   atomic_store_explicit(42, y, memory_order_relaxed);
>   else
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> Suddenly we have an explicit constant 42 showing up.  Of course, if
> the compiler carefully avoided speculative stores (as both Peter and
> I believe that it should if its code generation is to be regarded as
> anything other than an act of vandalism, the words in the standard
> notwithstanding), there would be no problem.  But currently, a number
> of compiler writers see absolutely nothing wrong with transforming
> the optimized-for-42 version above with something like this:
> 
> Initial state: x == y == 0
> 
> T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   atomic_store_explicit(42, y, memory_order_relaxed);
>   if (r1 != 42)
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.

For this to be correct, the compiler would actually have to prove that
the speculative store is "as-if correct", which in turn would mean that
it needs to be aware of all potential observers, and check whether those
observers aren't actually affected by the speculative store.

I would guess that the compilers you have in mind don't really do that.
If they do, then I don't see why this should be okay, unless you think
out-of-thin-air values are something good (which I wouldn't agree with).

> And then it is a short and uncontroversial step to the following:
> 
> Initial state: x == y == 0
> 
> T1:   atomic_store_explicit(42, y, memory_order_relaxed);
>   r1 = atomic_load_explicit(x, memory_order_relaxed);
>   if (r1 != 42)
>   atomic_store_explicit(r1, y, memory_order_relaxed);
> 
> T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
>   atomic_store_explicit(r2, x, memory_order_relaxed);
> 
> This can of course result in r1 == r2 == 42, even though the constant
> 42 never appeared in the original code.  This is one way to generate
> an out-of-thin-air value.

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
> On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
> > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
> > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
> > > > > > There are also so many ways to blow your head off it's untrue. For 
> > > > > > example,
> > > > > > cmpxchg takes a separate memory model parameter for failure and 
> > > > > > success, but
> > > > > > then there are restrictions on the sets you can use for each. It's 
> > > > > > not hard
> > > > > > to find well-known memory-ordering experts shouting "Just use
> > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then 
> > > > > > there's
> > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely 
> > > > > > ignores consume
> > > > > > atm and optimises all of the data dependencies away) as well as the 
> > > > > > definition
> > > > > > of "data races", which seem to be used as an excuse to miscompile a 
> > > > > > program
> > > > > > at the earliest opportunity.
> > > > > 
> > > > > Trust me, rcu_dereference() is not going to be defined in terms of
> > > > > memory_order_consume until the compilers implement it both correctly 
> > > > > and
> > > > > efficiently.  They are not there yet, and there is currently no 
> > > > > shortage
> > > > > of compiler writers who would prefer to ignore memory_order_consume.
> > > > 
> > > > Do you have any input on
> > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
> > > > language standard's definition of dependencies?
> > > 
> > > Let's see...  1.10p9 says that a dependency must be carried unless:
> > > 
> > > — B is an invocation of any specialization of std::kill_dependency 
> > > (29.3), or
> > > — A is the left operand of a built-in logical AND (&&, see 5.14) or 
> > > logical OR (||, see 5.15) operator,
> > > or
> > > — A is the left operand of a conditional (?:, see 5.16) operator, or
> > > — A is the left operand of the built-in comma (,) operator (5.18);
> > > 
> > > So the use of "flag" before the "?" is ignored.  But the "flag - flag"
> > > after the "?" will carry a dependency, so the code fragment in 59448
> > > needs to do the ordering rather than just optimizing "flag - flag" out
> > > of existence.  One way to do that on both ARM and Power is to actually
> > > emit code for "flag - flag", but there are a number of other ways to
> > > make that work.
> > 
> > And that's what would concern me, considering that these requirements
> > seem to be able to creep out easily.  Also, whereas the other atomics
> > just constrain compilers wrt. reordering across atomic accesses or
> > changes to the atomic accesses themselves, the dependencies are new
> > requirements on pieces of otherwise non-synchronizing code.  The latter
> > seems far more involved to me.
> 
> Well, the wording of 1.10p9 is pretty explicit on this point.
> There are only a few exceptions to the rule that dependencies from
> memory_order_consume loads must be tracked.  And to your point about
> requirements being placed on pieces of otherwise non-synchronizing code,
> we already have that with plain old load acquire and store release --
> both of these put ordering constraints that affect the surrounding
> non-synchronizing code.

I think there's a significant difference.  With acquire/release or more
general memory orders, it's true that we can't order _across_ the atomic
access.  However, we can reorder and optimize without additional
constraints if we do not reorder.  This is not the case with consume
memory order, as the (p + flag - flag) example shows.

> This issue got a lot of discussion, and the compromise is that
> dependencies cannot leak into or out of functions unless the relevant
> parameters or return values are annotated with [[carries_dependency]].
> This means that the compiler can see all the places where dependencies
> must be tracked.  This is described in 7.6.4.

I wasn't aware of 7.6.4 (but it isn't referred to as an additional
constraint--what it is--in 1.10, so I guess at least that should be
fixed).
Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
difference, at least if one is assuming that normal optimization of
sequential code is the default, and that maintaining things such as
(flag-flag) is not; if optimizing away (flag-flag) would require the
insertion of fences unless there is the carries_dependency attribute,
then this would be bad I think.

IMHO, the dependencies construct (carries_dependency, kill_dependency)
seem to be backwards to me.  They assume that the compiler preserves all
those dependencies including (flag-flag) by default, which prohibits
meaningful optimizations.  Instead, I guess there should be a construct
for explicitly exploiting the dependencies

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote:
 On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote:
  On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote:
   On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote:
On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote:
 On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote:
  There are also so many ways to blow your head off it's untrue. For 
  example,
  cmpxchg takes a separate memory model parameter for failure and 
  success, but
  then there are restrictions on the sets you can use for each. It's 
  not hard
  to find well-known memory-ordering experts shouting Just use
  memory_model_seq_cst for everything, it's too hard otherwise. Then 
  there's
  the fun of load-consume vs load-acquire (arm64 GCC completely 
  ignores consume
  atm and optimises all of the data dependencies away) as well as the 
  definition
  of data races, which seem to be used as an excuse to miscompile a 
  program
  at the earliest opportunity.
 
 Trust me, rcu_dereference() is not going to be defined in terms of
 memory_order_consume until the compilers implement it both correctly 
 and
 efficiently.  They are not there yet, and there is currently no 
 shortage
 of compiler writers who would prefer to ignore memory_order_consume.

Do you have any input on
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448?  In particular, the
language standard's definition of dependencies?
   
   Let's see...  1.10p9 says that a dependency must be carried unless:
   
   — B is an invocation of any specialization of std::kill_dependency 
   (29.3), or
   — A is the left operand of a built-in logical AND (, see 5.14) or 
   logical OR (||, see 5.15) operator,
   or
   — A is the left operand of a conditional (?:, see 5.16) operator, or
   — A is the left operand of the built-in comma (,) operator (5.18);
   
   So the use of flag before the ? is ignored.  But the flag - flag
   after the ? will carry a dependency, so the code fragment in 59448
   needs to do the ordering rather than just optimizing flag - flag out
   of existence.  One way to do that on both ARM and Power is to actually
   emit code for flag - flag, but there are a number of other ways to
   make that work.
  
  And that's what would concern me, considering that these requirements
  seem to be able to creep out easily.  Also, whereas the other atomics
  just constrain compilers wrt. reordering across atomic accesses or
  changes to the atomic accesses themselves, the dependencies are new
  requirements on pieces of otherwise non-synchronizing code.  The latter
  seems far more involved to me.
 
 Well, the wording of 1.10p9 is pretty explicit on this point.
 There are only a few exceptions to the rule that dependencies from
 memory_order_consume loads must be tracked.  And to your point about
 requirements being placed on pieces of otherwise non-synchronizing code,
 we already have that with plain old load acquire and store release --
 both of these put ordering constraints that affect the surrounding
 non-synchronizing code.

I think there's a significant difference.  With acquire/release or more
general memory orders, it's true that we can't order _across_ the atomic
access.  However, we can reorder and optimize without additional
constraints if we do not reorder.  This is not the case with consume
memory order, as the (p + flag - flag) example shows.

 This issue got a lot of discussion, and the compromise is that
 dependencies cannot leak into or out of functions unless the relevant
 parameters or return values are annotated with [[carries_dependency]].
 This means that the compiler can see all the places where dependencies
 must be tracked.  This is described in 7.6.4.

I wasn't aware of 7.6.4 (but it isn't referred to as an additional
constraint--what it is--in 1.10, so I guess at least that should be
fixed).
Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic
difference, at least if one is assuming that normal optimization of
sequential code is the default, and that maintaining things such as
(flag-flag) is not; if optimizing away (flag-flag) would require the
insertion of fences unless there is the carries_dependency attribute,
then this would be bad I think.

IMHO, the dependencies construct (carries_dependency, kill_dependency)
seem to be backwards to me.  They assume that the compiler preserves all
those dependencies including (flag-flag) by default, which prohibits
meaningful optimizations.  Instead, I guess there should be a construct
for explicitly exploiting the dependencies (e.g., a
preserve_dependencies call, whose argument will not be optimized fully).
Exploiting dependencies will be special code and isn't the common case,
so it can be require additional annotations.

 If a dependency chain
 headed by a memory_order_consume load

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
 On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
  Hi Paul,
  
  On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
   On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
 Hopefully some discussion of out-of-thin-air values as well.

Yes, absolutely shoot store speculation in the head already. Then drive
a wooden stake through its hart.

C11/C++11 should not be allowed to claim itself a memory model until 
that
is sorted.
   
   There actually is a proposal being put forward, but it might not make ARM
   and Power people happy because it involves adding a compare, a branch,
   and an ISB/isync after every relaxed load...  Me, I agree with you,
   much preferring the no-store-speculation approach.
  
  Can you elaborate a bit on this please? We don't permit speculative stores
  in the ARM architecture, so it seems counter-intuitive that GCC needs to
  emit any additional instructions to prevent that from happening.
 
 Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
 proof that out-of-thin-air values cannot be observed in the face of any
 compiler optimization that refrains from reordering a prior relaxed load
 with a subsequent relaxed store.
 
  Stores can, of course, be observed out-of-order but that's a lot more
  reasonable :)
 
 So let me try an example.  I am sure that Torvald Riegel will jump in
 with any needed corrections or amplifications:
 
 Initial state: x == y == 0
 
 T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
   atomic_store_explicit(r1, y, memory_order_relaxed);
 
 T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
   atomic_store_explicit(r2, x, memory_order_relaxed);
 
 One would intuitively expect r1 == r2 == 0 as the only possible outcome.
 But suppose that the compiler used specialization optimizations, as it
 would if there was a function that has a very lightweight implementation
 for some values and a very heavyweight one for other.  In particular,
 suppose that the lightweight implementation was for the value 42.
 Then the compiler might do something like the following:
 
 Initial state: x == y == 0
 
 T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
   if (r1 == 42)
   atomic_store_explicit(42, y, memory_order_relaxed);
   else
   atomic_store_explicit(r1, y, memory_order_relaxed);
 
 T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
   atomic_store_explicit(r2, x, memory_order_relaxed);
 
 Suddenly we have an explicit constant 42 showing up.  Of course, if
 the compiler carefully avoided speculative stores (as both Peter and
 I believe that it should if its code generation is to be regarded as
 anything other than an act of vandalism, the words in the standard
 notwithstanding), there would be no problem.  But currently, a number
 of compiler writers see absolutely nothing wrong with transforming
 the optimized-for-42 version above with something like this:
 
 Initial state: x == y == 0
 
 T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
   atomic_store_explicit(42, y, memory_order_relaxed);
   if (r1 != 42)
   atomic_store_explicit(r1, y, memory_order_relaxed);
 
 T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
   atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.

For this to be correct, the compiler would actually have to prove that
the speculative store is as-if correct, which in turn would mean that
it needs to be aware of all potential observers, and check whether those
observers aren't actually affected by the speculative store.

I would guess that the compilers you have in mind don't really do that.
If they do, then I don't see why this should be okay, unless you think
out-of-thin-air values are something good (which I wouldn't agree with).

 And then it is a short and uncontroversial step to the following:
 
 Initial state: x == y == 0
 
 T1:   atomic_store_explicit(42, y, memory_order_relaxed);
   r1 = atomic_load_explicit(x, memory_order_relaxed);
   if (r1 != 42)
   atomic_store_explicit(r1, y, memory_order_relaxed);
 
 T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
   atomic_store_explicit(r2, x, memory_order_relaxed);
 
 This can of course result in r1 == r2 == 42, even though the constant
 42 never appeared in the original code.  This is one way to generate
 an out-of-thin-air value.
 
 As near as I can tell, compiler writers hate the idea of prohibiting
 speculative-store optimizations because it requires them to introduce

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote:

 I wouldn't characterize the situation like this (although I can't speak
 for others, obviously).  IMHO, it's perfectly fine on sequential /
 non-synchronizing code, because we know the difference isn't observable
 by a correct program.

What BS is that? If you use an atomic_store_explicit(), by
definition you're either

 (a) f*cking insane
 (b) not doing sequential non-synchronizing code

and a compiler that assumes that the programmer is insane may actually
be correct more often than not, but it's still a shit compiler.
Agreed?

So I don't see how any sane person can say that speculative writes are
ok. They are clearly not ok.

Speculative stores are a bad idea in general. They are completely
invalid for anything that says atomic. This is not even worth
discussing.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Sun, 2014-02-09 at 16:56 -0800, Linus Torvalds wrote:
 On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote:
 
  I wouldn't characterize the situation like this (although I can't speak
  for others, obviously).  IMHO, it's perfectly fine on sequential /
  non-synchronizing code, because we know the difference isn't observable
  by a correct program.
 
 What BS is that? If you use an atomic_store_explicit(), by
 definition you're either
 
  (a) f*cking insane
  (b) not doing sequential non-synchronizing code
 
 and a compiler that assumes that the programmer is insane may actually
 be correct more often than not, but it's still a shit compiler.
 Agreed?

Due to all the expletives, I can't really understand what you are
saying.  Nor does what I guess it might mean seem to relate to what I
said.

(a) seems to say that you don't like requiring programmers to mark
atomic accesses specially.  Is that the case?

If so, I have to disagree.  If you're writing concurrent code, marking
the atomic accesses is the least of your problem.  Instead, for the
concurrent code I've written, it rather improved readability; it made
code more verbose, but in turn it made the synchronization easier to
see.

Beside this question of taste (and I don't care what the Kernel style
guides are), there is a real technical argument here, though:  Requiring
all synchronizing memory accesses to be annotated makes the compiler
aware what is sequential code, and what is not.  Without it, one can't
exploit data-race-freedom.  So unless we want to slow down optimization
of sequential code, we need to tell the compiler what is what.  If every
access is potentially synchronizing, then this doesn't just prevent
speculative stores.

(b) seems as if you are saying that if there is a specially annotated
atomic access in the code, that this isn't sequential/non-synchronizing
code anymore.  I don't agree with that, obviously.

 So I don't see how any sane person can say that speculative writes are
 ok. They are clearly not ok.

We are discussing programs written against the C11/C++11 memory model
here.  At least that's the part that got forwarded to g...@gcc.gnu.org,
and was subject of the nearest emails in the thread.

This memory model requires programs to be data-race-free.  Thus, we can
optimize accordingly.  If you don't like that, then this isn't C11/C++11
anymore.  Which would be fine, but then complain about that
specifically.

 Speculative stores are a bad idea in general.

I disagree in the context of C11/C++11 programs.  At least from a
correctness point of view.

 They are completely
 invalid for anything that says atomic.

I agree, as you will see when you read the other emails I posted as part
of this thread.  But those two things are separate.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel trie...@redhat.com wrote:

 (a) seems to say that you don't like requiring programmers to mark
 atomic accesses specially.  Is that the case?

In Paul's example, they were marked specially.

And you seemed to argue that Paul's example could possibly return
anything but 0/0.

If you really think that, I hope to God that you have nothing to do
with the C standard or any actual compiler I ever use.

Because such a standard or compiler would be shit. It's sadly not too uncommon.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Torvald Riegel

On Sun, 2014-02-09 at 17:24 -0800, Linus Torvalds wrote:
 On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel trie...@redhat.com wrote:
 
  (a) seems to say that you don't like requiring programmers to mark
  atomic accesses specially.  Is that the case?
 
 In Paul's example, they were marked specially.
 
 And you seemed to argue that Paul's example could possibly return
 anything but 0/0.

Just read my reply to Paul again.  Here's an excerpt:

 Initial state: x == y == 0
 
 T1:   r1 = atomic_load_explicit(x, memory_order_relaxed);
   atomic_store_explicit(42, y, memory_order_relaxed);
   if (r1 != 42)
   atomic_store_explicit(r1, y, memory_order_relaxed);
 
 T2:   r2 = atomic_load_explicit(y, memory_order_relaxed);
   atomic_store_explicit(r2, x, memory_order_relaxed);

Intuitively, this is wrong because this let's the program take a step
the abstract machine wouldn't do.  This is different to the sequential
code that Peter posted because it uses atomics, and thus one can't
easily assume that the difference is not observable.


IOW, I wrote that such a compiler transformation would be wrong in my
opinion.  Thus, it should *not* return 42.
If you still see how what I wrote could be misunderstood, please let me
know because I really don't see it.

Yes, I don't do so by swearing or calling other insane, and I try to see
the reasoning that those compilers' authors might have had, even if I
don't agree with it.  In my personal opinion, that's a good thing.

 If you really think that, I hope to God that you have nothing to do
 with the C standard or any actual compiler I ever use.
 
 Because such a standard or compiler would be shit. It's sadly not too 
 uncommon.

Thanks for the kind words.  Perhaps writing that was quicker than
reading what I actually wrote.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds

On Sun, Feb 9, 2014 at 5:46 PM, Torvald Riegel trie...@redhat.com wrote:

 IOW, I wrote that such a compiler transformation would be wrong in my
 opinion.  Thus, it should *not* return 42.

Ahh, I am happy to have misunderstood. The intuitively threw me,
because I thought that was building up to a but, and misread the
rest.

I then react stronly, because I've seen so much total crap (the
type-based C aliasing rules topping my list) etc coming out of
standards groups because it allows them to generate wrong code that
goes faster, that I just assume compiler people are out to do stupid
things in the name of ..but the standard allows it.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Paul E. McKenney

On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote:
 On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote:
  On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote:
   Hi Paul,
   
   On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote:
On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote:
 On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote:
  Hopefully some discussion of out-of-thin-air values as well.
 
 Yes, absolutely shoot store speculation in the head already. Then 
 drive
 a wooden stake through its hart.
 
 C11/C++11 should not be allowed to claim itself a memory model until 
 that
 is sorted.

There actually is a proposal being put forward, but it might not make 
ARM
and Power people happy because it involves adding a compare, a branch,
and an ISB/isync after every relaxed load...  Me, I agree with you,
much preferring the no-store-speculation approach.
   
   Can you elaborate a bit on this please? We don't permit speculative stores
   in the ARM architecture, so it seems counter-intuitive that GCC needs to
   emit any additional instructions to prevent that from happening.
  
  Requiring a compare/branch/ISB after each relaxed load enables a simple(r)
  proof that out-of-thin-air values cannot be observed in the face of any
  compiler optimization that refrains from reordering a prior relaxed load
  with a subsequent relaxed store.
  
   Stores can, of course, be observed out-of-order but that's a lot more
   reasonable :)
  
  So let me try an example.  I am sure that Torvald Riegel will jump in
  with any needed corrections or amplifications:
  
  Initial state: x == y == 0
  
  T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
  atomic_store_explicit(r1, y, memory_order_relaxed);
  
  T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
  atomic_store_explicit(r2, x, memory_order_relaxed);
  
  One would intuitively expect r1 == r2 == 0 as the only possible outcome.
  But suppose that the compiler used specialization optimizations, as it
  would if there was a function that has a very lightweight implementation
  for some values and a very heavyweight one for other.  In particular,
  suppose that the lightweight implementation was for the value 42.
  Then the compiler might do something like the following:
  
  Initial state: x == y == 0
  
  T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
  if (r1 == 42)
  atomic_store_explicit(42, y, memory_order_relaxed);
  else
  atomic_store_explicit(r1, y, memory_order_relaxed);
  
  T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
  atomic_store_explicit(r2, x, memory_order_relaxed);
  
  Suddenly we have an explicit constant 42 showing up.  Of course, if
  the compiler carefully avoided speculative stores (as both Peter and
  I believe that it should if its code generation is to be regarded as
  anything other than an act of vandalism, the words in the standard
  notwithstanding), there would be no problem.  But currently, a number
  of compiler writers see absolutely nothing wrong with transforming
  the optimized-for-42 version above with something like this:
  
  Initial state: x == y == 0
  
  T1: r1 = atomic_load_explicit(x, memory_order_relaxed);
  atomic_store_explicit(42, y, memory_order_relaxed);
  if (r1 != 42)
  atomic_store_explicit(r1, y, memory_order_relaxed);
  
  T2: r2 = atomic_load_explicit(y, memory_order_relaxed);
  atomic_store_explicit(r2, x, memory_order_relaxed);
 
 Intuitively, this is wrong because this let's the program take a step
 the abstract machine wouldn't do.  This is different to the sequential
 code that Peter posted because it uses atomics, and thus one can't
 easily assume that the difference is not observable.
 
 For this to be correct, the compiler would actually have to prove that
 the speculative store is as-if correct, which in turn would mean that
 it needs to be aware of all potential observers, and check whether those
 observers aren't actually affected by the speculative store.
 
 I would guess that the compilers you have in mind don't really do that.
 If they do, then I don't see why this should be okay, unless you think
 out-of-thin-air values are something good (which I wouldn't agree with).

OK, we agree that pulling the atomic store to y out of its if statement
is a bad thing.  Very good!  Now we just have to convince others on
the committee.  ;-)

Thanx, Paul

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

< 1 2 3 4 5 6 >

401 - 500 of 570 matches

Mail list logo