Re: Re: typeof and operands in named address spaces

2020-11-17 Thread Linus Torvalds
On Tue, Nov 17, 2020 at 11:13 AM Linus Torvalds
 wrote:
>
> > +#define __unqual_typeof(type)  typeof( (typeof(type))type )
>
> that's certainly a much nicer version than the existing pre-processor
> expansion from hell.

Oh, and sparse doesn't handle this, and doesn't remove any qualifiers
for the above. And so this horror-test-case fails:

#define __unqual_typeof(type)  typeof( (typeof(type))(type) )

int *function(volatile int x)
{
extern __unqual_typeof(x) y;
return &y;
}

with

  t.c:6:17: warning: incorrect type in return expression (different modifiers)
  t.c:6:17:expected int *
  t.c:6:17:got int volatile *
  t.c:3:5: warning: symbol 'function' was not declared. Should it be static?

adding Luc and the sparse mailing list to the participants list.

But it does work with both gcc and clang for me.

For Luc, quoting some earlier emails:

> > lvalue conversion drops qualifers in C.  In GCC, this is not
> > implemented correctly as it is unobservable in standard C
> > (but it using typeof).
> >
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97702
> >
> > A have a working patch in preparation to change this. Then you
> > could use
> >
> > typeof( ((void)0, x) )

on another thing that fails with sparse. But since gcc gets that wrong
too and it only works with clang, that's not so relevant for the
kernel.

Luc - same test-case as above, just

#define __unqual_typeof(x) typeof( ((void)0, (x)) )

instead.

   Linus


Re: Re: typeof and operands in named address spaces

2020-11-17 Thread Linus Torvalds
On Tue, Nov 17, 2020 at 11:25 AM Jakub Jelinek  wrote:
>
> It would need to be typeof( (typeof(type)) (type) ) to not be that
> constrained on what kind of expressions it accepts as arguments.

Yup.

> Anyway, it won't work with array types at least,
>   int a[10];
>   typeof ((typeof (a)) (a)) b;
> is an error (in both gcc and clang), while typeof (a) b; will work
> (but not drop the qualifiers).  Don't know if the kernel cares or not.

Well, the kernel already doesn't allow that, because our existing
horror only handles simple integer scalar types.

So that macro is a clear improvement - if it actually works (local
testing says it does, but who knows about random compiler versions
etc)

  Linus


Re: Re: typeof and operands in named address spaces

2020-11-17 Thread Linus Torvalds
On Mon, Nov 16, 2020 at 3:11 AM Peter Zijlstra  wrote:
>
> XXX: I've only verified the below actually compiles, I've not verified
>  the generated code is actually 'correct'.

Well, it was mainly the arm64 code generation for load-acquire and
store-release that wanted this - so it's really the generated code
that matters.

Will, can you check?

Because:

> +#define __unqual_typeof(type)  typeof( (typeof(type))type )

that's certainly a much nicer version than the existing pre-processor
expansion from hell.

 Linus


Re: [isocpp-parallel] Proposal for new memory_order_consume definition

2016-02-29 Thread Linus Torvalds
On Mon, Feb 29, 2016 at 9:37 AM, Michael Matz  wrote:
>
>The important part is with induction variables controlling
> loops:
>
>   short i;  for (i = start; i < end; i++)
> vs.
>   unsigned short u; for (u = start; u < end; u++)
>
> For the former you're allowed to assume that the loop will terminate, and
> that its iteration count is easily computable.  For the latter you get
> modulo arithmetic and (if start/end are of larger type than u, say 'int')
> it might not even terminate at all.  That has direct consequences of
> vectorizability of such loops (or profitability of such transformation)
> and hence quite important performance implications in practice.

Stop bullshitting me.

It would generally force the compiler to add a few extra checks when
you do vectorize (or, more generally, do any kind of loop unrolling),
and yes, it would make things slightly more painful. You might, for
example, need to add code to handle the wraparound and have a more
complex non-unrolled head/tail version for that case.

In theory you could do a whole "restart the unrolled loop around the
index wraparound" if you actually cared about the performance of such
a case - but since nobody would ever care about that, it's more likely
that you'd just do it with a non-unrolled fallback (which would likely
be identical to the tail fixup).

It would be painful, yes.

But it wouldn't be fundamentally hard, or hurt actual performance fundamentally.

It would be _inconvenient_ for compiler writers, and the bad ones
would argue vehemently against it.

.. and it's how a "go fast" mode would be implemented by a compiler
writer initially as a compiler option, for those HPC people. Then you
have a use case and implementation example, and can go to the
standards body and say "look, we have people who use this already, it
breaks almost no code, and it makes our compiler able to generate much
faster code".

Which is why the standard was written to be good for compiler writers,
not actual users.

Of course, in real life HPC performance is often more about doing the
cache blocking etc, and I've seen people move to more parameterized
languages rather than C to get best performance. Generate the code
from a much higher-level description, and be able to do a much better
job, and leave C to do the low-level job, and let people do the
important part.

But no. Instead the C compiler people still argue for bad features
that were a misdesign and a wart on the language.

At the very least it should have been left as a "go unsafe, go fast"
option, and standardize *that*, instead of screwing everybody else
over.

The HPC people end up often using those anyway, because it turns out
that they'll happily get rid of proper rounding etc if it buys them a
couple of percent on their workload.  Things like "I really want you
to generate multiply-accumulate instructions because I don't mind
having intermediates with higher precision" etc.

 Linus


Re: [isocpp-parallel] Proposal for new memory_order_consume definition

2016-02-28 Thread Linus Torvalds
On Sun, Feb 28, 2016 at 12:27 AM, Markus Trippelsdorf
 wrote:
>> >
>> >   -fno-strict-overflow
>>
>>   -fno-strict-aliasing.
>
> Do not forget -fno-delete-null-pointer-checks.
>
> So the kernel obviously is already using its own C dialect, that is
> pretty far from standard C.
> All these options also have a negative impact on the performance of the
> generated code.

They really don't.

Have you ever seen code that cared about signed integer overflow?
Yeah, getting it right can make the compiler generate an extra ALU
instruction once in a blue moon, but trust me - you'll never notice.
You *will* notice when you suddenly have a crash or a security issue
due to bad code generation, though.

The idiotic C alias rules aren't even worth discussing. They were a
mistake. The kernel doesn't use some "C dialect pretty far from
standard C". Yeah, let's just say that the original C designers were
better at their job than a gaggle of standards people who were making
bad crap up to make some Fortran-style programs go faster.

They don't speed up normal code either, they just introduce undefined
behavior in a lot of code.

And deleting NULL pointer checks because somebody made a mistake, and
then turning that small mistake into a real and exploitable security
hole? Not so smart either.

The fact is, undefined compiler behavior is never a good idea. Not for
serious projects.

Performance doesn't come from occasional small and odd
micro-optimizations. I care about performance a lot, and I actually
look at generated code and do profiling etc. None of those three
options have *ever* shown up as issues. But the incorrect code they
generate? It has.

 Linus


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-21 Thread Linus Torvalds
On Thu, May 21, 2015 at 1:02 PM, Paul E. McKenney
 wrote:
>
> The compiler can (and does) speculate non-atomic non-volatile writes
> in some cases, but I do not believe that it is permitted to speculate
> either volatile or atomic writes.

I do *not* believe that a compiler is ever allowed to speculate *any*
writes - volatile or not - unless the compiler can prove that the end
result is either single-threaded, or the write in question is
guaranteed to only be visible in that thread (ie local stack variable
etc).

Quite frankly, I'd be much happier if the C standard just said so outright.

Also, I do think that the whole "consume" read should be explained
better to compiler writers. Right now the language (including very
much in the "restricted dependency" model) is described in very
abstract terms. Yet those abstract terms are actually very subtle and
complex, and very opaque to a compiler writer.

If I was a compiler writer, I'd absolutely detest that definition.
It's very far removed from my problem space as a compiler writer, and
nothing in the language *explains* the odd and subtle abstract rules.
It smells ad-hoc to me.

Now, I actually understand the point of those odd and abstract rules,
but to a compiler writer that doesn't understand the background, the
whole section reads as "this is really painful for me to track all
those dependencies and what kills them".

So I would very much suggest that there would be language that
*explains* this. Basically, tell the compiler writer:

 (a) the "official" rules are completely pointless, and make sense
only because the standard is written for some random "abstract
machine" that doesn't actually exist.

 (b) in *real life*, the whole and only point of the rules is to make
sure that the compiler doesn't turn a data depenency into a control
dependency, which on ARM and POWERPC does not honor causal memory
ordering

 (c) on x86, since *all* memory accesses are causal, all the magical
dependency rules are just pointless anyway, and what it really means
is that you cannot re-order accesses with value speculation.

 (c) the *actual* relevant rule for a compiler writer is very simple:
the compiler must not do value speculation on a "consume" load, and
the abstract machine rules are written so that any other sane
optimization is legal.

 (d) if the compiler writer really thinks they want to do value
speculation, they have to turn the "consume" load into an "acquire"
load. And you have to do that anyway on architectures like alpha that
aren't causal even for data dependencies.

I personally think the whole "abstract machine" model of the C
language is a mistake. It would be much better to talk about things in
terms of actual code generation and actual issues. Make all the
problems much more concrete, with actual examples of how memory
ordering matters on different architectures.

99% of all the problems with the whole "consume" memory ordering comes
not from anything relevant to a compiler writer. All of it comes from
trying to "define" the issue in the wrong terms.

 Linus


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-19 Thread Linus Torvalds
On Tue, May 19, 2015 at 6:57 PM, Linus Torvalds
 wrote:
>
>  - the "you can add/subtract integral values" still opens you up to
> language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
> dependency, which it clearly doesn't. But language-lawyering it does,
> since all those operations (cast to pointer, cast to integer,
> subtracting an integer) claim to be dependency-preserving operations.
>
> So I think you want to limit the logical operators to things that
> don't mask off too many bits, and you should probably limit the
> add/subtract operations some way (maybe specify that the integer value
> you add/subtract cannot be related to the pointer).

Actually, "not related" doesn't work. For some buddy allocator thing,
you very much might want some of the bits to be related.

So I think you're better off just saying that operations designed to
drop significant bits break the dependency chain, and give things like
"& 1" and "(char *)ptr-(uintptr_t)ptr" as examples of such.

Making that just an extension of your existing "& 0" language would
seem to be natural.

Humans will understand, and compiler writers won't care. They will
either depend on hardware semantics anyway (and argue that your
language is tight enough that they don't need to do anything special)
or they will turn the consume into an acquire (on platforms that have
too weak hardware).

 Linus


Re: Compilers and RCU readers: Once more unto the breach!

2015-05-19 Thread Linus Torvalds
On Tue, May 19, 2015 at 5:55 PM, Paul E. McKenney
 wrote:
>
> http://www.rdrop.com/users/paulmck/RCU/consume.2015.05.18a.pdf

>From a very quick read-through, the restricted dependency chain in 7.9
seems to be reasonable, and essentially covers "thats' what hardware
gives us anyway", making compiler writers happy.

I would clarify the language somewhat:

 - it says that the result of a cast of a pointer is a dependency. You
need to make an exception for casting to bool, methinks (because
that's effectively just a test-against-NULL, which you later describe
as terminating the dependency).

   Maybe get rid of the "any type", and just limit it to casts to
types of size intptr_t, ie ones that don't drop significant bits. All
the other rules talk about [u]intptr_t anyway.

 - you clarify that the trivial "& 0" and "| ~0" kill the dependency
chain, but if you really want to be a stickler, you might want to
extend it to a few more cases. Things like "& 1" (to extract a tag
from the lot bit of a tagged pointer) should likely also drop the
dependency, since a compiler would commonly end up using the end
result as a conditional even if the code was written to then use
casting etc to look like a dereference.

 - the "you can add/subtract integral values" still opens you up to
language lawyers claiming "(char *)ptr - (intptr_t)ptr" preserving the
dependency, which it clearly doesn't. But language-lawyering it does,
since all those operations (cast to pointer, cast to integer,
subtracting an integer) claim to be dependency-preserving operations.

So I think you want to limit the logical operators to things that
don't mask off too many bits, and you should probably limit the
add/subtract operations some way (maybe specify that the integer value
you add/subtract cannot be related to the pointer). But I think
limiting it to mostly pointer ops (and a _few_ integer operations to
do offsets and remove tag bits) is otherwise a good approach.

   Linus


Re: [RFC] Design for flag bit outputs from asms

2015-05-05 Thread Linus Torvalds
On Tue, May 5, 2015 at 6:50 AM, Segher Boessenkool
 wrote:
>
> Since it is pre-processed, there is no real reason to overlap this with
> the constraints namespace; we could have e.g. "=@[xy]" (and "@[xy]" for
> inputs) mean the target needs to do some "xy" transform here.

In fact, standing out visually would be just a good thing, since it's
pretty special even from a usage standpoint.

And are you actually planning to have flags as inputs? Because *that*
sounds like a bad idea. It's pretty hard to turn a boolean into a flag
value, while pretty much any archiecture has an operation like "setcc"
to go the other way. And I don't think your machine descriptions have
anything to "generate flags". You'd have to add fragile and complex
machinery for something it is unlikely anybody ever wants.

Flag *outputs* people definitely want. Flag inputs? Yeah, I can
absolutely see the carry flag being useful for multi-precision
arithmetic, but it's *so* hard to guarantee that it still is live,
that in practice the compiler would likely have to re-generate it from
a value anyway, so ...

So I'd go for output-only, and make the syntax be something very
visually unambiguous. That "=@[xy]" format looks fine, where "xy"
would be very architecture-dependent.

Or make it even *more* specific by using "CC" for condition codes, and
make the syntax "=@CC[xy]", in case you ever want to use the "@"
marker for any other kind of magic constraint.

 Linus


Re: [RFC] Design for flag bit outputs from asms

2015-05-04 Thread Linus Torvalds
On Mon, May 4, 2015 at 1:33 PM, Richard Henderson  wrote:
>
> A fair point.  Though honestly, I was hoping that this feature would mostly be
> used for conditions that are "weird" -- that is, not normally describable by
> arithmetic at all.  Otherwise, why are you using inline asm for it?

I could easily imagine using some of the combinations for atomic operations.

For example, doing a "lock decl", and wanting to see if the result is
negative or zero. Sure, it would be possible to set *two* booleans (ZF
and SF), but there's a contiional for "BE"..

Linus


Re: [RFC] Design for flag bit outputs from asms

2015-05-04 Thread Linus Torvalds
On Mon, May 4, 2015 at 1:14 PM, H. Peter Anvin  wrote:
>
> I would argue that for x86 what you actually want is to model the
> *conditions* that are available on the flags, not the flags themselves.

Yes. Otherwise it would be a nightmare to try to describe simple
conditions like "le", which a rather complicated combination of three
of the actual flag bits:

((SF ^^ OF) || ZF) = 1

which would just be ridiculously painful for (a) the user to describe
and (b) fior the compiler to recognize once described.

Now, I do admit that most of the cases where you'd use inline asm with
condition codes would probably fall into just simple "test ZF or CF".
But I could certainly imagine other cases.

   Linus


Re: [PATCH] tell gcc optimizer to never introduce new data races

2014-06-10 Thread Linus Torvalds
On Tue, Jun 10, 2014 at 6:23 AM, Jiri Kosina  wrote:
> We have been chasing a memory corruption bug, which turned out to be
> caused by very old gcc (4.3.4), which happily turned conditional load into
> a non-conditional one, and that broke correctness (the condition was met
> only if lock was held) and corrupted memory.

Just out of interest, can you point to the particular kernel code that
caused this? I think that's more interesting than the example program
you show - which I'm sure is really nice for gcc developers as an
example, but from a kernel standpoint I think it's more important to
show the particular problems this caused for the kernel?

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Linus Torvalds
On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney
 wrote:
>
> 3.  The comparison was against another RCU-protected pointer,
> where that other pointer was properly fetched using one
> of the RCU primitives.  Here it doesn't matter which pointer
> you use.  At least as long as the rcu_assign_pointer() for
> that other pointer happened after the last update to the
> pointed-to structure.
>
> I am a bit nervous about #3.  Any thoughts on it?

I think that it might be worth pointing out as an example, and saying
that code like

   p = atomic_read(consume);
   X;
   q = atomic_read(consume);
   Y;
   if (p == q)
data = p->val;

then the access of "p->val" is constrained to be data-dependent on
*either* p or q, but you can't really tell which, since the compiler
can decide that the values are interchangeable.

I cannot for the life of me come up with a situation where this would
matter, though. If "X" contains a fence, then that fence will be a
stronger ordering than anything the consume through "p" would
guarantee anyway. And if "X" does *not* contain a fence, then the
atomic reads of p and q are unordered *anyway*, so then whether the
ordering to the access through "p" is through p or q is kind of
irrelevant. No?

 Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-27 Thread Linus Torvalds
On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel  wrote:
>
> I agree that just considering syntactic properties of the program seems
> to be insufficient.  Making it instead depend on whether there is a
> "semantic" dependency due to a value being "necessary" to compute a
> result seems better.  However, whether a value is "necessary" might not
> be obvious, and I understand Paul's argument that he does not want to
> have to reason about all potential compiler optimizations.  Thus, I
> believe we need to specify when a value is "necessary".

I suspect it's hard to really strictly define, but at the same time I
actually think that compiler writers (and users, for that matter) have
little problem understanding the concept and intent.

I do think that listing operations might be useful to give good
examples of what is a "necessary" value, and - perhaps more
importantly - what can break the value from being "necessary".
Especially the gotchas.

> I have a suggestion for a somewhat different formulation of the feature
> that you seem to have in mind, which I'll discuss below.  Excuse the
> verbosity of the following, but I'd rather like to avoid
> misunderstandings than save a few words.

Ok, I'm going to cut most of the verbiage since it's long and I'm not
commenting on most of it.

But

> Based on these thoughts, we could specify the new mo_consume guarantees
> roughly as follows:
>
> An evaluation E (in an execution) has a value dependency to an
> atomic and mo_consume load L (in an execution) iff:
> * L's type holds more than one value (ruling out constants
> etc.),
> * L is sequenced-before E,
> * L's result is used by the abstract machine to compute E,
> * E is value-dependency-preserving code (defined below), and
> * at the time of execution of E, L can possibly have returned at
> least two different values under the assumption that L itself
> could have returned any value allowed by L's type.
>
> If a memory access A's targeted memory location has a value
> dependency on a mo_consume load L, and an action X
> inter-thread-happens-before L, then X happens-before A.

I think this mostly works.

> Regarding the latter, we make a fresh start at each mo_consume load (ie,
> we assume we know nothing -- L could have returned any possible value);
> I believe this is easier to reason about than other scopes like function
> granularities (what happens on inlining?), or translation units.  It
> should also be simple to implement for compilers, and would hopefully
> not constrain optimization too much.
>
> [...]
>
> Paul's litmus test would work, because we guarantee to the programmer
> that it can assume that the mo_consume load would return any value
> allowed by the type; effectively, this forbids the compiler analysis
> Paul thought about:

So realistically, since with the new wording we can ignore the silly
cases (ie "p-p") and we can ignore the trivial-to-optimize compiler
cases ("if (p == &variable) .. use p"), and you would forbid the
"global value range optimization case" that Paul bright up, what
remains would seem to be just really subtle compiler transformations
of data dependencies to control dependencies.

And the only such thing I can think of is basically compiler-initiated
value-prediction, presumably directed by PGO (since now if the value
prediction is in the source code, it's considered to break the value
chain).

The good thing is that afaik, value-prediction is largely not used in
real life, afaik. There are lots of papers on it, but I don't think
anybody actually does it (although I can easily see some
specint-specific optimization pattern that is build up around it).

And even value prediction is actually fine, as long as the compiler
can see the memory *source* of the value prediction (and it isn't a
mo_consume). So it really ends up limiting your value prediction in
very simple ways: you cannot do it to function arguments if they are
registers. But you can still do value prediction on values you loaded
from memory, if you can actually *see* that memory op.

Of course, on more strongly ordered CPU's, even that "register
argument" limitation goes away.

So I agree that there is basically no real optimization constraint.
Value-prediction is of dubious value to begin with, and the actual
constraint on its use if some compiler writer really wants to is not
onerous.

> What I have in mind is roughly the following (totally made-up syntax --
> suggestions for how to do this properly are very welcome):
> * Have a type modifier (eg, like restrict), that specifies that
> operations on data of this type are preserving value dependencies:

So I'm not violently opposed, but I think the upsides are not great.
Note that my earlier suggestion to use "restrict" wasn't because I
believed the annotation itself would be visible, but basically just as
a legalistic promise to the compiler that *if* it found an 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-25 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney
 wrote:
>
> So let me see if I understand your reasoning.  My best guess is that it
> goes something like this:
>
> 1.  The Linux kernel contains code that passes pointers from
> rcu_dereference() through external functions.

No, actually, it's not so much Linux-specific at all.

I'm actually thinking about what I'd do as a compiler writer, and as a
defender the "C is a high-level assembler" concept.

I love C. I'm a huge fan. I think it's a great language, and I think
it's a great language not because of some theoretical issues, but
because it is the only language around that actually maps fairly well
to what machines really do.

And it's a *simple* language. Sure, it's not quite as simple as it
used to be, but look at how thin the "K&R book" is. Which pretty much
describes it - still.

That's the real strength of C, and why it's the only language serious
people use for system programming.  Ignore C++ for a while (Jesus
Xavier Christ, I've had to do C++ programming for subsurface), and
just think about what makes _C_ a good language.

I can look at C code, and I can understand what the code generation
is, and what it will really *do*. And I think that's important.
Abstractions that hide what the compiler will actually generate are
bad abstractions.

And ok, so this is obviously Linux-specific in that it's generally
only Linux where I really care about the code generation, but I do
think it's a bigger issue too.

So I want C features to *map* to the hardware features they implement.
The abstractions should match each other, not fight each other.

> Actually, the fact that there are more potential optimizations than I can
> think of is a big reason for my insistence on the carries-a-dependency
> crap.  My lack of optimization omniscience makes me very nervous about
> relying on there never ever being a reasonable way of computing a given
> result without preserving the ordering.

But if I can give two clear examples that are basically identical from
a syntactic standpoint, and one clearly can be trivially optimized to
the point where the ordering guarantee goes away, and the other
cannot, and you cannot describe the difference, then I think your
description is seriously lacking.

And I do *not* think the C language should be defined by how it can be
described. Leave that to things like Haskell or LISP, where the goal
is some kind of completeness of the language that is about the
language, not about the machines it will run on.

>> So the code sequence I already mentioned is *not* ordered:
>>
>> Litmus test 1:
>>
>> p = atomic_read(pp, consume);
>> if (p == &variable)
>> return p->val;
>>
>>is *NOT* ordered, because the compiler can trivially turn this into
>> "return variable.val", and break the data dependency.
>
> Right, given your model, the compiler is free to produce code that
> doesn't order the load from pp against the load from p->val.

Yes. Note also that that is what existing compilers would actually do.

And they'd do it "by mistake": they'd load the address of the variable
into a register, and then compare the two registers, and then end up
using _one_ of the registers as the base pointer for the "p->val"
access, but I can almost *guarantee* that there are going to be
sequences where some compiler will choose one register over the other
based on some random detail.

So my model isn't just a "model", it also happens to descibe reality.

> Indeed, it won't work across different compilation units unless
> the compiler is told about it, which is of course the whole point of
> [[carries_dependency]].  Understood, though, the Linux kernel currently
> does not have anything that could reasonably automatically generate those
> [[carries_dependency]] attributes.  (Or are there other reasons why you
> believe [[carries_dependency]] is problematic?)

So I think carries_dependency is problematic because:

 - it's not actually in C11 afaik

 - it requires the programmer to solve the problem of the standard not
matching the hardware.

 - I think it's just insanely ugly, *especially* if it's actually
meant to work so that the current carries-a-dependency works even for
insane expressions like "a-a".

in practice, it's one of those things where I guess nobody actually
would ever use it.

> Of course, I cannot resist putting forward a third litmus test:
>
> static struct foo variable1;
> static struct foo variable2;
> static struct foo *pp = &variable1;
>
> T1: initialize_foo(&variable2);
> atomic_store_explicit(&pp, &variable2, memory_order_release);
> /* The above is the only store to pp in this translation unit,
>  * and the address of pp is not exported in any way.
>  */
>
> T2: if (p == &variable1)
> return p->val1; /* Must be variable1.val1. */
> else
> return p->val2; /* Must be variable2.val2. */
>
> My guess is that you

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds
 wrote:
>
> Litmus test 1:
>
> p = atomic_read(pp, consume);
> if (p == &variable)
> return p->val;
>
>is *NOT* ordered

Btw, don't get me wrong. I don't _like_ it not being ordered, and I
actually did spend some time thinking about my earlier proposal on
strengthening the 'consume' ordering.

I have for the last several years been 100% convinced that the Intel
memory ordering is the right thing, and that people who like weak
memory ordering are wrong and should try to avoid reproducing if at
all possible. But given that we have memory orderings like power and
ARM, I don't actually see a sane way to get a good strong ordering.
You can teach compilers about cases like the above when they actually
see all the code and they could poison the value chain etc. But it
would be fairly painful, and once you cross object files (or even just
functions in the same compilation unit, for that matter), it goes from
painful to just "ridiculously not worth it".

So I think the C semantics should mirror what the hardware gives us -
and do so even in the face of reasonable optimizations - not try to do
something else that requires compilers to treat "consume" very
differently.

If people made me king of the world, I'd outlaw weak memory ordering.
You can re-order as much as you want in hardware with speculation etc,
but you should always *check* your speculation and make it *look* like
you did everything in order. Which is pretty much the intel memory
ordering (ignoring the write buffering).

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 2:37 PM, Paul E. McKenney
 wrote:
>>
>> What if the "nothing modifies 'p'" part looks like this:
>>
>> if (p != &myvariable)
>> return;
>>
>> and now any sane compiler will happily optimize "q = *p" into "q =
>> myvariable", and we're all done - nothing invalid was ever
>
> Yes, the compiler could do that.  But it would still be required to
> carry a dependency from the memory_order_consume read to the "*p",

But that's *BS*. You didn't actually listen to the main issue.

Paul, why do you insist on this carries-a-dependency crap?

It's broken. If you don't believe me, then believe the compiler person
who already piped up and told you so.

The "carries a dependency" model is broken. Get over it.

No sane compiler will ever distinguish two different registers that
have the same value from each other. No sane compiler will ever say
"ok, register r1 has the exact same value as register r2, but r2
carries the dependency, so I need to make sure to pass r2 to that
function or use it as a base pointer".

And nobody sane should *expect* a compiler to distinguish two
registers with the same value that way.

So the whole model is broken.

I gave an alternate model (the "restrict"), and you didn't seem to
understand the really fundamental difference. It's not a language
difference, it's a conceptual difference.

In the broken "carries a dependency" model, you have fight all those
aliases that can have the same value, and it is not a fight you can
win. We've had the "p-p" examples, we've had the "p&0" examples, but
the fact is, that "p==&myvariable" example IS EXACTLY THE SAME THING.

All three of those things: "p-p", "p&0", and "p==&myvariable" mean
that any compiler worth its salt now know that "p" carries no
information, and will optimize it away.

So please stop arguing against that. Whenever you argue against that
simple fact, you are arguing against sane compilers.

So *accept* the fact that some operations (and I guarantee that there
are more of those than you can think of, and you can create them with
various tricks using pretty much *any* feature in the C language)
essentially take the data information away. And just accept the fact
that then the ordering goes away too.

So give up on "carries a dependency". Because there will be cases
where that dependency *isn't* carried.

The language of the standard needs to get *away* from the broken
model, because otherwise the standard is broken.

I suggest we instead talk about "litmus tests" and why certain code
sequences are ordered, and others are not.

So the code sequence I already mentioned is *not* ordered:

Litmus test 1:

p = atomic_read(pp, consume);
if (p == &variable)
return p->val;

   is *NOT* ordered, because the compiler can trivially turn this into
"return variable.val", and break the data dependency.

   This is true *regardless* of any "carries a dependency" language,
because that language is insane, and doesn't work when the different
pieces here may be in different compilation units.

BUT:

Litmus test 2:

p = atomic_read(pp, consume);
if (p != &variable)
return p->val;

  *IS* ordered, because while it looks syntactically a lot like
"Litmus test 1", there is no sane way a compiler can use the knowledge
that "p is not a pointer to a particular location" to break the data
dependency.

There is no way in hell that any sane "carries a dependency" model can
get the simple tests above right.

So give up on it already. "Carries a dependency" cannot work. It's a
bad model. You're trying to describe the problem using the wrong
tools.

Note that my "restrict+pointer to object" language actually got this
*right*. The "restrict" part made Litmus test 1 not ordered, because
that "p == &variable" success case means that the pointer wasn't
restricted, so the pre-requisite for ordering didn't exist.

See? The "carries a dependency" is a broken model for this, but there
are _other_ models that can work.

You tried to rewrite my model into "carries a dependency". That
*CANNOT* work. It's like trying to rewrite quantum physics into the
Greek model of the four elements. They are not compatible models, and
one of them can be shown to not work.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney
 wrote:
>
> Good points.  How about the following replacements?
>
> 3.  Adding or subtracting an integer to/from a chained pointer
> results in another chained pointer in that same pointer chain.
> The results of addition and subtraction operations that cancel
> the chained pointer's value (for example, "p-(long)p" where "p"
> is a pointer to char) are implementation defined.
>
> 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> applied to a chained pointer and an integer for the purposes
> of alignment and pointer translation results in another
> chained pointer in that same pointer chain.  Other uses
> of bitwise operators on chained pointers (for example,
> "p|~0") are implementation defined.

Quite frankly, I think all of this language that is about the actual
operations is irrelevant and wrong.

It's not going to help compiler writers, and it sure isn't going to
help users that read this.

Why not just talk about "value chains" and that any operations that
restrict the value range severely end up breaking the chain. There is
no point in listing the operations individually, because every single
operation *can* restrict things. Listing individual operations and
depdendencies is just fundamentally wrong.

For example, let's look at this obvious case:

   int q,*p = atomic_read(&pp, consume);
   .. nothing modifies 'p' ..
   q = *p;

and there are literally *zero* operations that modify the value
change, so obviously the two operations are ordered, right?

Wrong.

What if the "nothing modifies 'p'" part looks like this:

if (p != &myvariable)
return;

and now any sane compiler will happily optimize "q = *p" into "q =
myvariable", and we're all done - nothing invalid was ever

So my earlier suggestion tried to avoid this by having the restrict
thing, so the above wouldn't work.

But your (and the current C standards) attempt to define this with
some kind of syntactic dependency carrying chain will _inevitably_ get
this wrong, and/or be too horribly complex to actually be useful.

Seriously, don't do it. I claim that all your attempts to do this
crazy syntactic "these operations maintain the chained pointers" is
broken. The fact that you have changed "carries a dependency" to
"chained pointer" changes NOTHING.

So just give it up. It's a fundamentally broken model. It's *wrong*,
but even more importantly, it's not even *useful*, since it ends up
being too complicated for a compiler writer or a programmer to
understand.

I really really really think you need to do this at a higher
conceptual level, get away from all these idiotic "these operations
maintain the chain" crap. Because there *is* no such list.

Quite frankly, any standards text that has that
"[[carries_dependency]]" or "[[kill_dependency]]" or whatever
attribute is broken. It's broken because the whole concept is TOTALLY
ALIEN to the compiler writer or the programmer. It makes no sense.
It's purely legalistic language that has zero reason to exist. It's
non-intuitive for everybody.

And *any* language that talks about the individual operations only
encourages people to play legalistic games that actually defeat the
whole purpose (namely that all relevant CPU's are going to implement
that consume ordering guarantee natively, with no extra code
generation rules AT ALL). So any time you talk about some random
detail of some operation, somebody is going to come up with a "trick"
that defeats things. So don't do it. There is absolutely ZERO
difference between any of the arithmetic operations, be they bitwise,
additive, multiplicative, shifts, whatever.

The *only* thing that matters for all of them is whether they are
"value-preserving", or whether they drop so much information that the
compiler might decide to use a control dependency instead. That's true
for every single one of them.

Similarly, actual true control dependencies that limit the problem
space sufficiently that the actual pointer value no longer has
significant information in it (see the above example) are also things
that remove information to the point that only a control dependency
remains. Even when the value itself is not modified in any way at all.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 9:21 AM, Paul E. McKenney
 wrote:
>
> 4.  Bitwise operators ("&", "|", "^", and I suppose also "~")
> applied to a chained pointer and an integer results in another
> chained pointer in that same pointer chain.

No. You cannot define it this way. Taking the value of a pointer and
doing a bitwise operation that throws away all the bits (or even
*most* of the bits) results in the compiler easily being able to turn
the "chain" into a non-chain.

The obvious example being "val & 0", but things like "val & 1" are in
practice also something that compilers easily turn into control
dependencies instead of data dependencies.

So you can talk about things like "aligning the pointer value to
object boundaries" etc, but it really cannot and *must* not be about
the syntactic operations.

The same goes for "adding and subtracting an integer". The *syntax*
doesn't matter. It's about remaining information. Doing "p-(int)p" or
"p+(-(int)p)" doesn't leave any information despite being "subtracting
and adding an integer" at a syntactic level.

Syntax is meaningless. Really.

> 8.  Applying any of the following operators to a chained pointer
> results in something that is not a chained pointer:
> "()", sizeof, "!", "*", "/", "%", ">>", "<<", "<", ">", "<=",
> ">=", "==", "!=", "&&", and "||".

Parenthesis? I'm assuming that you mean calling through the chained pointer.

Also, I think all of /, * and % are perfectly fine, and might be used
for that "aligning the pointer" operation that is fine.

 Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz  wrote:
>
> So, let me try to poke holes into your definition or increase my
> understanding :) .  You said "chain of pointers"(dereferences I assume),
> e.g. if p is result of consume load, then access to
> p->here->there->next->prev->stuff is supposed to be ordered with that load
> (or only when that last load/store itself is also an atomic load or
> store?).

It's supposed to be ordered wrt the first load (the consuming one), yes.

> So, what happens if the pointer deref chain is partly hidden in some
> functions:

No problem.

The thing is, the ordering is actually handled by the CPU in all
relevant cases.  So the compiler doesn't actually need to *do*
anything. All this legalistic stuff is just to describe the semantics
and the guarantees.

The problem is two cases:

 (a) alpha (which doesn't really order any accesses at all, not even
dependent loads), but for a compiler alpha is actually trivial: just
add a "rmb" instruction after the load, and you can't really do
anything else (there's a few optimizations you can do wrt the rmb, but
they are very specific and simple).

So (a) is a "problem", but the solution is actually really simple, and
gives very *strong* guarantees: on alpha, a "consume" ends up being
basically the same as a read barrier after the load, with only very
minimal room for optimization.

 (b) ARM and powerpc and similar architectures, that guarantee the
data dependency as long as it is an *actual* data dependency, and
never becomes a control dependency.

On ARM and powerpc, control dependencies do *not* order accesses (the
reasons boil down to essentially: branch prediction breaks the
dependency, and instructions that come after the branch can be happily
executed before the branch). But it's almost impossible to describe
that in the standard, since compilers can (and very much do) turn a
control dependency into a data dependency and vice versa.

So the current standard tries to describe that "control vs data"
dependency, and tries to limit it to a data dependency. It fails. It
fails for multiple reasons - it doesn't allow for trivial
optimizations that just remove the data dependency, and it also
doesn't allow for various trivial cases where the compiler *does* turn
the data dependency into a control dependency.

So I really really think that the current C standard language is
broken. Unfixably broken.

I'm trying to change the "syntactic data dependency" that the current
standard uses into something that is clearer and correct.

The "chain of pointers" thing is still obviously a data dependency,
but by limiting it to pointers, it simplifies the language, clarifies
the meaning, avoids all syntactic tricks (ie "p-p" is clearly a
syntactic dependency on "p", but does *not* involve in any way
following the pointer) and makes it basically impossible for the
compiler to break the dependency without doing value prediction, and
since value prediction has to be disallowed anyway, that's a feature,
not a bug.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 8:37 AM, Linus Torvalds
 wrote:
>
> So yes, the atomic_read() would be ordered wrt '*ptr' (getting 'q')
> _and_ '**ptr' (getting 'i'), but nothing else - including just the
> aliasing access of dereferencing 'i' directly.

Btw, what CPU architects and memory ordering guys tend to do in
documentation is give a number of "litmus test" pseudo-code sequences
to show the effects and intent of the language.

I think giving those kinds of litmus tests for both "this is ordered"
and "this is not ordered" cases like the above is would be a great
clarification. Partly because the language is going to be somewhat
legalistic and thus hard to wrap your mind around, and partly to
really hit home the *intent* of the language, which I think is
actually fairly clear to both compiler writers and to programmers.

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Linus Torvalds
On Mon, Feb 24, 2014 at 8:27 AM, Richard Biener
 wrote:
>
> To me that reads like
>
>   int i;
>   int *q = &i;
>   int **p = &q;
>
>   atomic_XXX (p, CONSUME);
>
> orders against accesses '*p', '**p', '*q' and 'i'.  Thus it seems they
> want to say that it orders against aliased storage - but then go further
> and include "indirectly through a chain of pointers"?!  Thus an
> atomic read of a int * orders against any 'int' memory operation but
> not against 'float' memory operations?

No, it's not about type at all, and the "chain of pointers" can be
much more complex than that, since the "int *" can point to within an
object that contains other things than just that "int" (the "int" can
be part of a structure that then has pointers to other structures
etc).

So in your example,

ptr = atomic_read(p, CONSUME);

would indeed order against the subsequent access of the chain through
*that* pointer (the whole "restrict" thing that I left out as a
separate thing, which was probably a mistake), but certainly not
against any integer pointer, and certainly not against any aliasing
pointer chains.

So yes, the atomic_read() would be ordered wrt '*ptr' (getting 'q')
_and_ '**ptr' (getting 'i'), but nothing else - including just the
aliasing access of dereferencing 'i' directly.

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-24 Thread Linus Torvalds
On Sun, Feb 23, 2014 at 11:31 AM, Linus Torvalds
 wrote:
>
> Let me think about it some more, but my gut feel is that just tweaking
> the definition of what "ordered" means is sufficient.
>
> So to go back to the suggested ordering rules (ignoring the "restrict"
> part, which is just to clarify that ordering through other means to
> get to the object doesn't matter), I suggested:
>
>  "the consume ordering guarantees the ordering between that
>   atomic read and the accesses to the object that the pointer
>   points to"
>
> and I think the solution is to just say that this ordering acts as a
> fence. It doesn't say exactly *where* the fence is, but it says that
> there is *some* fence between the load of the pointer and any/all
> accesses to the object through that pointer.

I'm wrong. That doesn't work. At all. There is no ordering except
through the pointer chain.

So I think saying just that, and nothing else (no magic fences, no
nothing) is the right thing:

 "the consume ordering guarantees the ordering between that
  atomic read and the accesses to the object that the pointer
  points to directly or indirectly through a chain of pointers"

The thing is, anything but a chain of pointers (and maybe relaxing it
to "indexes in tables" in addition to pointers) doesn't really work.

The current standard tries to break it at "obvious" points that can
lose the data dependency (either by turning it into a control
dependency, or by just dropping the value, like the left-hand side of
a comma-expression), but the fact is, it's broken.

It's broken not just because the value can be lost other ways (ie the
"p-p" example), it's broken because the value can be turned into a
control dependency so many other ways too.

Compilers regularly turn arithmetic ops with logical comparisons into
branches. So an expression like "a = !!ptr" carries a dependency in
the current C standard, but it's entirely possible that a compiler
ends up turning it into a compare-and-branch rather than a
compare-and-set-conditional, depending on just exactly how "a" ends up
being used. That's true even on an architecture like ARM that has a
lot of conditional instructions (there are way less if you compile for
Thumb, for example, but compilers also do things like "if there are
more than N predicated instructions I'll just turn it into a
branch-over instead").

So I think the C standard needs to just explicitly say that you can
walk a chain of pointers (with that possible "indexes in arrays"
extension), and nothing more.

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-23 Thread Linus Torvalds
On Sun, Feb 23, 2014 at 8:59 PM, Paul E. McKenney
 wrote:
> On Sun, Feb 23, 2014 at 05:35:28PM -0800, Linus Torvalds wrote:
>>
>> But "q = p->next" is ordered by how something can alias "p->next", not by 
>> 'q'!
>>
>> There is no need to restrict anything but 'p' for all of this to work.
>
> I cannot say I understand this last sentence right new from the viewpoint
> of the standard, but suspending disbelief for the moment...

So 'p' is what comes from that consuming load that returns a
'restrict' pointer. That doesn't affect 'q' in any way.

But the act of initializing 'q' by dereferencing p (in "p->next") is -
by virtue of the restrict - something that the compiler can see cannot
alias with anything else, so the compiler could re-order other memory
accesses freely around it, if you see what I mean.

Modulo all the *other* ordering guarantees, of course. So other
atomics and volatiles etc may have their own rules, quite apart from
any aliasing issues.

> Understood -- in this variant, you are taking the marking from the
> fact that there was an assignment from a memory_order_consume load
> rather than from a keyword on the assigned-to variable's declaration.

Yes, and to me, it's really just a legalistic trick to make it clear
that any *other* pointer that  happens to point to the same object
cannot be dereferenced within scope of the result of the
atomic_read(mo_consume), at least not if you expect to get the memory
ordering semantics. You can do it, but then you violate the guarantee
of the restrict, and you get what you get - a potential non-ordering.

So if somebody just immediately assigns the value to a normal
(non-restrict) pointer nothing *really* changes. It's just there to
describe the guarantees.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-23 Thread Linus Torvalds
On Sun, Feb 23, 2014 at 5:16 PM, Paul E. McKenney
 wrote:
>>
>>  (a) we've said 'q' is restricted, so there is no aliasing between q
>> and the pointers b/c. So the compiler is free to move those accesses
>> around the "q = p->next" access.
>
> Ah, if I understand you, very good!
>
> My example intentionally left "q" -not- restricted.

No, I 100% agree with that. "q" is *not* restricted. But "p" is, since
it came from that consuming load.

But "q = p->next" is ordered by how something can alias "p->next", not by 'q'!

There is no need to restrict anything but 'p' for all of this to work.

Btw, it's also worth pointing out that I do *not* in any way expect
people to actually write the "restrict" keyword anywhere. So no need
to change source code.

What you have is a situation where the pointer coming out of the
memory_order_consume is restricted. But if you assign it to a
non-restricted pointer, that's *fine*. That's perfectly normal C
behavior. The "restrict" concept is not something that the programmer
needs to worry about or ever even notice, it's basically just a
promise to the compiler that "if somebody has another pointer lying
around, accesses though that other pointer do not require ordering".

So it sounds like you believe that the programmer would mark things
"restrict", and I did not mean that at all.

 Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-23 Thread Linus Torvalds
On Sat, Feb 22, 2014 at 10:34 PM, Paul E. McKenney
 wrote:
>
> Adding and subtracting integers to/from a RCU-protected pointer makes
> sense to me.

Ack. And that's normal "access to an object" behavior anyway.

> Adding and subtracting integers to/from an RCU-protected integer makes
> sense in many practical cases, but I worry about the compiler figuring
> out that the RCU-protected integer cancelled with some other integer.

I suspect that in practice, all the same normal rules apply - assuming
that there aren't any "aliasing" integers, and that the compiler
doesn't do any value prediction, the end result *should* be exactly
the same as for the pointer arithmetic. So even if the standard were
to talk about just pointers, I suspect that in practice, there really
isn't anything that can go wrong.

> I don't feel all that good about subtractions involving an RCU-protected
> pointer and another pointer, again due to the possibility of arithmetic
> optimizations cancelling everything.

Actually, pointer subtraction is a pretty useful operation, even
without the "gotcha" case of "p-p" just to force a fake dependency.
Getting an index, or indeed just getting an offset within a structure,
is valid and common, and people will and should do it. It doesn't
really matter for my suggested language: it should be perfectly fine
to do something like

   ptr = atomic_read(pp, mo_consume);
   index = ptr - array_base;
   .. pass off 'index' to some function, which then re-generates the
ptr using it ..

and the compiler will have no trouble generating code, and the
suggested "ordered wrt that object" guarantee is that the eventual
ordering is between the pointer load and the use in the function,
regardless of how it got there (ie turning it into an index and back
is perfectly fine).

So both from a legalistic language wording standpoint, and from a
compiler code generation standpoint, this is a non-issue.

Now, if you actually lose information (ie some chain there drops
enough data from the pointer that it cannot be recovered, partially of
fully), and then "regenerate" the object value by faking it, and still
end up accessing the right data, but without actually going through
any of the the pointer that you loaded, that falls under the
"restricted" heading, and you must clearly at least partially have
used other information. In which case the standard wording wouldn't
guarantee anything at all.

>> I don't see that you would need to protect anything but the first
>> read, so I think you need rcu_dereference() only on the initial
>> pointer access.
>
> Let me try an example:
>
> struct bar {
> int a;
> unsigned b;
> };
>
> struct foo {
> struct bar *next;
> char c;
> };
>
> struct bar bar1;
> struct foo foo1;
>
> struct foo *foop;
>
> T1: bar1.a = -1;
> bar1.b = 2;
> foo1.next = &bar;
> foo1.c = "*";
> atomic_store_explicit(&foop, &foo1, memory_order_release);

So here, the standard requires that the store with release is an
ordering to all preceding writes. So *all* writes to bar and foo are
ordered, despite the fact that the pointer just points to foo.

> T2: struct foo restrict *p;
> struct bar *q;
>
> p = atomic_load_explicit(&foop, memory_order_consume);
> if (p == NULL)
> return -EDEALWITHIT;
> q = p->next;  /* Ordered with above load. */
> do_something(q->b); /* Non-decorated pointer, ordered? */

So the theory is that a compiler *could* do some value speculation now
on "q", and with value speculation move the actual load of "q->p" up
to before "foop" was even loaded.

So in practice, I think we agree that this doesn't really affect
compiler writers (because they'd have to do a whole lot extra code to
break it intentionally, considering that they can't do
value-prediction on 'p'), and we just need to make sure to close the
hole in the language to make this safe, right?

Let me think about it some more, but my gut feel is that just tweaking
the definition of what "ordered" means is sufficient.

So to go back to the suggested ordering rules (ignoring the "restrict"
part, which is just to clarify that ordering through other means to
get to the object doesn't matter), I suggested:

 "the consume ordering guarantees the ordering between that
  atomic read and the accesses to the object that the pointer
  points to"

and I think the solution is to just say that this ordering acts as a
fence. It doesn't say exactly *where* the fence is, but it says that
there is *some* fence between the load of the pointer and any/all
accesses to the object through that pointer.

So with that definition, the memory accesses that are dependent on 'q'
will obviously be ordered. Now, they will *not* be ordered wrt the
load of q itself, but they will be ordered wrt the load from 'foop' -
because we've made it clear that there is a f

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-22 Thread Linus Torvalds
On Sat, Feb 22, 2014 at 4:39 PM, Paul E. McKenney
 wrote:
>
> Agreed, by far the most frequent use is "->" to dereference and assignment
> to store into a local variable.  The other operations where the kernel
> expects ordering to be maintained are:
>
> o   Bitwise "&" to strip off low-order bits.  The FIB tree does
> this, for example in fib_table_lookup() in net/ipv4/fib_trie.c.
> The low-order bit is used to distinguish internal nodes from
> leaves -- nodes and leaves are different types of structures.
> (There are a few others.)

Note that this is very much outside the scope of the C standard,
regardless of 'consume' or not.

We'll always do things outside the standard, so I wouldn't worry.

> o   Uses "?:" to substitute defaults in case of NULL pointers,
> but ordering must be maintained in the non-default case.
> Most, perhaps all, of these could be converted to "if" should
> "?:" prove problematic.

Note that this doesn't actually affect the restrict/ordering rule in
theory: "?:" isn't special according to those rules. The rules are
fairly simple: we guarantee ordering only to the object that the
pointer points to, and even that guarantee goes out the window if
there is some *other* way to reach the object.

?: is not really relevant, except in the sense that *any* expression
that ends up pointing to outside the object will lose the ordering
guarantee. ?: can be one such expression, but so can "p-p" or anything
like that.

And in *practice*, the only thing that needs to be sure to generate
special code is alpha, and there you'd just add the "rmb" after the
load. That is sufficient to fulfill the guarantees.

On ARM and powerpc, the compiler obviously has to guarantee that it
doesn't do value-speculation on the result, but again, that never
really had anything to do with the whole "carries a dependency", it is
really all about the fact that in order to guarantee the ordering, the
compiler mustn't generate that magical aliased pointer value. But if
the aliased pointer value comes from the *source* code, all bets are
off.

Now, even on alpha, the compiler can obviously move that "rmb" around.
For example, if there is a conditional after the
"atomic_read(mo_consume)", and the compiler can tell that the pointer
that got read by mo_consume is dead along one branch, then the
compiler can move the "rmb" to only exist in the other branch. Why?
Because we inherently guarantee only the order to any accesses to the
object the pointer pointed to, and that the pointer that got loaded is
the *only* way to get to that object (in this context), so if the
value is dead, then so is the ordering.

In fact, even if the value is *not* dead, but it is NULL, the compiler
can validly say "the NULL pointer cannot point to any object, so I
don't have to guarantee any serialization". So code like this (writing
alpha assembly, since in practice only alpha will ever care):

ptr = atomic_read(pp, mo_consume);
if (ptr) {
   ... do something with ptr ..
}
return ptr;

can validly be translated to:

ldq $1,0($2)
beq $1,branch-over
rmb
.. the do-something code using register $1 ..

because the compiler knows that a NULL pointer cannot be dereferenced,
so it can decide to put the rmb in the non-NULL path - even though the
pointer value is still *live* in the other branch (well, the liveness
of a constant value is somewhat debatable, but you get the idea), and
may be used by the caller (but since it is NULL, the "use" can not
include accessing any object, only really testing)

So note how this is actually very different from the "carries
dependency" rule. It's simpler, and it allows much more natural
optimizations.

> o   Addition and subtraction to adjust both pointers to and indexes
> into RCU-protected arrays.  There are not that many indexes,
> and they could be converted to pointers, but the addition and
> subtraction looks necessary in a some cases.

Addition and subtraction is fine, as long as they stay within the same
object/array.

And realistically, people violate the whole C pointer "same object"
rule all the time. Any time you implement a raw memory allocator,
you'll violate the C standard and you *will* basically be depending on
architecture-specific behavior. So everybody knows that the C "pointer
arithmetic has to stay within the object" is really a fairly made-up
but convenient shorthand for "sure, we know you'll do tricks on
pointer values, but they won't be portable and you may have to take
particular machine representations into account".

> o   Array indexing.  The value from rcu_dereference() is used both
> before and inside the "[]", interestingly enough.

Well, in the C sense, or in the actual "integer index" sense? Because
technically, a[b] is nothing but *(a+b), so "inside" the "[]" is
strictly speaking meaningless. Inside and outside are just syntactic
sugar.

That said, 

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-22 Thread Linus Torvalds
On Sat, Feb 22, 2014 at 10:53 AM, Torvald Riegel  wrote:
>
> Stating that (1) "the standard is wrong" and (2) that you think that
> mo_consume semantics are not good is two different things.

I do agree. They are two independent things.

I think the standard is wrong, because it's overly complex, hard to
understand, and nigh unimplementable. As shown by the bugzilla
example, "carries a dependency" encompasses things that are *not* just
synchronizing things just through a pointer, and as a result it's
actually very complicated, since they could have been optimized away,
or done in non-local code that wasn't even aware of the dependency
carrying.

That said, I'm reconsidering my suggested stricter semantics, because
for RCU we actually do want to test the resulting pointer against NULL
_without_ any implied serialization.

So I still feel that the standard as written is fragile and confusing
(and the bugzilla entry pretty much proves that it is also practically
unimplementable as written), but strengthening the serialization may
be the wrong thing.

Within the kernel, the RCU use for this is literally purely about
loading a pointer, and doing either:

 - testing its value against NULL (without any implied synchronization at all)

 - using it as a pointer to an object, and expecting that any accesses
to that object are ordered wrt the consuming load.

So I actually have a suggested *very* different model that people
might find more acceptable.

How about saying that the result of a "atomic_read(&a, mo_consume)" is
required to be a _restricted_ pointer type, and that the consume
ordering guarantees the ordering between that atomic read and the
accesses to the object that the pointer points to.

No "carries a dependency", no nothing.

Now, there's two things to note in there:

 - the "restricted pointer" part means that the compiler does not need
to worry about serialization to that object through other possible
pointers - we have basically promised that the *only* pointer to that
object comes from the mo_consume. So that part makes it clear that the
"consume" ordering really only is valid wrt that particular pointer
load.

 - the "to the object that the pointer points to" makes it clear that
you can't use the pointer to generate arbitrary other values and claim
to serialize that way.

IOW, with those alternate semantics, that gcc bugzilla example is
utterly bogus, and a compiler can ignore it, because while it tries to
synchronize through the "dependency chain" created with that "p-i+i"
expression, that is completely irrelevant when you use the above rules
instead.

In the bugzilla example, the object that "*(p-i+i)" accesses isn't
actually the object pointed to by the pointer, so no serialization is
implied. And if it actually *were* to be the same object, because "p"
happens to have the same value as "i", then the "restrict" part of the
rule pops up and the compiler can again say that there is no ordering
guarantee, since the programmer lied to it and used a restricted
pointer that aliased with another one.

So the above suggestion basically tightens the semantics of "consume"
in a totally different way - it doesn't make it serialize more, in
fact it weakens the serialization guarantees a lot, but it weakens
them in a way that makes the semantics a lot simpler and clearer.

 Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Linus Torvalds
On Fri, Feb 21, 2014 at 11:43 AM, Peter Sewell
 wrote:
>
> You have to track dependencies through other assignments, e.g. simple x=y

That is all visible in the SSA form. Variable assignment has been
converted to some use of the SSA node that generated the value. The
use might be a phi node or a cast op, or maybe it's just a memory
store op, but the whole point of SSA is that there is one single node
that creates the data (in this case that would be the "load" op with
the associated consume barrier - that barrier might be part of the
load op itself, or it might be implemented as a separate SSA node that
consumes the result of the load that generates a new pseudo), and the
uses of the result are all visible from that.

And yes, there might be a lot of users. But any complex case you just
punt on - and the difference here is that since "punt" means "leave
the barrier in place", it's never a correctness issue.

So yeah, it could be somewhat expensive, although you can always bound
that expense by just punting. But the dependencies in SSA form are no
more complex than the dependencies the C standard talks about now, and
in SSA form they are at least really easy to follow.  So if they are
complex and expensive in SSA form, I'd expect them to be *worse* in
the current "depends-on" syntax form.

 Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Linus Torvalds
On Fri, Feb 21, 2014 at 11:16 AM, Linus Torvalds
 wrote:
>
> Why would this be any different, especially since it's easy to
> understand both for a human and a compiler?

Btw, the actual data path may actually be semantically meaningful even
at a processor level.

For example, let's look at that gcc bugzilla that got mentioned
earlier, and let's assume that gcc is fixed to follow the "arithmetic
is always meaningful, even if it is only syntactic" the letter.
So we have that gcc bugzilla use-case:

   flag ? *(q + flag - flag) : 0;

and let's say that the fixed compiler now generates the code with the
data dependency that is actually suggested in that bugzilla entry:

and w2, w2, #0
ldr w0, [x1, w2]

ie the CPU actually sees that address data dependency. Now everything
is fine, right?

Wrong.

It is actually quite possible that the CPU sees the "and with zero"
and *breaks the dependencies on the incoming value*.

Modern CPU's literally do things like that. Seriously. Maybe not that
particular one, but you'll sometimes find that the CPU - int he
instruction decoding phase (ie very early in the pipeline) notices
certain patterns that generate constants, and actually drop the data
dependency on the "incoming" registers.

On x86, generating zero using "xor" on the register with itself is one
such known sequence.

Can you guarantee that powerpc doesn't do the same for "and r,r,#0"?
Or what if the compiler generated the much more obvious

sub w2,w2,w2

for that "+flag-flag"? Are you really 100% sure that the CPU won't
notice that that is just a way to generate a zero, and doesn't depend
on the incoming values?

Because I'm not. I know CPU designers that do exactly this.

So I would actually and seriously argue that the whole C standard
attempt to use a syntactic data dependency as a determination of
whether two things are serialized is wrong, and that you actually
*want* to have the compiler optimize away false data dependencies.

Because people playing tricks with "+flag-flag" and thinking that that
somehow generates a data dependency - that's *wrong*. It's not just
the compiler that decides "that's obviously nonsense, I'll optimize it
away". The CPU itself can do it.

So my "actual semantic dependency" model is seriously more likely to
be *correct*. Not just t a compiler level.

Btw, any tricks like that, I would also take a second look at the
assembler and the linker. Many assemblers do some trivial
optimizations too. Are you sure that "and w2, w2, #0" really ends
up being encoded as an "and"? Maybe the assembler says "I can do that
as a "mov w2,#0" instead? Who knows? Even power and ARM have their
variable-sized encodings (there are some "compressed executable"
embedded power processors, and there is obviously Thumb2, and many
assemblers end up trying to use equivalent "small" instructions..

So the whole "fake data dependency" thing is just dangerous on so many levels.

MUCH more dangerous than my "actual real dependency" model.

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-21 Thread Linus Torvalds
On Fri, Feb 21, 2014 at 10:25 AM, Peter Sewell
 wrote:
>
> If one thinks this is too fragile, then simply using memory_order_acquire
> and paying the resulting barrier cost (and perhaps hoping that compilers
> will eventually be able to optimise some cases of those barriers to
> hardware-level dependencies) is the obvious alternative.

No, the obvious alternative is to do what we do already, and just do
it by hand. Using acquire is *worse* than what we have now.

Maybe for some other users, the thing falls out differently.

> Many concurrent things will "accidentally" work on x86 - consume is not
> special in that respect.

No. But if you have something that is mis-designed, easy to get wrong,
and untestable, any sane programmer will go "that's bad".

> There are two difficulties with this, if I understand correctly what you're
> proposing.
>
> The first is knowing where to stop.

No.

Read my suggestion. Knowing where to stop is *trivial*.

Either the dependency is immediate and obvious, or you treat it like an acquire.

Seriously. Any compiler that doesn't turn the dependency chain into
SSA or something pretty much equivalent is pretty much a joke. Agreed?

So we can pretty much assume that the compiler will have some
intermediate representation as part of optimization that is basically
SSA.

So what you do is,

 - build the SSA by doing all the normal parsing and possible
tree-level optimizations you already do even before getting to the SSA
stage

 - do all the normal optimizations/simplifications/cse/etc that you do
normally on SSA

 - add *one* new rule to your SSA simplification that goes something like this:

   * when you see a load op that is marked with a "consume" barrier,
just follow the usage chain that comes from that.
   * if you hit a normal arithmetic op, just follow the result chain of that
   * if you hit a memory operation address use, stop and say "looks good"
   * it you hit anything else (including a copy/phi/whatever), abort
   * if nothing aborted as part of the walk, you can now just remove
the "consume" barrier.

You can fancy it up and try to follow more cases, but realistically
the only case that really matters is the "consume" being fed directly
into one or more loads, with possibly an offset calculation in
between. There are certainly more cases you could *try* to remove the
barrier, but the thing is, it's never incorrect to not remove it, so
any time you get bored or hit any complication at all, just do the
"abort" part.

I *guarantee* that if you describe this to a compiler writer, he will
tell you that my scheme is about a billion times simpler than the
current standard wording. Especially after you've pointed him to that
gcc bugzilla entry and explained to him about how the current standard
cares about those kinds of made-up syntactic chains that he likely
removed quite early, possibly even as he was generating the semantic
tree.

Try it. I dare you. So if you want to talk about "difficulties", the
current C standard loses.

> The second is the proposal in later mails to use some notion of  "semantic"
> dependency instead of this syntactic one.

Bah.

The C standard does that all over. It's called "as-is". The C standard
talks about how the compiler can do pretty much whatever it likes, as
long as the end result acts the same in the virtual C machine.

So claiming that "semantics" being meaningful is somehow complex is
bogus. People do that all the time. If you make it clear that the
dependency chain is through the *value*, not syntax, and that the
value can be optimized all the usual ways, it's quite clear what the
end result is. Any operation that actually meaningfully uses the value
is serialized with the load, and if there is no meaningful use that
would affect the end result in the virtual machine, then there is no
barrier.

Why would this be any different, especially since it's easy to
understand both for a human and a compiler?

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Linus Torvalds
On Thu, Feb 20, 2014 at 2:10 PM, Paul E. McKenney
 wrote:
>
> Linus, given that you are calling me out for pushing "legalistic and bad"
> things, "syntactic bullshit", and playing "little games", I am forced
> to conclude that you have never attended any sort of standards-committee
> meeting.  ;-)

Heh. I have heard people wax poetic about the pleasures of standards
committee meetings.

Enough that I haven't really ever had the slightest urge to participate ;)

> FWIW, the last time I tried excluding things like "f-f", "x%1", "y*0" and
> so on, I got a lot of pushback.  The reason I didn't argue too much back
> (2007 or some such) then was that my view at the time was that I figured
> the kernel code wouldn't do things like that anyway, so it didn't matter.

Well.. I'd really hope we would never do that. That said, we have
certainly used disgusting things that we knew would disable certain
optimizations in the compiler before, so I wouldn't put it *entirely*
past us to do things like that, but I'd argue that we'd do so in order
to confuse the compiler to do what we want, not in order to argue that
it's a good thing.

I mean, right now we have at least *one* active ugly work-around for a
compiler bug (the magic empty inline asm that works around the "asm
goto" bug in gcc). So it's not like I would claim that we don't do
disgusting things when we need to.

But I'm pretty sure that any compiler guy must *hate* that current odd
dependency-generation part, and if I was a gcc person, seeing that
bugzilla entry Torvald pointed at, I would personally want to
dismember somebody with a rusty spoon..

So I suspect there are a number of people who would be *more* than
happy with a change to those odd dependency rules.

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Linus Torvalds
On Thu, Feb 20, 2014 at 10:56 AM, Paul E. McKenney
 wrote:
>
> The example gcc breakage was something like this:
>
> i = atomic_load(idx, memory_order_consume);
> x = array[0 + i - i];
>
> Then gcc optimized this to:
>
> i = atomic_load(idx, memory_order_consume);
> x = array[0];
>
> This same issue would hit control dependencies.  You are free to argue
> that this is the fault of ARM and PowerPC memory ordering, but the fact
> remains that your suggested change has -exactly- the same vulnerability
> as memory_order_consume currently has.

No it does not, for two reasons, first the legalistic (and bad) reason:

As I actually described it, the "consume" becomes an "acquire" by
default. If it's not used as an address to the dependent load, then
it's an acquire. The use "going away" in no way makes the acquire go
away in my simplistic model.

So the compiler would actually translate that to a load-with-acquire,
not be able to remove the acquire, and we have end of story. The
actual code generation would be that "ld + sync + ld" on powerpc, or
"ld.acq" on ARM.

Now, the reason I claim that reason was "legalistic and bad" is that
it's actually a cop-out, and if you had made the example be something
like this:

   p = atomic_load(&ptr, memory_order_consume);
   x = array[0 + p - p];
   y = p->val;

then yes, I actually think that the order of loads of 'x' and 'p' are
not enforced by the "consume". The only case that is clear is the
order of 'y' and 'p', because that is the only one that really *USES*
the value.

The "use" of "+p-p" is syntactic bullshit. It's very obvious to even a
slightly developmentally challenged hedgehog that "+p-p" doesn't have
any actual *semantic* meaning, it's purely syntactic.

And the syntactic meaning is meaningless and doesn't matter. Because I
would just get rid of the whole "dependency chain" language
ALTOGETHER.

So in fact, in my world, I would consider your example to be a
non-issue. In my world, there _is_ no "dependency chain" at a
syntactic level. In my SANE world, none of that insane crap language
exists. That language is made-up and tied to syntax exactly because it
*cannot* be tied to semantics.

In my sane world, "consume" has a much simpler meaning, and has no
legalistic syntactic meaning: only real use matters. If the value can
be optimized away, the so can the barrier, and so can the whole load.
The value isn't "consumed", so it has no meaning.

So if you write

   i = atomic_load(idx, memory_order_consume);
   x = array[0+i-i];

then in my world that "+i-i" is meaningless. It's semantic fluff, and
while my naive explanation would have left it as an acquire (because
it cannot be peep-holed away), I actually do believe that the compiler
should be obviously allowed to optimize the load away entirely since
it's meaningless, and if no use of 'i' remains, then it has no
consumer, and so there is no dependency.

Put another way: "consume" is not about getting a lock, it's about
getting a *value*. Only the dependency on the *value* matters, and if
the value is optimized away, there is no dependency.

And the value itself does not have any semantics. There's nothing
"volatile" about the use of the value that would mean that the
compiler cannot re-order it or remove it entirely. There's no barrier
"carried around" by the value per se. The barrier is between the load
and use. That's the *point* of "consume" after all.

The whole "chain of dependency" language is pointless. It's wrong.
It's complicated, it is illogical, and it causes subtle problems
exactly because it got tied to the language *syntax* rather than to
any logical use.

Don't try to re-introduce the whole issue. It was a mistake for the C
standard to talk about dependencies in the first place, exactly
because it results in these idiotic legalistic practices.

You do realize that that whole "*(q+flag-flag)" example in the
bugzilla comes from the fact that the programmer tried to *fight* the
fact that the C standard got the control dependency wrong?

In other words, the *deepest* reason for that bugzilla is that the
programmer tried to force the logical dependency by rewriting it as a
(fake, and easily optimizable) data dependency.

In *my* world, the stupid data-vs-control dependency thing goes away,
the test of the value itself is a use of it, and "*p ? *q :0" just
does the right thing, there's no reason to do that "q+flag-flag" thing
in the first place, and if you do, the compiler *should* just ignore
your little games.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Linus Torvalds
On Thu, Feb 20, 2014 at 10:53 AM, Torvald Riegel  wrote:
> On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote:
>> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
>>  wrote:
>> >
>> > You really need that "consume" to be "acquire".
>>
>> So I think we now all agree that that is what the standard is saying.
>
> Huh?
>
> The standard says that there are two separate things (among many more):
> mo_acquire and mo_consume.  They both influence happens-before in
> different (and independent!) ways.
>
> What Paul is saying is that *you* should have used *acquire* in that
> example.

I understand.

And I disagree. I think the standard is wrong, and what I *should* be
doing is point out the fact very loudly, and just tell people to NEVER
EVER use "consume" as long as it's not reliable and has insane
semantics.

So what I "should do" is to not accept any C11 atomics use in the
kernel. Because with the "acquire", it generates worse code than what
we already have, and with the "consume" it's shit.

See?

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Linus Torvalds
On Thu, Feb 20, 2014 at 11:02 AM, Linus Torvalds
 wrote:
>
> Again, the way I'd expect a compiler writer to actually *do* this is
> to just default to "ac

Oops, pressed send by mistake too early.

I was almost done:

I'd expect a compiler to just default to "acquire" semantics, but then
have a few "obvious peephole" optimizations for cases that it
encounters and where it is easy to replace the synchronization point
with just an address dependency.

 Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Linus Torvalds
On Thu, Feb 20, 2014 at 10:25 AM, Linus Torvalds
 wrote:
>
> While in my *sane* model, where you can consume things even if they
> then result in control dependencies, there will still eventually be a
> "sync" instruction on powerpc (because you really need one between the
> load of 'initialized' and the load of 'calculated'), but the compiler
> would be free to schedule the load of 'magic_system_multiplier'
> earlier.

Actually, "consume" is more interesting than that. Looking at the
bugzilla entry Torvald pointed at, it has the trick to always turn any
"consume" dependency into an address data dependency.

So another reason why you *want* to allow "consume" + "control
dependency" is that it opens up the window for many more interesting
and relevant optimizations than "acquire" does.

Again, let's take that "trivial" expression:

 return atomic_read(&initialized, consume) ? value : -1;

and the compiler can actually turn this into an interesting address
data dependency and optimize it to basically use address arithmetic
and turn it into something like

return *(&value + (&((int)-1)-value)*!atomic_read(&initialized, consume));

Of course, the *programmer* could have done that himself, but the
above is actually a *pessimization* on x86 or other strongly ordered
machines, so doing it at a source code level is actually a bad idea
(not to mention that it's horribly unreadable).

I could easily have gotten the address generation trick above wrong
(see my comment about "horribly unreadable" and no sane person doing
this at a source level), but that "complex" expression is not
necessarily at all complex for a compiler. If "value" is a static
variable, the compiler could create another read-only static variable
that contains that "-1" value, and the difference in addresses would
be a link-time constant, so it would not necessarily be all that ugly
from a code generation standpoint.

There are other ways to turn it into an address dependency, so the
whole "consume as an input to conditionals" really does seem to have
several optimization advantages (over "acquire").

Again, the way I'd expect a compiler writer to actually *do* this is
to just default to "ac

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Linus Torvalds
On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney
 wrote:
>
> You really need that "consume" to be "acquire".

So I think we now all agree that that is what the standard is saying.

And I'm saying that that is wrong, that the standard is badly written,
and should be fixed.

Because before the standard is fixed, I claim that "consume" is
unusable. We cannot trust it. End of story.

The fact that apparently gcc is currently buggy because it got the
dependency calculations *wrong* just reinforces my point.

The gcc bug Torvald pointed at is exactly because the current C
standard is illogical unreadable CRAP. I can guarantee that what
happened is:

 - the compiler saw that the result of the read was used as the left
hand expression of the ternary "? :" operator

 - as a result, the compiler decided that there's no dependency

 - the compiler didn't think about the dependency that comes from the
result of the load *also* being used as the middle part of the ternary
expression, because it had optimized it away, despite the standard not
talking about that at all.

 - so the compiler never saw the dependency that the standard talks about

BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT.

My suggested language never had any of these problems, because *my*
suggested semantics are clear, logical, and don't have these kinds of
idiotic pit-falls.

Solution: Fix the f*cking C standard. No excuses, no explanations.
Just get it fixed.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Linus Torvalds
On Thu, Feb 20, 2014 at 9:49 AM, Torvald Riegel  wrote:
>
> Yes, mo_consume is more tricky than mo_acquire.
>
> However, that has an advantage because you can avoid getting stronger
> barriers if you don't need them (ie, you can avoid the "auto-update to
> acquire" you seem to have in mind).

Oh, I agree about that part - I very much understand the reason for
"consume", and I can see how it is more relaxed than "acquire" under
many circumstances.

I just think that you actually *do* want to have "consume" even for
flag values, exactly *because* it is potentially cheaper than acquire.

In fact, I'd argue that making consume reliable in the face of control
dependencies is actually a *good* thing. It may not matter for
something like x86, where consume and acquire end up with the same
simple load, but even there it might relax instruction scheduling a
bit, since a 'consume' would have a barrier just to the *users* of the
value loaded, while 'acquire' would still have a scheduling barrier to
any subsequent operations.

So I claim that for a sequence like my example, where the reader
basically does something like

   load_atomic(&initialized, consume) ? value : -1;

the "consume" version can actually generate better code than "acquire"
- if "consume" is specified the way *I* specified it.

The way the C standard specifies it, the above code is *buggy*.
Agreed? It's really really subtly buggy, and I think that bug is not
only a real danger, I think it is logically hard to understand why.
The bug only makes sense to people who understand how memory ordering
and branch prediction interacts.

The way *I* suggested "consume" be implemented, the above not only
works and is sensible, it actually generates possibly better code than
forcing the programmer to use the (illogical) "acquire" operation.

Why? Let me give you another - completely realistic, even if obviously
a bit made up - example:

   int return_expensive_system_value(void)
   {
static atomic_t initialized;
static int calculated;

if (atomic_read(&initialized, mo_consume))
return calculated;

//let's say that this code opens /proc/cpuinfo and counts
number of CPU's or whatever ...
calculated = read_value_from_system_files();
atomic_write(&initialized, 1, mo_release);
return calculated;
}

and let's all agree that this is a somewhat realistic example, and we
can imagine why/how somebody would write code like this. It's
basically a very common lazy initialization pattern, you'll find this
in libraries, in kernels, in application code yadda yadda. No
argument?

Now, let's assume that it turns out that this value ends up being
really performance-critical, so the programmer makes the fast-path an
inline function, tells the compiler that "initialized" read is likely,
and generally wants the compiler to optimize it to hell and back.
Still sounds reasonable and realistic?

In other words, the *expected* code sequence for this is (on x86,
which doesn't need any barriers):

cmpl $0, initialized
je unlikely_out_of_line_case
movl calculated, eax

and on ARM/power you'd see a 'sync' instruction or whatever.

So far 'acquire' and 'consume' have exacly the same code generation on
power of x86, so your argument can be: "Ok, so let's just use the
inconvenient and hard-to-understand 'consume' semantics that the
current standard has, and tell the programmer that he should use
'acquire' and not worry his little head about the difference because
he will never understand it anyway".

Sure, that would be an inconvencience for programmers, but hey,
they're programming in C or C++, so they are *expected* to be manly
men or womanly women, and a little illogical inconvenience never hurt
anybody. After all, compared to the aliasing rules, that "use acquire,
not consume" rule is positively *simple*, no?

Are we all in agreement so far?

But no, the "consume" thing can actually generate better code. Trivial example:

int my_threads_value;
extern int magic_system_multiplier;

my_thread_value = return_expensive_system_value();
my_thread_value *= magic_system_multiplier;

and in the "acquire" model, the "acquire" itself means that the load
from magic_system_multiplier is now constrained by the acquire memory
ordering on "initialized".

While in my *sane* model, where you can consume things even if they
then result in control dependencies, there will still eventually be a
"sync" instruction on powerpc (because you really need one between the
load of 'initialized' and the load of 'calculated'), but the compiler
would be free to schedule the load of 'magic_system_multiplier'
earlier.

So as far as I can tell, we want the 'consume' memory ordering to
honor *all* dependencies, because
 - it's simpler
 - it's more logical
 - it's less error-prone
 - and it allows better code generation

Hmm?

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Linus Torvalds
On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel  wrote:
>>
>> So the clarification is basically to the statement that the "if
>> (consume(p)) a" version *would* have an ordering guarantee between the
>> read of "p" and "a", but the "consume(p) ? a : b" would *not* have
>> such an ordering guarantee. Yes?
>
> Not as I understand it.  If my reply above wasn't clear, let me know and
> I'll try to rephrase it into something that is.

Yeah, so you and Paul agree. And as I mentioned in the email that
crossed with yours, I think that means that the standard is overly
complex, hard to understand, fragile, and all *pointlessly* so.

Btw, there are many ways that "use a consume as an input to a
conditional" can happen. In particular, even if the result is actually
*used* like a pointer as far as the programmer is concerned, tricks
like pointer compression etc can well mean that the "pointer" is
actually at least partly implemented using conditionals, so that some
paths end up being only dependent through a comparison of the pointer
value.

So I very much did *not* want to complicate the "litmus test" code
snippet when Paul tried to make it more complex, but I do think that
there are cases where code that "looks" like pure pointer chasing
actually is not for some cases, and then can become basically that
litmus test for some path.

Just to give you an example: in simple list handling it is not at all
unusual to have a special node that is discovered by comparing the
address, not by just loading from the pointer and following the list
itself. Examples of that would be a HEAD node in a doubly linked list
(Linux uses this concept quite widely, it's our default list
implementation), or it could be a place-marker ("cursor" entry) in the
list for safe traversal in the presence of concurrent deletes etc.

And obviously there is the already much earlier mentioned
compiler-induced compare, due to value speculation, that can basically
create such sequences even wherethey did not originally exist in the
source code itself.

So even if you work with "pointer dereferences", and thus match that
particular consume pattern, I really don't see why anybody would think
that "hey, we can ignore any control dependencies" is a good idea.
It's a *TERRIBLE* idea.

And as mentioned, it's a terrible idea with no upsides. It doesn't
help compiler optimizations for the case it's *intended* to help,
since those optimizations can still be done without the horribly
broken semantics. It doesn't help compiler writers, it just confuses
them.

And it sure as hell doesn't help users.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-20 Thread Linus Torvalds
On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney
 wrote:
>>
>> So lets make this really simple: if you have a consume->cmp->read, is
>> the ordering of the two reads guaranteed?
>
> Not as far as I know.  Also, as far as I know, there is no difference
> between consume and relaxed in the consume->cmp->read case.

Ok, quite frankly, I think that means that "consume" is misdesigned.

> The above example can have a return value of 0 if translated
> straightforwardly into either ARM or Power, right?

Correct. And I think that is too subtle. It's dangerous, it makes code
that *looks* correct work incorrectly, and it actually happens to work
on x86 since x86 doesn't have crap-for-brains memory ordering
semantics.

> So, if you make one of two changes to your example, then I will agree
> with you.

No. We're not playing games here. I'm fed up with complex examples
that make no sense.

Nobody sane writes code that does that pointer comparison, and it is
entirely immaterial what the compiler can do behind our backs. The C
standard semantics need to make sense to the *user* (ie programmer),
not to a CPU and not to a compiler. The CPU and compiler are "tools".
They don't matter. Their only job is to make the code *work*, dammit.

So no idiotic made-up examples that involve code that nobody will ever
write and that have subtle issues.

So the starting point is that (same example as before, but with even
clearer naming):

Initialization state:
  initialized = 0;
  value = 0;

Consumer:

return atomic_read(&initialized, consume) ? value : -1;

Writer:
value = 42;
atomic_write(&initialized, 1, release);

and because the C memory ordering standard is written in such a way
that this is subtly buggy (and can return 0, which is *not* logically
a valid value), then I think the C memory ordering standard is broken.

That "consumer" memory ordering is dangerous as hell, and it is
dangerous FOR NO GOOD REASON.

The trivial "fix" to the standard would be to get rid of all the
"carries a dependency" crap, and just say that *anything* that depends
on it is ordered wrt it.

That just means that on alpha, "consume" implies an unconditional read
barrier (well, unless the value is never used and is loaded just
because it is also volatile), on x86, "consume" is the same as
"acquire" which is just a plain load with ordering guarantees, and on
ARM or power you can still avoid the extra synchronization *if* the
value is used just for computation and for following pointers, but if
the value is used for a comparison, there needs to be a
synchronization barrier.

Notice? Getting rid of the stupid "carries-dependency" crap from the
standard actually
 (a) simplifies the standard
 (b) means that the above obvious example *works*
 (c) does not in *any* way make for any less efficient code generation
for the cases that "consume" works correctly for in the current
mis-designed standard.
 (d) is actually a hell of a lot easier to explain to a compiler
writer, and I can guarantee that it is simpler to implement too.

Why do I claim (d) "it is simpler to implement" - because on ARM/power
you can implement it *exactly* as a special "acquire", with just a
trivial peep-hole special case that follows the use chain of the
acquire op to the consume, and then just drop the acquire bit if the
only use is that compute-to-load chain.

In fact, realistically, the *only* thing you need to actually care
about for the intended use case of "consume" is the question "is the
consuming load immediately consumed as an address (with offset) of a
memory operation. So you don't even need to follow any complicated
computation chain in a compiler - the only case that matters for the
barrier removal optimization is the "oh, I can see that it is only
used as an address to a dereference".

Seriously. The current standard is broken. It's broken because it
mis-compiles code that on the face of it looks logical and works, it's
broken because it's overly complex, and it's broken because the
complexity doesn't even *buy* you anything. All this complexity for no
reason. When much simpler wording and implementation actually WORKS
BETTER.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Linus Torvalds
On Wed, Feb 19, 2014 at 8:01 PM, Paul E. McKenney
 wrote:
>
> The control dependency should order subsequent stores, at least assuming
> that "a" and "b" don't start off with identical stores that the compiler
> could pull out of the "if" and merge.  The same might also be true for ?:
> for all I know.  (But see below)

Stores I don't worry about so much because

 (a) you can't sanely move stores up in a compiler anyway
 (b) no sane CPU or moves stores up, since they aren't on the critical path

so a read->cmp->store is actually really hard to make anything sane
re-order. I'm sure it can be done, and I'm sure it's stupid as hell.

But that "it's hard to screw up" is *not* true for a load->cmp->load.

So lets make this really simple: if you have a consume->cmp->read, is
the ordering of the two reads guaranteed?

> As long as there is an unbroken chain of -data- dependencies from the
> consume to the later access in question, and as long as that chain
> doesn't go through the excluded operations, yes.

So let's make it *really* specific, and make it real code doing a real
operation, that is actually realistic and reasonable in a threaded
environment, and may even be in some critical code.

The issue is the read-side ordering guarantee for 'a' and 'b', for this case:

 - Initial state:

   a = b = 0;

 - Thread 1 ("consumer"):

if (atomic_read(&a, consume))
 return b;
/* not yet initialized */
return -1;

 - Thread 2 ("initializer"):

 b = some_value_lets_say_42;
 /* We are now ready to party */
 atomic_write(&a, 1, release);

and quite frankly, if there is no ordering guarantee between the read
of "a" and the read of "b" in the consumer thread, then the C atomics
standard is broken.

Put another way: I claim that if "thread 1" ever sees a return value
other than -1 or 42, then the whole definition of atomics is broken.

Question 2: and what changes if the atomic_read() is turned into an
acquire, and why? Does it start working?

> Neither has a data-dependency guarantee, because there is no data
> dependency from the load to either "a" or "b".  After all, the value
> loaded got absorbed into the "if" condition.  However, according to
> discussions earlier in this thread, the "if" variant would have a
> control-dependency ordering guarantee for any stores in "a" and "b"
> (but not loads!).

So exactly what part of the standard allows the loads to be
re-ordered, and why? Quite frankly, I'd think that any sane person
will agree that the above code snippet is realistic, and that my
requirement that thread 1 sees either -1 or 42 is valid.

And if the C standards body has said that control dependencies break
the read ordering, then I really think that the C standards committee
has screwed up.

If the consumer of an atomic load isn't a pointer chasing operation,
then the consume should be defined to be the same as acquire. None of
this "conditionals break consumers". No, conditionals on the
dependency path should turn consumers into acquire, because otherwise
the "consume" load is dangerous as hell.

And if the definition of acquire doesn't include the control
dependency either, then the C atomic memory model is just completely
and utterly broken, since the above *trivial* and clearly useful
example is broken.

I really think the above example is pretty damn black-and-white.
Either it works, or the standard isn't worth wiping your ass with.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Linus Torvalds
On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel  wrote:
> On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote:
>>
>> Can you point to it? Because I can find a draft standard, and it sure
>> as hell does *not* contain any clarity of the model. It has a *lot* of
>> verbiage, but it's pretty much impossible to actually understand, even
>> for somebody who really understands memory ordering.
>
> http://www.cl.cam.ac.uk/~mjb220/n3132.pdf
> This has an explanation of the model up front, and then the detailed
> formulae in Section 6.  This is from 2010, and there might have been
> smaller changes since then, but I'm not aware of any bigger ones.

Ahh, this is different from what others pointed at. Same people,
similar name, but not the same paper.

I will read this version too, but from reading the other one and the
standard in parallel and trying to make sense of it, it seems that I
may have originally misunderstood part of the whole control dependency
chain.

The fact that the left side of "? :", "&&" and "||" breaks data
dependencies made me originally think that the standard tried very
hard to break any control dependencies. Which I felt was insane, when
then some of the examples literally were about the testing of the
value of an atomic read. The data dependency matters quite a bit. The
fact that the other "Mathematical" paper then very much talked about
consume only in the sense of following a pointer made me think so even
more.

But reading it some more, I now think that the whole "data dependency"
logic (which is where the special left-hand side rule of the ternary
and logical operators come in) are basically an exception to the rule
that sequence points end up being also meaningful for ordering (ok, so
C11 seems to have renamed "sequence points" to "sequenced before").

So while an expression like

atomic_read(p, consume) ? a : b;

doesn't have a data dependency from the atomic read that forces
serialization, writing

   if (atomic_read(p, consume))
  a;
   else
  b;

the standard *does* imply that the atomic read is "happens-before" wrt
"a", and I'm hoping that there is no question that the control
dependency still acts as an ordering point.

THAT was one of my big confusions, the discussion about control
dependencies and the fact that the logical ops broke the data
dependency made me believe that the standard tried to actively avoid
the whole issue with "control dependencies can break ordering
dependencies on some CPU's due to branch prediction and memory
re-ordering by the CPU".

But after all the reading, I'm starting to think that that was never
actually the implication at all, and the "logical ops breaks the data
dependency rule" is simply an exception to the sequence point rule.
All other sequence points still do exist, and do imply an ordering
that matters for "consume"

Am I now reading it right?

So the clarification is basically to the statement that the "if
(consume(p)) a" version *would* have an ordering guarantee between the
read of "p" and "a", but the "consume(p) ? a : b" would *not* have
such an ordering guarantee. Yes?

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-19 Thread Linus Torvalds
On Wed, Feb 19, 2014 at 6:40 AM, Torvald Riegel  wrote:
>
> If all those other threads written in whichever way use the same memory
> model and ABI for synchronization (e.g., choice of HW barriers for a
> certain memory_order), it doesn't matter whether it's a hardware thread,
> microcode, whatever.  In this case, C11 atomics should be fine.
> (We have this in userspace already, because correct compilers will have
> to assume that the code generated by them has to properly synchronize
> with other code generated by different compilers.)
>
> If the other threads use a different model, access memory entirely
> differently, etc, then we might be back to "volatile" because we don't
> know anything, and the very strict rules about execution steps of the
> abstract machine (ie, no as-if rule) are probably the safest thing to
> do.

Oh, I don't even care about architectures that don't have real hardware atomics.

So if there's a software protocol for atomics, all bets are off. The
compiler almost certainly has to do atomics with function calls
anyway, and we'll just plug in out own. And frankly, nobody will ever
care, because those architectures aren't relevant, and never will be.

Sure, there are some ancient Sparc platforms that only support a
single-byte "ldstub" and there are some embedded chips that don't
really do SMP, but have some pseudo-smp with special separate locking.
Really, nobody cares. The C standard has that crazy lock-free atomic
tests, and talks about address-free, but generally we require both
lock-free and address-free in the kernel, because otherwise it's just
too painful to do interrupt-safe locking, or do atomics in user-space
(for futexes).

So if your worry is just about software protocols for CPU's that
aren't actually designed for modern SMP, that's pretty much a complete
non-issue.

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds
On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel  wrote:
>>
>> So imagine that you have some clever global optimizer that sees that
>> the program never ever actually sets the dirty bit at all in any
>> thread, and then uses that kind of non-local knowledge to make
>> optimization decisions. THAT WOULD BE BAD.
>>
>> Do you see what I'm aiming for?
>
> Yes, I do.  But that seems to be "volatile" territory.  It crosses the
> boundaries of the abstract machine, and thus is input/output.  Which
> fraction of your atomic accesses can read values produced by hardware?
> I would still suppose that lots of synchronization is not affected by
> this.

The "hardware can change things" case is indeed pretty rare.

But quite frankly, even when it isn't hardware, as far as the compiler
is concerned you have the exact same issue - you have TLB faults
happening on other CPU's that do the same thing asynchronously using
software TLB fault handlers. So *semantically*, it really doesn't make
any difference what-so-ever if it's a software TLB handler on another
CPU, a microcoded TLB fault, or an actual hardware path.

So if the answer for all of the above is "use volatile", then I think
that means that the C11 atomics are badly designed.

The whole *point* of atomic accesses is that stuff like above should
"JustWork(tm)"

> Do you perhaps want a weaker form of volatile?  That is, one that, for
> example, allows combining of two adjacent loads of the dirty bits, but
> will make sure that this is treated as if there is some imaginary
> external thread that it cannot analyze and that may write?

Yes, that's basically what I would want. And it is what I would expect
an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a
bit of a misnomer, because technically we really generally want
"ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use
volatile, and is a hell of a lot simpler to write ;^).

So we obviously use "volatile" for this currently, and generally the
semantics we really want are:

 - the load or store is done as a single access ("atomic")

 - the compiler must not try to re-materialize the value by reloading
it from memory (this is the "at most once" part)

and quite frankly, "volatile" is a big hammer for this. In practice it
tends to work pretty well, though, because in _most_ cases, there
really is just the single access, so there isn't anything that it
could be combined with, and the biggest issue is often just the
correctness of not re-materializing the value.

And I agree - memory ordering is a totally separate issue, and in fact
we largely tend to consider it entirely separate. For cases where we
have ordering constraints, we either handle those with special
accessors (ie "atomic-modify-and-test" helpers tend to have some
serialization guarantees built in), or we add explicit fencing.

But semantically, C11 atomic accessors *should* generally have the
correct behavior for our uses.

If we have to add "volatile", that makes atomics basically useless. We
already *have* the volatile semantics, if atomics need it, that just
means that atomics have zero upside for us.

>> But *local* optimizations are fine, as long as they follow the obvious
>> rule of not actually making changes that are semantically visible.
>
> If we assume that there is this imaginary thread called hardware that
> can write/read to/from such weak-volatile atomics, I believe this should
> restrict optimizations sufficiently even in the model as specified in
> the standard.

Well, what about *real* threads that do this, but that aren't
analyzable by the C compiler because they are written in another
language entirely (inline asm, asm, perl, INTERCA:. microcode,
PAL-code, whatever?)

I really don't think that "hardware" is necessary for this to happen.
What is done by hardware on x86, for example, is done by PAL-code
(loaded at boot-time) on alpha, and done by hand-tuned assembler fault
handlers on Sparc. The *effect* is the same: it's not visible to the
compiler. There is no way in hell that the compiler can understand the
hand-tuned Sparc TLB fault handler, even if it parsed it.

>> IOW, I would *revel* in the fact that different machines are
>> different, and basically just describe the "stupid" code generation.
>> You'd get the guaranteed semantic baseline, but you'd *also* be able
>> to know that whatever architecture guarantees you have would remain.
>> Without having to describe those architecture issues.
>
> Would you be okay with this preventing lots of optimizations a compiler
> otherwise could do?  Because AFAICT, this spreads into non-synchronizing
> code via the dependency-tracking, for example.

Actually, it probably wouldn't really hurt code generation.

The thing is, you really have three cases:

 - architectures that have weak memory ordering and fairly stupid
hardware (power, some day maybe ARM)

 - architectures that expose a fairly strong memory ordering, but
reorders aggressively in hardware by tracking c

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds
On Tue, Feb 18, 2014 at 10:23 AM, Peter Sewell
 wrote:
>
> interesting list.  So are you saying that value-range-analysis and
> such-like (I say glibly, without really knowing what "such-like"
> refers to here) are fundamentally incompatible with
> the kernel code

No, it's fine to do things like value-range-analysis, it's just that
you have to do it on a local scope, and not think that you can see
every single write to some variable.

And as Paul points out, this is actually generally true even outside
of kernels. We may be pretty unique in having some things that are
literally done by hardware (eg the page table updates), but we're
certainly not unique in having loadable modules or code that the
compiler doesn't know about by virtue of being compiled separately
(assembly files, JITted, whatever).

And we are actually perfectly fine with compiler barriers. One of our
most common barriers is simply this:

   #define barrier() __asm__ __volatile__("": : :"memory")

which basically tells the compiler "something changes memory in ways
you don't understand, so you cannot assume anything about memory
contents".

Obviously, a compiler is still perfectly fine optimizing things like
local variables etc that haven't had their address taken across such a
barrier (regardless of whether those local variables might be spilled
to the stack frame etc), but equally obviously we'd require that this
kind of thing makes sure that any atomic writes have been finalized by
the time the barrier happens (it does *not* imply a memory ordering,
just a compiler memory access barrier - we have separate things for
ordering requirements, and those obviously often generate actual
barrier instructions)

And we're likely always going to have things like this, but if C11
atomics end up working well for us, we might have *fewer* of them.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds
On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell
 wrote:
>
> This is a bit more subtle, because (on ARM and POWER) removing the
> dependency and conditional branch is actually in general *not* equivalent
> in the hardware, in a concurrent context.

So I agree, but I think that's a generic issue with non-local memory
ordering, and is not at all specific to the optimization wrt that
"x?42:42" expression.

If you have a value that you loaded with a non-relaxed load, and you
pass that value off to a non-local function that you don't know what
it does, in my opinion that implies that the compiler had better add
the necessary serialization to say "whatever that other function does,
we guarantee the semantics of the load".

So on ppc, if you do a load with "consume" or "acquire" and then call
another function without having had something in the caller that
serializes the load, you'd better add the lwsync or whatever before
the call. Exactly because the function call itself otherwise basically
breaks the visibility into ordering. You've basically turned a
load-with-ordering-guarantees into just an integer that you passed off
to something that doesn't know about the ordering guarantees - and you
need that "lwsync" in order to still guarantee the ordering.

Tough titties. That's what a CPU with weak memory ordering semantics
gets in order to have sufficient memory ordering.

And I don't think it's actually a problem in practice. If you are
doing loads with ordered semantics, you're not going to pass the
result off willy-nilly to random functions (or you really *do* require
the ordering, because the load that did the "acquire" was actually for
a lock!

So I really think that the "local optimization" is correct regardless.

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds
On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel  wrote:
>>
>>   "Consume operation: no reads in the current thread dependent on the
>> value currently loaded can be reordered before this load"
>
> I can't remember seeing that language in the standard (ie, C or C++).
> Where is this from?

That's just for googling for explanations. I do have some old standard
draft, but that doesn't have any concise definitions anywhere that I
could find.

>> and it could make a compiler writer say that value speculation is
>> still valid, if you do it like this (with "ptr" being the atomic
>> variable):
>>
>>   value = ptr->val;
>
> I assume the load from ptr has mo_consume ordering?

Yes.

>> into
>>
>>   tmp = ptr;
>>   value = speculated.value;
>>   if (unlikely(tmp != &speculated))
>> value = tmp->value;
>>
>> which is still bogus. The load of "ptr" does happen before the load of
>> "value = speculated->value" in the instruction stream, but it would
>> still result in the CPU possibly moving the value read before the
>> pointer read at least on ARM and power.
>
> And surprise, in the C/C++ model the load from ptr is sequenced-before
> the load from speculated, but there's no ordering constraint on the
> reads-from relation for the value load if you use mo_consume on the ptr
> load.  Thus, the transformed code has less ordering constraints than the
> original code, and we arrive at the same outcome.

Ok, good.

> The standard is clear on what's required.  I strongly suggest reading
> the formalization of the memory model by Batty et al.

Can you point to it? Because I can find a draft standard, and it sure
as hell does *not* contain any clarity of the model. It has a *lot* of
verbiage, but it's pretty much impossible to actually understand, even
for somebody who really understands memory ordering.

 Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds
On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell  wrote:
>
> For example, suppose we have, in one compilation unit:
>
> void f(int ra, int*rb) {
>   if (ra==42)
> *rb=42;
>   else
> *rb=42;
> }

So this is a great example, and in general I really like your page at:

> For more context, this example is taken from a summary of the thin-air
> problem by Mark Batty and myself,
> , and the problem with
> dependencies via other compilation units was AFAIK first pointed out
> by Hans Boehm.

and the reason I like your page is that it really talks about the
problem by pointing to the "unoptimized" code, and what hardware would
do.

As mentioned, I think that's actually the *correct* way to think about
the problem space, because it allows the programmer to take hardware
characteristics into account, without having to try to "describe" them
at a source level.

As to your example of

   if (ra)
   atomic_write(rb, A);
   else
   atomic_write(rb, B);

I really think that it is ok to combine that into

atomic_write(rb, ra ? A:B);

(by virtue of "exact same behavior on actual hardware"), and then the
only remaining question is whether the "ra?A:B" can be optimized to
remove the conditional if A==B as in your example where both are "42".
Agreed?

Now, I would argue that the "naive" translation of that is
unambiguous, and since "ra" is not volatile or magic in any way, then
"ra?42:42" can obviously be optimized into just 42 - by the exact same
rule that says "the compiler can do any transformation that is
equivalent in the hardware". The compiler can *locally* decide that
that is the right thing to do, and any programmer that complains about
that decision is just crazy.

So my "local machine behavior equivalency" rule means that that
function can be optimized into a single "store 42 atomically into rb".

Now, if it's *not* compiled locally, and is instead implemented as a
macro (or inline function), there are obviously situations where "ra ?
A : B" ends up having to do other things. In particular, X may be
volatile or an atomic read that has ordering semantics, and then that
expression doesn't become just "42", but that's a separate issue. It's
not all that dissimilar to "function calls are sequence points",
though, and obviously if the source of "ra" has semantic meaning, you
have to honor that semantic meaning.

Agreed?

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-18 Thread Linus Torvalds
On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel  wrote:
> On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote:
>> And exactly because I know enough, I would *really* like atomics to be
>> well-defined, and have very clear - and *local* - rules about how they
>> can be combined and optimized.
>
> "Local"?

Yes.

So I think that one of the big advantages of atomics over volatile is
that they *can* be optimized, and as such I'm not at all against
trying to generate much better code than for volatile accesses.

But at the same time, that can go too far. For example, one of the
things we'd want to use atomics for is page table accesses, where it
is very important that we don't generate multiple accesses to the
values, because parts of the values can be change *by*hardware* (ie
accessed and dirty bits).

So imagine that you have some clever global optimizer that sees that
the program never ever actually sets the dirty bit at all in any
thread, and then uses that kind of non-local knowledge to make
optimization decisions. THAT WOULD BE BAD.

Do you see what I'm aiming for? Any optimization that tries to prove
anything from more than local state is by definition broken, because
it assumes that everything is described by the program.

But *local* optimizations are fine, as long as they follow the obvious
rule of not actually making changes that are semantically visible.

(In practice, I'd be impressed as hell for that particular example,
and we actually do end up setting the dirty bit by hand in some
situations, so the example is slightly made up, but there are other
cases that might be more realistic in that sometimes we do things that
are hidden from the compiler - in assembly etc - and the compiler
might *think* it knows what is going on, but it doesn't actually see
all accesses).

> Sorry, but the rules *are* very clear.  I *really* suggest to look at
> the formalization by Batty et al.  And in these rules, proving that a
> read will always return value X has a well-defined meaning, and you can
> use it.  That simply follows from how the model is built.

What "model"?

That's the thing. I have tried to figure out whether the model is some
abstract C model, or a model based on the actual hardware that the
compiler is compiling for, and whether the model is one that assumes
the compiler has complete knowledge of the system (see the example
above).

And it seems to be a mixture of it all. The definitions of the various
orderings obviously very much imply that the compiler has to insert
the proper barriers or sequence markers for that architecture, but
then there is no apparent way to depend on any *other* architecture
ordering guarantees. Like our knowledge that all architectures (with
the exception of alpha, which really doesn't end up being something we
worry about any more) end up having the load dependency ordering
guarantee.

> What you seem to want just isn't covered by the model as it is today --
> you can't infer from that that the model itself would be wrong.  The
> dependency chains aren't modeled in the way you envision it (except in
> what consume_mo tries, but that seems to be hard to implement); they are
> there on the level of the program logic as modeled by the abstract
> machine and the respective execution/output rules, but they are not
> built to represent those specific ordering guarantees the HW gives you.

So this is a problem. It basically means that we cannot do the kinds
of things we do now, which very much involve knowing what the memory
ordering of a particular machine is, and combining that knowledge with
our knowledge of code generation.

Now, *most* of what we do is protected by locking and is all fine. But
we do have a few rather subtle places in RCU and in the scheduler
where we depend on the actual dependency chain.

In *practice*, I seriously doubt any reasonable compiler can actually
make a mess of it. The kinds of optimizations that would actually
defeat the dependency chain are simply not realistic. And I suspect
that will end up being what we rely on - there being no actual sane
sequence that a compiler would ever do, even if we wouldn't have
guarantees for some of it.

And I suspect  I can live with that. We _have_ lived with that for the
longest time, after all. We very much do things that aren't covered by
any existing C standard, and just basically use tricks to coax the
compiler into never generating code that doesn't work (with our inline
asm barriers etc being a prime example).

> I would also be cautious claiming that the rules you suggested would be
> very clear and very simple.  I haven't seen a memory model spec from you
> that would be viable as the standard model for C/C++, nor have I seen
> proof that this would actually be easier to understand for programmers
> in general.

So persona

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Linus Torvalds
On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds
 wrote:
>
> As far as I can tell, the intent is that you can't do value
> speculation (except perhaps for the "relaxed", which quite frankly
> sounds largely useless).

Hmm. The language I see for "consume" is not obvious:

  "Consume operation: no reads in the current thread dependent on the
value currently loaded can be reordered before this load"

and it could make a compiler writer say that value speculation is
still valid, if you do it like this (with "ptr" being the atomic
variable):

  value = ptr->val;

into

  tmp = ptr;
  value = speculated.value;
  if (unlikely(tmp != &speculated))
value = tmp->value;

which is still bogus. The load of "ptr" does happen before the load of
"value = speculated->value" in the instruction stream, but it would
still result in the CPU possibly moving the value read before the
pointer read at least on ARM and power.

So if you're a compiler person, you think you followed the letter of
the spec - as far as *you* were concerned, no load dependent on the
value of the atomic load moved to before the atomic load. You go home,
happy, knowing you've done your job. Never mind that you generated
code that doesn't actually work.

I dread having to explain to the compiler person that he may be right
in some theoretical virtual machine, but the code is subtly broken and
nobody will ever understand why (and likely not be able to create a
test-case showing the breakage).

But maybe the full standard makes it clear that "reordered before this
load" actually means on the real hardware, not just in the generated
instruction stream. Reading it with understanding of the *intent* and
understanding all the different memory models that requirement should
be obvious (on alpha, you need an "rmb" instruction after the load),
but ...

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Linus Torvalds
On Mon, Feb 17, 2014 at 7:00 PM, Paul E. McKenney
 wrote:
>
> One example that I learned about last week uses the branch-prediction
> hardware to validate value speculation.  And no, I am not at all a fan
> of value speculation, in case you were curious.

Heh. See the example I used in my reply to Alec Teal. It basically
broke the same dependency the same way.

Yes, value speculation of reads is simply wrong, the same way
speculative writes are simply wrong. The dependency chain matters, and
is meaningful, and breaking it is actively bad.

As far as I can tell, the intent is that you can't do value
speculation (except perhaps for the "relaxed", which quite frankly
sounds largely useless). But then I do get very very nervous when
people talk about "proving" certain values.

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Linus Torvalds
On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel  wrote:
>
> There's an underlying problem here that's independent from the actual
> instance that you're worried about here: "no sense" is a ultimately a
> matter of taste/objectives/priorities as long as the respective
> specification is logically consistent.

Yes. But I don't think it's "independent".

Exactly *because* some people will read standards without applying
"does the resulting code generation actually make sense for the
programmer that wrote the code", the standard has to be pretty clear.

The standard often *isn't* pretty clear. It wasn't clear enough when
it came to "volatile", and yet that was a *much* simpler concept than
atomic accesses and memory ordering.

And most of the time it's not a big deal. But because the C standard
generally tries to be very portable, and cover different machines,
there tends to be a mindset that anything inherently unportable is
"undefined" or "implementation defined", and then the compiler writer
is basically given free reign to do anything they want (with
"implementation defined" at least requiring that it is reliably the
same thing).

And when it comes to memory ordering, *everything* is basically
non-portable, because different CPU's very much have different rules.
I worry that that means that the standard then takes the stance that
"well, compiler re-ordering is no worse than CPU re-ordering, so we
let the compiler do anything". And then we have to either add
"volatile" to make sure the compiler doesn't do that, or use an overly
strict memory model at the compiler level that makes it all pointless.

So I really really hope that the standard doesn't give compiler
writers free hands to do anything that they can prove is "equivalent"
in the virtual C machine model. That's not how you get reliable
results.

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Linus Torvalds
On Mon, Feb 17, 2014 at 3:17 PM, Torvald Riegel  wrote:
> On Mon, 2014-02-17 at 14:32 -0800,
>
>> Stop claiming it "can return 1".. It *never* returns 1 unless you do
>> the load and *verify* it, or unless the load itself can be made to go
>> away. And with the code sequence given, that just doesn't happen. END
>> OF STORY.
>
> void foo();
> {
>   atomic x = 1;
>   if (atomic_load(&x, mo_relaxed) == 1)
> atomic_store(&y, 3, mo_relaxed));
> }

This is the very example I gave, where the real issue is not that "you
prove that load returns 1", you instead say "store followed by a load
can be combined".

I (in another email I just wrote) tried to show why the "prove
something is true" is a very dangerous model.  Seriously, it's pure
crap. It's broken.

If the C standard defines atomics in terms of "provable equivalence",
it's broken. Exactly because on a *virtual* machine you can prove
things that are not actually true in a *real* machine. I have the
example of value speculation changing the memory ordering model of the
actual machine.

See?

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Linus Torvalds
On Mon, Feb 17, 2014 at 3:10 PM, Alec Teal  wrote:
>
> You mean "unambiguous" - try reading a patent (Apple have 1000s of trivial
> ones, I tried reading one once thinking "how could they have phrased it so
> this got approved", their technique was to make the reader want to start
> cutting themselves to prove they wern't numb to everything)

Oh, I agree, patent language is worse.

> I'm not going to teach you what rvalues and lvalues, but!

I know what lvalues and rvalues are. I *understand* the thinking that
goes on behind the "let's not do the access, because it's not an
rvalue, so there is no 'access' to the object".

I understand it from a technical perspective.

I don't understand the compiler writer that uses a *technicality* to
argue against generating sane code that is obviously what the user
actually asked for.

See the difference?

> So start again, what is the serious problem, have you got any code that
> would let me replicate it, what is your version of GCC?

The volatile problem is long fixed. The people who argued for the
"legalistically correct", but insane behavior lost (and as mentioned,
I think C++11 actually fixed the legalistic reading too).

I'm bringing it up because I've had too many cases where compiler
writers pointed to standard and said "that is ambiguous or undefined,
so we can do whatever the hell we want, regardless of whether that's
sensible, or regardless of whether there is a sensible way to get the
behavior you want or not".


> Oh and lastly! Optimisations are not as casual as "oh, we could do this and
> it'd work better" unlike kernel work or any other software that is being
> improved, it is very formal (and rightfully so)

Alec, I know compilers. I don't do code generation (quite frankly,
register allocation and instruction choice is when I give up), but I
did actually write my own for static analysis, including turning
things into SSA etc.

No, I'm not a "compiler person", but I actually do know enough that I
understand what goes on.

And exactly because I know enough, I would *really* like atomics to be
well-defined, and have very clear - and *local* - rules about how they
can be combined and optimized.

None of this "if you can prove that the read has value X" stuff. And
things like value speculation should simply not be allowed, because
that actually breaks the dependency chain that the CPU architects give
guarantees for. Instead, make the rules be very clear, and very
simple, like my suggestion. You can never remove a load because you
can "prove" it has some value, but you can combine two consecutive
atomic accesses/

For example, CPU people actually do tend to give guarantees for
certain things, like stores that are causally related being visible in
a particular order. If the compiler starts doing value speculation on
atomic accesses, you are quite possibly breaking things like that.
It's just not a good idea. Don't do it. Write the standard so that it
clearly is disallowed.

Because you may think that a C standard is machine-independent, but
that isn't really the case. The people who write code still write code
for a particular machine. Our code works (in the general case) on
different byte orderings, different register sizes, different memory
ordering models. But in each *instance* we still end up actually
coding for each machine.

So the rules for atomics should be simple and *specific* enough that
when you write code for a particular architecture, you can take the
architecture memory ordering *and* the C atomics orderings into
account, and do the right thing for that architecture.

And that very much means that doing things like value speculation MUST
NOT HAPPEN. See? Even if you can prove that your code is "equivalent",
it isn't.

So for example, let's say that you have a pointer, and you have some
reason to believe that the pointer has a particular value. So you
rewrite following the pointer from this:

  value = ptr->val;

into

  value = speculated->value;
  tmp = ptr;
  if (unlikely(tmp != speculated))
value = tmp->value;

and maybe you can now make the critical code-path for the speculated
case go faster (since now there is no data dependency for the
speculated case, and the actual pointer chasing load is now no longer
in the critical path), and you made things faster because your
profiling showed that the speculated case was true 99% of the time.
Wonderful, right? And clearly, the code "provably" does the same
thing.

EXCEPT THAT IS NOT TRUE AT ALL.

It very much does not do the same thing at all, and by doing value
speculation and "proving" something was true, the only thing you did
was to make incorrect code run faster. Because now the causally
related load of value from the pointer isn't actually causally related
at all, and you broke the memory ordering.

This is why I don't like it when I see Torvald talk about "proving"
things. It's bullshit. You can "prove" pretty much anything, and in
the process lose sight of the bigger issue, namely tha

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Linus Torvalds
On Mon, Feb 17, 2014 at 2:25 PM, Torvald Riegel  wrote:
> On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote:
>>
>> The argument was that an lvalue doesn't actually "access" the memory
>> (an rvalue does), so this:
>>
>>volatile int *p = ...;
>>
>>*p;
>>
>> doesn't need to generate a load from memory, because "*p" is still an
>> lvalue (since you could assign things to it).
>>
>> This isn't an issue in C, because in C, expression statements are
>> always rvalues, but C++ changed that.
>
> Huhh.  I can see the problems that this creates in terms of C/C++
> compatibility.

That's not the biggest problem.

The biggest problem is that you have compiler writers that don't care
about sane *use* of the features they write a compiler for, they just
care about the standard.

So they don't care about C vs C++ compatibility. Even more
importantly, they don't care about the *user* that uses only C++ and
the fact that their reading of the standard results in *meaningless*
behavior. They point to the standard and say "that's what the standard
says, suck it", and silently generate code (or in this case, avoid
generating code) that makes no sense.

So it's not about C++ being incompatible with C, it's about C++ having
insane and bad semantics unless you just admit that "oh, ok, I need to
not just read the standard, I also need to use my brain, and admit
that a C++ statement expression needs to act as if it is an "access"
wrt volatile variables".

In other words, as a compiler person, you do need to read more than
the paper of standard. You need to also take into account what is
reasonable behavior even when the standard could possibly be read some
other way. And some compiler people don't.

The "volatile access in statement expression" did get resolved,
sanely, at least in gcc. I think gcc warns about some remaining cases.

Btw, afaik, C++11 actually clarifies the standard to require the
reads, because everybody *knew* that not requiring the read was insane
and meaningless behavior, and clearly against the intent of
"volatile".

But that didn't stop compiler writers from saying "hey, the standard
allows my insane and meaningless behavior, so I'll implement it and
not consider it a bug".

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Linus Torvalds
On Mon, Feb 17, 2014 at 2:09 PM, Torvald Riegel  wrote:
> On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote:
>> >
>> > if (atomic_load(&x, mo_relaxed) == 1)
>> >   atomic_store(&y, 3, mo_relaxed));
>>
>> No, please don't use this idiotic example. It is wrong.
>
> It won't be useful in practice in a lot of cases, but that doesn't mean
> it's wrong.  It's clearly not illegal code.  It also serves a purpose: a
> simple example to reason about a few aspects of the memory model.

It's not illegal code, but i you claim that you can make that store
unconditional, it's a pointless and wrong example.

>> The fact is, if a compiler generates anything but the obvious sequence
>> (read/cmp/branch/store - where branch/store might obviously be done
>> with some other machine conditional like a predicate), the compiler is
>> wrong.
>
> Why?  I've reasoned why (1) to (3) above allow in certain cases (i.e.,
> the first load always returning 1) for the branch (or other machine
> conditional) to not be emitted.  So please either poke holes into this
> reasoning, or clarify that you don't in fact, contrary to what you wrote
> above, agree with (1) to (3).

The thing is, the first load DOES NOT RETURN 1. It returns whatever
that memory location contains. End of story.

Stop claiming it "can return 1".. It *never* returns 1 unless you do
the load and *verify* it, or unless the load itself can be made to go
away. And with the code sequence given, that just doesn't happen. END
OF STORY.

So your argument is *shit*. Why do you continue to argue it?

I told you how that load can go away, and you agreed. But IT CANNOT GO
AWAY any other way. You cannot claim "the compiler knows". The
compiler doesn't know. It's that simple.

>> So why do I say you are wrong, after I just gave you an example of how
>> it happens? Because my example went back to the *real* issue, and
>> there are actual real semantically meaningful details with doing
>> things like load merging.
>>
>> To give an example, let's rewrite things a bit more to use an extra variable:
>>
>> atomic_store(&x, 1, mo_relaxed);
>> a = atomic_load(&1, mo_relaxed);
>> if (a == 1)
>> atomic_store(&y, 3, mo_relaxed);
>>
>> which looks exactly the same.
>
> I'm confused.  Is this a new example?

That is a new example. The important part is that it has left a
"trace" for the programmer: because 'a' contains the value, the
programmer can now look at the value later and say "oh, we know we did
a store iff a was 1"

>> This sequence:
>>
>> atomic_store(&x, 1, mo_relaxed);
>> a = atomic_load(&x, mo_relaxed);
>> atomic_store(&y, 3, mo_relaxed);
>>
>> is actually - and very seriously - buggy.
>>
>> Why? Because you have effectively split the atomic_load into two loads
>> - one for the value of 'a', and one for your 'proof' that the store is
>> unconditional.
>
> I can't follow that, because it isn't clear to me which code sequences
> are meant to belong together, and which transformations the compiler is
> supposed to make.  If you would clarify that, then I can reply to this
> part.

Basically, if the compiler allows the condition of "I wrote 3 to the
y, but the programmer sees 'a' has another value than 1 later" then
the compiler is one buggy pile of shit. It fundamentally broke the
whole concept of atomic accesses. Basically the "atomic" access to 'x'
turned into two different accesses: the one that "proved" that x had
the value 1 (and caused the value 3 to be written), and the other load
that then write that other value into 'a'.

It's really not that complicated.

And this is why descriptions like this should ABSOLUTELY NOT BE
WRITTEN as "if the compiler can prove that 'x' had the value 1, it can
remove the branch". Because that IS NOT SUFFICIENT. That was not a
valid transformation of the atomic load.

The only valid transformation was the one I stated, namely to remove
the load entirely and replace it with the value written earlier in the
same execution context.

Really, why is so hard to understand?

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Linus Torvalds
On Mon, Feb 17, 2014 at 1:21 PM, Torvald Riegel  wrote:
> On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote:
>> and then it is read by people (compiler writers) that intentionally
>> try to mis-use the words and do language-lawyering ("that depends on
>> what the meaning of 'is' is").
>
> That assumption about people working on compilers is a little too broad,
> don't you think?

Let's just say that *some* are that way, and those are the ones that I
end up butting heads with.

The sane ones I never have to argue with - point them at a bug, and
they just say "yup, bug". The insane ones say "we don't need to fix
that, because if you read this copy of the standards that have been
translated to chinese and back, it clearly says that this is
acceptable".

>> The whole "lvalue vs rvalue expression
>> vs 'what is a volatile access'" thing for C++ was/is a great example
>> of that.
>
> I'm not aware of the details of this.

The argument was that an lvalue doesn't actually "access" the memory
(an rvalue does), so this:

   volatile int *p = ...;

   *p;

doesn't need to generate a load from memory, because "*p" is still an
lvalue (since you could assign things to it).

This isn't an issue in C, because in C, expression statements are
always rvalues, but C++ changed that. The people involved with the C++
standards have generally been totally clueless about their subtle
changes.

I may have misstated something, but basically some C++ people tried
very hard to make "volatile" useless.

We had other issues too. Like C compiler people who felt that the
type-based aliasing should always override anything else, even if the
variable accessed (through different types) was statically clearly
aliasing and used the exact same pointer. That made it impossible to
do a syntactically clean model of "this aliases", since the _only_
exception to the type-based aliasing rule was to generate a union for
every possible access pairing.

We turned off type-based aliasing (as I've mentioned before, I think
it's a fundamentally broken feature to begin with, and a horrible
horrible hack that adds no value for anybody but the HPC people).

Gcc eventually ended up having some sane syntax for overriding it, but
by then I was too disgusted with the people involved to even care.

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-17 Thread Linus Torvalds
On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel  wrote:
>
> Which example do you have in mind here?  Haven't we resolved all the
> debated examples, or did I miss any?

Well, Paul seems to still think that the standard possibly allows
speculative writes or possibly value speculation in ways that break
the hardware-guaranteed orderings.

And personally, I can't read standards paperwork. It is invariably
written in some basically impossible-to-understand lawyeristic mode,
and then it is read by people (compiler writers) that intentionally
try to mis-use the words and do language-lawyering ("that depends on
what the meaning of 'is' is"). The whole "lvalue vs rvalue expression
vs 'what is a volatile access'" thing for C++ was/is a great example
of that.

So quite frankly, as a result I refuse to have anything to do with the
process directly.

 Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Linus Torvalds
On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel  wrote:
>
> I think the example is easy to misunderstand, because the context isn't
> clear.  Therefore, let me first try to clarify the background.
>
> (1) The abstract machine does not write speculatively.
> (2) Emitting a branch instruction and executing a branch at runtime is
> not part of the specified behavior of the abstract machine.  Of course,
> the abstract machine performs conditional execution, but that just
> specifies the output / side effects that it must produce (e.g., volatile
> stores) -- not with which hardware instructions it is producing this.
> (3) A compiled program must produce the same output as if executed by
> the abstract machine.

Ok, I'm fine with that.

> Thus, we need to be careful what "speculative store" is meant to refer
> to.  A few examples:
>
> if (atomic_load(&x, mo_relaxed) == 1)
>   atomic_store(&y, 3, mo_relaxed));

No, please don't use this idiotic example. It is wrong.

The fact is, if a compiler generates anything but the obvious sequence
(read/cmp/branch/store - where branch/store might obviously be done
with some other machine conditional like a predicate), the compiler is
wrong.

Anybody who argues anything else is wrong, or confused, or confusing.

Instead, argue about *other* sequences where the compiler can do something.

For example, this sequence:

   atomic_store(&x, a, mo_relaxed);
   b = atomic_load(&x, mo_relaxed);

can validly be transformed to

   atomic_store(&x, a, mo_relaxed);
   b = (typeof(x)) a;

and I think everybody agrees about that. In fact, that optimization
can be done even for mo_strict.

But even that "obvious" optimization has subtle cases. What if the
store is relaxed, but the load is strict? You can't do the
optimization without a lot of though, because dropping the strict load
would drop an ordering point. So even the "store followed by exact
same load" case has subtle issues.

With similar caveats, it is perfectly valid to merge two consecutive
loads, and to merge two consecutive stores.

Now that means that the sequence

atomic_store(&x, 1, mo_relaxed);
if (atomic_load(&x, mo_relaxed) == 1)
atomic_store(&y, 3, mo_relaxed);

can first be optimized to

atomic_store(&x, 1, mo_relaxed);
if (1 == 1)
atomic_store(&y, 3, mo_relaxed);

and then you get the end result that you wanted in the first place
(including the ability to re-order the two stores due to the relaxed
ordering, assuming they can be proven to not alias - and please don't
use the idiotic type-based aliasing rules).

Bringing up your first example is pure and utter confusion. Don't do
it. Instead, show what are obvious and valid transformations, and then
you can bring up these kinds of combinations as "look, this is
obviously also correct".

Now, the reason I want to make it clear that the code example you
point to is a crap example is that because it doesn't talk about the
*real* issue, it also misses a lot of really important details.

For example, when you say "if the compiler can prove that the
conditional is always true" then YOU ARE TOTALLY WRONG.

So why do I say you are wrong, after I just gave you an example of how
it happens? Because my example went back to the *real* issue, and
there are actual real semantically meaningful details with doing
things like load merging.

To give an example, let's rewrite things a bit more to use an extra variable:

atomic_store(&x, 1, mo_relaxed);
a = atomic_load(&1, mo_relaxed);
if (a == 1)
atomic_store(&y, 3, mo_relaxed);

which looks exactly the same. If you now say "if you can prove the
conditional is always true, you can make the store unconditional", YOU
ARE WRONG.

Why?

This sequence:

atomic_store(&x, 1, mo_relaxed);
a = atomic_load(&x, mo_relaxed);
atomic_store(&y, 3, mo_relaxed);

is actually - and very seriously - buggy.

Why? Because you have effectively split the atomic_load into two loads
- one for the value of 'a', and one for your 'proof' that the store is
unconditional.

Maybe you go "Nobody sane would do that", and you'd be right. But
compilers aren't "sane" (and compiler _writers_ I have some serious
doubts about), and how you describe the problem really does affect the
solution.

My description ("you can combine two subsequent atomic accesses, if
you are careful about still having the same ordering points") doesn't
have the confusion. It makes it clear that no, you can't speculate
writes, but yes, obviously you can combine certain things (think
"write buffers" and "memory caches"), and that may make you able to
remove the conditional. Which is very different from speculating
writes.

But my description also makes it clear that the transformation above
was buggy, but that rewriting it as

a = 1;
atomic_store(&y, 3, mo_relaxed);
atomic_store(&x, a, mo_relaxed);

is fine (doing the re-ordering here just to make a point).

So I suspect we actually agree, I just think that the "example" that
h

Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-15 Thread Linus Torvalds
On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel  wrote:
>
> I think a major benefit of C11's memory model is that it gives a
> *precise* specification for how a compiler is allowed to optimize.

Clearly it does *not*. This whole discussion is proof of that. It's
not at all clear, and the standard apparently is at least debatably
allowing things that shouldn't be allowed. It's also a whole lot more
complicated than "volatile", so the likelihood of a compiler writer
actually getting it right - even if the standard does - is lower.
They've gotten "volatile" wrong too, after all (particularly in C++).

   Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds
On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds
 wrote:
>
> And conversely, the C11 people can walk away from us too. But if they
> can't make us happy (and by "make us happy", I really mean no stupid
> games on our part) I personally think they'll have a stronger
> standard, and a real use case, and real arguments. I'm assuming they
> want that.

I should have somebody who proof-reads my emails before I send them out.

I obviously meant "if they *can* make us happy" (not "can't").

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds
On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney
 wrote:
>
> One way of looking at the discussion between Torvald and myself would be
> as a seller (Torvald) and a buyer (me) haggling over the fine print in
> a proposed contract (the standard).  Whether that makes you feel better
> or worse about the situation I cannot say.  ;-)

Oh, I'm perfectly fine with that. But we're not desperate to buy, and
I actually think the C11 people are - or at least should be - *way*
more desperate to sell their model to us than we are to take it.

Which means that as a buyer you should say "this is what we want, if
you don't give us this, we'll just walk away". Not try to see how much
we can pay for it. Because there is very little upside for us, and
_unless_ the C11 standard gets things right it's just extra complexity
for us, coupled with new compiler fragility and years of code
generation bugs.

Why would we want that extra complexity and inevitable compiler bugs?
If we then have to fight compiler writers that point to the standard
and say "..but look, the standard says we can do this", then at that
point it went from "extra complexity and compiler bugs" to a whole
'nother level of frustration and pain.

So just walk away unless the C11 standard gives us exactly what we
want. Not "something kind of like what we'd use". EXACTLY. Because I'm
not in the least interested in fighting compiler people that have a
crappy standard they can point to. Been there, done that, got the
T-shirt and learnt my lesson.

And the thing is, I suspect that the Linux kernel is the most complete
- and most serious - user of true atomics that the C11 people can sell
their solution to.

If we don't buy it, they have no serious user. Sure, they'll have lots
of random other one-off users for their atomics, where each user wants
one particular thing, but I suspect that we'll have the only really
unified portable code base that handles pretty much *all* the serious
odd cases that the C11 atomics can actually talk about to each other.

Oh, they'll push things through with or without us, and it will be a
collection of random stuff, where they tried to please everybody, with
particularly compiler/architecture people who have no f*cking clue
about how their stuff is used pushing to make it easy/efficient for
their particular compiler/architecture.

But we have real optimized uses of pretty much all relevant cases that
people actually care about.

We can walk away from them, and not really lose anything but a small
convenience (and it's a convenience *only* if the standard gets things
right).

And conversely, the C11 people can walk away from us too. But if they
can't make us happy (and by "make us happy", I really mean no stupid
games on our part) I personally think they'll have a stronger
standard, and a real use case, and real arguments. I'm assuming they
want that.

That's why I complain when you talk about things like marking control
dependencies explicitly. That's *us* bending over backwards. And as a
buyer, we have absolutely zero reason to do that.

Tell the C11 people: "no speculative writes". Full stop. End of story.
Because we're not buying anything else.

Similarly, if we need to mark atomics "volatile", then now the C11
atomics are no longer even a "small convenience", now they are just
extra complexity wrt what we already have. So just make it clear that
if the C11 standard needs to mark atomics volatile in order to get
non-speculative and non-reloading behavior, then the C11 atomics are
useless to us, and we're not buying.

Remember: a compiler can *always* do "as if" optimizations - if a
compiler writer can prove that the end result acts 100% the same using
an optimized sequence, then they can do whatever the hell they want.
That's not the issue. But if we can *ever* see semantic impact of
speculative writes, the compiler is buggy, and the compiler writers
need to be aware that it is buggy. No ifs, buts, maybes about it.

So I'm perfectly fine with you seeing yourself as a buyer. But I want
you to be a really *picky* and anal buyer - one that knows he has the
upper hand, and can walk away with no downside.

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds
On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds
 wrote:
>
> Why are we still discussing this idiocy? It's irrelevant. If the
> standard really allows random store speculation, the standard doesn't
> matter, and sane people shouldn't waste their time arguing about it.

Btw, the other part of this coin is that our manual types (using
volatile and various architecture-specific stuff) and our manual
barriers and inline asm accesses are generally *fine*.

The C11 stuff doesn't buy us anything. The argument that "new
architectures might want to use it" is prue and utter bollocks, since
unless the standard gets the thing *right*, nobody sane would ever use
it for some new architecture, when the sane thing to do is to just
fill in the normal barriers and inline asms.

So I'm very very serious: either the compiler and the standard gets
things right, or we don't use it. There is no middle ground where "we
might use it for one or two architectures and add random hints".
That's just stupid.

The only "middle ground" is about which compiler version we end up
trusting _if_ it turns out that the compiler and standard do get
things right. From Torvald's explanations (once I don't mis-read them
;), my take-away so far has actually been that the standard *does* get
things right, but I do know from over-long personal experience that
compiler people sometimes want to be legalistic and twist the
documentation to the breaking point, at which point we just go "we'd
be crazy do use that".

See our use of "-fno-strict-aliasing", for example. The C standard
aliasing rules are a mistake, stupid, and wrong, and gcc uses those
stupid type-based alias rules even when statically *proving* the
aliasing gives the opposite result. End result: we turn the shit off.

Exact same deal wrt atomics. We are *not* going to add crazy "this
here is a control dependency" crap. There's a test, the compiler
*sees* the control dependency for chrissake, and it still generates
crap, we turn that broken "optimization" off. It really is that
simple.

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-14 Thread Linus Torvalds
On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney
 wrote:
>
> Linus, Peter, any objections to marking places where we are relying on
> ordering from control dependencies against later stores?  This approach
> seems to me to have significant documentation benefits.

Quite frankly, I think it's stupid, and the "documentation" is not a
benefit, it's just wrong.

How would you figure out whether your added "documentation" holds true
for particular branches but not others?

How could you *ever* trust a compiler that makes the dependency meaningful?

Again, let's keep this simple and sane:

 - if a compiler ever generates code where an atomic store movement is
"visible" in any way, then that compiler is broken shit.

I don't understand why you even argue this. Seriously, Paul, you seem
to *want* to think that "broken shit" is acceptable, and that we
should then add magic markers to say "now you need to *not* be broken
shit".

Here's a magic marker for you: DON'T USE THAT BROKEN COMPILER.

And if a compiler can *prove* that whatever code movement it does
cannot make a difference, then let it do so. No amount of
"documentation" should matter.

Seriously, this whole discussion has been completely moronic. I don't
understand why you even bring shit like this up:

> > r1 = atomic_load(x, memory_order_control);
> > if (control_dependency(r1))
> > atomic_store(y, memory_order_relaxed);

I mean, really? Anybody who writes code like that, or any compiler
where that "control_dependency()" marker makes any difference
what-so-ever for code generation should just be retroactively aborted.

There is absolutely *zero* reason for that "control_dependency()"
crap. If you ever find a reason for it, it is either because the
compiler is buggy, or because the standard is so shit that we should
never *ever* use the atomics.

Seriously. This thread has devolved into some kind of "just what kind
of idiotic compiler cesspool crap could we accept". Get away from that
f*cking mindset. We don't accept *any* crap.

Why are we still discussing this idiocy? It's irrelevant. If the
standard really allows random store speculation, the standard doesn't
matter, and sane people shouldn't waste their time arguing about it.

Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-12 Thread Linus Torvalds
On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney
 wrote:
>
> Us Linux-kernel hackers will often need to use volatile semantics in
> combination with C11 atomics in most cases.  The C11 atomics do cover
> some of the reasons we currently use ACCESS_ONCE(), but not all of them --
> in particular, it allows load/store merging.

I really disagree with the "will need to use volatile".

We should never need to use volatile (outside of whatever MMIO we do
using C) if C11 defines atomics correctly.

Allowing load/store merging is *fine*. All sane CPU's do that anyway -
it's called a cache - and there's no actual reason to think that
"ACCESS_ONCE()" has to mean our current "volatile".

Now, it's possible that the C standards simply get atomics _wrong_, so
that they create visible semantics that are different from what a CPU
cache already does, but that's a plain bug in the standard if so.

But merging loads and stores is fine. And I *guarantee* it is fine,
exactly because CPU's already do it, so claiming that the compiler
couldn't do it is just insanity.

Now, there are things that are *not* fine, like speculative stores
that could be visible to other threads. Those are *bugs* (either in
the compiler or in the standard), and anybody who claims otherwise is
not worth discussing with.

But I really really disagree with the "we might have to use
'volatile'". Because if we *ever* have to use 'volatile' with the
standard C atomic types, then we're just better off ignoring the
atomic types entirely, because they are obviously broken shit - and
we're better off doing it ourselves the way we have forever.

Seriously. This is not even hyperbole. It really is as simple as that.

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-10 Thread Linus Torvalds
On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
>
> Intuitively, this is wrong because this let's the program take a step
> the abstract machine wouldn't do.  This is different to the sequential
> code that Peter posted because it uses atomics, and thus one can't
> easily assume that the difference is not observable.

Btw, what is the definition of "observable" for the atomics?

Because I'm hoping that it's not the same as for volatiles, where
"observable" is about the virtual machine itself, and as such volatile
accesses cannot be combined or optimized at all.

Now, I claim that atomic accesses cannot be done speculatively for
writes, and not re-done for reads (because the value could change),
but *combining* them would be possible and good.

For example, we often have multiple independent atomic accesses that
could certainly be combined: testing the individual bits of an atomic
value with helper functions, causing things like "load atomic, test
bit, load same atomic, test another bit". The two atomic loads could
be done as a single load without possibly changing semantics on a real
machine, but if "visibility" is defined in the same way it is for
"volatile", that wouldn't be a valid transformation. Right now we use
"volatile" semantics for these kinds of things, and they really can
hurt.

Same goes for multiple writes (possibly due to setting bits):
combining multiple accesses into a single one is generally fine, it's
*adding* write accesses speculatively that is broken by design..

At the same time, you can't combine atomic loads or stores infinitely
- "visibility" on a real machine definitely is about timeliness.
Removing all but the last write when there are multiple consecutive
writes is generally fine, even if you unroll a loop to generate those
writes. But if what remains is a loop, it might be a busy-loop
basically waiting for something, so it would be wrong ("untimely") to
hoist a store in a loop entirely past the end of the loop, or hoist a
load in a loop to before the loop.

Does the standard allow for that kind of behavior?

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds
On Sun, Feb 9, 2014 at 5:46 PM, Torvald Riegel  wrote:
>
> IOW, I wrote that such a compiler transformation would be wrong in my
> opinion.  Thus, it should *not* return 42.

Ahh, I am happy to have misunderstood. The "intuitively" threw me,
because I thought that was building up to a "but", and misread the
rest.

I then react stronly, because I've seen so much total crap (the
type-based C aliasing rules topping my list) etc coming out of
standards groups because it allows them to generate wrong code that
goes faster, that I just assume compiler people are out to do stupid
things in the name of "..but the standard allows it".

  Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds
On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel  wrote:
>
> (a) seems to say that you don't like requiring programmers to mark
> atomic accesses specially.  Is that the case?

In Paul's example, they were marked specially.

And you seemed to argue that Paul's example could possibly return
anything but 0/0.

If you really think that, I hope to God that you have nothing to do
with the C standard or any actual compiler I ever use.

Because such a standard or compiler would be shit. It's sadly not too uncommon.

 Linus


Re: [RFC][PATCH 0/5] arch: atomic rework

2014-02-09 Thread Linus Torvalds
On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel  wrote:
>
> I wouldn't characterize the situation like this (although I can't speak
> for others, obviously).  IMHO, it's perfectly fine on sequential /
> non-synchronizing code, because we know the difference isn't observable
> by a correct program.

What BS is that? If you use an "atomic_store_explicit()", by
definition you're either

 (a) f*cking insane
 (b) not doing sequential non-synchronizing code

and a compiler that assumes that the programmer is insane may actually
be correct more often than not, but it's still a shit compiler.
Agreed?

So I don't see how any sane person can say that speculative writes are
ok. They are clearly not ok.

Speculative stores are a bad idea in general. They are completely
invalid for anything that says "atomic". This is not even worth
discussing.

Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-06 Thread Linus Torvalds
On Tue, Aug 6, 2013 at 7:19 AM, Steven Rostedt  wrote:
>
> After playing with the patches again, I now understand why I did that.
> It wasn't just for optimization.

[explanation snipped]

> Anyway, if you feel that update_jump_label is too complex, I can go the
> "update at early boot" route and see how that goes.

Ugh. I'd love to see short jumps, but I do dislike binary rewriting,
and doing it at early boot seems really quite scary too.

So I wonder if this is a "ok, let's not bother, it's not worth the
pain" issue. 128 bytes of offset is very small, so there probably
aren't all that many cases that would use it.

 Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 12:54 PM, Mathieu Desnoyers
 wrote:
>
> I remember that choosing between 2 and 5 bytes nop in the asm goto was
> tricky: it had something to do with the fact that gcc doesn't know the
> exact size of each instructions until further down within compilation

Oh, you can't do it in the coompiler, no. But you don't need to. The
assembler will pick the right version if you just do "jmp target".

 Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 12:40 PM, Marek Polacek  wrote:
>
> FWIW, we also support hot/cold attributes for labels, thus e.g.
>
>   if (bar ())
> goto A;
>   /* ... */
> A: __attribute__((cold))
>   /* ... */
>
> I don't know whether that might be useful for what you want or not though...

Steve? That does sound like it might at least re-order the basic
blocks better for your cases. Worth checking out, no?

That said, I don't know what gcc actually does for that case. It may
be that it just ends up trying to transfer that "cold" information to
the conditional itself, which wouldn't work for our asm goto use. I
hope/assume it doesn't do that, though, since the "cold" attribute
would presumably also be useful for things like computed gotos etc -
so it really isn't about the _source_ of the branch, but about that
specific target, and the basic block re-ordering.

Anyway, the exact implementation details may make it more or less
useful for our special static key things. But it does sound like the
right thing to do for static keys.

Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 12:04 PM, Andi Kleen  wrote:
> Steven Rostedt  writes:
>
> Can't you just use -freorder-blocks-and-partition?
>
> This should already partition unlikely blocks into a
> different section. Just a single one of course.

That's horrible. Not because of dwarf problems, but exactly because
unlikely code isn't necessarily *that* unlikely, and normal unlikely
code is reached with a small branch. Making it a whole different
section breaks both of those.

Maybe some "really_unlikely()" would make it ok.

  Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 11:51 AM, H. Peter Anvin  wrote:
>>
>> Also, how would you pass the parameters? Every tracepoint has its own
>> parameters to pass to it. How would a trap know what where to get "prev"
>> and "next"?
>
> How do you do that now?
>
> You have to do an IP lookup to find out what you are doing.

No, he just generates the code for the call and then uses a static_key
to jump to it. So normally it's all out-of-line, and the only thing in
the hot-path is that 5-byte nop (which gets turned into a 5-byte jump
when the tracing key is enabled)

Works fine, but the normally unused stubs end up mixing in the normal
code segment. Which I actually think is fine, but right now we don't
get the short-jump advantage from it (and there is likely some I$
disadvantage from just fragmentation of the code).

With two-byte jumps, you'd still get the I$ fragmentation (the
argument generation and the call and the branch back would all be in
the same code segment as the hot code), but that would be offset by
the fact that at least the hot code itself could use a short jump when
possible (ie a 2-byte nop rather than a 5-byte one).

Don't know which way it would go performance-wise. But it shouldn't
need gcc changes, it just needs the static key branch/nop rewriting to
be able to handle both sizes. I couldn't tell why Steven's series to
do that was so complex, though - I only glanced through the patches.

Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 11:39 AM, Steven Rostedt  wrote:
>
> I had patches that did exactly this:
>
>  https://lkml.org/lkml/2012/3/8/461
>
> But it got dropped for some reason. I don't remember why. Maybe because
> of the complexity?

Ugh. Why the crazy update_jump_label script stuff? I'd go "Eww" at
that too, it looks crazy. The assembler already knows to make short
2-byte "jmp" instructions for near jumps, and you can just look at the
opcode itself to determine size, why is all that other stuff required?

IOW, 5/7 looks sane, but 4/7 makes me go "there's something wrong with
that series".

 Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 11:24 AM, Linus Torvalds
 wrote:
>
> Ugh. I can see the attraction of your section thing for that case, I
> just get the feeling that we should be able to do better somehow.

Hmm.. Quite frankly, Steven, for your use case I think you actually
want the C goto *labels* associated with a section. Which sounds like
it might be a cleaner syntax than making it about the basic block
anyway.

 Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 11:20 AM, Linus Torvalds
 wrote:
>
> The static_key_false() approach with minimal inlining sounds like a
> much better approach overall.

Sorry, I misunderstood your thing. That's actually what you want that
section thing for, because right now you cannot generate the argument
expansion otherwise.

Ugh. I can see the attraction of your section thing for that case, I
just get the feeling that we should be able to do better somehow.

  Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 10:55 AM, Steven Rostedt  wrote:
>
> My main concern is with tracepoints. Which on 90% (or more) of systems
> running Linux, is completely off, and basically just dead code, until
> someone wants to see what's happening and enables them.

The static_key_false() approach with minimal inlining sounds like a
much better approach overall. Sure, it might add a call/ret, but it
adds it to just the unlikely tracepoint taken path.

Of course, it would be good to optimize static_key_false() itself -
right now those static key jumps are always five bytes, and while they
get nopped out, it would still be nice if there was some way to have
just a two-byte nop (turning into a short branch) *if* we can reach
another jump that way..For small functions that would be lovely. Oh
well.

Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 10:12 AM, Linus Torvalds
 wrote:
>
> Secondly, you don't want a separate section anyway for any normal
> kernel code, since you want short jumps if possible

Just to clarify: the short jump is important regardless of how
unlikely the code you're jumping is, since even if you'd be jumping to
very unlikely ("never executed") code, the branch to that code is
itself in the hot path.

And the difference between a two-byte short jump to the end of a short
function, and a five-byte long jump (to pick the x86 case) is quite
noticeable.

Other cases do long jumps by jumping to a thunk, and so the "hot case"
is unaffected, but at least one common architecture very much sees the
difference in the likely code.

  Linus


Re: [RFC] gcc feature request: Moving blocks into sections

2013-08-05 Thread Linus Torvalds
On Mon, Aug 5, 2013 at 9:55 AM, Steven Rostedt  wrote:
>
> Almost a full year ago, Mathieu suggested something like:
>
> if (unlikely(x)) __attribute__((section(".unlikely"))) {
> ...
> } else __attribute__((section(".likely"))) {
> ...
> }

It's almost certainly a horrible idea.

First off, we have very few things that are *so* unlikely that they
never get executed. Putting things in a separate section would
actually be really bad.

Secondly, you don't want a separate section anyway for any normal
kernel code, since you want short jumps if possible (pretty much every
single architecture out there has a concept of shorter jumps that are
noticeably cheaper than long ones). You want the unlikely code to be
out-of-line, but still *close*. Which is largely what gcc already does
(except if you use "-Os", which disables all the basic block movement
and thus makes "likely/unlikely" pointless to begin with).

There are some situations where you'd want extremely unlikely code to
really be elsewhere, but they are rare as hell, and mostly in user
code where you might try to avoid demand-loading such code entirely.

So give up on sections. They are a bad idea for anything except the
things we already use them for. Sure, you can try to fix the problems
with sections with link-time optimization work and a *lot* of small
individual sections (the way per-function sections work already), but
that's basically just undoing the stupidity of using sections to begin
with.

Linus


Re: Memory corruption due to word sharing

2012-02-03 Thread Linus Torvalds
On Fri, Feb 3, 2012 at 11:16 AM, Andrew MacLeod  wrote:
>>    The special cases are because older x86 cannot do the generic
>> "add_return" efficiently - it needs xadd - but can do atomic versions
>> that test the end result and give zero or sign information.
>
> Since these are older x86 only, could you use add_return() always and then
> have the compiler use new peephole optimizations to detect those usage
> patterns and change the instruction sequence for x86 when required?  would
> that be acceptable?    Or maybe you don't trust the compiler :-)   Or maybe
> I can innocently ask if the performance impact on older x86 matters enough
> any more? :-)

Since xadd is often slow even when it does exist, the peepholes would
definitely be required. A "dec_and_test" does need to be

   decl %m (or "subl $1" for x86-64)
   je

rather than

   xadd %r,%m
   cmp $1,%r
   je

simply because the latter is not just larger, but often quite a bit
slower (newer CPU's seem to be better at xadd).

But with the peepholes, by the time we could depend on gcc doing the
intrinsics, we'd almost certainly be ready to let the old pre-xadd
machines just go.

We're already almost there, and by the time we can trust the compiler
to do it for us, we're almost certainly going to be at a point where
we really don't care any more about pre-xadd (but we will still care
about older machines where xadd is slow).

>>  - atomic_add_unless() - basically an optimized cmpxchg.
>
> is this the reverse of a compare_exchange and add?  Ie, add if the value
> ISN'T expected?   or some form of compare_exchange_and_add?    This might
> require a new atomic builltin.
>
> What exactly does it do?

It's basically

   do {
 load-link %r,%m
 if (r == value)
 return 0;
 add
   } while (store-conditional %r,%m)
   return 1;

and it is used to implement two *very* common (and critical)
reference-counting use cases:

 - decrement ref-count locklessly, and if it hits free, take a lock
atomically (ie "it would be a bug for anybody to ever see it decrement
to zero while holding the lock")

 - increment ref-count locklessly, but if we race with the final free,
don't touch it, because it's gone (ie "if somebody decremented it to
zero while holding the lock, we absolutely cannot increment it again")

They may sound odd, but those two operations are absolutely critical
for most lock-less refcount management. And taking locks for reference
counting is absolutely death to performance, and is often impossible
anyway (the lock may be a local lock that is *inside* the structure
that is being reference-counted, so if the refcount is zero, then you
cannot even look at the lock!)

In the above example, the load-locked -> store-conditional would
obviously be a cmpxchg loop instead on architectures that do cmpxchg
instead of ll/sc.

Of course, it you expose some intrinsic for the whole "ll/sc" model
(and you then turn it into cmpxchg on demand), we could literally
open-code it.

That is likely the most *flexible* approach for a compiler. I think
pretty much everything the kernel needs (except for cmpxchg_double)
can be very naturally written as a "ll/sc" sequence, and if the
compiler then just does the right thing with peephole optimizations,
that's fine.

IOW, we don't really need "atomic_add()" or anything like that. If we can do

  do {
 val = __load_linked(mem);
 val++;
  } while (__store_conditional(val, mem));

and the compiler just automagically turns that into "lock add" on x86,
that's perfectly fine.

It might not be too hard, because you really don't need to recognize
very many patterns, and any pattern you don't recognize could be
turned into a cmpxchg loop.

NOTE NOTE NOTE! The "turned into a cmpxchg loop" is not the true
correct translation of load-linked/store-conditional, since it allows
the memory to be modified as long as it's modified *back* before the
store-conditional, and that actually matters for a few algorithms. But
if you document the fact that it's not a "true" ll/sc (and maybe have
some compile-time way to detect when it isn't), it would be very
flexible.

Of course, the syntax could be something completely different. Maybe
you'd want to do it as

   __builtin_ll_sc(&var, update-expression, return-expression,
failure-expression)

rather than an explicit loop.

But it doesn't sound like the internal gcc model is based on some
generic ll/sc model.

>>  - atomic bit array operations (bit set, clear, set-and-test,
>> clear-and-test). We do them on "unsigned long" exclusively, and in
>> fact we do them on arrays of unsigned long, ie we have the whole "bts
>> reg,mem" semantics. I'm not sure we really care about the atomic
>> versions for the arrays, so it's possible we only really care about a
>> single long.
>>
>>    The only complication with the bit setting is that we have a
>> concept of "set/clear bit with memory barrier before or after the bit"
>> (for locking). We don't do the whole release/acquire thing, though.
>
> Are

Re: Memory corruption due to word sharing

2012-02-03 Thread Linus Torvalds
On Fri, Feb 3, 2012 at 8:38 AM, Andrew MacLeod  wrote:
>
> The atomic intrinsics were created for c++11  memory model compliance, but I
> am certainly open to enhancements that would make them more useful.   I am
> planning some enhancements for 4.8 now, and it sounds like you may have some
> suggestions...

So we have several atomics we use in the kernel, with the more common being

 - add (and subtract) and cmpchg of both 'int' and 'long'

 - add_return (add and return new value)

 - special cases of the above:
  dec_and_test (decrement and test result for zero)
  inc_and_test (decrement and test result for zero)
  add_negative (add and check if result is negative)

   The special cases are because older x86 cannot do the generic
"add_return" efficiently - it needs xadd - but can do atomic versions
that test the end result and give zero or sign information.

 - atomic_add_unless() - basically an optimized cmpxchg.

 - atomic bit array operations (bit set, clear, set-and-test,
clear-and-test). We do them on "unsigned long" exclusively, and in
fact we do them on arrays of unsigned long, ie we have the whole "bts
reg,mem" semantics. I'm not sure we really care about the atomic
versions for the arrays, so it's possible we only really care about a
single long.

   The only complication with the bit setting is that we have a
concept of "set/clear bit with memory barrier before or after the bit"
(for locking). We don't do the whole release/acquire thing, though.

 - compare_xchg_double

We also do byte/word atomic increments and decrements, but that' sin
the x86 spinlock implementation, so it's not a generic need.

We also do the add version in particular as CPU-local optimizations
that do not need to be SMP-safe, but do need to be interrupt-safe. On
x86, this is just an r-m-w op, on most other architectures it ends up
being the usual load-locked/store-conditional.

I think that's pretty much it, but maybe I'm missing something.

Of course, locking itself tends to be special cases of the above with
extra memory barriers, but it's usually hidden in asm for other
reasons (the bit-op + barrier being a special case).

  Linus


Re: Memory corruption due to word sharing

2012-02-02 Thread Linus Torvalds
On Thu, Feb 2, 2012 at 10:42 AM, Paul E. McKenney
 wrote:
>>
>> SMP-atomic or percpu atomic? Or both?
>
> Only SMP-atomic.

And I assume that since the compiler does them, that would now make it
impossible for us to gather a list of all the 'lock' prefixes so that
we can undo them if it turns out that we are running on a UP machine.

When we do SMP operations, we don't just add a "lock" prefix to it. We do this:

  #define LOCK_PREFIX_HERE \
".section .smp_locks,\"a\"\n"   \
".balign 4\n"   \
".long 671f - .\n" /* offset */ \
".previous\n"   \
"671:"

  #define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "

and I'm sure you know that, but I'm not sure the gcc people realize
the kinds of games we play to make things work better.

Sure, "everything will be SMP" some day, but running UP kernels is
likely still going to remain a good idea in virtualized environments,
for example. And having to make it a compile-time option is *not* a
good idea.

So compiler intrisics are often *worse* than doing it by hand for us
for all these kinds of reasons. They aren't generally geared towards
the very specialized needs that a kernel has.

Of course, maybe even user space would want some kind of way to
automatically strip 'lock' prefixes from a binary, so maybe the
compiler would have some kind of support like this too.

(And no, disassembling the binary in order to find lock prefixes is
*not* the answer, at least not for the kernel)

>> We need both variants in the kernel. If the compiler generates one of
>> them for us, that doesn't really much help.
>
> I must admit that the non-x86 per-CPU atomics are, ummm, "interesting".

Most non-x86 cpu's would probably be better off treating them the same
as smp-atomics (load-locked + store-conditional), but right now we
have this insane generic infrastructure for having versions that are
irq-safe by disabling interrupts etc. Ugh. Mainly because nobody
really is willing to work on and fix up the 25 architectures that
really don't matter.

  Linus


Re: Memory corruption due to word sharing

2012-02-02 Thread Linus Torvalds
On Thu, Feb 2, 2012 at 8:28 AM, Michael Matz  wrote:
>
> Sure.  Simplest example: struct s {int i:24;} __attribute__((packed)).
>
> You must access only three bytes, no matter what.  The basetype (int) is
> four bytes.

Ok, so here's a really *stupid* (but also really really simple) patch attached.

What it does is to simply change the meaning of the SLOW_BYTE_ACCESS thing.

What SLOW_BYTE_ACCESS means is currently is not just that byte
accesses are slow: it means that *everything* but full-word accesses
are slow!

Which is (a) not what the name really implies, (b) not even the
historical meaning of that #define (why? the meaning of "full word"
has changed since: it used to be 32 bits, now it is 64 bits) and (c)
stupid, because that's not how hardware with slow byte accesses really
work anyway.

Finally, it's what causes gcc to do 64-bit accesses to bitfields that
aren't 64 bits in size.

So because of this, I propose gcc just change the rules for what
SLOW_BYTE_ACCESS means.

I propose to just change it to mean "accesses smaller than 32 bits are
slow". That's actually much closer to what the hardware definition
tends to be.

It doesn't fix the generic issue, it doesn't fix any C++11/C11 issues,
it doesn't really change anything, but what it *does* do is make for a
hell of a lot saner model. And it avoids bugs in practice.

NOTE! On at least some machines with SLOW_BYTE_ACCESS, accesses
smaller than a word cannot possibly be atomic anyway (well, not
without jumping through hoops), so the fact that we still extend to 32
bits and the problem of touching too much still exists with 'char' and
'short' variables that are in the same 32-bit word as a bitfield is
kind of unavoidable.

So this suggested patch doesn't really "fix" anything fundamental, but
it is (a) simple, (b) totally untested and (c) at least fixes *some*
cases.

For example, it might fix the 'sig_atomic_t' shadowning, since that is
usually 'int'. It obviously can never fix the issue with volatile
char/short, but at least it works around it for 'long'.

In other words - I'm not trying to say that this patch really "fixes"
anything (except sig_atomic_t interaction, which really does get fixed
for the normal 'int' case). But what it does do is to limit the damage
of a bad situation.

And it is small, and should hopefully be easy to accept even for
stable versions of gcc.

So can we please do something like this for maintenance releases, and
consider the much more complex C++11/C11 issues to be a separate much
bigger issue for the future?

Because the current SLOW_BYTE_ACCESS meaning really is crap. The
*only* thing that architecture define seems to do is literally the
bitfield extract semantics, and it does that horribly horribly badly
as shown by the bug on both 64-bit POWER and Sparc. For no good reason
- both of those have perfectly fine word operations afaik.

I did not look at other architectures that define SLOW_BYTE_ACCESS,
but if they have a 32-bit integer type, I'm pretty sure they support
fast loads/stores of it.

Hacky? Yes. Pretty? No. But better than the disaster that is there
now? Hell yes.

   Linus

PS. The only reason I hardcoded "32" was that I didn't know how to ask
the quesion "is this mode at least as wide as 'int'" in the proper gcc
way. So I'm really not suggesting you apply this patch as-is, I'm just
sending it out as a "hey, this is a pretty obvious way to work around
part of the problem, somebody who really knows what they are doing
should probably improve on it".
 gcc/stor-layout.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/gcc/stor-layout.c b/gcc/stor-layout.c
index ea0d44d64d26..1387c0e9e060 100644
--- a/gcc/stor-layout.c
+++ b/gcc/stor-layout.c
@@ -2486,7 +2486,12 @@ get_best_mode (int bitsize, int bitpos, unsigned int align,
 	  && unit <= MIN (align, BIGGEST_ALIGNMENT)
 	  && (largest_mode == VOIDmode
 		  || unit <= GET_MODE_BITSIZE (largest_mode)))
+	  {
 	wide_mode = tmode;
+	/* Wide enough? */
+	if (unit >= 32)
+	  break;
+	  }
 	}
 
   if (wide_mode != VOIDmode)


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 2:45 PM, Paul E. McKenney
 wrote:
>
> My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed
> will eventually allow ACCESS_ONCE() to be upgraded so that (for example)
> access-once increments can generate a single increment-memory instruction
> on x86.

I don't think that is a semantic issue.

gcc could do it *today* with volatile accesses. It doesn't, because
volatiles are scary and basically disables a lot of optimizations. Why
would memory ordering be substantially different just because it has a
different name?

> New architectures might eventually might define things like atomic_inc()
> in terms of C++11 atomics, but let's start with the straightforward stuff
> as and if it makes sense.

SMP-atomic or percpu atomic? Or both?

We need both variants in the kernel. If the compiler generates one of
them for us, that doesn't really much help.

  Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 1:25 PM, Boehm, Hans  wrote:
>
> Here are some more interesting ones that illustrate the issues (all 
> declarations are non-local, unless stated otherwise):
>
> struct { char a; int b:9; int c:7; char d} x;
>
> Is x.b = 1 allowed to overwrite x.a?  C11 says no, essentially requiring two 
> byte stores.  Gcc currently does so.  I'm not sure I understand Linus' 
> position here.

So I like the fact that the C11 stance seems very strict. But honestly
I also think it sounds like C11 is actually more strict than I would
necessarily be.

I really do think that bitfields imply "int", both historically and
technically. So I would not see the problem with treating the
bitfields as part of an 'int' and thus overwriting a (and d) when
writing to b. That's how bitfields work! They are fields of an int.

It would be good if it perhaps caused a *warning*, and gave a good way
to avoid it. For example, while I think using any other base type than
'int' is technically an extension of the C bitfield rules (but
whatever, I don't have any specs in front of me), I think a warning
together with alowing the user to rewrite it as

  struct { char a; char d; short b:9; short c:7; } x;

would make it clear that now a write to 'b' cannot validly overwrite
'a' or 'd'.

But forcing the compiler to do two (and sometimes three!) byte
accesses sounds excessive.

The reason I think the

   int flag:1;
   int othervariable;

overwriting of "othervariable" is so *obviously* a bug is exactly that
bitfields are historically about 'int', and no 'long' was there
anywhere, so using a 64-bit access is clearly not sane in any way,
shape or form.

I dunno.

 Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 1:24 PM, Torvald Riegel  wrote:
>> It's not the only thing we do. We have cases where it's not that you
>> can't hoist things outside of loops, it's that you have to read things
>> exactly *once*, and then use that particular value (ie the compiler
>> can't be allowed to reload the value because it may change). So any
>> *particular* value we read is valid, but we can't allow the compiler
>> to do anything but a single access.
>
> That's what an access to an atomics-typed var would give you as well.

Right. The concept of "atomic" access is required.

The problem I have with treating this as a C11 issue is that people
aren't using C11 compilers, and won't for a long while.

So quite frankly, it won't be reasonable for the kernel people to say
"use a C11 version" for years to come.

And the thing is, "atomic" doesn't actually seem to buy us anything in
the kernel afaik. We're perfectly happy to use "volatile". We don't
want the data structure annotated, we want the code annotated, so the
semantic differences between 'volatile' and 'atomic' seem to be pretty
irrelevant.

Sure, there is probably some subtle difference, but on the whole, it
all boils down to "we can't depend on new rules for several years, and
even when we can, it doesn't look like they'd buy us a lot, because we
have to play so many games around them *anyway*".

And don't get me wrong. I don't think that means that C11 is *bad*.
It's just that the kernel is very different from most other projects.
We have to have those crazy architecture-specific header files and
random synchronization macros etc anyway.

C11 is not - I think - even meant to be geared towards the Linux
kernel kind of crazy use. We really do some odd things, adding
compiler features for them is mostly a mistake. asm() takes care of a
lot of the oddities for us, it's not like all of them are about memory
accesses or concurrency either.

I do think that the threading issues in C11 are going to help us
kernel people, because the more people think about issues with
concurrency, they really *will* be hitting some of the issues we've
been having. So it clearly forces clarification of just what the word
"access" implies, for example, which is certainly not going to be bad
for the kernel.

>> It really doesn't. Loops are fairly unusual in the kernel to begin
>> with, and the compiler barriers are a total non-issue.
>
> This is interesting, because GCC is currently not optimizing across
> atomics but we we're considering working on this later.
> Do you have real data about this, or is this based on analysis of
> samples of compiled code?

So the kernel literally doesn't tend to *have* loops.

Sure, there are exceptions: there are the obvious loops implied in
"memcpy()" and friends, but almost all kernel activity tends to be
about individual events. A really hot loop under normal loads would be
the loop that calculates a hash value over a pathname component. Or
the loop that loops over those components.

Loop count? Think about it. Most pathnames have one or two components.
And while "8.3" is clearly too short for a filename, it's still true
that a lot of filenames (especially directories, which is what you end
up traversing many times) are in a kind of awkward 5-8 character
range.

So the kernel loads tend to have loop counts in the single digits -
*maybe* double digits. We use hash-tables a lot, so looping to find
something is usually just an entry or two, and the cost is almost
never the loop, it's the cache miss from the pointer chasing.

Most "looping" behavior in the kernel is actually user space calling
the kernel in a loop. So the kernel may see "loops" in that sense, but
it's not loopy kernel code itself.

Even really hot kernel-only code is often not about loops. You might
get a million network packets a second, and with interrupt mitigation
you would not necessarily treat them each as individual events with
its own individual interrupt (which does happen too!), but you'd still
not see it as a loop over a million packets. You'd see them come in as
a few packets at a time, and looping until you've handled them.

And the loops we have tend to not be very amenable to nice compiler
optimizations either.

The kernel almost never works with arrays, for example. The string in
a pathname component is an array, yes, but *most* kernel data
structures loop over our doubly linked lists or over a hash-chain.

And it tends to be really hard to do much loop optimization over list
traversal like that.

>> If the source code doesn't write to it, the assembly the compiler
>> generates doesn't write to it.
>
> If it would be so easy, then answering my questions would have been easy
> too, right?  Which piece of source code?  The whole kernel?

So make the rule be: "in one single function, with all you can see
there", and I'll be happy. Do as much inlining as you want (although
the kernel does have "noinline" too).

With all the normal sequence point and memory invalidation

Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:53 PM, Torvald Riegel  wrote:
>
> For volatile, I agree.
>
> However, the original btrfs example was *without* a volatile, and that's
> why I raised the memory model point.  This triggered an error in a
> concurrent execution, so that's memory model land, at least in C
> language standard.

Sure. The thing is, if you fix the volatile problem, you'll almost
certainly fix our problem too.

The whole "compilers actually do reasonable things" approach really
does work in reality. It in fact works a lot better than reading some
spec and trying to figure out if something is "valid" or not, and
having fifteen different compiler writers and users disagree about
what the meaning of the word "is" is in some part of it.

I'm not kidding. With specs, there really *are* people who spend years
discussing what the meaning of the word "access" is or similar.
Combine that with a big spec that is 500+ pages in size and then try
to apply that all to a project that is 15 million lines of code and
sometimes *knowingly* has to do things that it simply knows are
outside the spec, and the discussion about these kinds of details is
just mental masturbation.


>> We do end up doing
>> much more aggressive threading, with models that C11 simply doesn't
>> cover.
>
> Any specific examples for that would be interesting.

Oh, one of my favorite (NOT!) pieces of code in the kernel is the
implementation of the

   smp_read_barrier_depends()

macro, which on every single architecture except for one (alpha) is a no-op.

We have basically 30 or so empty definitions for it, and I think we
have something like five uses of it. One of them, I think, is
performance crticial, and the reason for that macro existing.

What does it do? The semantics is that it's a read barrier between two
different reads that we want to happen in order wrt two writes on the
writing side (the writing side also has to have a "smp_wmb()" to order
those writes). But the reason it isn't a simple read barrier is that
the reads are actually causally *dependent*, ie we have code like

   first_read = read_pointer;
   smp_read_barrier_depends();
   second_read = *first_read;

and it turns out that on pretty much all architectures (except for
alpha), the *data*dependency* will already guarantee that the CPU
reads the thing in order. And because a read barrier can actually be
quite expensive, we don't want to have a read barrier for this case.

But alpha? Its memory consistency is so broken that even the data
dependency doesn't actually guarantee cache access order. It's
strange, yes. No, it's not that alpha does some magic value prediction
and can do the second read without having even done the first read
first to get the address. What's actually going on is that the cache
itself is unordered, and without the read barrier, you may get a stale
version from the cache even if the writes were forced (by the write
barrier in the writer) to happen in the right order.

You really want to try to describe issues like this in your memory
consistency model? No you don't. Nobody will ever really care, except
for crazy kernel people. And quite frankly, not even kernel people
care: we have a fairly big kernel developer community, and the people
who actually talk about memory ordering issues can be counted on one
hand. There's the "RCU guy" who writes the RCU helper functions, and
hides the proper serializing code into those helpers, so that normal
mortal kernel people don't have to care, and don't even have to *know*
how ignorant they are about the things.

And that's also why the compiler shouldn't have to care. It's a really
small esoteric detail, and it can be hidden in a header file and a set
of library routines. Teaching the compiler about crazy memory ordering
would just not be worth it. 99.99% of all programmers will never need
to understand any of it, they'll use the locking primitives and follow
the rules, and the code that makes it all work is basically invisible
to them.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel  wrote:
>
> You do rely on the compiler to do common transformations I suppose:
> hoist loads out of loops, CSE, etc.  How do you expect the compiler to
> know whether they are allowed for a particular piece of code or not?

We have barriers.

Compiler barriers are cheap. They look like this:

include/linux/compiler-gcc.h:
   #define barrier() __asm__ __volatile__("": : :"memory")

and if we don't allow hoisting, we'll often use something like that.

It's not the only thing we do. We have cases where it's not that you
can't hoist things outside of loops, it's that you have to read things
exactly *once*, and then use that particular value (ie the compiler
can't be allowed to reload the value because it may change). So any
*particular* value we read is valid, but we can't allow the compiler
to do anything but a single access.

So we have this beauty:

include/linux/compiler.h:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

which does that for us (quite often, it's not used as-is: a more
common case is using the "rcu_access_pointer()" inline function that
not only does that ACCESS_ONCE() but also has the capability to enable
run-time dynamic verification that you are in a proper RCU read locked
section etc)

We also have tons of (very subtle) code that actually creates locks
and memory ordering etc. They usually just use the big "barrier()"
approach to telling the compiler not to combine or move memory
accesses around it.

And I realize that compiler people tend to think that loop hoisting
etc is absolutely critical for performance, and some big hammer like
"barrier()" makes a compiler person wince. You think it results in
horrible code generation problems.

It really doesn't. Loops are fairly unusual in the kernel to begin
with, and the compiler barriers are a total non-issue. We have much
more problems with the actual CPU barriers that can be *very*
expensive on some architectures, and we work a lot at avoiding those
and avoiding cacheline ping-pong issues etc.

>> > No vague assumptions with lots of hand-waving.
>>
>> So here's basically what the kernel needs:
>>
>>  - if we don't touch a field, the compiler doesn't touch it.
>
> "we don't touch a field": what does that mean precisely?  Never touch it
> in the same thread?  Same function?  Same basic block?  Between two
> sequence points?  When crossing synchronizing code?  For example, in the
> above code, can we touch it earlier/later?

Why are you trying to make it more complex than it is?

If the source code doesn't write to it, the assembly the compiler
generates doesn't write to it.

Don't try to make it anything more complicated. This has *nothing* to
do with threads or functions or anything else.

If you do massive inlining, and you don't see any barriers or
conditionals or other reasons not to write to it, just write to it.

Don't try to appear smart and make this into something it isn't.

Look at the damn five-line example of the bug. FIX THE BUG. Don't try
to make it anything bigger than a stupid compiler bug. Don't try to
make this into a problem it simply isn't.

 Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:16 PM, Jakub Jelinek  wrote:
>>
>> So the kernel really doesn't care what you do to things *within* the 
>> bitfield.
>
> But what is *within* the bitfield?  Do you consider s4 or t2 fields
> (non-bitfield fields that just the ABI wants to pack together with
> the bitfield) above as *within* or is already modifying of those a problem?

I seriously think you should just do the obvious thing, and LOOK AT
THE BASE TYPE. That solves everything.

If the definition of the data is (assume "long" to be 64 bits)

   long mybits:32;
   int nonbitfield;

and you end up doing a 64-bit access when you read/write the "mybits"
bitfield, and it overlaps "nonbitfield", then I would also just say
"hey, whatever, the user asked for it". He really did. The
"allocation" for the bitfield comes from the base type, and so you
basically have a "long" to play with.

Let's take an even more extreme example:

   int bits1:8
   char c;
   int bits2:16;

where the "char c" is actually *embedded* in the "bitfield
allocation". But again, it's completely and totally unambiguous what
accesses you would at least start out with: sure, you *could* do a
8-bit access (and you probably should, if the ISA allows it), but
dangit, the base type is "int", so I don't think it's wrong to do a
32-bit load, mask, and a 32-bit write.

But if the user wrote

char bits1:8;
char c;
short bits2:16;

then I really think it would be a *bug* if modifying "bits1" would
possible write to 'c'. Of course, this is again a possible ISA
problem: if the architecture doesn't have byte writes, you can't do
anything about it, but that is obviously true regardless of bitfields,
so that's really a totally irrelevant argument for this case. That
non-atomicity doesn't come from bitfields, it comes from the fact that
the BASE TYPE is again non-atomic.

Notice how it always ends up being about the base type. Simple,
straightforward, and unambiguous.

And similarly, to go to the kernel example, if you have

int mybits:2;
int nonbitfield;

then I think it's an obvious *BUG* to access the nonbitfield things
when you access "mybits". It clearly is not at all "within" the
bitfield allocation, and "int" isn't non-atomic on any sane
architecture, so you don't even have the "byte accesses aren't atomic"
issue)

So I really do think that the sane approach is to look at the base
type of the bitfield, like I suggested from the very beginning.

If the base type is an "int", you do an int access. It really solves
so many problems, and it is even entirely sane semantics that you can
*explain* to people. It makes sense, it avoids the clear bugs gcc has
now, and it also solves the performance bug I reported a long time
ago.

Seriously - is there any real argument *against* just using the base
type as a hint for access size?

I realize that it may not be simple, and may not fit the current mess
of gcc bitfields, but I really do think it's the RightThing(tm) to do.
It has sensible semantics, and avoids problems.

In contrast, the *current* gcc semantics are clearly not sensible, and
have both performance and correctness issues.

 Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:01 PM, Linus Torvalds
 wrote:
>
>  - However, while using the *smallest* possible access may generate
> correct code, it often generates really *crappy* code. Which is
> exactly the bug that I reported in
>
>   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

Btw, as I also pointed out in that (old) bugzilla entry, gcc can do
pretty much whatever it damn well pleases with loads from a bitfield.

You can narrow them, you can make them wider, you can paint them pink.

Widening the loads can still be a performance problem, and technically
it's a really bad idea for any nearby "volatile"s too, but in practice
things will *work*.

Writes are much more critical. If you overwrite a field that is no
longer in the bitfield, you can no longer depend on "nobody cares
about bitfield accesses", because by definition you are clearly now
stepping on non-bitfield things.

You do realize that even in user space, and even before C11, people
have things like "sig_atomic_t" etc. So even if you don't like the
notion of "volatile", here's *another* example of this being  real gcc
bug:

   struct mydata {
  sig_atomic_t seen_signal;
  unsigned flags:1;
   };

and now do the same test-program, realizing that "sig_atomic_t" is
normally the exact same thing as "int".

Now, thing about what happens if you have a signal handler that comes
in and does

   mydata.seen_signal = 1;

and happens to interrupt the code that changes "mydata.flags" in just
the right spot.

That's right: the bitfield update will possibly *clear* that
"signal-safe" flag again, and gcc just created buggy asm code from
correct C code.

Guys, if you don't admit that this is a bug, I don't know what to say.

IT IS A GCC BUG.

Don't try to make it into something bigger and related to C++11/C11.
Don't try to talk about "memory models". Just admit that it is a bug
to do a 64-bit write to a 32-bit bitfield, and fix it!

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 11:40 AM, Jakub Jelinek  wrote:
>
> Well, the C++11/C11 model doesn't allow to use the underlying type
> for accesses, consider e.g.
>
> struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char 
> s4; unsigned int s5; };
> struct T { long s1 : 16; unsigned int s2; };
>
> on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into
> the same 32 bits as s2 and s3.  While the memory model allows s2 to be
> changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4
> to be changed, as it isn't a bitfield (you'd need s4 : 8 for that).
> T is 8 bytes long, again, s2 is packed into the same 64 bits as s1,
> and the memory model doesn't allow s2 to be modified.
>
> Not sure what the kernel would expect in such a case.

So the kernel really doesn't care what you do to things *within* the bitfield.

Sure, we do have some expectations of packing, but basically we never
expect bitfields to be 'atomic'. You really don't need to worry about
that.

From a correctness standpoint, I really don't think the kernel cares
if gcc does bitfield reads and writes one bit at a time. Seriously.

The bug is that the bitfield write wrote *outside* the bitfield.
That's *WRONG*. It's completely wrong, even if you write back the same
value you read. It's *always* wrong. It's simply not acceptable.

If gcc does that kind of crap, then gcc needs to align bitfields
sufficiently that the accesses that are outside the bitfields never
touch any other fields.

So what I would suggest:

 - use the *smallest* aligned access you can use to access the
bitfield. This will be 'correct', but it may perform badly.

   This is apparently what x86 does. So on x86, if you have a one-bit
bitfield within a bitfield that is really 32 bits wide, it looks like
gcc will generally generate a byte load/store to set that bit. I don't
know exactly what the rules are, but this is fine from a *correctness*
point.

 - However, while using the *smallest* possible access may generate
correct code, it often generates really *crappy* code. Which is
exactly the bug that I reported in

   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

So quite frankly, my suggestion is to just try to use the base type.

But *whatever* you do, using a type *larger* than the whole bitfield
itself is a bug.

And dammit, the people who continue to bring up C11 as if this was a
new issue, and only needs to be worried about if you support C11 (ie
gcc-4.7 and up) are just in denial.

The 'volatile'  example makes it entirely clear that the bug has
nothing what-so-ever to do with C11.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 9:42 AM, Torvald Riegel  wrote:
>
> We need a proper memory model.

Not really.

The fact is, the kernel will happily take the memory model of the
underlying hardware. Trying to impose some compiler description of the
memory model is actually horribly bad, because it automatically also
involves compiler synchronization points - which will try to be
portable and easy to understand, and quite frankly, that will
automatically result in what is technically known as a "shitload of
crap".

Now, a strict memory model is fine for portability, and to make it
simple for programmers. I actually largely approve of the Java memory
model approach, even if it has been horribly problematic and has some
serious problems. But it's not something that is acceptable for the
kernel - we absolutely do *not* want the compiler to introduce some
memory model, because we are very much working together with whatever
the hardware memory model is, and we do our own serialization
primitives.

> No vague assumptions with lots of hand-waving.

So here's basically what the kernel needs:

 - if we don't touch a field, the compiler doesn't touch it.

   This is the rule that gcc now violates with bitfields.

   This is a gcc bug. End of story. The "volatile" example proves it -
anybody who argues otherwise is simply wrong, and is just trying to
make excuses.

 - if we need to touch something only once, or care about something
that is touched only conditionally, and we actually need to make sure
that it is touched *only* under that condition, we already go to quite
extreme lengths to tell the compiler so.

   We usually use an access with a "volatile" cast on it or tricks
like that. Or we do the whole "magic asm that says it modified memory
to let the compiler know not to try anything clever"

 - The kernel IS NOT GOING TO MARK DATA STRUCTURES.

Marking data structures is seriously stupid and wrong. It's not the
*data* that has access rules, it's the *code* that has rules of
access.

The traditional C "volatile" is misdesigned and wrong. We don't
generally mark data volatile, we really mark *code* volatile - which
is why our "volatiles" are in the casts, not on the data structures.

Stuff that is "volatile" in one context is not volatile in another. If
you hold a RCU write lock, it may well be entirely stable, and marking
it volatile is *wrong*, and generating code as if it was volatile is
pure and utter shit.

On the other hand, if you are touching *the*very*same* field while you
are only read-locked for RCU, it may well be one of those "this has to
be read by accessing it exactly once".

And we do all this correctly in the kernel.  Modulo bugs, of course,
but the fundamental rule really is: "atomicity or volatility is about
CODE, not DATA".

And no, C11 does *not* do it correctly. The whole "_Atomic" crap is
exactly the same mistake as "volatile" was. It's wrong. Marking data
_Atomic is a sure sign that whoever designed it didn't understand
things.

> The only candidate that I see is the C++11/C11 model.  Any other
> suggestions?

The C11 model addresses the wrong thing, and addresses it badly.

You might as well ignore it as far as the kernel is concerned. I'm
sure it helps some *other* problem, but it's not the kernel problem.

The rules really are very simple, and the kernel already does
everything to make it easy for the compiler.

When we do something that the compiler cannot re-order around, we do
an asm() and mark it as modifying memory so that the compiler won't
screw things up. In addition, we will do whatever that the CPU
requires for memory ordering, and quite frankly, the compiler will
never have sufficient locking primitives to satisfy us, and there is
no real point in even trying. If you try to do locking in the
compiler, you *will* do it wrong.

If you add random flags on data structures ("_Atomic" or "volatile" or
whatever), you *will* do it wrong. It's simply a fundamentally broken
model.

So the *only* memory model we want from the compiler really is: "don't
write to fields that the source code didn't write to".

It's really is that simple.

> So, would it be okay to tell the compiler which part of the state is
> accessed concurrently (ie, locks, atomics, ...)?

Seriously, no.

See above: it's not the "state" that is accessed concurrently. It's
the code. If you ever try to mark state, you've already lost. The same
"state" can be atomic or not depending on context. It's not about the
state or the data structures, and it never will be.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 10:45 AM, Jeff Law  wrote:
>
> Torvald Riegel & I were told that was kernel policy when we brought up the
> upcoming bitfield semantic changes with some of the linux kernel folks last
> year.

Btw, one reason this is true is that the bitfield ordering/packing is
so unspecified that using bitfields sometimes is just way more pain
than you get. Some people have tried to use bitfields for various IO
data structures (think laying out bits in memory to match some
particular layout of some disk controller result structure). It's
always a total disaster, because you have another whole level of
architecture-specific bit ordering (in *addition* to all the normal
byte order issues).

That's not a compiler issue, that's just the nature of the beast.
It's just another reason why the kernel often ends up then doing bit
masking by hand.

But *all* that said, we do have a metric buttload of bitfields in the
kernel. It's not some really unusual feature. It does get used a lot,
despite all the reasons why some particular code might not use
bitfields.

We have a lot of code, there's still a lot of situations left where
bitfields are just really convenient.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 10:09 AM, David Miller  wrote:
>
> Personally I've avoided C bitfields like the plague in any code I've
> written.

I do agree with that. The kernel largely tries to avoid bitfields,
usually because we have some really strict rules about different
bitfields, but also because initialization of bitfields tends to
result in gcc generating an incredible mess of the code (while
"explicit bits" allows us to just set all the fields in one go, and
know what the result is).

So core kernel data structures tend to be things like "unsigned long",
together with our various bitop functions that have explicit atomicity
(or non-atomicity) guarantees on a bit-per-bit basis.

Sometimes bitfields are really convenient, though, and allow for much
more natural syntax.

I'm not surprised this issue came up in a filesystem, for example.

  Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 9:41 AM, Michael Matz  wrote:
>
> One problem is that it's not a new problem, GCC emitted similar code since
> about forever, and still they turned up only now (well, probably because
> ia64 is dead, but sparc64 should have similar problems).  The bitfield
> handling code is _terribly_ complex and fixing it is quite involved.  So
> don't expect any quick fixes.

I agree that bitfields are nasty, I've had to handle them myself in
sparse. And we have actually had problems with bitfields before, to
the point where we've replaced use of bitfields with masking and
setting bits by hand.

But I also think that gcc is simply *buggy*, and has made them much
nastier than they should be. What gcc *should* have done is to turn
bitfield accesses into shift-and-masking of the underlying field as
early as possible, and then do all optimizations at that level.

In fact, there is another gcc bug outstanding (48696) where I complain
about absolutely horrible code generation, and that one was actually
the exact same issue except in reverse: gcc wouldn't take the
underlying size of the bitfield into account, and use the wrong
(smaller) size for the access, causing absolutely horrendous code
generation that mixes byte and word accesses, and causes slowdowns by
orders of magnitude.

And it really is the same issue: gcc has forgotten what the base type
is, and tries to "compute" some type using the actual bits. Which is
simply *wrong*. Always has been.

It's stupid, it generates bad code (both from performance *and* from a
correctness angle), and it has actually resulted in gcc having *extra*
complexity because it keeps the bitfield data around much too late.

> The other problem is specification.  While you think that the code you
> wrote is totally obvious it might not actually be so.  For instance, what
> about this struct:
>
> {long l:32; int i1:16; short s; int i2:1; char c:7; short s2:8; short s3;}
>
> What are the obviously correct accesses for various writes into this
> struct?

I really don't think that example matters. You should instead turn the
question around, and look at the real code examples, make *those*
generate good and obviously correct code, and then *after* you've done
that, you start looking at the mixed case where people do odd things.

Quite frankly, the layout that makes *sense* for something like the
above is not to pack them. You'd just do

If somebody actually wants packing, they'd do the *sane* thing, which is

  unsigned int l:32, i1:16; s:16,i2:1, c:7, s2:8, s3:16;

and now it's obvious what the packing rules and the access rules are.

That said, I just tried, and 'sparse' actually gets your crazy example
right (where "right" is defined as what I *think* you want): optimally
packed. And it turns the accesses into the ones that are based on the
base type.  And I'm pretty sure the sparse bitfield rules are way
simpler than gcc, exactly because it tries to convert bitfields fairly
early into mask-and-set. So if I do

  struct {long l:32; int i1:16; short s; int i2:1; char c:7; short
s2:8; short s3;} dummy;

  int main(int argc, char **argv)
  {
dummy.l = 1;
dummy.i1 = 2;
dummy.s = 3;
dummy.i2 = 4;
dummy.c = 5;
dummy.s2 = 6;
dummy.s3 = 7;
  }

and then do a test-linearize (with "-m64" to show the difference
between "long" and "int") I get

  t.c:2:48: error: dubious one-bit signed bitfield
  main:
  .L0x7f928e378010:

load.64 %r1 <- 0[dummy]
and.64  %r2 <- %r1, $0x
or.64   %r3 <- %r2, $1
store.64%r3 -> 0[dummy]
load.32 %r4 <- 4[dummy]
and.32  %r5 <- %r4, $0x
or.32   %r6 <- %r5, $2
store.32%r6 -> 4[dummy]
store.16$3 -> 6[dummy]
load.32 %r7 <- 8[dummy]
and.32  %r8 <- %r7, $-2
store.32%r8 -> 8[dummy]
load.8  %r10 <- 8[dummy]
and.8   %r12 <- %r10, $-255
or.8%r13 <- %r12, $10
store.8 %r13 -> 8[dummy]
load.16 %r14 <- 8[dummy]
and.16  %r16 <- %r14, $0x00ff
or.16   %r17 <- %r16, $0x600
store.16%r17 -> 8[dummy]
store.16$7 -> 10[dummy]
ret.32  $7

which I think is what you'd expect. Now, sparse doesn't do the smart
"combine shifts-and-masks to memory", but that's an optimization phase
that should be done long after bitfields *anyway*, so that's kind of
immaterial.

And yes, look at how you can see how sparse mixes different access
sizes for different fields. But that is damn well what the programmer
asked for.

If programmers write stupid code, the compiler might as well generate odd code.

But if a programmer writes *good* code, and the compiler messes it up
because it *thinks* the code is stupid, then the compiler is just bad.

> One rule may be to never write to a bitfield with accesses larger than
> their underlying

Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 9:11 AM, Jiri Kosina  wrote:
> On Wed, 1 Feb 2012, Linus Torvalds wrote:
>>
>> And I suspect it really is a generic bug that can be shown even with
>> the above trivial example.
>
> I have actually tried exactly this earlier today (because while looking at
> this, I had an idea that putting volatile in place could be a workaround,
> causing gcc to generate a saner code), but it doesn't work either:
>
> # cat x.c
> struct x {
>    long a;
>    volatile unsigned int lock;
>    unsigned int full:1;
> };
>
> void
> wrong(struct x *ptr)
> {
>        ptr->full = 1;
> }
>
> int main()
> {
>        wrong(0);
> }
> # gcc -O2 x.c
> # gdb -q ./a.out
> Reading symbols from /root/a.out...done.
> (gdb) disassemble wrong
> Dump of assembler code for function wrong:
>   0x45c0 <+0>:     [MMI]       adds r32=8,r32
>   0x45c1 <+1>:                 nop.m 0x0
>   0x45c2 <+2>:                 mov r15=1;;
>   0x45d0 <+16>:    [MMI]       ld8 r14=[r32];;
>   0x45d1 <+17>:                nop.m 0x0
>   0x45d2 <+18>:                dep r14=r15,r14,32,1;;
>   0x45e0 <+32>:    [MIB]       st8 [r32]=r14
>   0x45e1 <+33>:                nop.i 0x0
>   0x45e2 <+34>:                br.ret.sptk.many b0;;
>
> In my opinion, this is a clear bug in gcc (while the original problem,
> without explitict volatile, is not a C spec violation per se, it's just
> very inconvenient :) ).

Yup, gcc is clearly just buggy here. I do not believe there is any
question what-so-ever about the above test-case showing a bug.

And the thing is, if they fix this bug, they'll fix our problem too,
unless they are going to write explicit code to *try* to screw us over
while fixing that 'volatile' bug.

Because the right thing to do with bitfields is really to take the
base type into account. If the bitfield was in an "int", you use an
"int" access for it, not a 64-bit access. That's the simple fix for
the volatile problem, and it just happens to fix our issue too.

Trying to explicitly *look* for volatiles, and only doing the 32-bit
access when you see them is actually extra code, and extra effort, and
doesn't even *help* anything. It's not like the 64-bit access is
somehow "better".

I can see some vindictive programmer doing that, while thinking "I'll
show these people who pointed out this bug in my code, mhwhahahahaa!
I'll fix their test-case while still leaving the real problem
unaddressed", but I don't think compiler people are quite *that* evil.
 Yes, they are evil people who are trying to trip us up, but still..

   Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 9:08 AM, Torvald Riegel  wrote:
>
> What do the kernel folks think about the C11 memory model?  If you can
> spot any issues in there, the GCC community would certainly like to
> know.

I don't think this is about memory models except very tangentially.

Gcc currently accesses fields in a way that affects *independent*
fields, without checking their memory models at all.

Even original C already has one memory model: "volatile" data is seen
outside the virtual C machine. And Jiri reports that even that
*original* memory model is being violated. We're taling about the one
from about 40 years ago.

So no amount of memory model will ever matter, if gcc can't even get
the 40-year old one right.

Also, the C11 memory model really isn't going to make any difference
to this. Again, because this happens even if we have explicit locking
for the different fields. So they aren't "atomic" or anything like
that, because we do our own (and sufficient) locking. But once the
compiler starts to randomly touch data that just happens to be
adjacent to other data, any memory model about those individual pieces
is going to be entirely immaterial anyway.

Finally: I do not believe for a second that we can actually use the
C11 memory model in the kernel and say "that's sufficient for our
needs". We will continue to have to do things that are "outside the
specs" for the very simple reason that the kernel isn't just some
random application that happens to use pthreads. We do end up doing
much more aggressive threading, with models that C11 simply doesn't
cover.

_Atomic simply isn't going to make any difference to us. Sure, we
might use it in a couple of places (where we have sometimes tried to
use volatile casts, for example), but it would be totally irrelevant
in this case exactly because *neither* field is really atomic - they
are both protected, it's just that they are *separately* protected, so
accessing one does *not* mean that you can access the other, even if
they are next to each other.

  Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 8:37 AM, Colin Walters  wrote:
>
> 1) Use the same lock for a given bitfield

That's not the problem. All the *bitfield* fields are all accessed
under the same word already.

> 2) Split up the bitfield into different words

Again, it's not the bitfield that is the problem.

The problem is that the compiler - when it writes to the word that
contains the bitfield - will also corrupt the word *NEXT* to the
bitfield (or before - it probably depends on alignment).

So we have two separate 32-bit words - one of which just happens to
contain a bitfield. We write to these *separate* words using different
locking rules (btw, they don't even need to be protected by a lock: we
may have other rules that protects the individual word contents.

But the compiler turns the access to the bitfield (in a 32-bit aligned
word) into a 64-bit access that accesses the word *next* to it.

That word next to it might *be* the lock, for example.

So we could literally have this kind of situation:

   struct {
  atomic_t counter;
  unsigned int val:4, other:4, data:24;
   };

and if we write code like this:

spin_lock(&somelock);
s->data++;
spin_unlock(&somelock);

and on another CPU we might do

   atomic_inc(&counter);

and the access to the bitfield will *corrupt* the atomic counter, even
though both of them are perfectly fine!

Quite frankly, if the bug is simply because gcc doesn't actually know
or care about the underlying size of the bitmask, it is possible that
we can find a case where gcc clearly is buggy even according to the
original C rules.

Honza - since you have access to the compiler in question, try
compiling this trivial test-program:


   struct example {
  volatile int a;
  int b:1;
   };

   ..
 s->b = 1;
   ..

and if that bitfield access actually does a 64-bit access that also
touches 's->a', then dammit, that's a clear violation of even the
*old* C standard, and the gcc people cannot just wave away their bugs
by saying "we've got standads, pttthththt".

And I suspect it really is a generic bug that can be shown even with
the above trivial example.

   Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 7:19 AM, Jan Kara  wrote:
>
>  we've spotted the following mismatch between what kernel folks expect
> from a compiler and what GCC really does, resulting in memory corruption on
> some architectures.

This is sad.

We've had something like this before due to architectural reasons
(alpha inability to do byte loads and stores leading us to not be able
to have items smaller than a byte sharing a word).

But at least then there was a *reason* for it, not "the compiler is
being difficult for no good reason".

Actually, the *sad* part is not that the compiler does something
unexpected - that's not new - but the usual "..and the gcc crowd
doesn't care, because they can point to paperwork saying it's not
defined". Even if that same paper is actually in the process of
getting updated exactly because it causes problems like ours.

That mentality is not new, of course.

> So it seems what C/GCC promises does not quite match with what kernel
> expects.

The paper C standard can *never* promise what a kernel expects. There
are tons of unspecified things that a compiler could do, including
moving memory around behind our back as long as it moves it back.
Because it's all "invisible" in the virtual C machine in the absence
of volatiles. The fact that the kernel has things like SMP coherency
requirements is simply not covered by the standard. There are tons of
other things not covered by the standard too that are just "that's
what we need".

So C/gcc has never "promised" anything in that sense, and we've always
had to make assumptions about what is reasonable code generation. Most
of the time, our assumptions are correct, simply because it would be
*stupid* for a C compiler to do anything but what we assume it does.

But sometimes compilers do stupid things. Using 8-byte accesses to a
4-byte entity is *stupid*, when it's not even faster, and when the
base type has been specified to be 4 bytes!

>I'm not really an expert in this area so I wanted to report it
> here so that more knowledgeable people can decide how to solve the issue...

If the gcc people aren't willing to agree that this is actually a flaw
in the standard (one that is being addressed, no less) and try to fix
it, we just have to extend our assumptions to something like "a
compiler would be stupid to ever access anything bigger than the
aligned register-size area". It's still just an assumption, and
compiler people could be crazy, but we could just extend the current
alpha rules to cover not just "int", but "long" too.

Sure, the compiler people could use "load/store multiple" or something
like that, but it would be obviously crazy code, so if it happens past
a register size, at least you could argue that it's a performance
issue and maybe the gcc people would care.

HOWEVER. The problem with the alpha rules (which, btw, were huge, and
led to the CPU designers literally changing the CPU instruction set
because they admitted they made a huge mistake) was never so much the
occasional memory corruption, as the fact that the places where it
could happen were basically almost impossible to find.

So we probably have tons of random places in the kernel that violate
even the alpha rules - because the corruption is so rare, and the
architecture was so rare as to making the corruption even harder to
find.

I assume this code generation idiocy only happens with bitfields? The
problem is, we really don't want to make all bitfields take up 64 bits
just because we might have a lock or something else next to them. But
we could probably re-order things (and then *occasionally* wasting
memory) if we could just find the ones that are problematic.

It's finding the problematic ones that is the issue. Which is why the
compiler should just generate code that matches what we told it to do,
not try to be "clever" in ways that doesn't even help performance! The
compiler simply doesn't know enough about the requirements. Never has,
never will, and this is not about "standards".

   Linus


  1   2   >