Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 10:45 AM, Jeff Law  wrote:
>
> Torvald Riegel & I were told that was kernel policy when we brought up the
> upcoming bitfield semantic changes with some of the linux kernel folks last
> year.

Btw, one reason this is true is that the bitfield ordering/packing is
so unspecified that using bitfields sometimes is just way more pain
than you get. Some people have tried to use bitfields for various IO
data structures (think laying out bits in memory to match some
particular layout of some disk controller result structure). It's
always a total disaster, because you have another whole level of
architecture-specific bit ordering (in *addition* to all the normal
byte order issues).

That's not a compiler issue, that's just the nature of the beast.
It's just another reason why the kernel often ends up then doing bit
masking by hand.

But *all* that said, we do have a metric buttload of bitfields in the
kernel. It's not some really unusual feature. It does get used a lot,
despite all the reasons why some particular code might not use
bitfields.

We have a lot of code, there's still a lot of situations left where
bitfields are just really convenient.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 9:42 AM, Torvald Riegel  wrote:
>
> We need a proper memory model.

Not really.

The fact is, the kernel will happily take the memory model of the
underlying hardware. Trying to impose some compiler description of the
memory model is actually horribly bad, because it automatically also
involves compiler synchronization points - which will try to be
portable and easy to understand, and quite frankly, that will
automatically result in what is technically known as a "shitload of
crap".

Now, a strict memory model is fine for portability, and to make it
simple for programmers. I actually largely approve of the Java memory
model approach, even if it has been horribly problematic and has some
serious problems. But it's not something that is acceptable for the
kernel - we absolutely do *not* want the compiler to introduce some
memory model, because we are very much working together with whatever
the hardware memory model is, and we do our own serialization
primitives.

> No vague assumptions with lots of hand-waving.

So here's basically what the kernel needs:

 - if we don't touch a field, the compiler doesn't touch it.

   This is the rule that gcc now violates with bitfields.

   This is a gcc bug. End of story. The "volatile" example proves it -
anybody who argues otherwise is simply wrong, and is just trying to
make excuses.

 - if we need to touch something only once, or care about something
that is touched only conditionally, and we actually need to make sure
that it is touched *only* under that condition, we already go to quite
extreme lengths to tell the compiler so.

   We usually use an access with a "volatile" cast on it or tricks
like that. Or we do the whole "magic asm that says it modified memory
to let the compiler know not to try anything clever"

 - The kernel IS NOT GOING TO MARK DATA STRUCTURES.

Marking data structures is seriously stupid and wrong. It's not the
*data* that has access rules, it's the *code* that has rules of
access.

The traditional C "volatile" is misdesigned and wrong. We don't
generally mark data volatile, we really mark *code* volatile - which
is why our "volatiles" are in the casts, not on the data structures.

Stuff that is "volatile" in one context is not volatile in another. If
you hold a RCU write lock, it may well be entirely stable, and marking
it volatile is *wrong*, and generating code as if it was volatile is
pure and utter shit.

On the other hand, if you are touching *the*very*same* field while you
are only read-locked for RCU, it may well be one of those "this has to
be read by accessing it exactly once".

And we do all this correctly in the kernel.  Modulo bugs, of course,
but the fundamental rule really is: "atomicity or volatility is about
CODE, not DATA".

And no, C11 does *not* do it correctly. The whole "_Atomic" crap is
exactly the same mistake as "volatile" was. It's wrong. Marking data
_Atomic is a sure sign that whoever designed it didn't understand
things.

> The only candidate that I see is the C++11/C11 model.  Any other
> suggestions?

The C11 model addresses the wrong thing, and addresses it badly.

You might as well ignore it as far as the kernel is concerned. I'm
sure it helps some *other* problem, but it's not the kernel problem.

The rules really are very simple, and the kernel already does
everything to make it easy for the compiler.

When we do something that the compiler cannot re-order around, we do
an asm() and mark it as modifying memory so that the compiler won't
screw things up. In addition, we will do whatever that the CPU
requires for memory ordering, and quite frankly, the compiler will
never have sufficient locking primitives to satisfy us, and there is
no real point in even trying. If you try to do locking in the
compiler, you *will* do it wrong.

If you add random flags on data structures ("_Atomic" or "volatile" or
whatever), you *will* do it wrong. It's simply a fundamentally broken
model.

So the *only* memory model we want from the compiler really is: "don't
write to fields that the source code didn't write to".

It's really is that simple.

> So, would it be okay to tell the compiler which part of the state is
> accessed concurrently (ie, locks, atomics, ...)?

Seriously, no.

See above: it's not the "state" that is accessed concurrently. It's
the code. If you ever try to mark state, you've already lost. The same
"state" can be atomic or not depending on context. It's not about the
state or the data structures, and it never will be.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 11:40 AM, Jakub Jelinek  wrote:
>
> Well, the C++11/C11 model doesn't allow to use the underlying type
> for accesses, consider e.g.
>
> struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char 
> s4; unsigned int s5; };
> struct T { long s1 : 16; unsigned int s2; };
>
> on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into
> the same 32 bits as s2 and s3.  While the memory model allows s2 to be
> changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4
> to be changed, as it isn't a bitfield (you'd need s4 : 8 for that).
> T is 8 bytes long, again, s2 is packed into the same 64 bits as s1,
> and the memory model doesn't allow s2 to be modified.
>
> Not sure what the kernel would expect in such a case.

So the kernel really doesn't care what you do to things *within* the bitfield.

Sure, we do have some expectations of packing, but basically we never
expect bitfields to be 'atomic'. You really don't need to worry about
that.

From a correctness standpoint, I really don't think the kernel cares
if gcc does bitfield reads and writes one bit at a time. Seriously.

The bug is that the bitfield write wrote *outside* the bitfield.
That's *WRONG*. It's completely wrong, even if you write back the same
value you read. It's *always* wrong. It's simply not acceptable.

If gcc does that kind of crap, then gcc needs to align bitfields
sufficiently that the accesses that are outside the bitfields never
touch any other fields.

So what I would suggest:

 - use the *smallest* aligned access you can use to access the
bitfield. This will be 'correct', but it may perform badly.

   This is apparently what x86 does. So on x86, if you have a one-bit
bitfield within a bitfield that is really 32 bits wide, it looks like
gcc will generally generate a byte load/store to set that bit. I don't
know exactly what the rules are, but this is fine from a *correctness*
point.

 - However, while using the *smallest* possible access may generate
correct code, it often generates really *crappy* code. Which is
exactly the bug that I reported in

   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

So quite frankly, my suggestion is to just try to use the base type.

But *whatever* you do, using a type *larger* than the whole bitfield
itself is a bug.

And dammit, the people who continue to bring up C11 as if this was a
new issue, and only needs to be worried about if you support C11 (ie
gcc-4.7 and up) are just in denial.

The 'volatile'  example makes it entirely clear that the bug has
nothing what-so-ever to do with C11.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:01 PM, Linus Torvalds
 wrote:
>
>  - However, while using the *smallest* possible access may generate
> correct code, it often generates really *crappy* code. Which is
> exactly the bug that I reported in
>
>   http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696

Btw, as I also pointed out in that (old) bugzilla entry, gcc can do
pretty much whatever it damn well pleases with loads from a bitfield.

You can narrow them, you can make them wider, you can paint them pink.

Widening the loads can still be a performance problem, and technically
it's a really bad idea for any nearby "volatile"s too, but in practice
things will *work*.

Writes are much more critical. If you overwrite a field that is no
longer in the bitfield, you can no longer depend on "nobody cares
about bitfield accesses", because by definition you are clearly now
stepping on non-bitfield things.

You do realize that even in user space, and even before C11, people
have things like "sig_atomic_t" etc. So even if you don't like the
notion of "volatile", here's *another* example of this being  real gcc
bug:

   struct mydata {
  sig_atomic_t seen_signal;
  unsigned flags:1;
   };

and now do the same test-program, realizing that "sig_atomic_t" is
normally the exact same thing as "int".

Now, thing about what happens if you have a signal handler that comes
in and does

   mydata.seen_signal = 1;

and happens to interrupt the code that changes "mydata.flags" in just
the right spot.

That's right: the bitfield update will possibly *clear* that
"signal-safe" flag again, and gcc just created buggy asm code from
correct C code.

Guys, if you don't admit that this is a bug, I don't know what to say.

IT IS A GCC BUG.

Don't try to make it into something bigger and related to C++11/C11.
Don't try to talk about "memory models". Just admit that it is a bug
to do a 64-bit write to a 32-bit bitfield, and fix it!

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:16 PM, Jakub Jelinek  wrote:
>>
>> So the kernel really doesn't care what you do to things *within* the 
>> bitfield.
>
> But what is *within* the bitfield?  Do you consider s4 or t2 fields
> (non-bitfield fields that just the ABI wants to pack together with
> the bitfield) above as *within* or is already modifying of those a problem?

I seriously think you should just do the obvious thing, and LOOK AT
THE BASE TYPE. That solves everything.

If the definition of the data is (assume "long" to be 64 bits)

   long mybits:32;
   int nonbitfield;

and you end up doing a 64-bit access when you read/write the "mybits"
bitfield, and it overlaps "nonbitfield", then I would also just say
"hey, whatever, the user asked for it". He really did. The
"allocation" for the bitfield comes from the base type, and so you
basically have a "long" to play with.

Let's take an even more extreme example:

   int bits1:8
   char c;
   int bits2:16;

where the "char c" is actually *embedded* in the "bitfield
allocation". But again, it's completely and totally unambiguous what
accesses you would at least start out with: sure, you *could* do a
8-bit access (and you probably should, if the ISA allows it), but
dangit, the base type is "int", so I don't think it's wrong to do a
32-bit load, mask, and a 32-bit write.

But if the user wrote

char bits1:8;
char c;
short bits2:16;

then I really think it would be a *bug* if modifying "bits1" would
possible write to 'c'. Of course, this is again a possible ISA
problem: if the architecture doesn't have byte writes, you can't do
anything about it, but that is obviously true regardless of bitfields,
so that's really a totally irrelevant argument for this case. That
non-atomicity doesn't come from bitfields, it comes from the fact that
the BASE TYPE is again non-atomic.

Notice how it always ends up being about the base type. Simple,
straightforward, and unambiguous.

And similarly, to go to the kernel example, if you have

int mybits:2;
int nonbitfield;

then I think it's an obvious *BUG* to access the nonbitfield things
when you access "mybits". It clearly is not at all "within" the
bitfield allocation, and "int" isn't non-atomic on any sane
architecture, so you don't even have the "byte accesses aren't atomic"
issue)

So I really do think that the sane approach is to look at the base
type of the bitfield, like I suggested from the very beginning.

If the base type is an "int", you do an int access. It really solves
so many problems, and it is even entirely sane semantics that you can
*explain* to people. It makes sense, it avoids the clear bugs gcc has
now, and it also solves the performance bug I reported a long time
ago.

Seriously - is there any real argument *against* just using the base
type as a hint for access size?

I realize that it may not be simple, and may not fit the current mess
of gcc bitfields, but I really do think it's the RightThing(tm) to do.
It has sensible semantics, and avoids problems.

In contrast, the *current* gcc semantics are clearly not sensible, and
have both performance and correctness issues.

 Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel  wrote:
>
> You do rely on the compiler to do common transformations I suppose:
> hoist loads out of loops, CSE, etc.  How do you expect the compiler to
> know whether they are allowed for a particular piece of code or not?

We have barriers.

Compiler barriers are cheap. They look like this:

include/linux/compiler-gcc.h:
   #define barrier() __asm__ __volatile__("": : :"memory")

and if we don't allow hoisting, we'll often use something like that.

It's not the only thing we do. We have cases where it's not that you
can't hoist things outside of loops, it's that you have to read things
exactly *once*, and then use that particular value (ie the compiler
can't be allowed to reload the value because it may change). So any
*particular* value we read is valid, but we can't allow the compiler
to do anything but a single access.

So we have this beauty:

include/linux/compiler.h:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

which does that for us (quite often, it's not used as-is: a more
common case is using the "rcu_access_pointer()" inline function that
not only does that ACCESS_ONCE() but also has the capability to enable
run-time dynamic verification that you are in a proper RCU read locked
section etc)

We also have tons of (very subtle) code that actually creates locks
and memory ordering etc. They usually just use the big "barrier()"
approach to telling the compiler not to combine or move memory
accesses around it.

And I realize that compiler people tend to think that loop hoisting
etc is absolutely critical for performance, and some big hammer like
"barrier()" makes a compiler person wince. You think it results in
horrible code generation problems.

It really doesn't. Loops are fairly unusual in the kernel to begin
with, and the compiler barriers are a total non-issue. We have much
more problems with the actual CPU barriers that can be *very*
expensive on some architectures, and we work a lot at avoiding those
and avoiding cacheline ping-pong issues etc.

>> > No vague assumptions with lots of hand-waving.
>>
>> So here's basically what the kernel needs:
>>
>>  - if we don't touch a field, the compiler doesn't touch it.
>
> "we don't touch a field": what does that mean precisely?  Never touch it
> in the same thread?  Same function?  Same basic block?  Between two
> sequence points?  When crossing synchronizing code?  For example, in the
> above code, can we touch it earlier/later?

Why are you trying to make it more complex than it is?

If the source code doesn't write to it, the assembly the compiler
generates doesn't write to it.

Don't try to make it anything more complicated. This has *nothing* to
do with threads or functions or anything else.

If you do massive inlining, and you don't see any barriers or
conditionals or other reasons not to write to it, just write to it.

Don't try to appear smart and make this into something it isn't.

Look at the damn five-line example of the bug. FIX THE BUG. Don't try
to make it anything bigger than a stupid compiler bug. Don't try to
make this into a problem it simply isn't.

 Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 12:53 PM, Torvald Riegel  wrote:
>
> For volatile, I agree.
>
> However, the original btrfs example was *without* a volatile, and that's
> why I raised the memory model point.  This triggered an error in a
> concurrent execution, so that's memory model land, at least in C
> language standard.

Sure. The thing is, if you fix the volatile problem, you'll almost
certainly fix our problem too.

The whole "compilers actually do reasonable things" approach really
does work in reality. It in fact works a lot better than reading some
spec and trying to figure out if something is "valid" or not, and
having fifteen different compiler writers and users disagree about
what the meaning of the word "is" is in some part of it.

I'm not kidding. With specs, there really *are* people who spend years
discussing what the meaning of the word "access" is or similar.
Combine that with a big spec that is 500+ pages in size and then try
to apply that all to a project that is 15 million lines of code and
sometimes *knowingly* has to do things that it simply knows are
outside the spec, and the discussion about these kinds of details is
just mental masturbation.


>> We do end up doing
>> much more aggressive threading, with models that C11 simply doesn't
>> cover.
>
> Any specific examples for that would be interesting.

Oh, one of my favorite (NOT!) pieces of code in the kernel is the
implementation of the

   smp_read_barrier_depends()

macro, which on every single architecture except for one (alpha) is a no-op.

We have basically 30 or so empty definitions for it, and I think we
have something like five uses of it. One of them, I think, is
performance crticial, and the reason for that macro existing.

What does it do? The semantics is that it's a read barrier between two
different reads that we want to happen in order wrt two writes on the
writing side (the writing side also has to have a "smp_wmb()" to order
those writes). But the reason it isn't a simple read barrier is that
the reads are actually causally *dependent*, ie we have code like

   first_read = read_pointer;
   smp_read_barrier_depends();
   second_read = *first_read;

and it turns out that on pretty much all architectures (except for
alpha), the *data*dependency* will already guarantee that the CPU
reads the thing in order. And because a read barrier can actually be
quite expensive, we don't want to have a read barrier for this case.

But alpha? Its memory consistency is so broken that even the data
dependency doesn't actually guarantee cache access order. It's
strange, yes. No, it's not that alpha does some magic value prediction
and can do the second read without having even done the first read
first to get the address. What's actually going on is that the cache
itself is unordered, and without the read barrier, you may get a stale
version from the cache even if the writes were forced (by the write
barrier in the writer) to happen in the right order.

You really want to try to describe issues like this in your memory
consistency model? No you don't. Nobody will ever really care, except
for crazy kernel people. And quite frankly, not even kernel people
care: we have a fairly big kernel developer community, and the people
who actually talk about memory ordering issues can be counted on one
hand. There's the "RCU guy" who writes the RCU helper functions, and
hides the proper serializing code into those helpers, so that normal
mortal kernel people don't have to care, and don't even have to *know*
how ignorant they are about the things.

And that's also why the compiler shouldn't have to care. It's a really
small esoteric detail, and it can be hidden in a header file and a set
of library routines. Teaching the compiler about crazy memory ordering
would just not be worth it. 99.99% of all programmers will never need
to understand any of it, they'll use the locking primitives and follow
the rules, and the code that makes it all work is basically invisible
to them.

Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 1:24 PM, Torvald Riegel  wrote:
>> It's not the only thing we do. We have cases where it's not that you
>> can't hoist things outside of loops, it's that you have to read things
>> exactly *once*, and then use that particular value (ie the compiler
>> can't be allowed to reload the value because it may change). So any
>> *particular* value we read is valid, but we can't allow the compiler
>> to do anything but a single access.
>
> That's what an access to an atomics-typed var would give you as well.

Right. The concept of "atomic" access is required.

The problem I have with treating this as a C11 issue is that people
aren't using C11 compilers, and won't for a long while.

So quite frankly, it won't be reasonable for the kernel people to say
"use a C11 version" for years to come.

And the thing is, "atomic" doesn't actually seem to buy us anything in
the kernel afaik. We're perfectly happy to use "volatile". We don't
want the data structure annotated, we want the code annotated, so the
semantic differences between 'volatile' and 'atomic' seem to be pretty
irrelevant.

Sure, there is probably some subtle difference, but on the whole, it
all boils down to "we can't depend on new rules for several years, and
even when we can, it doesn't look like they'd buy us a lot, because we
have to play so many games around them *anyway*".

And don't get me wrong. I don't think that means that C11 is *bad*.
It's just that the kernel is very different from most other projects.
We have to have those crazy architecture-specific header files and
random synchronization macros etc anyway.

C11 is not - I think - even meant to be geared towards the Linux
kernel kind of crazy use. We really do some odd things, adding
compiler features for them is mostly a mistake. asm() takes care of a
lot of the oddities for us, it's not like all of them are about memory
accesses or concurrency either.

I do think that the threading issues in C11 are going to help us
kernel people, because the more people think about issues with
concurrency, they really *will* be hitting some of the issues we've
been having. So it clearly forces clarification of just what the word
"access" implies, for example, which is certainly not going to be bad
for the kernel.

>> It really doesn't. Loops are fairly unusual in the kernel to begin
>> with, and the compiler barriers are a total non-issue.
>
> This is interesting, because GCC is currently not optimizing across
> atomics but we we're considering working on this later.
> Do you have real data about this, or is this based on analysis of
> samples of compiled code?

So the kernel literally doesn't tend to *have* loops.

Sure, there are exceptions: there are the obvious loops implied in
"memcpy()" and friends, but almost all kernel activity tends to be
about individual events. A really hot loop under normal loads would be
the loop that calculates a hash value over a pathname component. Or
the loop that loops over those components.

Loop count? Think about it. Most pathnames have one or two components.
And while "8.3" is clearly too short for a filename, it's still true
that a lot of filenames (especially directories, which is what you end
up traversing many times) are in a kind of awkward 5-8 character
range.

So the kernel loads tend to have loop counts in the single digits -
*maybe* double digits. We use hash-tables a lot, so looping to find
something is usually just an entry or two, and the cost is almost
never the loop, it's the cache miss from the pointer chasing.

Most "looping" behavior in the kernel is actually user space calling
the kernel in a loop. So the kernel may see "loops" in that sense, but
it's not loopy kernel code itself.

Even really hot kernel-only code is often not about loops. You might
get a million network packets a second, and with interrupt mitigation
you would not necessarily treat them each as individual events with
its own individual interrupt (which does happen too!), but you'd still
not see it as a loop over a million packets. You'd see them come in as
a few packets at a time, and looping until you've handled them.

And the loops we have tend to not be very amenable to nice compiler
optimizations either.

The kernel almost never works with arrays, for example. The string in
a pathname component is an array, yes, but *most* kernel data
structures loop over our doubly linked lists or over a hash-chain.

And it tends to be really hard to do much loop optimization over list
traversal like that.

>> If the source code doesn't write to it, the assembly the compiler
>> generates doesn't write to it.
>
> If it would be so easy, then answering my questions would have been easy
> too, right?  Which piece of source code?  The whole kernel?

So make the rule be: "in one single function, with all you can see
there", and I'll be happy. Do as much inlining as you want (although
the kernel does have "noinline" too).

With all the normal sequence point and memory invalidation

Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 1:25 PM, Boehm, Hans  wrote:
>
> Here are some more interesting ones that illustrate the issues (all 
> declarations are non-local, unless stated otherwise):
>
> struct { char a; int b:9; int c:7; char d} x;
>
> Is x.b = 1 allowed to overwrite x.a?  C11 says no, essentially requiring two 
> byte stores.  Gcc currently does so.  I'm not sure I understand Linus' 
> position here.

So I like the fact that the C11 stance seems very strict. But honestly
I also think it sounds like C11 is actually more strict than I would
necessarily be.

I really do think that bitfields imply "int", both historically and
technically. So I would not see the problem with treating the
bitfields as part of an 'int' and thus overwriting a (and d) when
writing to b. That's how bitfields work! They are fields of an int.

It would be good if it perhaps caused a *warning*, and gave a good way
to avoid it. For example, while I think using any other base type than
'int' is technically an extension of the C bitfield rules (but
whatever, I don't have any specs in front of me), I think a warning
together with alowing the user to rewrite it as

  struct { char a; char d; short b:9; short c:7; } x;

would make it clear that now a write to 'b' cannot validly overwrite
'a' or 'd'.

But forcing the compiler to do two (and sometimes three!) byte
accesses sounds excessive.

The reason I think the

   int flag:1;
   int othervariable;

overwriting of "othervariable" is so *obviously* a bug is exactly that
bitfields are historically about 'int', and no 'long' was there
anywhere, so using a 64-bit access is clearly not sane in any way,
shape or form.

I dunno.

 Linus


Re: Memory corruption due to word sharing

2012-02-01 Thread Linus Torvalds
On Wed, Feb 1, 2012 at 2:45 PM, Paul E. McKenney
 wrote:
>
> My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed
> will eventually allow ACCESS_ONCE() to be upgraded so that (for example)
> access-once increments can generate a single increment-memory instruction
> on x86.

I don't think that is a semantic issue.

gcc could do it *today* with volatile accesses. It doesn't, because
volatiles are scary and basically disables a lot of optimizations. Why
would memory ordering be substantially different just because it has a
different name?

> New architectures might eventually might define things like atomic_inc()
> in terms of C++11 atomics, but let's start with the straightforward stuff
> as and if it makes sense.

SMP-atomic or percpu atomic? Or both?

We need both variants in the kernel. If the compiler generates one of
them for us, that doesn't really much help.

  Linus


Re: Memory corruption due to word sharing

2012-02-02 Thread Linus Torvalds
On Thu, Feb 2, 2012 at 8:28 AM, Michael Matz  wrote:
>
> Sure.  Simplest example: struct s {int i:24;} __attribute__((packed)).
>
> You must access only three bytes, no matter what.  The basetype (int) is
> four bytes.

Ok, so here's a really *stupid* (but also really really simple) patch attached.

What it does is to simply change the meaning of the SLOW_BYTE_ACCESS thing.

What SLOW_BYTE_ACCESS means is currently is not just that byte
accesses are slow: it means that *everything* but full-word accesses
are slow!

Which is (a) not what the name really implies, (b) not even the
historical meaning of that #define (why? the meaning of "full word"
has changed since: it used to be 32 bits, now it is 64 bits) and (c)
stupid, because that's not how hardware with slow byte accesses really
work anyway.

Finally, it's what causes gcc to do 64-bit accesses to bitfields that
aren't 64 bits in size.

So because of this, I propose gcc just change the rules for what
SLOW_BYTE_ACCESS means.

I propose to just change it to mean "accesses smaller than 32 bits are
slow". That's actually much closer to what the hardware definition
tends to be.

It doesn't fix the generic issue, it doesn't fix any C++11/C11 issues,
it doesn't really change anything, but what it *does* do is make for a
hell of a lot saner model. And it avoids bugs in practice.

NOTE! On at least some machines with SLOW_BYTE_ACCESS, accesses
smaller than a word cannot possibly be atomic anyway (well, not
without jumping through hoops), so the fact that we still extend to 32
bits and the problem of touching too much still exists with 'char' and
'short' variables that are in the same 32-bit word as a bitfield is
kind of unavoidable.

So this suggested patch doesn't really "fix" anything fundamental, but
it is (a) simple, (b) totally untested and (c) at least fixes *some*
cases.

For example, it might fix the 'sig_atomic_t' shadowning, since that is
usually 'int'. It obviously can never fix the issue with volatile
char/short, but at least it works around it for 'long'.

In other words - I'm not trying to say that this patch really "fixes"
anything (except sig_atomic_t interaction, which really does get fixed
for the normal 'int' case). But what it does do is to limit the damage
of a bad situation.

And it is small, and should hopefully be easy to accept even for
stable versions of gcc.

So can we please do something like this for maintenance releases, and
consider the much more complex C++11/C11 issues to be a separate much
bigger issue for the future?

Because the current SLOW_BYTE_ACCESS meaning really is crap. The
*only* thing that architecture define seems to do is literally the
bitfield extract semantics, and it does that horribly horribly badly
as shown by the bug on both 64-bit POWER and Sparc. For no good reason
- both of those have perfectly fine word operations afaik.

I did not look at other architectures that define SLOW_BYTE_ACCESS,
but if they have a 32-bit integer type, I'm pretty sure they support
fast loads/stores of it.

Hacky? Yes. Pretty? No. But better than the disaster that is there
now? Hell yes.

   Linus

PS. The only reason I hardcoded "32" was that I didn't know how to ask
the quesion "is this mode at least as wide as 'int'" in the proper gcc
way. So I'm really not suggesting you apply this patch as-is, I'm just
sending it out as a "hey, this is a pretty obvious way to work around
part of the problem, somebody who really knows what they are doing
should probably improve on it".
 gcc/stor-layout.c |5 +
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/gcc/stor-layout.c b/gcc/stor-layout.c
index ea0d44d64d26..1387c0e9e060 100644
--- a/gcc/stor-layout.c
+++ b/gcc/stor-layout.c
@@ -2486,7 +2486,12 @@ get_best_mode (int bitsize, int bitpos, unsigned int align,
 	  && unit <= MIN (align, BIGGEST_ALIGNMENT)
 	  && (largest_mode == VOIDmode
 		  || unit <= GET_MODE_BITSIZE (largest_mode)))
+	  {
 	wide_mode = tmode;
+	/* Wide enough? */
+	if (unit >= 32)
+	  break;
+	  }
 	}
 
   if (wide_mode != VOIDmode)


Re: Memory corruption due to word sharing

2012-02-02 Thread Linus Torvalds
On Thu, Feb 2, 2012 at 10:42 AM, Paul E. McKenney
 wrote:
>>
>> SMP-atomic or percpu atomic? Or both?
>
> Only SMP-atomic.

And I assume that since the compiler does them, that would now make it
impossible for us to gather a list of all the 'lock' prefixes so that
we can undo them if it turns out that we are running on a UP machine.

When we do SMP operations, we don't just add a "lock" prefix to it. We do this:

  #define LOCK_PREFIX_HERE \
".section .smp_locks,\"a\"\n"   \
".balign 4\n"   \
".long 671f - .\n" /* offset */ \
".previous\n"   \
"671:"

  #define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; "

and I'm sure you know that, but I'm not sure the gcc people realize
the kinds of games we play to make things work better.

Sure, "everything will be SMP" some day, but running UP kernels is
likely still going to remain a good idea in virtualized environments,
for example. And having to make it a compile-time option is *not* a
good idea.

So compiler intrisics are often *worse* than doing it by hand for us
for all these kinds of reasons. They aren't generally geared towards
the very specialized needs that a kernel has.

Of course, maybe even user space would want some kind of way to
automatically strip 'lock' prefixes from a binary, so maybe the
compiler would have some kind of support like this too.

(And no, disassembling the binary in order to find lock prefixes is
*not* the answer, at least not for the kernel)

>> We need both variants in the kernel. If the compiler generates one of
>> them for us, that doesn't really much help.
>
> I must admit that the non-x86 per-CPU atomics are, ummm, "interesting".

Most non-x86 cpu's would probably be better off treating them the same
as smp-atomics (load-locked + store-conditional), but right now we
have this insane generic infrastructure for having versions that are
irq-safe by disabling interrupts etc. Ugh. Mainly because nobody
really is willing to work on and fix up the 25 architectures that
really don't matter.

  Linus


Re: Memory corruption due to word sharing

2012-02-03 Thread Linus Torvalds
On Fri, Feb 3, 2012 at 8:38 AM, Andrew MacLeod  wrote:
>
> The atomic intrinsics were created for c++11  memory model compliance, but I
> am certainly open to enhancements that would make them more useful.   I am
> planning some enhancements for 4.8 now, and it sounds like you may have some
> suggestions...

So we have several atomics we use in the kernel, with the more common being

 - add (and subtract) and cmpchg of both 'int' and 'long'

 - add_return (add and return new value)

 - special cases of the above:
  dec_and_test (decrement and test result for zero)
  inc_and_test (decrement and test result for zero)
  add_negative (add and check if result is negative)

   The special cases are because older x86 cannot do the generic
"add_return" efficiently - it needs xadd - but can do atomic versions
that test the end result and give zero or sign information.

 - atomic_add_unless() - basically an optimized cmpxchg.

 - atomic bit array operations (bit set, clear, set-and-test,
clear-and-test). We do them on "unsigned long" exclusively, and in
fact we do them on arrays of unsigned long, ie we have the whole "bts
reg,mem" semantics. I'm not sure we really care about the atomic
versions for the arrays, so it's possible we only really care about a
single long.

   The only complication with the bit setting is that we have a
concept of "set/clear bit with memory barrier before or after the bit"
(for locking). We don't do the whole release/acquire thing, though.

 - compare_xchg_double

We also do byte/word atomic increments and decrements, but that' sin
the x86 spinlock implementation, so it's not a generic need.

We also do the add version in particular as CPU-local optimizations
that do not need to be SMP-safe, but do need to be interrupt-safe. On
x86, this is just an r-m-w op, on most other architectures it ends up
being the usual load-locked/store-conditional.

I think that's pretty much it, but maybe I'm missing something.

Of course, locking itself tends to be special cases of the above with
extra memory barriers, but it's usually hidden in asm for other
reasons (the bit-op + barrier being a special case).

  Linus


Re: Memory corruption due to word sharing

2012-02-03 Thread Linus Torvalds
On Fri, Feb 3, 2012 at 11:16 AM, Andrew MacLeod  wrote:
>>    The special cases are because older x86 cannot do the generic
>> "add_return" efficiently - it needs xadd - but can do atomic versions
>> that test the end result and give zero or sign information.
>
> Since these are older x86 only, could you use add_return() always and then
> have the compiler use new peephole optimizations to detect those usage
> patterns and change the instruction sequence for x86 when required?  would
> that be acceptable?    Or maybe you don't trust the compiler :-)   Or maybe
> I can innocently ask if the performance impact on older x86 matters enough
> any more? :-)

Since xadd is often slow even when it does exist, the peepholes would
definitely be required. A "dec_and_test" does need to be

   decl %m (or "subl $1" for x86-64)
   je

rather than

   xadd %r,%m
   cmp $1,%r
   je

simply because the latter is not just larger, but often quite a bit
slower (newer CPU's seem to be better at xadd).

But with the peepholes, by the time we could depend on gcc doing the
intrinsics, we'd almost certainly be ready to let the old pre-xadd
machines just go.

We're already almost there, and by the time we can trust the compiler
to do it for us, we're almost certainly going to be at a point where
we really don't care any more about pre-xadd (but we will still care
about older machines where xadd is slow).

>>  - atomic_add_unless() - basically an optimized cmpxchg.
>
> is this the reverse of a compare_exchange and add?  Ie, add if the value
> ISN'T expected?   or some form of compare_exchange_and_add?    This might
> require a new atomic builltin.
>
> What exactly does it do?

It's basically

   do {
 load-link %r,%m
 if (r == value)
 return 0;
 add
   } while (store-conditional %r,%m)
   return 1;

and it is used to implement two *very* common (and critical)
reference-counting use cases:

 - decrement ref-count locklessly, and if it hits free, take a lock
atomically (ie "it would be a bug for anybody to ever see it decrement
to zero while holding the lock")

 - increment ref-count locklessly, but if we race with the final free,
don't touch it, because it's gone (ie "if somebody decremented it to
zero while holding the lock, we absolutely cannot increment it again")

They may sound odd, but those two operations are absolutely critical
for most lock-less refcount management. And taking locks for reference
counting is absolutely death to performance, and is often impossible
anyway (the lock may be a local lock that is *inside* the structure
that is being reference-counted, so if the refcount is zero, then you
cannot even look at the lock!)

In the above example, the load-locked -> store-conditional would
obviously be a cmpxchg loop instead on architectures that do cmpxchg
instead of ll/sc.

Of course, it you expose some intrinsic for the whole "ll/sc" model
(and you then turn it into cmpxchg on demand), we could literally
open-code it.

That is likely the most *flexible* approach for a compiler. I think
pretty much everything the kernel needs (except for cmpxchg_double)
can be very naturally written as a "ll/sc" sequence, and if the
compiler then just does the right thing with peephole optimizations,
that's fine.

IOW, we don't really need "atomic_add()" or anything like that. If we can do

  do {
 val = __load_linked(mem);
 val++;
  } while (__store_conditional(val, mem));

and the compiler just automagically turns that into "lock add" on x86,
that's perfectly fine.

It might not be too hard, because you really don't need to recognize
very many patterns, and any pattern you don't recognize could be
turned into a cmpxchg loop.

NOTE NOTE NOTE! The "turned into a cmpxchg loop" is not the true
correct translation of load-linked/store-conditional, since it allows
the memory to be modified as long as it's modified *back* before the
store-conditional, and that actually matters for a few algorithms. But
if you document the fact that it's not a "true" ll/sc (and maybe have
some compile-time way to detect when it isn't), it would be very
flexible.

Of course, the syntax could be something completely different. Maybe
you'd want to do it as

   __builtin_ll_sc(&var, update-expression, return-expression,
failure-expression)

rather than an explicit loop.

But it doesn't sound like the internal gcc model is based on some
generic ll/sc model.

>>  - atomic bit array operations (bit set, clear, set-and-test,
>> clear-and-test). We do them on "unsigned long" exclusively, and in
>> fact we do them on arrays of unsigned long, ie we have the whole "bts
>> reg,mem" semantics. I'm not sure we really care about the atomic
>> versions for the arrays, so it's possible we only really care about a
>> single long.
>>
>>    The only complication with the bit setting is that we have a
>> concept of "set/clear bit with memory barrier before or after the bit"
>> (for locking). We don't do the whole release/acquire thing, though.
>
> Are

Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Linus Torvalds


On Thu, 19 Nov 2009, Thomas Gleixner wrote:
> 
> standard function start:
> 
>push   %ebp
>mov%esp, %ebp
>
>call   mcount
> 
> modified function start on a handful of functions only seen with gcc
> 4.4.x on x86 32 bit:
> 
>   push   %edi
>   lea0x8(%esp),%edi
>   and$0xfff0,%esp
>   pushl  -0x4(%edi)
>   push   %ebp
>   mov%esp,%ebp
>   ...
>   call   mcount

That's some crazy sh*t anyway, since we don't _want_ the stack to be 
16-byte aligned in the kernel. We do

KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=2)

why is that not working?

So this looks like a gcc bug, plain and simple.

> This modification leads to a hard to solve problem in the kernel
> function graph tracer which assumes that the stack looks like:
> 
>return address
>saved  ebp

Umm. But it still does, doesn't it? That

pushl  -0x4(%edi)
push   %ebp

should do it - the "-0x4(%edi)" thing seems to be trying to reload the 
return address. No? 

Maybe I misread the code - but regardless, it does look like a gcc code 
generation bug if only because we really don't want a 16-byte aligned 
stack anyway, and have asked for it to not be done.

So I agree that gcc shouldn't do that crazy prologue (and certainly _not_ 
before calling mcount anyway), but I'm not sure I agree with that detail 
of your analysis or explanation.

Linus


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Linus Torvalds


On Thu, 19 Nov 2009, Richard Guenther wrote:
> 
> Note that I only can reproduce the issue with
> -mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2.

Since you can reproduce it with -mincoming-stack-boundary=2, I woul 
suggest just fixing mcount handling that way regardless of anything else. 
The current code generated by gcc is just insane - even for the case where 
you _want_ 16-byte stack alignment.

Instead crazy code like

>   push   %edi
>   lea0x8(%esp),%edi
>   and$0xfff0,%esp
>   pushl  -0x4(%edi)
>   push   %ebp
>   mov%esp,%ebp
>   ...
>   call   mcount

the sane thing to do would be to just do it as

push   %ebp
mov%esp,%ebp
call   mcount
and$0xfff0,%esp

since

 - no sane 'mcount' implementation can ever care about 16-byte stack 
   alignment anyway, so aliging the stack before mcount is crazy.

 - mcount is special anyway, and is the only thing that cares about that 
   whole ebp/return address thing is mcount, and _all_ your games with 
   %edi are about that mcount thing.

IOW, once you as a compiler person understand that the 'mcount' call is 
special, you should have realized that all the work you did for it was 
totally pointless and stupid to begin with. 

You must already have that special mcount logic (the whole code to save a 
register early and push the fake mcount stack frame), so instead of _that_ 
special logic, change it to a different mcount special logic that 
associates the 'mcount' call with theframe pointer pushing. 

That will not only make the Linux kernel tracer happy, it will make all 
your _other_ users happier too, since you can generate smaller and more 
efficient code.

Admittedly, anybody who compiles with -pg probably doesn't care deeply 
about smaller and more efficient code, since the mcount call overhead 
tends to make the thing moot anyway, but it really looks like a win-win 
situation to just fix the mcount call sequence regardless.

> And you didn't provide us with a testcase either ... so please open a 
> bugzilla and attach preprocessed source of a file that shows the 
> problem, note the function it happens in and provide the command-line 
> options you used for building.
> 
> Otherwise it's going to be all speculation on our side.

See above - all you need to do is to just fix mcount calling.

Now, there is a separate bug that shows that you seem to over-align the 
stack when not asked for, and yes, since we noticed that I hope that 
Thomas and friends will fix that, but I think your mcount logic could (and 
should) be fixed as an independent sillyness.

Linus


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Linus Torvalds


On Thu, 19 Nov 2009, Andrew Haley wrote:
> 
> I've got all that off-list.  I found the cause, and replied in another
> email.  It's not a bug.

Oh Gods, are we back to gcc people saying "sure, we do stupid things, but 
it's allowed, so we don't consider it a bug because it doesn't matter that 
real people care about real life, we only care about some paper, and real 
life doesn't matter, if it's 'undefined' we can make our idiotic choices 
regardless of what people need, and regardless of whether it actually 
generates better code or not".

Linus


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Linus Torvalds


On Thu, 19 Nov 2009, Linus Torvalds wrote:
> 
> Oh Gods, are we back to gcc people saying "sure, we do stupid things, but 
> it's allowed, so we don't consider it a bug because it doesn't matter that 
> real people care about real life, we only care about some paper, and real 
> life doesn't matter, if it's 'undefined' we can make our idiotic choices 
> regardless of what people need, and regardless of whether it actually 
> generates better code or not".

Put another way: the stack alignment itself may not be a bug, but gcc 
generating God-awful code for the mcount handling that results in problems 
in real life sure as hell is *stupid* enough to be called a bug.

I bet other people than just the kernel use the mcount hook for subtler 
things than just doing profiles. And even if they don't, the quoted code 
generation is just crazy _crap_.

Linus


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Linus Torvalds


On Thu, 19 Nov 2009, Linus Torvalds wrote:
> 
> I bet other people than just the kernel use the mcount hook for subtler 
> things than just doing profiles. And even if they don't, the quoted code 
> generation is just crazy _crap_.

For the kernel, if the only case is that timer_stat.c thing that Thomas 
pointed at, I guess we can at least work around it with something like the 
appended. The kernel code is certainly ugly too, no question about that. 

It's just that we'd like to be able to depend on mcount code generation 
not being insane even in the presense of ugly code..

The alternative would be to have some warning when this happens, so that 
we can at least see it. "mcount won't work reliably"

Linus

---
 kernel/time/timer_stats.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c
index ee5681f..488c7b8 100644
--- a/kernel/time/timer_stats.c
+++ b/kernel/time/timer_stats.c
@@ -76,7 +76,7 @@ struct entry {
 */
charcomm[TASK_COMM_LEN + 1];
 
-} cacheline_aligned_in_smp;
+};
 
 /*
  * Spinlock protecting the tables - not taken during lookup:
@@ -114,7 +114,7 @@ static ktime_t time_start, time_stop;
 #define MAX_ENTRIES(1UL << MAX_ENTRIES_BITS)
 
 static unsigned long nr_entries;
-static struct entry entries[MAX_ENTRIES];
+static struct entry entries[MAX_ENTRIES] cacheline_aligned_in_smp;
 
 static atomic_t overflow_count;
 


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Linus Torvalds


On Thu, 19 Nov 2009, H. Peter Anvin wrote:
> 
> Calling the profiler immediately at the entry point is clearly the more
> sane option.  It means the ABI is well-defined, stable, and independent
> of what the actual function contents are.  It means that ABI isn't the
> normal C ABI (the __fentry__ function would have to preserve all
> registers), but that's fine...

As far as I know, that's true of _mcount already: it's not a normal ABI 
and is rather a highly architecture-specific special case to begin with. 
At least ARM has some (several?) special mcount calling conventions, 
afaik.

(And then ARM people use __attribute__((naked)) and insert the code by 
inline asm, or something. That seems to be "standard" in the embedded 
world, where they often do even stranger things than we do in the 
kernel. At least our low-level system call and interrupt handlers are 
written as assembly language - the embedded world seems to commonly 
write them as C functions with magic attributes and inline asm).

Linus


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Linus Torvalds


On Thu, 19 Nov 2009, Frederic Weisbecker wrote:
>
> > That way the lr would have the current function, and the parent would
> > still be at 8(%sp)
> 
> Yeah right, we need at least such very tiny prologue for
> archs that store return addresses in a reg.

Well, it will be architecture-dependent.

For example, alpha can store the return value in _any_ register if I 
recall correctly, so you can do the "call to __fentry__" by just picking 
another register than the default one as the return address.

And powerpc has two special registers: link and ctr, but iirc you can only 
load 'link' with a branch instruction. Which means that you could do 
something like 

mflr 0
bl __fentry__

in the caller (I forget if R0 is actually a call-trashed register or not), 
and then __fentry__ could do something like

mflr 12 # save _new_ link
mtlr 0  # restore original link
mtctr 12# move __fentry__ link to ctr
.. do whatever ..
bctr# return to __fentry__ caller

to return with 'link' restored (but ctr and r0/r12 trashed - I don't 
recall the ppc calling conventions any more, but I think that is ok).

Saving to stack seems unnecessary and pointless.

Linus


Re: BUG: GCC-4.4.x changes the function frame on some functions

2009-11-19 Thread Linus Torvalds


On Fri, 20 Nov 2009, Thomas Gleixner wrote:
> 
> While testing various kernel configs we found out that the problem
> comes and goes. Finally I started to compare the gcc command line
> options and after some fiddling it turned out that the following
> minimal deltas change the code generator behaviour:
> 
> Bad:  -march=pentium-mmx-Wa,-mtune=generic32
> Good: -march=i686-mtune=generic -Wa,-mtune=generic32
> Good: -march=pentium-mmx -mtune-generic -Wa,-mtune=generic32
> 
> I'm not supposed to understand the logic behind that, right ?

Are you sure it's just the compiler flags?

There's another configuration portion: the size of the alignment itself. 
That's dependent on L1_CACHE_SHIFT, which in turn is taken from the kernel 
config CONFIG_X86_L1_CACHE_SHIFT.

Maybe that value matters too - for example maybe gcc will not try to align 
the stack if it's big?

[ Btw, looking at that, why are X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT 
  totally unrelated numbers? Very confusing. ]

The compiler flags we use are tied to some of the same choices that choose 
the cache shift, so the correlation you found while debugging this would 
still hold.

Linus



Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?

2010-11-14 Thread Linus Torvalds
On Sun, Nov 14, 2010 at 4:52 PM, James Cloos  wrote:
> Gcc 4.5.1 running on an amd64 box "cross"-compiling for a P3 i8k fails
> to compile the module since commit 6b4e81db2552bad04100e7d5ddeed7e848f53b48
> with:
>
>  CC      drivers/char/i8k.o
> drivers/char/i8k.c: In function ‘i8k_smm’:
> drivers/char/i8k.c:149:2: error: can't find a register in class 
> ‘GENERAL_REGS’ while reloading ‘asm’
> drivers/char/i8k.c:149:2: error: ‘asm’ operand has impossible constraints

At this point, I think this falls clearly under "unresolvable gcc bug".

Quite frankly, I think gcc was buggy to begin with: since we had a
memory clobber, the "+m" (*regs) should not have mattered. The fact
that "*regs" may be some local variable doesn't make any difference
what-so-ever, since we took the address of the variable. So the memory
clobber _clearly_ can change that variable.

So when Richard Gunther says "a memory clobber doesn't cover automatic
storage", to me that very clearly spells "gcc is buggy as hell".
Because automatic storage with its address taken _very_ much gets
clobbered by things like memset etc. If the compiler doesn't
understand that, the compiler is just broken.

And now, if even the (superfluous) "+m" isn't working, it sounds like
we have no sane options left. Except to say that gcc-4.5.1 is totally
broken wrt asms.

Can we just get gcc to realize that when you pass the address of
automatic storage to an asm, that means that "memory" really does
clobber it? Because clearly that is the case.

Linus


Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?

2010-11-15 Thread Linus Torvalds
On Mon, Nov 15, 2010 at 3:16 AM, Jakub Jelinek  wrote:
>
> I don't see any problems on the assembly level.  i8k_smm is
> not inlined in this case and checks all 3 conditions.

If it really is related to gcc not understanding that "*regs" has
changed due to the memory being an automatic variable, and passing in
"regs" itself as a pointer to that automatic variable together with
the "memory" clobber not being sufficient, than I think it's the lack
of inlining that will automatically hide the bug.

(Side note: and I think this does show how much of a gcc bug it is not
to consider "memory" together with passing in a pointer to an asm to
always be a clobber).

Because if it isn't inlined, then "regs" will be seen a a real pointer
to some external memory (the caller) rather than being optimized to
just be the auto structure on the stack. Because *mem is auto only
within the context of the caller.

Which actually points to a possible simpler:
 - remove the "+m" since it adds too much register pressure
 - mark the i8k_smm() as "noinline" instead.

Quite frankly, I'd hate to add even more crud to that inline asm (to
save/restore the registers manually). It's already not the prettiest
thing around.

So does the attached patch work for everybody?

 Linus
 drivers/char/i8k.c |6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/char/i8k.c b/drivers/char/i8k.c
index f0863be..101011e 100644
--- a/drivers/char/i8k.c
+++ b/drivers/char/i8k.c
@@ -114,7 +114,7 @@ static inline const char *i8k_get_dmi_data(int field)
 /*
  * Call the System Management Mode BIOS. Code provided by Jonathan Buzzard.
  */
-static int i8k_smm(struct smm_regs *regs)
+static noinline int i8k_smm(struct smm_regs *regs)
 {
 	int rc;
 	int eax = regs->eax;
@@ -142,7 +142,7 @@ static int i8k_smm(struct smm_regs *regs)
 		"lahf\n\t"
 		"shrl $8,%%eax\n\t"
 		"andl $1,%%eax\n"
-		:"=a"(rc), "+m" (*regs)
+		:"=a"(rc)
 		:"a"(regs)
 		:"%ebx", "%ecx", "%edx", "%esi", "%edi", "memory");
 #else
@@ -168,7 +168,7 @@ static int i8k_smm(struct smm_regs *regs)
 	"lahf\n\t"
 	"shrl $8,%%eax\n\t"
 	"andl $1,%%eax\n"
-	:"=a"(rc), "+m" (*regs)
+	:"=a"(rc)
 	:"a"(regs)
 	:"%ebx", "%ecx", "%edx", "%esi", "%edi", "memory");
 #endif


Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?

2010-11-15 Thread Linus Torvalds
On Mon, Nov 15, 2010 at 9:40 AM, Jim Bos  wrote:
>
> Hmm, that doesn't work.
>
> [ Not sure if you read to whole thread but initial workaround was to
> change the asm(..) to asm volatile(..) which did work. ]

Since I have a different gcc than yours (and I'm not going to compile
my own), have you posted your broken .s file anywhere? In fact, with
the noinline (and the removal of the "+m" thing - iow just the patch
you tried), what does just the "i8k_smm" function assembly look like
for you after you've done a "make drivers/char/i8k.s"?

If the asm just doesn't exist AT ALL, that's just odd. Because every
single call-site of i8k_smm() clearly looks at the return value. So
the volatile really shouldn't make any difference from that
standpoint. Odd.

   Linus


Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?

2010-11-15 Thread Linus Torvalds
On Mon, Nov 15, 2010 at 10:30 AM, Jim Bos  wrote:
>
> Attached version with plain 2.6.36 source and version with the committed
> patch, i.e with the '"+m" (*regs)'

Looks 100% identical in i8k_smm() itself, and I'm not seeing anything
bad. The asm has certainly not been optimized away as implied in the
archlinux thread.

There are differences, but they are with code generation *elsewhere*.

To me it is starting to look like the real problem is that gcc has
decided that the "i8k_smm()" function is "__attribute__((const))".

Which is clearly totally bogus. If a function has an inline asm that
has a memory clobber, it is clearly *not* 'const'. But that does
explain the bug, and does explain why "+m" makes a difference and why
"noinline" does not.

So what I _think_ happens is that

 - gcc logic for the automatic 'const' attribute for functions is
broken, so it marks that function 'const'.

 - since the rule for a const function is that it only _looks_ at its
attributes and has no side effects, now the callers will decide that
'i8k_smm()' cannot change the passed-in structure, so they'll happily
optimize away all the accesses to it.

   Linus


Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?

2010-11-15 Thread Linus Torvalds
On Mon, Nov 15, 2010 at 10:45 AM, Jeff Law  wrote:
>
> A memory clobber should clobber anything in memory, including autos in
> memory; if it doesn't, then that seems like a major problem.  I'd like to
> see the rationale behind not clobbering autos in memory.

Yes. It turns out that the "asm optimized away" was entirely wrong (we
never saw that, it was just a report on another mailing list).

Looking at the asm posted, it seems to me that gcc actually compiles
the asm itself 100% correctly, and the "memory" clobber is working
fine inside that function. So the code generated for i8k_smm() itself
is all good.

But _while_ generating the good code, gcc doesn't seem to realize that
it writes to anything, so it decides to mark the function
"__attribute__((const))", which is obviously wrong (a memory clobber
definitely implies that it's not const). And as a result, the callers
will be mis-optimized, because they do things like

static int i8k_get_bios_version(void)
{
struct smm_regs regs = { .eax = I8K_SMM_BIOS_VERSION, };

return i8k_smm(®s) ? : regs.eax;
}

and since gcc has (incorrectly) decided that "i8k_smm()" is a const
function, it thinks that "regs.eax" hasn't changed, so it doesn't
bother to reload it: it "knows" that it is still I8K_SMM_BIOS_VERSION
that it initialized it with. So it will basically have rewritten that
final return statement as

return i8k_smm(®s) ? : I8K_SMM_BIOS_VERSION;

which obviously doesn't really work.

This also explains why adding "volatile" worked. The "asm volatile"
triggered "this is not a const function".

Similarly, the "+m" works, because it also makes clear that the asm is
writing to memory, and isn't a const function.

Now, the "memory" clobber should clearly also have done that, but I'd
be willing to bet that some version of gcc (possibly extra slackware
patches) had forgotten the trivial logic to say "a memory clobber also
makes the user function non-const".

 Linus


Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?

2010-11-15 Thread Linus Torvalds
On Mon, Nov 15, 2010 at 11:12 AM, Jakub Jelinek  wrote:
>
> Ah, the problem is that memory_identifier_string is only initialized in
> ipa-reference.c's initialization, so it can be (and is in this case) NULL in
> ipa-pure-const.c.

Ok. And I guess you can verify that all versions of gcc do this
correctly for "asm volatile"?

Because since we'll have to work around this problem in the kernel, I
suspect the simplest solution is to remove the "+m" that causes
register pressure problems, and then use "asm volatile" to work around
the const-function bug.

And add a large comment about why "asm volatile" is probably always a
good idea when you have a memory clobber and don't have any other
visible memory modifications.

I do wonder if this explains some of the problems we had with the
bitop asms too.

Hmm?

   Linus


<    1   2