Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 10:45 AM, Jeff Law wrote: > > Torvald Riegel & I were told that was kernel policy when we brought up the > upcoming bitfield semantic changes with some of the linux kernel folks last > year. Btw, one reason this is true is that the bitfield ordering/packing is so unspecified that using bitfields sometimes is just way more pain than you get. Some people have tried to use bitfields for various IO data structures (think laying out bits in memory to match some particular layout of some disk controller result structure). It's always a total disaster, because you have another whole level of architecture-specific bit ordering (in *addition* to all the normal byte order issues). That's not a compiler issue, that's just the nature of the beast. It's just another reason why the kernel often ends up then doing bit masking by hand. But *all* that said, we do have a metric buttload of bitfields in the kernel. It's not some really unusual feature. It does get used a lot, despite all the reasons why some particular code might not use bitfields. We have a lot of code, there's still a lot of situations left where bitfields are just really convenient. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 9:42 AM, Torvald Riegel wrote: > > We need a proper memory model. Not really. The fact is, the kernel will happily take the memory model of the underlying hardware. Trying to impose some compiler description of the memory model is actually horribly bad, because it automatically also involves compiler synchronization points - which will try to be portable and easy to understand, and quite frankly, that will automatically result in what is technically known as a "shitload of crap". Now, a strict memory model is fine for portability, and to make it simple for programmers. I actually largely approve of the Java memory model approach, even if it has been horribly problematic and has some serious problems. But it's not something that is acceptable for the kernel - we absolutely do *not* want the compiler to introduce some memory model, because we are very much working together with whatever the hardware memory model is, and we do our own serialization primitives. > No vague assumptions with lots of hand-waving. So here's basically what the kernel needs: - if we don't touch a field, the compiler doesn't touch it. This is the rule that gcc now violates with bitfields. This is a gcc bug. End of story. The "volatile" example proves it - anybody who argues otherwise is simply wrong, and is just trying to make excuses. - if we need to touch something only once, or care about something that is touched only conditionally, and we actually need to make sure that it is touched *only* under that condition, we already go to quite extreme lengths to tell the compiler so. We usually use an access with a "volatile" cast on it or tricks like that. Or we do the whole "magic asm that says it modified memory to let the compiler know not to try anything clever" - The kernel IS NOT GOING TO MARK DATA STRUCTURES. Marking data structures is seriously stupid and wrong. It's not the *data* that has access rules, it's the *code* that has rules of access. The traditional C "volatile" is misdesigned and wrong. We don't generally mark data volatile, we really mark *code* volatile - which is why our "volatiles" are in the casts, not on the data structures. Stuff that is "volatile" in one context is not volatile in another. If you hold a RCU write lock, it may well be entirely stable, and marking it volatile is *wrong*, and generating code as if it was volatile is pure and utter shit. On the other hand, if you are touching *the*very*same* field while you are only read-locked for RCU, it may well be one of those "this has to be read by accessing it exactly once". And we do all this correctly in the kernel. Modulo bugs, of course, but the fundamental rule really is: "atomicity or volatility is about CODE, not DATA". And no, C11 does *not* do it correctly. The whole "_Atomic" crap is exactly the same mistake as "volatile" was. It's wrong. Marking data _Atomic is a sure sign that whoever designed it didn't understand things. > The only candidate that I see is the C++11/C11 model. Any other > suggestions? The C11 model addresses the wrong thing, and addresses it badly. You might as well ignore it as far as the kernel is concerned. I'm sure it helps some *other* problem, but it's not the kernel problem. The rules really are very simple, and the kernel already does everything to make it easy for the compiler. When we do something that the compiler cannot re-order around, we do an asm() and mark it as modifying memory so that the compiler won't screw things up. In addition, we will do whatever that the CPU requires for memory ordering, and quite frankly, the compiler will never have sufficient locking primitives to satisfy us, and there is no real point in even trying. If you try to do locking in the compiler, you *will* do it wrong. If you add random flags on data structures ("_Atomic" or "volatile" or whatever), you *will* do it wrong. It's simply a fundamentally broken model. So the *only* memory model we want from the compiler really is: "don't write to fields that the source code didn't write to". It's really is that simple. > So, would it be okay to tell the compiler which part of the state is > accessed concurrently (ie, locks, atomics, ...)? Seriously, no. See above: it's not the "state" that is accessed concurrently. It's the code. If you ever try to mark state, you've already lost. The same "state" can be atomic or not depending on context. It's not about the state or the data structures, and it never will be. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 11:40 AM, Jakub Jelinek wrote: > > Well, the C++11/C11 model doesn't allow to use the underlying type > for accesses, consider e.g. > > struct S { long s1; unsigned int s2 : 5; unsigned int s3 : 19; unsigned char > s4; unsigned int s5; }; > struct T { long s1 : 16; unsigned int s2; }; > > on e.g. x86_64-linux, S is 16 byte long, field s4 is packed together into > the same 32 bits as s2 and s3. While the memory model allows s2 to be > changed when storing s3 (i.e. use a RMW cycle on it), it doesn't allow s4 > to be changed, as it isn't a bitfield (you'd need s4 : 8 for that). > T is 8 bytes long, again, s2 is packed into the same 64 bits as s1, > and the memory model doesn't allow s2 to be modified. > > Not sure what the kernel would expect in such a case. So the kernel really doesn't care what you do to things *within* the bitfield. Sure, we do have some expectations of packing, but basically we never expect bitfields to be 'atomic'. You really don't need to worry about that. From a correctness standpoint, I really don't think the kernel cares if gcc does bitfield reads and writes one bit at a time. Seriously. The bug is that the bitfield write wrote *outside* the bitfield. That's *WRONG*. It's completely wrong, even if you write back the same value you read. It's *always* wrong. It's simply not acceptable. If gcc does that kind of crap, then gcc needs to align bitfields sufficiently that the accesses that are outside the bitfields never touch any other fields. So what I would suggest: - use the *smallest* aligned access you can use to access the bitfield. This will be 'correct', but it may perform badly. This is apparently what x86 does. So on x86, if you have a one-bit bitfield within a bitfield that is really 32 bits wide, it looks like gcc will generally generate a byte load/store to set that bit. I don't know exactly what the rules are, but this is fine from a *correctness* point. - However, while using the *smallest* possible access may generate correct code, it often generates really *crappy* code. Which is exactly the bug that I reported in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 So quite frankly, my suggestion is to just try to use the base type. But *whatever* you do, using a type *larger* than the whole bitfield itself is a bug. And dammit, the people who continue to bring up C11 as if this was a new issue, and only needs to be worried about if you support C11 (ie gcc-4.7 and up) are just in denial. The 'volatile' example makes it entirely clear that the bug has nothing what-so-ever to do with C11. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 12:01 PM, Linus Torvalds wrote: > > - However, while using the *smallest* possible access may generate > correct code, it often generates really *crappy* code. Which is > exactly the bug that I reported in > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=48696 Btw, as I also pointed out in that (old) bugzilla entry, gcc can do pretty much whatever it damn well pleases with loads from a bitfield. You can narrow them, you can make them wider, you can paint them pink. Widening the loads can still be a performance problem, and technically it's a really bad idea for any nearby "volatile"s too, but in practice things will *work*. Writes are much more critical. If you overwrite a field that is no longer in the bitfield, you can no longer depend on "nobody cares about bitfield accesses", because by definition you are clearly now stepping on non-bitfield things. You do realize that even in user space, and even before C11, people have things like "sig_atomic_t" etc. So even if you don't like the notion of "volatile", here's *another* example of this being real gcc bug: struct mydata { sig_atomic_t seen_signal; unsigned flags:1; }; and now do the same test-program, realizing that "sig_atomic_t" is normally the exact same thing as "int". Now, thing about what happens if you have a signal handler that comes in and does mydata.seen_signal = 1; and happens to interrupt the code that changes "mydata.flags" in just the right spot. That's right: the bitfield update will possibly *clear* that "signal-safe" flag again, and gcc just created buggy asm code from correct C code. Guys, if you don't admit that this is a bug, I don't know what to say. IT IS A GCC BUG. Don't try to make it into something bigger and related to C++11/C11. Don't try to talk about "memory models". Just admit that it is a bug to do a 64-bit write to a 32-bit bitfield, and fix it! Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 12:16 PM, Jakub Jelinek wrote: >> >> So the kernel really doesn't care what you do to things *within* the >> bitfield. > > But what is *within* the bitfield? Do you consider s4 or t2 fields > (non-bitfield fields that just the ABI wants to pack together with > the bitfield) above as *within* or is already modifying of those a problem? I seriously think you should just do the obvious thing, and LOOK AT THE BASE TYPE. That solves everything. If the definition of the data is (assume "long" to be 64 bits) long mybits:32; int nonbitfield; and you end up doing a 64-bit access when you read/write the "mybits" bitfield, and it overlaps "nonbitfield", then I would also just say "hey, whatever, the user asked for it". He really did. The "allocation" for the bitfield comes from the base type, and so you basically have a "long" to play with. Let's take an even more extreme example: int bits1:8 char c; int bits2:16; where the "char c" is actually *embedded* in the "bitfield allocation". But again, it's completely and totally unambiguous what accesses you would at least start out with: sure, you *could* do a 8-bit access (and you probably should, if the ISA allows it), but dangit, the base type is "int", so I don't think it's wrong to do a 32-bit load, mask, and a 32-bit write. But if the user wrote char bits1:8; char c; short bits2:16; then I really think it would be a *bug* if modifying "bits1" would possible write to 'c'. Of course, this is again a possible ISA problem: if the architecture doesn't have byte writes, you can't do anything about it, but that is obviously true regardless of bitfields, so that's really a totally irrelevant argument for this case. That non-atomicity doesn't come from bitfields, it comes from the fact that the BASE TYPE is again non-atomic. Notice how it always ends up being about the base type. Simple, straightforward, and unambiguous. And similarly, to go to the kernel example, if you have int mybits:2; int nonbitfield; then I think it's an obvious *BUG* to access the nonbitfield things when you access "mybits". It clearly is not at all "within" the bitfield allocation, and "int" isn't non-atomic on any sane architecture, so you don't even have the "byte accesses aren't atomic" issue) So I really do think that the sane approach is to look at the base type of the bitfield, like I suggested from the very beginning. If the base type is an "int", you do an int access. It really solves so many problems, and it is even entirely sane semantics that you can *explain* to people. It makes sense, it avoids the clear bugs gcc has now, and it also solves the performance bug I reported a long time ago. Seriously - is there any real argument *against* just using the base type as a hint for access size? I realize that it may not be simple, and may not fit the current mess of gcc bitfields, but I really do think it's the RightThing(tm) to do. It has sensible semantics, and avoids problems. In contrast, the *current* gcc semantics are clearly not sensible, and have both performance and correctness issues. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 12:41 PM, Torvald Riegel wrote: > > You do rely on the compiler to do common transformations I suppose: > hoist loads out of loops, CSE, etc. How do you expect the compiler to > know whether they are allowed for a particular piece of code or not? We have barriers. Compiler barriers are cheap. They look like this: include/linux/compiler-gcc.h: #define barrier() __asm__ __volatile__("": : :"memory") and if we don't allow hoisting, we'll often use something like that. It's not the only thing we do. We have cases where it's not that you can't hoist things outside of loops, it's that you have to read things exactly *once*, and then use that particular value (ie the compiler can't be allowed to reload the value because it may change). So any *particular* value we read is valid, but we can't allow the compiler to do anything but a single access. So we have this beauty: include/linux/compiler.h: #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) which does that for us (quite often, it's not used as-is: a more common case is using the "rcu_access_pointer()" inline function that not only does that ACCESS_ONCE() but also has the capability to enable run-time dynamic verification that you are in a proper RCU read locked section etc) We also have tons of (very subtle) code that actually creates locks and memory ordering etc. They usually just use the big "barrier()" approach to telling the compiler not to combine or move memory accesses around it. And I realize that compiler people tend to think that loop hoisting etc is absolutely critical for performance, and some big hammer like "barrier()" makes a compiler person wince. You think it results in horrible code generation problems. It really doesn't. Loops are fairly unusual in the kernel to begin with, and the compiler barriers are a total non-issue. We have much more problems with the actual CPU barriers that can be *very* expensive on some architectures, and we work a lot at avoiding those and avoiding cacheline ping-pong issues etc. >> > No vague assumptions with lots of hand-waving. >> >> So here's basically what the kernel needs: >> >> - if we don't touch a field, the compiler doesn't touch it. > > "we don't touch a field": what does that mean precisely? Never touch it > in the same thread? Same function? Same basic block? Between two > sequence points? When crossing synchronizing code? For example, in the > above code, can we touch it earlier/later? Why are you trying to make it more complex than it is? If the source code doesn't write to it, the assembly the compiler generates doesn't write to it. Don't try to make it anything more complicated. This has *nothing* to do with threads or functions or anything else. If you do massive inlining, and you don't see any barriers or conditionals or other reasons not to write to it, just write to it. Don't try to appear smart and make this into something it isn't. Look at the damn five-line example of the bug. FIX THE BUG. Don't try to make it anything bigger than a stupid compiler bug. Don't try to make this into a problem it simply isn't. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 12:53 PM, Torvald Riegel wrote: > > For volatile, I agree. > > However, the original btrfs example was *without* a volatile, and that's > why I raised the memory model point. This triggered an error in a > concurrent execution, so that's memory model land, at least in C > language standard. Sure. The thing is, if you fix the volatile problem, you'll almost certainly fix our problem too. The whole "compilers actually do reasonable things" approach really does work in reality. It in fact works a lot better than reading some spec and trying to figure out if something is "valid" or not, and having fifteen different compiler writers and users disagree about what the meaning of the word "is" is in some part of it. I'm not kidding. With specs, there really *are* people who spend years discussing what the meaning of the word "access" is or similar. Combine that with a big spec that is 500+ pages in size and then try to apply that all to a project that is 15 million lines of code and sometimes *knowingly* has to do things that it simply knows are outside the spec, and the discussion about these kinds of details is just mental masturbation. >> We do end up doing >> much more aggressive threading, with models that C11 simply doesn't >> cover. > > Any specific examples for that would be interesting. Oh, one of my favorite (NOT!) pieces of code in the kernel is the implementation of the smp_read_barrier_depends() macro, which on every single architecture except for one (alpha) is a no-op. We have basically 30 or so empty definitions for it, and I think we have something like five uses of it. One of them, I think, is performance crticial, and the reason for that macro existing. What does it do? The semantics is that it's a read barrier between two different reads that we want to happen in order wrt two writes on the writing side (the writing side also has to have a "smp_wmb()" to order those writes). But the reason it isn't a simple read barrier is that the reads are actually causally *dependent*, ie we have code like first_read = read_pointer; smp_read_barrier_depends(); second_read = *first_read; and it turns out that on pretty much all architectures (except for alpha), the *data*dependency* will already guarantee that the CPU reads the thing in order. And because a read barrier can actually be quite expensive, we don't want to have a read barrier for this case. But alpha? Its memory consistency is so broken that even the data dependency doesn't actually guarantee cache access order. It's strange, yes. No, it's not that alpha does some magic value prediction and can do the second read without having even done the first read first to get the address. What's actually going on is that the cache itself is unordered, and without the read barrier, you may get a stale version from the cache even if the writes were forced (by the write barrier in the writer) to happen in the right order. You really want to try to describe issues like this in your memory consistency model? No you don't. Nobody will ever really care, except for crazy kernel people. And quite frankly, not even kernel people care: we have a fairly big kernel developer community, and the people who actually talk about memory ordering issues can be counted on one hand. There's the "RCU guy" who writes the RCU helper functions, and hides the proper serializing code into those helpers, so that normal mortal kernel people don't have to care, and don't even have to *know* how ignorant they are about the things. And that's also why the compiler shouldn't have to care. It's a really small esoteric detail, and it can be hidden in a header file and a set of library routines. Teaching the compiler about crazy memory ordering would just not be worth it. 99.99% of all programmers will never need to understand any of it, they'll use the locking primitives and follow the rules, and the code that makes it all work is basically invisible to them. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 1:24 PM, Torvald Riegel wrote: >> It's not the only thing we do. We have cases where it's not that you >> can't hoist things outside of loops, it's that you have to read things >> exactly *once*, and then use that particular value (ie the compiler >> can't be allowed to reload the value because it may change). So any >> *particular* value we read is valid, but we can't allow the compiler >> to do anything but a single access. > > That's what an access to an atomics-typed var would give you as well. Right. The concept of "atomic" access is required. The problem I have with treating this as a C11 issue is that people aren't using C11 compilers, and won't for a long while. So quite frankly, it won't be reasonable for the kernel people to say "use a C11 version" for years to come. And the thing is, "atomic" doesn't actually seem to buy us anything in the kernel afaik. We're perfectly happy to use "volatile". We don't want the data structure annotated, we want the code annotated, so the semantic differences between 'volatile' and 'atomic' seem to be pretty irrelevant. Sure, there is probably some subtle difference, but on the whole, it all boils down to "we can't depend on new rules for several years, and even when we can, it doesn't look like they'd buy us a lot, because we have to play so many games around them *anyway*". And don't get me wrong. I don't think that means that C11 is *bad*. It's just that the kernel is very different from most other projects. We have to have those crazy architecture-specific header files and random synchronization macros etc anyway. C11 is not - I think - even meant to be geared towards the Linux kernel kind of crazy use. We really do some odd things, adding compiler features for them is mostly a mistake. asm() takes care of a lot of the oddities for us, it's not like all of them are about memory accesses or concurrency either. I do think that the threading issues in C11 are going to help us kernel people, because the more people think about issues with concurrency, they really *will* be hitting some of the issues we've been having. So it clearly forces clarification of just what the word "access" implies, for example, which is certainly not going to be bad for the kernel. >> It really doesn't. Loops are fairly unusual in the kernel to begin >> with, and the compiler barriers are a total non-issue. > > This is interesting, because GCC is currently not optimizing across > atomics but we we're considering working on this later. > Do you have real data about this, or is this based on analysis of > samples of compiled code? So the kernel literally doesn't tend to *have* loops. Sure, there are exceptions: there are the obvious loops implied in "memcpy()" and friends, but almost all kernel activity tends to be about individual events. A really hot loop under normal loads would be the loop that calculates a hash value over a pathname component. Or the loop that loops over those components. Loop count? Think about it. Most pathnames have one or two components. And while "8.3" is clearly too short for a filename, it's still true that a lot of filenames (especially directories, which is what you end up traversing many times) are in a kind of awkward 5-8 character range. So the kernel loads tend to have loop counts in the single digits - *maybe* double digits. We use hash-tables a lot, so looping to find something is usually just an entry or two, and the cost is almost never the loop, it's the cache miss from the pointer chasing. Most "looping" behavior in the kernel is actually user space calling the kernel in a loop. So the kernel may see "loops" in that sense, but it's not loopy kernel code itself. Even really hot kernel-only code is often not about loops. You might get a million network packets a second, and with interrupt mitigation you would not necessarily treat them each as individual events with its own individual interrupt (which does happen too!), but you'd still not see it as a loop over a million packets. You'd see them come in as a few packets at a time, and looping until you've handled them. And the loops we have tend to not be very amenable to nice compiler optimizations either. The kernel almost never works with arrays, for example. The string in a pathname component is an array, yes, but *most* kernel data structures loop over our doubly linked lists or over a hash-chain. And it tends to be really hard to do much loop optimization over list traversal like that. >> If the source code doesn't write to it, the assembly the compiler >> generates doesn't write to it. > > If it would be so easy, then answering my questions would have been easy > too, right? Which piece of source code? The whole kernel? So make the rule be: "in one single function, with all you can see there", and I'll be happy. Do as much inlining as you want (although the kernel does have "noinline" too). With all the normal sequence point and memory invalidation
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 1:25 PM, Boehm, Hans wrote: > > Here are some more interesting ones that illustrate the issues (all > declarations are non-local, unless stated otherwise): > > struct { char a; int b:9; int c:7; char d} x; > > Is x.b = 1 allowed to overwrite x.a? C11 says no, essentially requiring two > byte stores. Gcc currently does so. I'm not sure I understand Linus' > position here. So I like the fact that the C11 stance seems very strict. But honestly I also think it sounds like C11 is actually more strict than I would necessarily be. I really do think that bitfields imply "int", both historically and technically. So I would not see the problem with treating the bitfields as part of an 'int' and thus overwriting a (and d) when writing to b. That's how bitfields work! They are fields of an int. It would be good if it perhaps caused a *warning*, and gave a good way to avoid it. For example, while I think using any other base type than 'int' is technically an extension of the C bitfield rules (but whatever, I don't have any specs in front of me), I think a warning together with alowing the user to rewrite it as struct { char a; char d; short b:9; short c:7; } x; would make it clear that now a write to 'b' cannot validly overwrite 'a' or 'd'. But forcing the compiler to do two (and sometimes three!) byte accesses sounds excessive. The reason I think the int flag:1; int othervariable; overwriting of "othervariable" is so *obviously* a bug is exactly that bitfields are historically about 'int', and no 'long' was there anywhere, so using a 64-bit access is clearly not sane in any way, shape or form. I dunno. Linus
Re: Memory corruption due to word sharing
On Wed, Feb 1, 2012 at 2:45 PM, Paul E. McKenney wrote: > > My (perhaps forlorn and naive) hope is that C++11 memory_order_relaxed > will eventually allow ACCESS_ONCE() to be upgraded so that (for example) > access-once increments can generate a single increment-memory instruction > on x86. I don't think that is a semantic issue. gcc could do it *today* with volatile accesses. It doesn't, because volatiles are scary and basically disables a lot of optimizations. Why would memory ordering be substantially different just because it has a different name? > New architectures might eventually might define things like atomic_inc() > in terms of C++11 atomics, but let's start with the straightforward stuff > as and if it makes sense. SMP-atomic or percpu atomic? Or both? We need both variants in the kernel. If the compiler generates one of them for us, that doesn't really much help. Linus
Re: Memory corruption due to word sharing
On Thu, Feb 2, 2012 at 8:28 AM, Michael Matz wrote: > > Sure. Simplest example: struct s {int i:24;} __attribute__((packed)). > > You must access only three bytes, no matter what. The basetype (int) is > four bytes. Ok, so here's a really *stupid* (but also really really simple) patch attached. What it does is to simply change the meaning of the SLOW_BYTE_ACCESS thing. What SLOW_BYTE_ACCESS means is currently is not just that byte accesses are slow: it means that *everything* but full-word accesses are slow! Which is (a) not what the name really implies, (b) not even the historical meaning of that #define (why? the meaning of "full word" has changed since: it used to be 32 bits, now it is 64 bits) and (c) stupid, because that's not how hardware with slow byte accesses really work anyway. Finally, it's what causes gcc to do 64-bit accesses to bitfields that aren't 64 bits in size. So because of this, I propose gcc just change the rules for what SLOW_BYTE_ACCESS means. I propose to just change it to mean "accesses smaller than 32 bits are slow". That's actually much closer to what the hardware definition tends to be. It doesn't fix the generic issue, it doesn't fix any C++11/C11 issues, it doesn't really change anything, but what it *does* do is make for a hell of a lot saner model. And it avoids bugs in practice. NOTE! On at least some machines with SLOW_BYTE_ACCESS, accesses smaller than a word cannot possibly be atomic anyway (well, not without jumping through hoops), so the fact that we still extend to 32 bits and the problem of touching too much still exists with 'char' and 'short' variables that are in the same 32-bit word as a bitfield is kind of unavoidable. So this suggested patch doesn't really "fix" anything fundamental, but it is (a) simple, (b) totally untested and (c) at least fixes *some* cases. For example, it might fix the 'sig_atomic_t' shadowning, since that is usually 'int'. It obviously can never fix the issue with volatile char/short, but at least it works around it for 'long'. In other words - I'm not trying to say that this patch really "fixes" anything (except sig_atomic_t interaction, which really does get fixed for the normal 'int' case). But what it does do is to limit the damage of a bad situation. And it is small, and should hopefully be easy to accept even for stable versions of gcc. So can we please do something like this for maintenance releases, and consider the much more complex C++11/C11 issues to be a separate much bigger issue for the future? Because the current SLOW_BYTE_ACCESS meaning really is crap. The *only* thing that architecture define seems to do is literally the bitfield extract semantics, and it does that horribly horribly badly as shown by the bug on both 64-bit POWER and Sparc. For no good reason - both of those have perfectly fine word operations afaik. I did not look at other architectures that define SLOW_BYTE_ACCESS, but if they have a 32-bit integer type, I'm pretty sure they support fast loads/stores of it. Hacky? Yes. Pretty? No. But better than the disaster that is there now? Hell yes. Linus PS. The only reason I hardcoded "32" was that I didn't know how to ask the quesion "is this mode at least as wide as 'int'" in the proper gcc way. So I'm really not suggesting you apply this patch as-is, I'm just sending it out as a "hey, this is a pretty obvious way to work around part of the problem, somebody who really knows what they are doing should probably improve on it". gcc/stor-layout.c |5 + 1 files changed, 5 insertions(+), 0 deletions(-) diff --git a/gcc/stor-layout.c b/gcc/stor-layout.c index ea0d44d64d26..1387c0e9e060 100644 --- a/gcc/stor-layout.c +++ b/gcc/stor-layout.c @@ -2486,7 +2486,12 @@ get_best_mode (int bitsize, int bitpos, unsigned int align, && unit <= MIN (align, BIGGEST_ALIGNMENT) && (largest_mode == VOIDmode || unit <= GET_MODE_BITSIZE (largest_mode))) + { wide_mode = tmode; + /* Wide enough? */ + if (unit >= 32) + break; + } } if (wide_mode != VOIDmode)
Re: Memory corruption due to word sharing
On Thu, Feb 2, 2012 at 10:42 AM, Paul E. McKenney wrote: >> >> SMP-atomic or percpu atomic? Or both? > > Only SMP-atomic. And I assume that since the compiler does them, that would now make it impossible for us to gather a list of all the 'lock' prefixes so that we can undo them if it turns out that we are running on a UP machine. When we do SMP operations, we don't just add a "lock" prefix to it. We do this: #define LOCK_PREFIX_HERE \ ".section .smp_locks,\"a\"\n" \ ".balign 4\n" \ ".long 671f - .\n" /* offset */ \ ".previous\n" \ "671:" #define LOCK_PREFIX LOCK_PREFIX_HERE "\n\tlock; " and I'm sure you know that, but I'm not sure the gcc people realize the kinds of games we play to make things work better. Sure, "everything will be SMP" some day, but running UP kernels is likely still going to remain a good idea in virtualized environments, for example. And having to make it a compile-time option is *not* a good idea. So compiler intrisics are often *worse* than doing it by hand for us for all these kinds of reasons. They aren't generally geared towards the very specialized needs that a kernel has. Of course, maybe even user space would want some kind of way to automatically strip 'lock' prefixes from a binary, so maybe the compiler would have some kind of support like this too. (And no, disassembling the binary in order to find lock prefixes is *not* the answer, at least not for the kernel) >> We need both variants in the kernel. If the compiler generates one of >> them for us, that doesn't really much help. > > I must admit that the non-x86 per-CPU atomics are, ummm, "interesting". Most non-x86 cpu's would probably be better off treating them the same as smp-atomics (load-locked + store-conditional), but right now we have this insane generic infrastructure for having versions that are irq-safe by disabling interrupts etc. Ugh. Mainly because nobody really is willing to work on and fix up the 25 architectures that really don't matter. Linus
Re: Memory corruption due to word sharing
On Fri, Feb 3, 2012 at 8:38 AM, Andrew MacLeod wrote: > > The atomic intrinsics were created for c++11 memory model compliance, but I > am certainly open to enhancements that would make them more useful. I am > planning some enhancements for 4.8 now, and it sounds like you may have some > suggestions... So we have several atomics we use in the kernel, with the more common being - add (and subtract) and cmpchg of both 'int' and 'long' - add_return (add and return new value) - special cases of the above: dec_and_test (decrement and test result for zero) inc_and_test (decrement and test result for zero) add_negative (add and check if result is negative) The special cases are because older x86 cannot do the generic "add_return" efficiently - it needs xadd - but can do atomic versions that test the end result and give zero or sign information. - atomic_add_unless() - basically an optimized cmpxchg. - atomic bit array operations (bit set, clear, set-and-test, clear-and-test). We do them on "unsigned long" exclusively, and in fact we do them on arrays of unsigned long, ie we have the whole "bts reg,mem" semantics. I'm not sure we really care about the atomic versions for the arrays, so it's possible we only really care about a single long. The only complication with the bit setting is that we have a concept of "set/clear bit with memory barrier before or after the bit" (for locking). We don't do the whole release/acquire thing, though. - compare_xchg_double We also do byte/word atomic increments and decrements, but that' sin the x86 spinlock implementation, so it's not a generic need. We also do the add version in particular as CPU-local optimizations that do not need to be SMP-safe, but do need to be interrupt-safe. On x86, this is just an r-m-w op, on most other architectures it ends up being the usual load-locked/store-conditional. I think that's pretty much it, but maybe I'm missing something. Of course, locking itself tends to be special cases of the above with extra memory barriers, but it's usually hidden in asm for other reasons (the bit-op + barrier being a special case). Linus
Re: Memory corruption due to word sharing
On Fri, Feb 3, 2012 at 11:16 AM, Andrew MacLeod wrote: >> The special cases are because older x86 cannot do the generic >> "add_return" efficiently - it needs xadd - but can do atomic versions >> that test the end result and give zero or sign information. > > Since these are older x86 only, could you use add_return() always and then > have the compiler use new peephole optimizations to detect those usage > patterns and change the instruction sequence for x86 when required? would > that be acceptable? Or maybe you don't trust the compiler :-) Or maybe > I can innocently ask if the performance impact on older x86 matters enough > any more? :-) Since xadd is often slow even when it does exist, the peepholes would definitely be required. A "dec_and_test" does need to be decl %m (or "subl $1" for x86-64) je rather than xadd %r,%m cmp $1,%r je simply because the latter is not just larger, but often quite a bit slower (newer CPU's seem to be better at xadd). But with the peepholes, by the time we could depend on gcc doing the intrinsics, we'd almost certainly be ready to let the old pre-xadd machines just go. We're already almost there, and by the time we can trust the compiler to do it for us, we're almost certainly going to be at a point where we really don't care any more about pre-xadd (but we will still care about older machines where xadd is slow). >> - atomic_add_unless() - basically an optimized cmpxchg. > > is this the reverse of a compare_exchange and add? Ie, add if the value > ISN'T expected? or some form of compare_exchange_and_add? This might > require a new atomic builltin. > > What exactly does it do? It's basically do { load-link %r,%m if (r == value) return 0; add } while (store-conditional %r,%m) return 1; and it is used to implement two *very* common (and critical) reference-counting use cases: - decrement ref-count locklessly, and if it hits free, take a lock atomically (ie "it would be a bug for anybody to ever see it decrement to zero while holding the lock") - increment ref-count locklessly, but if we race with the final free, don't touch it, because it's gone (ie "if somebody decremented it to zero while holding the lock, we absolutely cannot increment it again") They may sound odd, but those two operations are absolutely critical for most lock-less refcount management. And taking locks for reference counting is absolutely death to performance, and is often impossible anyway (the lock may be a local lock that is *inside* the structure that is being reference-counted, so if the refcount is zero, then you cannot even look at the lock!) In the above example, the load-locked -> store-conditional would obviously be a cmpxchg loop instead on architectures that do cmpxchg instead of ll/sc. Of course, it you expose some intrinsic for the whole "ll/sc" model (and you then turn it into cmpxchg on demand), we could literally open-code it. That is likely the most *flexible* approach for a compiler. I think pretty much everything the kernel needs (except for cmpxchg_double) can be very naturally written as a "ll/sc" sequence, and if the compiler then just does the right thing with peephole optimizations, that's fine. IOW, we don't really need "atomic_add()" or anything like that. If we can do do { val = __load_linked(mem); val++; } while (__store_conditional(val, mem)); and the compiler just automagically turns that into "lock add" on x86, that's perfectly fine. It might not be too hard, because you really don't need to recognize very many patterns, and any pattern you don't recognize could be turned into a cmpxchg loop. NOTE NOTE NOTE! The "turned into a cmpxchg loop" is not the true correct translation of load-linked/store-conditional, since it allows the memory to be modified as long as it's modified *back* before the store-conditional, and that actually matters for a few algorithms. But if you document the fact that it's not a "true" ll/sc (and maybe have some compile-time way to detect when it isn't), it would be very flexible. Of course, the syntax could be something completely different. Maybe you'd want to do it as __builtin_ll_sc(&var, update-expression, return-expression, failure-expression) rather than an explicit loop. But it doesn't sound like the internal gcc model is based on some generic ll/sc model. >> - atomic bit array operations (bit set, clear, set-and-test, >> clear-and-test). We do them on "unsigned long" exclusively, and in >> fact we do them on arrays of unsigned long, ie we have the whole "bts >> reg,mem" semantics. I'm not sure we really care about the atomic >> versions for the arrays, so it's possible we only really care about a >> single long. >> >> The only complication with the bit setting is that we have a >> concept of "set/clear bit with memory barrier before or after the bit" >> (for locking). We don't do the whole release/acquire thing, though. > > Are
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Thomas Gleixner wrote: > > standard function start: > >push %ebp >mov%esp, %ebp > >call mcount > > modified function start on a handful of functions only seen with gcc > 4.4.x on x86 32 bit: > > push %edi > lea0x8(%esp),%edi > and$0xfff0,%esp > pushl -0x4(%edi) > push %ebp > mov%esp,%ebp > ... > call mcount That's some crazy sh*t anyway, since we don't _want_ the stack to be 16-byte aligned in the kernel. We do KBUILD_CFLAGS += $(call cc-option,-mpreferred-stack-boundary=2) why is that not working? So this looks like a gcc bug, plain and simple. > This modification leads to a hard to solve problem in the kernel > function graph tracer which assumes that the stack looks like: > >return address >saved ebp Umm. But it still does, doesn't it? That pushl -0x4(%edi) push %ebp should do it - the "-0x4(%edi)" thing seems to be trying to reload the return address. No? Maybe I misread the code - but regardless, it does look like a gcc code generation bug if only because we really don't want a 16-byte aligned stack anyway, and have asked for it to not be done. So I agree that gcc shouldn't do that crazy prologue (and certainly _not_ before calling mcount anyway), but I'm not sure I agree with that detail of your analysis or explanation. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Richard Guenther wrote: > > Note that I only can reproduce the issue with > -mincoming-stack-boundary=2, not with -mpreferred-stack-boundary=2. Since you can reproduce it with -mincoming-stack-boundary=2, I woul suggest just fixing mcount handling that way regardless of anything else. The current code generated by gcc is just insane - even for the case where you _want_ 16-byte stack alignment. Instead crazy code like > push %edi > lea0x8(%esp),%edi > and$0xfff0,%esp > pushl -0x4(%edi) > push %ebp > mov%esp,%ebp > ... > call mcount the sane thing to do would be to just do it as push %ebp mov%esp,%ebp call mcount and$0xfff0,%esp since - no sane 'mcount' implementation can ever care about 16-byte stack alignment anyway, so aliging the stack before mcount is crazy. - mcount is special anyway, and is the only thing that cares about that whole ebp/return address thing is mcount, and _all_ your games with %edi are about that mcount thing. IOW, once you as a compiler person understand that the 'mcount' call is special, you should have realized that all the work you did for it was totally pointless and stupid to begin with. You must already have that special mcount logic (the whole code to save a register early and push the fake mcount stack frame), so instead of _that_ special logic, change it to a different mcount special logic that associates the 'mcount' call with theframe pointer pushing. That will not only make the Linux kernel tracer happy, it will make all your _other_ users happier too, since you can generate smaller and more efficient code. Admittedly, anybody who compiles with -pg probably doesn't care deeply about smaller and more efficient code, since the mcount call overhead tends to make the thing moot anyway, but it really looks like a win-win situation to just fix the mcount call sequence regardless. > And you didn't provide us with a testcase either ... so please open a > bugzilla and attach preprocessed source of a file that shows the > problem, note the function it happens in and provide the command-line > options you used for building. > > Otherwise it's going to be all speculation on our side. See above - all you need to do is to just fix mcount calling. Now, there is a separate bug that shows that you seem to over-align the stack when not asked for, and yes, since we noticed that I hope that Thomas and friends will fix that, but I think your mcount logic could (and should) be fixed as an independent sillyness. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Andrew Haley wrote: > > I've got all that off-list. I found the cause, and replied in another > email. It's not a bug. Oh Gods, are we back to gcc people saying "sure, we do stupid things, but it's allowed, so we don't consider it a bug because it doesn't matter that real people care about real life, we only care about some paper, and real life doesn't matter, if it's 'undefined' we can make our idiotic choices regardless of what people need, and regardless of whether it actually generates better code or not". Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Linus Torvalds wrote: > > Oh Gods, are we back to gcc people saying "sure, we do stupid things, but > it's allowed, so we don't consider it a bug because it doesn't matter that > real people care about real life, we only care about some paper, and real > life doesn't matter, if it's 'undefined' we can make our idiotic choices > regardless of what people need, and regardless of whether it actually > generates better code or not". Put another way: the stack alignment itself may not be a bug, but gcc generating God-awful code for the mcount handling that results in problems in real life sure as hell is *stupid* enough to be called a bug. I bet other people than just the kernel use the mcount hook for subtler things than just doing profiles. And even if they don't, the quoted code generation is just crazy _crap_. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Linus Torvalds wrote: > > I bet other people than just the kernel use the mcount hook for subtler > things than just doing profiles. And even if they don't, the quoted code > generation is just crazy _crap_. For the kernel, if the only case is that timer_stat.c thing that Thomas pointed at, I guess we can at least work around it with something like the appended. The kernel code is certainly ugly too, no question about that. It's just that we'd like to be able to depend on mcount code generation not being insane even in the presense of ugly code.. The alternative would be to have some warning when this happens, so that we can at least see it. "mcount won't work reliably" Linus --- kernel/time/timer_stats.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/time/timer_stats.c b/kernel/time/timer_stats.c index ee5681f..488c7b8 100644 --- a/kernel/time/timer_stats.c +++ b/kernel/time/timer_stats.c @@ -76,7 +76,7 @@ struct entry { */ charcomm[TASK_COMM_LEN + 1]; -} cacheline_aligned_in_smp; +}; /* * Spinlock protecting the tables - not taken during lookup: @@ -114,7 +114,7 @@ static ktime_t time_start, time_stop; #define MAX_ENTRIES(1UL << MAX_ENTRIES_BITS) static unsigned long nr_entries; -static struct entry entries[MAX_ENTRIES]; +static struct entry entries[MAX_ENTRIES] cacheline_aligned_in_smp; static atomic_t overflow_count;
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, H. Peter Anvin wrote: > > Calling the profiler immediately at the entry point is clearly the more > sane option. It means the ABI is well-defined, stable, and independent > of what the actual function contents are. It means that ABI isn't the > normal C ABI (the __fentry__ function would have to preserve all > registers), but that's fine... As far as I know, that's true of _mcount already: it's not a normal ABI and is rather a highly architecture-specific special case to begin with. At least ARM has some (several?) special mcount calling conventions, afaik. (And then ARM people use __attribute__((naked)) and insert the code by inline asm, or something. That seems to be "standard" in the embedded world, where they often do even stranger things than we do in the kernel. At least our low-level system call and interrupt handlers are written as assembly language - the embedded world seems to commonly write them as C functions with magic attributes and inline asm). Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Thu, 19 Nov 2009, Frederic Weisbecker wrote: > > > That way the lr would have the current function, and the parent would > > still be at 8(%sp) > > Yeah right, we need at least such very tiny prologue for > archs that store return addresses in a reg. Well, it will be architecture-dependent. For example, alpha can store the return value in _any_ register if I recall correctly, so you can do the "call to __fentry__" by just picking another register than the default one as the return address. And powerpc has two special registers: link and ctr, but iirc you can only load 'link' with a branch instruction. Which means that you could do something like mflr 0 bl __fentry__ in the caller (I forget if R0 is actually a call-trashed register or not), and then __fentry__ could do something like mflr 12 # save _new_ link mtlr 0 # restore original link mtctr 12# move __fentry__ link to ctr .. do whatever .. bctr# return to __fentry__ caller to return with 'link' restored (but ctr and r0/r12 trashed - I don't recall the ppc calling conventions any more, but I think that is ok). Saving to stack seems unnecessary and pointless. Linus
Re: BUG: GCC-4.4.x changes the function frame on some functions
On Fri, 20 Nov 2009, Thomas Gleixner wrote: > > While testing various kernel configs we found out that the problem > comes and goes. Finally I started to compare the gcc command line > options and after some fiddling it turned out that the following > minimal deltas change the code generator behaviour: > > Bad: -march=pentium-mmx-Wa,-mtune=generic32 > Good: -march=i686-mtune=generic -Wa,-mtune=generic32 > Good: -march=pentium-mmx -mtune-generic -Wa,-mtune=generic32 > > I'm not supposed to understand the logic behind that, right ? Are you sure it's just the compiler flags? There's another configuration portion: the size of the alignment itself. That's dependent on L1_CACHE_SHIFT, which in turn is taken from the kernel config CONFIG_X86_L1_CACHE_SHIFT. Maybe that value matters too - for example maybe gcc will not try to align the stack if it's big? [ Btw, looking at that, why are X86_L1_CACHE_BYTES and X86_L1_CACHE_SHIFT totally unrelated numbers? Very confusing. ] The compiler flags we use are tied to some of the same choices that choose the cache shift, so the correlation you found while debugging this would still hold. Linus
Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?
On Sun, Nov 14, 2010 at 4:52 PM, James Cloos wrote: > Gcc 4.5.1 running on an amd64 box "cross"-compiling for a P3 i8k fails > to compile the module since commit 6b4e81db2552bad04100e7d5ddeed7e848f53b48 > with: > > CC drivers/char/i8k.o > drivers/char/i8k.c: In function ‘i8k_smm’: > drivers/char/i8k.c:149:2: error: can't find a register in class > ‘GENERAL_REGS’ while reloading ‘asm’ > drivers/char/i8k.c:149:2: error: ‘asm’ operand has impossible constraints At this point, I think this falls clearly under "unresolvable gcc bug". Quite frankly, I think gcc was buggy to begin with: since we had a memory clobber, the "+m" (*regs) should not have mattered. The fact that "*regs" may be some local variable doesn't make any difference what-so-ever, since we took the address of the variable. So the memory clobber _clearly_ can change that variable. So when Richard Gunther says "a memory clobber doesn't cover automatic storage", to me that very clearly spells "gcc is buggy as hell". Because automatic storage with its address taken _very_ much gets clobbered by things like memset etc. If the compiler doesn't understand that, the compiler is just broken. And now, if even the (superfluous) "+m" isn't working, it sounds like we have no sane options left. Except to say that gcc-4.5.1 is totally broken wrt asms. Can we just get gcc to realize that when you pass the address of automatic storage to an asm, that means that "memory" really does clobber it? Because clearly that is the case. Linus
Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?
On Mon, Nov 15, 2010 at 3:16 AM, Jakub Jelinek wrote: > > I don't see any problems on the assembly level. i8k_smm is > not inlined in this case and checks all 3 conditions. If it really is related to gcc not understanding that "*regs" has changed due to the memory being an automatic variable, and passing in "regs" itself as a pointer to that automatic variable together with the "memory" clobber not being sufficient, than I think it's the lack of inlining that will automatically hide the bug. (Side note: and I think this does show how much of a gcc bug it is not to consider "memory" together with passing in a pointer to an asm to always be a clobber). Because if it isn't inlined, then "regs" will be seen a a real pointer to some external memory (the caller) rather than being optimized to just be the auto structure on the stack. Because *mem is auto only within the context of the caller. Which actually points to a possible simpler: - remove the "+m" since it adds too much register pressure - mark the i8k_smm() as "noinline" instead. Quite frankly, I'd hate to add even more crud to that inline asm (to save/restore the registers manually). It's already not the prettiest thing around. So does the attached patch work for everybody? Linus drivers/char/i8k.c |6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/char/i8k.c b/drivers/char/i8k.c index f0863be..101011e 100644 --- a/drivers/char/i8k.c +++ b/drivers/char/i8k.c @@ -114,7 +114,7 @@ static inline const char *i8k_get_dmi_data(int field) /* * Call the System Management Mode BIOS. Code provided by Jonathan Buzzard. */ -static int i8k_smm(struct smm_regs *regs) +static noinline int i8k_smm(struct smm_regs *regs) { int rc; int eax = regs->eax; @@ -142,7 +142,7 @@ static int i8k_smm(struct smm_regs *regs) "lahf\n\t" "shrl $8,%%eax\n\t" "andl $1,%%eax\n" - :"=a"(rc), "+m" (*regs) + :"=a"(rc) :"a"(regs) :"%ebx", "%ecx", "%edx", "%esi", "%edi", "memory"); #else @@ -168,7 +168,7 @@ static int i8k_smm(struct smm_regs *regs) "lahf\n\t" "shrl $8,%%eax\n\t" "andl $1,%%eax\n" - :"=a"(rc), "+m" (*regs) + :"=a"(rc) :"a"(regs) :"%ebx", "%ecx", "%edx", "%esi", "%edi", "memory"); #endif
Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?
On Mon, Nov 15, 2010 at 9:40 AM, Jim Bos wrote: > > Hmm, that doesn't work. > > [ Not sure if you read to whole thread but initial workaround was to > change the asm(..) to asm volatile(..) which did work. ] Since I have a different gcc than yours (and I'm not going to compile my own), have you posted your broken .s file anywhere? In fact, with the noinline (and the removal of the "+m" thing - iow just the patch you tried), what does just the "i8k_smm" function assembly look like for you after you've done a "make drivers/char/i8k.s"? If the asm just doesn't exist AT ALL, that's just odd. Because every single call-site of i8k_smm() clearly looks at the return value. So the volatile really shouldn't make any difference from that standpoint. Odd. Linus
Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?
On Mon, Nov 15, 2010 at 10:30 AM, Jim Bos wrote: > > Attached version with plain 2.6.36 source and version with the committed > patch, i.e with the '"+m" (*regs)' Looks 100% identical in i8k_smm() itself, and I'm not seeing anything bad. The asm has certainly not been optimized away as implied in the archlinux thread. There are differences, but they are with code generation *elsewhere*. To me it is starting to look like the real problem is that gcc has decided that the "i8k_smm()" function is "__attribute__((const))". Which is clearly totally bogus. If a function has an inline asm that has a memory clobber, it is clearly *not* 'const'. But that does explain the bug, and does explain why "+m" makes a difference and why "noinline" does not. So what I _think_ happens is that - gcc logic for the automatic 'const' attribute for functions is broken, so it marks that function 'const'. - since the rule for a const function is that it only _looks_ at its attributes and has no side effects, now the callers will decide that 'i8k_smm()' cannot change the passed-in structure, so they'll happily optimize away all the accesses to it. Linus
Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?
On Mon, Nov 15, 2010 at 10:45 AM, Jeff Law wrote: > > A memory clobber should clobber anything in memory, including autos in > memory; if it doesn't, then that seems like a major problem. I'd like to > see the rationale behind not clobbering autos in memory. Yes. It turns out that the "asm optimized away" was entirely wrong (we never saw that, it was just a report on another mailing list). Looking at the asm posted, it seems to me that gcc actually compiles the asm itself 100% correctly, and the "memory" clobber is working fine inside that function. So the code generated for i8k_smm() itself is all good. But _while_ generating the good code, gcc doesn't seem to realize that it writes to anything, so it decides to mark the function "__attribute__((const))", which is obviously wrong (a memory clobber definitely implies that it's not const). And as a result, the callers will be mis-optimized, because they do things like static int i8k_get_bios_version(void) { struct smm_regs regs = { .eax = I8K_SMM_BIOS_VERSION, }; return i8k_smm(®s) ? : regs.eax; } and since gcc has (incorrectly) decided that "i8k_smm()" is a const function, it thinks that "regs.eax" hasn't changed, so it doesn't bother to reload it: it "knows" that it is still I8K_SMM_BIOS_VERSION that it initialized it with. So it will basically have rewritten that final return statement as return i8k_smm(®s) ? : I8K_SMM_BIOS_VERSION; which obviously doesn't really work. This also explains why adding "volatile" worked. The "asm volatile" triggered "this is not a const function". Similarly, the "+m" works, because it also makes clear that the asm is writing to memory, and isn't a const function. Now, the "memory" clobber should clearly also have done that, but I'd be willing to bet that some version of gcc (possibly extra slackware patches) had forgotten the trivial logic to say "a memory clobber also makes the user function non-const". Linus
Re: gcc 4.5.1 / as 2.20.51.0.11 miscompiling drivers/char/i8k.c ?
On Mon, Nov 15, 2010 at 11:12 AM, Jakub Jelinek wrote: > > Ah, the problem is that memory_identifier_string is only initialized in > ipa-reference.c's initialization, so it can be (and is in this case) NULL in > ipa-pure-const.c. Ok. And I guess you can verify that all versions of gcc do this correctly for "asm volatile"? Because since we'll have to work around this problem in the kernel, I suspect the simplest solution is to remove the "+m" that causes register pressure problems, and then use "asm volatile" to work around the const-function bug. And add a large comment about why "asm volatile" is probably always a good idea when you have a memory clobber and don't have any other visible memory modifications. I do wonder if this explains some of the problems we had with the bitop asms too. Hmm? Linus