Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel wrote: > > I think the example is easy to misunderstand, because the context isn't > clear. Therefore, let me first try to clarify the background. > > (1) The abstract machine does not write speculatively. > (2) Emitting a branch instruction and executing a branch at runtime is > not part of the specified behavior of the abstract machine. Of course, > the abstract machine performs conditional execution, but that just > specifies the output / side effects that it must produce (e.g., volatile > stores) -- not with which hardware instructions it is producing this. > (3) A compiled program must produce the same output as if executed by > the abstract machine. Ok, I'm fine with that. > Thus, we need to be careful what "speculative store" is meant to refer > to. A few examples: > > if (atomic_load(, mo_relaxed) == 1) > atomic_store(, 3, mo_relaxed)); No, please don't use this idiotic example. It is wrong. The fact is, if a compiler generates anything but the obvious sequence (read/cmp/branch/store - where branch/store might obviously be done with some other machine conditional like a predicate), the compiler is wrong. Anybody who argues anything else is wrong, or confused, or confusing. Instead, argue about *other* sequences where the compiler can do something. For example, this sequence: atomic_store(, a, mo_relaxed); b = atomic_load(, mo_relaxed); can validly be transformed to atomic_store(, a, mo_relaxed); b = (typeof(x)) a; and I think everybody agrees about that. In fact, that optimization can be done even for mo_strict. But even that "obvious" optimization has subtle cases. What if the store is relaxed, but the load is strict? You can't do the optimization without a lot of though, because dropping the strict load would drop an ordering point. So even the "store followed by exact same load" case has subtle issues. With similar caveats, it is perfectly valid to merge two consecutive loads, and to merge two consecutive stores. Now that means that the sequence atomic_store(, 1, mo_relaxed); if (atomic_load(, mo_relaxed) == 1) atomic_store(, 3, mo_relaxed); can first be optimized to atomic_store(, 1, mo_relaxed); if (1 == 1) atomic_store(, 3, mo_relaxed); and then you get the end result that you wanted in the first place (including the ability to re-order the two stores due to the relaxed ordering, assuming they can be proven to not alias - and please don't use the idiotic type-based aliasing rules). Bringing up your first example is pure and utter confusion. Don't do it. Instead, show what are obvious and valid transformations, and then you can bring up these kinds of combinations as "look, this is obviously also correct". Now, the reason I want to make it clear that the code example you point to is a crap example is that because it doesn't talk about the *real* issue, it also misses a lot of really important details. For example, when you say "if the compiler can prove that the conditional is always true" then YOU ARE TOTALLY WRONG. So why do I say you are wrong, after I just gave you an example of how it happens? Because my example went back to the *real* issue, and there are actual real semantically meaningful details with doing things like load merging. To give an example, let's rewrite things a bit more to use an extra variable: atomic_store(, 1, mo_relaxed); a = atomic_load(&1, mo_relaxed); if (a == 1) atomic_store(, 3, mo_relaxed); which looks exactly the same. If you now say "if you can prove the conditional is always true, you can make the store unconditional", YOU ARE WRONG. Why? This sequence: atomic_store(, 1, mo_relaxed); a = atomic_load(, mo_relaxed); atomic_store(, 3, mo_relaxed); is actually - and very seriously - buggy. Why? Because you have effectively split the atomic_load into two loads - one for the value of 'a', and one for your 'proof' that the store is unconditional. Maybe you go "Nobody sane would do that", and you'd be right. But compilers aren't "sane" (and compiler _writers_ I have some serious doubts about), and how you describe the problem really does affect the solution. My description ("you can combine two subsequent atomic accesses, if you are careful about still having the same ordering points") doesn't have the confusion. It makes it clear that no, you can't speculate writes, but yes, obviously you can combine certain things (think "write buffers" and "memory caches"), and that may make you able to remove the conditional. Which is very different from speculating writes. But my description also makes it clear that the transformation above was buggy, but that rewriting it as a = 1; atomic_store(, 3, mo_relaxed); atomic_store(, a, mo_relaxed); is fine (doing the re-ordering here just to make a point). So I suspect we actually agree, I just think that the "example" that has been floating around, and the
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel wrote: > > I think a major benefit of C11's memory model is that it gives a > *precise* specification for how a compiler is allowed to optimize. Clearly it does *not*. This whole discussion is proof of that. It's not at all clear, and the standard apparently is at least debatably allowing things that shouldn't be allowed. It's also a whole lot more complicated than "volatile", so the likelihood of a compiler writer actually getting it right - even if the standard does - is lower. They've gotten "volatile" wrong too, after all (particularly in C++). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 18:44 -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney > wrote: > > > > One way of looking at the discussion between Torvald and myself would be > > as a seller (Torvald) and a buyer (me) haggling over the fine print in > > a proposed contract (the standard). Whether that makes you feel better > > or worse about the situation I cannot say. ;-) > > Oh, I'm perfectly fine with that. But we're not desperate to buy, and > I actually think the C11 people are - or at least should be - *way* > more desperate to sell their model to us than we are to take it. > > Which means that as a buyer you should say "this is what we want, if > you don't give us this, we'll just walk away". Not try to see how much > we can pay for it. Because there is very little upside for us, and > _unless_ the C11 standard gets things right it's just extra complexity > for us, coupled with new compiler fragility and years of code > generation bugs. I think there is an upside to you, mainly in that it allows compiler testing tools, potentially verification tools for atomics, and tools like cppmem that show allowed executions of code. I agree that the Linux community has been working well without this, and it's big enough to make running it's own show viable. This will be different for smaller projects, though. > Why would we want that extra complexity and inevitable compiler bugs? > If we then have to fight compiler writers that point to the standard > and say "..but look, the standard says we can do this", then at that > point it went from "extra complexity and compiler bugs" to a whole > 'nother level of frustration and pain. I see your point, but flip side of the coin is that if you get the standard to say what you want, then you can tell the compiler writers to look at the standard. Or show them bugs revealed by fuzz testing and such. > So just walk away unless the C11 standard gives us exactly what we > want. Not "something kind of like what we'd use". EXACTLY. Because I'm > not in the least interested in fighting compiler people that have a > crappy standard they can point to. Been there, done that, got the > T-shirt and learnt my lesson. > > And the thing is, I suspect that the Linux kernel is the most complete > - and most serious - user of true atomics that the C11 people can sell > their solution to. I agree, but there are likely also other big projects that could make use of C11 atomics on the userspace side (e.g., certain databases, ...). > If we don't buy it, they have no serious user. I disagree with that. That obviously depends on one's definition of "serious", but if you combine all C/C++ programs that use low-level atomics, then this is serious use as well. There's lots of shared-memory synchronization in userspace as well. > Sure, they'll have lots > of random other one-off users for their atomics, where each user wants > one particular thing, but I suspect that we'll have the only really > unified portable code base glibc is a counterexample that comes to mind, although it's a smaller code base. (It's currently not using C11 atomics, but transitioning there makes sense, and some thing I want to get to eventually.) > that handles pretty much *all* the serious > odd cases that the C11 atomics can actually talk about to each other. You certainly have lots of odd cases, but I would disagree with the assumption that only the Linux kernel will do full "testing" of the implementations. If you have plenty of userspace programs using the atomics, that's a pretty big test suite, and one that should help bring the compilers up to speed. So that might be a benefit even to the Linux kernel if it would use the C11 atomics. > Oh, they'll push things through with or without us, and it will be a > collection of random stuff, where they tried to please everybody, with > particularly compiler/architecture people who have no f*cking clue > about how their stuff is used pushing to make it easy/efficient for > their particular compiler/architecture. I'll ignore this... :) > But we have real optimized uses of pretty much all relevant cases that > people actually care about. You certainly cover a lot of cases. Finding out whether you cover all that "people care about" would require you to actually ask all people, which I'm sure you've done ;) > We can walk away from them, and not really lose anything but a small > convenience (and it's a convenience *only* if the standard gets things > right). > > And conversely, the C11 people can walk away from us too. But if they > can't make us happy (and by "make us happy", I really mean no stupid > games on our part) I personally think they'll have a stronger > standard, and a real use case, and real arguments. I'm assuming they > want that. I agree. > That's why I complain when you talk about things like marking control > dependencies explicitly. That's *us* bending over backwards. And as a > buyer, we have absolutely
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 12:02 -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds > wrote: > > > > Why are we still discussing this idiocy? It's irrelevant. If the > > standard really allows random store speculation, the standard doesn't > > matter, and sane people shouldn't waste their time arguing about it. > > Btw, the other part of this coin is that our manual types (using > volatile and various architecture-specific stuff) and our manual > barriers and inline asm accesses are generally *fine*. AFAICT, it does work for you, but hasn't been exactly pain-free. I think a major benefit of C11's memory model is that it gives a *precise* specification for how a compiler is allowed to optimize. There is a formalization of the model, which allows things like the cppmem tool by the Cambridge group. It also allows meaningful fuzz testing: http://www.di.ens.fr/~zappa/projects/cmmtest/ ; this did reveal several GCC compiler bugs. I also think that reasoning about this model is easier than reasoning about how lots of different, concrete compiler optimizations would interact. > The C11 stuff doesn't buy us anything. The argument that "new > architectures might want to use it" is prue and utter bollocks, since > unless the standard gets the thing *right*, nobody sane would ever use > it for some new architecture, when the sane thing to do is to just > fill in the normal barriers and inline asms. > > So I'm very very serious: either the compiler and the standard gets > things right, or we don't use it. There is no middle ground where "we > might use it for one or two architectures and add random hints". > That's just stupid. > > The only "middle ground" is about which compiler version we end up > trusting _if_ it turns out that the compiler and standard do get > things right. From Torvald's explanations (once I don't mis-read them > ;), my take-away so far has actually been that the standard *does* get > things right, but I do know from over-long personal experience that > compiler people sometimes want to be legalistic and twist the > documentation to the breaking point, at which point we just go "we'd > be crazy do use that". I agree that compilers want to optimize, and sometimes there's probably a little too much emphasis on applying an optimization vs. not surprising users. But we have to draw a line (e.g., what is undefined behavior and what is not), because we need this to be actually able to optimize. Therefore, we need to get the rules into a shape that both allows optimizations and isn't full of surprising corner cases. The rules are the standard, so it's the standard we have to get right. According to my experience, a lot of thought goes into how to design the standard's language and library so that they are intuitive yet efficient. If you see issues in the standard, please bring them up. Either report the defects directly and get involved yourself, or reach out to somebody that is participating in the standards process. The standard certainly isn't perfect, so there is room to contribute. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 11:50 -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney > wrote: > > > > Linus, Peter, any objections to marking places where we are relying on > > ordering from control dependencies against later stores? This approach > > seems to me to have significant documentation benefits. > > Quite frankly, I think it's stupid, and the "documentation" is not a > benefit, it's just wrong. I think the example is easy to misunderstand, because the context isn't clear. Therefore, let me first try to clarify the background. (1) The abstract machine does not write speculatively. (2) Emitting a branch instruction and executing a branch at runtime is not part of the specified behavior of the abstract machine. Of course, the abstract machine performs conditional execution, but that just specifies the output / side effects that it must produce (e.g., volatile stores) -- not with which hardware instructions it is producing this. (3) A compiled program must produce the same output as if executed by the abstract machine. Thus, we need to be careful what "speculative store" is meant to refer to. A few examples: if (atomic_load(, mo_relaxed) == 1) atomic_store(, 3, mo_relaxed)); Here, the requirement is that in terms of program logic, y is assigned 3 if x equals 1. It's not specified how an implementation does that. * If the compiler can prove that x is always 1, then it can remove the branch. This is because of (2). Because of the proof, (1) is not violated. * If the compiler can prove that the store to y is never observed or does not change the program's output, the store can be removed. if (atomic_load(, mo_relaxed) == 1) { atomic_store(, 3, mo_relaxed)); other_a(); } else { atomic_store(, 3, mo_relaxed)); other_b(); } Here, y will be assigned to regardless of the value of x. * The compiler can hoist the store out of the two branches. This is because the store and the branch instruction aren't observable outcomes of the abstract machine. * The compiler can even move the store to y before the load from x (unless this affects logical program order of this thread in some way.) This is because the load/store are ordered by sequenced-before (intra-thread), but mo_relaxed allows the hardware to reorder, so the compiler can do it as well (IOW, other threads can't expect a particular order). if (atomic_load(, mo_acquire) == 1) atomic_store(, 3, mo_relaxed)); This is similar to the first case, but with stronger memory order. * If the compiler proves that x is always 1, then it does so by showing that the load will always be able to read from a particular store (or several of them) that (all) assigned 1 to x -- as specified by the abstract machine and taking the forward progress guarantees into account. In general, it still has to establish the synchronized-with edge if any of those stores used release_mo (or other fences resulting in the same situation), so it can't just get rid of the acquire "fence" in this case. (There are probably situations in which this can be done, but I can't characterize them easily at the moment.) These examples all rely on the abstract machine as specified in the current standard. In contrast, the example that Paul (and Peter, I assume) where looking at is not currently modeled by the standard. AFAIU, they want to exploit that control dependencies, when encountered in binary code, can result in the hardware giving certain ordering guarantees. This is vaguely similar to mo_consume which is about data dependencies. mo_consume is, partially due to how it's specified, pretty hard to implement for compilers in a way that actually exploits and preserves data dependencies and not just substitutes mo_consume for a stronger memory order. Part of this problem is that the standard takes an opt-out approach regarding the code that should track dependencies (e.g., certain operators are specified as not preserving them), instead of cleanly carving out meaningful operators where one can track dependencies without obstructing generally useful compiler optimizations (i.e., "opt-in"). This leads to cases such as that in "*(p + f - f)", the compiler either has to keep f - f or emit a stronger fence if f is originating from a mo_consume load. Furthermore, dependencies are supposed to be tracked across any load and store, so the compiler needs to do points-to if it wants to optimize this as much as possible. Paul and I have been thinking about alternatives, and one of them was doing the opt-in by demarcating code that needs explicit dependency tracking because it wants to exploit mo_consume. Back to HW control dependencies, this lead to the idea of marking the "control dependencies" in the source code (ie, on the abstract machine level), that need to be preserved in the generated binary code, even if they have no semantic meaning on the abstract machine level. So, this is something extra that isn't modeled in the standard currently,
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 11:50 -0800, Linus Torvalds wrote: On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Linus, Peter, any objections to marking places where we are relying on ordering from control dependencies against later stores? This approach seems to me to have significant documentation benefits. Quite frankly, I think it's stupid, and the documentation is not a benefit, it's just wrong. I think the example is easy to misunderstand, because the context isn't clear. Therefore, let me first try to clarify the background. (1) The abstract machine does not write speculatively. (2) Emitting a branch instruction and executing a branch at runtime is not part of the specified behavior of the abstract machine. Of course, the abstract machine performs conditional execution, but that just specifies the output / side effects that it must produce (e.g., volatile stores) -- not with which hardware instructions it is producing this. (3) A compiled program must produce the same output as if executed by the abstract machine. Thus, we need to be careful what speculative store is meant to refer to. A few examples: if (atomic_load(x, mo_relaxed) == 1) atomic_store(y, 3, mo_relaxed)); Here, the requirement is that in terms of program logic, y is assigned 3 if x equals 1. It's not specified how an implementation does that. * If the compiler can prove that x is always 1, then it can remove the branch. This is because of (2). Because of the proof, (1) is not violated. * If the compiler can prove that the store to y is never observed or does not change the program's output, the store can be removed. if (atomic_load(x, mo_relaxed) == 1) { atomic_store(y, 3, mo_relaxed)); other_a(); } else { atomic_store(y, 3, mo_relaxed)); other_b(); } Here, y will be assigned to regardless of the value of x. * The compiler can hoist the store out of the two branches. This is because the store and the branch instruction aren't observable outcomes of the abstract machine. * The compiler can even move the store to y before the load from x (unless this affects logical program order of this thread in some way.) This is because the load/store are ordered by sequenced-before (intra-thread), but mo_relaxed allows the hardware to reorder, so the compiler can do it as well (IOW, other threads can't expect a particular order). if (atomic_load(x, mo_acquire) == 1) atomic_store(y, 3, mo_relaxed)); This is similar to the first case, but with stronger memory order. * If the compiler proves that x is always 1, then it does so by showing that the load will always be able to read from a particular store (or several of them) that (all) assigned 1 to x -- as specified by the abstract machine and taking the forward progress guarantees into account. In general, it still has to establish the synchronized-with edge if any of those stores used release_mo (or other fences resulting in the same situation), so it can't just get rid of the acquire fence in this case. (There are probably situations in which this can be done, but I can't characterize them easily at the moment.) These examples all rely on the abstract machine as specified in the current standard. In contrast, the example that Paul (and Peter, I assume) where looking at is not currently modeled by the standard. AFAIU, they want to exploit that control dependencies, when encountered in binary code, can result in the hardware giving certain ordering guarantees. This is vaguely similar to mo_consume which is about data dependencies. mo_consume is, partially due to how it's specified, pretty hard to implement for compilers in a way that actually exploits and preserves data dependencies and not just substitutes mo_consume for a stronger memory order. Part of this problem is that the standard takes an opt-out approach regarding the code that should track dependencies (e.g., certain operators are specified as not preserving them), instead of cleanly carving out meaningful operators where one can track dependencies without obstructing generally useful compiler optimizations (i.e., opt-in). This leads to cases such as that in *(p + f - f), the compiler either has to keep f - f or emit a stronger fence if f is originating from a mo_consume load. Furthermore, dependencies are supposed to be tracked across any load and store, so the compiler needs to do points-to if it wants to optimize this as much as possible. Paul and I have been thinking about alternatives, and one of them was doing the opt-in by demarcating code that needs explicit dependency tracking because it wants to exploit mo_consume. Back to HW control dependencies, this lead to the idea of marking the control dependencies in the source code (ie, on the abstract machine level), that need to be preserved in the generated binary code, even if they have no semantic meaning on the abstract machine level. So, this is something extra that isn't modeled in the standard
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 12:02 -0800, Linus Torvalds wrote: On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds torva...@linux-foundation.org wrote: Why are we still discussing this idiocy? It's irrelevant. If the standard really allows random store speculation, the standard doesn't matter, and sane people shouldn't waste their time arguing about it. Btw, the other part of this coin is that our manual types (using volatile and various architecture-specific stuff) and our manual barriers and inline asm accesses are generally *fine*. AFAICT, it does work for you, but hasn't been exactly pain-free. I think a major benefit of C11's memory model is that it gives a *precise* specification for how a compiler is allowed to optimize. There is a formalization of the model, which allows things like the cppmem tool by the Cambridge group. It also allows meaningful fuzz testing: http://www.di.ens.fr/~zappa/projects/cmmtest/ ; this did reveal several GCC compiler bugs. I also think that reasoning about this model is easier than reasoning about how lots of different, concrete compiler optimizations would interact. The C11 stuff doesn't buy us anything. The argument that new architectures might want to use it is prue and utter bollocks, since unless the standard gets the thing *right*, nobody sane would ever use it for some new architecture, when the sane thing to do is to just fill in the normal barriers and inline asms. So I'm very very serious: either the compiler and the standard gets things right, or we don't use it. There is no middle ground where we might use it for one or two architectures and add random hints. That's just stupid. The only middle ground is about which compiler version we end up trusting _if_ it turns out that the compiler and standard do get things right. From Torvald's explanations (once I don't mis-read them ;), my take-away so far has actually been that the standard *does* get things right, but I do know from over-long personal experience that compiler people sometimes want to be legalistic and twist the documentation to the breaking point, at which point we just go we'd be crazy do use that. I agree that compilers want to optimize, and sometimes there's probably a little too much emphasis on applying an optimization vs. not surprising users. But we have to draw a line (e.g., what is undefined behavior and what is not), because we need this to be actually able to optimize. Therefore, we need to get the rules into a shape that both allows optimizations and isn't full of surprising corner cases. The rules are the standard, so it's the standard we have to get right. According to my experience, a lot of thought goes into how to design the standard's language and library so that they are intuitive yet efficient. If you see issues in the standard, please bring them up. Either report the defects directly and get involved yourself, or reach out to somebody that is participating in the standards process. The standard certainly isn't perfect, so there is room to contribute. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 18:44 -0800, Linus Torvalds wrote: On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: One way of looking at the discussion between Torvald and myself would be as a seller (Torvald) and a buyer (me) haggling over the fine print in a proposed contract (the standard). Whether that makes you feel better or worse about the situation I cannot say. ;-) Oh, I'm perfectly fine with that. But we're not desperate to buy, and I actually think the C11 people are - or at least should be - *way* more desperate to sell their model to us than we are to take it. Which means that as a buyer you should say this is what we want, if you don't give us this, we'll just walk away. Not try to see how much we can pay for it. Because there is very little upside for us, and _unless_ the C11 standard gets things right it's just extra complexity for us, coupled with new compiler fragility and years of code generation bugs. I think there is an upside to you, mainly in that it allows compiler testing tools, potentially verification tools for atomics, and tools like cppmem that show allowed executions of code. I agree that the Linux community has been working well without this, and it's big enough to make running it's own show viable. This will be different for smaller projects, though. Why would we want that extra complexity and inevitable compiler bugs? If we then have to fight compiler writers that point to the standard and say ..but look, the standard says we can do this, then at that point it went from extra complexity and compiler bugs to a whole 'nother level of frustration and pain. I see your point, but flip side of the coin is that if you get the standard to say what you want, then you can tell the compiler writers to look at the standard. Or show them bugs revealed by fuzz testing and such. So just walk away unless the C11 standard gives us exactly what we want. Not something kind of like what we'd use. EXACTLY. Because I'm not in the least interested in fighting compiler people that have a crappy standard they can point to. Been there, done that, got the T-shirt and learnt my lesson. And the thing is, I suspect that the Linux kernel is the most complete - and most serious - user of true atomics that the C11 people can sell their solution to. I agree, but there are likely also other big projects that could make use of C11 atomics on the userspace side (e.g., certain databases, ...). If we don't buy it, they have no serious user. I disagree with that. That obviously depends on one's definition of serious, but if you combine all C/C++ programs that use low-level atomics, then this is serious use as well. There's lots of shared-memory synchronization in userspace as well. Sure, they'll have lots of random other one-off users for their atomics, where each user wants one particular thing, but I suspect that we'll have the only really unified portable code base glibc is a counterexample that comes to mind, although it's a smaller code base. (It's currently not using C11 atomics, but transitioning there makes sense, and some thing I want to get to eventually.) that handles pretty much *all* the serious odd cases that the C11 atomics can actually talk about to each other. You certainly have lots of odd cases, but I would disagree with the assumption that only the Linux kernel will do full testing of the implementations. If you have plenty of userspace programs using the atomics, that's a pretty big test suite, and one that should help bring the compilers up to speed. So that might be a benefit even to the Linux kernel if it would use the C11 atomics. Oh, they'll push things through with or without us, and it will be a collection of random stuff, where they tried to please everybody, with particularly compiler/architecture people who have no f*cking clue about how their stuff is used pushing to make it easy/efficient for their particular compiler/architecture. I'll ignore this... :) But we have real optimized uses of pretty much all relevant cases that people actually care about. You certainly cover a lot of cases. Finding out whether you cover all that people care about would require you to actually ask all people, which I'm sure you've done ;) We can walk away from them, and not really lose anything but a small convenience (and it's a convenience *only* if the standard gets things right). And conversely, the C11 people can walk away from us too. But if they can't make us happy (and by make us happy, I really mean no stupid games on our part) I personally think they'll have a stronger standard, and a real use case, and real arguments. I'm assuming they want that. I agree. That's why I complain when you talk about things like marking control dependencies explicitly. That's *us* bending over backwards. And as a buyer, we have absolutely zero reason to do that. As I understood the
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel trie...@redhat.com wrote: I think a major benefit of C11's memory model is that it gives a *precise* specification for how a compiler is allowed to optimize. Clearly it does *not*. This whole discussion is proof of that. It's not at all clear, and the standard apparently is at least debatably allowing things that shouldn't be allowed. It's also a whole lot more complicated than volatile, so the likelihood of a compiler writer actually getting it right - even if the standard does - is lower. They've gotten volatile wrong too, after all (particularly in C++). Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel trie...@redhat.com wrote: I think the example is easy to misunderstand, because the context isn't clear. Therefore, let me first try to clarify the background. (1) The abstract machine does not write speculatively. (2) Emitting a branch instruction and executing a branch at runtime is not part of the specified behavior of the abstract machine. Of course, the abstract machine performs conditional execution, but that just specifies the output / side effects that it must produce (e.g., volatile stores) -- not with which hardware instructions it is producing this. (3) A compiled program must produce the same output as if executed by the abstract machine. Ok, I'm fine with that. Thus, we need to be careful what speculative store is meant to refer to. A few examples: if (atomic_load(x, mo_relaxed) == 1) atomic_store(y, 3, mo_relaxed)); No, please don't use this idiotic example. It is wrong. The fact is, if a compiler generates anything but the obvious sequence (read/cmp/branch/store - where branch/store might obviously be done with some other machine conditional like a predicate), the compiler is wrong. Anybody who argues anything else is wrong, or confused, or confusing. Instead, argue about *other* sequences where the compiler can do something. For example, this sequence: atomic_store(x, a, mo_relaxed); b = atomic_load(x, mo_relaxed); can validly be transformed to atomic_store(x, a, mo_relaxed); b = (typeof(x)) a; and I think everybody agrees about that. In fact, that optimization can be done even for mo_strict. But even that obvious optimization has subtle cases. What if the store is relaxed, but the load is strict? You can't do the optimization without a lot of though, because dropping the strict load would drop an ordering point. So even the store followed by exact same load case has subtle issues. With similar caveats, it is perfectly valid to merge two consecutive loads, and to merge two consecutive stores. Now that means that the sequence atomic_store(x, 1, mo_relaxed); if (atomic_load(x, mo_relaxed) == 1) atomic_store(y, 3, mo_relaxed); can first be optimized to atomic_store(x, 1, mo_relaxed); if (1 == 1) atomic_store(y, 3, mo_relaxed); and then you get the end result that you wanted in the first place (including the ability to re-order the two stores due to the relaxed ordering, assuming they can be proven to not alias - and please don't use the idiotic type-based aliasing rules). Bringing up your first example is pure and utter confusion. Don't do it. Instead, show what are obvious and valid transformations, and then you can bring up these kinds of combinations as look, this is obviously also correct. Now, the reason I want to make it clear that the code example you point to is a crap example is that because it doesn't talk about the *real* issue, it also misses a lot of really important details. For example, when you say if the compiler can prove that the conditional is always true then YOU ARE TOTALLY WRONG. So why do I say you are wrong, after I just gave you an example of how it happens? Because my example went back to the *real* issue, and there are actual real semantically meaningful details with doing things like load merging. To give an example, let's rewrite things a bit more to use an extra variable: atomic_store(x, 1, mo_relaxed); a = atomic_load(1, mo_relaxed); if (a == 1) atomic_store(y, 3, mo_relaxed); which looks exactly the same. If you now say if you can prove the conditional is always true, you can make the store unconditional, YOU ARE WRONG. Why? This sequence: atomic_store(x, 1, mo_relaxed); a = atomic_load(x, mo_relaxed); atomic_store(y, 3, mo_relaxed); is actually - and very seriously - buggy. Why? Because you have effectively split the atomic_load into two loads - one for the value of 'a', and one for your 'proof' that the store is unconditional. Maybe you go Nobody sane would do that, and you'd be right. But compilers aren't sane (and compiler _writers_ I have some serious doubts about), and how you describe the problem really does affect the solution. My description (you can combine two subsequent atomic accesses, if you are careful about still having the same ordering points) doesn't have the confusion. It makes it clear that no, you can't speculate writes, but yes, obviously you can combine certain things (think write buffers and memory caches), and that may make you able to remove the conditional. Which is very different from speculating writes. But my description also makes it clear that the transformation above was buggy, but that rewriting it as a = 1; atomic_store(y, 3, mo_relaxed); atomic_store(x, a, mo_relaxed); is fine (doing the re-ordering here just to make a point). So I suspect we actually agree, I just think that the example that has been floating around, and the
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 10:35:44PM -0800, Paul E. McKenney wrote: > On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote: > > On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds > > wrote: > > > > > > And conversely, the C11 people can walk away from us too. But if they > > > can't make us happy (and by "make us happy", I really mean no stupid > > > games on our part) I personally think they'll have a stronger > > > standard, and a real use case, and real arguments. I'm assuming they > > > want that. > > > > I should have somebody who proof-reads my emails before I send them out. > > > > I obviously meant "if they *can* make us happy" (not "can't"). > > Understood. My next step is to take a more detailed look at the piece > of the standard that should support RCU. Depending on how that turns > out, I might look at other parts of the standard vs. Linux's atomics > and memory-ordering needs. Should be interesting. ;-) And perhaps a better way to represent the roles is that I am not the buyer, but rather the purchasing agent for the -potential- buyer. -You- are of course the potential buyer. If I were to see myself as the buyer, then I must confess that the concerns you implicitly expressed in your prior email would be all too well-founded! Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds > wrote: > > > > And conversely, the C11 people can walk away from us too. But if they > > can't make us happy (and by "make us happy", I really mean no stupid > > games on our part) I personally think they'll have a stronger > > standard, and a real use case, and real arguments. I'm assuming they > > want that. > > I should have somebody who proof-reads my emails before I send them out. > > I obviously meant "if they *can* make us happy" (not "can't"). Understood. My next step is to take a more detailed look at the piece of the standard that should support RCU. Depending on how that turns out, I might look at other parts of the standard vs. Linux's atomics and memory-ordering needs. Should be interesting. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds wrote: > > And conversely, the C11 people can walk away from us too. But if they > can't make us happy (and by "make us happy", I really mean no stupid > games on our part) I personally think they'll have a stronger > standard, and a real use case, and real arguments. I'm assuming they > want that. I should have somebody who proof-reads my emails before I send them out. I obviously meant "if they *can* make us happy" (not "can't"). Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney wrote: > > One way of looking at the discussion between Torvald and myself would be > as a seller (Torvald) and a buyer (me) haggling over the fine print in > a proposed contract (the standard). Whether that makes you feel better > or worse about the situation I cannot say. ;-) Oh, I'm perfectly fine with that. But we're not desperate to buy, and I actually think the C11 people are - or at least should be - *way* more desperate to sell their model to us than we are to take it. Which means that as a buyer you should say "this is what we want, if you don't give us this, we'll just walk away". Not try to see how much we can pay for it. Because there is very little upside for us, and _unless_ the C11 standard gets things right it's just extra complexity for us, coupled with new compiler fragility and years of code generation bugs. Why would we want that extra complexity and inevitable compiler bugs? If we then have to fight compiler writers that point to the standard and say "..but look, the standard says we can do this", then at that point it went from "extra complexity and compiler bugs" to a whole 'nother level of frustration and pain. So just walk away unless the C11 standard gives us exactly what we want. Not "something kind of like what we'd use". EXACTLY. Because I'm not in the least interested in fighting compiler people that have a crappy standard they can point to. Been there, done that, got the T-shirt and learnt my lesson. And the thing is, I suspect that the Linux kernel is the most complete - and most serious - user of true atomics that the C11 people can sell their solution to. If we don't buy it, they have no serious user. Sure, they'll have lots of random other one-off users for their atomics, where each user wants one particular thing, but I suspect that we'll have the only really unified portable code base that handles pretty much *all* the serious odd cases that the C11 atomics can actually talk about to each other. Oh, they'll push things through with or without us, and it will be a collection of random stuff, where they tried to please everybody, with particularly compiler/architecture people who have no f*cking clue about how their stuff is used pushing to make it easy/efficient for their particular compiler/architecture. But we have real optimized uses of pretty much all relevant cases that people actually care about. We can walk away from them, and not really lose anything but a small convenience (and it's a convenience *only* if the standard gets things right). And conversely, the C11 people can walk away from us too. But if they can't make us happy (and by "make us happy", I really mean no stupid games on our part) I personally think they'll have a stronger standard, and a real use case, and real arguments. I'm assuming they want that. That's why I complain when you talk about things like marking control dependencies explicitly. That's *us* bending over backwards. And as a buyer, we have absolutely zero reason to do that. Tell the C11 people: "no speculative writes". Full stop. End of story. Because we're not buying anything else. Similarly, if we need to mark atomics "volatile", then now the C11 atomics are no longer even a "small convenience", now they are just extra complexity wrt what we already have. So just make it clear that if the C11 standard needs to mark atomics volatile in order to get non-speculative and non-reloading behavior, then the C11 atomics are useless to us, and we're not buying. Remember: a compiler can *always* do "as if" optimizations - if a compiler writer can prove that the end result acts 100% the same using an optimized sequence, then they can do whatever the hell they want. That's not the issue. But if we can *ever* see semantic impact of speculative writes, the compiler is buggy, and the compiler writers need to be aware that it is buggy. No ifs, buts, maybes about it. So I'm perfectly fine with you seeing yourself as a buyer. But I want you to be a really *picky* and anal buyer - one that knows he has the upper hand, and can walk away with no downside. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 12:02:23PM -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds > wrote: > > > > Why are we still discussing this idiocy? It's irrelevant. If the > > standard really allows random store speculation, the standard doesn't > > matter, and sane people shouldn't waste their time arguing about it. > > Btw, the other part of this coin is that our manual types (using > volatile and various architecture-specific stuff) and our manual > barriers and inline asm accesses are generally *fine*. > > The C11 stuff doesn't buy us anything. The argument that "new > architectures might want to use it" is prue and utter bollocks, since > unless the standard gets the thing *right*, nobody sane would ever use > it for some new architecture, when the sane thing to do is to just > fill in the normal barriers and inline asms. > > So I'm very very serious: either the compiler and the standard gets > things right, or we don't use it. There is no middle ground where "we > might use it for one or two architectures and add random hints". > That's just stupid. > > The only "middle ground" is about which compiler version we end up > trusting _if_ it turns out that the compiler and standard do get > things right. From Torvald's explanations (once I don't mis-read them > ;), my take-away so far has actually been that the standard *does* get > things right, but I do know from over-long personal experience that > compiler people sometimes want to be legalistic and twist the > documentation to the breaking point, at which point we just go "we'd > be crazy do use that". > > See our use of "-fno-strict-aliasing", for example. The C standard > aliasing rules are a mistake, stupid, and wrong, and gcc uses those > stupid type-based alias rules even when statically *proving* the > aliasing gives the opposite result. End result: we turn the shit off. > > Exact same deal wrt atomics. We are *not* going to add crazy "this > here is a control dependency" crap. There's a test, the compiler > *sees* the control dependency for chrissake, and it still generates > crap, we turn that broken "optimization" off. It really is that > simple. >From what I can see at the moment, the standard -generally- avoids speculative stores, but there are a few corner cases where it might allow them. I will be working with the committee to see exactly what the situation is. Might be that I am confused and that everything really is OK, might be that I am right but the corner cases are things that no sane kernel developer would do anyway, it might be that the standard needs a bit of repair, or it might be that the corner cases are somehow inherent and problematic (but I hope not!). I will let you know what I find, but it will probably be a few months. In the meantime, agreed, we keep doing what we have been doing. And maybe in the long term as well, for that matter. One way of looking at the discussion between Torvald and myself would be as a seller (Torvald) and a buyer (me) haggling over the fine print in a proposed contract (the standard). Whether that makes you feel better or worse about the situation I cannot say. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds wrote: > > Why are we still discussing this idiocy? It's irrelevant. If the > standard really allows random store speculation, the standard doesn't > matter, and sane people shouldn't waste their time arguing about it. Btw, the other part of this coin is that our manual types (using volatile and various architecture-specific stuff) and our manual barriers and inline asm accesses are generally *fine*. The C11 stuff doesn't buy us anything. The argument that "new architectures might want to use it" is prue and utter bollocks, since unless the standard gets the thing *right*, nobody sane would ever use it for some new architecture, when the sane thing to do is to just fill in the normal barriers and inline asms. So I'm very very serious: either the compiler and the standard gets things right, or we don't use it. There is no middle ground where "we might use it for one or two architectures and add random hints". That's just stupid. The only "middle ground" is about which compiler version we end up trusting _if_ it turns out that the compiler and standard do get things right. From Torvald's explanations (once I don't mis-read them ;), my take-away so far has actually been that the standard *does* get things right, but I do know from over-long personal experience that compiler people sometimes want to be legalistic and twist the documentation to the breaking point, at which point we just go "we'd be crazy do use that". See our use of "-fno-strict-aliasing", for example. The C standard aliasing rules are a mistake, stupid, and wrong, and gcc uses those stupid type-based alias rules even when statically *proving* the aliasing gives the opposite result. End result: we turn the shit off. Exact same deal wrt atomics. We are *not* going to add crazy "this here is a control dependency" crap. There's a test, the compiler *sees* the control dependency for chrissake, and it still generates crap, we turn that broken "optimization" off. It really is that simple. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney wrote: > > Linus, Peter, any objections to marking places where we are relying on > ordering from control dependencies against later stores? This approach > seems to me to have significant documentation benefits. Quite frankly, I think it's stupid, and the "documentation" is not a benefit, it's just wrong. How would you figure out whether your added "documentation" holds true for particular branches but not others? How could you *ever* trust a compiler that makes the dependency meaningful? Again, let's keep this simple and sane: - if a compiler ever generates code where an atomic store movement is "visible" in any way, then that compiler is broken shit. I don't understand why you even argue this. Seriously, Paul, you seem to *want* to think that "broken shit" is acceptable, and that we should then add magic markers to say "now you need to *not* be broken shit". Here's a magic marker for you: DON'T USE THAT BROKEN COMPILER. And if a compiler can *prove* that whatever code movement it does cannot make a difference, then let it do so. No amount of "documentation" should matter. Seriously, this whole discussion has been completely moronic. I don't understand why you even bring shit like this up: > > r1 = atomic_load(x, memory_order_control); > > if (control_dependency(r1)) > > atomic_store(y, memory_order_relaxed); I mean, really? Anybody who writes code like that, or any compiler where that "control_dependency()" marker makes any difference what-so-ever for code generation should just be retroactively aborted. There is absolutely *zero* reason for that "control_dependency()" crap. If you ever find a reason for it, it is either because the compiler is buggy, or because the standard is so shit that we should never *ever* use the atomics. Seriously. This thread has devolved into some kind of "just what kind of idiotic compiler cesspool crap could we accept". Get away from that f*cking mindset. We don't accept *any* crap. Why are we still discussing this idiocy? It's irrelevant. If the standard really allows random store speculation, the standard doesn't matter, and sane people shouldn't waste their time arguing about it. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 09:29 -0800, Paul E. McKenney wrote: > On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote: > > On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: > > [ . . . ] > > > > Another option would be to flag the conditional expression, prohibiting > > > the compiler from optimizing out any conditional branches. Perhaps > > > something like this: > > > > > > r1 = atomic_load(x, memory_order_control); > > > if (control_dependency(r1)) > > > atomic_store(y, memory_order_relaxed); > > > > That's the one I had in mind and talked to you about earlier today. My > > gut feeling is that this is preferably over the other because it "marks" > > the if-statement, so the compiler knows exactly which branches matter. > > I'm not sure one would need the other memory order for that, if indeed > > all you want is relaxed -> branch -> relaxed. But maybe there are > > corner cases (see the weaker-than-relaxed discussion in SG1 today). > > Linus, Peter, any objections to marking places where we are relying on > ordering from control dependencies against later stores? This approach > seems to me to have significant documentation benefits. Let me note that at least as I'm concerned, that's just a quick idea. At least I haven't looked at (1) how to properly specify the semantics of this, (2) whether it has any bad effects on unrelated code, (3) and whether there are pitfalls for compiler implementations. It looks not too bad at first glance, though. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 10:50 +0100, Peter Zijlstra wrote: > On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote: > > That depends on what your goal is. First, I don't know why you quoted that, but without the context, quoting it doesn't make sense. Let me repeat the point. The standard is the rule set for the compiler. Period. The compiler does not just serve the semantics that you might have in your head. It does have to do something meaningful for all of its users. Thus, the goal for the compiler is to properly compile programs in the language as specified. If there is a deficiency in the standard (bug or missing feature) -- and thus the specification, we need to have a patch for the standard that fixes this deficiency. If you think that this is the case, that's where you fix it. If your goal is to do wishful thinking, imagine some kind of semantics in your head, and then assume that magically, implementations will do just that, then that's bound to fail. > A compiler that we don't need to fight in order to generate sane code > would be nice. But as Linus said; we can continue to ignore you lot and > go on as we've done. I don't see why it's so hard to understand that you need to specify semantics, and the place (or at least the base) for that is the standard. Aren't you guys the ones replying "send a patch"? :) This isn't any different. If you're uncomfortable working with the standard, then say so, and reach out to people that aren't. You can surely ignore the specification of the language(s) that you are depending on. But that won't help you. If you want a change, get involved. (Oh, and claiming that the other side doesn't get it doesn't count as getting involved.) There's no fight between people here. It's just a technical problem that we have to solve in the right way. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote: > On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: [ . . . ] > > Another option would be to flag the conditional expression, prohibiting > > the compiler from optimizing out any conditional branches. Perhaps > > something like this: > > > > r1 = atomic_load(x, memory_order_control); > > if (control_dependency(r1)) > > atomic_store(y, memory_order_relaxed); > > That's the one I had in mind and talked to you about earlier today. My > gut feeling is that this is preferably over the other because it "marks" > the if-statement, so the compiler knows exactly which branches matter. > I'm not sure one would need the other memory order for that, if indeed > all you want is relaxed -> branch -> relaxed. But maybe there are > corner cases (see the weaker-than-relaxed discussion in SG1 today). Linus, Peter, any objections to marking places where we are relying on ordering from control dependencies against later stores? This approach seems to me to have significant documentation benefits. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote: > That depends on what your goal is. A compiler that we don't need to fight in order to generate sane code would be nice. But as Linus said; we can continue to ignore you lot and go on as we've done. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote: That depends on what your goal is. A compiler that we don't need to fight in order to generate sane code would be nice. But as Linus said; we can continue to ignore you lot and go on as we've done. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote: On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: [ . . . ] Another option would be to flag the conditional expression, prohibiting the compiler from optimizing out any conditional branches. Perhaps something like this: r1 = atomic_load(x, memory_order_control); if (control_dependency(r1)) atomic_store(y, memory_order_relaxed); That's the one I had in mind and talked to you about earlier today. My gut feeling is that this is preferably over the other because it marks the if-statement, so the compiler knows exactly which branches matter. I'm not sure one would need the other memory order for that, if indeed all you want is relaxed - branch - relaxed. But maybe there are corner cases (see the weaker-than-relaxed discussion in SG1 today). Linus, Peter, any objections to marking places where we are relying on ordering from control dependencies against later stores? This approach seems to me to have significant documentation benefits. Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 10:50 +0100, Peter Zijlstra wrote: On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote: That depends on what your goal is. First, I don't know why you quoted that, but without the context, quoting it doesn't make sense. Let me repeat the point. The standard is the rule set for the compiler. Period. The compiler does not just serve the semantics that you might have in your head. It does have to do something meaningful for all of its users. Thus, the goal for the compiler is to properly compile programs in the language as specified. If there is a deficiency in the standard (bug or missing feature) -- and thus the specification, we need to have a patch for the standard that fixes this deficiency. If you think that this is the case, that's where you fix it. If your goal is to do wishful thinking, imagine some kind of semantics in your head, and then assume that magically, implementations will do just that, then that's bound to fail. A compiler that we don't need to fight in order to generate sane code would be nice. But as Linus said; we can continue to ignore you lot and go on as we've done. I don't see why it's so hard to understand that you need to specify semantics, and the place (or at least the base) for that is the standard. Aren't you guys the ones replying send a patch? :) This isn't any different. If you're uncomfortable working with the standard, then say so, and reach out to people that aren't. You can surely ignore the specification of the language(s) that you are depending on. But that won't help you. If you want a change, get involved. (Oh, and claiming that the other side doesn't get it doesn't count as getting involved.) There's no fight between people here. It's just a technical problem that we have to solve in the right way. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-14 at 09:29 -0800, Paul E. McKenney wrote: On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote: On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: [ . . . ] Another option would be to flag the conditional expression, prohibiting the compiler from optimizing out any conditional branches. Perhaps something like this: r1 = atomic_load(x, memory_order_control); if (control_dependency(r1)) atomic_store(y, memory_order_relaxed); That's the one I had in mind and talked to you about earlier today. My gut feeling is that this is preferably over the other because it marks the if-statement, so the compiler knows exactly which branches matter. I'm not sure one would need the other memory order for that, if indeed all you want is relaxed - branch - relaxed. But maybe there are corner cases (see the weaker-than-relaxed discussion in SG1 today). Linus, Peter, any objections to marking places where we are relying on ordering from control dependencies against later stores? This approach seems to me to have significant documentation benefits. Let me note that at least as I'm concerned, that's just a quick idea. At least I haven't looked at (1) how to properly specify the semantics of this, (2) whether it has any bad effects on unrelated code, (3) and whether there are pitfalls for compiler implementations. It looks not too bad at first glance, though. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Linus, Peter, any objections to marking places where we are relying on ordering from control dependencies against later stores? This approach seems to me to have significant documentation benefits. Quite frankly, I think it's stupid, and the documentation is not a benefit, it's just wrong. How would you figure out whether your added documentation holds true for particular branches but not others? How could you *ever* trust a compiler that makes the dependency meaningful? Again, let's keep this simple and sane: - if a compiler ever generates code where an atomic store movement is visible in any way, then that compiler is broken shit. I don't understand why you even argue this. Seriously, Paul, you seem to *want* to think that broken shit is acceptable, and that we should then add magic markers to say now you need to *not* be broken shit. Here's a magic marker for you: DON'T USE THAT BROKEN COMPILER. And if a compiler can *prove* that whatever code movement it does cannot make a difference, then let it do so. No amount of documentation should matter. Seriously, this whole discussion has been completely moronic. I don't understand why you even bring shit like this up: r1 = atomic_load(x, memory_order_control); if (control_dependency(r1)) atomic_store(y, memory_order_relaxed); I mean, really? Anybody who writes code like that, or any compiler where that control_dependency() marker makes any difference what-so-ever for code generation should just be retroactively aborted. There is absolutely *zero* reason for that control_dependency() crap. If you ever find a reason for it, it is either because the compiler is buggy, or because the standard is so shit that we should never *ever* use the atomics. Seriously. This thread has devolved into some kind of just what kind of idiotic compiler cesspool crap could we accept. Get away from that f*cking mindset. We don't accept *any* crap. Why are we still discussing this idiocy? It's irrelevant. If the standard really allows random store speculation, the standard doesn't matter, and sane people shouldn't waste their time arguing about it. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds torva...@linux-foundation.org wrote: Why are we still discussing this idiocy? It's irrelevant. If the standard really allows random store speculation, the standard doesn't matter, and sane people shouldn't waste their time arguing about it. Btw, the other part of this coin is that our manual types (using volatile and various architecture-specific stuff) and our manual barriers and inline asm accesses are generally *fine*. The C11 stuff doesn't buy us anything. The argument that new architectures might want to use it is prue and utter bollocks, since unless the standard gets the thing *right*, nobody sane would ever use it for some new architecture, when the sane thing to do is to just fill in the normal barriers and inline asms. So I'm very very serious: either the compiler and the standard gets things right, or we don't use it. There is no middle ground where we might use it for one or two architectures and add random hints. That's just stupid. The only middle ground is about which compiler version we end up trusting _if_ it turns out that the compiler and standard do get things right. From Torvald's explanations (once I don't mis-read them ;), my take-away so far has actually been that the standard *does* get things right, but I do know from over-long personal experience that compiler people sometimes want to be legalistic and twist the documentation to the breaking point, at which point we just go we'd be crazy do use that. See our use of -fno-strict-aliasing, for example. The C standard aliasing rules are a mistake, stupid, and wrong, and gcc uses those stupid type-based alias rules even when statically *proving* the aliasing gives the opposite result. End result: we turn the shit off. Exact same deal wrt atomics. We are *not* going to add crazy this here is a control dependency crap. There's a test, the compiler *sees* the control dependency for chrissake, and it still generates crap, we turn that broken optimization off. It really is that simple. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 12:02:23PM -0800, Linus Torvalds wrote: On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds torva...@linux-foundation.org wrote: Why are we still discussing this idiocy? It's irrelevant. If the standard really allows random store speculation, the standard doesn't matter, and sane people shouldn't waste their time arguing about it. Btw, the other part of this coin is that our manual types (using volatile and various architecture-specific stuff) and our manual barriers and inline asm accesses are generally *fine*. The C11 stuff doesn't buy us anything. The argument that new architectures might want to use it is prue and utter bollocks, since unless the standard gets the thing *right*, nobody sane would ever use it for some new architecture, when the sane thing to do is to just fill in the normal barriers and inline asms. So I'm very very serious: either the compiler and the standard gets things right, or we don't use it. There is no middle ground where we might use it for one or two architectures and add random hints. That's just stupid. The only middle ground is about which compiler version we end up trusting _if_ it turns out that the compiler and standard do get things right. From Torvald's explanations (once I don't mis-read them ;), my take-away so far has actually been that the standard *does* get things right, but I do know from over-long personal experience that compiler people sometimes want to be legalistic and twist the documentation to the breaking point, at which point we just go we'd be crazy do use that. See our use of -fno-strict-aliasing, for example. The C standard aliasing rules are a mistake, stupid, and wrong, and gcc uses those stupid type-based alias rules even when statically *proving* the aliasing gives the opposite result. End result: we turn the shit off. Exact same deal wrt atomics. We are *not* going to add crazy this here is a control dependency crap. There's a test, the compiler *sees* the control dependency for chrissake, and it still generates crap, we turn that broken optimization off. It really is that simple. From what I can see at the moment, the standard -generally- avoids speculative stores, but there are a few corner cases where it might allow them. I will be working with the committee to see exactly what the situation is. Might be that I am confused and that everything really is OK, might be that I am right but the corner cases are things that no sane kernel developer would do anyway, it might be that the standard needs a bit of repair, or it might be that the corner cases are somehow inherent and problematic (but I hope not!). I will let you know what I find, but it will probably be a few months. In the meantime, agreed, we keep doing what we have been doing. And maybe in the long term as well, for that matter. One way of looking at the discussion between Torvald and myself would be as a seller (Torvald) and a buyer (me) haggling over the fine print in a proposed contract (the standard). Whether that makes you feel better or worse about the situation I cannot say. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: One way of looking at the discussion between Torvald and myself would be as a seller (Torvald) and a buyer (me) haggling over the fine print in a proposed contract (the standard). Whether that makes you feel better or worse about the situation I cannot say. ;-) Oh, I'm perfectly fine with that. But we're not desperate to buy, and I actually think the C11 people are - or at least should be - *way* more desperate to sell their model to us than we are to take it. Which means that as a buyer you should say this is what we want, if you don't give us this, we'll just walk away. Not try to see how much we can pay for it. Because there is very little upside for us, and _unless_ the C11 standard gets things right it's just extra complexity for us, coupled with new compiler fragility and years of code generation bugs. Why would we want that extra complexity and inevitable compiler bugs? If we then have to fight compiler writers that point to the standard and say ..but look, the standard says we can do this, then at that point it went from extra complexity and compiler bugs to a whole 'nother level of frustration and pain. So just walk away unless the C11 standard gives us exactly what we want. Not something kind of like what we'd use. EXACTLY. Because I'm not in the least interested in fighting compiler people that have a crappy standard they can point to. Been there, done that, got the T-shirt and learnt my lesson. And the thing is, I suspect that the Linux kernel is the most complete - and most serious - user of true atomics that the C11 people can sell their solution to. If we don't buy it, they have no serious user. Sure, they'll have lots of random other one-off users for their atomics, where each user wants one particular thing, but I suspect that we'll have the only really unified portable code base that handles pretty much *all* the serious odd cases that the C11 atomics can actually talk about to each other. Oh, they'll push things through with or without us, and it will be a collection of random stuff, where they tried to please everybody, with particularly compiler/architecture people who have no f*cking clue about how their stuff is used pushing to make it easy/efficient for their particular compiler/architecture. But we have real optimized uses of pretty much all relevant cases that people actually care about. We can walk away from them, and not really lose anything but a small convenience (and it's a convenience *only* if the standard gets things right). And conversely, the C11 people can walk away from us too. But if they can't make us happy (and by make us happy, I really mean no stupid games on our part) I personally think they'll have a stronger standard, and a real use case, and real arguments. I'm assuming they want that. That's why I complain when you talk about things like marking control dependencies explicitly. That's *us* bending over backwards. And as a buyer, we have absolutely zero reason to do that. Tell the C11 people: no speculative writes. Full stop. End of story. Because we're not buying anything else. Similarly, if we need to mark atomics volatile, then now the C11 atomics are no longer even a small convenience, now they are just extra complexity wrt what we already have. So just make it clear that if the C11 standard needs to mark atomics volatile in order to get non-speculative and non-reloading behavior, then the C11 atomics are useless to us, and we're not buying. Remember: a compiler can *always* do as if optimizations - if a compiler writer can prove that the end result acts 100% the same using an optimized sequence, then they can do whatever the hell they want. That's not the issue. But if we can *ever* see semantic impact of speculative writes, the compiler is buggy, and the compiler writers need to be aware that it is buggy. No ifs, buts, maybes about it. So I'm perfectly fine with you seeing yourself as a buyer. But I want you to be a really *picky* and anal buyer - one that knows he has the upper hand, and can walk away with no downside. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds torva...@linux-foundation.org wrote: And conversely, the C11 people can walk away from us too. But if they can't make us happy (and by make us happy, I really mean no stupid games on our part) I personally think they'll have a stronger standard, and a real use case, and real arguments. I'm assuming they want that. I should have somebody who proof-reads my emails before I send them out. I obviously meant if they *can* make us happy (not can't). Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote: On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds torva...@linux-foundation.org wrote: And conversely, the C11 people can walk away from us too. But if they can't make us happy (and by make us happy, I really mean no stupid games on our part) I personally think they'll have a stronger standard, and a real use case, and real arguments. I'm assuming they want that. I should have somebody who proof-reads my emails before I send them out. I obviously meant if they *can* make us happy (not can't). Understood. My next step is to take a more detailed look at the piece of the standard that should support RCU. Depending on how that turns out, I might look at other parts of the standard vs. Linux's atomics and memory-ordering needs. Should be interesting. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 14, 2014 at 10:35:44PM -0800, Paul E. McKenney wrote: On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote: On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds torva...@linux-foundation.org wrote: And conversely, the C11 people can walk away from us too. But if they can't make us happy (and by make us happy, I really mean no stupid games on our part) I personally think they'll have a stronger standard, and a real use case, and real arguments. I'm assuming they want that. I should have somebody who proof-reads my emails before I send them out. I obviously meant if they *can* make us happy (not can't). Understood. My next step is to take a more detailed look at the piece of the standard that should support RCU. Depending on how that turns out, I might look at other parts of the standard vs. Linux's atomics and memory-ordering needs. Should be interesting. ;-) And perhaps a better way to represent the roles is that I am not the buyer, but rather the purchasing agent for the -potential- buyer. -You- are of course the potential buyer. If I were to see myself as the buyer, then I must confess that the concerns you implicitly expressed in your prior email would be all too well-founded! Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-12 at 10:19 +0100, Peter Zijlstra wrote: > > I don't know the specifics of your example, but from how I understand > > it, I don't see a problem if the compiler can prove that the store will > > always happen. > > > > To be more specific, if the compiler can prove that the store will > > happen anyway, and the region of code can be assumed to always run > > atomically (e.g., there's no loop or such in there), then it is known > > that we have one atomic region of code that will always perform the > > store, so we might as well do the stuff in the region in some order. > > > > Now, if any of the memory accesses are atomic, then the whole region of > > code containing those accesses is often not atomic because other threads > > might observe intermediate results in a data-race-free way. > > > > (I know that this isn't a very precise formulation, but I hope it brings > > my line of reasoning across.) > > So given something like: > > if (x) > y = 3; > > assuming both x and y are atomic (so don't gimme crap for now knowing > the C11 atomic incantations); and you can prove x is always true; you > don't see a problem with not emitting the conditional? That depends on what your goal is. It would be correct as far as the standard is specified; this makes sense if all you want is indeed a program that does what the abstract machine might do, and produces the same output / side effects. If you're trying to preserve the branch in the code emitted / executed by the implementation, then it would not be correct. But those branches aren't specified as being part of the observable side effects. In the common case, this makes sense because it enables optimizations that are useful; this line of reasoning also allows the compiler to merge some atomic accesses in the way that Linus would like to see it. > Avoiding the conditional changes the result; see that control dependency > email from earlier. It does not regarding how the standard defines "result". > In the above example the load of X and the store to > Y are strictly ordered, due to control dependencies. Not emitting the > condition and maybe not even emitting the load completely wrecks this. I think you're trying to solve this backwards. You are looking at this with an implicit wishlist of what the compiler should do (or how you want to use the hardware), but this is not a viable specification that one can write a compiler against. We do need clear rules for what the compiler is allowed to do or not (e.g., a memory model that models multi-threaded executions). Otherwise it's all hand-waving, and we're getting nowhere. Thus, the way to approach this is to propose a feature or change to the standard, make sure that this is consistent and has no unintended side effects for other aspects of compilation or other code, and then ask the compiler to implement it. IOW, we need a patch for where this all starts: in the rules and requirements for compilation. Paul and I are at the C++ meeting currently, and we had sessions in which the concurrency study group talked about memory model issues like dependency tracking and memory_order_consume. Paul shared uses of atomics (or likewise) in the kernel, and we discussed how the memory model currently handles various cases and why, how one could express other requirements consistently, and what is actually implementable in practice. I can't speak for Paul, but I thought those discussions were productive. > Its therefore an invalid optimization to take out the conditional or > speculate the store, since it takes out the dependency. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: > On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote: > > On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: > > > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > > > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > > > > wrote: > > > > > > > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > > > > combination with C11 atomics in most cases. The C11 atomics do cover > > > > > some of the reasons we currently use ACCESS_ONCE(), but not all of > > > > > them -- > > > > > in particular, it allows load/store merging. > > > > > > > > I really disagree with the "will need to use volatile". > > > > > > > > We should never need to use volatile (outside of whatever MMIO we do > > > > using C) if C11 defines atomics correctly. > > > > > > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > > > > it's called a cache - and there's no actual reason to think that > > > > "ACCESS_ONCE()" has to mean our current "volatile". > > > > > > > > Now, it's possible that the C standards simply get atomics _wrong_, so > > > > that they create visible semantics that are different from what a CPU > > > > cache already does, but that's a plain bug in the standard if so. > > > > > > > > But merging loads and stores is fine. And I *guarantee* it is fine, > > > > exactly because CPU's already do it, so claiming that the compiler > > > > couldn't do it is just insanity. > > > > > > Agreed, both CPUs and compilers can merge loads and stores. But CPUs > > > normally get their stores pushed through the store buffer in reasonable > > > time, and CPUs also use things like invalidations to ensure that a > > > store is seen in reasonable time by readers. Compilers don't always > > > have these two properties, so we do need to be more careful of load > > > and store merging by compilers. > > > > The standard's _wording_ is a little vague about forward-progress > > guarantees, but I believe the vast majority of the people involved do > > want compilers to not prevent forward progress. There is of course a > > difference whether a compiler establishes _eventual_ forward progress in > > the sense of after 10 years or forward progress in a small bounded > > interval of time, but this is a QoI issue, and good compilers won't want > > to introduce unnecessary latencies. I believe that it is fine if the > > standard merely talks about eventual forward progress. > > The compiler will need to earn my trust on this one. ;-) > > > > > Now, there are things that are *not* fine, like speculative stores > > > > that could be visible to other threads. Those are *bugs* (either in > > > > the compiler or in the standard), and anybody who claims otherwise is > > > > not worth discussing with. > > > > > > And as near as I can tell, volatile semantics are required in C11 to > > > avoid speculative stores. I might be wrong about this, and hope that > > > I am wrong. But I am currently not seeing it in the current standard. > > > (Though I expect that most compilers would avoid speculating stores, > > > especially in the near term. > > > > This really depends on how we define speculative stores. The memory > > model is absolutely clear that programs have to behave as if executed by > > the virtual machine, and that rules out speculative stores to volatiles > > and other locations. Under certain circumstances, there will be > > "speculative" stores in the sense that they will happen at different > > times as if you had a trivial implementation of the abstract machine. > > But to be allowed to do that, the compiler has to prove that such a > > transformation still fulfills the as-if rule. > > Agreed, although the as-if rule would ignore control dependencies, since > these are not yet part of the standard (as you in fact note below). > I nevertheless consider myself at least somewhat reassured that current > C11 won't speculate stores. My remaining concerns involve the compiler > proving to itself that a given branch is always taken, thus motivating > it to optimize the branch away -- though this is more properly a > control-dependency concern. > > > IOW, the abstract machine is what currently defines disallowed > > speculative stores. If you want to put *further* constraints on what > > implementations are allowed to do, I suppose it is best to talk about > > those and see how we can add rules that allow programmers to express > > those constraints. For example, control dependencies might be such a > > case. I don't have a specific suggestion -- maybe the control > > dependencies are best tackled similar to consume dependencies (even > > though we don't have a good solution for those yets). But using > > volatile accesses for that seems to be a big hammer, or even the wrong > > one. > > In current compilers, the two hammers we have are volatile and barrier(). > But yes, it would be
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote: > On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: > > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > > > wrote: > > > > > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > > > combination with C11 atomics in most cases. The C11 atomics do cover > > > > some of the reasons we currently use ACCESS_ONCE(), but not all of them > > > > -- > > > > in particular, it allows load/store merging. > > > > > > I really disagree with the "will need to use volatile". > > > > > > We should never need to use volatile (outside of whatever MMIO we do > > > using C) if C11 defines atomics correctly. > > > > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > > > it's called a cache - and there's no actual reason to think that > > > "ACCESS_ONCE()" has to mean our current "volatile". > > > > > > Now, it's possible that the C standards simply get atomics _wrong_, so > > > that they create visible semantics that are different from what a CPU > > > cache already does, but that's a plain bug in the standard if so. > > > > > > But merging loads and stores is fine. And I *guarantee* it is fine, > > > exactly because CPU's already do it, so claiming that the compiler > > > couldn't do it is just insanity. > > > > Agreed, both CPUs and compilers can merge loads and stores. But CPUs > > normally get their stores pushed through the store buffer in reasonable > > time, and CPUs also use things like invalidations to ensure that a > > store is seen in reasonable time by readers. Compilers don't always > > have these two properties, so we do need to be more careful of load > > and store merging by compilers. > > The standard's _wording_ is a little vague about forward-progress > guarantees, but I believe the vast majority of the people involved do > want compilers to not prevent forward progress. There is of course a > difference whether a compiler establishes _eventual_ forward progress in > the sense of after 10 years or forward progress in a small bounded > interval of time, but this is a QoI issue, and good compilers won't want > to introduce unnecessary latencies. I believe that it is fine if the > standard merely talks about eventual forward progress. The compiler will need to earn my trust on this one. ;-) > > > Now, there are things that are *not* fine, like speculative stores > > > that could be visible to other threads. Those are *bugs* (either in > > > the compiler or in the standard), and anybody who claims otherwise is > > > not worth discussing with. > > > > And as near as I can tell, volatile semantics are required in C11 to > > avoid speculative stores. I might be wrong about this, and hope that > > I am wrong. But I am currently not seeing it in the current standard. > > (Though I expect that most compilers would avoid speculating stores, > > especially in the near term. > > This really depends on how we define speculative stores. The memory > model is absolutely clear that programs have to behave as if executed by > the virtual machine, and that rules out speculative stores to volatiles > and other locations. Under certain circumstances, there will be > "speculative" stores in the sense that they will happen at different > times as if you had a trivial implementation of the abstract machine. > But to be allowed to do that, the compiler has to prove that such a > transformation still fulfills the as-if rule. Agreed, although the as-if rule would ignore control dependencies, since these are not yet part of the standard (as you in fact note below). I nevertheless consider myself at least somewhat reassured that current C11 won't speculate stores. My remaining concerns involve the compiler proving to itself that a given branch is always taken, thus motivating it to optimize the branch away -- though this is more properly a control-dependency concern. > IOW, the abstract machine is what currently defines disallowed > speculative stores. If you want to put *further* constraints on what > implementations are allowed to do, I suppose it is best to talk about > those and see how we can add rules that allow programmers to express > those constraints. For example, control dependencies might be such a > case. I don't have a specific suggestion -- maybe the control > dependencies are best tackled similar to consume dependencies (even > though we don't have a good solution for those yets). But using > volatile accesses for that seems to be a big hammer, or even the wrong > one. In current compilers, the two hammers we have are volatile and barrier(). But yes, it would be good to have something more focused. One option would be to propose memory_order_control loads to see how loudly the committee screams. One use case might be as follows: if (atomic_load(x, memory_order_control))
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > > wrote: > > > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > > combination with C11 atomics in most cases. The C11 atomics do cover > > > some of the reasons we currently use ACCESS_ONCE(), but not all of them -- > > > in particular, it allows load/store merging. > > > > I really disagree with the "will need to use volatile". > > > > We should never need to use volatile (outside of whatever MMIO we do > > using C) if C11 defines atomics correctly. > > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > > it's called a cache - and there's no actual reason to think that > > "ACCESS_ONCE()" has to mean our current "volatile". > > > > Now, it's possible that the C standards simply get atomics _wrong_, so > > that they create visible semantics that are different from what a CPU > > cache already does, but that's a plain bug in the standard if so. > > > > But merging loads and stores is fine. And I *guarantee* it is fine, > > exactly because CPU's already do it, so claiming that the compiler > > couldn't do it is just insanity. > > Agreed, both CPUs and compilers can merge loads and stores. But CPUs > normally get their stores pushed through the store buffer in reasonable > time, and CPUs also use things like invalidations to ensure that a > store is seen in reasonable time by readers. Compilers don't always > have these two properties, so we do need to be more careful of load > and store merging by compilers. The standard's _wording_ is a little vague about forward-progress guarantees, but I believe the vast majority of the people involved do want compilers to not prevent forward progress. There is of course a difference whether a compiler establishes _eventual_ forward progress in the sense of after 10 years or forward progress in a small bounded interval of time, but this is a QoI issue, and good compilers won't want to introduce unnecessary latencies. I believe that it is fine if the standard merely talks about eventual forward progress. > > Now, there are things that are *not* fine, like speculative stores > > that could be visible to other threads. Those are *bugs* (either in > > the compiler or in the standard), and anybody who claims otherwise is > > not worth discussing with. > > And as near as I can tell, volatile semantics are required in C11 to > avoid speculative stores. I might be wrong about this, and hope that > I am wrong. But I am currently not seeing it in the current standard. > (Though I expect that most compilers would avoid speculating stores, > especially in the near term. This really depends on how we define speculative stores. The memory model is absolutely clear that programs have to behave as if executed by the virtual machine, and that rules out speculative stores to volatiles and other locations. Under certain circumstances, there will be "speculative" stores in the sense that they will happen at different times as if you had a trivial implementation of the abstract machine. But to be allowed to do that, the compiler has to prove that such a transformation still fulfills the as-if rule. IOW, the abstract machine is what currently defines disallowed speculative stores. If you want to put *further* constraints on what implementations are allowed to do, I suppose it is best to talk about those and see how we can add rules that allow programmers to express those constraints. For example, control dependencies might be such a case. I don't have a specific suggestion -- maybe the control dependencies are best tackled similar to consume dependencies (even though we don't have a good solution for those yets). But using volatile accesses for that seems to be a big hammer, or even the wrong one. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Us Linux-kernel hackers will often need to use volatile semantics in combination with C11 atomics in most cases. The C11 atomics do cover some of the reasons we currently use ACCESS_ONCE(), but not all of them -- in particular, it allows load/store merging. I really disagree with the will need to use volatile. We should never need to use volatile (outside of whatever MMIO we do using C) if C11 defines atomics correctly. Allowing load/store merging is *fine*. All sane CPU's do that anyway - it's called a cache - and there's no actual reason to think that ACCESS_ONCE() has to mean our current volatile. Now, it's possible that the C standards simply get atomics _wrong_, so that they create visible semantics that are different from what a CPU cache already does, but that's a plain bug in the standard if so. But merging loads and stores is fine. And I *guarantee* it is fine, exactly because CPU's already do it, so claiming that the compiler couldn't do it is just insanity. Agreed, both CPUs and compilers can merge loads and stores. But CPUs normally get their stores pushed through the store buffer in reasonable time, and CPUs also use things like invalidations to ensure that a store is seen in reasonable time by readers. Compilers don't always have these two properties, so we do need to be more careful of load and store merging by compilers. The standard's _wording_ is a little vague about forward-progress guarantees, but I believe the vast majority of the people involved do want compilers to not prevent forward progress. There is of course a difference whether a compiler establishes _eventual_ forward progress in the sense of after 10 years or forward progress in a small bounded interval of time, but this is a QoI issue, and good compilers won't want to introduce unnecessary latencies. I believe that it is fine if the standard merely talks about eventual forward progress. Now, there are things that are *not* fine, like speculative stores that could be visible to other threads. Those are *bugs* (either in the compiler or in the standard), and anybody who claims otherwise is not worth discussing with. And as near as I can tell, volatile semantics are required in C11 to avoid speculative stores. I might be wrong about this, and hope that I am wrong. But I am currently not seeing it in the current standard. (Though I expect that most compilers would avoid speculating stores, especially in the near term. This really depends on how we define speculative stores. The memory model is absolutely clear that programs have to behave as if executed by the virtual machine, and that rules out speculative stores to volatiles and other locations. Under certain circumstances, there will be speculative stores in the sense that they will happen at different times as if you had a trivial implementation of the abstract machine. But to be allowed to do that, the compiler has to prove that such a transformation still fulfills the as-if rule. IOW, the abstract machine is what currently defines disallowed speculative stores. If you want to put *further* constraints on what implementations are allowed to do, I suppose it is best to talk about those and see how we can add rules that allow programmers to express those constraints. For example, control dependencies might be such a case. I don't have a specific suggestion -- maybe the control dependencies are best tackled similar to consume dependencies (even though we don't have a good solution for those yets). But using volatile accesses for that seems to be a big hammer, or even the wrong one. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote: On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Us Linux-kernel hackers will often need to use volatile semantics in combination with C11 atomics in most cases. The C11 atomics do cover some of the reasons we currently use ACCESS_ONCE(), but not all of them -- in particular, it allows load/store merging. I really disagree with the will need to use volatile. We should never need to use volatile (outside of whatever MMIO we do using C) if C11 defines atomics correctly. Allowing load/store merging is *fine*. All sane CPU's do that anyway - it's called a cache - and there's no actual reason to think that ACCESS_ONCE() has to mean our current volatile. Now, it's possible that the C standards simply get atomics _wrong_, so that they create visible semantics that are different from what a CPU cache already does, but that's a plain bug in the standard if so. But merging loads and stores is fine. And I *guarantee* it is fine, exactly because CPU's already do it, so claiming that the compiler couldn't do it is just insanity. Agreed, both CPUs and compilers can merge loads and stores. But CPUs normally get their stores pushed through the store buffer in reasonable time, and CPUs also use things like invalidations to ensure that a store is seen in reasonable time by readers. Compilers don't always have these two properties, so we do need to be more careful of load and store merging by compilers. The standard's _wording_ is a little vague about forward-progress guarantees, but I believe the vast majority of the people involved do want compilers to not prevent forward progress. There is of course a difference whether a compiler establishes _eventual_ forward progress in the sense of after 10 years or forward progress in a small bounded interval of time, but this is a QoI issue, and good compilers won't want to introduce unnecessary latencies. I believe that it is fine if the standard merely talks about eventual forward progress. The compiler will need to earn my trust on this one. ;-) Now, there are things that are *not* fine, like speculative stores that could be visible to other threads. Those are *bugs* (either in the compiler or in the standard), and anybody who claims otherwise is not worth discussing with. And as near as I can tell, volatile semantics are required in C11 to avoid speculative stores. I might be wrong about this, and hope that I am wrong. But I am currently not seeing it in the current standard. (Though I expect that most compilers would avoid speculating stores, especially in the near term. This really depends on how we define speculative stores. The memory model is absolutely clear that programs have to behave as if executed by the virtual machine, and that rules out speculative stores to volatiles and other locations. Under certain circumstances, there will be speculative stores in the sense that they will happen at different times as if you had a trivial implementation of the abstract machine. But to be allowed to do that, the compiler has to prove that such a transformation still fulfills the as-if rule. Agreed, although the as-if rule would ignore control dependencies, since these are not yet part of the standard (as you in fact note below). I nevertheless consider myself at least somewhat reassured that current C11 won't speculate stores. My remaining concerns involve the compiler proving to itself that a given branch is always taken, thus motivating it to optimize the branch away -- though this is more properly a control-dependency concern. IOW, the abstract machine is what currently defines disallowed speculative stores. If you want to put *further* constraints on what implementations are allowed to do, I suppose it is best to talk about those and see how we can add rules that allow programmers to express those constraints. For example, control dependencies might be such a case. I don't have a specific suggestion -- maybe the control dependencies are best tackled similar to consume dependencies (even though we don't have a good solution for those yets). But using volatile accesses for that seems to be a big hammer, or even the wrong one. In current compilers, the two hammers we have are volatile and barrier(). But yes, it would be good to have something more focused. One option would be to propose memory_order_control loads to see how loudly the committee screams. One use case might be as follows: if (atomic_load(x, memory_order_control)) atomic_store(y, memory_order_relaxed); This could also be written: r1 = atomic_load(x, memory_order_control); if
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote: On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Us Linux-kernel hackers will often need to use volatile semantics in combination with C11 atomics in most cases. The C11 atomics do cover some of the reasons we currently use ACCESS_ONCE(), but not all of them -- in particular, it allows load/store merging. I really disagree with the will need to use volatile. We should never need to use volatile (outside of whatever MMIO we do using C) if C11 defines atomics correctly. Allowing load/store merging is *fine*. All sane CPU's do that anyway - it's called a cache - and there's no actual reason to think that ACCESS_ONCE() has to mean our current volatile. Now, it's possible that the C standards simply get atomics _wrong_, so that they create visible semantics that are different from what a CPU cache already does, but that's a plain bug in the standard if so. But merging loads and stores is fine. And I *guarantee* it is fine, exactly because CPU's already do it, so claiming that the compiler couldn't do it is just insanity. Agreed, both CPUs and compilers can merge loads and stores. But CPUs normally get their stores pushed through the store buffer in reasonable time, and CPUs also use things like invalidations to ensure that a store is seen in reasonable time by readers. Compilers don't always have these two properties, so we do need to be more careful of load and store merging by compilers. The standard's _wording_ is a little vague about forward-progress guarantees, but I believe the vast majority of the people involved do want compilers to not prevent forward progress. There is of course a difference whether a compiler establishes _eventual_ forward progress in the sense of after 10 years or forward progress in a small bounded interval of time, but this is a QoI issue, and good compilers won't want to introduce unnecessary latencies. I believe that it is fine if the standard merely talks about eventual forward progress. The compiler will need to earn my trust on this one. ;-) Now, there are things that are *not* fine, like speculative stores that could be visible to other threads. Those are *bugs* (either in the compiler or in the standard), and anybody who claims otherwise is not worth discussing with. And as near as I can tell, volatile semantics are required in C11 to avoid speculative stores. I might be wrong about this, and hope that I am wrong. But I am currently not seeing it in the current standard. (Though I expect that most compilers would avoid speculating stores, especially in the near term. This really depends on how we define speculative stores. The memory model is absolutely clear that programs have to behave as if executed by the virtual machine, and that rules out speculative stores to volatiles and other locations. Under certain circumstances, there will be speculative stores in the sense that they will happen at different times as if you had a trivial implementation of the abstract machine. But to be allowed to do that, the compiler has to prove that such a transformation still fulfills the as-if rule. Agreed, although the as-if rule would ignore control dependencies, since these are not yet part of the standard (as you in fact note below). I nevertheless consider myself at least somewhat reassured that current C11 won't speculate stores. My remaining concerns involve the compiler proving to itself that a given branch is always taken, thus motivating it to optimize the branch away -- though this is more properly a control-dependency concern. IOW, the abstract machine is what currently defines disallowed speculative stores. If you want to put *further* constraints on what implementations are allowed to do, I suppose it is best to talk about those and see how we can add rules that allow programmers to express those constraints. For example, control dependencies might be such a case. I don't have a specific suggestion -- maybe the control dependencies are best tackled similar to consume dependencies (even though we don't have a good solution for those yets). But using volatile accesses for that seems to be a big hammer, or even the wrong one. In current compilers, the two hammers we have are volatile and barrier(). But yes, it would be good to have something more focused. One option would be to propose memory_order_control loads to see how loudly the committee screams. One use case might be as follows: if (atomic_load(x,
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, 2014-02-12 at 10:19 +0100, Peter Zijlstra wrote: I don't know the specifics of your example, but from how I understand it, I don't see a problem if the compiler can prove that the store will always happen. To be more specific, if the compiler can prove that the store will happen anyway, and the region of code can be assumed to always run atomically (e.g., there's no loop or such in there), then it is known that we have one atomic region of code that will always perform the store, so we might as well do the stuff in the region in some order. Now, if any of the memory accesses are atomic, then the whole region of code containing those accesses is often not atomic because other threads might observe intermediate results in a data-race-free way. (I know that this isn't a very precise formulation, but I hope it brings my line of reasoning across.) So given something like: if (x) y = 3; assuming both x and y are atomic (so don't gimme crap for now knowing the C11 atomic incantations); and you can prove x is always true; you don't see a problem with not emitting the conditional? That depends on what your goal is. It would be correct as far as the standard is specified; this makes sense if all you want is indeed a program that does what the abstract machine might do, and produces the same output / side effects. If you're trying to preserve the branch in the code emitted / executed by the implementation, then it would not be correct. But those branches aren't specified as being part of the observable side effects. In the common case, this makes sense because it enables optimizations that are useful; this line of reasoning also allows the compiler to merge some atomic accesses in the way that Linus would like to see it. Avoiding the conditional changes the result; see that control dependency email from earlier. It does not regarding how the standard defines result. In the above example the load of X and the store to Y are strictly ordered, due to control dependencies. Not emitting the condition and maybe not even emitting the load completely wrecks this. I think you're trying to solve this backwards. You are looking at this with an implicit wishlist of what the compiler should do (or how you want to use the hardware), but this is not a viable specification that one can write a compiler against. We do need clear rules for what the compiler is allowed to do or not (e.g., a memory model that models multi-threaded executions). Otherwise it's all hand-waving, and we're getting nowhere. Thus, the way to approach this is to propose a feature or change to the standard, make sure that this is consistent and has no unintended side effects for other aspects of compilation or other code, and then ask the compiler to implement it. IOW, we need a patch for where this all starts: in the rules and requirements for compilation. Paul and I are at the C++ meeting currently, and we had sessions in which the concurrency study group talked about memory model issues like dependency tracking and memory_order_consume. Paul shared uses of atomics (or likewise) in the kernel, and we discussed how the memory model currently handles various cases and why, how one could express other requirements consistently, and what is actually implementable in practice. I can't speak for Paul, but I thought those discussions were productive. Its therefore an invalid optimization to take out the conditional or speculate the store, since it takes out the dependency. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > wrote: > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > combination with C11 atomics in most cases. The C11 atomics do cover > > some of the reasons we currently use ACCESS_ONCE(), but not all of them -- > > in particular, it allows load/store merging. > > I really disagree with the "will need to use volatile". > > We should never need to use volatile (outside of whatever MMIO we do > using C) if C11 defines atomics correctly. > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > it's called a cache - and there's no actual reason to think that > "ACCESS_ONCE()" has to mean our current "volatile". > > Now, it's possible that the C standards simply get atomics _wrong_, so > that they create visible semantics that are different from what a CPU > cache already does, but that's a plain bug in the standard if so. > > But merging loads and stores is fine. And I *guarantee* it is fine, > exactly because CPU's already do it, so claiming that the compiler > couldn't do it is just insanity. Agreed, both CPUs and compilers can merge loads and stores. But CPUs normally get their stores pushed through the store buffer in reasonable time, and CPUs also use things like invalidations to ensure that a store is seen in reasonable time by readers. Compilers don't always have these two properties, so we do need to be more careful of load and store merging by compilers. > Now, there are things that are *not* fine, like speculative stores > that could be visible to other threads. Those are *bugs* (either in > the compiler or in the standard), and anybody who claims otherwise is > not worth discussing with. And as near as I can tell, volatile semantics are required in C11 to avoid speculative stores. I might be wrong about this, and hope that I am wrong. But I am currently not seeing it in the current standard. (Though I expect that most compilers would avoid speculating stores, especially in the near term. > But I really really disagree with the "we might have to use > 'volatile'". Because if we *ever* have to use 'volatile' with the > standard C atomic types, then we're just better off ignoring the > atomic types entirely, because they are obviously broken shit - and > we're better off doing it ourselves the way we have forever. > > Seriously. This is not even hyperbole. It really is as simple as that. Agreed, if we are talking about replacing ACCESS_ONCE() with C11 relaxed atomics any time soon. But someone porting Linux to a new CPU architecture might use a carefully chosen subset of C11 atomics to implement some of the Linux atomic operations, especially non-value-returning atomics such as atomic_inc(). Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney wrote: > > Us Linux-kernel hackers will often need to use volatile semantics in > combination with C11 atomics in most cases. The C11 atomics do cover > some of the reasons we currently use ACCESS_ONCE(), but not all of them -- > in particular, it allows load/store merging. I really disagree with the "will need to use volatile". We should never need to use volatile (outside of whatever MMIO we do using C) if C11 defines atomics correctly. Allowing load/store merging is *fine*. All sane CPU's do that anyway - it's called a cache - and there's no actual reason to think that "ACCESS_ONCE()" has to mean our current "volatile". Now, it's possible that the C standards simply get atomics _wrong_, so that they create visible semantics that are different from what a CPU cache already does, but that's a plain bug in the standard if so. But merging loads and stores is fine. And I *guarantee* it is fine, exactly because CPU's already do it, so claiming that the compiler couldn't do it is just insanity. Now, there are things that are *not* fine, like speculative stores that could be visible to other threads. Those are *bugs* (either in the compiler or in the standard), and anybody who claims otherwise is not worth discussing with. But I really really disagree with the "we might have to use 'volatile'". Because if we *ever* have to use 'volatile' with the standard C atomic types, then we're just better off ignoring the atomic types entirely, because they are obviously broken shit - and we're better off doing it ourselves the way we have forever. Seriously. This is not even hyperbole. It really is as simple as that. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 11, 2014 at 09:13:34PM -0800, Torvald Riegel wrote: > On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote: > > On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote: > > > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > > > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: > > > > > > > > > There are also so many ways to blow your head off it's > > > > > > > > > untrue. For example, > > > > > > > > > cmpxchg takes a separate memory model parameter for failure > > > > > > > > > and success, but > > > > > > > > > then there are restrictions on the sets you can use for each. > > > > > > > > > It's not hard > > > > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > > > > memory_model_seq_cst for everything, it's too hard > > > > > > > > > otherwise". Then there's > > > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely > > > > > > > > > ignores consume > > > > > > > > > atm and optimises all of the data dependencies away) as well > > > > > > > > > as the definition > > > > > > > > > of "data races", which seem to be used as an excuse to > > > > > > > > > miscompile a program > > > > > > > > > at the earliest opportunity. > > > > > > > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms > > > > > > > > of > > > > > > > > memory_order_consume until the compilers implement it both > > > > > > > > correctly and > > > > > > > > efficiently. They are not there yet, and there is currently no > > > > > > > > shortage > > > > > > > > of compiler writers who would prefer to ignore > > > > > > > > memory_order_consume. > > > > > > > > > > > > > > Do you have any input on > > > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In > > > > > > > particular, the > > > > > > > language standard's definition of dependencies? > > > > > > > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > > > > > > > — B is an invocation of any specialization of std::kill_dependency > > > > > > (29.3), or > > > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or > > > > > > logical OR (||, see 5.15) operator, > > > > > > or > > > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - > > > > > > flag" > > > > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > > > > needs to do the ordering rather than just optimizing "flag - flag" > > > > > > out > > > > > > of existence. One way to do that on both ARM and Power is to > > > > > > actually > > > > > > emit code for "flag - flag", but there are a number of other ways to > > > > > > make that work. > > > > > > > > > > And that's what would concern me, considering that these requirements > > > > > seem to be able to creep out easily. Also, whereas the other atomics > > > > > just constrain compilers wrt. reordering across atomic accesses or > > > > > changes to the atomic accesses themselves, the dependencies are new > > > > > requirements on pieces of otherwise non-synchronizing code. The > > > > > latter > > > > > seems far more involved to me. > > > > > > > > Well, the wording of 1.10p9 is pretty explicit on this point. > > > > There are only a few exceptions to the rule that dependencies from > > > > memory_order_consume loads must be tracked. And to your point about > > > > requirements being placed on pieces of otherwise non-synchronizing code, > > > > we already have that with plain old load acquire and store release -- > > > > both of these put ordering constraints that affect the surrounding > > > > non-synchronizing code. > > > > > > I think there's a significant difference. With acquire/release or more > > > general memory orders, it's true that we can't order _across_ the atomic > > > access. However, we can reorder and optimize without additional > > > constraints if we do not reorder. This is not the case with consume > > > memory order, as the (p + flag - flag) example shows. > > > > Agreed, memory_order_consume does introduce additional restrictions. > > > > > > This issue got a lot of discussion, and the compromise is that > > > > dependencies cannot leak into or out of functions unless the relevant > > > > parameters or return values are annotated with [[carries_dependency]]. > > > > This means that the compiler can see all the places where dependencies > > > > must be tracked. This is described in 7.6.4. > > > > > > I wasn't aware
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 11, 2014 at 09:39:24PM -0800, Torvald Riegel wrote: > On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote: > > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > > > > > Intuitively, this is wrong because this let's the program take a step > > > the abstract machine wouldn't do. This is different to the sequential > > > code that Peter posted because it uses atomics, and thus one can't > > > easily assume that the difference is not observable. > > > > Btw, what is the definition of "observable" for the atomics? > > > > Because I'm hoping that it's not the same as for volatiles, where > > "observable" is about the virtual machine itself, and as such volatile > > accesses cannot be combined or optimized at all. > > No, atomics aren't an observable behavior of the abstract machine > (unless they are volatile). See 1.8.p8 (citing the C++ standard). Us Linux-kernel hackers will often need to use volatile semantics in combination with C11 atomics in most cases. The C11 atomics do cover some of the reasons we currently use ACCESS_ONCE(), but not all of them -- in particular, it allows load/store merging. > > Now, I claim that atomic accesses cannot be done speculatively for > > writes, and not re-done for reads (because the value could change), > > Agreed, unless the compiler can prove that this doesn't make a > difference in the program at hand and it's not volatile atomics. In > general, that will be hard and thus won't happen often I suppose, but if > correctly proved it would fall under the as-if rule I think. > > > but *combining* them would be possible and good. > > Agreed. In some cases, agreed. But many uses in the Linux kernel will need volatile semantics in combination with C11 atomics. Which is OK, for the foreseeable future, anyway. > > For example, we often have multiple independent atomic accesses that > > could certainly be combined: testing the individual bits of an atomic > > value with helper functions, causing things like "load atomic, test > > bit, load same atomic, test another bit". The two atomic loads could > > be done as a single load without possibly changing semantics on a real > > machine, but if "visibility" is defined in the same way it is for > > "volatile", that wouldn't be a valid transformation. Right now we use > > "volatile" semantics for these kinds of things, and they really can > > hurt. > > Agreed. In your example, the compiler would have to prove that the > abstract machine would always be able to run the two loads atomically > (ie, as one load) without running into impossible/disallowed behavior of > the program. But if there's no loop or branch or such in-between, this > should be straight-forward because any hardware oddity or similar could > merge those loads and it wouldn't be disallowed by the standard > (considering that we're talking about a finite number of loads), so the > compiler would be allowed to do it as well. As long as they are not marked volatile, agreed. Thanx, Paul > > Same goes for multiple writes (possibly due to setting bits): > > combining multiple accesses into a single one is generally fine, it's > > *adding* write accesses speculatively that is broken by design.. > > Agreed. As Paul points out, this being correct assumes that there are > no other ordering guarantees or memory accesses "interfering", but if > the stores are to the same memory location and adjacent to each other in > the program, then I don't see a reason why they wouldn't be combinable. > > > At the same time, you can't combine atomic loads or stores infinitely > > - "visibility" on a real machine definitely is about timeliness. > > Removing all but the last write when there are multiple consecutive > > writes is generally fine, even if you unroll a loop to generate those > > writes. But if what remains is a loop, it might be a busy-loop > > basically waiting for something, so it would be wrong ("untimely") to > > hoist a store in a loop entirely past the end of the loop, or hoist a > > load in a loop to before the loop. > > Agreed. That's what 1.10p24 and 1.10p25 are meant to specify for loads, > although those might not be bullet-proof as Paul points out. Forward > progress is rather vaguely specified in the standard, but at least parts > of the committee (and people in ISO C++ SG1, in particular) are working > on trying to improve this. > > > Does the standard allow for that kind of behavior? > > I think the standard requires (or intends to require) the behavior that > you (and I) seem to prefer in these examples. > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: > You need volatile semantics to force the compiler to ignore any proofs > it might otherwise attempt to construct. Hence all the ACCESS_ONCE() > calls in my email to Torvald. (Hopefully I translated your example > reasonably.) My brain gave out for today; but it did appear to have the right structure. I would prefer it C11 would not require the volatile casts. It should simply _never_ speculate with atomic writes, volatile or not. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 11, 2014 at 10:06:34PM -0800, Torvald Riegel wrote: > On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote: > > On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: > > > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > > > > > > > Intuitively, this is wrong because this let's the program take a step > > > > the abstract machine wouldn't do. This is different to the sequential > > > > code that Peter posted because it uses atomics, and thus one can't > > > > easily assume that the difference is not observable. > > > > > > Btw, what is the definition of "observable" for the atomics? > > > > > > Because I'm hoping that it's not the same as for volatiles, where > > > "observable" is about the virtual machine itself, and as such volatile > > > accesses cannot be combined or optimized at all. > > > > > > Now, I claim that atomic accesses cannot be done speculatively for > > > writes, and not re-done for reads (because the value could change), > > > but *combining* them would be possible and good. > > > > > > For example, we often have multiple independent atomic accesses that > > > could certainly be combined: testing the individual bits of an atomic > > > value with helper functions, causing things like "load atomic, test > > > bit, load same atomic, test another bit". The two atomic loads could > > > be done as a single load without possibly changing semantics on a real > > > machine, but if "visibility" is defined in the same way it is for > > > "volatile", that wouldn't be a valid transformation. Right now we use > > > "volatile" semantics for these kinds of things, and they really can > > > hurt. > > > > > > Same goes for multiple writes (possibly due to setting bits): > > > combining multiple accesses into a single one is generally fine, it's > > > *adding* write accesses speculatively that is broken by design.. > > > > > > At the same time, you can't combine atomic loads or stores infinitely > > > - "visibility" on a real machine definitely is about timeliness. > > > Removing all but the last write when there are multiple consecutive > > > writes is generally fine, even if you unroll a loop to generate those > > > writes. But if what remains is a loop, it might be a busy-loop > > > basically waiting for something, so it would be wrong ("untimely") to > > > hoist a store in a loop entirely past the end of the loop, or hoist a > > > load in a loop to before the loop. > > > > > > Does the standard allow for that kind of behavior? > > > > You asked! ;-) > > > > So the current standard allows merging of both loads and stores, unless of > > course ordring constraints prevent the merging. Volatile semantics may be > > used to prevent this merging, if desired, for example, for real-time code. > > Agreed. > > > Infinite merging is intended to be prohibited, but I am not certain that > > the current wording is bullet-proof (1.10p24 and 1.10p25). > > Yeah, maybe not. But it at least seems to rather clearly indicate the > intent ;) That is my hope. ;-) > > The only prohibition against speculative stores that I can see is in a > > non-normative note, and it can be argued to apply only to things that are > > not atomics (1.10p22). > > I think this one is specifically about speculative stores that would > affect memory locations that the abstract machine would not write to, > and that might be observable or create data races. While a compiler > could potentially prove that such stores aren't leading to a difference > in the behavior of the program (e.g., by proving that there are no > observers anywhere and this isn't overlapping with any volatile > locations), I think that this is hard in general and most compilers will > just not do such things. In GCC, bugs in that category were fixed after > researchers doing fuzz-testing found them (IIRC, speculative stores by > loops). And that is my fear. ;-) > > I don't see any prohibition against reordering > > a store to precede a load preceding a conditional branch -- which would > > not be speculative if the branch was know to be taken and the load > > hit in the store buffer. In a system where stores could be reordered, > > some other CPU might perceive the store as happening before the load > > that controlled the conditional branch. This needs to be addressed. > > I don't know the specifics of your example, but from how I understand > it, I don't see a problem if the compiler can prove that the store will > always happen. The current Documentation/memory-barriers.txt formulation requires that both the load and the store have volatile semantics. Does that help? > To be more specific, if the compiler can prove that the store will > happen anyway, and the region of code can be assumed to always run > atomically (e.g., there's no loop or such in there), then it is known > that we have one atomic region of code that will always perform the > store, so we might as well do the stuff in the region in some
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 10:19:07AM +0100, Peter Zijlstra wrote: > > I don't know the specifics of your example, but from how I understand > > it, I don't see a problem if the compiler can prove that the store will > > always happen. > > > > To be more specific, if the compiler can prove that the store will > > happen anyway, and the region of code can be assumed to always run > > atomically (e.g., there's no loop or such in there), then it is known > > that we have one atomic region of code that will always perform the > > store, so we might as well do the stuff in the region in some order. > > > > Now, if any of the memory accesses are atomic, then the whole region of > > code containing those accesses is often not atomic because other threads > > might observe intermediate results in a data-race-free way. > > > > (I know that this isn't a very precise formulation, but I hope it brings > > my line of reasoning across.) > > So given something like: > > if (x) > y = 3; > > assuming both x and y are atomic (so don't gimme crap for now knowing > the C11 atomic incantations); and you can prove x is always true; you > don't see a problem with not emitting the conditional? You need volatile semantics to force the compiler to ignore any proofs it might otherwise attempt to construct. Hence all the ACCESS_ONCE() calls in my email to Torvald. (Hopefully I translated your example reasonably.) Thanx, Paul > Avoiding the conditional changes the result; see that control dependency > email from earlier. In the above example the load of X and the store to > Y are strictly ordered, due to control dependencies. Not emitting the > condition and maybe not even emitting the load completely wrecks this. > > Its therefore an invalid optimization to take out the conditional or > speculate the store, since it takes out the dependency. > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
> I don't know the specifics of your example, but from how I understand > it, I don't see a problem if the compiler can prove that the store will > always happen. > > To be more specific, if the compiler can prove that the store will > happen anyway, and the region of code can be assumed to always run > atomically (e.g., there's no loop or such in there), then it is known > that we have one atomic region of code that will always perform the > store, so we might as well do the stuff in the region in some order. > > Now, if any of the memory accesses are atomic, then the whole region of > code containing those accesses is often not atomic because other threads > might observe intermediate results in a data-race-free way. > > (I know that this isn't a very precise formulation, but I hope it brings > my line of reasoning across.) So given something like: if (x) y = 3; assuming both x and y are atomic (so don't gimme crap for now knowing the C11 atomic incantations); and you can prove x is always true; you don't see a problem with not emitting the conditional? Avoiding the conditional changes the result; see that control dependency email from earlier. In the above example the load of X and the store to Y are strictly ordered, due to control dependencies. Not emitting the condition and maybe not even emitting the load completely wrecks this. Its therefore an invalid optimization to take out the conditional or speculate the store, since it takes out the dependency. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
I don't know the specifics of your example, but from how I understand it, I don't see a problem if the compiler can prove that the store will always happen. To be more specific, if the compiler can prove that the store will happen anyway, and the region of code can be assumed to always run atomically (e.g., there's no loop or such in there), then it is known that we have one atomic region of code that will always perform the store, so we might as well do the stuff in the region in some order. Now, if any of the memory accesses are atomic, then the whole region of code containing those accesses is often not atomic because other threads might observe intermediate results in a data-race-free way. (I know that this isn't a very precise formulation, but I hope it brings my line of reasoning across.) So given something like: if (x) y = 3; assuming both x and y are atomic (so don't gimme crap for now knowing the C11 atomic incantations); and you can prove x is always true; you don't see a problem with not emitting the conditional? Avoiding the conditional changes the result; see that control dependency email from earlier. In the above example the load of X and the store to Y are strictly ordered, due to control dependencies. Not emitting the condition and maybe not even emitting the load completely wrecks this. Its therefore an invalid optimization to take out the conditional or speculate the store, since it takes out the dependency. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 10:19:07AM +0100, Peter Zijlstra wrote: I don't know the specifics of your example, but from how I understand it, I don't see a problem if the compiler can prove that the store will always happen. To be more specific, if the compiler can prove that the store will happen anyway, and the region of code can be assumed to always run atomically (e.g., there's no loop or such in there), then it is known that we have one atomic region of code that will always perform the store, so we might as well do the stuff in the region in some order. Now, if any of the memory accesses are atomic, then the whole region of code containing those accesses is often not atomic because other threads might observe intermediate results in a data-race-free way. (I know that this isn't a very precise formulation, but I hope it brings my line of reasoning across.) So given something like: if (x) y = 3; assuming both x and y are atomic (so don't gimme crap for now knowing the C11 atomic incantations); and you can prove x is always true; you don't see a problem with not emitting the conditional? You need volatile semantics to force the compiler to ignore any proofs it might otherwise attempt to construct. Hence all the ACCESS_ONCE() calls in my email to Torvald. (Hopefully I translated your example reasonably.) Thanx, Paul Avoiding the conditional changes the result; see that control dependency email from earlier. In the above example the load of X and the store to Y are strictly ordered, due to control dependencies. Not emitting the condition and maybe not even emitting the load completely wrecks this. Its therefore an invalid optimization to take out the conditional or speculate the store, since it takes out the dependency. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 11, 2014 at 10:06:34PM -0800, Torvald Riegel wrote: On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote: On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote: Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. Btw, what is the definition of observable for the atomics? Because I'm hoping that it's not the same as for volatiles, where observable is about the virtual machine itself, and as such volatile accesses cannot be combined or optimized at all. Now, I claim that atomic accesses cannot be done speculatively for writes, and not re-done for reads (because the value could change), but *combining* them would be possible and good. For example, we often have multiple independent atomic accesses that could certainly be combined: testing the individual bits of an atomic value with helper functions, causing things like load atomic, test bit, load same atomic, test another bit. The two atomic loads could be done as a single load without possibly changing semantics on a real machine, but if visibility is defined in the same way it is for volatile, that wouldn't be a valid transformation. Right now we use volatile semantics for these kinds of things, and they really can hurt. Same goes for multiple writes (possibly due to setting bits): combining multiple accesses into a single one is generally fine, it's *adding* write accesses speculatively that is broken by design.. At the same time, you can't combine atomic loads or stores infinitely - visibility on a real machine definitely is about timeliness. Removing all but the last write when there are multiple consecutive writes is generally fine, even if you unroll a loop to generate those writes. But if what remains is a loop, it might be a busy-loop basically waiting for something, so it would be wrong (untimely) to hoist a store in a loop entirely past the end of the loop, or hoist a load in a loop to before the loop. Does the standard allow for that kind of behavior? You asked! ;-) So the current standard allows merging of both loads and stores, unless of course ordring constraints prevent the merging. Volatile semantics may be used to prevent this merging, if desired, for example, for real-time code. Agreed. Infinite merging is intended to be prohibited, but I am not certain that the current wording is bullet-proof (1.10p24 and 1.10p25). Yeah, maybe not. But it at least seems to rather clearly indicate the intent ;) That is my hope. ;-) The only prohibition against speculative stores that I can see is in a non-normative note, and it can be argued to apply only to things that are not atomics (1.10p22). I think this one is specifically about speculative stores that would affect memory locations that the abstract machine would not write to, and that might be observable or create data races. While a compiler could potentially prove that such stores aren't leading to a difference in the behavior of the program (e.g., by proving that there are no observers anywhere and this isn't overlapping with any volatile locations), I think that this is hard in general and most compilers will just not do such things. In GCC, bugs in that category were fixed after researchers doing fuzz-testing found them (IIRC, speculative stores by loops). And that is my fear. ;-) I don't see any prohibition against reordering a store to precede a load preceding a conditional branch -- which would not be speculative if the branch was know to be taken and the load hit in the store buffer. In a system where stores could be reordered, some other CPU might perceive the store as happening before the load that controlled the conditional branch. This needs to be addressed. I don't know the specifics of your example, but from how I understand it, I don't see a problem if the compiler can prove that the store will always happen. The current Documentation/memory-barriers.txt formulation requires that both the load and the store have volatile semantics. Does that help? To be more specific, if the compiler can prove that the store will happen anyway, and the region of code can be assumed to always run atomically (e.g., there's no loop or such in there), then it is known that we have one atomic region of code that will always perform the store, so we might as well do the stuff in the region in some order. And it would be very hard to write a program that proved that the store had been reordered prior to the load in this case. Now, if any of the memory accesses are atomic, then the
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: You need volatile semantics to force the compiler to ignore any proofs it might otherwise attempt to construct. Hence all the ACCESS_ONCE() calls in my email to Torvald. (Hopefully I translated your example reasonably.) My brain gave out for today; but it did appear to have the right structure. I would prefer it C11 would not require the volatile casts. It should simply _never_ speculate with atomic writes, volatile or not. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 11, 2014 at 09:39:24PM -0800, Torvald Riegel wrote: On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote: On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote: Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. Btw, what is the definition of observable for the atomics? Because I'm hoping that it's not the same as for volatiles, where observable is about the virtual machine itself, and as such volatile accesses cannot be combined or optimized at all. No, atomics aren't an observable behavior of the abstract machine (unless they are volatile). See 1.8.p8 (citing the C++ standard). Us Linux-kernel hackers will often need to use volatile semantics in combination with C11 atomics in most cases. The C11 atomics do cover some of the reasons we currently use ACCESS_ONCE(), but not all of them -- in particular, it allows load/store merging. Now, I claim that atomic accesses cannot be done speculatively for writes, and not re-done for reads (because the value could change), Agreed, unless the compiler can prove that this doesn't make a difference in the program at hand and it's not volatile atomics. In general, that will be hard and thus won't happen often I suppose, but if correctly proved it would fall under the as-if rule I think. but *combining* them would be possible and good. Agreed. In some cases, agreed. But many uses in the Linux kernel will need volatile semantics in combination with C11 atomics. Which is OK, for the foreseeable future, anyway. For example, we often have multiple independent atomic accesses that could certainly be combined: testing the individual bits of an atomic value with helper functions, causing things like load atomic, test bit, load same atomic, test another bit. The two atomic loads could be done as a single load without possibly changing semantics on a real machine, but if visibility is defined in the same way it is for volatile, that wouldn't be a valid transformation. Right now we use volatile semantics for these kinds of things, and they really can hurt. Agreed. In your example, the compiler would have to prove that the abstract machine would always be able to run the two loads atomically (ie, as one load) without running into impossible/disallowed behavior of the program. But if there's no loop or branch or such in-between, this should be straight-forward because any hardware oddity or similar could merge those loads and it wouldn't be disallowed by the standard (considering that we're talking about a finite number of loads), so the compiler would be allowed to do it as well. As long as they are not marked volatile, agreed. Thanx, Paul Same goes for multiple writes (possibly due to setting bits): combining multiple accesses into a single one is generally fine, it's *adding* write accesses speculatively that is broken by design.. Agreed. As Paul points out, this being correct assumes that there are no other ordering guarantees or memory accesses interfering, but if the stores are to the same memory location and adjacent to each other in the program, then I don't see a reason why they wouldn't be combinable. At the same time, you can't combine atomic loads or stores infinitely - visibility on a real machine definitely is about timeliness. Removing all but the last write when there are multiple consecutive writes is generally fine, even if you unroll a loop to generate those writes. But if what remains is a loop, it might be a busy-loop basically waiting for something, so it would be wrong (untimely) to hoist a store in a loop entirely past the end of the loop, or hoist a load in a loop to before the loop. Agreed. That's what 1.10p24 and 1.10p25 are meant to specify for loads, although those might not be bullet-proof as Paul points out. Forward progress is rather vaguely specified in the standard, but at least parts of the committee (and people in ISO C++ SG1, in particular) are working on trying to improve this. Does the standard allow for that kind of behavior? I think the standard requires (or intends to require) the behavior that you (and I) seem to prefer in these examples. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 11, 2014 at 09:13:34PM -0800, Torvald Riegel wrote: On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote: On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote: On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: There are also so many ways to blow your head off it's untrue. For example, cmpxchg takes a separate memory model parameter for failure and success, but then there are restrictions on the sets you can use for each. It's not hard to find well-known memory-ordering experts shouting Just use memory_model_seq_cst for everything, it's too hard otherwise. Then there's the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume atm and optimises all of the data dependencies away) as well as the definition of data races, which seem to be used as an excuse to miscompile a program at the earliest opportunity. Trust me, rcu_dereference() is not going to be defined in terms of memory_order_consume until the compilers implement it both correctly and efficiently. They are not there yet, and there is currently no shortage of compiler writers who would prefer to ignore memory_order_consume. Do you have any input on http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the language standard's definition of dependencies? Let's see... 1.10p9 says that a dependency must be carried unless: — B is an invocation of any specialization of std::kill_dependency (29.3), or — A is the left operand of a built-in logical AND (, see 5.14) or logical OR (||, see 5.15) operator, or — A is the left operand of a conditional (?:, see 5.16) operator, or — A is the left operand of the built-in comma (,) operator (5.18); So the use of flag before the ? is ignored. But the flag - flag after the ? will carry a dependency, so the code fragment in 59448 needs to do the ordering rather than just optimizing flag - flag out of existence. One way to do that on both ARM and Power is to actually emit code for flag - flag, but there are a number of other ways to make that work. And that's what would concern me, considering that these requirements seem to be able to creep out easily. Also, whereas the other atomics just constrain compilers wrt. reordering across atomic accesses or changes to the atomic accesses themselves, the dependencies are new requirements on pieces of otherwise non-synchronizing code. The latter seems far more involved to me. Well, the wording of 1.10p9 is pretty explicit on this point. There are only a few exceptions to the rule that dependencies from memory_order_consume loads must be tracked. And to your point about requirements being placed on pieces of otherwise non-synchronizing code, we already have that with plain old load acquire and store release -- both of these put ordering constraints that affect the surrounding non-synchronizing code. I think there's a significant difference. With acquire/release or more general memory orders, it's true that we can't order _across_ the atomic access. However, we can reorder and optimize without additional constraints if we do not reorder. This is not the case with consume memory order, as the (p + flag - flag) example shows. Agreed, memory_order_consume does introduce additional restrictions. This issue got a lot of discussion, and the compromise is that dependencies cannot leak into or out of functions unless the relevant parameters or return values are annotated with [[carries_dependency]]. This means that the compiler can see all the places where dependencies must be tracked. This is described in 7.6.4. I wasn't aware of 7.6.4 (but it isn't referred to as an additional constraint--what it is--in 1.10, so I guess at least that should be fixed). Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic difference, at least if one is assuming that normal optimization of sequential code is the default, and that maintaining things such as (flag-flag) is not; if optimizing away (flag-flag) would require the insertion of fences unless there is the carries_dependency attribute, then this would be bad I think. No, the
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Us Linux-kernel hackers will often need to use volatile semantics in combination with C11 atomics in most cases. The C11 atomics do cover some of the reasons we currently use ACCESS_ONCE(), but not all of them -- in particular, it allows load/store merging. I really disagree with the will need to use volatile. We should never need to use volatile (outside of whatever MMIO we do using C) if C11 defines atomics correctly. Allowing load/store merging is *fine*. All sane CPU's do that anyway - it's called a cache - and there's no actual reason to think that ACCESS_ONCE() has to mean our current volatile. Now, it's possible that the C standards simply get atomics _wrong_, so that they create visible semantics that are different from what a CPU cache already does, but that's a plain bug in the standard if so. But merging loads and stores is fine. And I *guarantee* it is fine, exactly because CPU's already do it, so claiming that the compiler couldn't do it is just insanity. Now, there are things that are *not* fine, like speculative stores that could be visible to other threads. Those are *bugs* (either in the compiler or in the standard), and anybody who claims otherwise is not worth discussing with. But I really really disagree with the we might have to use 'volatile'. Because if we *ever* have to use 'volatile' with the standard C atomic types, then we're just better off ignoring the atomic types entirely, because they are obviously broken shit - and we're better off doing it ourselves the way we have forever. Seriously. This is not even hyperbole. It really is as simple as that. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney paul...@linux.vnet.ibm.com wrote: Us Linux-kernel hackers will often need to use volatile semantics in combination with C11 atomics in most cases. The C11 atomics do cover some of the reasons we currently use ACCESS_ONCE(), but not all of them -- in particular, it allows load/store merging. I really disagree with the will need to use volatile. We should never need to use volatile (outside of whatever MMIO we do using C) if C11 defines atomics correctly. Allowing load/store merging is *fine*. All sane CPU's do that anyway - it's called a cache - and there's no actual reason to think that ACCESS_ONCE() has to mean our current volatile. Now, it's possible that the C standards simply get atomics _wrong_, so that they create visible semantics that are different from what a CPU cache already does, but that's a plain bug in the standard if so. But merging loads and stores is fine. And I *guarantee* it is fine, exactly because CPU's already do it, so claiming that the compiler couldn't do it is just insanity. Agreed, both CPUs and compilers can merge loads and stores. But CPUs normally get their stores pushed through the store buffer in reasonable time, and CPUs also use things like invalidations to ensure that a store is seen in reasonable time by readers. Compilers don't always have these two properties, so we do need to be more careful of load and store merging by compilers. Now, there are things that are *not* fine, like speculative stores that could be visible to other threads. Those are *bugs* (either in the compiler or in the standard), and anybody who claims otherwise is not worth discussing with. And as near as I can tell, volatile semantics are required in C11 to avoid speculative stores. I might be wrong about this, and hope that I am wrong. But I am currently not seeing it in the current standard. (Though I expect that most compilers would avoid speculating stores, especially in the near term. But I really really disagree with the we might have to use 'volatile'. Because if we *ever* have to use 'volatile' with the standard C atomic types, then we're just better off ignoring the atomic types entirely, because they are obviously broken shit - and we're better off doing it ourselves the way we have forever. Seriously. This is not even hyperbole. It really is as simple as that. Agreed, if we are talking about replacing ACCESS_ONCE() with C11 relaxed atomics any time soon. But someone porting Linux to a new CPU architecture might use a carefully chosen subset of C11 atomics to implement some of the Linux atomic operations, especially non-value-returning atomics such as atomic_inc(). Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote: > On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: > > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > > > > > Intuitively, this is wrong because this let's the program take a step > > > the abstract machine wouldn't do. This is different to the sequential > > > code that Peter posted because it uses atomics, and thus one can't > > > easily assume that the difference is not observable. > > > > Btw, what is the definition of "observable" for the atomics? > > > > Because I'm hoping that it's not the same as for volatiles, where > > "observable" is about the virtual machine itself, and as such volatile > > accesses cannot be combined or optimized at all. > > > > Now, I claim that atomic accesses cannot be done speculatively for > > writes, and not re-done for reads (because the value could change), > > but *combining* them would be possible and good. > > > > For example, we often have multiple independent atomic accesses that > > could certainly be combined: testing the individual bits of an atomic > > value with helper functions, causing things like "load atomic, test > > bit, load same atomic, test another bit". The two atomic loads could > > be done as a single load without possibly changing semantics on a real > > machine, but if "visibility" is defined in the same way it is for > > "volatile", that wouldn't be a valid transformation. Right now we use > > "volatile" semantics for these kinds of things, and they really can > > hurt. > > > > Same goes for multiple writes (possibly due to setting bits): > > combining multiple accesses into a single one is generally fine, it's > > *adding* write accesses speculatively that is broken by design.. > > > > At the same time, you can't combine atomic loads or stores infinitely > > - "visibility" on a real machine definitely is about timeliness. > > Removing all but the last write when there are multiple consecutive > > writes is generally fine, even if you unroll a loop to generate those > > writes. But if what remains is a loop, it might be a busy-loop > > basically waiting for something, so it would be wrong ("untimely") to > > hoist a store in a loop entirely past the end of the loop, or hoist a > > load in a loop to before the loop. > > > > Does the standard allow for that kind of behavior? > > You asked! ;-) > > So the current standard allows merging of both loads and stores, unless of > course ordring constraints prevent the merging. Volatile semantics may be > used to prevent this merging, if desired, for example, for real-time code. Agreed. > Infinite merging is intended to be prohibited, but I am not certain that > the current wording is bullet-proof (1.10p24 and 1.10p25). Yeah, maybe not. But it at least seems to rather clearly indicate the intent ;) > The only prohibition against speculative stores that I can see is in a > non-normative note, and it can be argued to apply only to things that are > not atomics (1.10p22). I think this one is specifically about speculative stores that would affect memory locations that the abstract machine would not write to, and that might be observable or create data races. While a compiler could potentially prove that such stores aren't leading to a difference in the behavior of the program (e.g., by proving that there are no observers anywhere and this isn't overlapping with any volatile locations), I think that this is hard in general and most compilers will just not do such things. In GCC, bugs in that category were fixed after researchers doing fuzz-testing found them (IIRC, speculative stores by loops). > I don't see any prohibition against reordering > a store to precede a load preceding a conditional branch -- which would > not be speculative if the branch was know to be taken and the load > hit in the store buffer. In a system where stores could be reordered, > some other CPU might perceive the store as happening before the load > that controlled the conditional branch. This needs to be addressed. I don't know the specifics of your example, but from how I understand it, I don't see a problem if the compiler can prove that the store will always happen. To be more specific, if the compiler can prove that the store will happen anyway, and the region of code can be assumed to always run atomically (e.g., there's no loop or such in there), then it is known that we have one atomic region of code that will always perform the store, so we might as well do the stuff in the region in some order. Now, if any of the memory accesses are atomic, then the whole region of code containing those accesses is often not atomic because other threads might observe intermediate results in a data-race-free way. (I know that this isn't a very precise formulation, but I hope it brings my line of reasoning across.) > Why this hole? At the time, the current formalizations of popular > CPU architectures did not exist, and it was
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > > > Intuitively, this is wrong because this let's the program take a step > > the abstract machine wouldn't do. This is different to the sequential > > code that Peter posted because it uses atomics, and thus one can't > > easily assume that the difference is not observable. > > Btw, what is the definition of "observable" for the atomics? > > Because I'm hoping that it's not the same as for volatiles, where > "observable" is about the virtual machine itself, and as such volatile > accesses cannot be combined or optimized at all. No, atomics aren't an observable behavior of the abstract machine (unless they are volatile). See 1.8.p8 (citing the C++ standard). > Now, I claim that atomic accesses cannot be done speculatively for > writes, and not re-done for reads (because the value could change), Agreed, unless the compiler can prove that this doesn't make a difference in the program at hand and it's not volatile atomics. In general, that will be hard and thus won't happen often I suppose, but if correctly proved it would fall under the as-if rule I think. > but *combining* them would be possible and good. Agreed. > For example, we often have multiple independent atomic accesses that > could certainly be combined: testing the individual bits of an atomic > value with helper functions, causing things like "load atomic, test > bit, load same atomic, test another bit". The two atomic loads could > be done as a single load without possibly changing semantics on a real > machine, but if "visibility" is defined in the same way it is for > "volatile", that wouldn't be a valid transformation. Right now we use > "volatile" semantics for these kinds of things, and they really can > hurt. Agreed. In your example, the compiler would have to prove that the abstract machine would always be able to run the two loads atomically (ie, as one load) without running into impossible/disallowed behavior of the program. But if there's no loop or branch or such in-between, this should be straight-forward because any hardware oddity or similar could merge those loads and it wouldn't be disallowed by the standard (considering that we're talking about a finite number of loads), so the compiler would be allowed to do it as well. > Same goes for multiple writes (possibly due to setting bits): > combining multiple accesses into a single one is generally fine, it's > *adding* write accesses speculatively that is broken by design.. Agreed. As Paul points out, this being correct assumes that there are no other ordering guarantees or memory accesses "interfering", but if the stores are to the same memory location and adjacent to each other in the program, then I don't see a reason why they wouldn't be combinable. > At the same time, you can't combine atomic loads or stores infinitely > - "visibility" on a real machine definitely is about timeliness. > Removing all but the last write when there are multiple consecutive > writes is generally fine, even if you unroll a loop to generate those > writes. But if what remains is a loop, it might be a busy-loop > basically waiting for something, so it would be wrong ("untimely") to > hoist a store in a loop entirely past the end of the loop, or hoist a > load in a loop to before the loop. Agreed. That's what 1.10p24 and 1.10p25 are meant to specify for loads, although those might not be bullet-proof as Paul points out. Forward progress is rather vaguely specified in the standard, but at least parts of the committee (and people in ISO C++ SG1, in particular) are working on trying to improve this. > Does the standard allow for that kind of behavior? I think the standard requires (or intends to require) the behavior that you (and I) seem to prefer in these examples. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote: > On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: > > > > > > > > There are also so many ways to blow your head off it's untrue. > > > > > > > > For example, > > > > > > > > cmpxchg takes a separate memory model parameter for failure and > > > > > > > > success, but > > > > > > > > then there are restrictions on the sets you can use for each. > > > > > > > > It's not hard > > > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". > > > > > > > > Then there's > > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely > > > > > > > > ignores consume > > > > > > > > atm and optimises all of the data dependencies away) as well as > > > > > > > > the definition > > > > > > > > of "data races", which seem to be used as an excuse to > > > > > > > > miscompile a program > > > > > > > > at the earliest opportunity. > > > > > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > > > > memory_order_consume until the compilers implement it both > > > > > > > correctly and > > > > > > > efficiently. They are not there yet, and there is currently no > > > > > > > shortage > > > > > > > of compiler writers who would prefer to ignore > > > > > > > memory_order_consume. > > > > > > > > > > > > Do you have any input on > > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, > > > > > > the > > > > > > language standard's definition of dependencies? > > > > > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > > > > > — B is an invocation of any specialization of std::kill_dependency > > > > > (29.3), or > > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or > > > > > logical OR (||, see 5.15) operator, > > > > > or > > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > > > needs to do the ordering rather than just optimizing "flag - flag" out > > > > > of existence. One way to do that on both ARM and Power is to actually > > > > > emit code for "flag - flag", but there are a number of other ways to > > > > > make that work. > > > > > > > > And that's what would concern me, considering that these requirements > > > > seem to be able to creep out easily. Also, whereas the other atomics > > > > just constrain compilers wrt. reordering across atomic accesses or > > > > changes to the atomic accesses themselves, the dependencies are new > > > > requirements on pieces of otherwise non-synchronizing code. The latter > > > > seems far more involved to me. > > > > > > Well, the wording of 1.10p9 is pretty explicit on this point. > > > There are only a few exceptions to the rule that dependencies from > > > memory_order_consume loads must be tracked. And to your point about > > > requirements being placed on pieces of otherwise non-synchronizing code, > > > we already have that with plain old load acquire and store release -- > > > both of these put ordering constraints that affect the surrounding > > > non-synchronizing code. > > > > I think there's a significant difference. With acquire/release or more > > general memory orders, it's true that we can't order _across_ the atomic > > access. However, we can reorder and optimize without additional > > constraints if we do not reorder. This is not the case with consume > > memory order, as the (p + flag - flag) example shows. > > Agreed, memory_order_consume does introduce additional restrictions. > > > > This issue got a lot of discussion, and the compromise is that > > > dependencies cannot leak into or out of functions unless the relevant > > > parameters or return values are annotated with [[carries_dependency]]. > > > This means that the compiler can see all the places where dependencies > > > must be tracked. This is described in 7.6.4. > > > > I wasn't aware of 7.6.4 (but it isn't referred to as an additional > > constraint--what it is--in 1.10, so I guess at least that should be > > fixed). > > Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic > > difference, at least if one is assuming that normal optimization of > > sequential
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > > > Intuitively, this is wrong because this let's the program take a step > > the abstract machine wouldn't do. This is different to the sequential > > code that Peter posted because it uses atomics, and thus one can't > > easily assume that the difference is not observable. > > Btw, what is the definition of "observable" for the atomics? > > Because I'm hoping that it's not the same as for volatiles, where > "observable" is about the virtual machine itself, and as such volatile > accesses cannot be combined or optimized at all. > > Now, I claim that atomic accesses cannot be done speculatively for > writes, and not re-done for reads (because the value could change), > but *combining* them would be possible and good. > > For example, we often have multiple independent atomic accesses that > could certainly be combined: testing the individual bits of an atomic > value with helper functions, causing things like "load atomic, test > bit, load same atomic, test another bit". The two atomic loads could > be done as a single load without possibly changing semantics on a real > machine, but if "visibility" is defined in the same way it is for > "volatile", that wouldn't be a valid transformation. Right now we use > "volatile" semantics for these kinds of things, and they really can > hurt. > > Same goes for multiple writes (possibly due to setting bits): > combining multiple accesses into a single one is generally fine, it's > *adding* write accesses speculatively that is broken by design.. > > At the same time, you can't combine atomic loads or stores infinitely > - "visibility" on a real machine definitely is about timeliness. > Removing all but the last write when there are multiple consecutive > writes is generally fine, even if you unroll a loop to generate those > writes. But if what remains is a loop, it might be a busy-loop > basically waiting for something, so it would be wrong ("untimely") to > hoist a store in a loop entirely past the end of the loop, or hoist a > load in a loop to before the loop. > > Does the standard allow for that kind of behavior? You asked! ;-) So the current standard allows merging of both loads and stores, unless of course ordring constraints prevent the merging. Volatile semantics may be used to prevent this merging, if desired, for example, for real-time code. Infinite merging is intended to be prohibited, but I am not certain that the current wording is bullet-proof (1.10p24 and 1.10p25). The only prohibition against speculative stores that I can see is in a non-normative note, and it can be argued to apply only to things that are not atomics (1.10p22). I don't see any prohibition against reordering a store to precede a load preceding a conditional branch -- which would not be speculative if the branch was know to be taken and the load hit in the store buffer. In a system where stores could be reordered, some other CPU might perceive the store as happening before the load that controlled the conditional branch. This needs to be addressed. Why this hole? At the time, the current formalizations of popular CPU architectures did not exist, and it was not clear that all popular hardware avoided speculative stores. There is also fun with "out of thin air" values, which everyone agrees should be prohibited, but where there is not agreement on how to prohibit them in a mathematically constructive manner. The current draft contains a clause simply stating that out-of-thin-air values are prohibited, which doesn't help someone constructing tools to analyze C++ code. One proposal requires that subsequent atomic stores never be reordered before prior atomic loads, which requires useless ordering code to be emitted on ARM and PowerPC (you may have seen Will Deacon's and Peter Zijlstra's reaction to this proposal a few days ago). Note that Itanium already pays this price in order to provide full single-variable cache coherence. This out-of-thin-air discussion is also happening in the Java community in preparation for a new rev of the Java memory model. There will also be some discussions on memory_order_consume, which is intended to (eventually) implement rcu_dereference(). The compiler writers don't like tracking dependencies, but there may be some ways of constraining optimizations to preserve the common dependencies that, while providing some syntax to force preservation of dependencies that would normally be optimized out. One example of this is where you have an RCU-protected array that might sometimes contain only a single element. In the single-element case, the compiler knows a priori which element will be used, and will therefore optimize the dependency away, so that the reader might see pre-initialization state. But this is rare, so if added syntax needs to be added in this case, I believe we should be OK with it. (If syntax is
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote: Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. Btw, what is the definition of observable for the atomics? Because I'm hoping that it's not the same as for volatiles, where observable is about the virtual machine itself, and as such volatile accesses cannot be combined or optimized at all. Now, I claim that atomic accesses cannot be done speculatively for writes, and not re-done for reads (because the value could change), but *combining* them would be possible and good. For example, we often have multiple independent atomic accesses that could certainly be combined: testing the individual bits of an atomic value with helper functions, causing things like load atomic, test bit, load same atomic, test another bit. The two atomic loads could be done as a single load without possibly changing semantics on a real machine, but if visibility is defined in the same way it is for volatile, that wouldn't be a valid transformation. Right now we use volatile semantics for these kinds of things, and they really can hurt. Same goes for multiple writes (possibly due to setting bits): combining multiple accesses into a single one is generally fine, it's *adding* write accesses speculatively that is broken by design.. At the same time, you can't combine atomic loads or stores infinitely - visibility on a real machine definitely is about timeliness. Removing all but the last write when there are multiple consecutive writes is generally fine, even if you unroll a loop to generate those writes. But if what remains is a loop, it might be a busy-loop basically waiting for something, so it would be wrong (untimely) to hoist a store in a loop entirely past the end of the loop, or hoist a load in a loop to before the loop. Does the standard allow for that kind of behavior? You asked! ;-) So the current standard allows merging of both loads and stores, unless of course ordring constraints prevent the merging. Volatile semantics may be used to prevent this merging, if desired, for example, for real-time code. Infinite merging is intended to be prohibited, but I am not certain that the current wording is bullet-proof (1.10p24 and 1.10p25). The only prohibition against speculative stores that I can see is in a non-normative note, and it can be argued to apply only to things that are not atomics (1.10p22). I don't see any prohibition against reordering a store to precede a load preceding a conditional branch -- which would not be speculative if the branch was know to be taken and the load hit in the store buffer. In a system where stores could be reordered, some other CPU might perceive the store as happening before the load that controlled the conditional branch. This needs to be addressed. Why this hole? At the time, the current formalizations of popular CPU architectures did not exist, and it was not clear that all popular hardware avoided speculative stores. There is also fun with out of thin air values, which everyone agrees should be prohibited, but where there is not agreement on how to prohibit them in a mathematically constructive manner. The current draft contains a clause simply stating that out-of-thin-air values are prohibited, which doesn't help someone constructing tools to analyze C++ code. One proposal requires that subsequent atomic stores never be reordered before prior atomic loads, which requires useless ordering code to be emitted on ARM and PowerPC (you may have seen Will Deacon's and Peter Zijlstra's reaction to this proposal a few days ago). Note that Itanium already pays this price in order to provide full single-variable cache coherence. This out-of-thin-air discussion is also happening in the Java community in preparation for a new rev of the Java memory model. There will also be some discussions on memory_order_consume, which is intended to (eventually) implement rcu_dereference(). The compiler writers don't like tracking dependencies, but there may be some ways of constraining optimizations to preserve the common dependencies that, while providing some syntax to force preservation of dependencies that would normally be optimized out. One example of this is where you have an RCU-protected array that might sometimes contain only a single element. In the single-element case, the compiler knows a priori which element will be used, and will therefore optimize the dependency away, so that the reader might see pre-initialization state. But this is rare, so if added syntax needs to be added in this case, I believe we should be OK with it. (If syntax is needed for plain old dereferences, it is
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote: On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote: On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: There are also so many ways to blow your head off it's untrue. For example, cmpxchg takes a separate memory model parameter for failure and success, but then there are restrictions on the sets you can use for each. It's not hard to find well-known memory-ordering experts shouting Just use memory_model_seq_cst for everything, it's too hard otherwise. Then there's the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume atm and optimises all of the data dependencies away) as well as the definition of data races, which seem to be used as an excuse to miscompile a program at the earliest opportunity. Trust me, rcu_dereference() is not going to be defined in terms of memory_order_consume until the compilers implement it both correctly and efficiently. They are not there yet, and there is currently no shortage of compiler writers who would prefer to ignore memory_order_consume. Do you have any input on http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the language standard's definition of dependencies? Let's see... 1.10p9 says that a dependency must be carried unless: — B is an invocation of any specialization of std::kill_dependency (29.3), or — A is the left operand of a built-in logical AND (, see 5.14) or logical OR (||, see 5.15) operator, or — A is the left operand of a conditional (?:, see 5.16) operator, or — A is the left operand of the built-in comma (,) operator (5.18); So the use of flag before the ? is ignored. But the flag - flag after the ? will carry a dependency, so the code fragment in 59448 needs to do the ordering rather than just optimizing flag - flag out of existence. One way to do that on both ARM and Power is to actually emit code for flag - flag, but there are a number of other ways to make that work. And that's what would concern me, considering that these requirements seem to be able to creep out easily. Also, whereas the other atomics just constrain compilers wrt. reordering across atomic accesses or changes to the atomic accesses themselves, the dependencies are new requirements on pieces of otherwise non-synchronizing code. The latter seems far more involved to me. Well, the wording of 1.10p9 is pretty explicit on this point. There are only a few exceptions to the rule that dependencies from memory_order_consume loads must be tracked. And to your point about requirements being placed on pieces of otherwise non-synchronizing code, we already have that with plain old load acquire and store release -- both of these put ordering constraints that affect the surrounding non-synchronizing code. I think there's a significant difference. With acquire/release or more general memory orders, it's true that we can't order _across_ the atomic access. However, we can reorder and optimize without additional constraints if we do not reorder. This is not the case with consume memory order, as the (p + flag - flag) example shows. Agreed, memory_order_consume does introduce additional restrictions. This issue got a lot of discussion, and the compromise is that dependencies cannot leak into or out of functions unless the relevant parameters or return values are annotated with [[carries_dependency]]. This means that the compiler can see all the places where dependencies must be tracked. This is described in 7.6.4. I wasn't aware of 7.6.4 (but it isn't referred to as an additional constraint--what it is--in 1.10, so I guess at least that should be fixed). Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic difference, at least if one is assuming that normal optimization of sequential code is the default, and that maintaining things such as (flag-flag) is not; if optimizing away (flag-flag) would require the insertion of fences unless there is the carries_dependency attribute, then this would be bad I think. No, the attribute does not make a semantic difference. If a dependency flows into a function without [[carries_dependency]], the implementation is within its right to emit an acquire barrier or
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote: On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote: Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. Btw, what is the definition of observable for the atomics? Because I'm hoping that it's not the same as for volatiles, where observable is about the virtual machine itself, and as such volatile accesses cannot be combined or optimized at all. No, atomics aren't an observable behavior of the abstract machine (unless they are volatile). See 1.8.p8 (citing the C++ standard). Now, I claim that atomic accesses cannot be done speculatively for writes, and not re-done for reads (because the value could change), Agreed, unless the compiler can prove that this doesn't make a difference in the program at hand and it's not volatile atomics. In general, that will be hard and thus won't happen often I suppose, but if correctly proved it would fall under the as-if rule I think. but *combining* them would be possible and good. Agreed. For example, we often have multiple independent atomic accesses that could certainly be combined: testing the individual bits of an atomic value with helper functions, causing things like load atomic, test bit, load same atomic, test another bit. The two atomic loads could be done as a single load without possibly changing semantics on a real machine, but if visibility is defined in the same way it is for volatile, that wouldn't be a valid transformation. Right now we use volatile semantics for these kinds of things, and they really can hurt. Agreed. In your example, the compiler would have to prove that the abstract machine would always be able to run the two loads atomically (ie, as one load) without running into impossible/disallowed behavior of the program. But if there's no loop or branch or such in-between, this should be straight-forward because any hardware oddity or similar could merge those loads and it wouldn't be disallowed by the standard (considering that we're talking about a finite number of loads), so the compiler would be allowed to do it as well. Same goes for multiple writes (possibly due to setting bits): combining multiple accesses into a single one is generally fine, it's *adding* write accesses speculatively that is broken by design.. Agreed. As Paul points out, this being correct assumes that there are no other ordering guarantees or memory accesses interfering, but if the stores are to the same memory location and adjacent to each other in the program, then I don't see a reason why they wouldn't be combinable. At the same time, you can't combine atomic loads or stores infinitely - visibility on a real machine definitely is about timeliness. Removing all but the last write when there are multiple consecutive writes is generally fine, even if you unroll a loop to generate those writes. But if what remains is a loop, it might be a busy-loop basically waiting for something, so it would be wrong (untimely) to hoist a store in a loop entirely past the end of the loop, or hoist a load in a loop to before the loop. Agreed. That's what 1.10p24 and 1.10p25 are meant to specify for loads, although those might not be bullet-proof as Paul points out. Forward progress is rather vaguely specified in the standard, but at least parts of the committee (and people in ISO C++ SG1, in particular) are working on trying to improve this. Does the standard allow for that kind of behavior? I think the standard requires (or intends to require) the behavior that you (and I) seem to prefer in these examples. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote: On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote: Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. Btw, what is the definition of observable for the atomics? Because I'm hoping that it's not the same as for volatiles, where observable is about the virtual machine itself, and as such volatile accesses cannot be combined or optimized at all. Now, I claim that atomic accesses cannot be done speculatively for writes, and not re-done for reads (because the value could change), but *combining* them would be possible and good. For example, we often have multiple independent atomic accesses that could certainly be combined: testing the individual bits of an atomic value with helper functions, causing things like load atomic, test bit, load same atomic, test another bit. The two atomic loads could be done as a single load without possibly changing semantics on a real machine, but if visibility is defined in the same way it is for volatile, that wouldn't be a valid transformation. Right now we use volatile semantics for these kinds of things, and they really can hurt. Same goes for multiple writes (possibly due to setting bits): combining multiple accesses into a single one is generally fine, it's *adding* write accesses speculatively that is broken by design.. At the same time, you can't combine atomic loads or stores infinitely - visibility on a real machine definitely is about timeliness. Removing all but the last write when there are multiple consecutive writes is generally fine, even if you unroll a loop to generate those writes. But if what remains is a loop, it might be a busy-loop basically waiting for something, so it would be wrong (untimely) to hoist a store in a loop entirely past the end of the loop, or hoist a load in a loop to before the loop. Does the standard allow for that kind of behavior? You asked! ;-) So the current standard allows merging of both loads and stores, unless of course ordring constraints prevent the merging. Volatile semantics may be used to prevent this merging, if desired, for example, for real-time code. Agreed. Infinite merging is intended to be prohibited, but I am not certain that the current wording is bullet-proof (1.10p24 and 1.10p25). Yeah, maybe not. But it at least seems to rather clearly indicate the intent ;) The only prohibition against speculative stores that I can see is in a non-normative note, and it can be argued to apply only to things that are not atomics (1.10p22). I think this one is specifically about speculative stores that would affect memory locations that the abstract machine would not write to, and that might be observable or create data races. While a compiler could potentially prove that such stores aren't leading to a difference in the behavior of the program (e.g., by proving that there are no observers anywhere and this isn't overlapping with any volatile locations), I think that this is hard in general and most compilers will just not do such things. In GCC, bugs in that category were fixed after researchers doing fuzz-testing found them (IIRC, speculative stores by loops). I don't see any prohibition against reordering a store to precede a load preceding a conditional branch -- which would not be speculative if the branch was know to be taken and the load hit in the store buffer. In a system where stores could be reordered, some other CPU might perceive the store as happening before the load that controlled the conditional branch. This needs to be addressed. I don't know the specifics of your example, but from how I understand it, I don't see a problem if the compiler can prove that the store will always happen. To be more specific, if the compiler can prove that the store will happen anyway, and the region of code can be assumed to always run atomically (e.g., there's no loop or such in there), then it is known that we have one atomic region of code that will always perform the store, so we might as well do the stuff in the region in some order. Now, if any of the memory accesses are atomic, then the whole region of code containing those accesses is often not atomic because other threads might observe intermediate results in a data-race-free way. (I know that this isn't a very precise formulation, but I hope it brings my line of reasoning across.) Why this hole? At the time, the current formalizations of popular CPU architectures did not exist, and it was not clear that all popular hardware avoided speculative stores. I'm not quite sure which hole you see
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 04:08:11PM -0500, Chris Metcalf wrote: > (+LKML again) > > On 2/10/2014 3:57 PM, Peter Zijlstra wrote: > > On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote: > >> On 2/6/2014 8:52 AM, Peter Zijlstra wrote: > >>> Its been compiled on everything I have a compiler for, however frv and > >>> tile are missing because they're special and I was tired. > >> So what's the specialness on tile? > > Its not doing the atomic work in ASM but uses magic builtins or such. > > > > I got the list of magic funcs for tilegx, but didn't look into the 32bit > > chips. > > Oh, I see. The files on tile are already reasonably > well-factored. > > It's possible you could do better, but I think not by too much, other than > possibly > by using for some of the common idioms like > "subtraction > is addition with a negative second argument", etc., which hasn't been done > elsewhere. The last patch 5/5 adds a few atomic ops; I could of course use cmpxchg() loops for everything, but I found tilegx actually has fetch_or and fetch_and to implement atomic_or() / atomic_and(). It doesn't have fetch_xor() from what I've been told so atomic_xor() will have to become a cmpxchg() loop. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
(+LKML again) On 2/10/2014 3:57 PM, Peter Zijlstra wrote: > On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote: >> On 2/6/2014 8:52 AM, Peter Zijlstra wrote: >>> Its been compiled on everything I have a compiler for, however frv and >>> tile are missing because they're special and I was tired. >> So what's the specialness on tile? > Its not doing the atomic work in ASM but uses magic builtins or such. > > I got the list of magic funcs for tilegx, but didn't look into the 32bit > chips. Oh, I see. The files on tile are already reasonably well-factored. It's possible you could do better, but I think not by too much, other than possibly by using for some of the common idioms like "subtraction is addition with a negative second argument", etc., which hasn't been done elsewhere. -- Chris Metcalf, Tilera Corp. http://www.tilera.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > Intuitively, this is wrong because this let's the program take a step > the abstract machine wouldn't do. This is different to the sequential > code that Peter posted because it uses atomics, and thus one can't > easily assume that the difference is not observable. Btw, what is the definition of "observable" for the atomics? Because I'm hoping that it's not the same as for volatiles, where "observable" is about the virtual machine itself, and as such volatile accesses cannot be combined or optimized at all. Now, I claim that atomic accesses cannot be done speculatively for writes, and not re-done for reads (because the value could change), but *combining* them would be possible and good. For example, we often have multiple independent atomic accesses that could certainly be combined: testing the individual bits of an atomic value with helper functions, causing things like "load atomic, test bit, load same atomic, test another bit". The two atomic loads could be done as a single load without possibly changing semantics on a real machine, but if "visibility" is defined in the same way it is for "volatile", that wouldn't be a valid transformation. Right now we use "volatile" semantics for these kinds of things, and they really can hurt. Same goes for multiple writes (possibly due to setting bits): combining multiple accesses into a single one is generally fine, it's *adding* write accesses speculatively that is broken by design.. At the same time, you can't combine atomic loads or stores infinitely - "visibility" on a real machine definitely is about timeliness. Removing all but the last write when there are multiple consecutive writes is generally fine, even if you unroll a loop to generate those writes. But if what remains is a loop, it might be a busy-loop basically waiting for something, so it would be wrong ("untimely") to hoist a store in a loop entirely past the end of the loop, or hoist a load in a loop to before the loop. Does the standard allow for that kind of behavior? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 03:04:43PM +, Paul E. McKenney wrote: > On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote: > > On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: > > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > > > As near as I can tell, compiler writers hate the idea of prohibiting > > > > speculative-store optimizations because it requires them to introduce > > > > both control and data dependency tracking into their compilers. Many of > > > > them seem to hate dependency tracking with a purple passion. At least, > > > > such a hatred would go a long way towards explaining the incomplete > > > > and high-overhead implementations of memory_order_consume, the long > > > > and successful use of idioms based on the memory_order_consume pattern > > > > notwithstanding [*]. ;-) > > > > > > Just tell them that because the hardware provides control dependencies > > > we actually use and rely on them. > > > > s/control/address/ ? > > Both are important, but as Peter's reply noted, it was control > dependencies under discussion. Data dependencies (which include the > ARM/PowerPC notion of address dependencies) are called out by the standard > already, but control dependencies are not. I am not all that satisified > by current implementations of data dependencies, admittedly. Should > be an interesting discussion. ;-) Ok, but since you can't use control dependencies to order LOAD -> LOAD, it's a pretty big ask of the compiler to make use of them for things like consume, where a data dependency will suffice for any combination of accesses. Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote: > On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > > As near as I can tell, compiler writers hate the idea of prohibiting > > > speculative-store optimizations because it requires them to introduce > > > both control and data dependency tracking into their compilers. Many of > > > them seem to hate dependency tracking with a purple passion. At least, > > > such a hatred would go a long way towards explaining the incomplete > > > and high-overhead implementations of memory_order_consume, the long > > > and successful use of idioms based on the memory_order_consume pattern > > > notwithstanding [*]. ;-) > > > > Just tell them that because the hardware provides control dependencies > > we actually use and rely on them. > > s/control/address/ ? Both are important, but as Peter's reply noted, it was control dependencies under discussion. Data dependencies (which include the ARM/PowerPC notion of address dependencies) are called out by the standard already, but control dependencies are not. I am not all that satisified by current implementations of data dependencies, admittedly. Should be an interesting discussion. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote: > On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > > As near as I can tell, compiler writers hate the idea of prohibiting > > > speculative-store optimizations because it requires them to introduce > > > both control and data dependency tracking into their compilers. Many of > > > them seem to hate dependency tracking with a purple passion. At least, > > > such a hatred would go a long way towards explaining the incomplete > > > and high-overhead implementations of memory_order_consume, the long > > > and successful use of idioms based on the memory_order_consume pattern > > > notwithstanding [*]. ;-) > > > > Just tell them that because the hardware provides control dependencies > > we actually use and rely on them. > > s/control/address/ ? Nope, control. Since stores cannot be speculated and thus require linear control flow history we can use it to order LOAD -> STORE when the LOAD is required for the control flow decision and the STORE depends on the control flow path. Also see commit 18c03c61444a211237f3d4782353cb38dba795df to Documentation/memory-barriers.txt --- commit c7f2e3cd6c1f4932ccc4135d050eae3f7c7aef63 Author: Peter Zijlstra Date: Mon Nov 25 11:49:10 2013 +0100 perf: Optimize ring-buffer write by depending on control dependencies Remove a full barrier from the ring-buffer write path by relying on a control dependency to order a LOAD -> STORE scenario. Cc: "Paul E. McKenney" Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnx...@git.kernel.org Signed-off-by: Ingo Molnar diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index e8b168af135b..146a5792b1d2 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -61,19 +61,20 @@ static void perf_output_put_handle(struct perf_output_handle *handle) * * kernel user * -* READ ->data_tail READ ->data_head -* smp_mb() (A) smp_rmb() (C) -* WRITE $dataREAD $data -* smp_wmb() (B) smp_mb()(D) -* STORE ->data_head WRITE ->data_tail +* if (LOAD ->data_tail) {LOAD ->data_head +* (A) smp_rmb() (C) +* STORE $data LOAD $data +* smp_wmb() (B) smp_mb()(D) +* STORE ->data_head STORE ->data_tail +* } * * Where A pairs with D, and B pairs with C. * -* I don't think A needs to be a full barrier because we won't in fact -* write data until we see the store from userspace. So we simply don't -* issue the data WRITE until we observe it. Be conservative for now. +* In our case (A) is a control dependency that separates the load of +* the ->data_tail and the stores of $data. In case ->data_tail +* indicates there is no room in the buffer to store $data we do not. * -* OTOH, D needs to be a full barrier since it separates the data READ +* D needs to be a full barrier since it separates the data READ * from the tail WRITE. * * For B a WMB is sufficient since it separates two WRITEs, and for C @@ -81,7 +82,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle) * * See perf_output_begin(). */ - smp_wmb(); + smp_wmb(); /* B, matches C */ rb->user_page->data_head = head; /* @@ -144,17 +145,26 @@ int perf_output_begin(struct perf_output_handle *handle, if (!rb->overwrite && unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size)) goto fail; + + /* +* The above forms a control dependency barrier separating the +* @tail load above from the data stores below. Since the @tail +* load is required to compute the branch to fail below. +* +* A, matches D; the full memory barrier userspace SHOULD issue +* after reading the data and before storing the new tail +* position. +* +* See perf_output_put_handle(). +*/ + head += size; } while (local_cmpxchg(>head, offset, head) != offset); /* -* Separate the userpage->tail read from the data stores below. -* Matches the MB userspace SHOULD issue after reading the data -* and before storing the new tail position. -* -* See perf_output_put_handle(). +* We rely on
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > As near as I can tell, compiler writers hate the idea of prohibiting > > speculative-store optimizations because it requires them to introduce > > both control and data dependency tracking into their compilers. Many of > > them seem to hate dependency tracking with a purple passion. At least, > > such a hatred would go a long way towards explaining the incomplete > > and high-overhead implementations of memory_order_consume, the long > > and successful use of idioms based on the memory_order_consume pattern > > notwithstanding [*]. ;-) > > Just tell them that because the hardware provides control dependencies > we actually use and rely on them. s/control/address/ ? Will -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > As near as I can tell, compiler writers hate the idea of prohibiting > speculative-store optimizations because it requires them to introduce > both control and data dependency tracking into their compilers. Many of > them seem to hate dependency tracking with a purple passion. At least, > such a hatred would go a long way towards explaining the incomplete > and high-overhead implementations of memory_order_consume, the long > and successful use of idioms based on the memory_order_consume pattern > notwithstanding [*]. ;-) Just tell them that because the hardware provides control dependencies we actually use and rely on them. Not that I expect they care too much what we do, given the current state of things. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote: > > Initial state: x == y == 0 > > > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > > atomic_store_explicit(42, y, memory_order_relaxed); > > if (r1 != 42) > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > Intuitively, this is wrong because this let's the program take a step > the abstract machine wouldn't do. This is different to the sequential > code that Peter posted because it uses atomics, and thus one can't > easily assume that the difference is not observable. Yeah, my bad for not being familiar with the atrocious crap C11 made of atomics :/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(42, y, memory_order_relaxed); if (r1 != 42) atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. Yeah, my bad for not being familiar with the atrocious crap C11 made of atomics :/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: As near as I can tell, compiler writers hate the idea of prohibiting speculative-store optimizations because it requires them to introduce both control and data dependency tracking into their compilers. Many of them seem to hate dependency tracking with a purple passion. At least, such a hatred would go a long way towards explaining the incomplete and high-overhead implementations of memory_order_consume, the long and successful use of idioms based on the memory_order_consume pattern notwithstanding [*]. ;-) Just tell them that because the hardware provides control dependencies we actually use and rely on them. Not that I expect they care too much what we do, given the current state of things. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: As near as I can tell, compiler writers hate the idea of prohibiting speculative-store optimizations because it requires them to introduce both control and data dependency tracking into their compilers. Many of them seem to hate dependency tracking with a purple passion. At least, such a hatred would go a long way towards explaining the incomplete and high-overhead implementations of memory_order_consume, the long and successful use of idioms based on the memory_order_consume pattern notwithstanding [*]. ;-) Just tell them that because the hardware provides control dependencies we actually use and rely on them. s/control/address/ ? Will -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote: On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: As near as I can tell, compiler writers hate the idea of prohibiting speculative-store optimizations because it requires them to introduce both control and data dependency tracking into their compilers. Many of them seem to hate dependency tracking with a purple passion. At least, such a hatred would go a long way towards explaining the incomplete and high-overhead implementations of memory_order_consume, the long and successful use of idioms based on the memory_order_consume pattern notwithstanding [*]. ;-) Just tell them that because the hardware provides control dependencies we actually use and rely on them. s/control/address/ ? Nope, control. Since stores cannot be speculated and thus require linear control flow history we can use it to order LOAD - STORE when the LOAD is required for the control flow decision and the STORE depends on the control flow path. Also see commit 18c03c61444a211237f3d4782353cb38dba795df to Documentation/memory-barriers.txt --- commit c7f2e3cd6c1f4932ccc4135d050eae3f7c7aef63 Author: Peter Zijlstra pet...@infradead.org Date: Mon Nov 25 11:49:10 2013 +0100 perf: Optimize ring-buffer write by depending on control dependencies Remove a full barrier from the ring-buffer write path by relying on a control dependency to order a LOAD - STORE scenario. Cc: Paul E. McKenney paul...@us.ibm.com Signed-off-by: Peter Zijlstra pet...@infradead.org Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnx...@git.kernel.org Signed-off-by: Ingo Molnar mi...@kernel.org diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index e8b168af135b..146a5792b1d2 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -61,19 +61,20 @@ static void perf_output_put_handle(struct perf_output_handle *handle) * * kernel user * -* READ -data_tail READ -data_head -* smp_mb() (A) smp_rmb() (C) -* WRITE $dataREAD $data -* smp_wmb() (B) smp_mb()(D) -* STORE -data_head WRITE -data_tail +* if (LOAD -data_tail) {LOAD -data_head +* (A) smp_rmb() (C) +* STORE $data LOAD $data +* smp_wmb() (B) smp_mb()(D) +* STORE -data_head STORE -data_tail +* } * * Where A pairs with D, and B pairs with C. * -* I don't think A needs to be a full barrier because we won't in fact -* write data until we see the store from userspace. So we simply don't -* issue the data WRITE until we observe it. Be conservative for now. +* In our case (A) is a control dependency that separates the load of +* the -data_tail and the stores of $data. In case -data_tail +* indicates there is no room in the buffer to store $data we do not. * -* OTOH, D needs to be a full barrier since it separates the data READ +* D needs to be a full barrier since it separates the data READ * from the tail WRITE. * * For B a WMB is sufficient since it separates two WRITEs, and for C @@ -81,7 +82,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle) * * See perf_output_begin(). */ - smp_wmb(); + smp_wmb(); /* B, matches C */ rb-user_page-data_head = head; /* @@ -144,17 +145,26 @@ int perf_output_begin(struct perf_output_handle *handle, if (!rb-overwrite unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) size)) goto fail; + + /* +* The above forms a control dependency barrier separating the +* @tail load above from the data stores below. Since the @tail +* load is required to compute the branch to fail below. +* +* A, matches D; the full memory barrier userspace SHOULD issue +* after reading the data and before storing the new tail +* position. +* +* See perf_output_put_handle(). +*/ + head += size; } while (local_cmpxchg(rb-head, offset, head) != offset); /* -* Separate the userpage-tail read from the data stores below. -* Matches the MB userspace SHOULD issue after reading the data -* and before storing the new tail position. -* -* See perf_output_put_handle(). +
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote: On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: As near as I can tell, compiler writers hate the idea of prohibiting speculative-store optimizations because it requires them to introduce both control and data dependency tracking into their compilers. Many of them seem to hate dependency tracking with a purple passion. At least, such a hatred would go a long way towards explaining the incomplete and high-overhead implementations of memory_order_consume, the long and successful use of idioms based on the memory_order_consume pattern notwithstanding [*]. ;-) Just tell them that because the hardware provides control dependencies we actually use and rely on them. s/control/address/ ? Both are important, but as Peter's reply noted, it was control dependencies under discussion. Data dependencies (which include the ARM/PowerPC notion of address dependencies) are called out by the standard already, but control dependencies are not. I am not all that satisified by current implementations of data dependencies, admittedly. Should be an interesting discussion. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 03:04:43PM +, Paul E. McKenney wrote: On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote: On Mon, Feb 10, 2014 at 11:48:13AM +, Peter Zijlstra wrote: On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: As near as I can tell, compiler writers hate the idea of prohibiting speculative-store optimizations because it requires them to introduce both control and data dependency tracking into their compilers. Many of them seem to hate dependency tracking with a purple passion. At least, such a hatred would go a long way towards explaining the incomplete and high-overhead implementations of memory_order_consume, the long and successful use of idioms based on the memory_order_consume pattern notwithstanding [*]. ;-) Just tell them that because the hardware provides control dependencies we actually use and rely on them. s/control/address/ ? Both are important, but as Peter's reply noted, it was control dependencies under discussion. Data dependencies (which include the ARM/PowerPC notion of address dependencies) are called out by the standard already, but control dependencies are not. I am not all that satisified by current implementations of data dependencies, admittedly. Should be an interesting discussion. ;-) Ok, but since you can't use control dependencies to order LOAD - LOAD, it's a pretty big ask of the compiler to make use of them for things like consume, where a data dependency will suffice for any combination of accesses. Will -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote: Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. Btw, what is the definition of observable for the atomics? Because I'm hoping that it's not the same as for volatiles, where observable is about the virtual machine itself, and as such volatile accesses cannot be combined or optimized at all. Now, I claim that atomic accesses cannot be done speculatively for writes, and not re-done for reads (because the value could change), but *combining* them would be possible and good. For example, we often have multiple independent atomic accesses that could certainly be combined: testing the individual bits of an atomic value with helper functions, causing things like load atomic, test bit, load same atomic, test another bit. The two atomic loads could be done as a single load without possibly changing semantics on a real machine, but if visibility is defined in the same way it is for volatile, that wouldn't be a valid transformation. Right now we use volatile semantics for these kinds of things, and they really can hurt. Same goes for multiple writes (possibly due to setting bits): combining multiple accesses into a single one is generally fine, it's *adding* write accesses speculatively that is broken by design.. At the same time, you can't combine atomic loads or stores infinitely - visibility on a real machine definitely is about timeliness. Removing all but the last write when there are multiple consecutive writes is generally fine, even if you unroll a loop to generate those writes. But if what remains is a loop, it might be a busy-loop basically waiting for something, so it would be wrong (untimely) to hoist a store in a loop entirely past the end of the loop, or hoist a load in a loop to before the loop. Does the standard allow for that kind of behavior? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
(+LKML again) On 2/10/2014 3:57 PM, Peter Zijlstra wrote: On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote: On 2/6/2014 8:52 AM, Peter Zijlstra wrote: Its been compiled on everything I have a compiler for, however frv and tile are missing because they're special and I was tired. So what's the specialness on tile? Its not doing the atomic work in ASM but uses magic builtins or such. I got the list of magic funcs for tilegx, but didn't look into the 32bit chips. Oh, I see. The asm/atomic.h files on tile are already reasonably well-factored. It's possible you could do better, but I think not by too much, other than possibly by using asm-generic/atomic.h for some of the common idioms like subtraction is addition with a negative second argument, etc., which hasn't been done elsewhere. -- Chris Metcalf, Tilera Corp. http://www.tilera.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 04:08:11PM -0500, Chris Metcalf wrote: (+LKML again) On 2/10/2014 3:57 PM, Peter Zijlstra wrote: On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote: On 2/6/2014 8:52 AM, Peter Zijlstra wrote: Its been compiled on everything I have a compiler for, however frv and tile are missing because they're special and I was tired. So what's the specialness on tile? Its not doing the atomic work in ASM but uses magic builtins or such. I got the list of magic funcs for tilegx, but didn't look into the 32bit chips. Oh, I see. The asm/atomic.h files on tile are already reasonably well-factored. It's possible you could do better, but I think not by too much, other than possibly by using asm-generic/atomic.h for some of the common idioms like subtraction is addition with a negative second argument, etc., which hasn't been done elsewhere. The last patch 5/5 adds a few atomic ops; I could of course use cmpxchg() loops for everything, but I found tilegx actually has fetch_or and fetch_and to implement atomic_or() / atomic_and(). It doesn't have fetch_xor() from what I've been told so atomic_xor() will have to become a cmpxchg() loop. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote: > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: > > > > > > > There are also so many ways to blow your head off it's untrue. > > > > > > > For example, > > > > > > > cmpxchg takes a separate memory model parameter for failure and > > > > > > > success, but > > > > > > > then there are restrictions on the sets you can use for each. > > > > > > > It's not hard > > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". > > > > > > > Then there's > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely > > > > > > > ignores consume > > > > > > > atm and optimises all of the data dependencies away) as well as > > > > > > > the definition > > > > > > > of "data races", which seem to be used as an excuse to miscompile > > > > > > > a program > > > > > > > at the earliest opportunity. > > > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > > > memory_order_consume until the compilers implement it both > > > > > > correctly and > > > > > > efficiently. They are not there yet, and there is currently no > > > > > > shortage > > > > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > > > > > > > Do you have any input on > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > > > > language standard's definition of dependencies? > > > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > > > — B is an invocation of any specialization of std::kill_dependency > > > > (29.3), or > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or > > > > logical OR (||, see 5.15) operator, > > > > or > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > > needs to do the ordering rather than just optimizing "flag - flag" out > > > > of existence. One way to do that on both ARM and Power is to actually > > > > emit code for "flag - flag", but there are a number of other ways to > > > > make that work. > > > > > > And that's what would concern me, considering that these requirements > > > seem to be able to creep out easily. Also, whereas the other atomics > > > just constrain compilers wrt. reordering across atomic accesses or > > > changes to the atomic accesses themselves, the dependencies are new > > > requirements on pieces of otherwise non-synchronizing code. The latter > > > seems far more involved to me. > > > > Well, the wording of 1.10p9 is pretty explicit on this point. > > There are only a few exceptions to the rule that dependencies from > > memory_order_consume loads must be tracked. And to your point about > > requirements being placed on pieces of otherwise non-synchronizing code, > > we already have that with plain old load acquire and store release -- > > both of these put ordering constraints that affect the surrounding > > non-synchronizing code. > > I think there's a significant difference. With acquire/release or more > general memory orders, it's true that we can't order _across_ the atomic > access. However, we can reorder and optimize without additional > constraints if we do not reorder. This is not the case with consume > memory order, as the (p + flag - flag) example shows. Agreed, memory_order_consume does introduce additional restrictions. > > This issue got a lot of discussion, and the compromise is that > > dependencies cannot leak into or out of functions unless the relevant > > parameters or return values are annotated with [[carries_dependency]]. > > This means that the compiler can see all the places where dependencies > > must be tracked. This is described in 7.6.4. > > I wasn't aware of 7.6.4 (but it isn't referred to as an additional > constraint--what it is--in 1.10, so I guess at least that should be > fixed). > Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic > difference, at least if one is assuming that normal optimization of > sequential code is the default, and that maintaining things such as > (flag-flag) is not; if optimizing away (flag-flag) would require the > insertion of fences unless there is the carries_dependency attribute, > then this would be bad I think. No, the attribute does not
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote: > On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote: > > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote: [ . . . ] > > And then it is a short and uncontroversial step to the following: > > > > Initial state: x == y == 0 > > > > T1: atomic_store_explicit(42, y, memory_order_relaxed); > > r1 = atomic_load_explicit(x, memory_order_relaxed); > > if (r1 != 42) > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > > > This can of course result in r1 == r2 == 42, even though the constant > > 42 never appeared in the original code. This is one way to generate > > an out-of-thin-air value. > > > > As near as I can tell, compiler writers hate the idea of prohibiting > > speculative-store optimizations because it requires them to introduce > > both control and data dependency tracking into their compilers. > > I wouldn't characterize the situation like this (although I can't speak > for others, obviously). IMHO, it's perfectly fine on sequential / > non-synchronizing code, because we know the difference isn't observable > by a correct program. For synchronizing code, compilers just shouldn't > do it, or they would have to truly prove that speculation is harmless. > That will be hard, so I think it should just be avoided. > > Synchronization code will likely have been tuned anyway (especially if > it uses relaxed MO), so I don't see a large need for trying to optimize > using speculative atomic stores. > > Thus, I think there's an easy and practical solution. I like this approach, but there has been resistance to it in the past. Definitely worth a good try, though! > > Many of > > them seem to hate dependency tracking with a purple passion. At least, > > such a hatred would go a long way towards explaining the incomplete > > and high-overhead implementations of memory_order_consume, the long > > and successful use of idioms based on the memory_order_consume pattern > > notwithstanding [*]. ;-) > > I still think that's different because it blurs the difference between > sequential code and synchronizing code (ie, atomic accesses). With > consume MO, the simple solution above doesn't work anymore, because > suddenly synchronizing code does affect optimizations in sequential > code, even if that wouldn't reorder across the synchronizing code (which > would be clearly "visible" to the implementation of the optimization). I understand that memory_order_consume is a bit harder on compiler writers than the other memory orders, but it is also pretty valuable. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote: > On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote: > > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote: > > > Hi Paul, > > > > > > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote: > > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then > > > > > drive > > > > > a wooden stake through its hart. > > > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until > > > > > that > > > > > is sorted. > > > > > > > > There actually is a proposal being put forward, but it might not make > > > > ARM > > > > and Power people happy because it involves adding a compare, a branch, > > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > > much preferring the no-store-speculation approach. > > > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > > emit any additional instructions to prevent that from happening. > > > > Requiring a compare/branch/ISB after each relaxed load enables a simple(r) > > proof that out-of-thin-air values cannot be observed in the face of any > > compiler optimization that refrains from reordering a prior relaxed load > > with a subsequent relaxed store. > > > > > Stores can, of course, be observed out-of-order but that's a lot more > > > reasonable :) > > > > So let me try an example. I am sure that Torvald Riegel will jump in > > with any needed corrections or amplifications: > > > > Initial state: x == y == 0 > > > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > > > One would intuitively expect r1 == r2 == 0 as the only possible outcome. > > But suppose that the compiler used specialization optimizations, as it > > would if there was a function that has a very lightweight implementation > > for some values and a very heavyweight one for other. In particular, > > suppose that the lightweight implementation was for the value 42. > > Then the compiler might do something like the following: > > > > Initial state: x == y == 0 > > > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > > if (r1 == 42) > > atomic_store_explicit(42, y, memory_order_relaxed); > > else > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > > > Suddenly we have an explicit constant 42 showing up. Of course, if > > the compiler carefully avoided speculative stores (as both Peter and > > I believe that it should if its code generation is to be regarded as > > anything other than an act of vandalism, the words in the standard > > notwithstanding), there would be no problem. But currently, a number > > of compiler writers see absolutely nothing wrong with transforming > > the optimized-for-42 version above with something like this: > > > > Initial state: x == y == 0 > > > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > > atomic_store_explicit(42, y, memory_order_relaxed); > > if (r1 != 42) > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > Intuitively, this is wrong because this let's the program take a step > the abstract machine wouldn't do. This is different to the sequential > code that Peter posted because it uses atomics, and thus one can't > easily assume that the difference is not observable. > > For this to be correct, the compiler would actually have to prove that > the speculative store is "as-if correct", which in turn would mean that > it needs to be aware of all potential observers, and check whether those > observers aren't actually affected by the speculative store. > > I would guess that the compilers you have in mind don't really do that. > If they do, then I don't see why this should be okay, unless you think > out-of-thin-air values are something good (which I wouldn't agree with). OK, we agree that pulling the atomic store to y out of its "if" statement is a bad thing. Very good! Now we just have to convince others on the committee. ;-) Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 9, 2014 at 5:46 PM, Torvald Riegel wrote: > > IOW, I wrote that such a compiler transformation would be wrong in my > opinion. Thus, it should *not* return 42. Ahh, I am happy to have misunderstood. The "intuitively" threw me, because I thought that was building up to a "but", and misread the rest. I then react stronly, because I've seen so much total crap (the type-based C aliasing rules topping my list) etc coming out of standards groups because it allows them to generate wrong code that goes faster, that I just assume compiler people are out to do stupid things in the name of "..but the standard allows it". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-02-09 at 17:24 -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel wrote: > > > > (a) seems to say that you don't like requiring programmers to mark > > atomic accesses specially. Is that the case? > > In Paul's example, they were marked specially. > > And you seemed to argue that Paul's example could possibly return > anything but 0/0. Just read my reply to Paul again. Here's an excerpt: > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > atomic_store_explicit(42, y, memory_order_relaxed); > if (r1 != 42) > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. IOW, I wrote that such a compiler transformation would be wrong in my opinion. Thus, it should *not* return 42. If you still see how what I wrote could be misunderstood, please let me know because I really don't see it. Yes, I don't do so by swearing or calling other insane, and I try to see the reasoning that those compilers' authors might have had, even if I don't agree with it. In my personal opinion, that's a good thing. > If you really think that, I hope to God that you have nothing to do > with the C standard or any actual compiler I ever use. > > Because such a standard or compiler would be shit. It's sadly not too > uncommon. Thanks for the kind words. Perhaps writing that was quicker than reading what I actually wrote. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel wrote: > > (a) seems to say that you don't like requiring programmers to mark > atomic accesses specially. Is that the case? In Paul's example, they were marked specially. And you seemed to argue that Paul's example could possibly return anything but 0/0. If you really think that, I hope to God that you have nothing to do with the C standard or any actual compiler I ever use. Because such a standard or compiler would be shit. It's sadly not too uncommon. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-02-09 at 16:56 -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > > > I wouldn't characterize the situation like this (although I can't speak > > for others, obviously). IMHO, it's perfectly fine on sequential / > > non-synchronizing code, because we know the difference isn't observable > > by a correct program. > > What BS is that? If you use an "atomic_store_explicit()", by > definition you're either > > (a) f*cking insane > (b) not doing sequential non-synchronizing code > > and a compiler that assumes that the programmer is insane may actually > be correct more often than not, but it's still a shit compiler. > Agreed? Due to all the expletives, I can't really understand what you are saying. Nor does what I guess it might mean seem to relate to what I said. (a) seems to say that you don't like requiring programmers to mark atomic accesses specially. Is that the case? If so, I have to disagree. If you're writing concurrent code, marking the atomic accesses is the least of your problem. Instead, for the concurrent code I've written, it rather improved readability; it made code more verbose, but in turn it made the synchronization easier to see. Beside this question of taste (and I don't care what the Kernel style guides are), there is a real technical argument here, though: Requiring all synchronizing memory accesses to be annotated makes the compiler aware what is sequential code, and what is not. Without it, one can't exploit data-race-freedom. So unless we want to slow down optimization of sequential code, we need to tell the compiler what is what. If every access is potentially synchronizing, then this doesn't just prevent speculative stores. (b) seems as if you are saying that if there is a specially annotated atomic access in the code, that this isn't sequential/non-synchronizing code anymore. I don't agree with that, obviously. > So I don't see how any sane person can say that speculative writes are > ok. They are clearly not ok. We are discussing programs written against the C11/C++11 memory model here. At least that's the part that got forwarded to g...@gcc.gnu.org, and was subject of the nearest emails in the thread. This memory model requires programs to be data-race-free. Thus, we can optimize accordingly. If you don't like that, then this isn't C11/C++11 anymore. Which would be fine, but then complain about that specifically. > Speculative stores are a bad idea in general. I disagree in the context of C11/C++11 programs. At least from a correctness point of view. > They are completely > invalid for anything that says "atomic". I agree, as you will see when you read the other emails I posted as part of this thread. But those two things are separate. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel wrote: > > I wouldn't characterize the situation like this (although I can't speak > for others, obviously). IMHO, it's perfectly fine on sequential / > non-synchronizing code, because we know the difference isn't observable > by a correct program. What BS is that? If you use an "atomic_store_explicit()", by definition you're either (a) f*cking insane (b) not doing sequential non-synchronizing code and a compiler that assumes that the programmer is insane may actually be correct more often than not, but it's still a shit compiler. Agreed? So I don't see how any sane person can say that speculative writes are ok. They are clearly not ok. Speculative stores are a bad idea in general. They are completely invalid for anything that says "atomic". This is not even worth discussing. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote: > > Hi Paul, > > > > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > > a wooden stake through its hart. > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until > > > > that > > > > is sorted. > > > > > > There actually is a proposal being put forward, but it might not make ARM > > > and Power people happy because it involves adding a compare, a branch, > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > much preferring the no-store-speculation approach. > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > emit any additional instructions to prevent that from happening. > > Requiring a compare/branch/ISB after each relaxed load enables a simple(r) > proof that out-of-thin-air values cannot be observed in the face of any > compiler optimization that refrains from reordering a prior relaxed load > with a subsequent relaxed store. > > > Stores can, of course, be observed out-of-order but that's a lot more > > reasonable :) > > So let me try an example. I am sure that Torvald Riegel will jump in > with any needed corrections or amplifications: > > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); > > One would intuitively expect r1 == r2 == 0 as the only possible outcome. > But suppose that the compiler used specialization optimizations, as it > would if there was a function that has a very lightweight implementation > for some values and a very heavyweight one for other. In particular, > suppose that the lightweight implementation was for the value 42. > Then the compiler might do something like the following: > > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > if (r1 == 42) > atomic_store_explicit(42, y, memory_order_relaxed); > else > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); > > Suddenly we have an explicit constant 42 showing up. Of course, if > the compiler carefully avoided speculative stores (as both Peter and > I believe that it should if its code generation is to be regarded as > anything other than an act of vandalism, the words in the standard > notwithstanding), there would be no problem. But currently, a number > of compiler writers see absolutely nothing wrong with transforming > the optimized-for-42 version above with something like this: > > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > atomic_store_explicit(42, y, memory_order_relaxed); > if (r1 != 42) > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. For this to be correct, the compiler would actually have to prove that the speculative store is "as-if correct", which in turn would mean that it needs to be aware of all potential observers, and check whether those observers aren't actually affected by the speculative store. I would guess that the compilers you have in mind don't really do that. If they do, then I don't see why this should be okay, unless you think out-of-thin-air values are something good (which I wouldn't agree with). > And then it is a short and uncontroversial step to the following: > > Initial state: x == y == 0 > > T1: atomic_store_explicit(42, y, memory_order_relaxed); > r1 = atomic_load_explicit(x, memory_order_relaxed); > if (r1 != 42) > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); > > This can of course result in r1 == r2 == 42, even though the constant > 42 never appeared in the original code. This is one way to generate > an out-of-thin-air value.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: > > > > > > There are also so many ways to blow your head off it's untrue. For > > > > > > example, > > > > > > cmpxchg takes a separate memory model parameter for failure and > > > > > > success, but > > > > > > then there are restrictions on the sets you can use for each. It's > > > > > > not hard > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then > > > > > > there's > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely > > > > > > ignores consume > > > > > > atm and optimises all of the data dependencies away) as well as the > > > > > > definition > > > > > > of "data races", which seem to be used as an excuse to miscompile a > > > > > > program > > > > > > at the earliest opportunity. > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > > memory_order_consume until the compilers implement it both correctly > > > > > and > > > > > efficiently. They are not there yet, and there is currently no > > > > > shortage > > > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > > > > > Do you have any input on > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > > > language standard's definition of dependencies? > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > — B is an invocation of any specialization of std::kill_dependency > > > (29.3), or > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or > > > logical OR (||, see 5.15) operator, > > > or > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > needs to do the ordering rather than just optimizing "flag - flag" out > > > of existence. One way to do that on both ARM and Power is to actually > > > emit code for "flag - flag", but there are a number of other ways to > > > make that work. > > > > And that's what would concern me, considering that these requirements > > seem to be able to creep out easily. Also, whereas the other atomics > > just constrain compilers wrt. reordering across atomic accesses or > > changes to the atomic accesses themselves, the dependencies are new > > requirements on pieces of otherwise non-synchronizing code. The latter > > seems far more involved to me. > > Well, the wording of 1.10p9 is pretty explicit on this point. > There are only a few exceptions to the rule that dependencies from > memory_order_consume loads must be tracked. And to your point about > requirements being placed on pieces of otherwise non-synchronizing code, > we already have that with plain old load acquire and store release -- > both of these put ordering constraints that affect the surrounding > non-synchronizing code. I think there's a significant difference. With acquire/release or more general memory orders, it's true that we can't order _across_ the atomic access. However, we can reorder and optimize without additional constraints if we do not reorder. This is not the case with consume memory order, as the (p + flag - flag) example shows. > This issue got a lot of discussion, and the compromise is that > dependencies cannot leak into or out of functions unless the relevant > parameters or return values are annotated with [[carries_dependency]]. > This means that the compiler can see all the places where dependencies > must be tracked. This is described in 7.6.4. I wasn't aware of 7.6.4 (but it isn't referred to as an additional constraint--what it is--in 1.10, so I guess at least that should be fixed). Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic difference, at least if one is assuming that normal optimization of sequential code is the default, and that maintaining things such as (flag-flag) is not; if optimizing away (flag-flag) would require the insertion of fences unless there is the carries_dependency attribute, then this would be bad I think. IMHO, the dependencies construct (carries_dependency, kill_dependency) seem to be backwards to me. They assume that the compiler preserves all those dependencies including (flag-flag) by default, which prohibits meaningful optimizations. Instead, I guess there should be a construct for explicitly exploiting the dependencies
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: On Thu, Feb 06, 2014 at 06:59:10PM +, Will Deacon wrote: There are also so many ways to blow your head off it's untrue. For example, cmpxchg takes a separate memory model parameter for failure and success, but then there are restrictions on the sets you can use for each. It's not hard to find well-known memory-ordering experts shouting Just use memory_model_seq_cst for everything, it's too hard otherwise. Then there's the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume atm and optimises all of the data dependencies away) as well as the definition of data races, which seem to be used as an excuse to miscompile a program at the earliest opportunity. Trust me, rcu_dereference() is not going to be defined in terms of memory_order_consume until the compilers implement it both correctly and efficiently. They are not there yet, and there is currently no shortage of compiler writers who would prefer to ignore memory_order_consume. Do you have any input on http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the language standard's definition of dependencies? Let's see... 1.10p9 says that a dependency must be carried unless: — B is an invocation of any specialization of std::kill_dependency (29.3), or — A is the left operand of a built-in logical AND (, see 5.14) or logical OR (||, see 5.15) operator, or — A is the left operand of a conditional (?:, see 5.16) operator, or — A is the left operand of the built-in comma (,) operator (5.18); So the use of flag before the ? is ignored. But the flag - flag after the ? will carry a dependency, so the code fragment in 59448 needs to do the ordering rather than just optimizing flag - flag out of existence. One way to do that on both ARM and Power is to actually emit code for flag - flag, but there are a number of other ways to make that work. And that's what would concern me, considering that these requirements seem to be able to creep out easily. Also, whereas the other atomics just constrain compilers wrt. reordering across atomic accesses or changes to the atomic accesses themselves, the dependencies are new requirements on pieces of otherwise non-synchronizing code. The latter seems far more involved to me. Well, the wording of 1.10p9 is pretty explicit on this point. There are only a few exceptions to the rule that dependencies from memory_order_consume loads must be tracked. And to your point about requirements being placed on pieces of otherwise non-synchronizing code, we already have that with plain old load acquire and store release -- both of these put ordering constraints that affect the surrounding non-synchronizing code. I think there's a significant difference. With acquire/release or more general memory orders, it's true that we can't order _across_ the atomic access. However, we can reorder and optimize without additional constraints if we do not reorder. This is not the case with consume memory order, as the (p + flag - flag) example shows. This issue got a lot of discussion, and the compromise is that dependencies cannot leak into or out of functions unless the relevant parameters or return values are annotated with [[carries_dependency]]. This means that the compiler can see all the places where dependencies must be tracked. This is described in 7.6.4. I wasn't aware of 7.6.4 (but it isn't referred to as an additional constraint--what it is--in 1.10, so I guess at least that should be fixed). Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic difference, at least if one is assuming that normal optimization of sequential code is the default, and that maintaining things such as (flag-flag) is not; if optimizing away (flag-flag) would require the insertion of fences unless there is the carries_dependency attribute, then this would be bad I think. IMHO, the dependencies construct (carries_dependency, kill_dependency) seem to be backwards to me. They assume that the compiler preserves all those dependencies including (flag-flag) by default, which prohibits meaningful optimizations. Instead, I guess there should be a construct for explicitly exploiting the dependencies (e.g., a preserve_dependencies call, whose argument will not be optimized fully). Exploiting dependencies will be special code and isn't the common case, so it can be require additional annotations. If a dependency chain headed by a memory_order_consume load
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote: On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote: Hi Paul, On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote: On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: Hopefully some discussion of out-of-thin-air values as well. Yes, absolutely shoot store speculation in the head already. Then drive a wooden stake through its hart. C11/C++11 should not be allowed to claim itself a memory model until that is sorted. There actually is a proposal being put forward, but it might not make ARM and Power people happy because it involves adding a compare, a branch, and an ISB/isync after every relaxed load... Me, I agree with you, much preferring the no-store-speculation approach. Can you elaborate a bit on this please? We don't permit speculative stores in the ARM architecture, so it seems counter-intuitive that GCC needs to emit any additional instructions to prevent that from happening. Requiring a compare/branch/ISB after each relaxed load enables a simple(r) proof that out-of-thin-air values cannot be observed in the face of any compiler optimization that refrains from reordering a prior relaxed load with a subsequent relaxed store. Stores can, of course, be observed out-of-order but that's a lot more reasonable :) So let me try an example. I am sure that Torvald Riegel will jump in with any needed corrections or amplifications: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); One would intuitively expect r1 == r2 == 0 as the only possible outcome. But suppose that the compiler used specialization optimizations, as it would if there was a function that has a very lightweight implementation for some values and a very heavyweight one for other. In particular, suppose that the lightweight implementation was for the value 42. Then the compiler might do something like the following: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); if (r1 == 42) atomic_store_explicit(42, y, memory_order_relaxed); else atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); Suddenly we have an explicit constant 42 showing up. Of course, if the compiler carefully avoided speculative stores (as both Peter and I believe that it should if its code generation is to be regarded as anything other than an act of vandalism, the words in the standard notwithstanding), there would be no problem. But currently, a number of compiler writers see absolutely nothing wrong with transforming the optimized-for-42 version above with something like this: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(42, y, memory_order_relaxed); if (r1 != 42) atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. For this to be correct, the compiler would actually have to prove that the speculative store is as-if correct, which in turn would mean that it needs to be aware of all potential observers, and check whether those observers aren't actually affected by the speculative store. I would guess that the compilers you have in mind don't really do that. If they do, then I don't see why this should be okay, unless you think out-of-thin-air values are something good (which I wouldn't agree with). And then it is a short and uncontroversial step to the following: Initial state: x == y == 0 T1: atomic_store_explicit(42, y, memory_order_relaxed); r1 = atomic_load_explicit(x, memory_order_relaxed); if (r1 != 42) atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); This can of course result in r1 == r2 == 42, even though the constant 42 never appeared in the original code. This is one way to generate an out-of-thin-air value. As near as I can tell, compiler writers hate the idea of prohibiting speculative-store optimizations because it requires them to introduce
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote: I wouldn't characterize the situation like this (although I can't speak for others, obviously). IMHO, it's perfectly fine on sequential / non-synchronizing code, because we know the difference isn't observable by a correct program. What BS is that? If you use an atomic_store_explicit(), by definition you're either (a) f*cking insane (b) not doing sequential non-synchronizing code and a compiler that assumes that the programmer is insane may actually be correct more often than not, but it's still a shit compiler. Agreed? So I don't see how any sane person can say that speculative writes are ok. They are clearly not ok. Speculative stores are a bad idea in general. They are completely invalid for anything that says atomic. This is not even worth discussing. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-02-09 at 16:56 -0800, Linus Torvalds wrote: On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel trie...@redhat.com wrote: I wouldn't characterize the situation like this (although I can't speak for others, obviously). IMHO, it's perfectly fine on sequential / non-synchronizing code, because we know the difference isn't observable by a correct program. What BS is that? If you use an atomic_store_explicit(), by definition you're either (a) f*cking insane (b) not doing sequential non-synchronizing code and a compiler that assumes that the programmer is insane may actually be correct more often than not, but it's still a shit compiler. Agreed? Due to all the expletives, I can't really understand what you are saying. Nor does what I guess it might mean seem to relate to what I said. (a) seems to say that you don't like requiring programmers to mark atomic accesses specially. Is that the case? If so, I have to disagree. If you're writing concurrent code, marking the atomic accesses is the least of your problem. Instead, for the concurrent code I've written, it rather improved readability; it made code more verbose, but in turn it made the synchronization easier to see. Beside this question of taste (and I don't care what the Kernel style guides are), there is a real technical argument here, though: Requiring all synchronizing memory accesses to be annotated makes the compiler aware what is sequential code, and what is not. Without it, one can't exploit data-race-freedom. So unless we want to slow down optimization of sequential code, we need to tell the compiler what is what. If every access is potentially synchronizing, then this doesn't just prevent speculative stores. (b) seems as if you are saying that if there is a specially annotated atomic access in the code, that this isn't sequential/non-synchronizing code anymore. I don't agree with that, obviously. So I don't see how any sane person can say that speculative writes are ok. They are clearly not ok. We are discussing programs written against the C11/C++11 memory model here. At least that's the part that got forwarded to g...@gcc.gnu.org, and was subject of the nearest emails in the thread. This memory model requires programs to be data-race-free. Thus, we can optimize accordingly. If you don't like that, then this isn't C11/C++11 anymore. Which would be fine, but then complain about that specifically. Speculative stores are a bad idea in general. I disagree in the context of C11/C++11 programs. At least from a correctness point of view. They are completely invalid for anything that says atomic. I agree, as you will see when you read the other emails I posted as part of this thread. But those two things are separate. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel trie...@redhat.com wrote: (a) seems to say that you don't like requiring programmers to mark atomic accesses specially. Is that the case? In Paul's example, they were marked specially. And you seemed to argue that Paul's example could possibly return anything but 0/0. If you really think that, I hope to God that you have nothing to do with the C standard or any actual compiler I ever use. Because such a standard or compiler would be shit. It's sadly not too uncommon. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, 2014-02-09 at 17:24 -0800, Linus Torvalds wrote: On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel trie...@redhat.com wrote: (a) seems to say that you don't like requiring programmers to mark atomic accesses specially. Is that the case? In Paul's example, they were marked specially. And you seemed to argue that Paul's example could possibly return anything but 0/0. Just read my reply to Paul again. Here's an excerpt: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(42, y, memory_order_relaxed); if (r1 != 42) atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. IOW, I wrote that such a compiler transformation would be wrong in my opinion. Thus, it should *not* return 42. If you still see how what I wrote could be misunderstood, please let me know because I really don't see it. Yes, I don't do so by swearing or calling other insane, and I try to see the reasoning that those compilers' authors might have had, even if I don't agree with it. In my personal opinion, that's a good thing. If you really think that, I hope to God that you have nothing to do with the C standard or any actual compiler I ever use. Because such a standard or compiler would be shit. It's sadly not too uncommon. Thanks for the kind words. Perhaps writing that was quicker than reading what I actually wrote. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Sun, Feb 9, 2014 at 5:46 PM, Torvald Riegel trie...@redhat.com wrote: IOW, I wrote that such a compiler transformation would be wrong in my opinion. Thus, it should *not* return 42. Ahh, I am happy to have misunderstood. The intuitively threw me, because I thought that was building up to a but, and misread the rest. I then react stronly, because I've seen so much total crap (the type-based C aliasing rules topping my list) etc coming out of standards groups because it allows them to generate wrong code that goes faster, that I just assume compiler people are out to do stupid things in the name of ..but the standard allows it. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote: On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote: On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote: Hi Paul, On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote: On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: Hopefully some discussion of out-of-thin-air values as well. Yes, absolutely shoot store speculation in the head already. Then drive a wooden stake through its hart. C11/C++11 should not be allowed to claim itself a memory model until that is sorted. There actually is a proposal being put forward, but it might not make ARM and Power people happy because it involves adding a compare, a branch, and an ISB/isync after every relaxed load... Me, I agree with you, much preferring the no-store-speculation approach. Can you elaborate a bit on this please? We don't permit speculative stores in the ARM architecture, so it seems counter-intuitive that GCC needs to emit any additional instructions to prevent that from happening. Requiring a compare/branch/ISB after each relaxed load enables a simple(r) proof that out-of-thin-air values cannot be observed in the face of any compiler optimization that refrains from reordering a prior relaxed load with a subsequent relaxed store. Stores can, of course, be observed out-of-order but that's a lot more reasonable :) So let me try an example. I am sure that Torvald Riegel will jump in with any needed corrections or amplifications: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); One would intuitively expect r1 == r2 == 0 as the only possible outcome. But suppose that the compiler used specialization optimizations, as it would if there was a function that has a very lightweight implementation for some values and a very heavyweight one for other. In particular, suppose that the lightweight implementation was for the value 42. Then the compiler might do something like the following: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); if (r1 == 42) atomic_store_explicit(42, y, memory_order_relaxed); else atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); Suddenly we have an explicit constant 42 showing up. Of course, if the compiler carefully avoided speculative stores (as both Peter and I believe that it should if its code generation is to be regarded as anything other than an act of vandalism, the words in the standard notwithstanding), there would be no problem. But currently, a number of compiler writers see absolutely nothing wrong with transforming the optimized-for-42 version above with something like this: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(42, y, memory_order_relaxed); if (r1 != 42) atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. For this to be correct, the compiler would actually have to prove that the speculative store is as-if correct, which in turn would mean that it needs to be aware of all potential observers, and check whether those observers aren't actually affected by the speculative store. I would guess that the compilers you have in mind don't really do that. If they do, then I don't see why this should be okay, unless you think out-of-thin-air values are something good (which I wouldn't agree with). OK, we agree that pulling the atomic store to y out of its if statement is a bad thing. Very good! Now we just have to convince others on the committee. ;-) Thanx, Paul -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/