Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics
On Fri, Jul 07, 2023 at 10:04:06AM -0400, Olivier Dion wrote: > On Tue, 04 Jul 2023, Peter Zijlstra wrote: > > On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote: > [...] > >> On x86-64 (gcc 13.1 -O2) we get: > >> > >> t0(): > >> movl$1, x(%rip) > >> movl$1, %eax > >> xchgl dummy(%rip), %eax > >> lock orq $0, (%rsp) ;; Redundant with previous exchange. > >> movly(%rip), %eax > >> movl%eax, r0(%rip) > >> ret > >> t1(): > >> movl$1, y(%rip) > >> lock orq $0, (%rsp) > >> movlx(%rip), %eax > >> movl%eax, r1(%rip) > >> ret > > > > So I would expect the compilers to do better here. It should know those > > __atomic_thread_fence() thingies are superfluous and simply not emit > > them. This could even be done as a peephole pass later, where it sees > > consecutive atomic ops and the second being a no-op. > > Indeed, a peephole optimization could work for this Dekker, if the > compiler adds the pattern for it. However, AFAIK, a peephole can not be > applied when the two fences are in different basic blocks. For example, > only emitting a fence on a compare_exchange success. This limitation > implies that the optimization can not be done across functions/modules > (shared libraries). LTO FTW :-) > For example, it would be interesting to be able to > promote an acquire fence of a pthread_mutex_lock() to a full fence on > weakly ordered architectures while preventing a redundant fence on > strongly ordered architectures. That's a very non-trivial thing to do. I know Linux has smp_mb__after_spinlock() and that x86 has it a no-op, but even on x86 adding a full fence after a lock has observable differences IIRC. Specifically, the actual store that acquires the lock is not well ordered vs the critical section itself for non-trivial spinlock implementations (notably qspinlock). For RCU you mostly care about RCsc locks (IIRC), and upgrading unlock is a 'simpler' (IMO) approach to achieve that (which is what RCU does with smp_mb_after_unlock_lock()). > We know that at least Clang has such peephole optimizations for some > architecture backends. It seems however that they do not recognize > lock-prefixed instructions as fence. They seem confused in general for emitting MFENCE. > AFAIK, GCC does not have that kind > of optimization. > We are also aware that some research has been done on this topic [0]. > The idea is to use PRE for elimiation of redundant fences. This would > work across multiple basic blocks, although the paper focus on > intra-procedural eliminations. However, it seems that the latest work > on that [1] has never been completed [2]. > > Our proposed approach provides a mean for the user to express -- and > document -- the wanted semantic in the source code. This allows the > compiler to only emit wanted fences, therefore not relying on > architecture specific backend optimizations. In other words, this > applies even on unoptimized binaries. I'm not a tool person, but if I were, I'd be very hesitant to add __builtin functions that 'conflict'/'overlap' with what an optimizer should be able to do. Either way around you need work done on the compilers, and I'm thinking 'fixing' the optimizer will benefit far more people than adding __builtin's. Then again, I'm not a tools person, so you don't need to convince me. But one of the selling points of the whole Atomics as a language feature was that whole optimizer angle. Otherwise you might as well do as we do, inline asm the world. I'll shut up now, thanks for that PRE reference [0], that seems a fun read for when I'm bored.
Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics
On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote: > int x = 0; > int y = 0; > int r0, r1; > > int dummy; > > void t0(void) > { > __atomic_store_n(&x, 1, __ATOMIC_RELAXED); > > __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST); > __atomic_thread_fence(__ATOMIC_SEQ_CST); > > r0 = __atomic_load_n(&y, __ATOMIC_RELAXED); > } > > void t1(void) > { > __atomic_store_n(&y, 1, __ATOMIC_RELAXED); > __atomic_thread_fence(__ATOMIC_SEQ_CST); > r1 = __atomic_load_n(&x, __ATOMIC_RELAXED); > } > > // BUG_ON(r0 == 0 && r1 == 0) > > On x86-64 (gcc 13.1 -O2) we get: > > t0(): > movl$1, x(%rip) > movl$1, %eax > xchgl dummy(%rip), %eax > lock orq $0, (%rsp) ;; Redundant with previous exchange. > movly(%rip), %eax > movl%eax, r0(%rip) > ret > t1(): > movl$1, y(%rip) > lock orq $0, (%rsp) > movlx(%rip), %eax > movl%eax, r1(%rip) > ret So I would expect the compilers to do better here. It should know those __atomic_thread_fence() thingies are superfluous and simply not emit them. This could even be done as a peephole pass later, where it sees consecutive atomic ops and the second being a no-op. > On x86-64 (clang 16 -O2) we get: > > t0(): > movl$1, x(%rip) > movl$1, %eax > xchgl %eax, dummy(%rip) > mfence;; Redundant with previous exchange. And that's just terrible :/ Nobody should be using MFENCE for this. And using MFENCE after a LOCK prefixes instruction (implicit in this case) is just fail, because I don't think C++ atomics cover MMIO and other such 'lovely' things. > movly(%rip), %eax > movl%eax, r0(%rip) > retq > t1(): > movl$1, y(%rip) > mfence > movlx(%rip), %eax > movl%eax, r1(%rip) > retq
Re: GCC 12 miscompilation of volatile asm (was: Re: [PATCH] arm64/io: Remind compiler that there is a memory side effect)
On Tue, Apr 05, 2022 at 01:51:30PM +0100, Mark Rutland wrote: > Hi all, > > [adding kernel folk who work on asm stuff] > > As a heads-up, GCC 12 (not yet released) appears to erroneously optimize away > calls to functions with volatile asm. Szabolcs has raised an issue on the GCC > bugzilla: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105160 > > ... which is a P1 release blocker, and is currently being investigated. > > Jemery originally reported this as an issue with {readl,writel}_relaxed(), but > the underlying problem doesn't have anything to do with those specifically. > > I'm dumping a bunch of info here largely for posterity / archival, and to find > out who (from the kernel side) is willing and able to test proposed compiler > fixes, once those are available. > > I'm happy to do so for aarch64; Peter, I assume you'd be happy to look at the > x86 side? Sure..
Re: Re: typeof and operands in named address spaces
On Mon, Nov 16, 2020 at 12:23:17PM +, Uecker, Martin wrote: > > > > Another way to drop qualifiers is using a cast. So you > > > > can use typeof twice: > > > > > > > > typeof((typeof(_var))_var) tmp__; > > > > > > > > This also works for non-scalars but this is a GCC extension. > > (That casts drop qualifiers is standard C. The extensions > are 'typeof' and that casts can be used for non-scalar types.) Ah, I'll clarify. Thanks!
Re: Re: typeof and operands in named address spaces
On Mon, Nov 16, 2020 at 12:10:56PM +0100, Peter Zijlstra wrote: > > > Another way to drop qualifiers is using a cast. So you > > > can use typeof twice: > > > > > > typeof((typeof(_var))_var) tmp__; > > > > > > This also works for non-scalars but this is a GCC extension. FWIW, clang seems to support this extension as well..
Re: Re: typeof and operands in named address spaces
( restoring at least linux-toolcha...@vger.kernel.org, since that seems to have gone missing ) On Mon, Nov 16, 2020 at 10:11:50AM +0100, Richard Biener wrote: > On Sun, Nov 15, 2020 at 11:53 AM Uecker, Martin > wrote: > > > On Wed, Nov 04, 2020 at 07:31:42PM +0100, Uros Bizjak wrote: > > > > Hello! > > > > > > > > I was looking at the recent linux patch series [1] where segment > > > > qualifiers (named address spaces) were introduced to handle percpu > > > > variables. In the patch [2], the author mentions that: > > > > > > > > --q-- > > > > Unfortunately, gcc does not provide a way to remove segment > > > > qualifiers, which is needed to use typeof() to create local instances > > > > of the per-cpu variable. For this reason, do not use the segment > > > > qualifier for per-cpu variables, and do casting using the segment > > > > qualifier instead. > > > > --/q-- > > > > > > C in general does not provide means to strip qualifiers. We recently had > > > a _lot_ of 'fun' trying to strip volatile from a type, see here: > > > > > > https://lore.kernel.org/lkml/875zimp0ay@mpe.ellerman.id.au > > > > > > which resulted in the current __unqual_scalar_typeof() hack. > > > > > > If we're going to do compiler extentions here, can we pretty please have > > > a sane means of modifying qualifiers in general? > > > > Another way to drop qualifiers is using a cast. So you > > can use typeof twice: > > > > typeof((typeof(_var))_var) tmp__; > > > > This also works for non-scalars but this is a GCC extension. > > > > > > WG14 plans to standardize typeof. I would like to hear opinion > > whether we should have typeof drop qualifiers or not. > > > > Currently, it does not do this on all compilers I tested > > (except _Atomic on GCC) and there are also use cases for > > keeping qualifiers. This is an argument for keeping qualifiers > > should we standardize it, but then we need a way to drop > > qualifiers. > > > > > > lvalue conversion drops qualifers in C. In GCC, this is not > > implemented correctly as it is unobvervable in standard C > > (but it using typeof). > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97702 > > > > A have a working patch in preparation to change this. Then you > > could use > > > > typeof( ((void)0, x) ) Neat, that actually already works with clang. And I suppose we can use the above GCC extention until such time as that GCC is fixed. See below.. > > to drop qualifiers. But this would then > > also do array-to-pointer conversion. I am not sure > > whether this is a problem. I don't _think_ so, but.. > > Of course, we could also introduce a new feature for > > dropping qualifiers. Thoughts? > Just add a new qualifier that un-qualifies? > > _Unqual volatile T x; > > is T with volatile (evenually) removed. Or just a way to drop > all using _Unqual? > > _Unqual T x; > > removing all qualifiers from T. Or add a special _Unqual_all > to achieve that. I think removing a specific qualification is > useful. Leaves cases like > > _Unqual volatile volatile T x; > > to be specified (that is ordering and cancellation of the > unqual and qual variants of qualifiers). I rather like this, however I think I'd prefer the syntax be something like: _Unqual T x; for removing all qualifiers, and: _Unqual(volatile) volatile T X; for stripping specific qualifiers. The syntax as proposed above seems very error prone to me. --- Subject: compiler: Improve __unqual_typeof() Improve our __unqual_scalar_typeof() implementation by relying on C dropping qualifiers for lvalue convesions. There is one small catch in that GCC is currently known broken in this respect, however it happens to have a C language extention that achieves the very same, it drops qualifiers on casts. This gets rid of the _Generic() usage and should improve compile times (less preprocessor output) as well as increases the capabilities of the macros. XXX: I've only verified the below actually compiles, I've not verified the generated code is actually 'correct'. Suggested-by: "Uecker, Martin" Signed-off-by: Peter Zijlstra (Intel) --- diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index 74c6c0486eed..3c5cb52c12f9 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -156,3 +156,11 @@ #else #define __diag_GCC_8(s) #endif + +/* + * GCC has a bug where lvalue conversion
Re: typeof and operands in named address spaces
On Tue, Nov 10, 2020 at 10:42:58AM -0800, Nick Desaulniers wrote: > When I think of qualifiers, I think of const and volatile. I'm not > sure why the first post I'm cc'ed on talks about "segment" qualifiers. > Maybe it's in reference to a variable attribute that the kernel > defines? Looking at Clang's Qualifier class, I see const, volatile, > restrict (ah, right), some Objective-C stuff, and address space > (TR18037 is referenced, I haven't looked up what that is) though maybe > "segment" pseudo qualifiers the kernel defines expand to address space > variable attributes? Right, x86 Named Address Space: https://gcc.gnu.org/onlinedocs/gcc-10.2.0/gcc/Named-Address-Spaces.html#Named-Address-Spaces Also, Google found me this: https://reviews.llvm.org/D64676 The basic problem seems to be they act exactly like qualifiers in that typeof() preserves them, so if you have: ( and now I realize the parent isn't Cc'd to LKML, find here: https://gcc.gnu.org/pipermail/gcc/2020-November/234119.html ) > --cut here-- > #define foo(_var)\ > ({\ > typeof(_var) tmp__;\ > asm ("mov %1, %0" : "=r"(tmp__) : "m"(_var));\ > tmp__;\ > }) > > __seg_fs int x; > > int test (void) > { > int y; > > y = foo (x); > return y; > } > --cut here-- > when compiled with -O2 for x86 target, the compiler reports: > > pcpu.c: In function ‘test’: > pcpu.c:14:3: error: ‘__seg_fs’ specified for auto variable ‘tmp__’ > Maybe stripping all qualifiers is fine since you can add them back in > if necessary? So far that seems sufficient. Although the Devil's advocate in me is trying to construct a case where we need to preserve const but strip volatile and that's then means we need to detect if the original has const or not, because unconditionally adding it will be wrong.
Re: typeof and operands in named address spaces
On Mon, Nov 09, 2020 at 11:50:15AM -0800, Nick Desaulniers wrote: > On Mon, Nov 9, 2020 at 11:46 AM Segher Boessenkool > wrote: > > > > On Mon, Nov 09, 2020 at 01:47:13PM +0100, Peter Zijlstra wrote: > > > > > > + lots of people and linux-toolchains > > > > > > On Wed, Nov 04, 2020 at 07:31:42PM +0100, Uros Bizjak wrote: > > > > Hello! > > > > > > > > I was looking at the recent linux patch series [1] where segment > > > > qualifiers (named address spaces) were introduced to handle percpu > > > > variables. In the patch [2], the author mentions that: > > > > > > > > --q-- > > > > Unfortunately, gcc does not provide a way to remove segment > > > > qualifiers, which is needed to use typeof() to create local instances > > > > of the per-cpu variable. For this reason, do not use the segment > > > > qualifier for per-cpu variables, and do casting using the segment > > > > qualifier instead. > > > > --/q-- > > > > > > C in general does not provide means to strip qualifiers. > > > > Most ways you can try to use the result are undefined behaviour, even. > > Yes, removing `const` from a `const` declared variable (via cast) then > expecting to use the result is a great way to have clang omit the use > from the final program. This has bitten us in the past getting MIPS > support up and running, and one of the MTK gfx drivers. Stripping const to delcare another variable is useful though. Sure C has sharp edges, esp. if you cast stuff, but since when did that stop anyone ;-) The point is, C++ has these very nice template helpers that can strip qualifiers, I want that too, for much of the same reasons. We might not have templates :-(, but we've become very creative with our pre-processor. Surely our __unqual_scalar_typeof() cries for a better solution.
Re: typeof and operands in named address spaces
On Mon, Nov 09, 2020 at 01:38:51PM -0600, Segher Boessenkool wrote: > On Mon, Nov 09, 2020 at 01:47:13PM +0100, Peter Zijlstra wrote: > > > > + lots of people and linux-toolchains > > > > On Wed, Nov 04, 2020 at 07:31:42PM +0100, Uros Bizjak wrote: > > > Hello! > > > > > > I was looking at the recent linux patch series [1] where segment > > > qualifiers (named address spaces) were introduced to handle percpu > > > variables. In the patch [2], the author mentions that: > > > > > > --q-- > > > Unfortunately, gcc does not provide a way to remove segment > > > qualifiers, which is needed to use typeof() to create local instances > > > of the per-cpu variable. For this reason, do not use the segment > > > qualifier for per-cpu variables, and do casting using the segment > > > qualifier instead. > > > --/q-- > > > > C in general does not provide means to strip qualifiers. > > Most ways you can try to use the result are undefined behaviour, even. > > > We recently had > > a _lot_ of 'fun' trying to strip volatile from a type, see here: > > > > https://lore.kernel.org/lkml/875zimp0ay@mpe.ellerman.id.au > > > > which resulted in the current __unqual_scalar_typeof() hack. > > > > If we're going to do compiler extentions here, can we pretty please have > > a sane means of modifying qualifiers in general? > > What do you want to do with it? It may be more feasible to do a > compiler extension for *that*. Like with the parent use-case it's pretty much always declaring temporaries in macros. We don't want the temporaries to be volatile, or as the parent post points out, to have a segment qualifier.
Re: typeof and operands in named address spaces
+ lots of people and linux-toolchains On Wed, Nov 04, 2020 at 07:31:42PM +0100, Uros Bizjak wrote: > Hello! > > I was looking at the recent linux patch series [1] where segment > qualifiers (named address spaces) were introduced to handle percpu > variables. In the patch [2], the author mentions that: > > --q-- > Unfortunately, gcc does not provide a way to remove segment > qualifiers, which is needed to use typeof() to create local instances > of the per-cpu variable. For this reason, do not use the segment > qualifier for per-cpu variables, and do casting using the segment > qualifier instead. > --/q-- C in general does not provide means to strip qualifiers. We recently had a _lot_ of 'fun' trying to strip volatile from a type, see here: https://lore.kernel.org/lkml/875zimp0ay@mpe.ellerman.id.au which resulted in the current __unqual_scalar_typeof() hack. If we're going to do compiler extentions here, can we pretty please have a sane means of modifying qualifiers in general?
Re: Broken check rejecting -fcf-protection and -mindirect-branch=thunk-extern
On Tue, Apr 28, 2020 at 02:41:33PM +0100, Andrew Cooper wrote: > Its fine to focus on userspace first, but the kernel is far more simple. > > Looking at that presentation, the only thing missing for kernel is the > notrack thunks, in the unlikely case that such code would be tolerated > (Frankly, I don't expect Xen or Linux to run with notrack enabled, as > there is no legacy code to be concerned with). Uhhh.. ftrace and kretprobes play dodgy games with the return stack, doesn't that make the CET thing slightly more interesting?
Re: [PATCH] tell gcc optimizer to never introduce new data races
On Tue, Jun 10, 2014 at 05:04:55PM +0200, Marek Polacek wrote: > On Tue, Jun 10, 2014 at 04:53:27PM +0200, Peter Zijlstra wrote: > > On Tue, Jun 10, 2014 at 03:23:36PM +0200, Jiri Kosina wrote: > > > +# Tell gcc to never replace conditional load with a non-conditional one > > > +KBUILD_CFLAGS+= $(call cc-option,--param allow-store-data-races=0) > > > + > > > > Why do we not want: -fmemory-model=safe? And should we not at the very > > least also disable packed-store-data-races? > > Note that the option does not exist, even though it is mentioned in the > documentation. Urgh.. ok. Any word on the packed-store-data thing? pgpMWhQCfAGsj.pgp Description: PGP signature
Re: [PATCH] tell gcc optimizer to never introduce new data races
On Tue, Jun 10, 2014 at 03:23:36PM +0200, Jiri Kosina wrote: > +# Tell gcc to never replace conditional load with a non-conditional one > +KBUILD_CFLAGS+= $(call cc-option,--param allow-store-data-races=0) > + Why do we not want: -fmemory-model=safe? And should we not at the very least also disable packed-store-data-races? pgpDwTn3j17ts.pgp Description: PGP signature
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 19, 2014 at 12:07:02PM +0100, Torvald Riegel wrote: > > Its not only hardware; also the kernel/user boundary has this same > > problem. We cannot a-priory say what userspace will do; in fact, because > > we're a general purpose OS, we must assume it will willfully try its > > bestest to wreck whatever assumptions we make about its behaviour. > > That's a good note, and I think a distinct case from those below, > because here you're saying that you can't assume that the userspace code > follows the C11 semantics ... Right; we can malfunction in those cases though; as long as the malfunctioning happens on the userspace side. That is, whatever userspace does should not cause the kernel to crash, but userspace crashing itself, or getting crap data or whatever is its own damn fault for not following expected behaviour. To stay on topic; if the kernel/user interface requires memory ordering and userspace explicitly omits the barriers all malfunctioning should be on the user. For instance it might loose a fwd progress guarantee or data integrity guarantees. In specific, given a kernel/user lockless producer/consumer buffer, if the user-side allows the tail write to happen before its data reads are complete, the kernel might overwrite the data its still reading. Or in case of futexes, if the user side doesn't use the appropriate operations its lock state gets corrupt but only userspace should suffer. But yes, this does require some care and consideration from our side.
Re: [RFC][PATCH 0/5] arch: atomic rework
> > 4. Some drivers allow user-mode code to mmap() some of their > > state. Any changes undertaken by the user-mode code would > > be invisible to the compiler. > > A good point, but a compiler that doesn't try to (incorrectly) assume > something about the semantics of mmap will simply see that the mmap'ed > data will escape to stuff if can't analyze, so it will not be able to > make a proof. > > This is different from, for example, malloc(), which is guaranteed to > return "fresh" nonaliasing memory. The kernel side of this is different.. it looks like 'normal' memory, we just happen to allow it to end up in userspace too. But on that point; how do you tell the compiler the difference between malloc() and mmap()? Is that some function attribute?
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > Yes, I do. But that seems to be "volatile" territory. It crosses the > boundaries of the abstract machine, and thus is input/output. Which > fraction of your atomic accesses can read values produced by hardware? > I would still suppose that lots of synchronization is not affected by > this. Its not only hardware; also the kernel/user boundary has this same problem. We cannot a-priory say what userspace will do; in fact, because we're a general purpose OS, we must assume it will willfully try its bestest to wreck whatever assumptions we make about its behaviour. We also have loadable modules -- much like regular userspace DSOs -- so there too we cannot say what will or will not happen. We also have JITs that generate code on demand. And I'm absolutely sure (with the exception of the JITs, its not an area I've worked on) that we have atomic usage across all those boundaries. I must agree with Linus, global state driven optimizations are crack brained; esp. for atomics. We simply cannot know all state at compile time. The best we can hope for are local optimizations.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > Well, that's how atomics that aren't volatile are defined in the > standard. I can see that you want something else too, but that doesn't > mean that the other thing is broken. Well that other thing depends on being able to see the entire program at compile time. PaulMck already listed various ways in which this is not feasible even for normal userspace code. In particular; DSOs and JITs were mentioned.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Tue, Feb 18, 2014 at 12:12:06PM +, Peter Sewell wrote: > Several of you have said that the standard and compiler should not > permit speculative writes of atomics, or (effectively) that the > compiler should preserve dependencies. The example below only deals with control dependencies; so I'll limit myself to that. > In simple examples it's easy > to see what that means, but in general it's not so clear what the > language should guarantee, because dependencies may go via non-atomic > code in other compilation units, and we have to consider the extent to > which it's desirable to limit optimisation there. > > For example, suppose we have, in one compilation unit: > > void f(int ra, int*rb) { > if (ra==42) > *rb=42; > else > *rb=42; > } > > and in another compilation unit the bodies of two threads: > > // Thread 0 > r1 = x; > f(r1,&r2); > y = r2; > > // Thread 1 > r3 = y; > f(r3,&r4); > x = r4; > > where accesses to x and y are annotated C11 atomic > memory_order_relaxed or Linux ACCESS_ONCE(), accesses to > r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. So I'm intuitively ok with this, however I would expect something like: void f(_Atomic int ra, _Atomic int *rb); To preserve dependencies and not make the conditional go away, simply because in that case the: if (ra == 42) the 'ra' usage can be seen as an atomic load. > So as far as we can see, either: > > 1) if you can accept the latter behaviour (if the Linux codebase does >not rely on its absence), the language definition should permit it, >and current compiler optimisations can be used, Currently there's exactly 1 site in the Linux kernel that relies on control dependencies as far as I know -- the one I put in. And its limited to a single function, so no cross translation unit funnies there. Of course, nobody is going to tell me when or where they'll put in the next one; since its now documented as accepted practise. However, PaulMck and our RCU usage very much do cross all sorts of TU boundaries; but those are data dependencies. ~ Peter
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote: > That depends on what your goal is. A compiler that we don't need to fight in order to generate sane code would be nice. But as Linus said; we can continue to ignore you lot and go on as we've done.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: > You need volatile semantics to force the compiler to ignore any proofs > it might otherwise attempt to construct. Hence all the ACCESS_ONCE() > calls in my email to Torvald. (Hopefully I translated your example > reasonably.) My brain gave out for today; but it did appear to have the right structure. I would prefer it C11 would not require the volatile casts. It should simply _never_ speculate with atomic writes, volatile or not.
Re: [RFC][PATCH 0/5] arch: atomic rework
> I don't know the specifics of your example, but from how I understand > it, I don't see a problem if the compiler can prove that the store will > always happen. > > To be more specific, if the compiler can prove that the store will > happen anyway, and the region of code can be assumed to always run > atomically (e.g., there's no loop or such in there), then it is known > that we have one atomic region of code that will always perform the > store, so we might as well do the stuff in the region in some order. > > Now, if any of the memory accesses are atomic, then the whole region of > code containing those accesses is often not atomic because other threads > might observe intermediate results in a data-race-free way. > > (I know that this isn't a very precise formulation, but I hope it brings > my line of reasoning across.) So given something like: if (x) y = 3; assuming both x and y are atomic (so don't gimme crap for now knowing the C11 atomic incantations); and you can prove x is always true; you don't see a problem with not emitting the conditional? Avoiding the conditional changes the result; see that control dependency email from earlier. In the above example the load of X and the store to Y are strictly ordered, due to control dependencies. Not emitting the condition and maybe not even emitting the load completely wrecks this. Its therefore an invalid optimization to take out the conditional or speculate the store, since it takes out the dependency.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 11:49:29AM +, Will Deacon wrote: > On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote: > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > > As near as I can tell, compiler writers hate the idea of prohibiting > > > speculative-store optimizations because it requires them to introduce > > > both control and data dependency tracking into their compilers. Many of > > > them seem to hate dependency tracking with a purple passion. At least, > > > such a hatred would go a long way towards explaining the incomplete > > > and high-overhead implementations of memory_order_consume, the long > > > and successful use of idioms based on the memory_order_consume pattern > > > notwithstanding [*]. ;-) > > > > Just tell them that because the hardware provides control dependencies > > we actually use and rely on them. > > s/control/address/ ? Nope, control. Since stores cannot be speculated and thus require linear control flow history we can use it to order LOAD -> STORE when the LOAD is required for the control flow decision and the STORE depends on the control flow path. Also see commit 18c03c61444a211237f3d4782353cb38dba795df to Documentation/memory-barriers.txt --- commit c7f2e3cd6c1f4932ccc4135d050eae3f7c7aef63 Author: Peter Zijlstra Date: Mon Nov 25 11:49:10 2013 +0100 perf: Optimize ring-buffer write by depending on control dependencies Remove a full barrier from the ring-buffer write path by relying on a control dependency to order a LOAD -> STORE scenario. Cc: "Paul E. McKenney" Signed-off-by: Peter Zijlstra Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnx...@git.kernel.org Signed-off-by: Ingo Molnar diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index e8b168af135b..146a5792b1d2 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -61,19 +61,20 @@ static void perf_output_put_handle(struct perf_output_handle *handle) * * kernel user * -* READ ->data_tail READ ->data_head -* smp_mb() (A) smp_rmb() (C) -* WRITE $dataREAD $data -* smp_wmb() (B) smp_mb()(D) -* STORE ->data_head WRITE ->data_tail +* if (LOAD ->data_tail) {LOAD ->data_head +* (A) smp_rmb() (C) +* STORE $data LOAD $data +* smp_wmb() (B) smp_mb()(D) +* STORE ->data_head STORE ->data_tail +* } * * Where A pairs with D, and B pairs with C. * -* I don't think A needs to be a full barrier because we won't in fact -* write data until we see the store from userspace. So we simply don't -* issue the data WRITE until we observe it. Be conservative for now. +* In our case (A) is a control dependency that separates the load of +* the ->data_tail and the stores of $data. In case ->data_tail +* indicates there is no room in the buffer to store $data we do not. * -* OTOH, D needs to be a full barrier since it separates the data READ +* D needs to be a full barrier since it separates the data READ * from the tail WRITE. * * For B a WMB is sufficient since it separates two WRITEs, and for C @@ -81,7 +82,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle) * * See perf_output_begin(). */ - smp_wmb(); + smp_wmb(); /* B, matches C */ rb->user_page->data_head = head; /* @@ -144,17 +145,26 @@ int perf_output_begin(struct perf_output_handle *handle, if (!rb->overwrite && unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size)) goto fail; + + /* +* The above forms a control dependency barrier separating the +* @tail load above from the data stores below. Since the @tail +* load is required to compute the branch to fail below. +* +* A, matches D; the full memory barrier userspace SHOULD issue +* after reading the data and before storing the new tail +* position. +* +* See perf_output_put_handle(). +*/ + head += size; } while (local_cmpxchg(&rb->head, offset, head) != offset); /* -* Separate the userpage->tail read from the data stores below.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > As near as I can tell, compiler writers hate the idea of prohibiting > speculative-store optimizations because it requires them to introduce > both control and data dependency tracking into their compilers. Many of > them seem to hate dependency tracking with a purple passion. At least, > such a hatred would go a long way towards explaining the incomplete > and high-overhead implementations of memory_order_consume, the long > and successful use of idioms based on the memory_order_consume pattern > notwithstanding [*]. ;-) Just tell them that because the hardware provides control dependencies we actually use and rely on them. Not that I expect they care too much what we do, given the current state of things.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote: > > Initial state: x == y == 0 > > > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > > atomic_store_explicit(42, y, memory_order_relaxed); > > if (r1 != 42) > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > Intuitively, this is wrong because this let's the program take a step > the abstract machine wouldn't do. This is different to the sequential > code that Peter posted because it uses atomics, and thus one can't > easily assume that the difference is not observable. Yeah, my bad for not being familiar with the atrocious crap C11 made of atomics :/
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 07, 2014 at 05:13:36PM +, Will Deacon wrote: > Understood, but that doesn't explain why Paul wants to add ISB/isync > instructions which affect the *CPU* rather than the compiler! I doubt Paul wants it, but yeah, I'm curious about that proposal as well, sounds like someone took a big toke from the bong again; it seems a favourite past time amongst committees.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Fri, Feb 07, 2014 at 04:55:48PM +, Will Deacon wrote: > Hi Paul, > > On Fri, Feb 07, 2014 at 04:50:28PM +, Paul E. McKenney wrote: > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > a wooden stake through its hart. > > > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > > is sorted. > > > > There actually is a proposal being put forward, but it might not make ARM > > and Power people happy because it involves adding a compare, a branch, > > and an ISB/isync after every relaxed load... Me, I agree with you, > > much preferring the no-store-speculation approach. > > Can you elaborate a bit on this please? We don't permit speculative stores > in the ARM architecture, so it seems counter-intuitive that GCC needs to > emit any additional instructions to prevent that from happening. > > Stores can, of course, be observed out-of-order but that's a lot more > reasonable :) This is more about the compiler speculating on stores; imagine: if (x) y = 1; else y = 2; The compiler is allowed to change that into: y = 2; if (x) y = 1; Which is of course a big problem when you want to rely on the ordering. There's further problems where things like memset() can write outside the specified address range. Examples are memset() using single instructions to wipe entire cachelines and then 'restoring' the tail bit. While valid for single threaded, its a complete disaster for concurrent code. There's more, but it all boils down to doing stores you don't expect in a 'sane' concurrent environment and/or don't respect the control flow.
Re: [RFC][PATCH 0/5] arch: atomic rework
On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > Hopefully some discussion of out-of-thin-air values as well. Yes, absolutely shoot store speculation in the head already. Then drive a wooden stake through its hart. C11/C++11 should not be allowed to claim itself a memory model until that is sorted.
Re: [RFC] gcc feature request: Moving blocks into sections
On Mon, Aug 12, 2013 at 10:47:37AM -0700, H. Peter Anvin wrote: > On 08/12/2013 09:09 AM, Peter Zijlstra wrote: > >> > >> On the majority of architectures, including x86, you cannot simply copy > >> a piece of code elsewhere and have it still work. > > > > I thought we used -fPIC which would allow just that. > > > > Doubly wrong. The kernel is not compiled with -fPIC, nor does -fPIC > allow this kind of movement for code that contains intramodule > references (that is *all* references in the kernel). Since we really > doesn't want to burden the kernel with a GOT and a PLT, that is life. OK. never mind then..
Re: [RFC] gcc feature request: Moving blocks into sections
On Mon, Aug 12, 2013 at 09:02:02AM -0700, Andi Kleen wrote: > "H. Peter Anvin" writes: > > > However, I would really like to > > understand what the value is. > > Probably very little. When I last looked at it, the main overhead in > perf currently seems to be backtraces and the ring buffer, not this > code. backtraces do indeed blow and make pretty much everything else irrelevant, but when not using them the branch forest was significant when I last looked at it.
Re: [RFC] gcc feature request: Moving blocks into sections
On Mon, Aug 12, 2013 at 07:56:10AM -0700, H. Peter Anvin wrote: > On 08/12/2013 02:17 AM, Peter Zijlstra wrote: > > > > I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT > > if-forest functions like perf_prepare_sample() and perf_output_sample(). > > > > They are of the form: > > > > void func(obj, args..) > > { > > unsigned long f = ...; > > > > if (f & F1) > > do_f1(); > > > > if (f & F2) > > do_f2(); > > > > ... > > > > if (f & FN) > > do_fn(); > > } > > > > Am I reading this right that f can be a combination of any of these? Correct. > > Where f is constant for the entire lifetime of the particular object. > > > > So I was thinking of having these functions use static_key/asm-goto; > > then write the proper static key values unsafe so as to avoid all > > trickery (as these functions would never actually be used) and copy the > > end result into object private memory. The object will then use indirect > > calls into these functions. > > I'm really not following what you are proposing here, especially not > "copy the end result into object private memory." > > With asm goto you end up with at minimum a jump or NOP for each of these > function entries, whereas an actual JIT can elide that as well. > > On the majority of architectures, including x86, you cannot simply copy > a piece of code elsewhere and have it still work. I thought we used -fPIC which would allow just that. > You end up doing a > bunch of the work that a JIT would do anyway, and would end up with > considerably higher complexity and worse results than a true JIT. Well, less complexity but worse result, yes. We'd only poke the specific static_branch sites with either NOPs or the (relative) jump target for each of these branches. Then copy the result. > You > also say "the object will then use indirect calls into these > functions"... you mean the JIT or pseudo-JIT generated functions, or the > calls inside them? The calls to these pseudo-JIT generated functions. > > I suppose the question is, do people strenuously object to creativity > > like that and or is there something GCC can do to make this > > easier/better still? > > I think it would be much easier to just write a minimal JIT for this, > even though it is per architecture. However, I would really like to > understand what the value is. Removing a lot of the conditionals from the sample path. Depending on the configuration these can be quite expensive.
Re: [RFC] gcc feature request: Moving blocks into sections
On Mon, Aug 05, 2013 at 12:55:15PM -0400, Steven Rostedt wrote: > [ sent to both Linux kernel mailing list and to gcc list ] > Let me hijack this thread for something related... I've been wanting to 'abuse' static_key/asm-goto to sort-of JIT if-forest functions like perf_prepare_sample() and perf_output_sample(). They are of the form: void func(obj, args..) { unsigned long f = ...; if (f & F1) do_f1(); if (f & F2) do_f2(); ... if (f & FN) do_fn(); } Where f is constant for the entire lifetime of the particular object. So I was thinking of having these functions use static_key/asm-goto; then write the proper static key values unsafe so as to avoid all trickery (as these functions would never actually be used) and copy the end result into object private memory. The object will then use indirect calls into these functions. The advantage of using something like this is that it would work for all architectures that now support the asm-goto feature. For arch/gcc combinations that do not we'd simply revert to the current state of affairs. I suppose the question is, do people strenuously object to creativity like that and or is there something GCC can do to make this easier/better still?
Re: WTF?
On Wed, 2009-11-25 at 20:38 +0100, Richard Guenther wrote: > If you can offer advice on how to teach quilt > (which I belive uses patch) to ignore whitespace changes when > applying patches then more power to you QUILT_PATCH_OPTS="-l"
problems bootstrapping gcc-4.0-20051117 on i386-pc-solaris2.10
Hi, I'm having a lot of problems bootstrapping this compiler on said target. What I did so far: from the build dir (which I located in the extracted source dir) ../configure --srcdir=.. --enable-languages="c c++" --with-gnu-as --with-as=/usr/sfw/bin/gas --without-gnu-ld --with-ld=/usr/ccs/bin/ld however this fails to select '/usr/sfw/bin/gas' as the assembler; per configure output is says: ... checking for i386-pc-solaris2.10-as... no checking for as... as ... So I set the environment variable 'AS' to point to gas: export AS='/usr/sfw/bin/gas'; removed the build dir and started over. This gave: ... checking for i386-pc-solaris2.10-as... /usr/sfw/bin/gas ... after which I continued with: make -j5 bootstrap At that point it starts building, but fails with an assembler error: /usr/local/src/gcc-4.0-20051117/gcc/config/i386/gmon-sol2.c:406: warning: control reaches end of non-void function /usr/local/src/gcc-4.0-20051117/gcc/config/i386/gmon-sol2.c: At top level: /usr/local/src/gcc-4.0-20051117/gcc/config/i386/gmon-sol2.c:58: warning: 'sccsid' defined but not used Fixing headers into /usr/local/src/gcc-4.0-20051117/build/gcc/include for i386-pc-solaris2.10 target Assembler: "", line 1 : Illegal flag (-) make[2]: *** [gmon.o] Error 1 make[2]: *** Waiting for unfinished jobs using truss I find that it is actually using /usr/ccs/bin/as ?!? luckyily it stats a few paths before happening on /usr/ccs/bin/as, so what I did is to symlink /usr/sfw/bin/gas to one of the other paths (ln -s /usr/sfw/bin/gas /usr/local/i386-pc-solaris2.10/bin/as) and restart all over again. This get me further along, however not quite there. Now it fails to configure libstd++ with the following error: checking for exception model to use... configure: error: unable to detect exception model make[1]: *** [configure-target-libstdc++-v3] Error 1 make[1]: Leaving directory `/usr/local/src/gcc-4.0-20051117/build' make: *** [bootstrap] Error 2 I'm sure I'm doing something horribly wrong here, can somebody point me to the way of a working compiler? Kind regards, Peter Zijlstra
Re: C++ vs. pthread_cancel
On Mon, 2005-08-15 at 09:53 -0400, Daniel Jacobowitz wrote: > On Mon, Aug 15, 2005 at 09:51:17AM -0400, Andrew Pinski wrote: > > > Yes, I'm aware of the list. My question was what the current behaviour > > > of the various gcc versions is. And if gcc supports the various work > > > around mentiod. Like explicity configuring the behavour of the 'catch > > > (...)' etc.. > > > > There is none yet because there have been no consensus yet. That is > > why I mentioned the list. GCC is not going to implement anything > > until there is a consensus of how to proceed. > > Eh, that's obviously incorrect. The feature was implemented back in > 3.3; the behavior hasn't been _changed_ and won't be until there is > consensus. > > I believe that it's still can be caught, must be rethrown, or the > program will be aborted. Someone who knows better than I may want to > confirm this. > AFAICT this is idd correct; I've been reading the libstdc++5 and NPTL sources. However once it is caught it is impossible to distinguish from other exceptions due to the lack of exception information set in: glibc-2.3.5/nptl/unwind.c:__pthread_unwind() Hence once is left in the situation where both forward and backward are not an option. Nor do I think they (being the company I work for) will allow me to ship patched versions of libpthread.so and libstdc++.so.5. Too bad, guess I have to redesign the issue. Peter Zijlstra
Re: C++ vs. pthread_cancel
On Mon, 2005-08-15 at 09:33 -0400, Andrew Pinski wrote: > > > > Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > > On this controversial subject, could somebody please - pretty please > > > with a cherry on top - tell me what the current status is: > > > - in general, > > > - as implemented in the 3.4 series and > > > - as implemented in the 4.0 series. > > > > > > At work we're using 3.4 and we have managed to shoot our foot of with > > > this issue :-(, google gives a lot of hits on the issue but it is a bit > > > hard to get the current impl. status for 3.4. Which in turn makes it > > > hard to decide on how to bandage our foot. > > > > Could you provide a link to a description of the particular > > problem? I looked around, and all I could find was > > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=111548 > > > > I suppose the controversial part is that you're using > > pthread_cancel, which is somewhat frowned upon as > > inherently unsafe. > > There is a whole mailing list about this: > http://www.codesourcery.com/archives/c++-pthreads/threads.html > > This has to be done correctly with the C++ standard and POSIX people and > the GCC people will be involved but not on the GCC list as it just gets > in the way. Yes, I'm aware of the list. My question was what the current behaviour of the various gcc versions is. And if gcc supports the various work around mentiod. Like explicity configuring the behavour of the 'catch (...)' etc.. Peter Zijlstra
re: C++ vs. pthread_cancel
On Mon, 2005-08-15 at 06:12 -0700, Dan Kegel wrote: > Peter Zijlstra <[EMAIL PROTECTED]> wrote: > > On this controversial subject, could somebody please - pretty please > > with a cherry on top - tell me what the current status is: > > - in general, > > - as implemented in the 3.4 series and > > - as implemented in the 4.0 series. > > > > At work we're using 3.4 and we have managed to shoot our foot of with > > this issue :-(, google gives a lot of hits on the issue but it is a bit > > hard to get the current impl. status for 3.4. Which in turn makes it > > hard to decide on how to bandage our foot. > > Could you provide a link to a description of the particular > problem? I looked around, and all I could find was > https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=111548 > > I suppose the controversial part is that you're using > pthread_cancel, which is somewhat frowned upon as > inherently unsafe. > - Dan > Yes, is seems to be that problem. Discussion here: http://gcc.gnu.org/ml/gcc/2003-12/msg00743.html And this list seems dedicated to the problem: http://www.codesourcery.com/archives/c++-pthreads/maillist.html The issue seems to be that pthread_cancel is implemented using force_unwind, the same mechanism as used for exception handling. And the interaction is ill defined. The behaviour of gcc-3.4 is that the unclassified exception caused by SIGCANCEL can be caught by the catch-all clause: 'catch (...)'. And then when not rethrown causes an abort. Peter Zijlstra
C++ vs. pthread_cancel
Hi all, On this controversial subject, could somebody please - pretty please with a cherry on top - tell me what the current status is: - in general, - as implemented in the 3.4 series and - as implemented in the 4.0 series. At work we're using 3.4 and we have managed to shoot our foot of with this issue :-(, google gives a lot of hits on the issue but it is a bit hard to get the current impl. status for 3.4. Which in turn makes it hard to decide on how to bandage our foot. Kind regards, Peter Zijlstra