Re: [RFC] Fix full memory barrier on SPARC-V8
> Linux doesn't ever run the cpu in the RMO memory model any more. All > sparc64 chips run only in TSO now. > > All of the Niagara chips implement an even stricter than TSO memory > model, and the membars we used to have all over the kernel to handle > that properly were just wasted I-cache space. So I just moved > unilaterally to TSO everywhere and killed off the membars necessitated > by RMO. OK, thanks for the clarification. That's also fine from GCC's viewpoint. -- Eric Botcazou
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Eric Botcazou Date: Tue, 28 Jun 2011 23:27:43 +0200 > With the pristine compiler, the test passes with -mcpu=v9 but fails otherwise. > It passes with the patched compiler. However, I suspect that we would still > have problems with newer UltraSparc CPUs supporting full RMO, because the new > insn membar_v8 is only half a memory barrier for V9. Linux doesn't ever run the cpu in the RMO memory model any more. All sparc64 chips run only in TSO now. All of the Niagara chips implement an even stricter than TSO memory model, and the membars we used to have all over the kernel to handle that properly were just wasted I-cache space. So I just moved unilaterally to TSO everywhere and killed off the membars necessitated by RMO.
Re: [RFC] Fix full memory barrier on SPARC-V8
> Fair enough, you can add this code if you want. Thanks. Note that this is marginal for Solaris as GCC defaults to -mcpu=v9 on Solaris but, in all other cases, it defaults to -mcpu=v8. I can reproduce the problem on the SPARC/Linux machine 'grobluk' of the CompileFarm: cpu : TI UltraSparc II (BlackBird) fpu : UltraSparc II integrated FPU prom: OBP 3.2.30 2002/10/25 14:03 type: sun4u ncpus probed: 4 ncpus active: 4 Linux grobluk 2.6.26-2-sparc64-smp #1 SMP Thu Nov 5 03:34:29 UTC 2009 sparc64 GNU/Linux With the pristine compiler, the test passes with -mcpu=v9 but fails otherwise. It passes with the patched compiler. However, I suspect that we would still have problems with newer UltraSparc CPUs supporting full RMO, because the new insn membar_v8 is only half a memory barrier for V9. -- Eric Botcazou
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Eric Botcazou Date: Tue, 28 Jun 2011 10:11:03 +0200 > The V8 architecture manual is quite clear about it: TSO allows stores to be > reordered after subsequent loads (it's the only difference in TSO with Strong > Consistency) so you need to do something to have a full memory barrier. As > there is no specific instruction to that effect in V8, you need to do what is > done for pre-SSE2 x86, i.e. use an atomic instruction. Fair enough, you can add this code if you want.
Re: [RFC] Fix full memory barrier on SPARC-V8
> Let's clarify something, did you run your testcase that triggered this > bug on a v8 or a v9 machine? Sun UltraSPARC, so V9 of course. The point is that Solaris is TSO (TSO as defined for the V9 architecture, i.e. backward compatible with V8) so you have a V8-compatible TSO implementation, in particular not a Strong Consistency V8. It is perfectly valid to compile with -mcpu=v8 on Solaris and expect to get a working program. Now if you start to play seriously with __sync_synchronize, you conclude that it doesn't implement a full memory barrier with -mcpu=v8. The V8 architecture manual is quite clear about it: TSO allows stores to be reordered after subsequent loads (it's the only difference in TSO with Strong Consistency) so you need to do something to have a full memory barrier. As there is no specific instruction to that effect in V8, you need to do what is done for pre-SSE2 x86, i.e. use an atomic instruction. -- Eric Botcazou
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Geert Bosch Date: Mon, 27 Jun 2011 23:17:18 -0400 >> \You then go on to speak about LEON, does LEON implement PSO? > No, I'm not talking about PSO anywhere or SPARCv9 anywhere. > Just plain old SPARCv8, using the TSO model. This requires a > load-store instruction to guarantee a full memory barrier. > > I'm not making this up, that is why I refer to the examples in > the SPARC v8 architecture manual that specifically state that > SWAP instructions need to be used instead of store instructions > to make Dekker's algorithm work. All v8 processors that I am aware of implement strong consistency, and if so discussions about TSO are not relevant. Is LEON an exception? Let's clarify something, did you run your testcase that triggered this bug on a v8 or a v9 machine?
Re: [RFC] Fix full memory barrier on SPARC-V8
On Jun 27, 2011, at 22:45, David Miller wrote: > From: Geert Bosch > Date: Mon, 27 Jun 2011 22:21:47 -0400 > >> On Jun 27, 2011, at 19:53, David Miller wrote: >> >>> Adding a ldstub here is going to be really expensive, on UltraSparc >>> that can be 36+ cycles even on a cache hit. >> >> Yes, synchronization in multi-CPU systems is expensive. >> If it's really cheap, you're probably doing something wrong. > > First, I fundamentally disagree with this assertion. The reason > proper memory barriers exist is so that you don't need nonsense like > these proposed atomics to get proper memory operation ordering. Sorry, I see now I phrased this poorly, no offense intended. We both agree that with TSO there is never a need for any STBAR instructions on SPARCv8. The point is that TSO is not sufficient for strong consistency. The reason for this is the existence of write buffers (see fig 6.1, or K-1 of the SPARC v8 architecture manual). In particular, note the CPU-local bypass from the store buffer. Two processors both storing a value X in location Y and then reading from Y might each see their own value. In the end, one will reach memory first and the stores will be ordered there. The load-store instructions are necessary to ensure the store will be seen by the memory system before subsequent loads can use them. The main issue is that SPARC's TSO does not guarantee Store-Load ordering. So, only by issuing a SWAP(A) or LDSTUB(A) instruction can total ordering of all loads and stores be guaranteed. > > A proper membar on your v9 test system is orders of magnitude cheaper > than this stbar+ldstub business. That's true, but membar is a SPARC v9 instruction. The issue Eric and I are addressing is only about SPARCv8. > \You then go on to speak about LEON, does LEON implement PSO? No, I'm not talking about PSO anywhere or SPARCv9 anywhere. Just plain old SPARCv8, using the TSO model. This requires a load-store instruction to guarantee a full memory barrier. I'm not making this up, that is why I refer to the examples in the SPARC v8 architecture manual that specifically state that SWAP instructions need to be used instead of store instructions to make Dekker's algorithm work. -Geert
Re: [RFC] Fix full memory barrier on SPARC-V8
From: David Miller Date: Mon, 27 Jun 2011 19:45:33 -0700 (PDT) > You then go on to speak about LEON, does LEON implement PSO? BTW, even if it does, I would be encouraging the person who submits LEON kernel patches to not run the chip in this mode. We don't even use PSO for v9 chips, it's just not worth the hassle. So I think even from a v8 LEON perspective this is a non-issue.
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Geert Bosch Date: Mon, 27 Jun 2011 22:21:47 -0400 > On Jun 27, 2011, at 19:53, David Miller wrote: > >> Adding a ldstub here is going to be really expensive, on UltraSparc >> that can be 36+ cycles even on a cache hit. > > Yes, synchronization in multi-CPU systems is expensive. > If it's really cheap, you're probably doing something wrong. First, I fundamentally disagree with this assertion. The reason proper memory barriers exist is so that you don't need nonsense like these proposed atomics to get proper memory operation ordering. A proper membar on your v9 test system is orders of magnitude cheaper than this stbar+ldstub business. You then go on to speak about LEON, does LEON implement PSO?
Re: [RFC] Fix full memory barrier on SPARC-V8
On Jun 27, 2011, at 19:53, David Miller wrote: > I'm trying to find the part of the v8 manual that says there is > a situation where we should use "stbar" and a "ldstub" to implement > proper memory barriers. In particular I'm looking in Appendix J, > "Programming with the memory models." Where is the description? See J.7, and study why the store instructions are replaces by SWAP. > > Adding a ldstub here is going to be really expensive, on UltraSparc > that can be 36+ cycles even on a cache hit. Yes, synchronization in multi-CPU systems is expensive. If it's really cheap, you're probably doing something wrong. > Also, the more I think about it, the issue really is that one is > trying to run v8 code on a v9 cpu. Double no: 1. No, my primary concern is about v8 code running on multiprocessor systems implementing the SPARC v8 architecture (LEON3 in particular) 2. No, a SPARCv8 compliant binary should run correctly on both SPARCv8 and SPARCv9. The entire raison-d'ĂȘtre for the SPARC architecture is so we can write code based on the architecture, and have it run correctly on all implementations. -Geert
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Geert Bosch Date: Mon, 27 Jun 2011 19:36:06 -0400 > On Jun 27, 2011, at 19:00, David Miller wrote: > >> V8 can only reorder stores, that's why it only has a 'stbar' >> instruction. I'm not so sure I agree with trying to paper over the >> fact that someone has compiled code for v8 that's going to run on a v9 >> cpu. > > That's not the issue. While it is true that all stores will be > submitted in order , this does not guarantee store-load > consistency. In particular on a multiprocessor, each individual > processor has it's own store buffers and cannot see what is in the > other CPUs store buffet. In the end all stores will be committed to > memory in a sequential order, but that is not sufficient. The use of > a load-store instruction is needed to achieve a full barrier. The > SPARC architecture manuals describe this in detail. I'm trying to find the part of the v8 manual that says there is a situation where we should use "stbar" and a "ldstub" to implement proper memory barriers. In particular I'm looking in Appendix J, "Programming with the memory models." Where is the description? Adding a ldstub here is going to be really expensive, on UltraSparc that can be 36+ cycles even on a cache hit. Also, the more I think about it, the issue really is that one is trying to run v8 code on a v9 cpu. And this is because no v8 cpu ever implemented anything other than "Strong Consistency", so on a v8 cpu you would never run into this problem. I really think the answer in this situation is "compile code for the actual processor you're targetting, especially if you want features with processor specific behaviors, such as atomics and memory barriers, to work properly."
Re: [RFC] Fix full memory barrier on SPARC-V8
On Jun 27, 2011, at 19:00, David Miller wrote: > V8 can only reorder stores, that's why it only has a 'stbar' > instruction. I'm not so sure I agree with trying to paper over the > fact that someone has compiled code for v8 that's going to run on a v9 > cpu. That's not the issue. While it is true that all stores will be submitted in order , this does not guarantee store-load consistency. In particular on a multiprocessor, each individual processor has it's own store buffers and cannot see what is in the other CPUs store buffet. In the end all stores will be committed to memory in a sequential order, but that is not sufficient. The use of a load-store instruction is needed to achieve a full barrier. The SPARC architecture manuals describe this in detail. -Geert
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Eric Botcazou Date: Mon, 27 Jun 2011 18:11:10 +0200 > * config/sparc/sync.md (*stbar): Delete. > (*membar_v8): New insn to implement UNSPEC_MEMBAR in SPARC-V8. Code which cares about memory ordering etc. really has to know the kind of cpu it is running on. This is why atomic and synchronization primitives are typically restricted to shared libraries and similar, where the dynamic linker can vet out what is the correct implementation on a given piece of hardware. V8 can only reorder stores, that's why it only has a 'stbar' instruction. I'm not so sure I agree with trying to paper over the fact that someone has compiled code for v8 that's going to run on a v9 cpu.
[RFC] Fix full memory barrier on SPARC-V8
The memory_barrier pattern expands to UNSPEC_MEMBAR on the SPARC and the latter is implemented differently for V8 and V9: (define_insn "*stbar" [(set (match_operand:BLK 0 "" "") (unspec:BLK [(match_dup 0)] UNSPEC_MEMBAR))] "TARGET_V8" "stbar" [(set_attr "type" "multi")]) ;; membar #StoreStore | #LoadStore | #StoreLoad | #LoadLoad (define_insn "*membar" [(set (match_operand:BLK 0 "" "") (unspec:BLK [(match_dup 0)] UNSPEC_MEMBAR))] "TARGET_V9" "membar\t15" [(set_attr "type" "multi")]) This is surprising because, while membar 0x0F is a full memory barrier for V9, stbar isn't one for V8. stbar is only for PSO and a nop in TSO; now TSO isn't Strong Consistency so there is something missing. Geert has devised a nice testcase (in Ada) based on Peterson's algorithm with 4 tasks (threads) and it fails on a 4-CPU Solaris machine with -mcpu=v8 (Solaris is TSO). Something like the attached patch is needed to make it pass. Now the GCC implementation seems to derive from that of the kernel, which has: /* XXX Change this if we ever use a PSO mode kernel. */ #define mb()__asm__ __volatile__ ("" : : : "memory") in include/asm-sparc/system.h and #define mb()\ membar_safe("#LoadLoad | #LoadStore | #StoreStore | #StoreLoad") in include/asm-sparc64/system.h. So mb() isn't a full memory barrier for V8 either. * config/sparc/sync.md (*stbar): Delete. (*membar_v8): New insn to implement UNSPEC_MEMBAR in SPARC-V8. -- Eric Botcazou Index: config/sparc/sync.md === --- config/sparc/sync.md (revision 175408) +++ config/sparc/sync.md (working copy) @@ -30,15 +30,20 @@ (define_expand "memory_barrier" { operands[0] = gen_rtx_MEM (BLKmode, gen_rtx_SCRATCH (Pmode)); MEM_VOLATILE_P (operands[0]) = 1; - }) -(define_insn "*stbar" +;; In V8, loads are blocking and ordered wrt earlier loads, i.e. every load +;; is virtually followed by a load barrier (membar #LoadStore | #LoadLoad). +;; In PSO, stbar orders the stores (membar #StoreStore). +;; In TSO, ldstub orders the stores wrt subsequent loads (membar #StoreLoad). +;; The combination of the three yields a full memory barrier in all cases. +(define_insn "*membar_v8" [(set (match_operand:BLK 0 "" "") (unspec:BLK [(match_dup 0)] UNSPEC_MEMBAR))] "TARGET_V8" - "stbar" - [(set_attr "type" "multi")]) + "stbar\n\tldstub\t[%%sp-1], %%g0" + [(set_attr "type" "multi") + (set_attr "length" "2")]) ;; membar #StoreStore | #LoadStore | #StoreLoad | #LoadLoad (define_insn "*membar"