On Sat, Sep 19, 2015 at 11:33:10PM +0800, Boqun Feng wrote: > Hi Will, > > On Fri, Sep 18, 2015 at 05:59:02PM +0100, Will Deacon wrote: > > On Wed, Sep 16, 2015 at 04:49:31PM +0100, Boqun Feng wrote: > > > On powerpc, we don't need a general memory barrier to achieve acquire and > > > release semantics, so __atomic_op_{acquire,release} can be implemented > > > using "lwsync" and "isync". > > > > I'm assuming isync+ctrl isn't transitive, so we need to get to the bottom > > Actually the transitivity is still guaranteed here, I think ;-) > > (Before I put my reasoning, I have to admit I just learned about the > cumulativity recently, so my reasoning may be wrong. But the good thing > is that we have our POWER experts in the CCed. In case I'm wrong, they > could correct me.) > > The thing is, on POWER, transitivity is implemented by a similar but > slightly different concept, cumulativity, and as said in the link: > > http://www.rdrop.com/users/paulmck/scalability/paper/N2745r.2011.03.04a.html > > """ > The ordering done by a memory barrier is said to be “cumulative” if it > also orders storage accesses that are performed by processors and > mechanisms other than P1, as follows. > > * A includes all applicable storage accesses by any such processor > or mechanism that have been performed with respect to P1 before > the memory barrier is created. > > * B includes all applicable storage accesses by any such processor > or mechanism that are performed after a Load instruction > executed by that processor or mechanism has returned the value > stored by a store that is in B. > """ > > Please note that the set B can be extended indefinitely without any > other cumulative barrier. > > So for a RELEASE+ACQUIRE pair to a same variable, as long as the barrier > in the RELEASE operation is cumumlative, the transitivity is guaranteed. > And lwsync is cumulative, so we are fine here. > > > I also wrote a herd litmus to test this. Due to the tool's limitation, I > use the xchg_release and xchg_acquire to test. And since herd doesn't
Hmm.. I think I wanted to say atomic_xchg_release and atomic_xchg_acquire here, sorry about that inaccuracy.. > support backward branching, some tricks are used here to work around: > And I check again, herd does suppor backward branching, the problem is just if we use backward branching, there will be a lot more states the tool need to check, but it seems there are not too many in this case, so I modify the litmus a little bit as follow: PPC lwsync+isync-transitivity "" { 0:r1=1; 0:r2=x; 0:r3=1; 0:r10=0 ; 0:r11=0; 0:r12=a; 1:r1=9; 1:r2=x; 1:r3=1; 1:r10=0 ; 1:r11=0; 1:r12=a; 2:r1=9; 2:r2=x; 2:r3=2; 2:r10=0 ; 2:r11=0; 2:r12=a; } P0 | P1 | P2 ; stw r1,0(r2) | lwz r1,0(r2) | Fail2: ; | lwsync | lwarx r11, r10, r12 ; | Fail1: | stwcx. r3, r10, r12 ; | lwarx r11,r10,r12 | bne Fail2 ; | stwcx. r3,r10,r12 | isync ; | bne Fail1 | lwz r1, 0(r2) ; exists (1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1 = 0) which is actually: CPU 0 CPU 1 CPU 2 ============== ===================== ======================= {int x = 0, atomic_t a = ATOMIC_INIT(0)} WRITE_ONCE(x,1); t1 = READ_ONCE(x); t2 = atomic_xchg_acquire(&a, 2); atomic_xchg_release(&a, 1); t3 = READ_ONCE(x); exists (t1 == 1 && t2 == 1 && t3 == 0) The result is still(it may take a while to get the result): Test lwsync+isync-transitivity Allowed States 11 1:r1=0; 1:r11=0; 2:r1=0; 2:r11=0; 1:r1=0; 1:r11=0; 2:r1=0; 2:r11=1; 1:r1=0; 1:r11=0; 2:r1=1; 2:r11=0; 1:r1=0; 1:r11=0; 2:r1=1; 2:r11=1; 1:r1=0; 1:r11=2; 2:r1=0; 2:r11=0; 1:r1=0; 1:r11=2; 2:r1=1; 2:r11=0; 1:r1=1; 1:r11=0; 2:r1=0; 2:r11=0; 1:r1=1; 1:r11=0; 2:r1=1; 2:r11=0; 1:r1=1; 1:r11=0; 2:r1=1; 2:r11=1; 1:r1=1; 1:r11=2; 2:r1=0; 2:r11=0; 1:r1=1; 1:r11=2; 2:r1=1; 2:r11=0; Loop No Witnesses Positive: 0 Negative: 198 Condition exists (1:r1=1 /\ 1:r11=0 /\ 2:r11=1 /\ 2:r1=0) Observation lwsync+isync-transitivity Never 0 198 , which means transitivity is guaranteed. And I think it deserves more analysis based on cumulativity: Initially, for the lwsync on P1(CPU 1), we have set A and B of the storage accesses on the same processor which lwsync orders: A includes: on CPU 1: lwz r1, 0(r2) // t1 = READ_ONCE(x); B includes: on CPU 1: lwarx r11,r10,r12 // atomic_xchg_release(); stwcx. r3,r10,r12 and as t1 == 1, which means before lwsync, P1 perceives the STORE of x on CPU 0, which makes another storage access is included in A: A now includes: on CPU 0: stw r1, 0(r) // WRITE_ONCE(x,1); on CPU 1: lwz r1, 0(r2) // t1 = READ_ONCE(x); B now includes: on CPU 1: lwarx r11,r10,r12 // atomic_xchg_release(); stwcx. r3,r10,r12 and as t2 == 1, which means on CPU 2, "lwarx r11,r10,r12" in atomic_xchg_acqurie() reads the value stored by "stwcx. r3,r10,r12" in atomic_xchg_release() on CPU 1, that makes all storage accesses performed after atomic_xchg_acquire() get included in set B: A now includes: on CPU 0: stw r1, 0(r) // WRITE_ONCE(x,1); on CPU 1: lwz r1, 0(r2) // t1 = READ_ONCE(x); B now includes: on CPU 1: lwarx r11,r10,r12 // atomic_xchg_release(); stwcx. r3,r10,r12 on CPU 2: lwz r1, 0(r2) // t3 = READ_ONCE(x); Therefore the STORE of x on CPU 0 and the LOAD of x on CPU 2 can not be reordered in this case, which means transitivity guaranteed. Regards, Boqun
signature.asc
Description: PGP signature
_______________________________________________ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev