Stephan Diestelhorst writes: > > David Holmes wrote: > > Stephan Diestelhorst writes: > > > Am Dienstag, 25. November 2014, 11:15:36 schrieb Hans Boehm: > > > > I'm no hardware architect, but fundamentally it seems to me that > > > > > > > > load x > > > > acquire_fence > > > > > > > > imposes a much more stringent constraint than > > > > > > > > load_acquire x > > > > > > > > Consider the case in which the load from x is an L1 hit, but a > > > > preceding load (from say y) is a long-latency miss. If we enforce > > > > ordering by just waiting for completion of prior operation, the > > > > former has to wait for the load from y to complete; while the > > > > latter doesn't. I find it hard to believe that this doesn't leave > > > > an appreciable amount of performance on the table, at least for > > > > some interesting microarchitectures. > > > > > > I agree, Hans, that this is a reasonable assumption. Load_acquire x > > > does allow roach motel, whereas the acquire fence does not. > > > > > > > In addition, for better or worse, fencing requirements on at least > > > > Power are actually driven as much by store atomicity issues, as by > > > > the ordering issues discussed in the cookbook. This was not > > > > understood in 2005, and unfortunately doesn't seem to be > amenable to > > > > the kind of straightforward explanation as in Doug's cookbook. > > > > > > Coming from a strongly ordered architecture to a weakly ordered one > > > myself, I also needed some mental adjustment about store (multi-copy) > > > atomicity. I can imagine others will be unaware of this difference, > > > too, even in 2014. > > > > Sorry I'm missing the connection between fences and multi-copy > atomicity. > > One example is the classic IRIW. With non-multi copy atomic stores, but > ordered (say through a dependency) loads in the following example: > > Memory: foo = bar = 0 > _T1_ _T2_ _T3_ _T4_ > st (foo),1 st (bar),1 ld r1, (bar) ld r3,(foo) > <addr dep / local "fence" here> <addr dep> > ld r2, (foo) ld r4, (bar) > > You may observe r1 = 1, r2 = 0, r3 = 1, r4 = 0 on non-multi-copy atomic > machines. On TSO boxes, this is not possible. That means that the > memory fence that will prevent such a behaviour (DMB on ARM) needs to > carry some additional oomph in ensuring multi-copy atomicity, or rather > prevent you from seeing it (which is the same thing).
I take it as given that any code for which you may have ordering constraints, must first have basic atomicity properties for loads and stores. I would not expect any kind of fence to add multi-copy-atomicity where there was none. David > Stephan > > _______________________________________________ > Concurrency-interest mailing list > concurrency-inter...@cs.oswego.edu > http://cs.oswego.edu/mailman/listinfo/concurrency-interest