Il 17/05/2012 02:24, Benjamin Herrenschmidt ha scritto: >> > Also, should I make the barrier conditional to kvm_enabled() ? IE. It's >> > pointless in full emulation and might actually be a performance hit on >> > something already quite slow... > Finally ... something like smp_mb() in qemu will turn into a lock op or > an mfence on x86, ie not a nop. > > That means overhead from today's implementation, which leads to the > question ... is today implementation correct ? IE. Is a barrier needed > on x86 as well or not ?
It depends on what semantics you attach to dma_mb. In my opinion, having a separate barrier for DMA is wrong, because you want the same semantics on all architectures. The x86 requirements are roughly as follows: 1) it never needs explicit rmb and wmb (as long as you don't use non-temporal stores etc.); 2) synchronized operations have an implicit mb before and after (unlike LL/SC on PowerPC). 3) everywhere else, you need an mb. So, on x86 you have more or less an implicit wmb after each write and an implicit rmb before a read. This explains why these kind of bugs are very hard to see on x86 (or often impossible to see). Adding these in cpu_physical_memory_rw has the advantage that x86 performance is not affected, but it would not cover the case of a device model doing a DMA read after a DMA write. Then the device model would need to issue a smp_mb manually, on all architectures. I think this is too brittle. > If not (I'm trying to figure out why exactly does x86 have a barrier in > the first place and when it's in order), then I might add a new barrier > type in qemu-barriers.h, something like dma_mb(), and define it as a nop > on x86, a lwsync or sync (still thinking about it) on ppc, and > __sync_synchronize() on unknown archs. I don't think it is correct to think of this in terms of low-level operations such as sync/lwsync. Rather, I think what we want is sequentially consistent accesses; it's heavyweight, but you cannot go wrong. http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html says this is how you do it: x86 -> Load Seq_Cst: mov or mfence; mov Store Seq Cst: mov; mfence or mov ARM -> Load Seq Cst: ldr; dmb or dmb; ldr; dmb Store Seq Cst: dmb; str; dmb or dmb; str PPC -> Load Seq Cst: sync; ld; cmp; bc; isync Store Seq Cst: sync; st where cmp; bc; isync can be replaced by sync. and says "As far as the memory model is concerned, the ARM processor is broadly similar to PowerPC, differing mainly in having a DMB barrier (analogous to the PowerPC sync in its programmer-observable behaviour for normal memory) and no analogue of the PowerPC lwsync". So one of the two ARM mappings, with smp_mb instead of dmb, is what we want in cpu_physical_memory_rw. Device models that want to do better can just use cpu_physical_memory_map, or we can add a cpu_physical_memory_rw_direct for them. Paolo