On Tue, May 22, 2012 at 07:58:17AM +1000, Benjamin Herrenschmidt wrote: > On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote: > > > > mb is more than just "read flush writes" (besides it's not a statement > > > about flushing, it's a statement about ordering. whether it has a > > > flushing side effect on x86 is a separate issue, it doesn't on power for > > > example). > > > > I referred to reads not bypassing writes on PCI. > > Again, from which originator ? From a given initiator, nothing bypasses > anything, so the right thing to do here is a full mb(). However, I > suspect what you are talking about here is read -responses- not > bypassing writes in the direction of the response (ie, the "flushing" > semantic of reads) which is a different matter.
No. My spec says: A3, A4 A Posted Request must be able to pass Non-Posted Requests to avoid deadlocks. > Also don't forget that > this is only a semantic of PCI, not of the system fabric, ie, a device > DMA read doesn't flush a CPU write that is still in that CPU store > queue. We need to emulate what hardware does IMO. So if usb has different rules it needs a different barriers. > > This is the real argument in my eyes: that we > > should behave the way real hardware does. > > But that doesn't really make much sense since we don't actually have a > non-coherent bus sitting in the middle :-) > > However we should as much as possible be observed to behave as such, I > agree, though I don't think we need to bother too much about timings > since we don't really have way to enforce the immediate visibility of > stores within the coherent domain without a bunch of arch specific very > very heavy hammers which we really don't want to wield at this point. > > > > Real flushing out writes matters very much in real life in two very > > > different contexts that tend to not affect emulation in qemu as much. > > > > > > One is flushing write in the opposite direction (or rather, having the > > > read response queued up behind those writes) which is critical to > > > ensuring proper completion of DMAs after an LSI from a guest driver > > > perspective on real HW typically. > > > > > > The other classic case is to flush posted MMIO writes in order to ensure > > > that a subsequent delay is respected. > > > > > > Most of those don't actually matter when doing emulation. Besides a > > > barrier won't provide you the second guarantee, you need a nastier > > > construct at least on some architectures like power. > > > > Exactly. This is what I was saying too. > > Right and I'm reasonably sure that none of those above is our problem. > > As I said, at this point, what I want to sort out is purely the > observable ordering of DMA transactions. The side effect of reads in one > direction on writes in the other direction is an orthogonal problem > which as I wrote above is probably not hurting us. > > > > However, we do need to ensure that read and writes are properly ordered > > > vs. each other (regardless of any "flush" semantic) or things could go > > > very wrong on OO architectures (here too, my understanding on x86 is > > > limited). > > > > Right. Here's a compromize: > > - add smp_rmb() on any DMA read > > - add smp_wmb( on any DMA write > > This is almost zero cost on x86 at least. > > So we are not regressing existing setups. > > And it's not correct. With that setup, DMA writes can pass DMA reads > (and vice-versa) which doesn't correspond to the guarantees of the PCI > spec. Cite the spec please. Express spec matches this at least. > The question I suppose is whether this is a problem in practice... > > > Are there any places where devices do read after write? > > It's possible, things like update of a descriptor followed by reading of > the next one, etc... I don't have an example hot in mind right know of > a device that would be hurt but I'm a bit nervous as this would be a > violation of the PCI guaranteed ordering. > > > My preferred way is to find them and do pci_dma_flush() invoking > > smp_mb(). If there is such a case it's likely on datapath anyway > > so we do care. > > > > But I can also live with a global flag "latest_dma_read" > > and on read we could do > > if (unlikely(latest_dma_read)) > > smp_mb(); > > > > if you really insist on it > > though I do think it's inelegant. > > Again, why do you object on simply making the default accessors fully > ordered ? Do you think it will be a measurable different in most cases ? > > Shouldn't we measure it first ? It's a lot of work. We measured the effect for virtio in the past. I don't think we need to redo it. > > You said above x86 is unaffected. This is portability, not safety. > > x86 is unaffected by the missing wmb/rmb, it might not be unaffected by > the missing ordering between loads and stores, I don't know, as I said, > I don't fully know the x86 memory model. You don't need to understand it. Assume memory-barriers.h is correct. > In any case, opposing "portability" to "safety" the way you do it means > you are making assumptions that basically "qemu is written for x86 and > nothing else matters". No. But find a way to fix power without hurting working setups. > If that's your point of view, so be it and be clear about it, but I will > disagree :-) And while I can understand that powerpc might not be > considered as the most important arch around at this point in time, > these problems are going to affect ARM as well. > > > > Almost all our > > > devices were written without any thought given to ordering, so they > > > basically can and should be considered as all broken. > > > > Problem is, a lot of code is likely broken even after you sprinkle > > barriers around. For example qemu might write A then B where guest driver > > expects to see B written before A. > > No, this is totally unrelated bugs, nothing to do with barriers. You are > mixing up two completely different problems and using one as an excuse > to not fix the other one :-) > > A device with the above problem would be broken today on x86 regardless. > > > > Since thinking > > > about ordering is something that, by experience, very few programmer can > > > do and get right, the default should absolutely be fully ordered. > > > > Give it bus ordering. That is not fully ordered. > > It pretty much is actually, look at your PCI spec :-) I looked. 2.4.1. Transaction Ordering Rules > > > Performance regressions aren't a big deal to bisect in that case: If > > > there's a regression for a given driver and it points to this specific > > > patch adding the barriers then we know precisely where the regression > > > come from, and we can get some insight about how this specific driver > > > can be improved to use more relaxed accessors. > > > > > > I don't see the problem. > > > > > > One thing that might be worth looking at is if indeed mb() is so much > > > more costly than just wmb/rmb, in which circumstances we could have some > > > smarts in the accessors to make them skip the full mb based on knowledge > > > of previous access direction, though here too I would be tempted to only > > > do that if absolutely necessary (ie if we can't instead just fix the > > > sensitive driver to use explicitly relaxed accessors). > > > > We did this in virtio and yes it is measureable. > > You did it in virtio in a very hot spot on a performance critical > driver. My argument is that: > > - We can do it in a way that doesn't affect virtio at all (by using the > dma accessors instead of cpu_*) > > - Only few drivers have that kind of performance criticality and they > can be easily hand fixed. > > > branches are pretty cheap though. > > Depends, not always but yes, cheaper than barriers in many cases. > > > > > > So on that I will not compromise. > > > > > > > > > > However, I think it might be better to leave the barrier in the dma > > > > > accessor since that's how you also get iommu transparency etc... so > > > > > it's > > > > > not a bad place to put them, and leave the cpu_physical_* for use by > > > > > lower level device drivers which are thus responsible also for dealing > > > > > with ordering if they have to. > > > > > > > > > > Cheers, > > > > > Ben. > > > > > > > > You claim to understand what matters for all devices I doubt that. > > > > > > It's pretty obvious that anything that does DMA using a classic > > > descriptor + buffers structure is broken without appropriate ordering. > > > > > > And yes, I claim to have a fairly good idea of the problem, but I don't > > > think throwing credentials around is going to be helpful. > > > > > > > Why don't we add safe APIs, then go over devices and switch over? > > > > I counted 97 pci_dma_ accesses. > > > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000. > > > > > > > > Let maintainers make a decision where does speed matter. > > > > > > No. Let's fix the general bug first. Then let's people who know the > > > individual drivers intimately and understand their access patterns make > > > the call as to when things can/should be relaxed. > > > > > > Ben. > > > > As a maintainer of a device, if you send me a patch I can review. > > If you change core APIs creating performance regressions > > I don't even know what to review without wasting time debugging > > and bisecting. > > Well, if you don't know then you ask on the list and others (such as > myself or a certain mst) who happens to know those issues will help that > lone clueless maintainer, seriously it's not like it's hard, and a lot > easier than just keeping everything broken and hoping we get to audit > them all properly. > > > According to what you said you want to fix kvm on powerpc. > > Good. Find a way that looks non intrusive on x86 please. > > And ARM. And any other OO arch. > > Ben.