Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function

Michael S. Tsirkin Mon, 21 May 2012 15:23:08 -0700

On Tue, May 22, 2012 at 07:58:17AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:
> 
> > > mb is more than just "read flush writes" (besides it's not a statement
> > > about flushing, it's a statement about ordering. whether it has a
> > > flushing side effect on x86 is a separate issue, it doesn't on power for
> > > example).
> > 
> > I referred to reads not bypassing writes on PCI.
> 
> Again, from which originator ? From a given initiator, nothing bypasses
> anything, so the right thing to do here is a full mb(). However, I
> suspect what you are talking about here is read -responses- not
> bypassing writes in the direction of the response (ie, the "flushing"
> semantic of reads) which is a different matter.


No. My spec says:
A3, A4
A Posted Request must be able to pass Non-Posted Requests to avoid
deadlocks.


> Also don't forget that
> this is only a semantic of PCI, not of the system fabric, ie, a device
> DMA read doesn't flush a CPU write that is still in that CPU store
> queue.

We need to emulate what hardware does IMO.
So if usb has different rules it needs a different barriers.


> > This is the real argument in my eyes: that we
> > should behave the way real hardware does.
> 
> But that doesn't really make much sense since we don't actually have a
> non-coherent bus sitting in the middle :-)
> 
> However we should as much as possible be observed to behave as such, I
> agree, though I don't think we need to bother too much about timings
> since we don't really have way to enforce the immediate visibility of
> stores within the coherent domain without a bunch of arch specific very
> very heavy hammers which we really don't want to wield at this point.
> 
> > > Real flushing out writes matters very much in real life in two very
> > > different contexts that tend to not affect emulation in qemu as much.
> > > 
> > > One is flushing write in the opposite direction (or rather, having the
> > > read response queued up behind those writes) which is critical to
> > > ensuring proper completion of DMAs after an LSI from a guest driver
> > > perspective on real HW typically.
> > > 
> > > The other classic case is to flush posted MMIO writes in order to ensure
> > > that a subsequent delay is respected.
> > > 
> > > Most of those don't actually matter when doing emulation. Besides a
> > > barrier won't provide you the second guarantee, you need a nastier
> > > construct at least on some architectures like power.
> > 
> > Exactly. This is what I was saying too.
> 
> Right and I'm reasonably sure that none of those above is our problem. 
> 
> As I said, at this point, what I want to sort out is purely the
> observable ordering of DMA transactions. The side effect of reads in one
> direction on writes in the other direction is an orthogonal problem
> which as I wrote above is probably not hurting us.
> 
> > > However, we do need to ensure that read and writes are properly ordered
> > > vs. each other (regardless of any "flush" semantic) or things could go
> > > very wrong on OO architectures (here too, my understanding on x86 is
> > > limited).
> > 
> > Right. Here's a compromize:
> > - add smp_rmb() on any DMA read
> > - add smp_wmb( on any DMA write
> > This is almost zero cost on x86 at least.
> > So we are not regressing existing setups.
> 
> And it's not correct. With that setup, DMA writes can pass DMA reads
> (and vice-versa) which doesn't correspond to the guarantees of the PCI
> spec.

Cite the spec please. Express spec matches this at least.

> The question I suppose is whether this is a problem in practice...
> 
> > Are there any places where devices do read after write?
> 
> It's possible, things like update of a descriptor followed by reading of
> the next one, etc...  I don't have an example hot in mind right know of
> a device that would be hurt but I'm a bit nervous as this would be a
> violation of the PCI guaranteed ordering.
> 
> > My preferred way is to find them and do pci_dma_flush() invoking
> > smp_mb(). If there is such a case it's likely on datapath anyway
> > so we do care.
> > 
> > But I can also live with a global flag "latest_dma_read"
> > and on read we could do
> >     if (unlikely(latest_dma_read))
> >             smp_mb();
> > 
> > if you really insist on it
> > though I do think it's inelegant.
> 
> Again, why do you object on simply making the default accessors fully
> ordered ? Do you think it will be a measurable different in most cases ?
> 
> Shouldn't we measure it first ?

It's a lot of work. We measured the effect for virtio in
the past. I don't think we need to redo it.

> > You said above x86 is unaffected. This is portability, not safety.
> 
> x86 is unaffected by the missing wmb/rmb, it might not be unaffected by
> the missing ordering between loads and stores, I don't know, as I said,
> I don't fully know the x86 memory model.

You don't need to understand it. Assume memory-barriers.h is correct.

> In any case, opposing "portability" to "safety" the way you do it means
> you are making assumptions that basically "qemu is written for x86 and
> nothing else matters".

No. But find a way to fix power without hurting working setups.

> If that's your point of view, so be it and be clear about it, but I will
> disagree :-) And while I can understand that powerpc might not be
> considered as the most important arch around at this point in time,
> these problems are going to affect ARM as well.
> 
> > > Almost all our
> > > devices were written without any thought given to ordering, so they
> > > basically can and should be considered as all broken.
> > 
> > Problem is, a lot of code is likely broken even after you sprinkle
> > barriers around. For example qemu might write A then B where guest driver
> > expects to see B written before A.
> 
> No, this is totally unrelated bugs, nothing to do with barriers. You are
> mixing up two completely different problems and using one as an excuse
> to not fix the other one :-)
> 
> A device with the above problem would be broken today on x86 regardless.
> 
> > > Since thinking
> > > about ordering is something that, by experience, very few programmer can
> > > do and get right, the default should absolutely be fully ordered.
> > 
> > Give it bus ordering. That is not fully ordered.
> 
> It pretty much is actually, look at your PCI spec :-)

I looked. 2.4.1.  Transaction Ordering Rules

> > > Performance regressions aren't a big deal to bisect in that case: If
> > > there's a regression for a given driver and it points to this specific
> > > patch adding the barriers then we know precisely where the regression
> > > come from, and we can get some insight about how this specific driver
> > > can be improved to use more relaxed accessors.
> > > 
> > > I don't see the problem.
> > > 
> > > One thing that might be worth looking at is if indeed mb() is so much
> > > more costly than just wmb/rmb, in which circumstances we could have some
> > > smarts in the accessors to make them skip the full mb based on knowledge
> > > of previous access direction, though here too I would be tempted to only
> > > do that if absolutely necessary (ie if we can't instead just fix the
> > > sensitive driver to use explicitly relaxed accessors).
> > 
> > We did this in virtio and yes it is measureable.
> 
> You did it in virtio in a very hot spot on a performance critical
> driver. My argument is that:
> 
>  - We can do it in a way that doesn't affect virtio at all (by using the
> dma accessors instead of cpu_*)
> 
>  - Only few drivers have that kind of performance criticality and they
> can be easily hand fixed.
> 
> > branches are pretty cheap though.
> 
> Depends, not always but yes, cheaper than barriers in many cases.
> 
> > > > > So on that I will not compromise.
> > > > > 
> > > > > However, I think it might be better to leave the barrier in the dma
> > > > > accessor since that's how you also get iommu transparency etc... so 
> > > > > it's
> > > > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > > > lower level device drivers which are thus responsible also for dealing
> > > > > with ordering if they have to.
> > > > > 
> > > > > Cheers,
> > > > > Ben.
> > > > 
> > > > You claim to understand what matters for all devices I doubt that.
> > > 
> > > It's pretty obvious that anything that does DMA using a classic
> > > descriptor + buffers structure is broken without appropriate ordering.
> > > 
> > > And yes, I claim to have a fairly good idea of the problem, but I don't
> > > think throwing credentials around is going to be helpful.
> > >  
> > > > Why don't we add safe APIs, then go over devices and switch over?
> > > > I counted 97 pci_dma_ accesses.
> > > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > > > 
> > > > Let maintainers make a decision where does speed matter.
> > > 
> > > No. Let's fix the general bug first. Then let's people who know the
> > > individual drivers intimately and understand their access patterns make
> > > the call as to when things can/should be relaxed.
> > > 
> > > Ben.
> > 
> > As a maintainer of a device, if you send me a patch I can review.
> > If you change core APIs creating performance regressions
> > I don't even know what to review without wasting time debugging
> > and bisecting.
> 
> Well, if you don't know then you ask on the list and others (such as
> myself or a certain mst) who happens to know those issues will help that
> lone clueless maintainer, seriously it's not like it's hard, and a lot
> easier than just keeping everything broken and hoping we get to audit
> them all properly.
> 
> > According to what you said you want to fix kvm on powerpc.
> > Good. Find a way that looks non intrusive on x86 please.
> 
> And ARM. And any other OO arch.
> 
> Ben.

Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function

Reply via email to