Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function

2012-05-21 Thread Benjamin Herrenschmidt
On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:

  mb is more than just read flush writes (besides it's not a statement
  about flushing, it's a statement about ordering. whether it has a
  flushing side effect on x86 is a separate issue, it doesn't on power for
  example).
 
 I referred to reads not bypassing writes on PCI.

Again, from which originator ? From a given initiator, nothing bypasses
anything, so the right thing to do here is a full mb(). However, I
suspect what you are talking about here is read -responses- not
bypassing writes in the direction of the response (ie, the flushing
semantic of reads) which is a different matter. Also don't forget that
this is only a semantic of PCI, not of the system fabric, ie, a device
DMA read doesn't flush a CPU write that is still in that CPU store
queue.

 This is the real argument in my eyes: that we
 should behave the way real hardware does.

But that doesn't really make much sense since we don't actually have a
non-coherent bus sitting in the middle :-)

However we should as much as possible be observed to behave as such, I
agree, though I don't think we need to bother too much about timings
since we don't really have way to enforce the immediate visibility of
stores within the coherent domain without a bunch of arch specific very
very heavy hammers which we really don't want to wield at this point.

  Real flushing out writes matters very much in real life in two very
  different contexts that tend to not affect emulation in qemu as much.
  
  One is flushing write in the opposite direction (or rather, having the
  read response queued up behind those writes) which is critical to
  ensuring proper completion of DMAs after an LSI from a guest driver
  perspective on real HW typically.
  
  The other classic case is to flush posted MMIO writes in order to ensure
  that a subsequent delay is respected.
  
  Most of those don't actually matter when doing emulation. Besides a
  barrier won't provide you the second guarantee, you need a nastier
  construct at least on some architectures like power.
 
 Exactly. This is what I was saying too.

Right and I'm reasonably sure that none of those above is our problem. 

As I said, at this point, what I want to sort out is purely the
observable ordering of DMA transactions. The side effect of reads in one
direction on writes in the other direction is an orthogonal problem
which as I wrote above is probably not hurting us.

  However, we do need to ensure that read and writes are properly ordered
  vs. each other (regardless of any flush semantic) or things could go
  very wrong on OO architectures (here too, my understanding on x86 is
  limited).
 
 Right. Here's a compromize:
 - add smp_rmb() on any DMA read
 - add smp_wmb( on any DMA write
 This is almost zero cost on x86 at least.
 So we are not regressing existing setups.

And it's not correct. With that setup, DMA writes can pass DMA reads
(and vice-versa) which doesn't correspond to the guarantees of the PCI
spec. The question I suppose is whether this is a problem in practice...

 Are there any places where devices do read after write?

It's possible, things like update of a descriptor followed by reading of
the next one, etc...  I don't have an example hot in mind right know of
a device that would be hurt but I'm a bit nervous as this would be a
violation of the PCI guaranteed ordering.

 My preferred way is to find them and do pci_dma_flush() invoking
 smp_mb(). If there is such a case it's likely on datapath anyway
 so we do care.
 
 But I can also live with a global flag latest_dma_read
 and on read we could do
   if (unlikely(latest_dma_read))
   smp_mb();
 
 if you really insist on it
 though I do think it's inelegant.

Again, why do you object on simply making the default accessors fully
ordered ? Do you think it will be a measurable different in most cases ?

Shouldn't we measure it first ?

 You said above x86 is unaffected. This is portability, not safety.

x86 is unaffected by the missing wmb/rmb, it might not be unaffected by
the missing ordering between loads and stores, I don't know, as I said,
I don't fully know the x86 memory model.

In any case, opposing portability to safety the way you do it means
you are making assumptions that basically qemu is written for x86 and
nothing else matters.

If that's your point of view, so be it and be clear about it, but I will
disagree :-) And while I can understand that powerpc might not be
considered as the most important arch around at this point in time,
these problems are going to affect ARM as well.

  Almost all our
  devices were written without any thought given to ordering, so they
  basically can and should be considered as all broken.
 
 Problem is, a lot of code is likely broken even after you sprinkle
 barriers around. For example qemu might write A then B where guest driver
 expects to see B written before A.

No, this is totally unrelated bugs, nothing to do with 

Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function

2012-05-21 Thread Michael S. Tsirkin
On Tue, May 22, 2012 at 07:58:17AM +1000, Benjamin Herrenschmidt wrote:
 On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:
 
   mb is more than just read flush writes (besides it's not a statement
   about flushing, it's a statement about ordering. whether it has a
   flushing side effect on x86 is a separate issue, it doesn't on power for
   example).
  
  I referred to reads not bypassing writes on PCI.
 
 Again, from which originator ? From a given initiator, nothing bypasses
 anything, so the right thing to do here is a full mb(). However, I
 suspect what you are talking about here is read -responses- not
 bypassing writes in the direction of the response (ie, the flushing
 semantic of reads) which is a different matter.

No. My spec says:
A3, A4
A Posted Request must be able to pass Non-Posted Requests to avoid
deadlocks.


 Also don't forget that
 this is only a semantic of PCI, not of the system fabric, ie, a device
 DMA read doesn't flush a CPU write that is still in that CPU store
 queue.

We need to emulate what hardware does IMO.
So if usb has different rules it needs a different barriers.


  This is the real argument in my eyes: that we
  should behave the way real hardware does.
 
 But that doesn't really make much sense since we don't actually have a
 non-coherent bus sitting in the middle :-)
 
 However we should as much as possible be observed to behave as such, I
 agree, though I don't think we need to bother too much about timings
 since we don't really have way to enforce the immediate visibility of
 stores within the coherent domain without a bunch of arch specific very
 very heavy hammers which we really don't want to wield at this point.
 
   Real flushing out writes matters very much in real life in two very
   different contexts that tend to not affect emulation in qemu as much.
   
   One is flushing write in the opposite direction (or rather, having the
   read response queued up behind those writes) which is critical to
   ensuring proper completion of DMAs after an LSI from a guest driver
   perspective on real HW typically.
   
   The other classic case is to flush posted MMIO writes in order to ensure
   that a subsequent delay is respected.
   
   Most of those don't actually matter when doing emulation. Besides a
   barrier won't provide you the second guarantee, you need a nastier
   construct at least on some architectures like power.
  
  Exactly. This is what I was saying too.
 
 Right and I'm reasonably sure that none of those above is our problem. 
 
 As I said, at this point, what I want to sort out is purely the
 observable ordering of DMA transactions. The side effect of reads in one
 direction on writes in the other direction is an orthogonal problem
 which as I wrote above is probably not hurting us.
 
   However, we do need to ensure that read and writes are properly ordered
   vs. each other (regardless of any flush semantic) or things could go
   very wrong on OO architectures (here too, my understanding on x86 is
   limited).
  
  Right. Here's a compromize:
  - add smp_rmb() on any DMA read
  - add smp_wmb( on any DMA write
  This is almost zero cost on x86 at least.
  So we are not regressing existing setups.
 
 And it's not correct. With that setup, DMA writes can pass DMA reads
 (and vice-versa) which doesn't correspond to the guarantees of the PCI
 spec.

Cite the spec please. Express spec matches this at least.

 The question I suppose is whether this is a problem in practice...
 
  Are there any places where devices do read after write?
 
 It's possible, things like update of a descriptor followed by reading of
 the next one, etc...  I don't have an example hot in mind right know of
 a device that would be hurt but I'm a bit nervous as this would be a
 violation of the PCI guaranteed ordering.
 
  My preferred way is to find them and do pci_dma_flush() invoking
  smp_mb(). If there is such a case it's likely on datapath anyway
  so we do care.
  
  But I can also live with a global flag latest_dma_read
  and on read we could do
  if (unlikely(latest_dma_read))
  smp_mb();
  
  if you really insist on it
  though I do think it's inelegant.
 
 Again, why do you object on simply making the default accessors fully
 ordered ? Do you think it will be a measurable different in most cases ?
 
 Shouldn't we measure it first ?

It's a lot of work. We measured the effect for virtio in
the past. I don't think we need to redo it.

  You said above x86 is unaffected. This is portability, not safety.
 
 x86 is unaffected by the missing wmb/rmb, it might not be unaffected by
 the missing ordering between loads and stores, I don't know, as I said,
 I don't fully know the x86 memory model.

You don't need to understand it. Assume memory-barriers.h is correct.

 In any case, opposing portability to safety the way you do it means
 you are making assumptions that basically qemu is written for x86 and
 nothing else matters.

No. But find a way to 

Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function

2012-05-21 Thread Benjamin Herrenschmidt
On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote:
  Again, from which originator ? From a given initiator, nothing
 bypasses
  anything, so the right thing to do here is a full mb(). However, I
  suspect what you are talking about here is read -responses- not
  bypassing writes in the direction of the response (ie, the
 flushing
  semantic of reads) which is a different matter.
 
 No. My spec says:
 A3, A4
 A Posted Request must be able to pass Non-Posted Requests to avoid
 deadlocks.

Right, a read + write can become write + read at the target, I forgot
about that, or you can deadlock due to the flush semantics, but a write
+ read must remain in order or am I missing something ?

And write + read afaik is typically the one that x86 can re-order
without a barrier isn't it ?

  Also don't forget that
  this is only a semantic of PCI, not of the system fabric, ie, a
 device
  DMA read doesn't flush a CPU write that is still in that CPU store
  queue.
 
 We need to emulate what hardware does IMO.
 So if usb has different rules it needs a different barriers.

Who talks about USB here ? Whatever rules USB has only matter between
the USB device and the USB controller emulation, USB doesn't sit
directly on the memory bus doing DMA, it all goes through the HCI, which
adheres the the ordering rules of whatever bus it sits on.

Here, sanity must prevail :-) I suggest ordering by default...

  And it's not correct. With that setup, DMA writes can pass DMA reads
  (and vice-versa) which doesn't correspond to the guarantees of the
 PCI
  spec.
 
 Cite the spec please. Express spec matches this at least.

Sure, see above. Yes I did forgot that a read + write could be
re-ordered on PCI but that isn't the case of a write + read, or am I
reading the table sideways ?

 It's a lot of work. We measured the effect for virtio in
 the past. I don't think we need to redo it.

virtio is specifically our high performance case and what I'm proposing
isn't affecting it.

   You said above x86 is unaffected. This is portability, not safety.
  
  x86 is unaffected by the missing wmb/rmb, it might not be unaffected
 by
  the missing ordering between loads and stores, I don't know, as I
 said,
  I don't fully know the x86 memory model.
 
 You don't need to understand it. Assume memory-barriers.h is correct.

In which case we still need a full mb() unless we can convince ourselves
that the ordering between a write and a subsequent read can be relaxed
safely and I'm really not sure about it.

  In any case, opposing portability to safety the way you do it
 means
  you are making assumptions that basically qemu is written for x86
 and
  nothing else matters.
 
 No. But find a way to fix power without hurting working setups.

And ARM ;-)

Arguably x86 is wrong too anyway, at least from a strict interpretation
off the spec (and unless I missed something).

  If that's your point of view, so be it and be clear about it, but I
 will
  disagree :-) And while I can understand that powerpc might not be
  considered as the most important arch around at this point in time,
  these problems are going to affect ARM as well.
  
Almost all our
devices were written without any thought given to ordering, so
 they
basically can and should be considered as all broken.
   
   Problem is, a lot of code is likely broken even after you sprinkle
   barriers around. For example qemu might write A then B where guest
 driver
   expects to see B written before A.
  
  No, this is totally unrelated bugs, nothing to do with barriers. You
 are
  mixing up two completely different problems and using one as an
 excuse
  to not fix the other one :-)
  
  A device with the above problem would be broken today on x86
 regardless.
  
Since thinking
about ordering is something that, by experience, very few
 programmer can
do and get right, the default should absolutely be fully
 ordered.
   
   Give it bus ordering. That is not fully ordered.
  
  It pretty much is actually, look at your PCI spec :-)
 
 I looked. 2.4.1.  Transaction Ordering Rules
 
Performance regressions aren't a big deal to bisect in that
 case: If
there's a regression for a given driver and it points to this
 specific
patch adding the barriers then we know precisely where the
 regression
come from, and we can get some insight about how this specific
 driver
can be improved to use more relaxed accessors.

I don't see the problem.

One thing that might be worth looking at is if indeed mb() is so
 much
more costly than just wmb/rmb, in which circumstances we could
 have some
smarts in the accessors to make them skip the full mb based on
 knowledge
of previous access direction, though here too I would be tempted
 to only
do that if absolutely necessary (ie if we can't instead just fix
 the
sensitive driver to use explicitly relaxed accessors).
   
   We did this in virtio and yes it is measureable.
  
  You did it in 

Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function

2012-05-21 Thread Benjamin Herrenschmidt
On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote:
  Again, from which originator ? From a given initiator, nothing
 bypasses
  anything, so the right thing to do here is a full mb(). However, I
  suspect what you are talking about here is read -responses- not
  bypassing writes in the direction of the response (ie, the
 flushing
  semantic of reads) which is a different matter.
 
 No. My spec says:
 A3, A4
 A Posted Request must be able to pass Non-Posted Requests to avoid
 deadlocks. 

An additional note about that one: It only applies obviously if the
initiator doesn't wait for the read response before shooting the write.

Most devices actually do wait. Here too, we would have to explicitly
make sure if what is the right semantic on an individual device basis.

Ben.





Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function

2012-05-21 Thread Rusty Russell
On Tue, 22 May 2012 07:58:17 +1000, Benjamin Herrenschmidt 
b...@kernel.crashing.org wrote:
 On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:
  But I can also live with a global flag latest_dma_read
  and on read we could do
  if (unlikely(latest_dma_read))
  smp_mb();
  
  if you really insist on it
  though I do think it's inelegant.
 
 Again, why do you object on simply making the default accessors fully
 ordered ? Do you think it will be a measurable different in most cases ?
 
 Shouldn't we measure it first ?

Yes.  It seems clear to me that qemu's default DMA operations should be
strictly ordered.  It's just far easier to get right.

After that, we can get tricky with conditional barriers, and we can get
tricky with using special unordered variants in critical drivers, but I
really don't want to be chasing subtle SMP ordering problems in
production.

If you're working on ARM or PPC, it works for x86 makes it *worse*,
not better, since you have fewer users to find bugs.

Cheers,
Rusty.




Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function

2012-05-21 Thread Michael S. Tsirkin
On Tue, May 22, 2012 at 08:56:12AM +1000, Benjamin Herrenschmidt wrote:
 On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote:
   Again, from which originator ? From a given initiator, nothing
  bypasses
   anything, so the right thing to do here is a full mb(). However, I
   suspect what you are talking about here is read -responses- not
   bypassing writes in the direction of the response (ie, the
  flushing
   semantic of reads) which is a different matter.
  
  No. My spec says:
  A3, A4
  A Posted Request must be able to pass Non-Posted Requests to avoid
  deadlocks.
 
 Right, a read + write can become write + read at the target, I forgot
 about that, or you can deadlock due to the flush semantics, but a write
 + read must remain in order or am I missing something ?

Exactly.

 And write + read afaik is typically the one that x86 can re-order
 without a barrier isn't it ?

AFAIK without a barrier, x86 can reorder them however you initiate them.

-- 
MST