Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote: mb is more than just read flush writes (besides it's not a statement about flushing, it's a statement about ordering. whether it has a flushing side effect on x86 is a separate issue, it doesn't on power for example). I referred to reads not bypassing writes on PCI. Again, from which originator ? From a given initiator, nothing bypasses anything, so the right thing to do here is a full mb(). However, I suspect what you are talking about here is read -responses- not bypassing writes in the direction of the response (ie, the flushing semantic of reads) which is a different matter. Also don't forget that this is only a semantic of PCI, not of the system fabric, ie, a device DMA read doesn't flush a CPU write that is still in that CPU store queue. This is the real argument in my eyes: that we should behave the way real hardware does. But that doesn't really make much sense since we don't actually have a non-coherent bus sitting in the middle :-) However we should as much as possible be observed to behave as such, I agree, though I don't think we need to bother too much about timings since we don't really have way to enforce the immediate visibility of stores within the coherent domain without a bunch of arch specific very very heavy hammers which we really don't want to wield at this point. Real flushing out writes matters very much in real life in two very different contexts that tend to not affect emulation in qemu as much. One is flushing write in the opposite direction (or rather, having the read response queued up behind those writes) which is critical to ensuring proper completion of DMAs after an LSI from a guest driver perspective on real HW typically. The other classic case is to flush posted MMIO writes in order to ensure that a subsequent delay is respected. Most of those don't actually matter when doing emulation. Besides a barrier won't provide you the second guarantee, you need a nastier construct at least on some architectures like power. Exactly. This is what I was saying too. Right and I'm reasonably sure that none of those above is our problem. As I said, at this point, what I want to sort out is purely the observable ordering of DMA transactions. The side effect of reads in one direction on writes in the other direction is an orthogonal problem which as I wrote above is probably not hurting us. However, we do need to ensure that read and writes are properly ordered vs. each other (regardless of any flush semantic) or things could go very wrong on OO architectures (here too, my understanding on x86 is limited). Right. Here's a compromize: - add smp_rmb() on any DMA read - add smp_wmb( on any DMA write This is almost zero cost on x86 at least. So we are not regressing existing setups. And it's not correct. With that setup, DMA writes can pass DMA reads (and vice-versa) which doesn't correspond to the guarantees of the PCI spec. The question I suppose is whether this is a problem in practice... Are there any places where devices do read after write? It's possible, things like update of a descriptor followed by reading of the next one, etc... I don't have an example hot in mind right know of a device that would be hurt but I'm a bit nervous as this would be a violation of the PCI guaranteed ordering. My preferred way is to find them and do pci_dma_flush() invoking smp_mb(). If there is such a case it's likely on datapath anyway so we do care. But I can also live with a global flag latest_dma_read and on read we could do if (unlikely(latest_dma_read)) smp_mb(); if you really insist on it though I do think it's inelegant. Again, why do you object on simply making the default accessors fully ordered ? Do you think it will be a measurable different in most cases ? Shouldn't we measure it first ? You said above x86 is unaffected. This is portability, not safety. x86 is unaffected by the missing wmb/rmb, it might not be unaffected by the missing ordering between loads and stores, I don't know, as I said, I don't fully know the x86 memory model. In any case, opposing portability to safety the way you do it means you are making assumptions that basically qemu is written for x86 and nothing else matters. If that's your point of view, so be it and be clear about it, but I will disagree :-) And while I can understand that powerpc might not be considered as the most important arch around at this point in time, these problems are going to affect ARM as well. Almost all our devices were written without any thought given to ordering, so they basically can and should be considered as all broken. Problem is, a lot of code is likely broken even after you sprinkle barriers around. For example qemu might write A then B where guest driver expects to see B written before A. No, this is totally unrelated bugs, nothing to do with
Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
On Tue, May 22, 2012 at 07:58:17AM +1000, Benjamin Herrenschmidt wrote: On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote: mb is more than just read flush writes (besides it's not a statement about flushing, it's a statement about ordering. whether it has a flushing side effect on x86 is a separate issue, it doesn't on power for example). I referred to reads not bypassing writes on PCI. Again, from which originator ? From a given initiator, nothing bypasses anything, so the right thing to do here is a full mb(). However, I suspect what you are talking about here is read -responses- not bypassing writes in the direction of the response (ie, the flushing semantic of reads) which is a different matter. No. My spec says: A3, A4 A Posted Request must be able to pass Non-Posted Requests to avoid deadlocks. Also don't forget that this is only a semantic of PCI, not of the system fabric, ie, a device DMA read doesn't flush a CPU write that is still in that CPU store queue. We need to emulate what hardware does IMO. So if usb has different rules it needs a different barriers. This is the real argument in my eyes: that we should behave the way real hardware does. But that doesn't really make much sense since we don't actually have a non-coherent bus sitting in the middle :-) However we should as much as possible be observed to behave as such, I agree, though I don't think we need to bother too much about timings since we don't really have way to enforce the immediate visibility of stores within the coherent domain without a bunch of arch specific very very heavy hammers which we really don't want to wield at this point. Real flushing out writes matters very much in real life in two very different contexts that tend to not affect emulation in qemu as much. One is flushing write in the opposite direction (or rather, having the read response queued up behind those writes) which is critical to ensuring proper completion of DMAs after an LSI from a guest driver perspective on real HW typically. The other classic case is to flush posted MMIO writes in order to ensure that a subsequent delay is respected. Most of those don't actually matter when doing emulation. Besides a barrier won't provide you the second guarantee, you need a nastier construct at least on some architectures like power. Exactly. This is what I was saying too. Right and I'm reasonably sure that none of those above is our problem. As I said, at this point, what I want to sort out is purely the observable ordering of DMA transactions. The side effect of reads in one direction on writes in the other direction is an orthogonal problem which as I wrote above is probably not hurting us. However, we do need to ensure that read and writes are properly ordered vs. each other (regardless of any flush semantic) or things could go very wrong on OO architectures (here too, my understanding on x86 is limited). Right. Here's a compromize: - add smp_rmb() on any DMA read - add smp_wmb( on any DMA write This is almost zero cost on x86 at least. So we are not regressing existing setups. And it's not correct. With that setup, DMA writes can pass DMA reads (and vice-versa) which doesn't correspond to the guarantees of the PCI spec. Cite the spec please. Express spec matches this at least. The question I suppose is whether this is a problem in practice... Are there any places where devices do read after write? It's possible, things like update of a descriptor followed by reading of the next one, etc... I don't have an example hot in mind right know of a device that would be hurt but I'm a bit nervous as this would be a violation of the PCI guaranteed ordering. My preferred way is to find them and do pci_dma_flush() invoking smp_mb(). If there is such a case it's likely on datapath anyway so we do care. But I can also live with a global flag latest_dma_read and on read we could do if (unlikely(latest_dma_read)) smp_mb(); if you really insist on it though I do think it's inelegant. Again, why do you object on simply making the default accessors fully ordered ? Do you think it will be a measurable different in most cases ? Shouldn't we measure it first ? It's a lot of work. We measured the effect for virtio in the past. I don't think we need to redo it. You said above x86 is unaffected. This is portability, not safety. x86 is unaffected by the missing wmb/rmb, it might not be unaffected by the missing ordering between loads and stores, I don't know, as I said, I don't fully know the x86 memory model. You don't need to understand it. Assume memory-barriers.h is correct. In any case, opposing portability to safety the way you do it means you are making assumptions that basically qemu is written for x86 and nothing else matters. No. But find a way to
Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote: Again, from which originator ? From a given initiator, nothing bypasses anything, so the right thing to do here is a full mb(). However, I suspect what you are talking about here is read -responses- not bypassing writes in the direction of the response (ie, the flushing semantic of reads) which is a different matter. No. My spec says: A3, A4 A Posted Request must be able to pass Non-Posted Requests to avoid deadlocks. Right, a read + write can become write + read at the target, I forgot about that, or you can deadlock due to the flush semantics, but a write + read must remain in order or am I missing something ? And write + read afaik is typically the one that x86 can re-order without a barrier isn't it ? Also don't forget that this is only a semantic of PCI, not of the system fabric, ie, a device DMA read doesn't flush a CPU write that is still in that CPU store queue. We need to emulate what hardware does IMO. So if usb has different rules it needs a different barriers. Who talks about USB here ? Whatever rules USB has only matter between the USB device and the USB controller emulation, USB doesn't sit directly on the memory bus doing DMA, it all goes through the HCI, which adheres the the ordering rules of whatever bus it sits on. Here, sanity must prevail :-) I suggest ordering by default... And it's not correct. With that setup, DMA writes can pass DMA reads (and vice-versa) which doesn't correspond to the guarantees of the PCI spec. Cite the spec please. Express spec matches this at least. Sure, see above. Yes I did forgot that a read + write could be re-ordered on PCI but that isn't the case of a write + read, or am I reading the table sideways ? It's a lot of work. We measured the effect for virtio in the past. I don't think we need to redo it. virtio is specifically our high performance case and what I'm proposing isn't affecting it. You said above x86 is unaffected. This is portability, not safety. x86 is unaffected by the missing wmb/rmb, it might not be unaffected by the missing ordering between loads and stores, I don't know, as I said, I don't fully know the x86 memory model. You don't need to understand it. Assume memory-barriers.h is correct. In which case we still need a full mb() unless we can convince ourselves that the ordering between a write and a subsequent read can be relaxed safely and I'm really not sure about it. In any case, opposing portability to safety the way you do it means you are making assumptions that basically qemu is written for x86 and nothing else matters. No. But find a way to fix power without hurting working setups. And ARM ;-) Arguably x86 is wrong too anyway, at least from a strict interpretation off the spec (and unless I missed something). If that's your point of view, so be it and be clear about it, but I will disagree :-) And while I can understand that powerpc might not be considered as the most important arch around at this point in time, these problems are going to affect ARM as well. Almost all our devices were written without any thought given to ordering, so they basically can and should be considered as all broken. Problem is, a lot of code is likely broken even after you sprinkle barriers around. For example qemu might write A then B where guest driver expects to see B written before A. No, this is totally unrelated bugs, nothing to do with barriers. You are mixing up two completely different problems and using one as an excuse to not fix the other one :-) A device with the above problem would be broken today on x86 regardless. Since thinking about ordering is something that, by experience, very few programmer can do and get right, the default should absolutely be fully ordered. Give it bus ordering. That is not fully ordered. It pretty much is actually, look at your PCI spec :-) I looked. 2.4.1. Transaction Ordering Rules Performance regressions aren't a big deal to bisect in that case: If there's a regression for a given driver and it points to this specific patch adding the barriers then we know precisely where the regression come from, and we can get some insight about how this specific driver can be improved to use more relaxed accessors. I don't see the problem. One thing that might be worth looking at is if indeed mb() is so much more costly than just wmb/rmb, in which circumstances we could have some smarts in the accessors to make them skip the full mb based on knowledge of previous access direction, though here too I would be tempted to only do that if absolutely necessary (ie if we can't instead just fix the sensitive driver to use explicitly relaxed accessors). We did this in virtio and yes it is measureable. You did it in
Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote: Again, from which originator ? From a given initiator, nothing bypasses anything, so the right thing to do here is a full mb(). However, I suspect what you are talking about here is read -responses- not bypassing writes in the direction of the response (ie, the flushing semantic of reads) which is a different matter. No. My spec says: A3, A4 A Posted Request must be able to pass Non-Posted Requests to avoid deadlocks. An additional note about that one: It only applies obviously if the initiator doesn't wait for the read response before shooting the write. Most devices actually do wait. Here too, we would have to explicitly make sure if what is the right semantic on an individual device basis. Ben.
Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
On Tue, 22 May 2012 07:58:17 +1000, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote: But I can also live with a global flag latest_dma_read and on read we could do if (unlikely(latest_dma_read)) smp_mb(); if you really insist on it though I do think it's inelegant. Again, why do you object on simply making the default accessors fully ordered ? Do you think it will be a measurable different in most cases ? Shouldn't we measure it first ? Yes. It seems clear to me that qemu's default DMA operations should be strictly ordered. It's just far easier to get right. After that, we can get tricky with conditional barriers, and we can get tricky with using special unordered variants in critical drivers, but I really don't want to be chasing subtle SMP ordering problems in production. If you're working on ARM or PPC, it works for x86 makes it *worse*, not better, since you have fewer users to find bugs. Cheers, Rusty.
Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
On Tue, May 22, 2012 at 08:56:12AM +1000, Benjamin Herrenschmidt wrote: On Tue, 2012-05-22 at 01:22 +0300, Michael S. Tsirkin wrote: Again, from which originator ? From a given initiator, nothing bypasses anything, so the right thing to do here is a full mb(). However, I suspect what you are talking about here is read -responses- not bypassing writes in the direction of the response (ie, the flushing semantic of reads) which is a different matter. No. My spec says: A3, A4 A Posted Request must be able to pass Non-Posted Requests to avoid deadlocks. Right, a read + write can become write + read at the target, I forgot about that, or you can deadlock due to the flush semantics, but a write + read must remain in order or am I missing something ? Exactly. And write + read afaik is typically the one that x86 can re-order without a barrier isn't it ? AFAIK without a barrier, x86 can reorder them however you initiate them. -- MST