Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Mon, 19 Dec 2011 10:50:13 +0800, Amos Kong ak...@redhat.com wrote: The format is broken in webpage, attached the result file. it's also available here: http://amosk.info/download/rusty-fix-perf.txt Thanks Amos. Linus, please apply. Fixes virtio-mmio without breaking virtio_pci performance. Thanks, Rusty. From: Rusty Russell ru...@rustcorp.com.au Subject: virtio: harsher barriers for virtio-mmio. We were cheating with our barriers; using the smp ones rather than the real device ones. That was fine, until virtio-mmio came along, which could be talking to a real device (a non-SMP CPU). Unfortunately, just putting back the real barriers (reverting d57ed95d) causes a performance regression on virtio-pci. In particular, Amos reports netbench's TCP_RR over virtio_net CPU utilization increased up to 35% while throughput went down by up to 14%. By comparison, this branch is in the noise. Reference: https://lkml.org/lkml/2011/12/11/22 Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/lguest/lguest_device.c | 10 ++ drivers/s390/kvm/kvm_virtio.c |2 +- drivers/virtio/virtio_mmio.c |7 --- drivers/virtio/virtio_pci.c|4 ++-- drivers/virtio/virtio_ring.c | 34 +- include/linux/virtio_ring.h|1 + tools/virtio/linux/virtio.h|1 + tools/virtio/virtio_test.c |3 ++- 8 files changed, 38 insertions(+), 24 deletions(-) diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c --- a/drivers/lguest/lguest_device.c +++ b/drivers/lguest/lguest_device.c @@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru } /* -* OK, tell virtio_ring.c to set up a virtqueue now we know its size -* and we've got a pointer to its pages. +* OK, tell virtio_ring.c to set up a virtqueue now we know its size +* and we've got a pointer to its pages. Note that we set weak_barriers +* to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu +* barriers. */ - vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, -vdev, lvq-pages, lg_notify, callback, name); + vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, vdev, +true, lvq-pages, lg_notify, callback, name); if (!vq) { err = -ENOMEM; goto unmap; diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c --- a/drivers/s390/kvm/kvm_virtio.c +++ b/drivers/s390/kvm/kvm_virtio.c @@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str goto out; vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN, -vdev, (void *) config-address, +vdev, true, (void *) config-address, kvm_notify, callback, name); if (!vq) { err = -ENOMEM; diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str writel(virt_to_phys(info-queue) PAGE_SHIFT, vm_dev-base + VIRTIO_MMIO_QUEUE_PFN); - /* Create the vring */ - vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, -vdev, info-queue, vm_notify, callback, name); + /* Create the vring: no weak barriers, the other side is could +* be an independent device. */ + vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, vdev, +false, info-queue, vm_notify, callback, name); if (!vq) { err = -ENOMEM; goto error_new_virtqueue; diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ - vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN, -vdev, info-queue, vp_notify, callback, name); + vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN, vdev, +true, info-queue, vp_notify, callback, name); if (!vq) { err = -ENOMEM; goto out_activate_queue; diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -28,17 +28,20 @@ #ifdef CONFIG_SMP /* Where possible, use SMP barriers which are more lightweight than mandatory * barriers, because mandatory barriers control MMIO effects on accesses - * through relaxed memory I/O windows (which virtio does not use). */ -#define virtio_mb() smp_mb() -#define
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On 12/12/11 13:12, Rusty Russell wrote: On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kongak...@redhat.com wrote: On 12/12/11 06:27, Benjamin Herrenschmidt wrote: On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote: Forwarding some results by Amos, who run multiple netperf streams in parallel, from an external box to the guest. TCP_STREAM results were noisy. This could be due to buffering done by TCP, where packet size varies even as message size is constant. TCP_RR results were consistent. In this benchmark, after switching to mandatory barriers, CPU utilization increased by up to 35% while throughput went down by up to 14%. the normalized throughput/cpu regressed consistently, between 7 and 35% The fix applied was simply this: What machine processor was this ? pined guest memory to numa node 1 Please try this patch. How much does the branch cost us? (Compiles, untested). Thanks, Rusty. From: Rusty Russellru...@rustcorp.com.au Subject: virtio: harsher barriers for virtio-mmio. We were cheating with our barriers; using the smp ones rather than the real device ones. That was fine, until virtio-mmio came along, which could be talking to a real device (a non-SMP CPU). Unfortunately, just putting back the real barriers (reverting d57ed95d) causes a performance regression on virtio-pci. In particular, Amos reports netbench's TCP_RR over virtio_net CPU utilization increased up to 35% while throughput went down by up to 14%. By comparison, this branch costs us??? Reference: https://lkml.org/lkml/2011/12/11/22 Signed-off-by: Rusty Russellru...@rustcorp.com.au --- drivers/lguest/lguest_device.c | 10 ++ drivers/s390/kvm/kvm_virtio.c |2 +- drivers/virtio/virtio_mmio.c |7 --- drivers/virtio/virtio_pci.c|4 ++-- drivers/virtio/virtio_ring.c | 34 +- include/linux/virtio_ring.h|1 + tools/virtio/linux/virtio.h|1 + tools/virtio/virtio_test.c |3 ++- 8 files changed, 38 insertions(+), 24 deletions(-) Hi all, I tested with the same environment and scenarios. tested one scenarios for three times and compute the average for more precision. Thanks, Amos - compare results --- Mon Dec 19 09:51:09 2011 1 - avg-old.netperf.exhost_guest.txt 2 - avg-fixed.netperf.exhost_guest.txt == TCP_STREAM sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #tx-byts| #rx-byts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit| #rpkt/#irq 11| 64| 1073.54| 10.50| 102| 0|31| 0| 1612| 0|16|487641|489753| 504764| 0.00| 0.00 21| 64| 1079.44| 10.29| 104| 0|30| 0| 1594| 0|17|487156|488828| 504411| 0.00| 0.00 % | 0.0| +0.5| -2.0| +2.0| 0| -3.2| 0| -1.1| 0| +6.2| -0.1| -0.2| -0.1| 0| 0 12| 64| 2141.12| 15.72| 136| 0|33| 0| 1744| 0|34|873777|972303| 928926| 0.00| 0.00 22| 64| 2140.88| 15.64| 137| 0|33| 0| 1744| 0|34|926588|942841| 974095| 0.00| 0.00 % | 0.0| -0.0| -0.5| +0.7| 0| 0.0| 0| 0.0| 0| 0.0| +6.0| -3.0| +4.9| 0| 0 14| 64| 4076.80| 19.82| 205| 0|30| 0| 1577| 0|67| 1422282| 1166425| 1539219| 0.00| 0.00 24| 64| 4094.32| 20.70| 197| 0|31| 0| 1612| 0|68| 1704330| 1314077| 1833394| 0.00| 0.00 % | 0.0| +0.4| +4.4| -3.9| 0| +3.3| 0| +2.2| 0| +1.5| +19.8| +12.7| +19.1| 0| 0 11| 256| 2867.48| 13.44| 213| 0|32| 0| 1726| 0|14|666430|694922| 690730| 0.00| 0.00 21| 256| 2874.20| 12.71| 226| 0|32| 0| 1709| 0|14|697960|740407| 721807| 0.00| 0.00 % | 0.0| +0.2| -5.4| +6.1| 0| 0.0| 0| -1.0| 0| 0.0| +4.7| +6.5| +4.5| 0| 0 12| 256| 5642.82| 17.61| 320| 0|30| 0| 1594| 0|30| 1226861| 1236081| 1268562| 0.00| 0.00 22| 256| 5661.06| 17.41| 326| 0|30| 0| 1594| 0|29| 1175696| 1143490| 1221528| 0.00| 0.00 % | 0.0| +0.3| -1.1|
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Mon, 2011-12-19 at 10:19 +0800, Amos Kong wrote: I tested with the same environment and scenarios. tested one scenarios for three times and compute the average for more precision. Thanks, Amos - compare results --- Mon Dec 19 09:51:09 2011 1 - avg-old.netperf.exhost_guest.txt 2 - avg-fixed.netperf.exhost_guest.txt The output is word wrapped and generally unreadable. Any chance you can provide us with a summary of the outcome ? Cheers, Ben. == TCP_STREAM sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #tx-byts| #rx-byts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit| #rpkt/#irq 11| 64| 1073.54| 10.50| 102| 0|31| 0| 1612| 0|16|487641|489753| 504764| 0.00| 0.00 21| 64| 1079.44| 10.29| 104| 0|30| 0| 1594| 0|17|487156|488828| 504411| 0.00| 0.00 % | 0.0| +0.5| -2.0| +2.0| 0| -3.2| 0| -1.1| 0| +6.2| -0.1| -0.2| -0.1| 0| 0 12| 64| 2141.12| 15.72| 136| 0|33| 0| 1744| 0|34|873777|972303| 928926| 0.00| 0.00 22| 64| 2140.88| 15.64| 137| 0|33| 0| 1744| 0|34|926588|942841| 974095| 0.00| 0.00 % | 0.0| -0.0| -0.5| +0.7| 0| 0.0| 0| 0.0| 0| 0.0| +6.0| -3.0| +4.9| 0| 0 14| 64| 4076.80| 19.82| 205| 0|30| 0| 1577| 0|67| 1422282| 1166425| 1539219| 0.00| 0.00 24| 64| 4094.32| 20.70| 197| 0|31| 0| 1612| 0|68| 1704330| 1314077| 1833394| 0.00| 0.00 % | 0.0| +0.4| +4.4| -3.9| 0| +3.3| 0| +2.2| 0| +1.5| +19.8| +12.7| +19.1| 0| 0 11| 256| 2867.48| 13.44| 213| 0|32| 0| 1726| 0|14|666430|694922| 690730| 0.00| 0.00 21| 256| 2874.20| 12.71| 226| 0|32| 0| 1709| 0|14|697960|740407| 721807| 0.00| 0.00 % | 0.0| +0.2| -5.4| +6.1| 0| 0.0| 0| -1.0| 0| 0.0| +4.7| +6.5| +4.5| 0| 0 12| 256| 5642.82| 17.61| 320| 0|30| 0| 1594| 0|30| 1226861| 1236081| 1268562| 0.00| 0.00 22| 256| 5661.06| 17.41| 326| 0|30| 0| 1594| 0|29| 1175696| 1143490| 1221528| 0.00| 0.00 % | 0.0| +0.3| -1.1| +1.9| 0| 0.0| 0| 0.0| 0| -3.3| -4.2| -7.5| -3.7| 0| 0 14| 256| 9404.27| 23.55| 399| 0|33| 0| 1744| 0|37| 1692245|659975| 1765103| 0.00| 0.00 24| 256| 8761.11| 23.18| 376| 0|32| 0| 1726| 0|36| 1699382|418992| 1870804| 0.00| 0.00 % | 0.0| -6.8| -1.6| -5.8| 0| -3.0| 0| -1.0| 0| -2.7| +0.4| -36.5| +6.0| 0| 0 11| 512| 3803.66| 14.20| 267| 0|30| 0| 1594| 0|14|693992|750078| 721107| 0.00| 0.00 21| 512| 3838.02| 15.47| 248| 0|31| 0| 1612| 0|15|811709|773505| 838788| 0.00| 0.00 % | 0.0| +0.9| +8.9| -7.1| 0| +3.3| 0| +1.1| 0| +7.1| +17.0| +3.1| +16.3| 0| 0 12| 512| 8606.11| 19.34| 444| 0|32| 0| 1709| 0|29| 1264624|647652| 1309740| 0.00| 0.00 22| 512| 8127.80| 18.93| 428| 0|32| 0| 1726| 0|28| 1216606| 1179269| 1266260| 0.00| 0.00 % | 0.0| -5.6| -2.1| -3.6| 0| 0.0| 0| +1.0| 0| -3.4| -3.8| +82.1| -3.3| 0| 0 14| 512| 9409.41| 22.88| 413| 0|30| 0| 1577|
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On 19/12/11 10:19, Amos Kong wrote: On 12/12/11 13:12, Rusty Russell wrote: On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kongak...@redhat.com wrote: On 12/12/11 06:27, Benjamin Herrenschmidt wrote: On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote: Forwarding some results by Amos, who run multiple netperf streams in parallel, from an external box to the guest. TCP_STREAM results were noisy. This could be due to buffering done by TCP, where packet size varies even as message size is constant. TCP_RR results were consistent. In this benchmark, after switching to mandatory barriers, CPU utilization increased by up to 35% while throughput went down by up to 14%. the normalized throughput/cpu regressed consistently, between 7 and 35% The fix applied was simply this: What machine processor was this ? pined guest memory to numa node 1 Please try this patch. How much does the branch cost us? (Compiles, untested). Thanks, Rusty. From: Rusty Russellru...@rustcorp.com.au Subject: virtio: harsher barriers for virtio-mmio. We were cheating with our barriers; using the smp ones rather than the real device ones. That was fine, until virtio-mmio came along, which could be talking to a real device (a non-SMP CPU). Unfortunately, just putting back the real barriers (reverting d57ed95d) causes a performance regression on virtio-pci. In particular, Amos reports netbench's TCP_RR over virtio_net CPU utilization increased up to 35% while throughput went down by up to 14%. By comparison, this branch costs us??? Reference: https://lkml.org/lkml/2011/12/11/22 Signed-off-by: Rusty Russellru...@rustcorp.com.au --- drivers/lguest/lguest_device.c | 10 ++ drivers/s390/kvm/kvm_virtio.c | 2 +- drivers/virtio/virtio_mmio.c | 7 --- drivers/virtio/virtio_pci.c | 4 ++-- drivers/virtio/virtio_ring.c | 34 +- include/linux/virtio_ring.h | 1 + tools/virtio/linux/virtio.h | 1 + tools/virtio/virtio_test.c | 3 ++- 8 files changed, 38 insertions(+), 24 deletions(-) Hi all, I tested with the same environment and scenarios. tested one scenarios for three times and compute the average for more precision. Thanks, Amos - compare results --- Mon Dec 19 09:51:09 2011 1 - avg-old.netperf.exhost_guest.txt 2 - avg-fixed.netperf.exhost_guest.txt == TCP_STREAM sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #tx-byts| #rx-byts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit| #rpkt/#irq 1 1| 64| 1073.54| 10.50| 102| 0| 31| 0| 1612| 0| 16| 487641| 489753| 504764| 0.00| 0.00 2 1| 64| 1079.44| 10.29| 104| 0| 30| 0| 1594| 0| 17| 487156| 488828| 504411| 0.00| 0.00 % | 0.0| +0.5| -2.0| +2.0| 0| -3.2| 0| -1.1| 0| +6.2| -0.1| -0.2| -0.1| The format is broken in webpage, attached the result file. it's also available here: http://amosk.info/download/rusty-fix-perf.txt - compare results --- Mon Dec 19 09:51:09 2011 1 - avg-old.netperf.exhost_guest.txt 2 - avg-fixed.netperf.exhost_guest.txt == TCP_STREAM sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #tx-byts| #rx-byts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit| #rpkt/#irq 11| 64| 1073.54| 10.50| 102| 0|31| 0| 1612| 0|16|487641|489753|504764| 0.00| 0.00 21| 64| 1079.44| 10.29| 104| 0|30| 0| 1594| 0|17|487156|488828|504411| 0.00| 0.00 % | 0.0| +0.5| -2.0| +2.0| 0| -3.2| 0| -1.1| 0| +6.2| -0.1| -0.2| -0.1| 0| 0 12| 64| 2141.12| 15.72| 136| 0|33| 0| 1744| 0|34|873777|972303|928926| 0.00| 0.00 22| 64| 2140.88| 15.64| 137| 0|33| 0| 1744| 0|34|926588|942841|974095| 0.00| 0.00 % | 0.0| -0.0| -0.5| +0.7| 0| 0.0| 0| 0.0| 0| 0.0| +6.0| -3.0| +4.9| 0| 0 14| 64| 4076.80| 19.82| 205| 0|30| 0| 1577| 0|67| 1422282| 1166425| 1539219| 0.00| 0.00 24| 64| 4094.32| 20.70| 197| 0|31| 0| 1612| 0|68| 1704330| 1314077| 1833394| 0.00| 0.00 % | 0.0| +0.4| +4.4| -3.9| 0| +3.3| 0| +2.2| 0| +1.5| +19.8| +12.7| +19.1| 0| 0 11| 256| 2867.48| 13.44| 213| 0|32| 0| 1726| 0|14|666430|694922|690730| 0.00| 0.00 21| 256|
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Tue, 13 Dec 2011 07:56:36 +0800, Amos Kong ak...@redhat.com wrote: On 12/12/2011 01:12 PM, Rusty Russell wrote: On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kong ak...@redhat.com wrote: On 12/12/11 06:27, Benjamin Herrenschmidt wrote: On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote: Forwarding some results by Amos, who run multiple netperf streams in parallel, from an external box to the guest. TCP_STREAM results were noisy. This could be due to buffering done by TCP, where packet size varies even as message size is constant. TCP_RR results were consistent. In this benchmark, after switching to mandatory barriers, CPU utilization increased by up to 35% while throughput went down by up to 14%. the normalized throughput/cpu regressed consistently, between 7 and 35% The fix applied was simply this: What machine processor was this ? pined guest memory to numa node 1 Please try this patch. How much does the branch cost us? Ok, I will provide the result later. Any news? We're cutting it very fine. I've CC'd Linus so he knows this is coming... From: Rusty Russell ru...@rustcorp.com.au Subject: virtio: harsher barriers for virtio-mmio. We were cheating with our barriers; using the smp ones rather than the real device ones. That was fine, until virtio-mmio came along, which could be talking to a real device (a non-SMP CPU). Unfortunately, just putting back the real barriers (reverting d57ed95d) causes a performance regression on virtio-pci. In particular, Amos reports netbench's TCP_RR over virtio_net CPU utilization increased up to 35% while throughput went down by up to 14%. By comparison, this branch costs us??? Reference: https://lkml.org/lkml/2011/12/11/22 Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/lguest/lguest_device.c | 10 ++ drivers/s390/kvm/kvm_virtio.c |2 +- drivers/virtio/virtio_mmio.c |7 --- drivers/virtio/virtio_pci.c|4 ++-- drivers/virtio/virtio_ring.c | 34 +- include/linux/virtio_ring.h|1 + tools/virtio/linux/virtio.h|1 + tools/virtio/virtio_test.c |3 ++- 8 files changed, 38 insertions(+), 24 deletions(-) diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c --- a/drivers/lguest/lguest_device.c +++ b/drivers/lguest/lguest_device.c @@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru } /* -* OK, tell virtio_ring.c to set up a virtqueue now we know its size -* and we've got a pointer to its pages. +* OK, tell virtio_ring.c to set up a virtqueue now we know its size +* and we've got a pointer to its pages. Note that we set weak_barriers +* to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu +* barriers. */ - vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, -vdev, lvq-pages, lg_notify, callback, name); + vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, vdev, +true, lvq-pages, lg_notify, callback, name); if (!vq) { err = -ENOMEM; goto unmap; diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c --- a/drivers/s390/kvm/kvm_virtio.c +++ b/drivers/s390/kvm/kvm_virtio.c @@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str goto out; vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN, -vdev, (void *) config-address, +vdev, true, (void *) config-address, kvm_notify, callback, name); if (!vq) { err = -ENOMEM; diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str writel(virt_to_phys(info-queue) PAGE_SHIFT, vm_dev-base + VIRTIO_MMIO_QUEUE_PFN); - /* Create the vring */ - vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, -vdev, info-queue, vm_notify, callback, name); + /* Create the vring: no weak barriers, the other side is could +* be an independent device. */ + vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, vdev, +false, info-queue, vm_notify, callback, name); if (!vq) { err = -ENOMEM; goto error_new_virtqueue; diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ - vq =
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On 19/12/11 10:41, Benjamin Herrenschmidt wrote: On Mon, 2011-12-19 at 10:19 +0800, Amos Kong wrote: I tested with the same environment and scenarios. tested one scenarios for three times and compute the average for more precision. Thanks, Amos - compare results --- Mon Dec 19 09:51:09 2011 1 - avg-old.netperf.exhost_guest.txt 2 - avg-fixed.netperf.exhost_guest.txt The output is word wrapped and generally unreadable. Any chance you can provide us with a summary of the outcome ? Cheers, Ben. Hi Ben, The change of TCP_RR Throughput is very small. external host - guest: Some of throughput of TCP_STREAM and TCP_MAERTS reduced a little. local host - guest: Some of throughput of TCP_STREAM and TCP_MAERTS increased a little. About compare result format: --- 1 - avg-old.netperf.exhost_guest.txt average result (tested 3 times) file of test 1 2 - avg-fixed.netperf.exhost_guest.txt average result file of test 2 == TCP_STREAM ^^^ protocol sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #tx-byts| #rx-byts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit| #rpkt/#irq 11| 64| 1073.54| 10.50| 102| ^^^ average result of old kernel, start netserver in guest, start netperf client(s) in external host 21| 64| 1079.44| 10.29| 104| ^^^ average result of fixed kernel % | 0.0| +0.5| -2.0| +2.0| ^^^ augment rate between test1 and test2 sessions: netperf clients number size: request/response sizes #rx-pkts: received packets number #rx-byts: received bytes number #rx-intr: interrupt number for receive #io_exit: io exit number #irq_inj: injected irq number Thanks, Amos. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On 12/12/2011 01:12 PM, Rusty Russell wrote: On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kong ak...@redhat.com wrote: On 12/12/11 06:27, Benjamin Herrenschmidt wrote: On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote: Forwarding some results by Amos, who run multiple netperf streams in parallel, from an external box to the guest. TCP_STREAM results were noisy. This could be due to buffering done by TCP, where packet size varies even as message size is constant. TCP_RR results were consistent. In this benchmark, after switching to mandatory barriers, CPU utilization increased by up to 35% while throughput went down by up to 14%. the normalized throughput/cpu regressed consistently, between 7 and 35% The fix applied was simply this: What machine processor was this ? pined guest memory to numa node 1 Please try this patch. How much does the branch cost us? Ok, I will provide the result later. Thanks, Amos (Compiles, untested). Thanks, Rusty. From: Rusty Russell ru...@rustcorp.com.au Subject: virtio: harsher barriers for virtio-mmio. We were cheating with our barriers; using the smp ones rather than the real device ones. That was fine, until virtio-mmio came along, which could be talking to a real device (a non-SMP CPU). Unfortunately, just putting back the real barriers (reverting d57ed95d) causes a performance regression on virtio-pci. In particular, Amos reports netbench's TCP_RR over virtio_net CPU utilization increased up to 35% while throughput went down by up to 14%. By comparison, this branch costs us??? Reference: https://lkml.org/lkml/2011/12/11/22 Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/lguest/lguest_device.c | 10 ++ drivers/s390/kvm/kvm_virtio.c |2 +- drivers/virtio/virtio_mmio.c |7 --- drivers/virtio/virtio_pci.c|4 ++-- drivers/virtio/virtio_ring.c | 34 +- include/linux/virtio_ring.h|1 + tools/virtio/linux/virtio.h|1 + tools/virtio/virtio_test.c |3 ++- 8 files changed, 38 insertions(+), 24 deletions(-) diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c --- a/drivers/lguest/lguest_device.c +++ b/drivers/lguest/lguest_device.c @@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru } /* - * OK, tell virtio_ring.c to set up a virtqueue now we know its size - * and we've got a pointer to its pages. + * OK, tell virtio_ring.c to set up a virtqueue now we know its size + * and we've got a pointer to its pages. Note that we set weak_barriers + * to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu + * barriers. */ - vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, - vdev, lvq-pages, lg_notify, callback, name); + vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, vdev, + true, lvq-pages, lg_notify, callback, name); if (!vq) { err = -ENOMEM; goto unmap; diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c --- a/drivers/s390/kvm/kvm_virtio.c +++ b/drivers/s390/kvm/kvm_virtio.c @@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str goto out; vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN, - vdev, (void *) config-address, + vdev, true, (void *) config-address, kvm_notify, callback, name); if (!vq) { err = -ENOMEM; diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str writel(virt_to_phys(info-queue) PAGE_SHIFT, vm_dev-base + VIRTIO_MMIO_QUEUE_PFN); - /* Create the vring */ - vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, - vdev, info-queue, vm_notify, callback, name); + /* Create the vring: no weak barriers, the other side is could + * be an independent device. */ + vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, vdev, + false, info-queue, vm_notify, callback, name); if (!vq) { err = -ENOMEM; goto error_new_virtqueue; diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ - vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN, - vdev, info-queue, vp_notify, callback, name); +
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Sat, Dec 03, 2011 at 03:44:36PM +1030, Rusty Russell wrote: On Sat, 03 Dec 2011 10:09:44 +1100, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Tue, 2011-11-29 at 14:31 +0200, Ohad Ben-Cohen wrote: A trivial, albeit sub-optimal, solution would be to simply revert commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though, that's going to have a negative impact on performance of SMP-based virtualization use cases. Have you measured the impact of using normal barriers (non-SMP ones) like we use on normal HW drivers unconditionally ? IE. If the difference is small enough I'd say just go for it and avoid the bloat. Yep. Plan is: 1) Measure the difference. 2) Difference unmeassurable? Use normal barriers (ie. revert d57ed95). 3) Difference small? Revert d57ed95 for 3.2, revisit for 3.3. 4) Difference large? Runtime switch based on if you're PCI for 3.2, revisit for 3.3. Cheers, Rusty. Forwarding some results by Amos, who run multiple netperf streams in parallel, from an external box to the guest. TCP_STREAM results were noisy. This could be due to buffering done by TCP, where packet size varies even as message size is constant. TCP_RR results were consistent. In this benchmark, after switching to mandatory barriers, CPU utilization increased by up to 35% while throughput went down by up to 14%. the normalized throughput/cpu regressed consistently, between 7 and 35% The fix applied was simply this: diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 3198f2e..fdccb77 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -23,7 +23,7 @@ /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ -#ifdef CONFIG_SMP +#if 0 /* Where possible, use SMP barriers which are more lightweight than mandatory * barriers, because mandatory barriers control MMIO effects on accesses * through relaxed memory I/O windows (which virtio does not use). */ -- MST Fri Dec 9 23:57:33 2011 1 - old-exhost_guest.txt 2 - fixed-exhost_guest.txt == TCP_STREAM sessions| size|throughput| cpu| normalize| #tx-pkts| #rx-pkts| #re-trans| #tx-intr| #rx-intr| #io_exit| #irq_inj|#tpkt/#exit| #rpkt/#irq 11| 64|949.64| 10.64|89| 1170134| 1368739| 0|17|487392|488820|504716| 2.39| 2.71 21| 64|946.03| 10.87|87| 1119582| 1325851| 0|17|493763|485865|516161| 2.30| 2.57 % | 0.0| -0.4| +2.2| -2.2| -4.3| -3.1| 0| 0.0| +1.3| -0.6| +2.3| -3.8| -5.2 12| 64| 1877.15| 15.45| 121| 2151267| 2561929| 0|33|923916|971093|969360| 2.22| 2.64 22| 64| 1867.63| 15.06| 124| 2212457| 2607606| 0|33|836160|927721|883964| 2.38| 2.95 % | 0.0| -0.5| -2.5| +2.5| +2.8| +1.8| 0| 0.0| -9.5| -4.5| -8.8| +7.2| +11.7 14| 64| 3577.38| 19.62| 182| 4176151| 5036661| 0|64| 1677417| 1412979| 1859101| 2.96| 2.71 24| 64| 3583.17| 20.05| 178| 4215327| 5063534| 0|65| 1682582| 1549394| 1759033| 2.72| 2.88 % | 0.0| +0.2| +2.2| -2.2| +0.9| +0.5| 0| +1.6| +0.3| +9.7| -5.4| -8.1| +6.3 11| 256| 2654.52| 11.41| 232|925787| 1029214| 0|14|597763|670927|619414| 1.38| 1.66 21| 256| 2632.22| 20.32| 129|977446| 1036094| 0|15|742699|715460|764512| 1.37| 1.36 % | 0.0| -0.8| +78.1| -44.4| +5.6| +0.7| 0| +7.1| +24.2| +6.6| +23.4| -0.7| -18.1 12| 256| 5228.76| 16.94| 308| 1949442| 2082492| 0|30| 1230329| 1323945| 1274262| 1.47| 1.63 22| 256| 5140.98| 19.58| 262| 1991090| 2093206| 0|30| 1400232| 1271363| 1441564| 1.57| 1.45 % | 0.0| -1.7| +15.6| -14.9| +2.1| +0.5| 0| 0.0| +13.8| -4.0| +13.1| +6.8| -11.0 14| 256| 9412.61| 24.04| 391| 2292404| 2351356| 0|35| 1669864|555786| 1741742| 4.12| 1.35 24| 256| 9408.92| 22.80| 412| 2303267|
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote: Forwarding some results by Amos, who run multiple netperf streams in parallel, from an external box to the guest. TCP_STREAM results were noisy. This could be due to buffering done by TCP, where packet size varies even as message size is constant. TCP_RR results were consistent. In this benchmark, after switching to mandatory barriers, CPU utilization increased by up to 35% while throughput went down by up to 14%. the normalized throughput/cpu regressed consistently, between 7 and 35% The fix applied was simply this: What machine processor was this ? Cheers, Ben. diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 3198f2e..fdccb77 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -23,7 +23,7 @@ /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ -#ifdef CONFIG_SMP +#if 0 /* Where possible, use SMP barriers which are more lightweight than mandatory * barriers, because mandatory barriers control MMIO effects on accesses * through relaxed memory I/O windows (which virtio does not use). */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On 12/12/11 06:27, Benjamin Herrenschmidt wrote: On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote: Forwarding some results by Amos, who run multiple netperf streams in parallel, from an external box to the guest. TCP_STREAM results were noisy. This could be due to buffering done by TCP, where packet size varies even as message size is constant. TCP_RR results were consistent. In this benchmark, after switching to mandatory barriers, CPU utilization increased by up to 35% while throughput went down by up to 14%. the normalized throughput/cpu regressed consistently, between 7 and 35% The fix applied was simply this: What machine processor was this ? pined guest memory to numa node 1 # numactl -m 1 qemu-kvm ... pined guest vcpu threads and vhost thread to single cpu of numa node 1 # taskset -p 0x10 8348 (vhost_net_thread) # taskset -p 0x20 8353 (vcpu 1 thread) # taskset -p 0x40 8357 (vcpu 2 thread) pined cpu/memory of netperf client process to node 1 # numactl --cpunodebind=1 --membind=1 netperf ... 8 cores --- processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU E5620 @ 2.40GHz stepping: 2 microcode : 0xc cpu MHz : 1596.000 cache size : 12288 KB physical id : 1 siblings: 4 core id : 10 cpu cores : 4 apicid : 52 initial apicid : 52 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes lahf_lm ida arat epb dts tpr_shadow vnmi flexpriority ept vpid bogomips: 4787.76 clflush size: 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: # cat /proc/meminfo MemTotal: 16446616 kB MemFree:15874092 kB Buffers: 30404 kB Cached: 238640 kB SwapCached:0 kB Active: 100204 kB Inactive: 184312 kB Active(anon): 15724 kB Inactive(anon):4 kB Active(file): 84480 kB Inactive(file): 184308 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 8388604 kB SwapFree:8388604 kB Dirty:56 kB Writeback: 0 kB AnonPages: 15548 kB Mapped:11540 kB Shmem: 256 kB Slab: 82444 kB SReclaimable: 19220 kB SUnreclaim:63224 kB KernelStack:1224 kB PageTables: 2256 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:16611912 kB Committed_AS: 209068 kB VmallocTotal: 34359738367 kB VmallocUsed: 224244 kB VmallocChunk: 34351073668 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k:9876 kB DirectMap2M: 2070528 kB DirectMap1G:14680064 kB # numactl --hardware available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 node 0 size: 8175 MB node 0 free: 7706 MB node 1 cpus: 4 5 6 7 node 1 size: 8192 MB node 1 free: 7796 MB node distances: node 0 1 0: 10 20 1: 20 10 # numactl --show policy: default preferred node: current physcpubind: 0 1 2 3 4 5 6 7 cpubind: 0 1 nodebind: 0 1 membind: 0 1 Cheers, Ben. diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 3198f2e..fdccb77 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -23,7 +23,7 @@ /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ -#ifdef CONFIG_SMP +#if 0 /* Where possible, use SMP barriers which are more lightweight than mandatory * barriers, because mandatory barriers control MMIO effects on accesses * through relaxed memory I/O windows (which virtio does not use). */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Mon, 12 Dec 2011 11:06:53 +0800, Amos Kong ak...@redhat.com wrote: On 12/12/11 06:27, Benjamin Herrenschmidt wrote: On Sun, 2011-12-11 at 14:25 +0200, Michael S. Tsirkin wrote: Forwarding some results by Amos, who run multiple netperf streams in parallel, from an external box to the guest. TCP_STREAM results were noisy. This could be due to buffering done by TCP, where packet size varies even as message size is constant. TCP_RR results were consistent. In this benchmark, after switching to mandatory barriers, CPU utilization increased by up to 35% while throughput went down by up to 14%. the normalized throughput/cpu regressed consistently, between 7 and 35% The fix applied was simply this: What machine processor was this ? pined guest memory to numa node 1 Please try this patch. How much does the branch cost us? (Compiles, untested). Thanks, Rusty. From: Rusty Russell ru...@rustcorp.com.au Subject: virtio: harsher barriers for virtio-mmio. We were cheating with our barriers; using the smp ones rather than the real device ones. That was fine, until virtio-mmio came along, which could be talking to a real device (a non-SMP CPU). Unfortunately, just putting back the real barriers (reverting d57ed95d) causes a performance regression on virtio-pci. In particular, Amos reports netbench's TCP_RR over virtio_net CPU utilization increased up to 35% while throughput went down by up to 14%. By comparison, this branch costs us??? Reference: https://lkml.org/lkml/2011/12/11/22 Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/lguest/lguest_device.c | 10 ++ drivers/s390/kvm/kvm_virtio.c |2 +- drivers/virtio/virtio_mmio.c |7 --- drivers/virtio/virtio_pci.c|4 ++-- drivers/virtio/virtio_ring.c | 34 +- include/linux/virtio_ring.h|1 + tools/virtio/linux/virtio.h|1 + tools/virtio/virtio_test.c |3 ++- 8 files changed, 38 insertions(+), 24 deletions(-) diff --git a/drivers/lguest/lguest_device.c b/drivers/lguest/lguest_device.c --- a/drivers/lguest/lguest_device.c +++ b/drivers/lguest/lguest_device.c @@ -291,11 +291,13 @@ static struct virtqueue *lg_find_vq(stru } /* -* OK, tell virtio_ring.c to set up a virtqueue now we know its size -* and we've got a pointer to its pages. +* OK, tell virtio_ring.c to set up a virtqueue now we know its size +* and we've got a pointer to its pages. Note that we set weak_barriers +* to 'true': the host just a(nother) SMP CPU, so we only need inter-cpu +* barriers. */ - vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, -vdev, lvq-pages, lg_notify, callback, name); + vq = vring_new_virtqueue(lvq-config.num, LGUEST_VRING_ALIGN, vdev, +true, lvq-pages, lg_notify, callback, name); if (!vq) { err = -ENOMEM; goto unmap; diff --git a/drivers/s390/kvm/kvm_virtio.c b/drivers/s390/kvm/kvm_virtio.c --- a/drivers/s390/kvm/kvm_virtio.c +++ b/drivers/s390/kvm/kvm_virtio.c @@ -198,7 +198,7 @@ static struct virtqueue *kvm_find_vq(str goto out; vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN, -vdev, (void *) config-address, +vdev, true, (void *) config-address, kvm_notify, callback, name); if (!vq) { err = -ENOMEM; diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -309,9 +309,10 @@ static struct virtqueue *vm_setup_vq(str writel(virt_to_phys(info-queue) PAGE_SHIFT, vm_dev-base + VIRTIO_MMIO_QUEUE_PFN); - /* Create the vring */ - vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, -vdev, info-queue, vm_notify, callback, name); + /* Create the vring: no weak barriers, the other side is could +* be an independent device. */ + vq = vring_new_virtqueue(info-num, VIRTIO_MMIO_VRING_ALIGN, vdev, +false, info-queue, vm_notify, callback, name); if (!vq) { err = -ENOMEM; goto error_new_virtqueue; diff --git a/drivers/virtio/virtio_pci.c b/drivers/virtio/virtio_pci.c --- a/drivers/virtio/virtio_pci.c +++ b/drivers/virtio/virtio_pci.c @@ -414,8 +414,8 @@ static struct virtqueue *setup_vq(struct vp_dev-ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ - vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN, -vdev, info-queue, vp_notify, callback, name); + vq = vring_new_virtqueue(info-num, VIRTIO_PCI_VRING_ALIGN, vdev, +
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Sat, 03 Dec 2011 10:09:44 +1100, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: On Tue, 2011-11-29 at 14:31 +0200, Ohad Ben-Cohen wrote: A trivial, albeit sub-optimal, solution would be to simply revert commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though, that's going to have a negative impact on performance of SMP-based virtualization use cases. Have you measured the impact of using normal barriers (non-SMP ones) like we use on normal HW drivers unconditionally ? IE. If the difference is small enough I'd say just go for it and avoid the bloat. Yep. Plan is: 1) Measure the difference. 2) Difference unmeassurable? Use normal barriers (ie. revert d57ed95). 3) Difference small? Revert d57ed95 for 3.2, revisit for 3.3. 4) Difference large? Runtime switch based on if you're PCI for 3.2, revisit for 3.3. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Tue, 2011-11-29 at 14:31 +0200, Ohad Ben-Cohen wrote: Virtio is using memory barriers to control the ordering of references to the vrings on SMP systems. When the guest is compiled with SMP support, virtio is only using SMP barriers in order to avoid incurring the overhead involved with mandatory barriers. Lately, though, virtio is being increasingly used with inter-processor communication scenarios too, which involve running two (separate) instances of operating systems on two (separate) processors, each of which might either be UP or SMP. To control the ordering of memory references when the vrings are shared between two external processors, we must always use mandatory barriers. A trivial, albeit sub-optimal, solution would be to simply revert commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though, that's going to have a negative impact on performance of SMP-based virtualization use cases. Have you measured the impact of using normal barriers (non-SMP ones) like we use on normal HW drivers unconditionally ? IE. If the difference is small enough I'd say just go for it and avoid the bloat. Ben. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Sat, Dec 3, 2011 at 1:09 AM, Benjamin Herrenschmidt b...@kernel.crashing.org wrote: Have you measured the impact of using normal barriers (non-SMP ones) like we use on normal HW drivers unconditionally ? IE. If the difference is small enough I'd say just go for it and avoid the bloat. I agree. MST wanted to give this a try this week. If all goes well and there's no unreasonable regression, we'd just switch to mandatory barriers. Ohad. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Thu, Dec 01, 2011 at 12:58:59PM +1030, Rusty Russell wrote: On Thu, 1 Dec 2011 01:13:07 +0200, Michael S. Tsirkin m...@redhat.com wrote: For x86, stores into memory are ordered. So I think that yes, smp_XXX can be selected at compile time. So let's forget the virtio strangeness for a minute, Hmm, we got away with light barriers because we knew we were not *really* talking to a device. But now with virtio-mmio, turns out we are :) You think virtio-mmio this issue too? It's reported on remoteproc... I'm really tempted to revert d57ed95 for 3.2, and we can revisit this optimization later if it proves worthwhile. Thoughts? Rusty. Generally it does seem the best we can do for 3.2. Given it's rc3, I'd be a bit wary of introducing regressions - I'll try to find some real setups (as in - not my laptop) to run some benchmarks on, to verify there's no major problem. I hope I can report on this in about a week from now - want to hold onto this meanwhile? Further, if we do revert, need to remember to apply the following beforehand, to avoid breaking virtio tool: tools/virtio: implement mandatory barriers for x86 Signed-off-by: Michael S. Tsirkin m...@redhat.com diff --git a/tools/virtio/linux/virtio.h b/tools/virtio/linux/virtio.h index 68b8b8d..1bf0e80 100644 --- a/tools/virtio/linux/virtio.h +++ b/tools/virtio/linux/virtio.h @@ -172,11 +172,18 @@ struct virtqueue { #define MODULE_LICENSE(__MODULE_LICENSE_value) \ const char *__MODULE_LICENSE_name = __MODULE_LICENSE_value #define CONFIG_SMP #if defined(__i386__) || defined(__x86_64__) #define barrier() asm volatile( ::: memory) #define mb() __sync_synchronize() +#if defined(__i386__) +#define wmb() mb() +#define rmb() mb() +#else +#define wmb() asm volatile(sfence ::: memory) +#define rmb() asm volatile(lfence ::: memory) +#endif #define smp_mb() mb() # define smp_rmb() barrier() -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Thu, Dec 01, 2011 at 08:14:26AM +0200, Ohad Ben-Cohen wrote: On Thu, Dec 1, 2011 at 1:13 AM, Michael S. Tsirkin m...@redhat.com wrote: For x86, stores into memory are ordered. So I think that yes, smp_XXX can be selected at compile time. But then you can't use the same kernel image for both scenarios. I was talking about virtio-pci. That always allocates the ring in the normal memory. It won't take long until people will use virtio on ARM for both virtualization and for talking to devices, and having to rebuild the kernel for different use cases is nasty. Yes, I understand that it's nasty. ___ Virtualization mailing list virtualizat...@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Thu, 1 Dec 2011 10:12:37 +0200, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Dec 01, 2011 at 12:58:59PM +1030, Rusty Russell wrote: On Thu, 1 Dec 2011 01:13:07 +0200, Michael S. Tsirkin m...@redhat.com wrote: For x86, stores into memory are ordered. So I think that yes, smp_XXX can be selected at compile time. So let's forget the virtio strangeness for a minute, Hmm, we got away with light barriers because we knew we were not *really* talking to a device. But now with virtio-mmio, turns out we are :) You think virtio-mmio this issue too? It's reported on remoteproc... I think any non-virtual, non-PCI device has to worry about it. Perhaps all virtio-mmio are virtual (at this point). I'm tempted to say we want permission from the device to do relaxed barriers (so I don't have to worry about it!) I'm really tempted to revert d57ed95 for 3.2, and we can revisit this optimization later if it proves worthwhile. Generally it does seem the best we can do for 3.2. Given it's rc3, I'd be a bit wary of introducing regressions - I'll try to find some real setups (as in - not my laptop) to run some benchmarks on, to verify there's no major problem. I hope I can report on this in about a week from now - want to hold onto this meanwhile? Yep, no huge hurry. Thanks! Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Tue, Nov 29, 2011 at 5:16 PM, Michael S. Tsirkin m...@redhat.com wrote: This mentions iommu - is there a need to use dma api to let the firmware acess the rings? Or does it have access to all of memory? IOMMU may or may not be used, it really depends on the hardware (my personal SoC does employ one, while others don't). The vrings are created in non-cacheable memory, which is allocated using dma_alloc_coherent, but that isn't necessarily controlling the remote processor access to the memory (a notable example is an iommu-less remote processor which can directly access the physical memory). Is there cache snooping? If yes access from an external device typically works mostly in the same way as smp ... No, nothing fancy like that. Every processor has its own cache, with no coherency protocol. The remote processor should really be treated as a device, and not as a processor that is part of an SMP configuration, and we must prohibit both the compiler and the CPU from reordering memory operations. So you put virtio rings in MMIO memory? I'll be precise: the vrings are created in non-cacheable memory, which both processors have access to. Could you please give a couple of examples of breakage? Sure. Basically, the order of the vring memory operations appear differently to the observing processor. For example, avail-idx gets updated before the new entry is put in the available array... Thanks, Ohad. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Tue, Nov 29, 2011 at 5:19 PM, Michael S. Tsirkin m...@redhat.com wrote: On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote: Is an extra branch faster or slower than reverting d57ed95? Sorry, unfortunately I have no way to measure this, as I don't have any virtualization/x86 setup. I'm developing on ARM SoCs, where virtualization hardware is coming, but not here yet. You can try using the micro-benchmark in tools/virtio/. Hmm, care to show me exactly what do you mean ? Though I somewhat suspect that any micro-benchmarking I'll do with my random ARM SoC will not have much value to real virtualization/x86 workloads. Thanks, Ohad. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Wed, Nov 30, 2011 at 01:55:53PM +0200, Ohad Ben-Cohen wrote: On Tue, Nov 29, 2011 at 5:19 PM, Michael S. Tsirkin m...@redhat.com wrote: On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote: Is an extra branch faster or slower than reverting d57ed95? Sorry, unfortunately I have no way to measure this, as I don't have any virtualization/x86 setup. I'm developing on ARM SoCs, where virtualization hardware is coming, but not here yet. You can try using the micro-benchmark in tools/virtio/. Hmm, care to show me exactly what do you mean ? make headers_install make -C tools/virtio/ (you'll need an empty stub for tools/virtio/linux/module.h, I just sent a patch to add that) sudo insmod tools/virtio/vhost_test/vhost_test.ko ./tools/virtio/virtio_test Though I somewhat suspect that any micro-benchmarking I'll do with my random ARM SoC will not have much value to real virtualization/x86 workloads. Thanks, Ohad. Real virtualization/x86 can keep using current smp_XX barriers, right? We can have some config for your kind of setup. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Wed, Nov 30, 2011 at 01:45:05PM +0200, Ohad Ben-Cohen wrote: So you put virtio rings in MMIO memory? I'll be precise: the vrings are created in non-cacheable memory, which both processors have access to. Could you please give a couple of examples of breakage? Sure. Basically, the order of the vring memory operations appear differently to the observing processor. For example, avail-idx gets updated before the new entry is put in the available array... I see. And this happens because the ARM processor reorders memory writes to this uncacheable memory? And in an SMP configuration, writes are somehow not reordered? For example, if we had such an AMP configuration with and x86 processor, wmb() (sfence) would be wrong and smp_wmb() would be sufficient. Just checking that this is not a bug in the smp_wmb implementation for the specific platform. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Wed, Nov 30, 2011 at 4:59 PM, Michael S. Tsirkin m...@redhat.com wrote: I see. And this happens because the ARM processor reorders memory writes Yes. And in an SMP configuration, writes are somehow not reordered? They are, but then the smp memory barriers are enough to control these effects. It's not enough to control reordering as seen by a device (which is what our AMP processors are) though. (btw, the difference between an SMP processor and a device here lies in how the memory is mapped: normal memory vs. device memory attributes. it's an ARM thingy). Just checking that this is not a bug in the smp_wmb implementation for the specific platform. No, it's not. ARM's smp memory barriers use ARM's DMB instruction, which is enough to control SMP effects, whereas ARM's mandatory memory barriers use ARM's DSB instruction, which is required to ensure the ordering between Device and Normal memory accesses. Thanks, Ohad. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Wed, Nov 30, 2011 at 06:04:56PM +0200, Ohad Ben-Cohen wrote: On Wed, Nov 30, 2011 at 4:59 PM, Michael S. Tsirkin m...@redhat.com wrote: I see. And this happens because the ARM processor reorders memory writes Yes. And in an SMP configuration, writes are somehow not reordered? They are, but then the smp memory barriers are enough to control these effects. It's not enough to control reordering as seen by a device (which is what our AMP processors are) though. (btw, the difference between an SMP processor and a device here lies in how the memory is mapped: normal memory vs. device memory attributes. it's an ARM thingy). How are the rings mapped? normal memory, right? We allocate them with plan alloc_pages_exact in virtio_pci.c ... Just checking that this is not a bug in the smp_wmb implementation for the specific platform. No, it's not. ARM's smp memory barriers use ARM's DMB instruction, which is enough to control SMP effects, whereas ARM's mandatory memory barriers use ARM's DSB instruction, which is required to ensure the ordering between Device and Normal memory accesses. Thanks, Ohad. Yes wmb() is required to ensure ordering for MMIO. But here both accesses: index and ring - are for memory, not MMIO. I could understand ring kick bypassing index write, maybe ... But you described an index write bypassing descriptor write. Is this something you see in practice? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Wed, Nov 30, 2011 at 6:15 PM, Michael S. Tsirkin m...@redhat.com wrote: How are the rings mapped? normal memory, right? No, device memory. We allocate them with plan alloc_pages_exact in virtio_pci.c ... I'm not using virtio_pci.c; remoteproc is allocating the rings using the DMA API. Yes wmb() is required to ensure ordering for MMIO. But here both accesses: index and ring - are for memory, not MMIO. I'm doing IO with a device over shared memory. It does require mandatory barriers as I explained. Is this something you see in practice? Yes. These bugs are very real. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Wed, Nov 30, 2011 at 4:50 PM, Michael S. Tsirkin m...@redhat.com wrote: make headers_install make -C tools/virtio/ (you'll need an empty stub for tools/virtio/linux/module.h, I just sent a patch to add that) sudo insmod tools/virtio/vhost_test/vhost_test.ko ./tools/virtio/virtio_test Ok, I gave this a spin. I've tried to see if reverting d57ed95 has any measurable effect on the execution time of virtio_test's run_test(), but I couldn't see any (several attempts with and without d57ed95 yielded very similar range of execution times). YMMV though, especially with real workloads. Real virtualization/x86 can keep using current smp_XX barriers, right? Yes, sure. ARM virtualization can too, since smp_XX barriers are enough for that scenario. We can have some config for your kind of setup. Please note that it can't be a compile-time decision though (unless we're willing to effectively revert d57ed95 when this config kicks in): it's not unlikely that one would want to have both use cases running on the same time. Thanks, Ohad. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Thu, Dec 01, 2011 at 12:43:08AM +0200, Ohad Ben-Cohen wrote: On Wed, Nov 30, 2011 at 4:50 PM, Michael S. Tsirkin m...@redhat.com wrote: make headers_install make -C tools/virtio/ (you'll need an empty stub for tools/virtio/linux/module.h, I just sent a patch to add that) sudo insmod tools/virtio/vhost_test/vhost_test.ko ./tools/virtio/virtio_test Ok, I gave this a spin. I've tried to see if reverting d57ed95 has any measurable effect on the execution time of virtio_test's run_test(), but I couldn't see any (several attempts with and without d57ed95 yielded very similar range of execution times). YMMV though, especially with real workloads. Real virtualization/x86 can keep using current smp_XX barriers, right? Yes, sure. ARM virtualization can too, since smp_XX barriers are enough for that scenario. We can have some config for your kind of setup. Please note that it can't be a compile-time decision though (unless we're willing to effectively revert d57ed95 when this config kicks in): it's not unlikely that one would want to have both use cases running on the same time. Thanks, Ohad. For x86, stores into memory are ordered. So I think that yes, smp_XXX can be selected at compile time. So let's forget the virtio strangeness for a minute, To me it starts looking like we need some new kind of barrier that handles accesses to DMA coherent memory. dma_Xmb()? dma_coherent_Xmb()? For example, on x86, dma_wmb() can be barrier(), but on your system it needs to do DSB. We can set the rule that dma barriers are guaranteed stronger than smp ones, and we can just use dma_ everywhere. So the strength will be: smp dma mandatory And now virtio can use DMA barriers and instead of adding overhead for x86, x86 will actually gain from this, as we'll drop mandatory barriers on UP systems. Hmm? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Wed, Nov 30, 2011 at 6:24 PM, Ohad Ben-Cohen o...@wizery.com wrote: On Wed, Nov 30, 2011 at 6:15 PM, Michael S. Tsirkin m...@redhat.com wrote: How are the rings mapped? normal memory, right? No, device memory. Ok, I have more info. Originally remoteproc was mapping the rings using ioremap, and that meant ARM Device memory. Recently, though, we moved to CMA (allocating memory for the rings via dma_alloc_coherent), and that isn't Device memory anymore: it's uncacheable Normal memory (on ARM v6+). We still require mandatory barriers though: one very reproducible problem I personally face is that the avail index doesn't get updated before the kick. As a result, the remote processor misses a buffer that was just added (the kick wakes it up only to find that the avail index wasn't changed yet). In this case, it probably happens because the mailbox, used to kick the remote processor, is mapped as Device memory, and therefore the kick can be reordered before the updates to the ring can be observed. I did get two additional reports about reordering issues, on different setups than my own, and which I can't personally reproduce: the one I've described earlier (avail index gets updated before the avail array) and one in the receive path (reading a used buffer which we already read). I couldn't personally verify those, but both issues were reported to be gone when mandatory barriers were used. I expect those reports only to increase: the diversity of platforms that are now looking into adopting virtio for this kind of inter-process communication is quite huge, with several different architectures and even more hardware implementations on the way (not only ARM). Thanks, Ohad. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Thu, Dec 01, 2011 at 01:27:10AM +0200, Ohad Ben-Cohen wrote: On Wed, Nov 30, 2011 at 6:24 PM, Ohad Ben-Cohen o...@wizery.com wrote: On Wed, Nov 30, 2011 at 6:15 PM, Michael S. Tsirkin m...@redhat.com wrote: How are the rings mapped? normal memory, right? No, device memory. Ok, I have more info. Originally remoteproc was mapping the rings using ioremap, and that meant ARM Device memory. Recently, though, we moved to CMA (allocating memory for the rings via dma_alloc_coherent), and that isn't Device memory anymore: it's uncacheable Normal memory (on ARM v6+). And these accesses need to be ordered with DSB? Or DMB? We still require mandatory barriers though: one very reproducible problem I personally face is that the avail index doesn't get updated before the kick. Aha! The *kick* really is MMIO. So I think we do need a mandatory barrier before the kick. Maybe we need it for virtio-pci as well (not on kvm, naturally :) Off-hand this seems to belong in the transport layer but need to think about it. As a result, the remote processor misses a buffer that was just added (the kick wakes it up only to find that the avail index wasn't changed yet). In this case, it probably happens because the mailbox, used to kick the remote processor, is mapped as Device memory, and therefore the kick can be reordered before the updates to the ring can be observed. I did get two additional reports about reordering issues, on different setups than my own, and which I can't personally reproduce: the one I've described earlier (avail index gets updated before the avail array) and one in the receive path (reading a used buffer which we already read). I couldn't personally verify those, but both issues were reported to be gone when mandatory barriers were used. Hmm. So it's a hint that something is wrong with memory but not what's wrong exactly. I expect those reports only to increase: the diversity of platforms that are now looking into adopting virtio for this kind of inter-process communication is quite huge, with several different architectures and even more hardware implementations on the way (not only ARM). Thanks, Ohad. Right. We need to be very careful with memory, it's a tricky field. One known problem with virtio is its insistance on using native endian-ness for some fields. If power is used, we'll have to fix this. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Thu, 1 Dec 2011 01:13:07 +0200, Michael S. Tsirkin m...@redhat.com wrote: For x86, stores into memory are ordered. So I think that yes, smp_XXX can be selected at compile time. So let's forget the virtio strangeness for a minute, Hmm, we got away with light barriers because we knew we were not *really* talking to a device. But now with virtio-mmio, turns out we are :) I'm really tempted to revert d57ed95 for 3.2, and we can revisit this optimization later if it proves worthwhile. Thoughts? Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Thu, Dec 1, 2011 at 1:13 AM, Michael S. Tsirkin m...@redhat.com wrote: For x86, stores into memory are ordered. So I think that yes, smp_XXX can be selected at compile time. But then you can't use the same kernel image for both scenarios. It won't take long until people will use virtio on ARM for both virtualization and for talking to devices, and having to rebuild the kernel for different use cases is nasty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Thu, Dec 1, 2011 at 1:43 AM, Michael S. Tsirkin m...@redhat.com wrote: And these accesses need to be ordered with DSB? Or DMB? DMB (i.e. smp barriers) should be enough within Normal memory accesses, though the other issues that were reported to me are a bit concerning. I'm still trying to get more information about them, in the hopes that I can eventually reproduce them myself. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Thu, Dec 1, 2011 at 4:28 AM, Rusty Russell ru...@rustcorp.com.au wrote: Hmm, we got away with light barriers because we knew we were not *really* talking to a device. But now with virtio-mmio, turns out we are :) I'm really tempted to revert d57ed95 for 3.2, and we can revisit this optimization later if it proves worthwhile. +1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC] virtio: use mandatory barriers for remote processor vdevs
Virtio is using memory barriers to control the ordering of references to the vrings on SMP systems. When the guest is compiled with SMP support, virtio is only using SMP barriers in order to avoid incurring the overhead involved with mandatory barriers. Lately, though, virtio is being increasingly used with inter-processor communication scenarios too, which involve running two (separate) instances of operating systems on two (separate) processors, each of which might either be UP or SMP. To control the ordering of memory references when the vrings are shared between two external processors, we must always use mandatory barriers. A trivial, albeit sub-optimal, solution would be to simply revert commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though, that's going to have a negative impact on performance of SMP-based virtualization use cases. A different approach, as demonstrated by this patch, would pick the type of memory barriers, in run time, according to the requirements of the virtio device. This way, both SMP virtualization scenarios and inter- processor communication use cases would run correctly, without making any performance compromises (except for those incurred by an additional branch or level of indirection). This patch introduces VIRTIO_RING_F_REMOTEPROC, a new virtio transport feature, which should be used by virtio devices that run on remote processors. The CONFIG_SMP variant of virtio_{mb, rmb, wmb} is then changed to use SMP barriers only if VIRTIO_RING_F_REMOTEPROC was absent. Signed-off-by: Ohad Ben-Cohen o...@wizery.com --- Alternatively, we can also introduce some kind of virtio_mb_ops and set it according to the nature of the vdev with handlers that just do the right thing, instead of introducting that branch. Though I also wonder how big really is the perforamnce gain of d57ed95 ? drivers/virtio/virtio_ring.c | 78 +- include/linux/virtio_ring.h |6 +++ 2 files changed, 60 insertions(+), 24 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index c7a2c20..cf66a2d 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -23,24 +23,6 @@ #include linux/slab.h #include linux/module.h -/* virtio guest is communicating with a virtual device that actually runs on - * a host processor. Memory barriers are used to control SMP effects. */ -#ifdef CONFIG_SMP -/* Where possible, use SMP barriers which are more lightweight than mandatory - * barriers, because mandatory barriers control MMIO effects on accesses - * through relaxed memory I/O windows (which virtio does not use). */ -#define virtio_mb() smp_mb() -#define virtio_rmb() smp_rmb() -#define virtio_wmb() smp_wmb() -#else -/* We must force memory ordering even if guest is UP since host could be - * running on another CPU, but SMP barriers are defined to barrier() in that - * configuration. So fall back to mandatory barriers instead. */ -#define virtio_mb() mb() -#define virtio_rmb() rmb() -#define virtio_wmb() wmb() -#endif - #ifdef DEBUG /* For development, we want to crash whenever the ring is screwed. */ #define BAD_RING(_vq, fmt, args...)\ @@ -86,6 +68,9 @@ struct vring_virtqueue /* Host publishes avail event idx */ bool event; + /* Host runs on a remote processor */ + bool rproc; + /* Number of free buffers */ unsigned int num_free; /* Head of free buffer list. */ @@ -108,6 +93,48 @@ struct vring_virtqueue void *data[]; }; +/* + * virtio guest is communicating with a virtual device that may either run + * on the host processor, or on an external processor. The former requires + * memory barriers in order to control SMP effects, but the latter must + * use mandatory barriers. + */ +#ifdef CONFIG_SMP +/* Where possible, use SMP barriers which are more lightweight than mandatory + * barriers, because mandatory barriers control MMIO effects on accesses + * through relaxed memory I/O windows. */ +static inline void virtio_mb(struct vring_virtqueue *vq) +{ + if (vq-rproc) + mb(); + else + smp_mb(); +} + +static inline void virtio_rmb(struct vring_virtqueue *vq) +{ + if (vq-rproc) + rmb(); + else + smp_rmb(); +} + +static inline void virtio_wmb(struct vring_virtqueue *vq) +{ + if (vq-rproc) + wmb(); + else + smp_wmb(); +} +#else +/* We must force memory ordering even if guest is UP since host could be + * running on another CPU, but SMP barriers are defined to barrier() in that + * configuration. So fall back to mandatory barriers instead. */ +static inline void virtio_mb(struct vring_virtqueue *vq) { mb(); } +static inline void virtio_rmb(struct vring_virtqueue *vq) { rmb(); } +static inline void virtio_wmb(struct vring_virtqueue *vq) { wmb(); } +#endif + #define to_vvq(_vq) container_of(_vq,
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Tue, Nov 29, 2011 at 02:31:26PM +0200, Ohad Ben-Cohen wrote: Virtio is using memory barriers to control the ordering of references to the vrings on SMP systems. When the guest is compiled with SMP support, virtio is only using SMP barriers in order to avoid incurring the overhead involved with mandatory barriers. Lately, though, virtio is being increasingly used with inter-processor communication scenarios too, which involve running two (separate) instances of operating systems on two (separate) processors, each of which might either be UP or SMP. Is that using virtio-mmio? If yes, would the extra serialization belongs at that layer? To control the ordering of memory references when the vrings are shared between two external processors, we must always use mandatory barriers. Sorry, could you pls explain what are 'two external processors'? I think I know that if two x86 CPUs in an SMP system run kernels built in an SMP configuration, smp_*mb barriers are enough. Documentation/memory-barriers.txt says: Mandatory barriers should not be used to control SMP effects ... They may, however, be used to control MMIO effects on accesses through relaxed memory I/O windows. We don't control MMIO/relaxed memory I/O windows here, so what exactly is the issue? Could you please give an example of a setup that is currently broken? A trivial, albeit sub-optimal, solution would be to simply revert commit d57ed95 virtio: use smp_XX barriers on SMP. Obviously, though, that's going to have a negative impact on performance of SMP-based virtualization use cases. A different approach, as demonstrated by this patch, would pick the type of memory barriers, in run time, according to the requirements of the virtio device. This way, both SMP virtualization scenarios and inter- processor communication use cases would run correctly, without making any performance compromises (except for those incurred by an additional branch or level of indirection). Is an extra branch faster or slower than reverting d57ed95? This patch introduces VIRTIO_RING_F_REMOTEPROC, a new virtio transport feature, which should be used by virtio devices that run on remote processors. The CONFIG_SMP variant of virtio_{mb, rmb, wmb} is then changed to use SMP barriers only if VIRTIO_RING_F_REMOTEPROC was absent. One wonders how the remote side knows enough to set this flag? Signed-off-by: Ohad Ben-Cohen o...@wizery.com --- Alternatively, we can also introduce some kind of virtio_mb_ops and set it according to the nature of the vdev with handlers that just do the right thing, instead of introducting that branch. Though I also wonder how big really is the perforamnce gain of d57ed95 ? Want to check and tell us? drivers/virtio/virtio_ring.c | 78 +- include/linux/virtio_ring.h |6 +++ 2 files changed, 60 insertions(+), 24 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index c7a2c20..cf66a2d 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -23,24 +23,6 @@ #include linux/slab.h #include linux/module.h -/* virtio guest is communicating with a virtual device that actually runs on - * a host processor. Memory barriers are used to control SMP effects. */ -#ifdef CONFIG_SMP -/* Where possible, use SMP barriers which are more lightweight than mandatory - * barriers, because mandatory barriers control MMIO effects on accesses - * through relaxed memory I/O windows (which virtio does not use). */ -#define virtio_mb() smp_mb() -#define virtio_rmb() smp_rmb() -#define virtio_wmb() smp_wmb() -#else -/* We must force memory ordering even if guest is UP since host could be - * running on another CPU, but SMP barriers are defined to barrier() in that - * configuration. So fall back to mandatory barriers instead. */ -#define virtio_mb() mb() -#define virtio_rmb() rmb() -#define virtio_wmb() wmb() -#endif - #ifdef DEBUG /* For development, we want to crash whenever the ring is screwed. */ #define BAD_RING(_vq, fmt, args...) \ @@ -86,6 +68,9 @@ struct vring_virtqueue /* Host publishes avail event idx */ bool event; + /* Host runs on a remote processor */ + bool rproc; + /* Number of free buffers */ unsigned int num_free; /* Head of free buffer list. */ @@ -108,6 +93,48 @@ struct vring_virtqueue void *data[]; }; +/* + * virtio guest is communicating with a virtual device that may either run + * on the host processor, or on an external processor. The former requires + * memory barriers in order to control SMP effects, but the latter must + * use mandatory barriers. + */ +#ifdef CONFIG_SMP +/* Where possible, use SMP barriers which are more lightweight than mandatory + * barriers, because mandatory barriers control MMIO effects on accesses + * through
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Tue, Nov 29, 2011 at 3:11 PM, Michael S. Tsirkin m...@redhat.com wrote: On Tue, Nov 29, 2011 at 02:31:26PM +0200, Ohad Ben-Cohen wrote: Virtio is using memory barriers to control the ordering of references to the vrings on SMP systems. When the guest is compiled with SMP support, virtio is only using SMP barriers in order to avoid incurring the overhead involved with mandatory barriers. Lately, though, virtio is being increasingly used with inter-processor communication scenarios too, which involve running two (separate) instances of operating systems on two (separate) processors, each of which might either be UP or SMP. Is that using virtio-mmio? No, I'm using this: https://lkml.org/lkml/2011/10/25/139 Sorry, could you pls explain what are 'two external processors'? I think I know that if two x86 CPUs in an SMP system run kernels built in an SMP configuration, smp_*mb barriers are enough. Sure: My setup is not SMP-based; it's two separate processors running in AMP configuration. The processors have completely different architectures, are not cache coherent, and only simply share some memory, which is used for communications using virtio as the shared memory wire protocol (i.e. we're not even doing virtualization: we have Linux on one processor, and some RTOS on another processor, and they use virtio to send and receive buffers). So it's not SMP effects we're controlling; we're pretty much doing MMIO and must use mandatory barriers (otherwise we see breakage). Is an extra branch faster or slower than reverting d57ed95? Sorry, unfortunately I have no way to measure this, as I don't have any virtualization/x86 setup. I'm developing on ARM SoCs, where virtualization hardware is coming, but not here yet. One wonders how the remote side knows enough to set this flag? The remote side is a dedicated firmware loaded in order to boot the other processor, so it really is intimate with the platform and its requirements (and it publishes the virtio device features for every supported virtio device). Though I also wonder how big really is the perforamnce gain of d57ed95 ? Want to check and tell us? Sorry, as I said, I have no way to do that... Any chance you did some measurements back then when d57ed95 was introduced ? I just guessed it did improve performance, otherwise you probably wouldn't have bothered. But if not, then reverting it is surely preferable to adding more complexity. Thanks! Ohad. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote: On Tue, Nov 29, 2011 at 3:11 PM, Michael S. Tsirkin m...@redhat.com wrote: On Tue, Nov 29, 2011 at 02:31:26PM +0200, Ohad Ben-Cohen wrote: Virtio is using memory barriers to control the ordering of references to the vrings on SMP systems. When the guest is compiled with SMP support, virtio is only using SMP barriers in order to avoid incurring the overhead involved with mandatory barriers. Lately, though, virtio is being increasingly used with inter-processor communication scenarios too, which involve running two (separate) instances of operating systems on two (separate) processors, each of which might either be UP or SMP. Is that using virtio-mmio? No, I'm using this: https://lkml.org/lkml/2011/10/25/139 This mentions iommu - is there a need to use dma api to let the firmware acess the rings? Or does it have access to all of memory? Sorry, could you pls explain what are 'two external processors'? I think I know that if two x86 CPUs in an SMP system run kernels built in an SMP configuration, smp_*mb barriers are enough. Sure: My setup is not SMP-based; it's two separate processors running in AMP configuration. The processors have completely different architectures, are not cache coherent, and only simply share some memory, which is used for communications using virtio as the shared memory wire protocol (i.e. we're not even doing virtualization: we have Linux on one processor, and some RTOS on another processor, and they use virtio to send and receive buffers). I'd like to make sure I understand the memory model some more. Is there cache snooping? If yes access from an external device typically works mostly in the same way as smp ... So it's not SMP effects we're controlling; we're pretty much doing MMIO and must use mandatory barriers So you put virtio rings in MMIO memory? (otherwise we see breakage). Could you please give a couple of examples of breakage? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] virtio: use mandatory barriers for remote processor vdevs
On Tue, Nov 29, 2011 at 03:57:19PM +0200, Ohad Ben-Cohen wrote: Is an extra branch faster or slower than reverting d57ed95? Sorry, unfortunately I have no way to measure this, as I don't have any virtualization/x86 setup. I'm developing on ARM SoCs, where virtualization hardware is coming, but not here yet. You can try using the micro-benchmark in tools/virtio/. That does not need virtualization. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html