Re: Network performance with small packets
On Thu, 14 Apr 2011 19:03:59 +0300, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote: They have to offer the feature, so if the have some way of allocating non-page-aligned amounts of memory, they'll have to add those extra 2 bytes. So I think it's OK... Rusty. To clarify, my concern is that we always seem to try to map these extra 2 bytes, which thinkably might fail? No, if you look at the layout it's clear that there's always most of a page left for this extra room, both in the middle and at the end. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 12 Apr 2011 23:01:12 +0300, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote: Here's an old patch where I played with implementing this: ... virtio: put last_used and last_avail index into ring itself. Generally, the other end of the virtio ring doesn't need to see where you're up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, if you want to save and restore a virtio_ring, but you're not the consumer because the kernel is using it directly. Fortunately, we have room to expand: This seems to be true for x86 kvm and lguest but is it true for s390? Yes, as the ring is page aligned so there's always room. Will this last bit work on s390? If I understand correctly the memory is allocated by host there? They have to offer the feature, so if the have some way of allocating non-page-aligned amounts of memory, they'll have to add those extra 2 bytes. So I think it's OK... Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote: On Tue, 12 Apr 2011 23:01:12 +0300, Michael S. Tsirkin m...@redhat.com wrote: On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote: Here's an old patch where I played with implementing this: ... virtio: put last_used and last_avail index into ring itself. Generally, the other end of the virtio ring doesn't need to see where you're up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, if you want to save and restore a virtio_ring, but you're not the consumer because the kernel is using it directly. Fortunately, we have room to expand: This seems to be true for x86 kvm and lguest but is it true for s390? Yes, as the ring is page aligned so there's always room. Will this last bit work on s390? If I understand correctly the memory is allocated by host there? They have to offer the feature, so if the have some way of allocating non-page-aligned amounts of memory, they'll have to add those extra 2 bytes. So I think it's OK... Rusty. To clarify, my concern is that we always seem to try to map these extra 2 bytes, which thinkably might fail? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote: Here's an old patch where I played with implementing this: ... virtio: put last_used and last_avail index into ring itself. Generally, the other end of the virtio ring doesn't need to see where you're up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, if you want to save and restore a virtio_ring, but you're not the consumer because the kernel is using it directly. Fortunately, we have room to expand: This seems to be true for x86 kvm and lguest but is it true for s390? err = vmem_add_mapping(config-address, vring_size(config-num, KVM_S390_VIRTIO_RING_ALIGN)); if (err) goto out; vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN, vdev, (void *) config-address, kvm_notify, callback, name); if (!vq) { err = -ENOMEM; goto unmap; } the ring is always a whole number of pages and there's hundreds of bytes of padding after the avail ring and the used ring, whatever the number of descriptors (which must be a power of 2). We add a feature bit so the guest can tell the host that it's writing out the current value there, if it wants to use that. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- --- a/include/linux/virtio_ring.h +++ b/include/linux/virtio_ring.h @@ -29,6 +29,9 @@ /* We support indirect buffer descriptors */ #define VIRTIO_RING_F_INDIRECT_DESC 28 +/* We publish our last-seen used index at the end of the avail ring. */ +#define VIRTIO_RING_F_PUBLISH_INDICES29 + /* Virtio ring descriptors: 16 bytes. These can chain together via next. */ struct vring_desc { @@ -87,6 +90,7 @@ struct vring { * __u16 avail_flags; * __u16 avail_idx; * __u16 available[num]; + * __u16 last_used_idx; * * // Padding to the next align boundary. * char pad[]; @@ -95,6 +99,7 @@ struct vring { * __u16 used_flags; * __u16 used_idx; * struct vring_used_elem used[num]; + * __u16 last_avail_idx; * }; */ static inline void vring_init(struct vring *vr, unsigned int num, void *p, @@ -111,9 +116,14 @@ static inline unsigned vring_size(unsign { return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num) + align - 1) ~(align - 1)) - + sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num; + + sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num + 2; } +/* We publish the last-seen used index at the end of the available ring, and + * vice-versa. These are at the end for backwards compatibility. */ +#define vring_last_used(vr) ((vr)-avail-ring[(vr)-num]) +#define vring_last_avail(vr) (*(__u16 *)(vr)-used-ring[(vr)-num]) + Will this last bit work on s390? If I understand correctly the memory is allocated by host there? #ifdef __KERNEL__ #include linux/irqreturn.h struct virtio_device; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote: On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote: As for which CPU the interrupt gets pinned to, that doesn't matter - see below. So what hurts us the most is that the IRQ jumps between the VCPUs? Yes, it appears that allowing the IRQ to run on more than one vCPU hurts. Without the publish last used index patch, vhost keeps injecting an irq for every received packet until the guest eventually turns off notifications. Because the irq injections end up overlapping we get contention on the irq_desc_lock_class lock. Here are some results using the baseline setup with irqbalance running. Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec Exits: 121,050.45 Exits/Sec TxCPU: 9.61% RxCPU: 99.45% Virtio1-input Interrupts/Sec (CPU0/CPU1): 13,975/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 24% increase over baseline. Irqbalance essentially pinned the virtio irq to CPU0 preventing the irq lock contention and resulting in nice gains. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Thu, Mar 10, 2011 at 09:23:42AM -0600, Tom Lendacky wrote: On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote: On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote: As for which CPU the interrupt gets pinned to, that doesn't matter - see below. So what hurts us the most is that the IRQ jumps between the VCPUs? Yes, it appears that allowing the IRQ to run on more than one vCPU hurts. Without the publish last used index patch, vhost keeps injecting an irq for every received packet until the guest eventually turns off notifications. Are you sure you see that? If yes publish used should help a lot. Because the irq injections end up overlapping we get contention on the irq_desc_lock_class lock. Here are some results using the baseline setup with irqbalance running. Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec Exits: 121,050.45 Exits/Sec TxCPU: 9.61% RxCPU: 99.45% Virtio1-input Interrupts/Sec (CPU0/CPU1): 13,975/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 24% increase over baseline. Irqbalance essentially pinned the virtio irq to CPU0 preventing the irq lock contention and resulting in nice gains. OK, so we probably want some form of delayed free for TX on top, and that should get us nice results already. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Thursday, March 10, 2011 09:34:22 am Michael S. Tsirkin wrote: On Thu, Mar 10, 2011 at 09:23:42AM -0600, Tom Lendacky wrote: On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote: On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote: As for which CPU the interrupt gets pinned to, that doesn't matter - see below. So what hurts us the most is that the IRQ jumps between the VCPUs? Yes, it appears that allowing the IRQ to run on more than one vCPU hurts. Without the publish last used index patch, vhost keeps injecting an irq for every received packet until the guest eventually turns off notifications. Are you sure you see that? If yes publish used should help a lot. I definitely see that. I ran lockstat in the guest and saw the contention on the lock when the irq was able to run on either vCPU. Once the irq was pinned the contention disappeared. The publish used index patch should eliminate the extra irq injections and then the pinning or use of irqbalance shouldn't be required. I'm getting a kernel oops during boot with the publish last used patches that I pulled from the mailing list - I had to make some changes in order to get them to apply and compile and might not have done the right things. Can you re-spin that patchset against kvm.git? Because the irq injections end up overlapping we get contention on the irq_desc_lock_class lock. Here are some results using the baseline setup with irqbalance running. Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec Exits: 121,050.45 Exits/Sec TxCPU: 9.61% RxCPU: 99.45% Virtio1-input Interrupts/Sec (CPU0/CPU1): 13,975/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 24% increase over baseline. Irqbalance essentially pinned the virtio irq to CPU0 preventing the irq lock contention and resulting in nice gains. OK, so we probably want some form of delayed free for TX on top, and that should get us nice results already. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 08 Mar 2011 20:21:18 -0600, Andrew Theurer haban...@linux.vnet.ibm.com wrote: On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote: On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote: I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Should we also add similar stat on vhost vq as well for monitoring vhost_signal vhost_notify? Tom L has started using Rusty's patches and found some interesting results, sent yesterday: http://marc.info/?l=kvmm=129953710930124w=2 Hmm, I'm not subscribed to kvm@ any more, so I didn't get this, so replying here: Also, it looks like vhost is sending a lot of notifications for packets it has received before the guest can get scheduled to disable notifications and begin processing the packets resulting in some lock contention in the guest (and high interrupt rates). Yes, this is a virtio design flaw, but one that should be fixable. We have room at the end of the ring, which we can put a last_used count. Then we can tell if wakeups are redundant, before the guest updates the flag. Here's an old patch where I played with implementing this: virtio: put last_used and last_avail index into ring itself. Generally, the other end of the virtio ring doesn't need to see where you're up to in consuming the ring. However, to completely understand what's going on from the outside, this information must be exposed. For example, if you want to save and restore a virtio_ring, but you're not the consumer because the kernel is using it directly. Fortunately, we have room to expand: the ring is always a whole number of pages and there's hundreds of bytes of padding after the avail ring and the used ring, whatever the number of descriptors (which must be a power of 2). We add a feature bit so the guest can tell the host that it's writing out the current value there, if it wants to use that. Signed-off-by: Rusty Russell ru...@rustcorp.com.au --- drivers/virtio/virtio_ring.c | 23 +++ include/linux/virtio_ring.h | 12 +++- 2 files changed, 26 insertions(+), 9 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -71,9 +71,6 @@ struct vring_virtqueue /* Number we've added since last sync. */ unsigned int num_added; - /* Last used index we've seen. */ - u16 last_used_idx; - /* How to notify other side. FIXME: commonalize hcalls! */ void (*notify)(struct virtqueue *vq); @@ -278,12 +275,13 @@ static void detach_buf(struct vring_virt static inline bool more_used(const struct vring_virtqueue *vq) { - return vq-last_used_idx != vq-vring.used-idx; + return vring_last_used(vq-vring) != vq-vring.used-idx; } static void *vring_get_buf(struct virtqueue *_vq, unsigned int *len) { struct vring_virtqueue *vq = to_vvq(_vq); + struct vring_used_elem *u; void *ret; unsigned int i; @@ -300,8 +298,11 @@ static void *vring_get_buf(struct virtqu return NULL; } - i = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].id; - *len = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].len; + u = vq-vring.used-ring[vring_last_used(vq-vring) % vq-vring.num]; + i = u-id; + *len = u-len; + /* Make sure we don't reload i after doing checks. */ + rmb(); if (unlikely(i = vq-vring.num)) { BAD_RING(vq, id %u out of range\n, i); @@ -315,7 +316,8 @@ static void *vring_get_buf(struct virtqu /* detach_buf clears data, so grab it now. */ ret = vq-data[i]; detach_buf(vq, i); - vq-last_used_idx++; + vring_last_used(vq-vring)++; + END_USE(vq); return ret; } @@ -402,7 +404,6 @@ struct virtqueue *vring_new_virtqueue(un vq-vq.name = name; vq-notify = notify; vq-broken = false; - vq-last_used_idx = 0; vq-num_added = 0; list_add_tail(vq-vq.list, vdev-vqs); #ifdef DEBUG @@ -413,6 +414,10 @@ struct virtqueue *vring_new_virtqueue(un vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC); + /* We publish indices whether they offer it or not: if not, it's junk +* space anyway. But calling this acknowledges the feature. */ + virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_INDICES); + /* No callback? Tell other side not to bother us. */ if (!callback) vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT; @@ -443,6 +448,8 @@ void vring_transport_features(struct vir switch (i) { case VIRTIO_RING_F_INDIRECT_DESC: break; +
Re: Network performance with small packets
On Tue, 2011-03-08 at 20:21 -0600, Andrew Theurer wrote: Tom L has started using Rusty's patches and found some interesting results, sent yesterday: http://marc.info/?l=kvmm=129953710930124w=2 Thanks. Very good experimental. I have been struggling with guest/vhost optimization work for a while. I created different experimental patches, performance results really depends on workloads. Based on the discussions and findings, seems that to improve virtio_net/vhost optimization work, we really need to collect more statistics data on both virtio_net and vhost for both TX and RX. A way to filter number of guest exits, I/O exits, irq injections in guest networking stacks only would be helpful. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, 2011-03-09 at 09:15 +0200, Michael S. Tsirkin wrote: diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 82dba5a..ebe3337 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) struct sk_buff *skb; unsigned int len, tot_sgs = 0; - while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { + if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); vi-dev-stats.tx_bytes += skb-len; vi-dev-stats.tx_packets++; - tot_sgs += skb_vnet_hdr(skb)-num_sg; + tot_sgs = 2+MAX_SKB_FRAGS; dev_kfree_skb_any(skb); } return tot_sgs; Return value should be different based on indirect or direct buffers here? @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev); int capacity; - /* Free up any pending old buffers before queueing new ones. */ - free_old_xmit_skbs(vi); - /* Try to transmit */ capacity = xmit_skb(vi, skb); @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) skb_orphan(skb); nf_reset(skb); + /* Free up any old buffers so we can queue new ones. */ + if (capacity 2+MAX_SKB_FRAGS) + capacity += free_old_xmit_skbs(vi); + /* Apparently nice girls don't return TX_BUSY; stop the queue * before it gets out of hand. Naturally, this wastes entries. */ if (capacity 2+MAX_SKB_FRAGS) { I tried a similar patch before, it didn't help much on TCP stream performance. But I didn't try multiple stream TCP_RR. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote: On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: We've been doing some more experimenting with the small packet network performance problem in KVM. I have a different setup than what Steve D. was using so I re-baselined things on the kvm.git kernel on both the host and guest with a 10GbE adapter. I also made use of the virtio-stats patch. The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters (the first connected to a 1GbE adapter and a LAN, the second connected to a 10GbE adapter that is direct connected to another system with the same 10GbE adapter) running the kvm.git kernel. The test was a TCP_RR test with 100 connections from a baremetal client to the KVM guest using a 256 byte message size in both directions. I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Here is the baseline for baremetal using 2 physical CPUs: Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec TxCPU: 7.88% RxCPU: 99.41% To be sure to get consistent results with KVM I disabled the hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter interrupts (this resulted in runs that differed by only about 2% from lowest to highest). The fact that pinning is required to get consistent results is a different problem that we'll have to look into later... Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% About 42% of baremetal. Can you add interrupt stats as well please? Yes I can. Just the guest interrupts for the virtio device? empty. So I coded a quick patch to delay freeing of the used Tx buffers until more than half the ring was used (I did not test this under a stream condition so I don't know if this would have a negative impact). Here are the results from delaying the freeing of used Tx buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% About a 4% increase over baseline and about 44% of baremetal. Hmm, I am not sure what you mean by delaying freeing. In the start_xmit function of virtio_net.c the first thing done is to free any used entries from the ring. I patched the code to track the number of used tx ring entries and only free the used entries when they are greater than half the capacity of the ring (similar to the way the rx ring is re-filled). I think we do have a problem that free_old_xmit_skbs tries to flush out the ring aggressively: it always polls until the ring is empty, so there could be bursts of activity where we spend a lot of time flushing the old entries before e.g. sending an ack, resulting in latency bursts. Generally we'll need some smarter logic, but with indirect at the moment we can just poll a single packet after we post a new one, and be done with it. Is your patch something like the patch below? Could you try mine as well please? Yes, I'll try the patch and post the results. This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over baseline and about 52% of baremetal. Running the perf command against the guest I noticed almost 19% of the time being spent in _raw_spin_lock. Enabling lockstat in the guest showed alot of contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt to a single cpu in the guest and re-running the last test resulted in tremendous gains (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% About a 77% increase over baseline and about 74% of baremetal. Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). Also, it looks like vhost is sending a lot of notifications for packets it has received before the guest can get scheduled to disable notifications and begin processing the packets Hmm, is this really what happens to you? The effect would be that guest gets an interrupt while notifications are disabled in guest,
Re: Network performance with small packets - continued
On Wed, Mar 09, 2011 at 07:45:43AM -0800, Shirley Ma wrote: On Wed, 2011-03-09 at 09:15 +0200, Michael S. Tsirkin wrote: diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 82dba5a..ebe3337 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) struct sk_buff *skb; unsigned int len, tot_sgs = 0; - while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { + if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); vi-dev-stats.tx_bytes += skb-len; vi-dev-stats.tx_packets++; - tot_sgs += skb_vnet_hdr(skb)-num_sg; + tot_sgs = 2+MAX_SKB_FRAGS; dev_kfree_skb_any(skb); } return tot_sgs; Return value should be different based on indirect or direct buffers here? Something like that. Or we can assume no indirect, worst-case. But just for testing, I think it should work as an estimation. @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev); int capacity; - /* Free up any pending old buffers before queueing new ones. */ - free_old_xmit_skbs(vi); - /* Try to transmit */ capacity = xmit_skb(vi, skb); @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) skb_orphan(skb); nf_reset(skb); + /* Free up any old buffers so we can queue new ones. */ + if (capacity 2+MAX_SKB_FRAGS) + capacity += free_old_xmit_skbs(vi); + /* Apparently nice girls don't return TX_BUSY; stop the queue * before it gets out of hand. Naturally, this wastes entries. */ if (capacity 2+MAX_SKB_FRAGS) { I tried a similar patch before, it didn't help much on TCP stream performance. But I didn't try multiple stream TCP_RR. Shirley There's a bug in myh patch by the way. Pls try the following instead (still untested). diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 82dba5a..4477b9a 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) struct sk_buff *skb; unsigned int len, tot_sgs = 0; - while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { + if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); vi-dev-stats.tx_bytes += skb-len; vi-dev-stats.tx_packets++; - tot_sgs += skb_vnet_hdr(skb)-num_sg; + tot_sgs = 2+MAX_SKB_FRAGS; dev_kfree_skb_any(skb); } return tot_sgs; @@ -576,7 +576,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev); int capacity; - /* Free up any pending old buffers before queueing new ones. */ + /* Free up any old buffers so we can queue new ones. */ free_old_xmit_skbs(vi); /* Try to transmit */ @@ -605,6 +605,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) skb_orphan(skb); nf_reset(skb); + /* Free up any old buffers so we can queue new ones. */ + if (capacity 2+MAX_SKB_FRAGS) + capacity += free_old_xmit_skbs(vi); + /* Apparently nice girls don't return TX_BUSY; stop the queue * before it gets out of hand. Naturally, this wastes entries. */ if (capacity 2+MAX_SKB_FRAGS) { -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 01:17:44 am Michael S. Tsirkin wrote: On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Could you post the XML on the list please? Environment variables are used to specify some of the values: uperf_instances=100 uperf_dest=192.168.100.28 uperf_duration=300 uperf_tx_msgsize=256 uperf_rx_msgsize=256 You can also change from threads to processes by specifying nprocs instead of nthreads in the group element. I found this out later so all of my runs are using threads. Using processes will give you some improved peformance but I need to be consistent with my runs and stay with threads for now. ?xml version=1.0? profile name=TCP_RR group nthreads=$uperf_instances transaction iterations=1 flowop type=connect options=remotehost=$uperf_dest protocol=tcp/ /transaction transaction duration=$uperf_duration flowop type=write options=size=$uperf_tx_msgsize/ flowop type=read options=size=$uperf_rx_msgsize/ /transaction transaction iterations=1 flowop type=disconnect / /transaction /group /profile -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote: This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over baseline and about 52% of baremetal. Running the perf command against the guest I noticed almost 19% of the time being spent in _raw_spin_lock. Enabling lockstat in the guest showed alot of contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt to a single cpu in the guest and re-running the last test resulted in tremendous gains (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% About a 77% increase over baseline and about 74% of baremetal. Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). Also, it looks like vhost is sending a lot of notifications for packets it has received before the guest can get scheduled to disable notifications and begin processing the packets Hmm, is this really what happens to you? The effect would be that guest gets an interrupt while notifications are disabled in guest, right? Could you add a counter and check this please? The disabling of the interrupt/notifications is done by the guest. So the guest has to get scheduled and handle the notification before it disables them. The vhost_signal routine will keep injecting an interrupt until this happens causing the contention in the guest. I'll try the patches you specify below and post the results. They look like they should take care of this issue. In guest TX path, the guest interrupt should be disabled in the start since it free_old_xmit_skbs in start_xmit call, it's not necessary to receive any send completion interrupts to handle free old skbs. Then the interrupt is only enabled when the netif queue is full. For multiple streams TCP_RR test, we never hit netif queue full situation, the cat /proc/interrupts/ send completion interrupts rate is 0, right? Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, 2011-03-09 at 18:10 +0200, Michael S. Tsirkin wrote: diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 82dba5a..4477b9a 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) struct sk_buff *skb; unsigned int len, tot_sgs = 0; - while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { + if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); vi-dev-stats.tx_bytes += skb-len; vi-dev-stats.tx_packets++; - tot_sgs += skb_vnet_hdr(skb)-num_sg; + tot_sgs = 2+MAX_SKB_FRAGS; dev_kfree_skb_any(skb); } return tot_sgs; @@ -576,7 +576,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev); int capacity; - /* Free up any pending old buffers before queueing new ones. */ + /* Free up any old buffers so we can queue new ones. */ free_old_xmit_skbs(vi); /* Try to transmit */ @@ -605,6 +605,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) skb_orphan(skb); nf_reset(skb); + /* Free up any old buffers so we can queue new ones. */ + if (capacity 2+MAX_SKB_FRAGS) + capacity += free_old_xmit_skbs(vi); + /* Apparently nice girls don't return TX_BUSY; stop the queue * before it gets out of hand. Naturally, this wastes entries. */ if (capacity 2+MAX_SKB_FRAGS) { -- I tried this one as well. It might improve TCP_RR performance but not TCP_STREAM. :) Let's wait for Tom's TCP_RR resutls. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, Mar 09, 2011 at 10:09:26AM -0600, Tom Lendacky wrote: On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote: On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: We've been doing some more experimenting with the small packet network performance problem in KVM. I have a different setup than what Steve D. was using so I re-baselined things on the kvm.git kernel on both the host and guest with a 10GbE adapter. I also made use of the virtio-stats patch. The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters (the first connected to a 1GbE adapter and a LAN, the second connected to a 10GbE adapter that is direct connected to another system with the same 10GbE adapter) running the kvm.git kernel. The test was a TCP_RR test with 100 connections from a baremetal client to the KVM guest using a 256 byte message size in both directions. One thing that might be happening is that we are out of atomic memory poll in guest, so indirect allocations start failing, and this is slow path. Could you check this please? I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Here is the baseline for baremetal using 2 physical CPUs: Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec TxCPU: 7.88% RxCPU: 99.41% To be sure to get consistent results with KVM I disabled the hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter interrupts (this resulted in runs that differed by only about 2% from lowest to highest). The fact that pinning is required to get consistent results is a different problem that we'll have to look into later... Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% About 42% of baremetal. Can you add interrupt stats as well please? Yes I can. Just the guest interrupts for the virtio device? Guess so: tx and rx. empty. So I coded a quick patch to delay freeing of the used Tx buffers until more than half the ring was used (I did not test this under a stream condition so I don't know if this would have a negative impact). Here are the results from delaying the freeing of used Tx buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% About a 4% increase over baseline and about 44% of baremetal. Hmm, I am not sure what you mean by delaying freeing. In the start_xmit function of virtio_net.c the first thing done is to free any used entries from the ring. I patched the code to track the number of used tx ring entries and only free the used entries when they are greater than half the capacity of the ring (similar to the way the rx ring is re-filled). We don't even need than: just max skb frag + 2. Also don't need to free them all: just enough to get place for max skb frag + 2 entries. I think we do have a problem that free_old_xmit_skbs tries to flush out the ring aggressively: it always polls until the ring is empty, so there could be bursts of activity where we spend a lot of time flushing the old entries before e.g. sending an ack, resulting in latency bursts. Generally we'll need some smarter logic, but with indirect at the moment we can just poll a single packet after we post a new one, and be done with it. Is your patch something like the patch below? Could you try mine as well please? Yes, I'll try the patch and post the results. This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over baseline and about 52% of baremetal. Running the perf command against the guest I noticed almost 19% of the time being spent in _raw_spin_lock. Enabling lockstat in the guest showed alot of contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt to a single cpu in the guest and re-running the last test resulted in tremendous gains (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% About a 77% increase over
Re: Network performance with small packets - continued
On Wed, Mar 09, 2011 at 08:25:34AM -0800, Shirley Ma wrote: On Wed, 2011-03-09 at 18:10 +0200, Michael S. Tsirkin wrote: diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 82dba5a..4477b9a 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) struct sk_buff *skb; unsigned int len, tot_sgs = 0; - while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { + if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); vi-dev-stats.tx_bytes += skb-len; vi-dev-stats.tx_packets++; - tot_sgs += skb_vnet_hdr(skb)-num_sg; + tot_sgs = 2+MAX_SKB_FRAGS; dev_kfree_skb_any(skb); } return tot_sgs; @@ -576,7 +576,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) struct virtnet_info *vi = netdev_priv(dev); int capacity; - /* Free up any pending old buffers before queueing new ones. */ + /* Free up any old buffers so we can queue new ones. */ free_old_xmit_skbs(vi); /* Try to transmit */ @@ -605,6 +605,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) skb_orphan(skb); nf_reset(skb); + /* Free up any old buffers so we can queue new ones. */ + if (capacity 2+MAX_SKB_FRAGS) + capacity += free_old_xmit_skbs(vi); + /* Apparently nice girls don't return TX_BUSY; stop the queue * before it gets out of hand. Naturally, this wastes entries. */ if (capacity 2+MAX_SKB_FRAGS) { -- I tried this one as well. It might improve TCP_RR performance but not TCP_STREAM. :) Let's wait for Tom's TCP_RR resutls. Thanks Shirley I think your issues are with TX overrun. Besides delaying IRQ on TX, I don't have many ideas. The one interesting thing is that you see better speed if you drop packets. netdev crowd says this should not happen, so could be an indicator of a problem somewhere. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, 2011-03-09 at 18:32 +0200, Michael S. Tsirkin wrote: I think your issues are with TX overrun. Besides delaying IRQ on TX, I don't have many ideas. The one interesting thing is that you see better speed if you drop packets. netdev crowd says this should not happen, so could be an indicator of a problem somewhere. Yes, I am looking at why guest didn't see see used_buffers on time from vhost send TX completion I am trying to collect some data on vhost. I also wonder whether it's a scheduler issue. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote: Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). This is guest TX send notification when vhost enables notification. In TCP_STREAM test, vhost exits from reaching NAPI WEIGHT, it rarely enables the notification, vhost re-enters handle_tx from NAPI poll, so guest doesn't do much kick_notify. In multiple TCP_RR test, seems vhost exits from nothing to send in TX vq very often, so it enables notification most of the time. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, Mar 09, 2011 at 08:51:33AM -0800, Shirley Ma wrote: On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote: Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). This is guest TX send notification when vhost enables notification. In TCP_STREAM test, vhost exits from reaching NAPI WEIGHT, You mean virtio? it rarely enables the notification, vhost re-enters handle_tx from NAPI poll, Does NAPI really call handle_tx? Not rx? so guest doesn't do much kick_notify. In multiple TCP_RR test, seems vhost exits from nothing to send in TX vq very often, so it enables notification most of the time. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, 2011-03-09 at 19:16 +0200, Michael S. Tsirkin wrote: On Wed, Mar 09, 2011 at 08:51:33AM -0800, Shirley Ma wrote: On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote: Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). This is guest TX send notification when vhost enables notification. In TCP_STREAM test, vhost exits from reaching NAPI WEIGHT, You mean virtio? Sorry, I messed up NAPI WEIGHT and VHOST NET WEIGHT. I meant VHOST_NET_WEIGH, vhost exit handdle_tx() from VHOST NET WEIGHT w/o enabling notification. it rarely enables the notification, vhost re-enters handle_tx from NAPI poll, Does NAPI really call handle_tx? Not rx? I meant for TX/RX, vhost re-enter handle_tx from vhost_poll_queue() not from kick_notify. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
Here are the results again with the addition of the interrupt rate that occurred on the guest virtio_net device: Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% Virtio1-input Interrupts/Sec (CPU0/CPU1): 5,154/5,222 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About 42% of baremetal. Delayed freeing of TX buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,796/4,908 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 4% increase over baseline and about 44% of baremetal. Delaying kick_notify (kick every 5 packets -average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,200/4,293 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 23% increase over baseline and about 52% of baremetal. Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% Virtio1-input Interrupts/Sec (CPU0/CPU1): 11,564/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 77% increase over baseline and about 74% of baremetal. On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote: On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: We've been doing some more experimenting with the small packet network performance problem in KVM. I have a different setup than what Steve D. was using so I re-baselined things on the kvm.git kernel on both the host and guest with a 10GbE adapter. I also made use of the virtio-stats patch. The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters (the first connected to a 1GbE adapter and a LAN, the second connected to a 10GbE adapter that is direct connected to another system with the same 10GbE adapter) running the kvm.git kernel. The test was a TCP_RR test with 100 connections from a baremetal client to the KVM guest using a 256 byte message size in both directions. I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Here is the baseline for baremetal using 2 physical CPUs: Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec TxCPU: 7.88% RxCPU: 99.41% To be sure to get consistent results with KVM I disabled the hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter interrupts (this resulted in runs that differed by only about 2% from lowest to highest). The fact that pinning is required to get consistent results is a different problem that we'll have to look into later... Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% About 42% of baremetal. Can you add interrupt stats as well please? empty. So I coded a quick patch to delay freeing of the used Tx buffers until more than half the ring was used (I did not test this under a stream condition so I don't know if this would have a negative impact). Here are the results from delaying the freeing of used Tx buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% About a 4% increase over baseline and about 44% of baremetal. Hmm, I am not sure what you mean by delaying freeing. I think we do have a problem that free_old_xmit_skbs tries to flush out the ring aggressively: it always polls until the ring is empty, so there could be bursts of activity where we spend a lot of time flushing the old entries before e.g. sending an ack, resulting in latency bursts. Generally we'll need some smarter logic, but with indirect at the moment we can just poll a single packet after we post a new one, and be done with it. Is your patch something like the patch below? Could you try mine as well please? This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over
Re: Network performance with small packets - continued
On Wed, Mar 09, 2011 at 02:11:07PM -0600, Tom Lendacky wrote: Here are the results again with the addition of the interrupt rate that occurred on the guest virtio_net device: Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% Virtio1-input Interrupts/Sec (CPU0/CPU1): 5,154/5,222 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About 42% of baremetal. Delayed freeing of TX buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,796/4,908 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 4% increase over baseline and about 44% of baremetal. Looks like delayed freeing is a good idea generally. Is this my patch? Yours? Delaying kick_notify (kick every 5 packets -average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,200/4,293 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 23% increase over baseline and about 52% of baremetal. Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs): What exactly moves the interrupt handler between CPUs? irqbalancer? Does it matter which CPU you pin it to? If yes, do you have any idea why? Also, what happens without delaying kick_notify but with pinning? Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% Virtio1-input Interrupts/Sec (CPU0/CPU1): 11,564/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 77% increase over baseline and about 74% of baremetal. Hmm we get about 20 packets per interrupt on average. That's pretty decent. The problem is with exits. Let's try something adaptive in the host? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
Hello Tom, Do you also have Rusty's virtio stat patch results for both send queue and recv queue to share here? Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 10:09:26 am Tom Lendacky wrote: On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote: On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: We've been doing some more experimenting with the small packet network performance problem in KVM. I have a different setup than what Steve D. was using so I re-baselined things on the kvm.git kernel on both the host and guest with a 10GbE adapter. I also made use of the virtio-stats patch. The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters (the first connected to a 1GbE adapter and a LAN, the second connected to a 10GbE adapter that is direct connected to another system with the same 10GbE adapter) running the kvm.git kernel. The test was a TCP_RR test with 100 connections from a baremetal client to the KVM guest using a 256 byte message size in both directions. I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Here is the baseline for baremetal using 2 physical CPUs: Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec TxCPU: 7.88% RxCPU: 99.41% To be sure to get consistent results with KVM I disabled the hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter interrupts (this resulted in runs that differed by only about 2% from lowest to highest). The fact that pinning is required to get consistent results is a different problem that we'll have to look into later... Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% About 42% of baremetal. Can you add interrupt stats as well please? Yes I can. Just the guest interrupts for the virtio device? empty. So I coded a quick patch to delay freeing of the used Tx buffers until more than half the ring was used (I did not test this under a stream condition so I don't know if this would have a negative impact). Here are the results from delaying the freeing of used Tx buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% About a 4% increase over baseline and about 44% of baremetal. Hmm, I am not sure what you mean by delaying freeing. In the start_xmit function of virtio_net.c the first thing done is to free any used entries from the ring. I patched the code to track the number of used tx ring entries and only free the used entries when they are greater than half the capacity of the ring (similar to the way the rx ring is re-filled). I think we do have a problem that free_old_xmit_skbs tries to flush out the ring aggressively: it always polls until the ring is empty, so there could be bursts of activity where we spend a lot of time flushing the old entries before e.g. sending an ack, resulting in latency bursts. Generally we'll need some smarter logic, but with indirect at the moment we can just poll a single packet after we post a new one, and be done with it. Is your patch something like the patch below? Could you try mine as well please? Yes, I'll try the patch and post the results. This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over baseline and about 52% of baremetal. Running the perf command against the guest I noticed almost 19% of the time being spent in _raw_spin_lock. Enabling lockstat in the guest showed alot of contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt to a single cpu in the guest and re-running the last test resulted in tremendous gains (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% About a 77% increase over baseline and about 74% of baremetal. Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). Also, it looks like vhost is sending a lot of notifications for packets it has received before the guest can get scheduled to disable notifications and begin
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 04:45:12 pm Shirley Ma wrote: Hello Tom, Do you also have Rusty's virtio stat patch results for both send queue and recv queue to share here? Let me see what I can do about getting the data extracted, averaged and in a form that I can put in an email. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 03:56:15 pm Michael S. Tsirkin wrote: On Wed, Mar 09, 2011 at 02:11:07PM -0600, Tom Lendacky wrote: Here are the results again with the addition of the interrupt rate that occurred on the guest virtio_net device: Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% Virtio1-input Interrupts/Sec (CPU0/CPU1): 5,154/5,222 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About 42% of baremetal. Delayed freeing of TX buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,796/4,908 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 4% increase over baseline and about 44% of baremetal. Looks like delayed freeing is a good idea generally. Is this my patch? Yours? These results are for my patch, I haven't had a chance to run your patch yet. Delaying kick_notify (kick every 5 packets -average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,200/4,293 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 23% increase over baseline and about 52% of baremetal. Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs): What exactly moves the interrupt handler between CPUs? irqbalancer? Does it matter which CPU you pin it to? If yes, do you have any idea why? Looking at the guest, irqbalance isn't running and the smp_affinity for the irq is set to 3 (both CPUs). It could be that irqbalance would help in this situation since it would probably change the smp_affinity mask to a single CPU and remove the irq lock contention (I think the last used index patch would be best though since it will avoid the extra irq injections). I'll kick off a run with irqbalance running. As for which CPU the interrupt gets pinned to, that doesn't matter - see below. Also, what happens without delaying kick_notify but with pinning? Here are the results of a single baseline run with the IRQ pinned to CPU0: Txn Rate: 108,212.12 Txn/Sec, Pkt Rate: 214,994 Pkts/Sec Exits: 119,310.21 Exits/Sec TxCPU: 9.63% RxCPU: 99.47% Virtio1-input Interrupts/Sec (CPU0/CPU1): Virtio1-output Interrupts/Sec (CPU0/CPU1): and CPU1: Txn Rate: 108,053.02 Txn/Sec, Pkt Rate: 214,678 Pkts/Sec Exits: 119,320.12 Exits/Sec TxCPU: 9.64% RxCPU: 99.42% Virtio1-input Interrupts/Sec (CPU0/CPU1): 13,608/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/13,830 About a 24% increase over baseline. Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkts/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% Virtio1-input Interrupts/Sec (CPU0/CPU1): 11,564/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 77% increase over baseline and about 74% of baremetal. Hmm we get about 20 packets per interrupt on average. That's pretty decent. The problem is with exits. Let's try something adaptive in the host? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, 2011-03-09 at 23:56 +0200, Michael S. Tsirkin wrote: Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% Virtio1-input Interrupts/Sec (CPU0/CPU1): 11,564/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 77% increase over baseline and about 74% of baremetal. Hmm we get about 20 packets per interrupt on average. That's pretty decent. The problem is with exits. Let's try something adaptive in the host? I did some hack before, for 32-64 multiple stream TCP_RR cases, either queue multiple skbs per kick or delay vhost exit from handle_tx, both improved TCP_RR aggregation performance, but single TCP_RR latency increased. Here, the test is about 100 TCP_RR streams from a bare metal client to KVM guest, the kick_notify from guest RX path should be small (every 1/2 ring size, it does a kick and even under that kick, vhost might already disable the notification). The kick_notify from guest TX path seems the main reason causes the guest huge of exits, (it does a kick for every send skb, under that kick vhost might mostly likely exit from empty ring not reaching VHOST_NET_WEIGH. The indirect buffer is used, so I wonder how many packets per handle_tx processed here? In theory, for lots of TCP_RR streams, the guest should be able to keep sending xmit skbs to send vq, so vhost should be able to disable notification most of the time, then number of guest exits should be significantly reduced? Why we saw lots of guest exits here still? Is it worth to try 256 (send queue size) TCP_RRs? Tom's kick_notify data from Rusty's patch would be helpful to understand what's going here. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, 2011-03-09 at 16:59 -0800, Shirley Ma wrote: In theory, for lots of TCP_RR streams, the guest should be able to keep sending xmit skbs to send vq, so vhost should be able to disable notification most of the time, then number of guest exits should be significantly reduced? Why we saw lots of guest exits here still? Is it worth to try 256 (send queue size) TCP_RRs? If these are single-transaction-at-a-time TCP_RRs rather than burst mode then the number may be something other than send queue size to keep it constantly active given the RTTs. In the bare iron world at least, that is one of the reasons I added the burst mode to the _RR test - because it could take a Very Large Number of concurrent netperfs to take a link to saturation, at which point it might have been just as much a context switching benchmark as anything else :) happy benchmarking, rick jones -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote: As for which CPU the interrupt gets pinned to, that doesn't matter - see below. So what hurts us the most is that the IRQ jumps between the VCPUs? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote: I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Should we also add similar stat on vhost vq as well for monitoring vhost_signal vhost_notify? Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote: On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote: I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Should we also add similar stat on vhost vq as well for monitoring vhost_signal vhost_notify? Tom L has started using Rusty's patches and found some interesting results, sent yesterday: http://marc.info/?l=kvmm=129953710930124w=2 -Andrew Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
Hi Tom, My two cents. Please look for [Chaks] snip Comparing the transmit path to the receive path, the guest disables notifications after the first kick and vhost re-enables notifications after completing processing of the tx ring. Can a similar thing be done for the receive path? Once vhost sends the first notification for a received packet it can disable notifications and let the guest re-enable notifications when it has finished processing the receive ring. Also, can the virtio-net driver do some adaptive polling (or does napi take care of that for the guest)? [Chaks] A better method is to have the producer generate the kick notifications only when the queue/ring transitions from empty to non-empty state. The consumer is not burdened with the task of reenabling the notifications. This of course assumes that notifications will never get lost. If loss of notifications is a possibility, producer can keep generating the notifications till guest signals (via some atomically manipulated memory variable) that it started consuming. The next notification will go out when the ring/queue again transitions from empty to non-empty state. Chaks The information contained in this message may be privileged and confidential and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any reproduction, dissemination or distribution of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by replying to the message and deleting it from your computer. Thank you. Tellabs -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: We've been doing some more experimenting with the small packet network performance problem in KVM. I have a different setup than what Steve D. was using so I re-baselined things on the kvm.git kernel on both the host and guest with a 10GbE adapter. I also made use of the virtio-stats patch. The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters (the first connected to a 1GbE adapter and a LAN, the second connected to a 10GbE adapter that is direct connected to another system with the same 10GbE adapter) running the kvm.git kernel. The test was a TCP_RR test with 100 connections from a baremetal client to the KVM guest using a 256 byte message size in both directions. I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Here is the baseline for baremetal using 2 physical CPUs: Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec TxCPU: 7.88% RxCPU: 99.41% To be sure to get consistent results with KVM I disabled the hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter interrupts (this resulted in runs that differed by only about 2% from lowest to highest). The fact that pinning is required to get consistent results is a different problem that we'll have to look into later... Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% About 42% of baremetal. Can you add interrupt stats as well please? empty. So I coded a quick patch to delay freeing of the used Tx buffers until more than half the ring was used (I did not test this under a stream condition so I don't know if this would have a negative impact). Here are the results from delaying the freeing of used Tx buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% About a 4% increase over baseline and about 44% of baremetal. Hmm, I am not sure what you mean by delaying freeing. I think we do have a problem that free_old_xmit_skbs tries to flush out the ring aggressively: it always polls until the ring is empty, so there could be bursts of activity where we spend a lot of time flushing the old entries before e.g. sending an ack, resulting in latency bursts. Generally we'll need some smarter logic, but with indirect at the moment we can just poll a single packet after we post a new one, and be done with it. Is your patch something like the patch below? Could you try mine as well please? This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over baseline and about 52% of baremetal. Running the perf command against the guest I noticed almost 19% of the time being spent in _raw_spin_lock. Enabling lockstat in the guest showed alot of contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt to a single cpu in the guest and re-running the last test resulted in tremendous gains (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% About a 77% increase over baseline and about 74% of baremetal. Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). Also, it looks like vhost is sending a lot of notifications for packets it has received before the guest can get scheduled to disable notifications and begin processing the packets Hmm, is this really what happens to you? The effect would be that guest gets an interrupt while notifications are disabled in guest, right? Could you add a counter and check this please? Another possible thing to try would be these old patches to publish used index from guest to make sure this double interrupt does not happen: [PATCHv2] virtio: put last seen used index into ring itself [PATCHv2] vhost-net: utilize PUBLISH_USED_IDX feature resulting in some lock contention in the guest (and high interrupt rates). Some thoughts for the transmit path... can vhost be enhanced to do some adaptive polling so that the number of kick_notify events are reduced and replaced by
Re: Network performance with small packets - continued
On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Could you post the XML on the list please? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -7,6 +7,14 @@ config VIRTIO_RING tristate depends on VIRTIO +config VIRTIO_STATS + bool Virtio debugging stats (EXPERIMENTAL) + depends on VIRTIO_RING + select DEBUG_FS + ---help--- + Virtio stats collected by how full the ring is at any time, + presented under debugfs/virtio/name-vq/num-used/ + config VIRTIO_PCI tristate PCI driver for virtio devices (EXPERIMENTAL) depends on PCI EXPERIMENTAL diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -21,6 +21,7 @@ #include linux/virtio_config.h #include linux/device.h #include linux/slab.h +#include linux/debugfs.h /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ @@ -95,6 +96,11 @@ struct vring_virtqueue /* How to notify other side. FIXME: commonalize hcalls! */ void (*notify)(struct virtqueue *vq); +#ifdef CONFIG_VIRTIO_STATS + struct vring_stat *stats; + struct dentry *statdir; +#endif + #ifdef DEBUG /* They're supposed to lock for us. */ unsigned int in_use; @@ -106,6 +112,87 @@ struct vring_virtqueue #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +#ifdef CONFIG_VIRTIO_STATS +/* We have an array of these, indexed by how full the ring is. */ +struct vring_stat { + /* How many interrupts? */ + size_t interrupt_nowork, interrupt_work; + /* How many non-notify kicks, how many notify kicks, how many add notify? */ + size_t kick_no_notify, kick_notify, add_notify; + /* How many adds? */ + size_t add_direct, add_indirect, add_fail; + /* How many gets? */ + size_t get; + /* How many disable callbacks? */ + size_t disable_cb; + /* How many enables? */ + size_t enable_cb_retry, enable_cb_success; +}; + +static struct dentry *virtio_stats; + +static void create_stat_files(struct vring_virtqueue *vq) +{ + char name[80]; + unsigned int i; + + /* Racy in theory, but we don't care. */ + if (!virtio_stats) + virtio_stats = debugfs_create_dir(virtio-stats, NULL); + + sprintf(name, %s-%s, dev_name(vq-vq.vdev-dev), vq-vq.name); + vq-statdir = debugfs_create_dir(name, virtio_stats); + + for (i = 0; i vq-vring.num; i++) { + struct dentry *dir; + + sprintf(name, %i, i); + dir = debugfs_create_dir(name, vq-statdir); + debugfs_create_size_t(interrupt_nowork, 0400, dir, + vq-stats[i].interrupt_nowork); + debugfs_create_size_t(interrupt_work, 0400, dir, + vq-stats[i].interrupt_work); + debugfs_create_size_t(kick_no_notify, 0400, dir, + vq-stats[i].kick_no_notify); + debugfs_create_size_t(kick_notify, 0400, dir, + vq-stats[i].kick_notify); + debugfs_create_size_t(add_notify, 0400, dir, + vq-stats[i].add_notify); + debugfs_create_size_t(add_direct, 0400, dir, + vq-stats[i].add_direct); + debugfs_create_size_t(add_indirect, 0400, dir, +
Re: Network performance with small packets
On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote: On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Not sure whether the intent is to merge this. If yes - would it make sense to use tracing for this instead? That's what kvm does. diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -7,6 +7,14 @@ config VIRTIO_RING tristate depends on VIRTIO +config VIRTIO_STATS + bool Virtio debugging stats (EXPERIMENTAL) + depends on VIRTIO_RING + select DEBUG_FS + ---help--- + Virtio stats collected by how full the ring is at any time, + presented under debugfs/virtio/name-vq/num-used/ + config VIRTIO_PCI tristate PCI driver for virtio devices (EXPERIMENTAL) depends on PCI EXPERIMENTAL diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -21,6 +21,7 @@ #include linux/virtio_config.h #include linux/device.h #include linux/slab.h +#include linux/debugfs.h /* virtio guest is communicating with a virtual device that actually runs on * a host processor. Memory barriers are used to control SMP effects. */ @@ -95,6 +96,11 @@ struct vring_virtqueue /* How to notify other side. FIXME: commonalize hcalls! */ void (*notify)(struct virtqueue *vq); +#ifdef CONFIG_VIRTIO_STATS + struct vring_stat *stats; + struct dentry *statdir; +#endif + #ifdef DEBUG /* They're supposed to lock for us. */ unsigned int in_use; @@ -106,6 +112,87 @@ struct vring_virtqueue #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +#ifdef CONFIG_VIRTIO_STATS +/* We have an array of these, indexed by how full the ring is. */ +struct vring_stat { + /* How many interrupts? */ + size_t interrupt_nowork, interrupt_work; + /* How many non-notify kicks, how many notify kicks, how many add notify? */ + size_t kick_no_notify, kick_notify, add_notify; + /* How many adds? */ + size_t add_direct, add_indirect, add_fail; + /* How many gets? */ + size_t get; + /* How many disable callbacks? */ + size_t disable_cb; + /* How many enables? */ + size_t enable_cb_retry, enable_cb_success; +}; + +static struct dentry *virtio_stats; + +static void create_stat_files(struct vring_virtqueue *vq) +{ + char name[80]; + unsigned int i; + + /* Racy in theory, but we don't care. */ + if (!virtio_stats) + virtio_stats = debugfs_create_dir(virtio-stats, NULL); + + sprintf(name, %s-%s, dev_name(vq-vq.vdev-dev), vq-vq.name); + vq-statdir = debugfs_create_dir(name, virtio_stats); + + for (i = 0; i vq-vring.num; i++) { + struct dentry *dir; + + sprintf(name, %i, i); + dir = debugfs_create_dir(name, vq-statdir); + debugfs_create_size_t(interrupt_nowork, 0400, dir, + vq-stats[i].interrupt_nowork); + debugfs_create_size_t(interrupt_work, 0400, dir, + vq-stats[i].interrupt_work); + debugfs_create_size_t(kick_no_notify, 0400, dir, + vq-stats[i].kick_no_notify); + debugfs_create_size_t(kick_notify, 0400, dir, + vq-stats[i].kick_notify); + debugfs_create_size_t(add_notify, 0400, dir, +
Re: Network performance with small packets
On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote: On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote: On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Not sure whether the intent is to merge this. If yes - would it make sense to use tracing for this instead? That's what kvm does. Intent wasn't; I've not used tracepoints before, but maybe we should consider a longer-term monitoring solution? Patch welcome! Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote: On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote: On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote: On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Not sure whether the intent is to merge this. If yes - would it make sense to use tracing for this instead? That's what kvm does. Intent wasn't; I've not used tracepoints before, but maybe we should consider a longer-term monitoring solution? Patch welcome! Cheers, Rusty. Sure, I'll look into this. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 9, 2011 at 1:55 AM, Michael S. Tsirkin m...@redhat.com wrote: On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote: On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote: On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote: On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? I've finally read this thread... I think we need to get more serious with our stats gathering to diagnose these kind of performance issues. This is a start; it should tell us what is actually happening to the virtio ring(s) without significant performance impact... Subject: virtio: CONFIG_VIRTIO_STATS For performance problems we'd like to know exactly what the ring looks like. Â This patch adds stats indexed by how-full-ring-is; we could extend it to also record them by how-used-ring-is if we need. Signed-off-by: Rusty Russell ru...@rustcorp.com.au Not sure whether the intent is to merge this. If yes - would it make sense to use tracing for this instead? That's what kvm does. Intent wasn't; I've not used tracepoints before, but maybe we should consider a longer-term monitoring solution? Patch welcome! Cheers, Rusty. Sure, I'll look into this. There are several virtio trace events already in QEMU today (see the trace-events file): virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned int idx) vq %p elem %p len %u idx %u virtqueue_flush(void *vq, unsigned int count) vq %p count %u virtqueue_pop(void *vq, void *elem, unsigned int in_num, unsigned int out_num) vq %p elem %p in_num %u out_num %u virtio_queue_notify(void *vdev, int n, void *vq) vdev %p n %d vq %p virtio_irq(void *vq) vq %p virtio_notify(void *vdev, void *vq) vdev %p vq %p These can be used by building QEMU with a suitable tracing backend like SystemTap (see docs/tracing.txt). Inside the guest I've used dynamic ftrace in the past, although static tracepoints would be nice. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote: Initial TCP_STREAM performance results I got for guest to local host 4.2Gb/s for 1K message size, (vs. 2.5Gb/s) 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s) 9.8Gb/s for 4K message size. (vs.5.xGb/s) What is the average packet size, # bytes per ack, and the # of interrupts per packet? It could be that just slowing down trahsmission makes GSO work better. There is no TX interrupts with dropping packet. GSO/TSO is the key for small message performance, w/o GSO/TSO, the performance is limited to about 2Gb/s no matter how big the message size it is. I think any work we try here will increase large packet size rate. BTW for dropping packet, TCP increased fast retrans, not slow start. I will collect tcpdump, netstart before and after data to compare packet size/rate w/o w/i the patch. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Feb 03, 2011 at 07:58:00AM -0800, Shirley Ma wrote: On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote: Initial TCP_STREAM performance results I got for guest to local host 4.2Gb/s for 1K message size, (vs. 2.5Gb/s) 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s) 9.8Gb/s for 4K message size. (vs.5.xGb/s) What is the average packet size, # bytes per ack, and the # of interrupts per packet? It could be that just slowing down trahsmission makes GSO work better. There is no TX interrupts with dropping packet. GSO/TSO is the key for small message performance, w/o GSO/TSO, the performance is limited to about 2Gb/s no matter how big the message size it is. I think any work we try here will increase large packet size rate. BTW for dropping packet, TCP increased fast retrans, not slow start. I will collect tcpdump, netstart before and after data to compare packet size/rate w/o w/i the patch. Thanks Shirley Just a thought: does it help to make tx queue len of the virtio device smaller? E.g. match the vq size? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, 2011-02-03 at 18:20 +0200, Michael S. Tsirkin wrote: Just a thought: does it help to make tx queue len of the virtio device smaller? Yes, that what I did before, reducing txqueuelen will cause qdisc dropp the packet early, But it's hard to control by using tx queuelen for performance gain. I tried on different systems, it required different values. Also, I tried another patch, instead of dropping packets, I used to timer (2 jiffies) to enable/disable queue on guest without interrupts notification, it gets better performance than original but worse performance than dropping packets because of netif stop/wake up too often. vhost is definitely needed to improve for handling small message sizes. It's unable to handle small message packets rate for queue size 256, even with ring size 1024. QEMU seems not allowing to increase the TX ring size to 2K (start qemu_kvm failure with no errors), I am not able to test it out. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote: Yes, I think doing this in the host is much simpler, just send an interrupt after there's a decent amount of space in the queue. Having said that the simple heuristic that I coded might be a bit too simple. From the debugging out what I have seen so far (a single small message TCP_STEAM test), I think the right approach is to patch both guest and vhost. The problem I have found is a regression for single small message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new kernel has problem. For Steven's problem, it's multiple stream TCP_RR issues, the old guest doesn't perform well, so does new guest kernel. We tested reducing vhost signaling patch before, it didn't help the performance at all. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote: w/i guest change, I played around the parameters,for example: I could get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size, w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU usage. I meant w/o guest change, only vhost changes. Sorry about that. Shirley Ah, excellent. What were the parameters? I used half of the ring size 129 for packet counters, but the performance is still not as good as dropping packets on guest, 3.7 Gb/s vs. 6.2Gb/s. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote: Yes, I think doing this in the host is much simpler, just send an interrupt after there's a decent amount of space in the queue. Having said that the simple heuristic that I coded might be a bit too simple. From the debugging out what I have seen so far (a single small message TCP_STEAM test), I think the right approach is to patch both guest and vhost. One problem is slowing down the guest helps here. So there's a chance that just by adding complexity in guest driver we get a small improvement :( We can't rely on a patched guest anyway, so I think it is best to test guest and host changes separately. And I do agree something needs to be done in guest too, for example when vqs share an interrupt, we might invoke a callback when we see vq is not empty even though it's not requested. Probably should check interrupts enabled here? The problem I have found is a regression for single small message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new kernel has problem. Likely new kernel is faster :) For Steven's problem, it's multiple stream TCP_RR issues, the old guest doesn't perform well, so does new guest kernel. We tested reducing vhost signaling patch before, it didn't help the performance at all. Thanks Shirley Yes, it seems unrelated to tx interrupts. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote: w/i guest change, I played around the parameters,for example: I could get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size, w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU usage. I meant w/o guest change, only vhost changes. Sorry about that. Shirley Ah, excellent. What were the parameters? I used half of the ring size 129 for packet counters, but the performance is still not as good as dropping packets on guest, 3.7 Gb/s vs. 6.2Gb/s. Shirley And this is with sndbuf=0 in host, yes? And do you see a lot of tx interrupts? How packets per interrupt? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote: Yes, I think doing this in the host is much simpler, just send an interrupt after there's a decent amount of space in the queue. Having said that the simple heuristic that I coded might be a bit too simple. From the debugging out what I have seen so far (a single small message TCP_STEAM test), I think the right approach is to patch both guest and vhost. One problem is slowing down the guest helps here. So there's a chance that just by adding complexity in guest driver we get a small improvement :( We can't rely on a patched guest anyway, so I think it is best to test guest and host changes separately. And I do agree something needs to be done in guest too, for example when vqs share an interrupt, we might invoke a callback when we see vq is not empty even though it's not requested. Probably should check interrupts enabled here? Yes, I modified xmit callback something like below: static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; /* Suppress further interrupts. */ virtqueue_disable_cb(svq); /* We were probably waiting for more output buffers. */ if (netif_queue_stopped(vi-dev)) { free_old_xmit_skbs(vi); if (virtqueue_free_size(svq) = svq-vring.num / 2) { virtqueue_enable_cb(svq); return; } } netif_wake_queue(vi-dev); } The problem I have found is a regression for single small message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new kernel has problem. Likely new kernel is faster :) For Steven's problem, it's multiple stream TCP_RR issues, the old guest doesn't perform well, so does new guest kernel. We tested reducing vhost signaling patch before, it didn't help the performance at all. Thanks Shirley Yes, it seems unrelated to tx interrupts. The issue is more likely related to latency. Do you have anything in mind on how to reduce vhost latency? Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 17:48 +0200, Michael S. Tsirkin wrote: And this is with sndbuf=0 in host, yes? And do you see a lot of tx interrupts? How packets per interrupt? Nope, sndbuf doens't matter since I never hit reaching sock wmem condition in vhost. I am still playing around, let me know what data you would like to collect. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 09:10:35AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote: Yes, I think doing this in the host is much simpler, just send an interrupt after there's a decent amount of space in the queue. Having said that the simple heuristic that I coded might be a bit too simple. From the debugging out what I have seen so far (a single small message TCP_STEAM test), I think the right approach is to patch both guest and vhost. One problem is slowing down the guest helps here. So there's a chance that just by adding complexity in guest driver we get a small improvement :( We can't rely on a patched guest anyway, so I think it is best to test guest and host changes separately. And I do agree something needs to be done in guest too, for example when vqs share an interrupt, we might invoke a callback when we see vq is not empty even though it's not requested. Probably should check interrupts enabled here? Yes, I modified xmit callback something like below: static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; /* Suppress further interrupts. */ virtqueue_disable_cb(svq); /* We were probably waiting for more output buffers. */ if (netif_queue_stopped(vi-dev)) { free_old_xmit_skbs(vi); if (virtqueue_free_size(svq) = svq-vring.num / 2) { virtqueue_enable_cb(svq); return; } } netif_wake_queue(vi-dev); } OK, but this should have no effect with a vhost patch which should ensure that we don't get an interrupt until the queue is at least half empty. Right? The problem I have found is a regression for single small message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new kernel has problem. Likely new kernel is faster :) For Steven's problem, it's multiple stream TCP_RR issues, the old guest doesn't perform well, so does new guest kernel. We tested reducing vhost signaling patch before, it didn't help the performance at all. Thanks Shirley Yes, it seems unrelated to tx interrupts. The issue is more likely related to latency. Could be. Why do you think so? Do you have anything in mind on how to reduce vhost latency? Thanks Shirley Hmm, bypassing the bridge might help a bit. Are you using tap+bridge or macvtap? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote: OK, but this should have no effect with a vhost patch which should ensure that we don't get an interrupt until the queue is at least half empty. Right? There should be some coordination between guest and vhost. We shouldn't count the TX packets when netif queue is enabled since next guest TX xmit will free any used buffers in vhost. We need to be careful here in case we miss the interrupts when netif queue has stopped. However we can't change old guest so we can test the patches separately for guest only, vhost only, and the combination. Yes, it seems unrelated to tx interrupts. The issue is more likely related to latency. Could be. Why do you think so? Since I played with latency hack, I can see performance difference for different latency. Do you have anything in mind on how to reduce vhost latency? Thanks Shirley Hmm, bypassing the bridge might help a bit. Are you using tap+bridge or macvtap? I am using tap+bridge for TCP_RR test, I think Steven tested macvtap before. He might have some data from his workload performance measurement. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote: w/i guest change, I played around the parameters,for example: I could get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size, w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU usage. I meant w/o guest change, only vhost changes. Sorry about that. Shirley Ah, excellent. What were the parameters? I used half of the ring size 129 for packet counters, but the performance is still not as good as dropping packets on guest, 3.7 Gb/s vs. 6.2Gb/s. Shirley How many packets and bytes per interrupt are sent? Also, what about other values for the counters and other counters? What does your patch do? Just drop packets instead of stopping the interface? To have an understanding when should we drop packets in the guest, we need to know *why* does it help. Otherwise, how do we know it will work for others? Note that qdisc will drop packets when it overruns - so what is different? Also, are we over-running some other queue somewhere? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 20:20 +0200, Michael S. Tsirkin wrote: How many packets and bytes per interrupt are sent? Also, what about other values for the counters and other counters? What does your patch do? Just drop packets instead of stopping the interface? To have an understanding when should we drop packets in the guest, we need to know *why* does it help. Otherwise, how do we know it will work for others? Note that qdisc will drop packets when it overruns - so what is different? Also, are we over-running some other queue somewhere? Agreed. I am trying to put more debugging output to look for all these answers. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote: OK, but this should have no effect with a vhost patch which should ensure that we don't get an interrupt until the queue is at least half empty. Right? There should be some coordination between guest and vhost. What kind of coordination? With a patched vhost, and a full ring. you should get an interrupt per 100 packets. Is this what you see? And if yes, isn't the guest patch doing nothing then? We shouldn't count the TX packets when netif queue is enabled since next guest TX xmit will free any used buffers in vhost. We need to be careful here in case we miss the interrupts when netif queue has stopped. However we can't change old guest so we can test the patches separately for guest only, vhost only, and the combination. Yes, it seems unrelated to tx interrupts. The issue is more likely related to latency. Could be. Why do you think so? Since I played with latency hack, I can see performance difference for different latency. Which hack was that? Do you have anything in mind on how to reduce vhost latency? Thanks Shirley Hmm, bypassing the bridge might help a bit. Are you using tap+bridge or macvtap? I am using tap+bridge for TCP_RR test, I think Steven tested macvtap before. He might have some data from his workload performance measurement. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote: I am working on a KVM network performance issue found in our lab running the DayTrader benchmark. The benchmark throughput takes a significant hit when running the application server in a KVM guest verses on bare metal. We have dug into the problem and found that DayTrader's use of small packets exposes KVM's overhead of handling network packets. I have been able to reproduce the performance hit with a simpler setup using the netperf benchmark with the TCP_RR test and the request and response sizes set to 256 bytes. I run the benchmark between two physical systems, each using a 1GB link. In order to get the maximum throughput for the system I have to run 100 instances of netperf. When I run the netserver processes in a guest, I see a maximum throughput that is 51% of what I get if I run the netserver processes directly on the host. The CPU utilization in the guest is only 85% at maximum throughput, whereas it is 100% on bare metal. You are stressing the scheduler pretty hard with this test :) Is your real benchmark also using a huge number of threads? If it's not, you might be seeing a different issue. IOW, the netperf degradation might not be network-related at all, but have to do with speed of context switch in guest. Thoughts? The KVM host has 16 CPUs. The KVM guest is configured with 2 VCPUs. When I run netperf on the host I boot the host with maxcpus=2 on the kernel command line. The host is running the current KVM upstream kernel along with the current upstream qemu. Here is the qemu command used to launch the guest: /build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on -drive file=/dev/virt/WAS,if=virtio,index=1 -net nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev tap,id=nic0,vhost=on -smp 2 -vnc :1 -monitor telnet::4499,server,nowait -serial telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize We have tried various proposed fixes, each with varying amounts of success. One such fix was to add code to the vhost thread such that when it found the work queue empty it wouldn't just exit the thread but rather would delay for 50 microseconds and then recheck the queue. If there was work on the queue it would loop back and process it, else it would exit the thread. The change got us a 13% improvement in the DayTrader throughput. Running the same netperf configuration on the same hardware but using a different hypervisor gets us significantly better throughput numbers. The guest on that hypervisor runs at 100% CPU utilization. The various fixes we have tried have not gotten us close to the throughput seen on the other hypervisor. I'm looking for ideas/input from the KVM experts on how to make KVM perform better when handling small packets. Thanks, Steve -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
Michael S. Tsirkin m...@redhat.com wrote on 02/02/2011 12:38:47 PM: On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote: I am working on a KVM network performance issue found in our lab running the DayTrader benchmark. The benchmark throughput takes a significant hit when running the application server in a KVM guest verses on bare metal. We have dug into the problem and found that DayTrader's use of small packets exposes KVM's overhead of handling network packets. I have been able to reproduce the performance hit with a simpler setup using the netperf benchmark with the TCP_RR test and the request and response sizes set to 256 bytes. I run the benchmark between two physical systems, each using a 1GB link. In order to get the maximum throughput for the system I have to run 100 instances of netperf. When I run the netserver processes in a guest, I see a maximum throughput that is 51% of what I get if I run the netserver processes directly on the host. The CPU utilization in the guest is only 85% at maximum throughput, whereas it is 100% on bare metal. You are stressing the scheduler pretty hard with this test :) Is your real benchmark also using a huge number of threads? Yes. The real benchmark has 60 threads handling client requests and 48 threads talking to a database server. If it's not, you might be seeing a different issue. IOW, the netperf degradation might not be network-related at all, but have to do with speed of context switch in guest. Thoughts? Yes, context switches can add to the overhead. We have that data captured, and I can look at it. What makes me think that's not the issue is that the CPU utilization in the guest is only about 85% at maximum throughput. Throughput/CPU is comparable to a different hypervisor, but that hypervisor runs at full CPU utilization and gets better throughput. I can't help but think KVM would get better throughput if it could just keep the guest VCPUs busy. Recently I have been playing with different CPU pinnings for the guest VCPUs and the vhost thread. Certain combinations can get us up to a 35% improvement in throughput with the same throughput/CPU ratio. CPU utilization was 94% -- not full CPU utilization, but it does illustrate that we can get better throughput if we keep the guest VCPUs busy. At this point it's looking more like a scheduler issue. We're starting to dig through the scheduler code for clues. Steve D. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote: OK, but this should have no effect with a vhost patch which should ensure that we don't get an interrupt until the queue is at least half empty. Right? There should be some coordination between guest and vhost. What kind of coordination? With a patched vhost, and a full ring. you should get an interrupt per 100 packets. Is this what you see? And if yes, isn't the guest patch doing nothing then? vhost_signal won't be able send any TX interrupts to guest when guest TX interrupt is disabled. Guest TX interrupt is only enabled when running out of descriptors. We shouldn't count the TX packets when netif queue is enabled since next guest TX xmit will free any used buffers in vhost. We need to be careful here in case we miss the interrupts when netif queue has stopped. However we can't change old guest so we can test the patches separately for guest only, vhost only, and the combination. Yes, it seems unrelated to tx interrupts. The issue is more likely related to latency. Could be. Why do you think so? Since I played with latency hack, I can see performance difference for different latency. Which hack was that? I tried to accumulate multiple guest to host notifications for TX xmits, it did help multiple streams TCP_RR results; I also forced vhost handle_tx to handle more packets; both hack seemed help. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 11:29:35AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote: On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote: OK, but this should have no effect with a vhost patch which should ensure that we don't get an interrupt until the queue is at least half empty. Right? There should be some coordination between guest and vhost. What kind of coordination? With a patched vhost, and a full ring. you should get an interrupt per 100 packets. Is this what you see? And if yes, isn't the guest patch doing nothing then? vhost_signal won't be able send any TX interrupts to guest when guest TX interrupt is disabled. Guest TX interrupt is only enabled when running out of descriptors. Well, this is also the only case where the queue is stopped, no? We shouldn't count the TX packets when netif queue is enabled since next guest TX xmit will free any used buffers in vhost. We need to be careful here in case we miss the interrupts when netif queue has stopped. However we can't change old guest so we can test the patches separately for guest only, vhost only, and the combination. Yes, it seems unrelated to tx interrupts. The issue is more likely related to latency. Could be. Why do you think so? Since I played with latency hack, I can see performance difference for different latency. Which hack was that? I tried to accumulate multiple guest to host notifications for TX xmits, it did help multiple streams TCP_RR results; I don't see a point to delay used idx update, do you? So delaying just signal seems better, right? I also forced vhost handle_tx to handle more packets; both hack seemed help. Thanks Shirley Haven't noticed that part, how does your patch make it handle more packets? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote: Well, this is also the only case where the queue is stopped, no? Yes. I got some debugging data, I saw that sometimes there were so many packets were waiting for free in guest between vhost_signal guest xmit callback. Looks like the time spent too long from vhost_signal to guest xmit callback? I tried to accumulate multiple guest to host notifications for TX xmits, it did help multiple streams TCP_RR results; I don't see a point to delay used idx update, do you? It might cause per vhost handle_tx processed more packets. So delaying just signal seems better, right? I think I need to define the test matrix to collect data for TX xmit from guest to host here for different tests. Data to be collected: - 1. kvm_stat for VM, I/O exits 2. cpu utilization for both guest and host 3. cat /proc/interrupts on guest 4. packets rate from vhost handle_tx per loop 5. guest netif queue stop rate 6. how many packets are waiting for free between vhost signaling and guest callback 7. performance results Test 1. TCP_STREAM single stream test for 1K to 4K message size 2. TCP_RR (64 instance test): 128 - 1K request/response size Different hacks --- 1. Base line data ( with the patch to fix capacity check first, free_old_xmit_skbs returns number of skbs) 2. Drop packet data (will put some debugging in generic networking code) 3. Delay guest netif queue wake up until certain descriptors (1/2 ring size, 1/4 ring size...) are available once the queue has stopped. 4. Accumulate more packets per vhost signal in handle_tx? 5. 3 4 combinations 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 7. Accumulate more packets per vhost handle_tx() by adding some delay? Haven't noticed that part, how does your patch make it handle more packets? Added a delay in handle_tx(). What else? It would take sometimes to do this. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 01:03:05PM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote: Well, this is also the only case where the queue is stopped, no? Yes. I got some debugging data, I saw that sometimes there were so many packets were waiting for free in guest between vhost_signal guest xmit callback. What does this mean? Looks like the time spent too long from vhost_signal to guest xmit callback? I tried to accumulate multiple guest to host notifications for TX xmits, it did help multiple streams TCP_RR results; I don't see a point to delay used idx update, do you? It might cause per vhost handle_tx processed more packets. I don't understand. It's a couple of writes - what is the issue? So delaying just signal seems better, right? I think I need to define the test matrix to collect data for TX xmit from guest to host here for different tests. Data to be collected: - 1. kvm_stat for VM, I/O exits 2. cpu utilization for both guest and host 3. cat /proc/interrupts on guest 4. packets rate from vhost handle_tx per loop 5. guest netif queue stop rate 6. how many packets are waiting for free between vhost signaling and guest callback 7. performance results Test 1. TCP_STREAM single stream test for 1K to 4K message size 2. TCP_RR (64 instance test): 128 - 1K request/response size Different hacks --- 1. Base line data ( with the patch to fix capacity check first, free_old_xmit_skbs returns number of skbs) 2. Drop packet data (will put some debugging in generic networking code) 3. Delay guest netif queue wake up until certain descriptors (1/2 ring size, 1/4 ring size...) are available once the queue has stopped. 4. Accumulate more packets per vhost signal in handle_tx? 5. 3 4 combinations 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 7. Accumulate more packets per vhost handle_tx() by adding some delay? Haven't noticed that part, how does your patch make it handle more packets? Added a delay in handle_tx(). What else? It would take sometimes to do this. Shirley Need to think about this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote: On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote: Well, this is also the only case where the queue is stopped, no? Yes. I got some debugging data, I saw that sometimes there were so many packets were waiting for free in guest between vhost_signal guest xmit callback. What does this mean? Let's look at the sequence here: guest start_xmit() xmit_skb() if ring is full, enable_cb() guest skb_xmit_done() disable_cb, printk free_old_xmit_skbs -- it was between more than 1/2 to full ring size printk vq-num_free vhost handle_tx() if (guest interrupt is enabled) signal guest to free xmit buffers So between guest queue full/stopped queue/enable call back to guest receives the callback from host to free_old_xmit_skbs, there were about 1/2 to full ring size descriptors available. I thought there were only a few. (I disabled your vhost patch for this test.) Looks like the time spent too long from vhost_signal to guest xmit callback? I tried to accumulate multiple guest to host notifications for TX xmits, it did help multiple streams TCP_RR results; I don't see a point to delay used idx update, do you? It might cause per vhost handle_tx processed more packets. I don't understand. It's a couple of writes - what is the issue? Oh, handle_tx could process more packets per loop for multiple streams TCP_RR case. I need to print out the data rate per loop to confirm this. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote: I think I need to define the test matrix to collect data for TX xmit from guest to host here for different tests. Data to be collected: - 1. kvm_stat for VM, I/O exits 2. cpu utilization for both guest and host 3. cat /proc/interrupts on guest 4. packets rate from vhost handle_tx per loop 5. guest netif queue stop rate 6. how many packets are waiting for free between vhost signaling and guest callback 7. performance results Test 1. TCP_STREAM single stream test for 1K to 4K message size 2. TCP_RR (64 instance test): 128 - 1K request/response size Different hacks --- 1. Base line data ( with the patch to fix capacity check first, free_old_xmit_skbs returns number of skbs) 2. Drop packet data (will put some debugging in generic networking code) Since I found that the netif queue stop/wake up is so expensive, I created a dropping packets patch on guest side so I don't need to debug generic networking code. guest start_xmit() capacity = free_old_xmit_skb() + virtqueue_get_num_freed() if (capacity == 0) drop this packet; return; In the patch, both guest TX interrupts and callback have been omitted. Host vhost_signal in handle_tx can totally be removed as well. (A new virtio_ring API is needed for exporting total of num_free descriptors here -- virtioqueue_get_num_freed) Initial TCP_STREAM performance results I got for guest to local host 4.2Gb/s for 1K message size, (vs. 2.5Gb/s) 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s) 9.8Gb/s for 4K message size. (vs.5.xGb/s) Since large message size (64K) doesn't hit (capacity == 0) case, so the performance only has a little better. (from 13.xGb/s to 14.x Gb/s) kvm_stat output shows significant exits reduction for both VM and I/O, no guest TX interrupts. With dropping packets, TCP retrans has been increased here, so I can see performance numbers are various. This might be not a good solution, but it gave us some ideas on expensive netif queue stop/wake up between guest and host notification. I couldn't find a better solution on how to reduce netif queue stop/wake up rate for small message size. But I think once we can address this, the guest TX performance will burst for small message size. I also compared this with return TX_BUSY approach when (capacity == 0), it is not as good as dropping packets. 3. Delay guest netif queue wake up until certain descriptors (1/2 ring size, 1/4 ring size...) are available once the queue has stopped. 4. Accumulate more packets per vhost signal in handle_tx? 5. 3 4 combinations 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 7. Accumulate more packets per vhost handle_tx() by adding some delay? Haven't noticed that part, how does your patch make it handle more packets? Added a delay in handle_tx(). What else? It would take sometimes to do this. Shirley Need to think about this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 01:41:33PM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote: On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote: Well, this is also the only case where the queue is stopped, no? Yes. I got some debugging data, I saw that sometimes there were so many packets were waiting for free in guest between vhost_signal guest xmit callback. What does this mean? Let's look at the sequence here: guest start_xmit() xmit_skb() if ring is full, enable_cb() guest skb_xmit_done() disable_cb, printk free_old_xmit_skbs -- it was between more than 1/2 to full ring size printk vq-num_free vhost handle_tx() if (guest interrupt is enabled) signal guest to free xmit buffers So between guest queue full/stopped queue/enable call back to guest receives the callback from host to free_old_xmit_skbs, there were about 1/2 to full ring size descriptors available. I thought there were only a few. (I disabled your vhost patch for this test.) The expected number is vq-num - max skb frags - 2. Looks like the time spent too long from vhost_signal to guest xmit callback? I tried to accumulate multiple guest to host notifications for TX xmits, it did help multiple streams TCP_RR results; I don't see a point to delay used idx update, do you? It might cause per vhost handle_tx processed more packets. I don't understand. It's a couple of writes - what is the issue? Oh, handle_tx could process more packets per loop for multiple streams TCP_RR case. I need to print out the data rate per loop to confirm this. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote: Let's look at the sequence here: guest start_xmit() xmit_skb() if ring is full, enable_cb() guest skb_xmit_done() disable_cb, printk free_old_xmit_skbs -- it was between more than 1/2 to full ring size printk vq-num_free vhost handle_tx() if (guest interrupt is enabled) signal guest to free xmit buffers So between guest queue full/stopped queue/enable call back to guest receives the callback from host to free_old_xmit_skbs, there were about 1/2 to full ring size descriptors available. I thought there were only a few. (I disabled your vhost patch for this test.) The expected number is vq-num - max skb frags - 2. It was various (up to the ring size 256). This is using indirection buffers, it returned how many freed descriptors, not number of buffers. Why do you think it is vq-num - max skb frags - 2 here? Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 09:05:56PM -0800, Shirley Ma wrote: On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote: I think I need to define the test matrix to collect data for TX xmit from guest to host here for different tests. Data to be collected: - 1. kvm_stat for VM, I/O exits 2. cpu utilization for both guest and host 3. cat /proc/interrupts on guest 4. packets rate from vhost handle_tx per loop 5. guest netif queue stop rate 6. how many packets are waiting for free between vhost signaling and guest callback 7. performance results Test 1. TCP_STREAM single stream test for 1K to 4K message size 2. TCP_RR (64 instance test): 128 - 1K request/response size Different hacks --- 1. Base line data ( with the patch to fix capacity check first, free_old_xmit_skbs returns number of skbs) 2. Drop packet data (will put some debugging in generic networking code) Since I found that the netif queue stop/wake up is so expensive, I created a dropping packets patch on guest side so I don't need to debug generic networking code. guest start_xmit() capacity = free_old_xmit_skb() + virtqueue_get_num_freed() if (capacity == 0) drop this packet; return; In the patch, both guest TX interrupts and callback have been omitted. Host vhost_signal in handle_tx can totally be removed as well. (A new virtio_ring API is needed for exporting total of num_free descriptors here -- virtioqueue_get_num_freed) Initial TCP_STREAM performance results I got for guest to local host 4.2Gb/s for 1K message size, (vs. 2.5Gb/s) 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s) 9.8Gb/s for 4K message size. (vs.5.xGb/s) What is the average packet size, # bytes per ack, and the # of interrupts per packet? It could be that just slowing down trahsmission makes GSO work better. Since large message size (64K) doesn't hit (capacity == 0) case, so the performance only has a little better. (from 13.xGb/s to 14.x Gb/s) kvm_stat output shows significant exits reduction for both VM and I/O, no guest TX interrupts. With dropping packets, TCP retrans has been increased here, so I can see performance numbers are various. This might be not a good solution, but it gave us some ideas on expensive netif queue stop/wake up between guest and host notification. I couldn't find a better solution on how to reduce netif queue stop/wake up rate for small message size. But I think once we can address this, the guest TX performance will burst for small message size. I also compared this with return TX_BUSY approach when (capacity == 0), it is not as good as dropping packets. 3. Delay guest netif queue wake up until certain descriptors (1/2 ring size, 1/4 ring size...) are available once the queue has stopped. 4. Accumulate more packets per vhost signal in handle_tx? 5. 3 4 combinations 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 7. Accumulate more packets per vhost handle_tx() by adding some delay? Haven't noticed that part, how does your patch make it handle more packets? Added a delay in handle_tx(). What else? It would take sometimes to do this. Shirley Need to think about this. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 10:09:14PM -0800, Shirley Ma wrote: On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote: Let's look at the sequence here: guest start_xmit() xmit_skb() if ring is full, enable_cb() guest skb_xmit_done() disable_cb, printk free_old_xmit_skbs -- it was between more than 1/2 to full ring size printk vq-num_free vhost handle_tx() if (guest interrupt is enabled) signal guest to free xmit buffers So between guest queue full/stopped queue/enable call back to guest receives the callback from host to free_old_xmit_skbs, there were about 1/2 to full ring size descriptors available. I thought there were only a few. (I disabled your vhost patch for this test.) The expected number is vq-num - max skb frags - 2. It was various (up to the ring size 256). This is using indirection buffers, it returned how many freed descriptors, not number of buffers. Why do you think it is vq-num - max skb frags - 2 here? Shirley well queue is stopped which happens when if (capacity 2+MAX_SKB_FRAGS) { netif_stop_queue(dev); if (unlikely(!virtqueue_enable_cb(vi-svq))) { /* More just got used, free them then recheck. * */ capacity += free_old_xmit_skbs(vi); if (capacity = 2+MAX_SKB_FRAGS) { netif_start_queue(dev); virtqueue_disable_cb(vi-svq); } } } This should be the most common case. I guess the case with += free_old_xmit_skbs is what can get us more. But it should be rare. Can you count how common it is? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-02-01 at 22:17 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 12:09:03PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 19:23 +0200, Michael S. Tsirkin wrote: On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote: On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote: Interesting. Could this is be a variant of the now famuous bufferbloat then? Sigh, bufferbloat is the new global warming... :-/ Yep, some places become colder, some other places become warmer; Same as BW results, sometimes faster, sometimes slower. :) Shirley Sent a tuning patch (v2) that might help. Could you try it and play with the module parameters please? Hello Michael, Sure I will play with this patch to see how it could help. I am looking at guest side as well, I found a couple issues on guest side: 1. free_old_xmit_skbs() should return the number of skbs instead of the total of sgs since we are using ring size to stop/start netif queue. static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) { struct sk_buff *skb; unsigned int len, tot_sgs = 0; while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); vi-dev-stats.tx_bytes += skb-len; vi-dev-stats.tx_packets++; tot_sgs += skb_vnet_hdr(skb)-num_sg; dev_kfree_skb_any(skb); } return tot_sgs; should return numbers of skbs to track ring usage here, I think; } Did the old guest use number of buffers to track ring usage before? 2. In start_xmit, I think we should move capacity += free_old_xmit_skbs before netif_stop_queue(); so we avoid unnecessary netif queue stop/start. This condition is heavily hit for small message size. Also we capacity checking condition should change to something like half of the vring.num size, instead of comparing 2+MAX_SKB_FRAGS? if (capacity 2+MAX_SKB_FRAGS) { netif_stop_queue(dev); if (unlikely(!virtqueue_enable_cb(vi-svq))) { /* More just got used, free them then recheck. */ capacity += free_old_xmit_skbs(vi); if (capacity = 2+MAX_SKB_FRAGS) { netif_start_queue(dev); virtqueue_disable_cb(vi-svq); } } } 3. Looks like the xmit callback is only used to wake the queue when the queue has stopped, right? Should we put a condition check here? static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; /* Suppress further interrupts. */ virtqueue_disable_cb(svq); /* We were probably waiting for more output buffers. */ --- if (netif_queue_stopped(vi-dev)) netif_wake_queue(vi-dev); } Shirley Well the return value is used to calculate capacity and that counts the # of s/g. No? Nope, the current guest kernel uses descriptors not number of sgs. I am not sure the old guest. From cache utilization POV it might be better to read from the skb and not peek at virtio header though... Pls Cc the lists on any discussions in the future. -- MST Sorry I missed reply all. :( Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote: Yes. It definitely should be 'out'. 'in' should be 0 in the tx path. I tried a simpler version of this patch without any tunables by delaying the signaling until we come out of the for loop. It definitely reduced the number of vmexits significantly for small message guest to host stream test and the throughput went up a little. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..5f9fae9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used(vq, head, 0); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net) } } + if (total_len 0) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Reducing the signaling will reduce the CPU utilization by reducing VM exits. The small message BW is a problem we have seen faster guest/slow vhost, even I increased VHOST_NET_WEIGHT times, it didn't help that much for BW. For large message size, vhost is able to process all packets on time. I played around with guest/host codes, I only see huge BW improvement by dropping packets on guest side so far. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 12:25:08PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 22:17 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 12:09:03PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 19:23 +0200, Michael S. Tsirkin wrote: On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote: On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote: Interesting. Could this is be a variant of the now famuous bufferbloat then? Sigh, bufferbloat is the new global warming... :-/ Yep, some places become colder, some other places become warmer; Same as BW results, sometimes faster, sometimes slower. :) Shirley Sent a tuning patch (v2) that might help. Could you try it and play with the module parameters please? Hello Michael, Sure I will play with this patch to see how it could help. I am looking at guest side as well, I found a couple issues on guest side: 1. free_old_xmit_skbs() should return the number of skbs instead of the total of sgs since we are using ring size to stop/start netif queue. static unsigned int free_old_xmit_skbs(struct virtnet_info *vi) { struct sk_buff *skb; unsigned int len, tot_sgs = 0; while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) { pr_debug(Sent skb %p\n, skb); vi-dev-stats.tx_bytes += skb-len; vi-dev-stats.tx_packets++; tot_sgs += skb_vnet_hdr(skb)-num_sg; dev_kfree_skb_any(skb); } return tot_sgs; should return numbers of skbs to track ring usage here, I think; } Did the old guest use number of buffers to track ring usage before? 2. In start_xmit, I think we should move capacity += free_old_xmit_skbs before netif_stop_queue(); so we avoid unnecessary netif queue stop/start. This condition is heavily hit for small message size. Also we capacity checking condition should change to something like half of the vring.num size, instead of comparing 2+MAX_SKB_FRAGS? if (capacity 2+MAX_SKB_FRAGS) { netif_stop_queue(dev); if (unlikely(!virtqueue_enable_cb(vi-svq))) { /* More just got used, free them then recheck. */ capacity += free_old_xmit_skbs(vi); if (capacity = 2+MAX_SKB_FRAGS) { netif_start_queue(dev); virtqueue_disable_cb(vi-svq); } } } 3. Looks like the xmit callback is only used to wake the queue when the queue has stopped, right? Should we put a condition check here? static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; /* Suppress further interrupts. */ virtqueue_disable_cb(svq); /* We were probably waiting for more output buffers. */ --- if (netif_queue_stopped(vi-dev)) netif_wake_queue(vi-dev); } Shirley Well the return value is used to calculate capacity and that counts the # of s/g. No? Nope, the current guest kernel uses descriptors not number of sgs. Confused. We compare capacity to skb frags, no? That's sg I think ... not sure the old guest. From cache utilization POV it might be better to read from the skb and not peek at virtio header though... Pls Cc the lists on any discussions in the future. -- MST Sorry I missed reply all. :( Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 01:09:45PM -0800, Shirley Ma wrote: On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote: Yes. It definitely should be 'out'. 'in' should be 0 in the tx path. I tried a simpler version of this patch without any tunables by delaying the signaling until we come out of the for loop. It definitely reduced the number of vmexits significantly for small message guest to host stream test and the throughput went up a little. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..5f9fae9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used(vq, head, 0); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net) } } + if (total_len 0) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Reducing the signaling will reduce the CPU utilization by reducing VM exits. The small message BW is a problem we have seen faster guest/slow vhost, even I increased VHOST_NET_WEIGHT times, it didn't help that much for BW. For large message size, vhost is able to process all packets on time. I played around with guest/host codes, I only see huge BW improvement by dropping packets on guest side so far. Thanks Shirley My theory is that the issue is not signalling. Rather, our queue fills up, then host handles one packet and sends an interrupt, and we immediately wake the queue. So the vq once it gets full, stays full. If you try my patch with bufs threshold set to e.g. half the vq, what we will do is send interrupt after we have processed half the vq. So host has half the vq to go, and guest has half the vq to fill. See? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote: My theory is that the issue is not signalling. Rather, our queue fills up, then host handles one packet and sends an interrupt, and we immediately wake the queue. So the vq once it gets full, stays full. From the printk debugging output, it might not be exactly the case. The ring gets full, run a bit, then gets full, then run a bit, then full... If you try my patch with bufs threshold set to e.g. half the vq, what we will do is send interrupt after we have processed half the vq. So host has half the vq to go, and guest has half the vq to fill. See? I am cleaning up my set up to run your patch ... Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? Another failure mode is when skb_xmit_done wakes the queue: it might be too early, there might not be space for the next packet in the vq yet. A solution might be to keep some kind of pool around for indirect, we wanted to do it for block anyway ... -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote: My theory is that the issue is not signalling. Rather, our queue fills up, then host handles one packet and sends an interrupt, and we immediately wake the queue. So the vq once it gets full, stays full. From the printk debugging output, it might not be exactly the case. The ring gets full, run a bit, then gets full, then run a bit, then full... Yes, but does it get even half empty in between? If you try my patch with bufs threshold set to e.g. half the vq, what we will do is send interrupt after we have processed half the vq. So host has half the vq to go, and guest has half the vq to fill. See? I am cleaning up my set up to run your patch ... Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-02-01 at 23:42 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote: My theory is that the issue is not signalling. Rather, our queue fills up, then host handles one packet and sends an interrupt, and we immediately wake the queue. So the vq once it gets full, stays full. From the printk debugging output, it might not be exactly the case. The ring gets full, run a bit, then gets full, then run a bit, then full... Yes, but does it get even half empty in between? Sometimes, most of them not half of empty in between. But printk slow down the traffics, so it's not accurate. I think your patch will improve the performance if it signals guest when half of the ring size is empty. But you manage signal by using TX bytes, I would like to change it to half of the ring size instead for signaling. Is that OK? Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote: There are flags for bytes, buffers and packets. Try playing with any one of them :) Just be sure to use v2. I would like to change it to half of the ring size instead for signaling. Is that OK? Shirley Sure that is why I made it a parameter so you can experiment. The initial test results shows that the CPUs utilization has been reduced some, and BW has increased some with the default parameters, like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization down from 4x% to 38%, (Similar results from the patch I submitted a while ago to reduce signaling on vhost) but far away from dropping packet results. I am going to change the code to use 1/2 ring size to wake the netif queue. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? Another failure mode is when skb_xmit_done wakes the queue: it might be too early, there might not be space for the next packet in the vq yet. I am not sure if this is the problem - shouldn't you see these messages: if (likely(capacity == -ENOMEM)) { dev_warn(dev-dev, TX queue failure: out of memory\n); } else { dev-stats.tx_fifo_errors++; dev_warn(dev-dev, Unexpected TX queue failure: %d\n, capacity); } in next xmit? I am not getting this in my testing. A solution might be to keep some kind of pool around for indirect, we wanted to do it for block anyway ... Your vhost patch should fix this automatically. Right? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 02:59:57PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote: There are flags for bytes, buffers and packets. Try playing with any one of them :) Just be sure to use v2. I would like to change it to half of the ring size instead for signaling. Is that OK? Shirley Sure that is why I made it a parameter so you can experiment. The initial test results shows that the CPUs utilization has been reduced some, and BW has increased some with the default parameters, like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization down from 4x% to 38%, (Similar results from the patch I submitted a while ago to reduce signaling on vhost) but far away from dropping packet results. I am going to change the code to use 1/2 ring size to wake the netif queue. Shirley Just tweak the parameters with sysfs, you do not have to edit the code: echo 64 /sys/module/vhost_net/parameters/tx_bufs_coalesce Or in a similar way for tx_packets_coalesce (since we use indirect, packets will typically use 1 buffer each). -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote: Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? Another failure mode is when skb_xmit_done wakes the queue: it might be too early, there might not be space for the next packet in the vq yet. I am not sure if this is the problem - shouldn't you see these messages: if (likely(capacity == -ENOMEM)) { dev_warn(dev-dev, TX queue failure: out of memory\n); } else { dev-stats.tx_fifo_errors++; dev_warn(dev-dev, Unexpected TX queue failure: %d\n, capacity); } in next xmit? I am not getting this in my testing. Yes, I don't think we hit this in our testing, simply because we don't stress memory. Disable indirect, then you might see this. A solution might be to keep some kind of pool around for indirect, we wanted to do it for block anyway ... Your vhost patch should fix this automatically. Right? Reduce the chance of it happening, yes. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 06:40 +0200, Michael S. Tsirkin wrote: ust tweak the parameters with sysfs, you do not have to edit the code: echo 64 /sys/module/vhost_net/parameters/tx_bufs_coalesce Or in a similar way for tx_packets_coalesce (since we use indirect, packets will typically use 1 buffer each). We should use packets instead of buffers, in indirect case, one packet has multiple buffers, each packet uses one descriptor from the ring (default size is 256). echo 128 /sys/module/vhost_net/parameters/tx_packets_coalesce The way I am changing is only when netif queue has stopped, then we start to count num_free descriptors to send the signal to wake netif queue. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote: The way I am changing is only when netif queue has stopped, then we start to count num_free descriptors to send the signal to wake netif queue. I forgot to mention, the code change I am making is in guest kernel, in xmit call back only wake up the queue when it's stopped num_free = 1/2 *vq-num, I add a new API in virtio_ring. However vhost signaling reduction is needed as well. The patch I submitted a while ago showed both CPUs and BW improvement. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote: The way I am changing is only when netif queue has stopped, then we start to count num_free descriptors to send the signal to wake netif queue. I forgot to mention, the code change I am making is in guest kernel, in xmit call back only wake up the queue when it's stopped num_free = 1/2 *vq-num, I add a new API in virtio_ring. Interesting. Yes, I agree an API extension would be helpful. However, wouldn't just the signaling reduction be enough, without guest changes? However vhost signaling reduction is needed as well. The patch I submitted a while ago showed both CPUs and BW improvement. Thanks Shirley Which patch was that? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote: The way I am changing is only when netif queue has stopped, then we start to count num_free descriptors to send the signal to wake netif queue. I forgot to mention, the code change I am making is in guest kernel, in xmit call back only wake up the queue when it's stopped num_free = 1/2 *vq-num, I add a new API in virtio_ring. FYI :) I have tried this before. There are a couple of issues: 1. the free count will not reduce until you run free_old_xmit_skbs, which will not run anymore since the tx queue is stopped. 2. You cannot call free_old_xmit_skbs directly as it races with a queue that was just awakened (current cb was due to the delay in disabling cb's). You have to call free_old_xmit_skbs() under netif_queue_stopped() check to avoid the race. I got a small improvement in my testing upto some number of threads (32 or 48?), but beyond that I was getting a regression. Thanks, - KK However vhost signaling reduction is needed as well. The patch I submitted a while ago showed both CPUs and BW improvement. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 12:04 +0530, Krishna Kumar2 wrote: On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote: The way I am changing is only when netif queue has stopped, then we start to count num_free descriptors to send the signal to wake netif queue. I forgot to mention, the code change I am making is in guest kernel, in xmit call back only wake up the queue when it's stopped num_free = 1/2 *vq-num, I add a new API in virtio_ring. FYI :) I have tried this before. There are a couple of issues: 1. the free count will not reduce until you run free_old_xmit_skbs, which will not run anymore since the tx queue is stopped. 2. You cannot call free_old_xmit_skbs directly as it races with a queue that was just awakened (current cb was due to the delay in disabling cb's). You have to call free_old_xmit_skbs() under netif_queue_stopped() check to avoid the race. Yes, that' what I did, when the netif queue stop, don't enable the queue, just free_old_xmit_skbs(), if not enough freed, then enabling callback until half of the ring size are freed, then wake the netif queue. But somehow I didn't reach the performance compared to drop packets, need to think about it more. :) Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote: On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote: The way I am changing is only when netif queue has stopped, then we start to count num_free descriptors to send the signal to wake netif queue. I forgot to mention, the code change I am making is in guest kernel, in xmit call back only wake up the queue when it's stopped num_free = 1/2 *vq-num, I add a new API in virtio_ring. Interesting. Yes, I agree an API extension would be helpful. However, wouldn't just the signaling reduction be enough, without guest changes? w/i guest change, I played around the parameters,for example: I could get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size, w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU usage. However vhost signaling reduction is needed as well. The patch I submitted a while ago showed both CPUs and BW improvement. Thanks Shirley Which patch was that? The patch was called vhost: TX used buffer guest signal accumulation. You suggested to split add_used_bufs and signal. I am still thinking what's the best approach to cooperate guest (virtio_kick) and vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce the overheads, so I haven't submit the new patch yet. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote: w/i guest change, I played around the parameters,for example: I could get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size, w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU usage. I meant w/o guest change, only vhost changes. Sorry about that. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
Shirley Ma mashi...@us.ibm.com wrote: I have tried this before. There are a couple of issues: 1. the free count will not reduce until you run free_old_xmit_skbs, which will not run anymore since the tx queue is stopped. 2. You cannot call free_old_xmit_skbs directly as it races with a queue that was just awakened (current cb was due to the delay in disabling cb's). You have to call free_old_xmit_skbs() under netif_queue_stopped() check to avoid the race. Yes, that' what I did, when the netif queue stop, don't enable the queue, just free_old_xmit_skbs(), if not enough freed, then enabling callback until half of the ring size are freed, then wake the netif queue. But somehow I didn't reach the performance compared to drop packets, need to think about it more. :) Did you check if the number of vmexits increased with this patch? This is possible if the device was keeping up (and not going into a stop, start, xmit 1 packet, stop, start loop). Also maybe you should try for 1/4th instead of 1/2? MST's delayed signalling should avoid this issue, I haven't tried both together. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Steve D. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote: Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Yes. It definitely should be 'out'. 'in' should be 0 in the tx path. I tried a simpler version of this patch without any tunables by delaying the signaling until we come out of the for loop. It definitely reduced the number of vmexits significantly for small message guest to host stream test and the throughput went up a little. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..5f9fae9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used(vq, head, 0); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net) } } + if (total_len 0) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } It is possible that we can miss signaling the guest even after processing a few pkts, if we don't hit any of these conditions. Steve D. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Mon, Jan 31, 2011 at 06:24:34PM -0600, Steve Dobbelstein wrote: Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Correct. Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Steve D. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Mon, Jan 31, 2011 at 05:30:38PM -0800, Sridhar Samudrala wrote: On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote: Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM: OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; Should this instead be: bufs_coalesced += out; Perusing the code I see that earlier there is a check to see if in is not zero, and, if so, error out of the loop. After the check, in is not touched until it is added to bufs_coalesced, effectively not changing bufs_coalesced, meaning bufs_coalesced will never trigger the conditions below. Yes. It definitely should be 'out'. 'in' should be 0 in the tx path. I tried a simpler version of this patch without any tunables by delaying the signaling until we come out of the for loop. It definitely reduced the number of vmexits significantly for small message guest to host stream test and the throughput went up a little. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9b3ca10..5f9fae9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); + vhost_add_used(vq, head, 0); total_len += len; if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net) } } + if (total_len 0) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } Or am I missing something? + if (unlikely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } It is possible that we can miss signaling the guest even after processing a few pkts, if we don't hit any of these conditions. Yes. It really should be if (likely(packets_coalesced bytes_coalesced bufs_coalesced)) vhost_signal(net-dev, vq); Steve D. -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote: On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote: Interesting. Could this is be a variant of the now famuous bufferbloat then? Sigh, bufferbloat is the new global warming... :-/ Yep, some places become colder, some other places become warmer; Same as BW results, sometimes faster, sometimes slower. :) Shirley OK, so thinking about it more, maybe the issue is this: tx becomes full. We process one request and interrupt the guest, then it adds one request and the queue is full again. Maybe the following will help it stabilize? By itself it does nothing, but if you set all the parameters to a huge value we will only interrupt when we see an empty ring. Which might be too much: pls try other values in the middle: e.g. make bufs half the ring, or bytes some small value, or packets some small value etc. Warning: completely untested. diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index aac05bc..6769cdc 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -32,6 +32,13 @@ * Using this limit prevents one virtqueue from starving others. */ #define VHOST_NET_WEIGHT 0x8 +int tx_bytes_coalesce = 0; +module_param(tx_bytes_coalesce, int, 0644); +int tx_bufs_coalesce = 0; +module_param(tx_bufs_coalesce, int, 0644); +int tx_packets_coalesce = 0; +module_param(tx_packets_coalesce, int, 0644); + enum { VHOST_NET_VQ_RX = 0, VHOST_NET_VQ_TX = 1, @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net) int err, wmem; size_t hdr_size; struct socket *sock; + int bytes_coalesced = 0; + int bufs_coalesced = 0; + int packets_coalesced = 0; /* TODO: check that we are running from vhost_worker? */ sock = rcu_dereference_check(vq-private_data, 1); @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net) if (err != len) pr_debug(Truncated TX packet: len %d != %zd\n, err, len); - vhost_add_used_and_signal(net-dev, vq, head, 0); total_len += len; + packets_coalesced += 1; + bytes_coalesced += len; + bufs_coalesced += in; + if (unlikely(packets_coalesced tx_packets_coalesce || +bytes_coalesced tx_bytes_coalesce || +bufs_coalesced tx_bufs_coalesce)) + vhost_add_used_and_signal(net-dev, vq, head, 0); + else + vhost_add_used(vq, head, 0); if (unlikely(total_len = VHOST_NET_WEIGHT)) { vhost_poll_queue(vq-poll); break; } } + if (likely(packets_coalesced tx_packets_coalesce || + bytes_coalesced tx_bytes_coalesce || + bufs_coalesced tx_bufs_coalesce)) + vhost_signal(net-dev, vq); mutex_unlock(vq-mutex); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
mashi...@linux.vnet.ibm.com wrote on 01/27/2011 02:15:05 PM: On Thu, 2011-01-27 at 22:05 +0200, Michael S. Tsirkin wrote: One simple theory is that guest net stack became faster and so the host can't keep up. Yes, that's what I think here. Some qdisc code has been changed recently. I ran a test with txqueuelen set to 128, instead of the default of 1000, in the guest in an attempt to slow down the guest transmits. The change had no effect on the throughput nor on the CPU usage. On the other hand, I ran some tests with different CPU pinnings and with/without hyperthreading enabled. Here is a summary of the results. Pinning configuration 1: pin the VCPUs and pin the vhost thread to one of the VCPU CPUs Pinning configuration 2: pin the VCPUs and pin the vhost thread to a separate CPU on the same socket Pinning configuration 3: pin the VCPUs and pin the vhost thread to a separate CPU a different socket HT Pinning Throughput CPU Yes config 1 - 40% - 40% Yes config 2 - 37% - 35% Yes config 3 - 37% - 36% No none 0% - 5% No config 1 - 41% - 43% No config 2 + 32% - 4% No config 3 + 34% + 9% Pinning the vhost thread to the same CPU as a guest VCPU hurts performance. Turning off hyperthreading and pinning the VPUS and vhost thread to separate CPUs significantly improves performance, getting it into the competitive range with other hypervisors. Steve D. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
ste...@us.ibm.com wrote on 01/28/2011 12:29:37 PM: On Thu, 2011-01-27 at 22:05 +0200, Michael S. Tsirkin wrote: One simple theory is that guest net stack became faster and so the host can't keep up. Yes, that's what I think here. Some qdisc code has been changed recently. I ran a test with txqueuelen set to 128, instead of the default of 1000, in the guest in an attempt to slow down the guest transmits. The change had no effect on the throughput nor on the CPU usage. On the other hand, I ran some tests with different CPU pinnings and with/without hyperthreading enabled. Here is a summary of the results. Pinning configuration 1: pin the VCPUs and pin the vhost thread to one of the VCPU CPUs Pinning configuration 2: pin the VCPUs and pin the vhost thread to a separate CPU on the same socket Pinning configuration 3: pin the VCPUs and pin the vhost thread to a separate CPU a different socket HT Pinning Throughput CPU Yes config 1 - 40% - 40% Yes config 2 - 37% - 35% Yes config 3 - 37% - 36% No none 0% - 5% No config 1 - 41% - 43% No config 2 + 32% - 4% No config 3 + 34% + 9% Pinning the vhost thread to the same CPU as a guest VCPU hurts performance. Turning off hyperthreading and pinning the VPUS and vhost thread to separate CPUs significantly improves performance, getting it into the competitive range with other hypervisors. Steve D. Those results for configs 2 and 3 with hyperthreading off are a little strange. Digging into the cause I found that my automation script for pinning the vhost thread failed and pinned it to CPU 1, the same as config 1, giving results similar to config 1. I reran the tests making sure the pinning script did the right thing. The results are more consistent. HT Pinning Throughput CPU Yes config 1 - 40% - 40% Yes config 2 + 33% - 8% Yes config 3 + 34% + 9% No none 0% - 5% No config 1 - 41% - 43% No config 2 + 32% - 4% No config 3 + 34% + 9% It appears that we have a scheduling problem. If the processes are pinned we can get good performance. We also se that hyperthreading makes little difference. Sorry for the initial misleading data. Steve D. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Wed, 2011-01-26 at 17:17 +0200, Michael S. Tsirkin wrote: I am seeing a similar problem, and am trying to fix that. My current theory is that this is a variant of a receive livelock: if the application isn't fast enough to process incoming data, the guest net stack switches from prequeue to backlog handling. One thing I noticed is that locking the vhost thread and the vcpu to the same physical CPU almost doubles the bandwidth. Can you confirm that in your setup? My current guess is that when we lock both to a single CPU, netperf in guest gets scheduled slowing down the vhost thread in the host. I also noticed that this specific workload performs better with vhost off: presumably we are loading the guest less. I found similar issue for small message size TCP_STREAM test when guest as TX. I found when I slow down TX, the BW performance will be doubled for 1K to 4K message size. Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Jan 27, 2011 at 10:44:34AM -0800, Shirley Ma wrote: On Wed, 2011-01-26 at 17:17 +0200, Michael S. Tsirkin wrote: I am seeing a similar problem, and am trying to fix that. My current theory is that this is a variant of a receive livelock: if the application isn't fast enough to process incoming data, the guest net stack switches from prequeue to backlog handling. One thing I noticed is that locking the vhost thread and the vcpu to the same physical CPU almost doubles the bandwidth. Can you confirm that in your setup? My current guess is that when we lock both to a single CPU, netperf in guest gets scheduled slowing down the vhost thread in the host. I also noticed that this specific workload performs better with vhost off: presumably we are loading the guest less. I found similar issue for small message size TCP_STREAM test when guest as TX. I found when I slow down TX, the BW performance will be doubled for 1K to 4K message size. Shirley Interesting. In particular running vhost and the transmitting guest on the same host would have the effect of slowing down TX. Does it double the BW for you too? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, 2011-01-27 at 21:00 +0200, Michael S. Tsirkin wrote: Interesting. In particular running vhost and the transmitting guest on the same host would have the effect of slowing down TX. Does it double the BW for you too? Running vhost and TX guest on the same host seems not good enough to slow down TX. In order to gain the double even triple BW for guest TX to local host I still need to play around, so 1K message size, BW is able to increase from 2.XGb/s to 6.XGb/s. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Jan 27, 2011 at 11:09:00AM -0800, Shirley Ma wrote: On Thu, 2011-01-27 at 21:00 +0200, Michael S. Tsirkin wrote: Interesting. In particular running vhost and the transmitting guest on the same host would have the effect of slowing down TX. Does it double the BW for you too? Running vhost and TX guest on the same host seems not good enough to slow down TX. In order to gain the double even triple BW for guest TX to local host I still need to play around, so 1K message size, BW is able to increase from 2.XGb/s to 6.XGb/s. Thanks Shirley Well slowing down the guest does not sound hard - for example we can request guest notifications, or send extra interrupts :) A slightly more sophisticated thing to try is to poll the vq a bit more aggressively. For example if we handled some requests and now tx vq is empty, reschedule and yeild. Worth a try? -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, 2011-01-27 at 21:31 +0200, Michael S. Tsirkin wrote: Well slowing down the guest does not sound hard - for example we can request guest notifications, or send extra interrupts :) A slightly more sophisticated thing to try is to poll the vq a bit more aggressively. For example if we handled some requests and now tx vq is empty, reschedule and yeild. Worth a try? I used dropping packets in high level to slow down TX. I am still thinking what's the right the approach here. Requesting guest notification and extra interrupts is what we want to avoid to reduce VM exits for saving CPUs. I don't think it's good. By polling the vq a bit more aggressively, you meant vhost, right? Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Thu, Jan 27, 2011 at 11:45:47AM -0800, Shirley Ma wrote: On Thu, 2011-01-27 at 21:31 +0200, Michael S. Tsirkin wrote: Well slowing down the guest does not sound hard - for example we can request guest notifications, or send extra interrupts :) A slightly more sophisticated thing to try is to poll the vq a bit more aggressively. For example if we handled some requests and now tx vq is empty, reschedule and yeild. Worth a try? I used dropping packets in high level to slow down TX. I am still thinking what's the right the approach here. Interesting. Could this is be a variant of the now famuous bufferbloat then? I guess we could drop some packets if we see we are not keeping up. For example if we see that the ring is X% full, we could quickly complete Y% without transmitting packets on. Or maybe we should drop some bytes not packets. Requesting guest notification and extra interrupts is what we want to avoid to reduce VM exits for saving CPUs. I don't think it's good. Yes but how do you explain regression? One simple theory is that guest net stack became faster and so the host can't keep up. By polling the vq a bit more aggressively, you meant vhost, right? Shirley Yes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html