Re: RFT: virtio_net: limit xmit polling
On Sunday, June 19, 2011 05:27:00 AM Michael S. Tsirkin wrote: OK, different people seem to test different trees. In the hope to get everyone on the same page, I created several variants of this patch so they can be compared. Whoever's interested, please check out the following, and tell me how these compare: kernel: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git virtio-net-limit-xmit-polling/base - this is net-next baseline to test against virtio-net-limit-xmit-polling/v0 - fixes checks on out of capacity virtio-net-limit-xmit-polling/v1 - previous revision of the patch this does xmit,free,xmit,2*free,free virtio-net-limit-xmit-polling/v2 - new revision of the patch this does free,xmit,2*free,free Here's a summary of the results. I've also attached an ODS format spreadsheet (30 KB in size) that might be easier to analyze and also has some pinned VM results data. I broke the tests down into a local guest-to-guest scenario and a remote host-to-guest scenario. Within the local guest-to-guest scenario I ran: - TCP_RR tests using two different messsage sizes and four different instance counts among 1 pair of VMs and 2 pairs of VMs. - TCP_STREAM tests using four different message sizes and two different instance counts among 1 pair of VMs and 2 pairs of VMs. Within the remote host-to-guest scenario I ran: - TCP_RR tests using two different messsage sizes and four different instance counts to 1 VM and 4 VMs. - TCP_STREAM and TCP_MAERTS tests using four different message sizes and two different instance counts to 1 VM and 4 VMs. over a 10GbE link. *** Local Guest-to-Guest *** Here's the local guest-to-guest summary for 1 VM pair doing TCP_RR with 256/256 request/response message size in transactions per second: Instances BaseV0 V1 V2 18,151.568,460.728,439.169,990.37 25 48,761.74 51,032.62 51,103.25 49,533.52 50 55,687.38 55,974.18 56,854.10 54,888.65 100 58,255.06 58,255.86 60,380.90 59,308.36 Here's the local guest-to-guest summary for 2 VM pairs doing TCP_RR with 256/256 request/response message size in transactions per second: Instances BaseV0 V1 V2 1 18,758.48 19,112.50 18,597.07 19,252.04 25 80,500.50 78,801.78 80,590.68 78,782.07 50 80,594.20 77,985.44 80,431.72 77,246.90 100 82,023.23 81,325.96 81,303.32 81,727.54 Here's the local guest-to-guest summary for 1 VM pair doing TCP_STREAM with 256, 1K, 4K and 16K message size in Mbps: 256: Instances BaseV0 V1 V2 1 961.781,115.92 794.02 740.37 42,498.332,541.822,441.602,308.26 1K: 13,476.613,522.022,170.861,395.57 46,344.307,056.577,275.167,174.09 4K: 19,213.57 10,647.449,883.429,007.29 4 11,070.66 11,300.37 11,001.02 12,103.72 16K: 1 12,065.949,437.78 11,710.606,989.93 4 12,755.28 13,050.78 12,518.06 13,227.33 Here's the local guest-to-guest summary for 2 VM pairs doing TCP_STREAM with 256, 1K, 4K and 16K message size in Mbps: 256: Instances BaseV0 V1 V2 12,434.982,403.232,308.692,261.35 45,973.825,729.485,956.765,831.86 1K: 15,305.995,148.724,960.675,067.76 4 10,628.38 10,649.49 10,098.90 10,380.09 4K: 1 11,577.03 10,710.33 11,700.53 10,304.09 4 14,580.66 14,881.38 14,551.17 15,053.02 16K: 1 16,801.46 16,072.50 15,773.78 15,835.66 4 17,194.00 17,294.02 17,319.78 17,121.09 *** Remote Host-to-Guest *** Here's the remote host-to-guest summary for 1 VM doing TCP_RR with 256/256 request/response message size in transactions per second: Instances BaseV0 V1 V2 19,732.99 10,307.98 10,529.828,889.28 25 43,976.18 49,480.50 46,536.66 45,682.38 50 63,031.33 67,127.15 60,073.34 65,748.62 100 64,778.43 65,338.07 66,774.12 69,391.22 Here's the remote host-to-guest summary for 4 VMs doing TCP_RR with 256/256 request/response
Re: RFT: virtio_net: limit xmit polling
On Sunday, June 19, 2011 05:27:00 AM Michael S. Tsirkin wrote: OK, different people seem to test different trees. In the hope to get everyone on the same page, I created several variants of this patch so they can be compared. Whoever's interested, please check out the following, and tell me how these compare: I'm in the process of testing these patches. Base and v0 are complete and v1 is near complete with v2 to follow. I'm testing with a variety of TCP_RR and TCP_STREAM/TCP_MAERTS tests involving local guest-to-guest tests and remote host-to-guest tests. I'll post the results in the next day or two when the tests finish. Thanks, Tom kernel: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git virtio-net-limit-xmit-polling/base - this is net-next baseline to test against virtio-net-limit-xmit-polling/v0 - fixes checks on out of capacity virtio-net-limit-xmit-polling/v1 - previous revision of the patch this does xmit,free,xmit,2*free,free virtio-net-limit-xmit-polling/v2 - new revision of the patch this does free,xmit,2*free,free There's also this on top: virtio-net-limit-xmit-polling/v3 - don't delay avail index update I don't think it's important to test this one, yet Userspace to use: event index work is not yet merged upstream so the revision to use is still this: git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git virtio-net-event-idx-v3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/18] virtio: use avail_event index
On Monday, May 16, 2011 02:12:21 AM Rusty Russell wrote: On Sun, 15 May 2011 16:55:41 +0300, Michael S. Tsirkin m...@redhat.com wrote: On Mon, May 09, 2011 at 02:03:26PM +0930, Rusty Russell wrote: On Wed, 4 May 2011 23:51:47 +0300, Michael S. Tsirkin m...@redhat.com wrote: Use the new avail_event feature to reduce the number of exits from the guest. Figures here would be nice :) You mean ASCII art in comments? I mean benchmarks of some kind. I'm working on getting some benchmark results for the patches. I should hopefully have something in the next day or two. Tom @@ -228,6 +237,12 @@ add_head: * new available array entries. */ virtio_wmb(); vq-vring.avail-idx++; + /* If the driver never bothers to kick in a very long while, +* avail index might wrap around. If that happens, invalidate +* kicked_avail index we stored. TODO: make sure all drivers +* kick at least once in 2^16 and remove this. */ + if (unlikely(vq-vring.avail-idx == vq-kicked_avail)) + vq-kicked_avail_valid = true; If they don't, they're already buggy. Simply do: WARN_ON(vq-vring.avail-idx == vq-kicked_avail); Hmm, but does it say that somewhere? AFAICT it's a corollary of: 1) You have a finite ring of size = 2^16. 2) You need to kick the other side once you've done some work. @@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device *vdev) break; case VIRTIO_RING_F_USED_EVENT_IDX: break; + case VIRTIO_RING_F_AVAIL_EVENT_IDX: + break; default: /* We don't understand this bit. */ clear_bit(i, vdev-features); Does this belong in a prior patch? Thanks, Rusty. Well if we don't support the feature in the ring we should not ack the feature, right? Ah, you're right. Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/18] virtio: used event index interface
On Wednesday, May 04, 2011 03:51:09 PM Michael S. Tsirkin wrote: Define a new feature bit for the guest to utilize a used_event index (like Xen) instead if a flag bit to enable/disable interrupts. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- include/linux/virtio_ring.h |9 + 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h index e4d144b..f5c1b75 100644 --- a/include/linux/virtio_ring.h +++ b/include/linux/virtio_ring.h @@ -29,6 +29,10 @@ /* We support indirect buffer descriptors */ #define VIRTIO_RING_F_INDIRECT_DESC 28 +/* The Guest publishes the used index for which it expects an interrupt + * at the end of the avail ring. Host should ignore the avail-flags field. */ +#define VIRTIO_RING_F_USED_EVENT_IDX 29 + /* Virtio ring descriptors: 16 bytes. These can chain together via next. */ struct vring_desc { /* Address (guest-physical). */ @@ -83,6 +87,7 @@ struct vring { * __u16 avail_flags; * __u16 avail_idx; * __u16 available[num]; + * __u16 used_event_idx; * * // Padding to the next align boundary. * char pad[]; @@ -93,6 +98,10 @@ struct vring { * struct vring_used_elem used[num]; * }; */ +/* We publish the used event index at the end of the available ring. + * It is at the end for backwards compatibility. */ +#define vring_used_event(vr) ((vr)-avail-ring[(vr)-num]) + static inline void vring_init(struct vring *vr, unsigned int num, void *p, unsigned long align) { You should update the vring_size procedure to account for the extra field at the end of the available ring by change the (2 + num) to (3 + num): return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (3 + num) Tom -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/18] virtio: use avail_event index
On Wednesday, May 04, 2011 03:51:47 PM Michael S. Tsirkin wrote: Use the new avail_event feature to reduce the number of exits from the guest. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- drivers/virtio/virtio_ring.c | 39 ++- 1 files changed, 38 insertions(+), 1 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 3a3ed75..262dfe6 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -82,6 +82,15 @@ struct vring_virtqueue /* Host supports indirect buffers */ bool indirect; + /* Host publishes avail event idx */ + bool event; + + /* Is kicked_avail below valid? */ + bool kicked_avail_valid; + + /* avail idx value we already kicked. */ + u16 kicked_avail; + /* Number of free buffers */ unsigned int num_free; /* Head of free buffer list. */ @@ -228,6 +237,12 @@ add_head: * new available array entries. */ virtio_wmb(); vq-vring.avail-idx++; + /* If the driver never bothers to kick in a very long while, + * avail index might wrap around. If that happens, invalidate + * kicked_avail index we stored. TODO: make sure all drivers + * kick at least once in 2^16 and remove this. */ + if (unlikely(vq-vring.avail-idx == vq-kicked_avail)) + vq-kicked_avail_valid = true; vq-kicked_avail_valid should be set to false here. Tom pr_debug(Added buffer head %i to %p\n, head, vq); END_USE(vq); @@ -236,6 +251,23 @@ add_head: } EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp); + +static bool vring_notify(struct vring_virtqueue *vq) +{ + u16 old, new; + bool v; + if (!vq-event) + return !(vq-vring.used-flags VRING_USED_F_NO_NOTIFY); + + v = vq-kicked_avail_valid; + old = vq-kicked_avail; + new = vq-kicked_avail = vq-vring.avail-idx; + vq-kicked_avail_valid = true; + if (unlikely(!v)) + return true; + return vring_need_event(vring_avail_event(vq-vring), new, old); +} + void virtqueue_kick(struct virtqueue *_vq) { struct vring_virtqueue *vq = to_vvq(_vq); @@ -244,7 +276,7 @@ void virtqueue_kick(struct virtqueue *_vq) /* Need to update avail index before checking if we should notify */ virtio_mb(); - if (!(vq-vring.used-flags VRING_USED_F_NO_NOTIFY)) + if (vring_notify(vq)) /* Prod other side to tell it about changes. */ vq-notify(vq-vq); @@ -437,6 +469,8 @@ struct virtqueue *vring_new_virtqueue(unsigned int num, vq-vq.name = name; vq-notify = notify; vq-broken = false; + vq-kicked_avail_valid = false; + vq-kicked_avail = 0; vq-last_used_idx = 0; list_add_tail(vq-vq.list, vdev-vqs); #ifdef DEBUG @@ -444,6 +478,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num, #endif vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC); + vq-event = virtio_has_feature(vdev, VIRTIO_RING_F_AVAIL_EVENT_IDX); /* No callback? Tell other side not to bother us. */ if (!callback) @@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device *vdev) break; case VIRTIO_RING_F_USED_EVENT_IDX: break; + case VIRTIO_RING_F_AVAIL_EVENT_IDX: + break; default: /* We don't understand this bit. */ clear_bit(i, vdev-features); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote: On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote: As for which CPU the interrupt gets pinned to, that doesn't matter - see below. So what hurts us the most is that the IRQ jumps between the VCPUs? Yes, it appears that allowing the IRQ to run on more than one vCPU hurts. Without the publish last used index patch, vhost keeps injecting an irq for every received packet until the guest eventually turns off notifications. Because the irq injections end up overlapping we get contention on the irq_desc_lock_class lock. Here are some results using the baseline setup with irqbalance running. Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec Exits: 121,050.45 Exits/Sec TxCPU: 9.61% RxCPU: 99.45% Virtio1-input Interrupts/Sec (CPU0/CPU1): 13,975/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 24% increase over baseline. Irqbalance essentially pinned the virtio irq to CPU0 preventing the irq lock contention and resulting in nice gains. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Thursday, March 10, 2011 09:34:22 am Michael S. Tsirkin wrote: On Thu, Mar 10, 2011 at 09:23:42AM -0600, Tom Lendacky wrote: On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote: On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote: As for which CPU the interrupt gets pinned to, that doesn't matter - see below. So what hurts us the most is that the IRQ jumps between the VCPUs? Yes, it appears that allowing the IRQ to run on more than one vCPU hurts. Without the publish last used index patch, vhost keeps injecting an irq for every received packet until the guest eventually turns off notifications. Are you sure you see that? If yes publish used should help a lot. I definitely see that. I ran lockstat in the guest and saw the contention on the lock when the irq was able to run on either vCPU. Once the irq was pinned the contention disappeared. The publish used index patch should eliminate the extra irq injections and then the pinning or use of irqbalance shouldn't be required. I'm getting a kernel oops during boot with the publish last used patches that I pulled from the mailing list - I had to make some changes in order to get them to apply and compile and might not have done the right things. Can you re-spin that patchset against kvm.git? Because the irq injections end up overlapping we get contention on the irq_desc_lock_class lock. Here are some results using the baseline setup with irqbalance running. Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec Exits: 121,050.45 Exits/Sec TxCPU: 9.61% RxCPU: 99.45% Virtio1-input Interrupts/Sec (CPU0/CPU1): 13,975/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 24% increase over baseline. Irqbalance essentially pinned the virtio irq to CPU0 preventing the irq lock contention and resulting in nice gains. OK, so we probably want some form of delayed free for TX on top, and that should get us nice results already. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote: On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: We've been doing some more experimenting with the small packet network performance problem in KVM. I have a different setup than what Steve D. was using so I re-baselined things on the kvm.git kernel on both the host and guest with a 10GbE adapter. I also made use of the virtio-stats patch. The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters (the first connected to a 1GbE adapter and a LAN, the second connected to a 10GbE adapter that is direct connected to another system with the same 10GbE adapter) running the kvm.git kernel. The test was a TCP_RR test with 100 connections from a baremetal client to the KVM guest using a 256 byte message size in both directions. I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Here is the baseline for baremetal using 2 physical CPUs: Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec TxCPU: 7.88% RxCPU: 99.41% To be sure to get consistent results with KVM I disabled the hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter interrupts (this resulted in runs that differed by only about 2% from lowest to highest). The fact that pinning is required to get consistent results is a different problem that we'll have to look into later... Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% About 42% of baremetal. Can you add interrupt stats as well please? Yes I can. Just the guest interrupts for the virtio device? empty. So I coded a quick patch to delay freeing of the used Tx buffers until more than half the ring was used (I did not test this under a stream condition so I don't know if this would have a negative impact). Here are the results from delaying the freeing of used Tx buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% About a 4% increase over baseline and about 44% of baremetal. Hmm, I am not sure what you mean by delaying freeing. In the start_xmit function of virtio_net.c the first thing done is to free any used entries from the ring. I patched the code to track the number of used tx ring entries and only free the used entries when they are greater than half the capacity of the ring (similar to the way the rx ring is re-filled). I think we do have a problem that free_old_xmit_skbs tries to flush out the ring aggressively: it always polls until the ring is empty, so there could be bursts of activity where we spend a lot of time flushing the old entries before e.g. sending an ack, resulting in latency bursts. Generally we'll need some smarter logic, but with indirect at the moment we can just poll a single packet after we post a new one, and be done with it. Is your patch something like the patch below? Could you try mine as well please? Yes, I'll try the patch and post the results. This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over baseline and about 52% of baremetal. Running the perf command against the guest I noticed almost 19% of the time being spent in _raw_spin_lock. Enabling lockstat in the guest showed alot of contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt to a single cpu in the guest and re-running the last test resulted in tremendous gains (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% About a 77% increase over baseline and about 74% of baremetal. Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). Also, it looks like vhost is sending a lot of notifications for packets it has received before the guest can get scheduled to disable notifications and begin processing the packets Hmm, is this really what happens to you? The effect would be that guest gets an interrupt while notifications are disabled in guest
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 01:17:44 am Michael S. Tsirkin wrote: On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Could you post the XML on the list please? Environment variables are used to specify some of the values: uperf_instances=100 uperf_dest=192.168.100.28 uperf_duration=300 uperf_tx_msgsize=256 uperf_rx_msgsize=256 You can also change from threads to processes by specifying nprocs instead of nthreads in the group element. I found this out later so all of my runs are using threads. Using processes will give you some improved peformance but I need to be consistent with my runs and stay with threads for now. ?xml version=1.0? profile name=TCP_RR group nthreads=$uperf_instances transaction iterations=1 flowop type=connect options=remotehost=$uperf_dest protocol=tcp/ /transaction transaction duration=$uperf_duration flowop type=write options=size=$uperf_tx_msgsize/ flowop type=read options=size=$uperf_rx_msgsize/ /transaction transaction iterations=1 flowop type=disconnect / /transaction /group /profile -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
Here are the results again with the addition of the interrupt rate that occurred on the guest virtio_net device: Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% Virtio1-input Interrupts/Sec (CPU0/CPU1): 5,154/5,222 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About 42% of baremetal. Delayed freeing of TX buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,796/4,908 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 4% increase over baseline and about 44% of baremetal. Delaying kick_notify (kick every 5 packets -average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,200/4,293 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 23% increase over baseline and about 52% of baremetal. Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% Virtio1-input Interrupts/Sec (CPU0/CPU1): 11,564/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 77% increase over baseline and about 74% of baremetal. On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote: On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: We've been doing some more experimenting with the small packet network performance problem in KVM. I have a different setup than what Steve D. was using so I re-baselined things on the kvm.git kernel on both the host and guest with a 10GbE adapter. I also made use of the virtio-stats patch. The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters (the first connected to a 1GbE adapter and a LAN, the second connected to a 10GbE adapter that is direct connected to another system with the same 10GbE adapter) running the kvm.git kernel. The test was a TCP_RR test with 100 connections from a baremetal client to the KVM guest using a 256 byte message size in both directions. I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Here is the baseline for baremetal using 2 physical CPUs: Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec TxCPU: 7.88% RxCPU: 99.41% To be sure to get consistent results with KVM I disabled the hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter interrupts (this resulted in runs that differed by only about 2% from lowest to highest). The fact that pinning is required to get consistent results is a different problem that we'll have to look into later... Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% About 42% of baremetal. Can you add interrupt stats as well please? empty. So I coded a quick patch to delay freeing of the used Tx buffers until more than half the ring was used (I did not test this under a stream condition so I don't know if this would have a negative impact). Here are the results from delaying the freeing of used Tx buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% About a 4% increase over baseline and about 44% of baremetal. Hmm, I am not sure what you mean by delaying freeing. I think we do have a problem that free_old_xmit_skbs tries to flush out the ring aggressively: it always polls until the ring is empty, so there could be bursts of activity where we spend a lot of time flushing the old entries before e.g. sending an ack, resulting in latency bursts. Generally we'll need some smarter logic, but with indirect at the moment we can just poll a single packet after we post a new one, and be done with it. Is your patch something like the patch below? Could you try mine as well please? This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 10:09:26 am Tom Lendacky wrote: On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote: On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote: We've been doing some more experimenting with the small packet network performance problem in KVM. I have a different setup than what Steve D. was using so I re-baselined things on the kvm.git kernel on both the host and guest with a 10GbE adapter. I also made use of the virtio-stats patch. The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters (the first connected to a 1GbE adapter and a LAN, the second connected to a 10GbE adapter that is direct connected to another system with the same 10GbE adapter) running the kvm.git kernel. The test was a TCP_RR test with 100 connections from a baremetal client to the KVM guest using a 256 byte message size in both directions. I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Here is the baseline for baremetal using 2 physical CPUs: Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec TxCPU: 7.88% RxCPU: 99.41% To be sure to get consistent results with KVM I disabled the hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter interrupts (this resulted in runs that differed by only about 2% from lowest to highest). The fact that pinning is required to get consistent results is a different problem that we'll have to look into later... Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% About 42% of baremetal. Can you add interrupt stats as well please? Yes I can. Just the guest interrupts for the virtio device? empty. So I coded a quick patch to delay freeing of the used Tx buffers until more than half the ring was used (I did not test this under a stream condition so I don't know if this would have a negative impact). Here are the results from delaying the freeing of used Tx buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% About a 4% increase over baseline and about 44% of baremetal. Hmm, I am not sure what you mean by delaying freeing. In the start_xmit function of virtio_net.c the first thing done is to free any used entries from the ring. I patched the code to track the number of used tx ring entries and only free the used entries when they are greater than half the capacity of the ring (similar to the way the rx ring is re-filled). I think we do have a problem that free_old_xmit_skbs tries to flush out the ring aggressively: it always polls until the ring is empty, so there could be bursts of activity where we spend a lot of time flushing the old entries before e.g. sending an ack, resulting in latency bursts. Generally we'll need some smarter logic, but with indirect at the moment we can just poll a single packet after we post a new one, and be done with it. Is your patch something like the patch below? Could you try mine as well please? Yes, I'll try the patch and post the results. This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over baseline and about 52% of baremetal. Running the perf command against the guest I noticed almost 19% of the time being spent in _raw_spin_lock. Enabling lockstat in the guest showed alot of contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt to a single cpu in the guest and re-running the last test resulted in tremendous gains (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% About a 77% increase over baseline and about 74% of baremetal. Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). Also, it looks like vhost is sending a lot of notifications for packets it has received before the guest can get scheduled to disable notifications and begin
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 04:45:12 pm Shirley Ma wrote: Hello Tom, Do you also have Rusty's virtio stat patch results for both send queue and recv queue to share here? Let me see what I can do about getting the data extracted, averaged and in a form that I can put in an email. Thanks Shirley -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets - continued
On Wednesday, March 09, 2011 03:56:15 pm Michael S. Tsirkin wrote: On Wed, Mar 09, 2011 at 02:11:07PM -0600, Tom Lendacky wrote: Here are the results again with the addition of the interrupt rate that occurred on the guest virtio_net device: Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% Virtio1-input Interrupts/Sec (CPU0/CPU1): 5,154/5,222 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About 42% of baremetal. Delayed freeing of TX buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,796/4,908 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 4% increase over baseline and about 44% of baremetal. Looks like delayed freeing is a good idea generally. Is this my patch? Yours? These results are for my patch, I haven't had a chance to run your patch yet. Delaying kick_notify (kick every 5 packets -average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% Virtio1-input Interrupts/Sec (CPU0/CPU1): 4,200/4,293 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 23% increase over baseline and about 52% of baremetal. Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs): What exactly moves the interrupt handler between CPUs? irqbalancer? Does it matter which CPU you pin it to? If yes, do you have any idea why? Looking at the guest, irqbalance isn't running and the smp_affinity for the irq is set to 3 (both CPUs). It could be that irqbalance would help in this situation since it would probably change the smp_affinity mask to a single CPU and remove the irq lock contention (I think the last used index patch would be best though since it will avoid the extra irq injections). I'll kick off a run with irqbalance running. As for which CPU the interrupt gets pinned to, that doesn't matter - see below. Also, what happens without delaying kick_notify but with pinning? Here are the results of a single baseline run with the IRQ pinned to CPU0: Txn Rate: 108,212.12 Txn/Sec, Pkt Rate: 214,994 Pkts/Sec Exits: 119,310.21 Exits/Sec TxCPU: 9.63% RxCPU: 99.47% Virtio1-input Interrupts/Sec (CPU0/CPU1): Virtio1-output Interrupts/Sec (CPU0/CPU1): and CPU1: Txn Rate: 108,053.02 Txn/Sec, Pkt Rate: 214,678 Pkts/Sec Exits: 119,320.12 Exits/Sec TxCPU: 9.64% RxCPU: 99.42% Virtio1-input Interrupts/Sec (CPU0/CPU1): 13,608/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/13,830 About a 24% increase over baseline. Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkts/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% Virtio1-input Interrupts/Sec (CPU0/CPU1): 11,564/0 Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0 About a 77% increase over baseline and about 74% of baremetal. Hmm we get about 20 packets per interrupt on average. That's pretty decent. The problem is with exits. Let's try something adaptive in the host? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Network performance with small packets - continued
We've been doing some more experimenting with the small packet network performance problem in KVM. I have a different setup than what Steve D. was using so I re-baselined things on the kvm.git kernel on both the host and guest with a 10GbE adapter. I also made use of the virtio-stats patch. The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters (the first connected to a 1GbE adapter and a LAN, the second connected to a 10GbE adapter that is direct connected to another system with the same 10GbE adapter) running the kvm.git kernel. The test was a TCP_RR test with 100 connections from a baremetal client to the KVM guest using a 256 byte message size in both directions. I used the uperf tool to do this after verifying the results against netperf. Uperf allows the specification of the number of connections as a parameter in an XML file as opposed to launching, in this case, 100 separate instances of netperf. Here is the baseline for baremetal using 2 physical CPUs: Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec TxCPU: 7.88% RxCPU: 99.41% To be sure to get consistent results with KVM I disabled the hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter interrupts (this resulted in runs that differed by only about 2% from lowest to highest). The fact that pinning is required to get consistent results is a different problem that we'll have to look into later... Here is the KVM baseline (average of six runs): Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec Exits: 148,444.58 Exits/Sec TxCPU: 2.40% RxCPU: 99.35% About 42% of baremetal. The virtio stats output showed alot of kick_notify happening when the ring was empty. So I coded a quick patch to delay freeing of the used Tx buffers until more than half the ring was used (I did not test this under a stream condition so I don't know if this would have a negative impact). Here are the results from delaying the freeing of used Tx buffers (average of six runs): Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec Exits: 142,681.67 Exits/Sec TxCPU: 2.78% RxCPU: 99.36% About a 4% increase over baseline and about 44% of baremetal. This spread out the kick_notify but still resulted in alot of them. I decided to build on the delayed Tx buffer freeing and code up an ethtool like coalescing patch in order to delay the kick_notify until there were at least 5 packets on the ring or 2000 usecs, whichever occurred first. Here are the results of delaying the kick_notify (average of six runs): Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec Exits: 102,587.28 Exits/Sec TxCPU: 3.03% RxCPU: 99.33% About a 23% increase over baseline and about 52% of baremetal. Running the perf command against the guest I noticed almost 19% of the time being spent in _raw_spin_lock. Enabling lockstat in the guest showed alot of contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt to a single cpu in the guest and re-running the last test resulted in tremendous gains (average of six runs): Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec Exits: 62,603.37 Exits/Sec TxCPU: 3.73% RxCPU: 98.52% About a 77% increase over baseline and about 74% of baremetal. Vhost is receiving a lot of notifications for packets that are to be transmitted (over 60% of the packets generate a kick_notify). Also, it looks like vhost is sending a lot of notifications for packets it has received before the guest can get scheduled to disable notifications and begin processing the packets resulting in some lock contention in the guest (and high interrupt rates). Some thoughts for the transmit path... can vhost be enhanced to do some adaptive polling so that the number of kick_notify events are reduced and replaced by kick_no_notify events? Comparing the transmit path to the receive path, the guest disables notifications after the first kick and vhost re-enables notifications after completing processing of the tx ring. Can a similar thing be done for the receive path? Once vhost sends the first notification for a received packet it can disable notifications and let the guest re-enable notifications when it has finished processing the receive ring. Also, can the virtio-net driver do some adaptive polling (or does napi take care of that for the guest)? Running the same workload on the same configuration with a different hypervisor results in performance that is almost equivalent to baremetal without doing any pinning. Thanks, Tom Lendacky -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network shutdown under load
Fix a race condition where qemu finds that there are not enough virtio ring buffers available and the guest make more buffers available before qemu can enable notifications. Signed-off-by: Tom Lendacky t...@us.ibm.com Signed-off-by: Anthony Liguori aligu...@us.ibm.com hw/virtio-net.c | 10 +- 1 files changed, 9 insertions(+), 1 deletions(-) diff --git a/hw/virtio-net.c b/hw/virtio-net.c index 6e48997..5c0093e 100644 --- a/hw/virtio-net.c +++ b/hw/virtio-net.c @@ -379,7 +379,15 @@ static int virtio_net_has_buffers(VirtIONet *n, int bufsize) (n-mergeable_rx_bufs !virtqueue_avail_bytes(n-rx_vq, bufsize, 0))) { virtio_queue_set_notification(n-rx_vq, 1); -return 0; + +/* To avoid a race condition where the guest has made some buffers + * available after the above check but before notification was + * enabled, check for available buffers again. + */ +if (virtio_queue_empty(n-rx_vq) || +(n-mergeable_rx_bufs + !virtqueue_avail_bytes(n-rx_vq, bufsize, 0))) +return 0; } virtio_queue_set_notification(n-rx_vq, 0); On Friday 29 January 2010 02:06:41 pm Tom Lendacky wrote: There's been some discussion of this already in the kvm list, but I want to summarize what I've found and also include the qemu-devel list in an effort to find a solution to this problem. Running a netperf test between two kvm guests results in the guest's network interface shutting down. I originally found this using kvm guests on two different machines that were connected via a 10GbE link. However, I found this problem can be easily reproduced using two guests on the same machine. I am running the 2.6.32 level of the kvm.git tree and the 0.12.1.2 level of the qemu-kvm.git tree. The setup includes two bridges, br0 and br1. The commands used to start the guests are as follows: usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive file=/autobench/var/tmp/cape-vm001- raw.img,if=virtio,index=0,media=disk,boot=on -net nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51,netdev=cape-vm001-eth0 - netdev tap,id=cape-vm001-eth0,script=/autobench/var/tmp/ifup-kvm- br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1,netdev=cape-vm001-eth1 - netdev tap,id=cape-vm001-eth1,script=/autobench/var/tmp/ifup-kvm- br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor telnet::5701,server,nowait -snapshot -daemonize usr/local/bin/qemu-system-x86_64 -name cape-vm002 -m 1024 -drive file=/autobench/var/tmp/cape-vm002- raw.img,if=virtio,index=0,media=disk,boot=on -net nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:61,netdev=cape-vm002-eth0 - netdev tap,id=cape-vm002-eth0,script=/autobench/var/tmp/ifup-kvm- br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:E1,netdev=cape-vm002-eth1 - netdev tap,id=cape-vm002-eth1,script=/autobench/var/tmp/ifup-kvm- br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :2 -monitor telnet::5702,server,nowait -snapshot -daemonize The ifup-kvm-br0 script takes the (first) qemu created tap device and brings it up and adds it to bridge br0. The ifup-kvm-br1 script take the (second) qemu created tap device and brings it up and adds it to bridge br1. Each ethernet device within a guest is on it's own subnet. For example: guest 1 eth0 has addr 192.168.100.32 and eth1 has addr 192.168.101.32 guest 2 eth0 has addr 192.168.100.64 and eth1 has addr 192.168.101.64 On one of the guests run netserver: netserver -L 192.168.101.32 -p 12000 On the other guest run netperf: netperf -L 192.168.101.64 -H 192.168.101.32 -p 12000 -t TCP_STREAM -l 60 -c -C -- -m 16K -M 16K It may take more than one netperf run (I find that my second run almost always causes the shutdown) but the network on the eth1 links will stop working. I did some debugging and found that in qemu on the guest running netserver: - the receive_disabled variable is set and never gets reset - the read_poll event handler for the eth1 tap device is disabled and never re-enabled These conditions result in no packets being read from the tap device and sent to the guest - effectively shutting down the network. Network connectivity can be restored by shutting down the guest interfaces, unloading the virtio_net module, re-loading the virtio_net module and re-starting the guest interfaces. I'm continuing to work on debugging this, but would appreciate if some folks with more qemu network experience could try to recreate and debug this. If my kernel config matters, I can provide that. Thanks, Tom -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line
Re: Multiple TAP Interfaces, with multiple bridges
On Wednesday 03 February 2010 10:56:43 am J L wrote: Hi, I am having an odd networking issue. It is one of those it used to work, and now it doesn't kind of things. I can't work out what I am doing differently. I have a virtual machine, started with (among other things): -net nic,macaddr=fa:9e:0b:53:d2:7d,model=rtl8139 -net tap,script=/images/1/ifup-eth0,downscript=/images/1/ifdown-eth0 -net nic,macaddr=fa:02:4e:86:ed:ce,model=e1000 -net tap,script=/images/1/ifup-eth1,downscript=/images/1/ifdown-eth1 I believe this has to do with the qemu vlan support. If you don't specify the vlan= option you end up with nics on the same vlan. You need to assign the two nics to separate vlans using vlan= on each net parameter, eg: -net nic,vlan=0,macaddr=fa:9e:0b:53:d2:7d,model=rtl8139 -net tap,vlan=0,script=/images/1/ifup-eth0,downscript=/images/1/ifdown-eth0 -net nic,vlan=1,macaddr=fa:02:4e:86:ed:ce,model=e1000 -net tap,vlan=1,script=/images/1/ifup-eth1,downscript=/images/1/ifdown-eth1 Try that and see if you get the results you expect. Tom The ifup-ethX script inserts the tap interface into the correct bridge (of which there are multiple.) The Virtual Machine is Centos 5.3, with a 2.6.27.21 kernel. The Host is Ubuntu 9.10 with a 2.6.31 kernel. My network then looks like: The Virtual Machine has an eth0 interface, which is matched with tap0 on the host. The Virtual Machine has an eth1 interface, which is matched with tap1 on the host. The host has a bridge br0, which contains tap0 and eth0. The host has a bridge br1, which contains tap1. There is a server on the same network as the Host's eth0. The Virtual Machines eth0 interface is down. The Virtual Machines eth1 interface has an IP address of 192.168.1.10/24. The Virtual Machine has a default gateway of 192.168.1.1. The host's br0 has an IP address of 192.168.0.1/24. The host's br1 has an IP address of 192.168.1.1/24. The server has an IP address of 192.168.0.20/24, and a default gateway of 192.168.0.1. Firewalling is disabled everywhere. I have allowed time for the bridges and STP to settle. If I go to the Virtual Machine, and ping 192.168.0.20 (the server), I would expect tcpdumps to show: * VM: eth1, dest MAC of Host's tap1/br0 * Host: tap1, dest MAC of Host's tap1/br0 * Host: br1, dest MAC of Host's tap1/br0 * Host now routes from br1 to br0 * Host: tap0, no packet * Host: br0, dest MAC of Server * Host: eth0, dest MAC of Server * Server: eth0, dest MAC of Server What I actually get: * VM: eth1, dest MAC of Host's tap1/br0 * Host: tap1, dest MAC of Host's tap1/br0 * Host: br1, dest MAC of Host's tap1/br0 * Host should, but does not route from br0 to br1 * Host: tap0, dest MAC of ***Host's tap1/br0*** * Host: br0, dest MAC of ***Host's tap1/br0** * Host: eth0, no packet * Server: eth0, no packet As you can see, the packet has egressed *both* tap interfaces! Is this expected behaviour? What can I do about this? If I remove tap0 from the bridge, I then get: * VM: eth1, dest MAC of Host's tap1/br0 * Host: tap1, dest MAC of Host's tap1/br0 * Host: br1, dest MAC of Host's tap1/br0 * Host should, but does not, route from br0 to br1 * Host: tap0, no packet * Host: br0, no packet * Host: eth0, no packet * Server: eth0, no packet This is the other half of my problem: in this case, with effectively only one tap, the host is not routing between br1 and br0. The packet just gets silently dropped. Does anyone know what I am doing wrong? I hope I have managed to explain this well enough! Thanks, -- Jarrod Lowe -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network shutdown under heavy load
On Wednesday 20 January 2010 09:48:04 am Tom Lendacky wrote: On Tuesday 19 January 2010 05:57:53 pm Chris Wright wrote: * Tom Lendacky (t...@linux.vnet.ibm.com) wrote: On Wednesday 13 January 2010 03:52:28 pm Chris Wright wrote: (Mark cc'd, sound familiar?) * Tom Lendacky (t...@linux.vnet.ibm.com) wrote: On Sunday 10 January 2010 06:38:54 am Avi Kivity wrote: On 01/10/2010 02:35 PM, Herbert Xu wrote: On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote: This isn't in 2.6.27.y. Herbert, can you send it there? It appears that now that TX is fixed we have a similar problem with RX. Once I figure that one out I'll send them together. I've been experiencing the network shutdown issue also. I've been running netperf tests across 10GbE adapters with Qemu 0.12.1.2, RHEL5.4 guests and 2.6.32 kernel (from kvm.git) guests. I instrumented Qemu to print out some network statistics. It appears that at some point in the netperf test the receiving guest ends up having the 10GbE device receive_disabled variable in its VLANClientState structure stuck at 1. From looking at the code it appears that the virtio-net driver in the guest should cause qemu_flush_queued_packets in net.c to eventually run and clear the receive_disabled variable but it's not happening. I don't seem to have these issues when I have a lot of debug settings active in the guest kernel which results in very low/poor network performance - maybe some kind of race condition? Ok, here's an update. After realizing that none of the ethtool offload options were enabled in my guest, I found that I needed to be using the -netdev option on the qemu command line. Once I did that, some ethtool offload options were enabled and the deadlock did not appear when I did networking between guests on different machines. However, the deadlock did appear when I did networking between guests on the same machine. What does your full command line look like? And when the networking stops does your same receive_disabled hack make things work? The command line when using the -net option for the tap device is: /usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive file=/autobench/var/tmp/cape-vm001- raw.img,if=virtio,index=0,media=disk,boot=on -net nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51 -net tap,vlan=0,script=/autobench/var/tmp/ifup-kvm- br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1 -net tap,vlan=1,script=/autobench/var/tmp/ifup-kvm- br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor telnet::5701,server,nowait -snapshot -daemonize when using the -netdev option for the tap device: /usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive file=/autobench/var/tmp/cape-vm001- raw.img,if=virtio,index=0,media=disk,boot=on -net nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51,netdev=cape-vm001-eth0 - netdev tap,id=cape-vm001-eth0,script=/autobench/var/tmp/ifup-kvm- br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1,netdev=cape-vm001-eth1 - netdev tap,id=cape-vm001-eth1,script=/autobench/var/tmp/ifup-kvm- br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor telnet::5701,server,nowait -snapshot -daemonize The first ethernet device is a 1GbE device for communicating with the automation infrastructure we have. The second ethernet device is the 10GbE device that the netperf tests run on. I can get the networking to work again by bringing down the interfaces and reloading the virtio_net module (modprobe -r virtio_net / modprobe virtio_net). I haven't had a chance yet to run the tests against a modified version of qemu that does not set the receive_disabled variable. I got a chance to run with the setting of the receive_diabled variable commented out and I still run into the problem. It's easier to reproduce when running netperf between two guests on the same machine. I instrumented qemu and virtio a little bit to try and track this down. What I'm seeing is that, with two guests on the same machine, the receiving (netserver) guest eventually gets into a condition where the tap read poll callback is disabled and never re-enabled. So packets are never delivered from tap to qemu and to the guest. On the sending (netperf) side the transmit queue eventually runs out of capacity and it can no longer send packets (I believe this is unique to having the guests on the same machine). And as before, bringing down the interfaces, reloading the virtio_net module, and restarting the interfaces clears things up. Tom Tom thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http
Re: network shutdown under heavy load
On Tuesday 19 January 2010 05:57:53 pm Chris Wright wrote: * Tom Lendacky (t...@linux.vnet.ibm.com) wrote: On Wednesday 13 January 2010 03:52:28 pm Chris Wright wrote: (Mark cc'd, sound familiar?) * Tom Lendacky (t...@linux.vnet.ibm.com) wrote: On Sunday 10 January 2010 06:38:54 am Avi Kivity wrote: On 01/10/2010 02:35 PM, Herbert Xu wrote: On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote: This isn't in 2.6.27.y. Herbert, can you send it there? It appears that now that TX is fixed we have a similar problem with RX. Once I figure that one out I'll send them together. I've been experiencing the network shutdown issue also. I've been running netperf tests across 10GbE adapters with Qemu 0.12.1.2, RHEL5.4 guests and 2.6.32 kernel (from kvm.git) guests. I instrumented Qemu to print out some network statistics. It appears that at some point in the netperf test the receiving guest ends up having the 10GbE device receive_disabled variable in its VLANClientState structure stuck at 1. From looking at the code it appears that the virtio-net driver in the guest should cause qemu_flush_queued_packets in net.c to eventually run and clear the receive_disabled variable but it's not happening. I don't seem to have these issues when I have a lot of debug settings active in the guest kernel which results in very low/poor network performance - maybe some kind of race condition? Ok, here's an update. After realizing that none of the ethtool offload options were enabled in my guest, I found that I needed to be using the -netdev option on the qemu command line. Once I did that, some ethtool offload options were enabled and the deadlock did not appear when I did networking between guests on different machines. However, the deadlock did appear when I did networking between guests on the same machine. What does your full command line look like? And when the networking stops does your same receive_disabled hack make things work? The command line when using the -net option for the tap device is: /usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive file=/autobench/var/tmp/cape-vm001- raw.img,if=virtio,index=0,media=disk,boot=on -net nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51 -net tap,vlan=0,script=/autobench/var/tmp/ifup-kvm- br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1 -net tap,vlan=1,script=/autobench/var/tmp/ifup-kvm- br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor telnet::5701,server,nowait -snapshot -daemonize when using the -netdev option for the tap device: /usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive file=/autobench/var/tmp/cape-vm001- raw.img,if=virtio,index=0,media=disk,boot=on -net nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51,netdev=cape-vm001-eth0 - netdev tap,id=cape-vm001-eth0,script=/autobench/var/tmp/ifup-kvm- br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1,netdev=cape-vm001-eth1 - netdev tap,id=cape-vm001-eth1,script=/autobench/var/tmp/ifup-kvm- br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor telnet::5701,server,nowait -snapshot -daemonize The first ethernet device is a 1GbE device for communicating with the automation infrastructure we have. The second ethernet device is the 10GbE device that the netperf tests run on. I can get the networking to work again by bringing down the interfaces and reloading the virtio_net module (modprobe -r virtio_net / modprobe virtio_net). I haven't had a chance yet to run the tests against a modified version of qemu that does not set the receive_disabled variable. Tom thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network shutdown under heavy load
On Wednesday 13 January 2010 03:52:28 pm Chris Wright wrote: (Mark cc'd, sound familiar?) * Tom Lendacky (t...@linux.vnet.ibm.com) wrote: On Sunday 10 January 2010 06:38:54 am Avi Kivity wrote: On 01/10/2010 02:35 PM, Herbert Xu wrote: On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote: This isn't in 2.6.27.y. Herbert, can you send it there? It appears that now that TX is fixed we have a similar problem with RX. Once I figure that one out I'll send them together. I've been experiencing the network shutdown issue also. I've been running netperf tests across 10GbE adapters with Qemu 0.12.1.2, RHEL5.4 guests and 2.6.32 kernel (from kvm.git) guests. I instrumented Qemu to print out some network statistics. It appears that at some point in the netperf test the receiving guest ends up having the 10GbE device receive_disabled variable in its VLANClientState structure stuck at 1. From looking at the code it appears that the virtio-net driver in the guest should cause qemu_flush_queued_packets in net.c to eventually run and clear the receive_disabled variable but it's not happening. I don't seem to have these issues when I have a lot of debug settings active in the guest kernel which results in very low/poor network performance - maybe some kind of race condition? Ok, here's an update. After realizing that none of the ethtool offload options were enabled in my guest, I found that I needed to be using the -netdev option on the qemu command line. Once I did that, some ethtool offload options were enabled and the deadlock did not appear when I did networking between guests on different machines. However, the deadlock did appear when I did networking between guests on the same machine. Tom -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network shutdown under heavy load
On Sunday 10 January 2010 06:38:54 am Avi Kivity wrote: On 01/10/2010 02:35 PM, Herbert Xu wrote: On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote: This isn't in 2.6.27.y. Herbert, can you send it there? It appears that now that TX is fixed we have a similar problem with RX. Once I figure that one out I'll send them together. I've been experiencing the network shutdown issue also. I've been running netperf tests across 10GbE adapters with Qemu 0.12.1.2, RHEL5.4 guests and 2.6.32 kernel (from kvm.git) guests. I instrumented Qemu to print out some network statistics. It appears that at some point in the netperf test the receiving guest ends up having the 10GbE device receive_disabled variable in its VLANClientState structure stuck at 1. From looking at the code it appears that the virtio-net driver in the guest should cause qemu_flush_queued_packets in net.c to eventually run and clear the receive_disabled variable but it's not happening. I don't seem to have these issues when I have a lot of debug settings active in the guest kernel which results in very low/poor network performance - maybe some kind of race condition? Tom Thanks. Who is maintaining that BTW, sta...@kernel.org? Yes. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM virtio network performance on RHEL5.4
I've been trying to understand why the performance from guest to guest over a 10GbE link using virtio, as measured by netperf, dramatically decreases when the socket buffer size is increased on the receiving guest. This is an Intel X3210 4-core 2.13GHz system running RHEL5.4. I don't see this drop in performance when going from guest to host or host to guest over the 10GbE link. Here are the results from netperf: Default socket buffer sizes: Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB 87380 16384 1638460.01 2268.47 47.6999.951.722 3.609 Receiver 256K socket buffer size (actually rmem_max * 2): Recv SendSend Utilization Service Demand Socket Socket Message Elapsed Send Recv SendRecv Size SizeSize Time Throughput localremote local remote bytes bytes bytessecs.10^6bits/s % S % S us/KB us/KB 262142 16384 1638460.00 1583.75 39.0074.092.018 3.832 There is increased idle time in the receiver. Using systemtap I found that the idle time is because we are waiting for data (tcp_recvmsg calling sk_wait_data). I instrumented qemu on the receiver side to print out some statistics related to xmit/recv events. Rx-Could not receive is incremented whenever do_virtio_net_can_receive returns 0 Rx-Ring full is incremented in do_virtio_net_can_receive whenever there are no available entries/space in the receive ring Rx-Count is incremented whenever virtio_net_receive2 is called (and can receive data) Rx-Bytes is increased in virtio_net_receive2 by the number of bytes to be read from the tap device Rx-Ring buffers is increased by the number of buffers used for the data in virtio_net_receive2 Tx-Notify is incremented whenever virtio_net_handle_tx is invoked Tx-Sched BH is incremented whenever virtio_net_handle_tx is invoked and the the qemu_bh hasn't been scheduled yet Tx-Packets is incremented in virtio_net_flush_tx whenever a packet is removed from the transmit ring and sent to qemu Tx-Bytes is increased in virtio_net_flush_tx by the number of bytes sent to qemu. Here are the stats for the two cases: Default 256K Rx-Could not receive3,559 0 Rx-Ring full3,559 0 Rx-Count1,063,056 805,012 Rx-Bytes18,131,704,980 12,593,270,826 Rx-Ring buffers 4,963,793 3,541,010 Tx-Notify 125,068 125,702 Tx-Sched BH 125,068 125,702 Tx-Packets 147,256 232,219 Tx-Bytes11,486,448 18,113,586 Dividing the Tx-Bytes by Tx-Packets in each case yields about 78 bytes/packet so these are most likely ACKs. But why am I seeing almost 85,000 more of these in the 256K socket buffer case? Also, dividing the Rx-Bytes by the Rx- Count shows that the tap device is delivering about 1413 bytes less per call to qemu in the 256K socket buffer case. Does anyone have some insight as to what is happening? Thanks, Tom Lendacky -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
QemuOpts changes breaks multiple nic options
The recent change to QemuOpts for the -net nic option breaks specifying -net nic,... more than once. The net_init_nic function's return value in net.c is a table index, which is non-zero after the first time it is called. The qemu_opts_foreach function in qemu-option.c receives the non-zero return value and stops processing further -net options (like associated -net tap options). It looks like the usb net function makes use of the index value, so the fix might best be to have qemu_opts_foreach check for a return code 0 as being an error? Tom Lendacky -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html