Re: RFT: virtio_net: limit xmit polling

2011-06-28 Thread Tom Lendacky
On Sunday, June 19, 2011 05:27:00 AM Michael S. Tsirkin wrote:
 OK, different people seem to test different trees.  In the hope to get
 everyone on the same page, I created several variants of this patch so
 they can be compared. Whoever's interested, please check out the
 following, and tell me how these compare:
 
 kernel:
 
 git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
 
 virtio-net-limit-xmit-polling/base - this is net-next baseline to test
 against virtio-net-limit-xmit-polling/v0 - fixes checks on out of capacity
 virtio-net-limit-xmit-polling/v1 - previous revision of the patch
   this does xmit,free,xmit,2*free,free
 virtio-net-limit-xmit-polling/v2 - new revision of the patch
   this does free,xmit,2*free,free
 

Here's a summary of the results.  I've also attached an ODS format spreadsheet
(30 KB in size) that might be easier to analyze and also has some pinned VM
results data.  I broke the tests down into a local guest-to-guest scenario
and a remote host-to-guest scenario.

Within the local guest-to-guest scenario I ran:
  - TCP_RR tests using two different messsage sizes and four different
instance counts among 1 pair of VMs and 2 pairs of VMs.
  - TCP_STREAM tests using four different message sizes and two different
instance counts among 1 pair of VMs and 2 pairs of VMs.

Within the remote host-to-guest scenario I ran:
  - TCP_RR tests using two different messsage sizes and four different
instance counts to 1 VM and 4 VMs.
  - TCP_STREAM and TCP_MAERTS tests using four different message sizes and
two different instance counts to 1 VM and 4 VMs.
over a 10GbE link.

*** Local Guest-to-Guest ***

Here's the local guest-to-guest summary for 1 VM pair doing TCP_RR with
256/256 request/response message size in transactions per second:

Instances   BaseV0  V1  V2
18,151.568,460.728,439.169,990.37
25  48,761.74   51,032.62   51,103.25   49,533.52
50  55,687.38   55,974.18   56,854.10   54,888.65
100 58,255.06   58,255.86   60,380.90   59,308.36

Here's the local guest-to-guest summary for 2 VM pairs doing TCP_RR with
256/256 request/response message size in transactions per second:

Instances   BaseV0  V1  V2
1   18,758.48   19,112.50   18,597.07   19,252.04
25  80,500.50   78,801.78   80,590.68   78,782.07
50  80,594.20   77,985.44   80,431.72   77,246.90
100 82,023.23   81,325.96   81,303.32   81,727.54

Here's the local guest-to-guest summary for 1 VM pair doing TCP_STREAM with
256, 1K, 4K and 16K message size in Mbps:

256:
Instances   BaseV0  V1  V2
1  961.781,115.92  794.02  740.37
42,498.332,541.822,441.602,308.26

1K: 
13,476.613,522.022,170.861,395.57
46,344.307,056.577,275.167,174.09

4K: 
19,213.57   10,647.449,883.429,007.29
4   11,070.66   11,300.37   11,001.02   12,103.72

16K:
1   12,065.949,437.78   11,710.606,989.93
4   12,755.28   13,050.78   12,518.06   13,227.33

Here's the local guest-to-guest summary for 2 VM pairs doing TCP_STREAM with
256, 1K, 4K and 16K message size in Mbps:

256:
Instances   BaseV0  V1  V2
12,434.982,403.232,308.692,261.35
45,973.825,729.485,956.765,831.86

1K:
15,305.995,148.724,960.675,067.76
4   10,628.38   10,649.49   10,098.90   10,380.09

4K:
1   11,577.03   10,710.33   11,700.53   10,304.09
4   14,580.66   14,881.38   14,551.17   15,053.02

16K:
1   16,801.46   16,072.50   15,773.78   15,835.66
4   17,194.00   17,294.02   17,319.78   17,121.09


*** Remote Host-to-Guest ***

Here's the remote host-to-guest summary for 1 VM doing TCP_RR with
256/256 request/response message size in transactions per second:

Instances   BaseV0  V1  V2
19,732.99   10,307.98   10,529.828,889.28
25  43,976.18   49,480.50   46,536.66   45,682.38
50  63,031.33   67,127.15   60,073.34   65,748.62
100 64,778.43   65,338.07   66,774.12   69,391.22

Here's the remote host-to-guest summary for 4 VMs doing TCP_RR with
256/256 request/response 

Re: RFT: virtio_net: limit xmit polling

2011-06-21 Thread Tom Lendacky
On Sunday, June 19, 2011 05:27:00 AM Michael S. Tsirkin wrote:
 OK, different people seem to test different trees.  In the hope to get
 everyone on the same page, I created several variants of this patch so
 they can be compared. Whoever's interested, please check out the
 following, and tell me how these compare:

I'm in the process of testing these patches.  Base and v0 are complete
and v1 is near complete with v2 to follow.  I'm testing with a variety
of TCP_RR and TCP_STREAM/TCP_MAERTS tests involving local guest-to-guest
tests and remote host-to-guest tests.  I'll post the results in the next
day or two when the tests finish.

Thanks,
Tom

 
 kernel:
 
 git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git
 
 virtio-net-limit-xmit-polling/base - this is net-next baseline to test
 against virtio-net-limit-xmit-polling/v0 - fixes checks on out of capacity
 virtio-net-limit-xmit-polling/v1 - previous revision of the patch
   this does xmit,free,xmit,2*free,free
 virtio-net-limit-xmit-polling/v2 - new revision of the patch
   this does free,xmit,2*free,free
 
 There's also this on top:
 virtio-net-limit-xmit-polling/v3 - don't delay avail index update
 I don't think it's important to test this one, yet
 
 Userspace to use: event index work is not yet merged upstream
 so the revision to use is still this:
 git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git
 virtio-net-event-idx-v3
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/18] virtio: use avail_event index

2011-05-17 Thread Tom Lendacky

On Monday, May 16, 2011 02:12:21 AM Rusty Russell wrote:
 On Sun, 15 May 2011 16:55:41 +0300, Michael S. Tsirkin m...@redhat.com 
wrote:
  On Mon, May 09, 2011 at 02:03:26PM +0930, Rusty Russell wrote:
   On Wed, 4 May 2011 23:51:47 +0300, Michael S. Tsirkin m...@redhat.com 
wrote:
Use the new avail_event feature to reduce the number
of exits from the guest.
   
   Figures here would be nice :)
  
  You mean ASCII art in comments?
 
 I mean benchmarks of some kind.

I'm working on getting some benchmark results for the patches.  I should 
hopefully have something in the next day or two.

Tom
 
@@ -228,6 +237,12 @@ add_head:
 * new available array entries. */

virtio_wmb();
vq-vring.avail-idx++;

+   /* If the driver never bothers to kick in a very long while,
+* avail index might wrap around. If that happens, invalidate
+* kicked_avail index we stored. TODO: make sure all drivers
+* kick at least once in 2^16 and remove this. */
+   if (unlikely(vq-vring.avail-idx == vq-kicked_avail))
+   vq-kicked_avail_valid = true;
   
   If they don't, they're already buggy.  Simply do:
   WARN_ON(vq-vring.avail-idx == vq-kicked_avail);
  
  Hmm, but does it say that somewhere?
 
 AFAICT it's a corollary of:
 1) You have a finite ring of size = 2^16.
 2) You need to kick the other side once you've done some work.
 
@@ -482,6 +517,8 @@ void vring_transport_features(struct
virtio_device *vdev)

break;

case VIRTIO_RING_F_USED_EVENT_IDX:
break;

+   case VIRTIO_RING_F_AVAIL_EVENT_IDX:
+   break;

default:
/* We don't understand this bit. */
clear_bit(i, vdev-features);
   
   Does this belong in a prior patch?
   
   Thanks,
   Rusty.
  
  Well if we don't support the feature in the ring we should not
  ack the feature, right?
 
 Ah, you're right.
 
 Thanks,
 Rusty.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/18] virtio: used event index interface

2011-05-04 Thread Tom Lendacky
On Wednesday, May 04, 2011 03:51:09 PM Michael S. Tsirkin wrote:
 Define a new feature bit for the guest to utilize a used_event index
 (like Xen) instead if a flag bit to enable/disable interrupts.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  include/linux/virtio_ring.h |9 +
  1 files changed, 9 insertions(+), 0 deletions(-)
 
 diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
 index e4d144b..f5c1b75 100644
 --- a/include/linux/virtio_ring.h
 +++ b/include/linux/virtio_ring.h
 @@ -29,6 +29,10 @@
  /* We support indirect buffer descriptors */
  #define VIRTIO_RING_F_INDIRECT_DESC  28
 
 +/* The Guest publishes the used index for which it expects an interrupt
 + * at the end of the avail ring. Host should ignore the avail-flags
 field. */ +#define VIRTIO_RING_F_USED_EVENT_IDX   29
 +
  /* Virtio ring descriptors: 16 bytes.  These can chain together via
 next. */ struct vring_desc {
   /* Address (guest-physical). */
 @@ -83,6 +87,7 @@ struct vring {
   *   __u16 avail_flags;
   *   __u16 avail_idx;
   *   __u16 available[num];
 + *   __u16 used_event_idx;
   *
   *   // Padding to the next align boundary.
   *   char pad[];
 @@ -93,6 +98,10 @@ struct vring {
   *   struct vring_used_elem used[num];
   * };
   */
 +/* We publish the used event index at the end of the available ring.
 + * It is at the end for backwards compatibility. */
 +#define vring_used_event(vr) ((vr)-avail-ring[(vr)-num])
 +
  static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 unsigned long align)
  {

You should update the vring_size procedure to account for the extra field at 
the end of the available ring by change the (2 + num) to (3 + num):
return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (3 + num)

Tom
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/18] virtio: use avail_event index

2011-05-04 Thread Tom Lendacky

On Wednesday, May 04, 2011 03:51:47 PM Michael S. Tsirkin wrote:
 Use the new avail_event feature to reduce the number
 of exits from the guest.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  drivers/virtio/virtio_ring.c |   39
 ++- 1 files changed, 38 insertions(+),
 1 deletions(-)
 
 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 index 3a3ed75..262dfe6 100644
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -82,6 +82,15 @@ struct vring_virtqueue
   /* Host supports indirect buffers */
   bool indirect;
 
 + /* Host publishes avail event idx */
 + bool event;
 +
 + /* Is kicked_avail below valid? */
 + bool kicked_avail_valid;
 +
 + /* avail idx value we already kicked. */
 + u16 kicked_avail;
 +
   /* Number of free buffers */
   unsigned int num_free;
   /* Head of free buffer list. */
 @@ -228,6 +237,12 @@ add_head:
* new available array entries. */
   virtio_wmb();
   vq-vring.avail-idx++;
 + /* If the driver never bothers to kick in a very long while,
 +  * avail index might wrap around. If that happens, invalidate
 +  * kicked_avail index we stored. TODO: make sure all drivers
 +  * kick at least once in 2^16 and remove this. */
 + if (unlikely(vq-vring.avail-idx == vq-kicked_avail))
 + vq-kicked_avail_valid = true;

vq-kicked_avail_valid should be set to false here.

Tom

 
   pr_debug(Added buffer head %i to %p\n, head, vq);
   END_USE(vq);
 @@ -236,6 +251,23 @@ add_head:
  }
  EXPORT_SYMBOL_GPL(virtqueue_add_buf_gfp);
 
 +
 +static bool vring_notify(struct vring_virtqueue *vq)
 +{
 + u16 old, new;
 + bool v;
 + if (!vq-event)
 + return !(vq-vring.used-flags  VRING_USED_F_NO_NOTIFY);
 +
 + v = vq-kicked_avail_valid;
 + old = vq-kicked_avail;
 + new = vq-kicked_avail = vq-vring.avail-idx;
 + vq-kicked_avail_valid = true;
 + if (unlikely(!v))
 + return true;
 + return vring_need_event(vring_avail_event(vq-vring), new, old);
 +}
 +
  void virtqueue_kick(struct virtqueue *_vq)
  {
   struct vring_virtqueue *vq = to_vvq(_vq);
 @@ -244,7 +276,7 @@ void virtqueue_kick(struct virtqueue *_vq)
   /* Need to update avail index before checking if we should notify */
   virtio_mb();
 
 - if (!(vq-vring.used-flags  VRING_USED_F_NO_NOTIFY))
 + if (vring_notify(vq))
   /* Prod other side to tell it about changes. */
   vq-notify(vq-vq);
 
 @@ -437,6 +469,8 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
   vq-vq.name = name;
   vq-notify = notify;
   vq-broken = false;
 + vq-kicked_avail_valid = false;
 + vq-kicked_avail = 0;
   vq-last_used_idx = 0;
   list_add_tail(vq-vq.list, vdev-vqs);
  #ifdef DEBUG
 @@ -444,6 +478,7 @@ struct virtqueue *vring_new_virtqueue(unsigned int num,
  #endif
 
   vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
 + vq-event = virtio_has_feature(vdev, VIRTIO_RING_F_AVAIL_EVENT_IDX);
 
   /* No callback?  Tell other side not to bother us. */
   if (!callback)
 @@ -482,6 +517,8 @@ void vring_transport_features(struct virtio_device
 *vdev) break;
   case VIRTIO_RING_F_USED_EVENT_IDX:
   break;
 + case VIRTIO_RING_F_AVAIL_EVENT_IDX:
 + break;
   default:
   /* We don't understand this bit. */
   clear_bit(i, vdev-features);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-10 Thread Tom Lendacky
On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote:
 On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
  As for which CPU the interrupt gets pinned to, that doesn't matter - see
  below.
 
 So what hurts us the most is that the IRQ jumps between the VCPUs?

Yes, it appears that allowing the IRQ to run on more than one vCPU hurts.  
Without the publish last used index patch, vhost keeps injecting an irq for 
every received packet until the guest eventually turns off notifications. 
Because the irq injections end up overlapping we get contention on the 
irq_desc_lock_class lock. Here are some results using the baseline setup 
with irqbalance running.

  Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
  Exits: 121,050.45 Exits/Sec
  TxCPU: 9.61%  RxCPU: 99.45%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 24% increase over baseline.  Irqbalance essentially pinned the virtio 
irq to CPU0 preventing the irq lock contention and resulting in nice gains.

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-10 Thread Tom Lendacky
On Thursday, March 10, 2011 09:34:22 am Michael S. Tsirkin wrote:
 On Thu, Mar 10, 2011 at 09:23:42AM -0600, Tom Lendacky wrote:
  On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote:
   On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
As for which CPU the interrupt gets pinned to, that doesn't matter -
see below.
   
   So what hurts us the most is that the IRQ jumps between the VCPUs?
  
  Yes, it appears that allowing the IRQ to run on more than one vCPU hurts.
  Without the publish last used index patch, vhost keeps injecting an irq
  for every received packet until the guest eventually turns off
  notifications.
 
 Are you sure you see that? If yes publish used should help a lot.

I definitely see that.  I ran lockstat in the guest and saw the contention on 
the lock when the irq was able to run on either vCPU.  Once the irq was pinned 
the contention disappeared.  The publish used index patch should eliminate the 
extra irq injections and then the pinning or use of irqbalance shouldn't be 
required.  I'm getting a kernel oops during boot with the publish last used 
patches that I pulled from the mailing list - I had to make some changes in 
order to get them to apply and compile and might not have done the right 
things.  Can you re-spin that patchset against kvm.git?

 
  Because the irq injections end up overlapping we get contention on the
  irq_desc_lock_class lock. Here are some results using the baseline
  setup with irqbalance running.
  
Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
Exits: 121,050.45 Exits/Sec
TxCPU: 9.61%  RxCPU: 99.45%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About a 24% increase over baseline.  Irqbalance essentially pinned the
  virtio irq to CPU0 preventing the irq lock contention and resulting in
  nice gains.
 
 OK, so we probably want some form of delayed free for TX
 on top, and that should get us nice results already.
 
   --
   To unsubscribe from this list: send the line unsubscribe kvm in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
 On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
  We've been doing some more experimenting with the small packet network
  performance problem in KVM.  I have a different setup than what Steve D.
  was using so I re-baselined things on the kvm.git kernel on both the
  host and guest with a 10GbE adapter.  I also made use of the
  virtio-stats patch.
  
  The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
  adapters (the first connected to a 1GbE adapter and a LAN, the second
  connected to a 10GbE adapter that is direct connected to another system
  with the same 10GbE adapter) running the kvm.git kernel.  The test was a
  TCP_RR test with 100 connections from a baremetal client to the KVM
  guest using a 256 byte message size in both directions.
  
  I used the uperf tool to do this after verifying the results against
  netperf. Uperf allows the specification of the number of connections as
  a parameter in an XML file as opposed to launching, in this case, 100
  separate instances of netperf.
  
  Here is the baseline for baremetal using 2 physical CPUs:
Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
TxCPU: 7.88%  RxCPU: 99.41%
  
  To be sure to get consistent results with KVM I disabled the
  hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
  ethernet adapter interrupts (this resulted in runs that differed by only
  about 2% from lowest to highest).  The fact that pinning is required to
  get consistent results is a different problem that we'll have to look
  into later...
  
  Here is the KVM baseline (average of six runs):
Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
Exits: 148,444.58 Exits/Sec
TxCPU: 2.40%  RxCPU: 99.35%
  
  About 42% of baremetal.
 
 Can you add interrupt stats as well please?

Yes I can.  Just the guest interrupts for the virtio device?

 
  empty.  So I coded a quick patch to delay freeing of the used Tx buffers
  until more than half the ring was used (I did not test this under a
  stream condition so I don't know if this would have a negative impact). 
  Here are the results
  
  from delaying the freeing of used Tx buffers (average of six runs):
Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
Exits: 142,681.67 Exits/Sec
TxCPU: 2.78%  RxCPU: 99.36%
  
  About a 4% increase over baseline and about 44% of baremetal.
 
 Hmm, I am not sure what you mean by delaying freeing.

In the start_xmit function of virtio_net.c the first thing done is to free any 
used entries from the ring.  I patched the code to track the number of used tx 
ring entries and only free the used entries when they are greater than half 
the capacity of the ring (similar to the way the rx ring is re-filled).

 I think we do have a problem that free_old_xmit_skbs
 tries to flush out the ring aggressively:
 it always polls until the ring is empty,
 so there could be bursts of activity where
 we spend a lot of time flushing the old entries
 before e.g. sending an ack, resulting in
 latency bursts.
 
 Generally we'll need some smarter logic,
 but with indirect at the moment we can just poll
 a single packet after we post a new one, and be done with it.
 Is your patch something like the patch below?
 Could you try mine as well please?

Yes, I'll try the patch and post the results.

 
  This spread out the kick_notify but still resulted in alot of them.  I
  decided to build on the delayed Tx buffer freeing and code up an
  ethtool like coalescing patch in order to delay the kick_notify until
  there were at least 5 packets on the ring or 2000 usecs, whichever
  occurred first.  Here are the
  
  results of delaying the kick_notify (average of six runs):
Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
Exits: 102,587.28 Exits/Sec
TxCPU: 3.03%  RxCPU: 99.33%
  
  About a 23% increase over baseline and about 52% of baremetal.
  
  Running the perf command against the guest I noticed almost 19% of the
  time being spent in _raw_spin_lock.  Enabling lockstat in the guest
  showed alot of contention in the irq_desc_lock_class. Pinning the
  virtio1-input interrupt to a single cpu in the guest and re-running the
  last test resulted in
  
  tremendous gains (average of six runs):
Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
Exits: 62,603.37 Exits/Sec
TxCPU: 3.73%  RxCPU: 98.52%
  
  About a 77% increase over baseline and about 74% of baremetal.
  
  Vhost is receiving a lot of notifications for packets that are to be
  transmitted (over 60% of the packets generate a kick_notify).  Also, it
  looks like vhost is sending a lot of notifications for packets it has
  received before the guest can get scheduled to disable notifications and
  begin processing the packets
 
 Hmm, is this really what happens to you?  The effect would be that guest
 gets an interrupt while notifications are disabled in guest

Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 01:17:44 am Michael S. Tsirkin wrote:
 On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
  I used the uperf tool to do this after verifying the results against
  netperf. Uperf allows the specification of the number of connections as
  a parameter in an XML file as opposed to launching, in this case, 100
  separate instances of netperf.
 
 Could you post the XML on the list please?

Environment variables are used to specify some of the values:
  uperf_instances=100
  uperf_dest=192.168.100.28
  uperf_duration=300
  uperf_tx_msgsize=256
  uperf_rx_msgsize=256

You can also change from threads to processes by specifying nprocs instead of 
nthreads in the group element.  I found this out later so all of my runs are 
using threads. Using processes will give you some improved peformance but I 
need to be consistent with my runs and stay with threads for now.

?xml version=1.0?
profile name=TCP_RR
  group nthreads=$uperf_instances
transaction iterations=1
flowop type=connect options=remotehost=$uperf_dest 
protocol=tcp/
/transaction
transaction duration=$uperf_duration
flowop type=write options=size=$uperf_tx_msgsize/
flowop type=read  options=size=$uperf_rx_msgsize/
/transaction
transaction iterations=1
flowop type=disconnect /
/transaction
  /group
/profile
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
Here are the results again with the addition of the interrupt rate that 
occurred on the guest virtio_net device:

Here is the KVM baseline (average of six runs):
  Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
  Exits: 148,444.58 Exits/Sec
  TxCPU: 2.40%  RxCPU: 99.35%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About 42% of baremetal.

Delayed freeing of TX buffers (average of six runs):
  Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
  Exits: 142,681.67 Exits/Sec
  TxCPU: 2.78%  RxCPU: 99.36%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,796/4,908
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 4% increase over baseline and about 44% of baremetal.

Delaying kick_notify (kick every 5 packets -average of six runs):
  Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
  Exits: 102,587.28 Exits/Sec
  TxCPU: 3.03%  RxCPU: 99.33%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,200/4,293
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 23% increase over baseline and about 52% of baremetal.

Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs):
  Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
  Exits: 62,603.37 Exits/Sec
  TxCPU: 3.73%  RxCPU: 98.52%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 77% increase over baseline and about 74% of baremetal.


On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
 On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
  We've been doing some more experimenting with the small packet network
  performance problem in KVM.  I have a different setup than what Steve D.
  was using so I re-baselined things on the kvm.git kernel on both the
  host and guest with a 10GbE adapter.  I also made use of the
  virtio-stats patch.
  
  The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
  adapters (the first connected to a 1GbE adapter and a LAN, the second
  connected to a 10GbE adapter that is direct connected to another system
  with the same 10GbE adapter) running the kvm.git kernel.  The test was a
  TCP_RR test with 100 connections from a baremetal client to the KVM
  guest using a 256 byte message size in both directions.
  
  I used the uperf tool to do this after verifying the results against
  netperf. Uperf allows the specification of the number of connections as
  a parameter in an XML file as opposed to launching, in this case, 100
  separate instances of netperf.
  
  Here is the baseline for baremetal using 2 physical CPUs:
Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
TxCPU: 7.88%  RxCPU: 99.41%
  
  To be sure to get consistent results with KVM I disabled the
  hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
  ethernet adapter interrupts (this resulted in runs that differed by only
  about 2% from lowest to highest).  The fact that pinning is required to
  get consistent results is a different problem that we'll have to look
  into later...
  
  Here is the KVM baseline (average of six runs):
Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
Exits: 148,444.58 Exits/Sec
TxCPU: 2.40%  RxCPU: 99.35%
  
  About 42% of baremetal.
 
 Can you add interrupt stats as well please?
 
  empty.  So I coded a quick patch to delay freeing of the used Tx buffers
  until more than half the ring was used (I did not test this under a
  stream condition so I don't know if this would have a negative impact). 
  Here are the results
  
  from delaying the freeing of used Tx buffers (average of six runs):
Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
Exits: 142,681.67 Exits/Sec
TxCPU: 2.78%  RxCPU: 99.36%
  
  About a 4% increase over baseline and about 44% of baremetal.
 
 Hmm, I am not sure what you mean by delaying freeing.
 I think we do have a problem that free_old_xmit_skbs
 tries to flush out the ring aggressively:
 it always polls until the ring is empty,
 so there could be bursts of activity where
 we spend a lot of time flushing the old entries
 before e.g. sending an ack, resulting in
 latency bursts.
 
 Generally we'll need some smarter logic,
 but with indirect at the moment we can just poll
 a single packet after we post a new one, and be done with it.
 Is your patch something like the patch below?
 Could you try mine as well please?
 
  This spread out the kick_notify but still resulted in alot of them.  I
  decided to build on the delayed Tx buffer freeing and code up an
  ethtool like coalescing patch in order to delay the kick_notify until
  there were at least 5 packets on the ring or 2000 usecs, whichever
  occurred first.  Here are the
  
  results of delaying the kick_notify (average of six runs):
Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
Exits: 102,587.28 Exits/Sec
TxCPU: 3.03%  RxCPU: 99.33%
  
  About a 23% increase over

Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 10:09:26 am Tom Lendacky wrote:
 On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
  On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
   We've been doing some more experimenting with the small packet network
   performance problem in KVM.  I have a different setup than what Steve
   D. was using so I re-baselined things on the kvm.git kernel on both
   the host and guest with a 10GbE adapter.  I also made use of the
   virtio-stats patch.
   
   The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
   adapters (the first connected to a 1GbE adapter and a LAN, the second
   connected to a 10GbE adapter that is direct connected to another system
   with the same 10GbE adapter) running the kvm.git kernel.  The test was
   a TCP_RR test with 100 connections from a baremetal client to the KVM
   guest using a 256 byte message size in both directions.
   
   I used the uperf tool to do this after verifying the results against
   netperf. Uperf allows the specification of the number of connections as
   a parameter in an XML file as opposed to launching, in this case, 100
   separate instances of netperf.
   
   Here is the baseline for baremetal using 2 physical CPUs:
 Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
 TxCPU: 7.88%  RxCPU: 99.41%
   
   To be sure to get consistent results with KVM I disabled the
   hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
   ethernet adapter interrupts (this resulted in runs that differed by
   only about 2% from lowest to highest).  The fact that pinning is
   required to get consistent results is a different problem that we'll
   have to look into later...
   
   Here is the KVM baseline (average of six runs):
 Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
 Exits: 148,444.58 Exits/Sec
 TxCPU: 2.40%  RxCPU: 99.35%
   
   About 42% of baremetal.
  
  Can you add interrupt stats as well please?
 
 Yes I can.  Just the guest interrupts for the virtio device?
 
   empty.  So I coded a quick patch to delay freeing of the used Tx
   buffers until more than half the ring was used (I did not test this
   under a stream condition so I don't know if this would have a negative
   impact). Here are the results
   
   from delaying the freeing of used Tx buffers (average of six runs):
 Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
 Exits: 142,681.67 Exits/Sec
 TxCPU: 2.78%  RxCPU: 99.36%
   
   About a 4% increase over baseline and about 44% of baremetal.
  
  Hmm, I am not sure what you mean by delaying freeing.
 
 In the start_xmit function of virtio_net.c the first thing done is to free
 any used entries from the ring.  I patched the code to track the number of
 used tx ring entries and only free the used entries when they are greater
 than half the capacity of the ring (similar to the way the rx ring is
 re-filled).
 
  I think we do have a problem that free_old_xmit_skbs
  tries to flush out the ring aggressively:
  it always polls until the ring is empty,
  so there could be bursts of activity where
  we spend a lot of time flushing the old entries
  before e.g. sending an ack, resulting in
  latency bursts.
  
  Generally we'll need some smarter logic,
  but with indirect at the moment we can just poll
  a single packet after we post a new one, and be done with it.
  Is your patch something like the patch below?
  Could you try mine as well please?
 
 Yes, I'll try the patch and post the results.
 
   This spread out the kick_notify but still resulted in alot of them.  I
   decided to build on the delayed Tx buffer freeing and code up an
   ethtool like coalescing patch in order to delay the kick_notify until
   there were at least 5 packets on the ring or 2000 usecs, whichever
   occurred first.  Here are the
   
   results of delaying the kick_notify (average of six runs):
 Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
 Exits: 102,587.28 Exits/Sec
 TxCPU: 3.03%  RxCPU: 99.33%
   
   About a 23% increase over baseline and about 52% of baremetal.
   
   Running the perf command against the guest I noticed almost 19% of the
   time being spent in _raw_spin_lock.  Enabling lockstat in the guest
   showed alot of contention in the irq_desc_lock_class. Pinning the
   virtio1-input interrupt to a single cpu in the guest and re-running the
   last test resulted in
   
   tremendous gains (average of six runs):
 Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
 Exits: 62,603.37 Exits/Sec
 TxCPU: 3.73%  RxCPU: 98.52%
   
   About a 77% increase over baseline and about 74% of baremetal.
   
   Vhost is receiving a lot of notifications for packets that are to be
   transmitted (over 60% of the packets generate a kick_notify).  Also, it
   looks like vhost is sending a lot of notifications for packets it has
   received before the guest can get scheduled to disable notifications
   and begin

Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 04:45:12 pm Shirley Ma wrote:
 Hello Tom,
 
 Do you also have Rusty's virtio stat patch results for both send queue
 and recv queue to share here?

Let me see what I can do about getting the data extracted, averaged and in a 
form that I can put in an email.

 
 Thanks
 Shirley
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 03:56:15 pm Michael S. Tsirkin wrote:
 On Wed, Mar 09, 2011 at 02:11:07PM -0600, Tom Lendacky wrote:
  Here are the results again with the addition of the interrupt rate that
  occurred on the guest virtio_net device:
  
  Here is the KVM baseline (average of six runs):
Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
Exits: 148,444.58 Exits/Sec
TxCPU: 2.40%  RxCPU: 99.35%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About 42% of baremetal.
  
  Delayed freeing of TX buffers (average of six runs):
Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
Exits: 142,681.67 Exits/Sec
TxCPU: 2.78%  RxCPU: 99.36%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,796/4,908
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About a 4% increase over baseline and about 44% of baremetal.
 
 Looks like delayed freeing is a good idea generally.
 Is this my patch? Yours?

These results are for my patch, I haven't had a chance to run your patch yet.

 
  Delaying kick_notify (kick every 5 packets -average of six runs):
Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
Exits: 102,587.28 Exits/Sec
TxCPU: 3.03%  RxCPU: 99.33%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,200/4,293
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About a 23% increase over baseline and about 52% of baremetal.
 
  Delaying kick_notify and pinning virtio1-input to CPU0 (average of six 
runs):
 What exactly moves the interrupt handler between CPUs?
 irqbalancer?  Does it matter which CPU you pin it to?
 If yes, do you have any idea why?

Looking at the guest, irqbalance isn't running and the smp_affinity for the 
irq is set to 3 (both CPUs).  It could be that irqbalance would help in this 
situation since it would probably change the smp_affinity mask to a single CPU 
and remove the irq lock contention (I think the last used index patch would be 
best though since it will avoid the extra irq injections).  I'll kick off a 
run with irqbalance running.

As for which CPU the interrupt gets pinned to, that doesn't matter - see 
below.

 
 Also, what happens without delaying kick_notify
 but with pinning?

Here are the results of a single baseline run with the IRQ pinned to CPU0:

  Txn Rate: 108,212.12 Txn/Sec, Pkt Rate: 214,994 Pkts/Sec
  Exits: 119,310.21 Exits/Sec
  TxCPU: 9.63%  RxCPU: 99.47%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 
  Virtio1-output Interrupts/Sec (CPU0/CPU1):

and CPU1:

  Txn Rate: 108,053.02 Txn/Sec, Pkt Rate: 214,678 Pkts/Sec
  Exits: 119,320.12 Exits/Sec
  TxCPU: 9.64%  RxCPU: 99.42%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,608/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/13,830

About a 24% increase over baseline.

 
Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkts/Sec
Exits: 62,603.37 Exits/Sec
TxCPU: 3.73%  RxCPU: 98.52%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About a 77% increase over baseline and about 74% of baremetal.
 
 Hmm we get about 20 packets per interrupt on average.
 That's pretty decent. The problem is with exits.
 Let's try something adaptive in the host?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Network performance with small packets - continued

2011-03-07 Thread Tom Lendacky
We've been doing some more experimenting with the small packet network 
performance problem in KVM.  I have a different setup than what Steve D. was 
using so I re-baselined things on the kvm.git kernel on both the host and 
guest with a 10GbE adapter.  I also made use of the virtio-stats patch.

The virtual machine has 2 vCPUs, 8GB of memory and two virtio network adapters 
(the first connected to a 1GbE adapter and a LAN, the second connected to a 
10GbE adapter that is direct connected to another system with the same 10GbE 
adapter) running the kvm.git kernel.  The test was a TCP_RR test with 100 
connections from a baremetal client to the KVM guest using a 256 byte message 
size in both directions.

I used the uperf tool to do this after verifying the results against netperf.  
Uperf allows the specification of the number of connections as a parameter in 
an XML file as opposed to launching, in this case, 100 separate instances of 
netperf.

Here is the baseline for baremetal using 2 physical CPUs:
  Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
  TxCPU: 7.88%  RxCPU: 99.41%

To be sure to get consistent results with KVM I disabled the hyperthreads, 
pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter 
interrupts (this resulted in runs that differed by only about 2% from lowest 
to highest).  The fact that pinning is required to get consistent results is a 
different problem that we'll have to look into later...

Here is the KVM baseline (average of six runs):
  Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
  Exits: 148,444.58 Exits/Sec
  TxCPU: 2.40%  RxCPU: 99.35%
About 42% of baremetal.

The virtio stats output showed alot of kick_notify happening when the ring was 
empty.  So I coded a quick patch to delay freeing of the used Tx buffers until 
more than half the ring was used (I did not test this under a stream condition 
so I don't know if this would have a negative impact).  Here are the results 
from delaying the freeing of used Tx buffers (average of six runs):
  Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
  Exits: 142,681.67 Exits/Sec
  TxCPU: 2.78%  RxCPU: 99.36%
About a 4% increase over baseline and about 44% of baremetal.

This spread out the kick_notify but still resulted in alot of them.  I decided 
to build on the delayed Tx buffer freeing and code up an ethtool like 
coalescing patch in order to delay the kick_notify until there were at least 5 
packets on the ring or 2000 usecs, whichever occurred first.  Here are the 
results of delaying the kick_notify (average of six runs):
  Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
  Exits: 102,587.28 Exits/Sec
  TxCPU: 3.03%  RxCPU: 99.33%
About a 23% increase over baseline and about 52% of baremetal.

Running the perf command against the guest I noticed almost 19% of the time 
being spent in _raw_spin_lock.  Enabling lockstat in the guest showed alot of 
contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt 
to a single cpu in the guest and re-running the last test resulted in 
tremendous gains (average of six runs):
  Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
  Exits: 62,603.37 Exits/Sec
  TxCPU: 3.73%  RxCPU: 98.52%
About a 77% increase over baseline and about 74% of baremetal.

Vhost is receiving a lot of notifications for packets that are to be 
transmitted (over 60% of the packets generate a kick_notify).  Also, it looks 
like vhost is sending a lot of notifications for packets it has received 
before the guest can get scheduled to disable notifications and begin 
processing the packets resulting in some lock contention in the guest (and 
high interrupt rates).

Some thoughts for the transmit path...  can vhost be enhanced to do some 
adaptive polling so that the number of kick_notify events are reduced and 
replaced by kick_no_notify events?

Comparing the transmit path to the receive path, the guest disables 
notifications after the first kick and vhost re-enables notifications after 
completing processing of the tx ring.  Can a similar thing be done for the 
receive path?  Once vhost sends the first notification for a received packet 
it can disable notifications and let the guest re-enable notifications when it 
has finished processing the receive ring.  Also, can the virtio-net driver do 
some adaptive polling (or does napi take care of that for the guest)?

Running the same workload on the same configuration with a different 
hypervisor results in performance that is almost equivalent to baremetal 
without doing any pinning.

Thanks,
Tom Lendacky
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network shutdown under load

2010-02-08 Thread Tom Lendacky

Fix a race condition where qemu finds that there are not enough virtio
ring buffers available and the guest make more buffers available before
qemu can enable notifications.

Signed-off-by: Tom Lendacky t...@us.ibm.com
Signed-off-by: Anthony Liguori aligu...@us.ibm.com

 hw/virtio-net.c |   10 +-
 1 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 6e48997..5c0093e 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -379,7 +379,15 @@ static int virtio_net_has_buffers(VirtIONet *n, int 
bufsize)
 (n-mergeable_rx_bufs 
  !virtqueue_avail_bytes(n-rx_vq, bufsize, 0))) {
 virtio_queue_set_notification(n-rx_vq, 1);
-return 0;
+
+/* To avoid a race condition where the guest has made some buffers
+ * available after the above check but before notification was
+ * enabled, check for available buffers again.
+ */
+if (virtio_queue_empty(n-rx_vq) ||
+(n-mergeable_rx_bufs 
+ !virtqueue_avail_bytes(n-rx_vq, bufsize, 0)))
+return 0;
 }
 
 virtio_queue_set_notification(n-rx_vq, 0);

On Friday 29 January 2010 02:06:41 pm Tom Lendacky wrote:
 There's been some discussion of this already in the kvm list, but I want to
 summarize what I've found and also include the qemu-devel list in an effort
  to find a solution to this problem.
 
 Running a netperf test between two kvm guests results in the guest's
  network interface shutting down. I originally found this using kvm guests
  on two different machines that were connected via a 10GbE link.  However,
  I found this problem can be easily reproduced using two guests on the same
  machine.
 
 I am running the 2.6.32 level of the kvm.git tree and the 0.12.1.2 level of
 the qemu-kvm.git tree.
 
 The setup includes two bridges, br0 and br1.
 
 The commands used to start the guests are as follows:
 usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive
 file=/autobench/var/tmp/cape-vm001-
 raw.img,if=virtio,index=0,media=disk,boot=on -net
 nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51,netdev=cape-vm001-eth0 -
 netdev tap,id=cape-vm001-eth0,script=/autobench/var/tmp/ifup-kvm-
 br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net
 nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1,netdev=cape-vm001-eth1 -
 netdev tap,id=cape-vm001-eth1,script=/autobench/var/tmp/ifup-kvm-
 br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor
 telnet::5701,server,nowait -snapshot -daemonize
 
 usr/local/bin/qemu-system-x86_64 -name cape-vm002 -m 1024 -drive
 file=/autobench/var/tmp/cape-vm002-
 raw.img,if=virtio,index=0,media=disk,boot=on -net
 nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:61,netdev=cape-vm002-eth0 -
 netdev tap,id=cape-vm002-eth0,script=/autobench/var/tmp/ifup-kvm-
 br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net
 nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:E1,netdev=cape-vm002-eth1 -
 netdev tap,id=cape-vm002-eth1,script=/autobench/var/tmp/ifup-kvm-
 br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :2 -monitor
 telnet::5702,server,nowait -snapshot -daemonize
 
 The ifup-kvm-br0 script takes the (first) qemu created tap device and
  brings it up and adds it to bridge br0.  The ifup-kvm-br1 script take the
  (second) qemu created tap device and brings it up and adds it to bridge
  br1.
 
 Each ethernet device within a guest is on it's own subnet.  For example:
   guest 1 eth0 has addr 192.168.100.32 and eth1 has addr 192.168.101.32
   guest 2 eth0 has addr 192.168.100.64 and eth1 has addr 192.168.101.64
 
 On one of the guests run netserver:
   netserver -L 192.168.101.32 -p 12000
 
 On the other guest run netperf:
   netperf -L 192.168.101.64 -H 192.168.101.32 -p 12000 -t TCP_STREAM -l 60
  -c -C -- -m 16K -M 16K
 
 It may take more than one netperf run (I find that my second run almost
  always causes the shutdown) but the network on the eth1 links will stop
  working.
 
 I did some debugging and found that in qemu on the guest running netserver:
  - the receive_disabled variable is set and never gets reset
  - the read_poll event handler for the eth1 tap device is disabled and
  never re-enabled
 These conditions result in no packets being read from the tap device and
  sent to the guest - effectively shutting down the network.  Network
  connectivity can be restored by shutting down the guest interfaces,
  unloading the virtio_net module, re-loading the virtio_net module and
  re-starting the guest interfaces.
 
 I'm continuing to work on debugging this, but would appreciate if some
  folks with more qemu network experience could try to recreate and debug
  this.
 
 If my kernel config matters, I can provide that.
 
 Thanks,
 Tom
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line

Re: Multiple TAP Interfaces, with multiple bridges

2010-02-03 Thread Tom Lendacky
On Wednesday 03 February 2010 10:56:43 am J L wrote:
 Hi,
 
 I am having an odd networking issue. It is one of those it used to
 work, and now it doesn't kind of things. I can't work out what I am
 doing differently.
 
 I have a virtual machine, started with (among other things):
   -net nic,macaddr=fa:9e:0b:53:d2:7d,model=rtl8139 -net
 tap,script=/images/1/ifup-eth0,downscript=/images/1/ifdown-eth0
   -net nic,macaddr=fa:02:4e:86:ed:ce,model=e1000 -net
 tap,script=/images/1/ifup-eth1,downscript=/images/1/ifdown-eth1
 

I believe this has to do with the qemu vlan support. If you don't specify the 
vlan= option you end up with nics on the same vlan. You need to assign the two 
nics to separate vlans using vlan= on each net parameter, eg:


   -net nic,vlan=0,macaddr=fa:9e:0b:53:d2:7d,model=rtl8139 -net
 tap,vlan=0,script=/images/1/ifup-eth0,downscript=/images/1/ifdown-eth0
   -net nic,vlan=1,macaddr=fa:02:4e:86:ed:ce,model=e1000 -net
 tap,vlan=1,script=/images/1/ifup-eth1,downscript=/images/1/ifdown-eth1

Try that and see if you get the results you expect.

Tom

 The ifup-ethX script inserts the tap interface into the correct bridge
 (of which there are multiple.)
 
 The Virtual Machine is Centos 5.3, with a 2.6.27.21 kernel. The Host
 is Ubuntu 9.10 with a 2.6.31 kernel.
 
 
 My network then looks like:
 
 The Virtual Machine has an eth0 interface, which is matched with tap0
 on the host.
 The Virtual Machine has an eth1 interface, which is matched with tap1
 on the host.
 
 The host has a bridge br0, which contains tap0 and eth0.
 The host has a bridge br1, which contains tap1.
 
 There is a server on the same network as the Host's eth0.
 
 The Virtual Machines eth0 interface is down.
 The Virtual Machines eth1 interface has an IP address of 192.168.1.10/24.
 The Virtual Machine has a default gateway of 192.168.1.1.
 
 The host's br0 has an IP address of 192.168.0.1/24.
 The host's br1 has an IP address of 192.168.1.1/24.
 
 The server has an IP address of 192.168.0.20/24, and a default gateway
 of 192.168.0.1.
 
 Firewalling is disabled everywhere. I have allowed time for the
 bridges and STP to settle.
 
 
 
 If I go to the Virtual Machine, and ping 192.168.0.20 (the server), I
 would expect tcpdumps to show:
   * VM: eth1, dest MAC of Host's tap1/br0
   * Host: tap1, dest MAC of Host's tap1/br0
   * Host: br1, dest MAC of Host's tap1/br0
   * Host now routes from br1 to br0
   * Host: tap0, no packet
   * Host: br0, dest MAC of Server
   * Host: eth0, dest MAC of Server
   * Server: eth0, dest MAC of Server
 
 What I actually get:
   * VM: eth1, dest MAC of Host's tap1/br0
   * Host: tap1, dest MAC of Host's tap1/br0
   * Host: br1, dest MAC of Host's tap1/br0
   * Host should, but does not route from br0 to br1
   * Host: tap0, dest MAC of ***Host's tap1/br0***
   * Host: br0, dest MAC of ***Host's tap1/br0**
   * Host: eth0, no packet
   * Server: eth0, no packet
 
 As you can see, the packet has egressed *both* tap interfaces! Is this
 expected behaviour? What can I do about this?
 
 
 
 
 If I remove tap0 from the bridge, I then get:
   * VM: eth1, dest MAC of Host's tap1/br0
   * Host: tap1, dest MAC of Host's tap1/br0
   * Host: br1, dest MAC of Host's tap1/br0
   * Host should, but does not, route from br0 to br1
   * Host: tap0, no packet
   * Host: br0, no packet
   * Host: eth0, no packet
   * Server: eth0, no packet
 
 This is the other half of my problem: in this case, with effectively
 only one tap, the host is not routing between br1 and br0. The packet
 just gets silently dropped. Does anyone know what I am doing wrong?
 
 I hope I have managed to explain this well enough!
 
 Thanks,
 --
 Jarrod Lowe
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network shutdown under heavy load

2010-01-26 Thread Tom Lendacky
On Wednesday 20 January 2010 09:48:04 am Tom Lendacky wrote:
 On Tuesday 19 January 2010 05:57:53 pm Chris Wright wrote:
  * Tom Lendacky (t...@linux.vnet.ibm.com) wrote:
   On Wednesday 13 January 2010 03:52:28 pm Chris Wright wrote:
(Mark cc'd, sound familiar?)
   
* Tom Lendacky (t...@linux.vnet.ibm.com) wrote:
 On Sunday 10 January 2010 06:38:54 am Avi Kivity wrote:
  On 01/10/2010 02:35 PM, Herbert Xu wrote:
   On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote:
   This isn't in 2.6.27.y.  Herbert, can you send it there?
  
   It appears that now that TX is fixed we have a similar problem
   with RX.  Once I figure that one out I'll send them together.

 I've been experiencing the network shutdown issue also.  I've been
 running netperf tests across 10GbE adapters with Qemu 0.12.1.2,
 RHEL5.4 guests and 2.6.32 kernel (from kvm.git) guests.  I
 instrumented Qemu to print out some network statistics.  It appears
 that at some point in the netperf test the receiving guest ends up
 having the 10GbE device receive_disabled variable in its
 VLANClientState structure stuck at 1. From looking at the code it
 appears that the virtio-net driver in the guest should cause
 qemu_flush_queued_packets in net.c to eventually run and clear the
 receive_disabled variable but it's not happening.  I don't seem
 to have these issues when I have a lot of debug settings active in
 the guest kernel which results in very low/poor network performance
 - maybe some kind of race condition?
  
   Ok, here's an update. After realizing that none of the ethtool offload
   options were enabled in my guest, I found that I needed to be using the
   -netdev option on the qemu command line.  Once I did that, some ethtool
   offload options were enabled and the deadlock did not appear when I did
   networking between guests on different machines.  However, the deadlock
   did appear when I did networking between guests on the same machine.
 
  What does your full command line look like?  And when the networking
  stops does your same receive_disabled hack make things work?
 
 The command line when using the -net option for the tap device is:
 
 /usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive
 file=/autobench/var/tmp/cape-vm001-
 raw.img,if=virtio,index=0,media=disk,boot=on -net
 nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51 -net
 tap,vlan=0,script=/autobench/var/tmp/ifup-kvm-
 br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net
 nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1 -net
 tap,vlan=1,script=/autobench/var/tmp/ifup-kvm-
 br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor
 telnet::5701,server,nowait -snapshot -daemonize
 
 when using the -netdev option for the tap device:
 
 /usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive
 file=/autobench/var/tmp/cape-vm001-
 raw.img,if=virtio,index=0,media=disk,boot=on -net
 nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51,netdev=cape-vm001-eth0 -
 netdev tap,id=cape-vm001-eth0,script=/autobench/var/tmp/ifup-kvm-
 br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net
 nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1,netdev=cape-vm001-eth1 -
 netdev tap,id=cape-vm001-eth1,script=/autobench/var/tmp/ifup-kvm-
 br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor
 telnet::5701,server,nowait -snapshot -daemonize
 
 
 The first ethernet device is a 1GbE device for communicating with the
 automation infrastructure we have.  The second ethernet device is the 10GbE
 device that the netperf tests run on.
 
 I can get the networking to work again by bringing down the interfaces and
 reloading the virtio_net module (modprobe -r virtio_net / modprobe
 virtio_net).
 
 I haven't had a chance yet to run the tests against a modified version of
  qemu that does not set the receive_disabled variable.

I got a chance to run with the setting of the receive_diabled variable 
commented out and I still run into the problem.  It's easier to reproduce when 
running netperf between two guests on the same machine.  I instrumented qemu 
and virtio a little bit to try and track this down.  What I'm seeing is that, 
with two guests on the same machine, the receiving (netserver) guest 
eventually gets into a condition where the tap read poll callback is disabled 
and never re-enabled.  So packets are never delivered from tap to qemu and to 
the guest.  On the sending (netperf) side the transmit queue eventually runs 
out of capacity and it can no longer send packets (I believe this is unique to 
having the guests on the same machine).  And as before, bringing down the 
interfaces, reloading the virtio_net module, and restarting the interfaces 
clears things up.

Tom

 
 Tom
 
  thanks,
  -chris
  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http

Re: network shutdown under heavy load

2010-01-20 Thread Tom Lendacky
On Tuesday 19 January 2010 05:57:53 pm Chris Wright wrote:
 * Tom Lendacky (t...@linux.vnet.ibm.com) wrote:
  On Wednesday 13 January 2010 03:52:28 pm Chris Wright wrote:
   (Mark cc'd, sound familiar?)
  
   * Tom Lendacky (t...@linux.vnet.ibm.com) wrote:
On Sunday 10 January 2010 06:38:54 am Avi Kivity wrote:
 On 01/10/2010 02:35 PM, Herbert Xu wrote:
  On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote:
  This isn't in 2.6.27.y.  Herbert, can you send it there?
 
  It appears that now that TX is fixed we have a similar problem
  with RX.  Once I figure that one out I'll send them together.
   
I've been experiencing the network shutdown issue also.  I've been
running netperf tests across 10GbE adapters with Qemu 0.12.1.2,
RHEL5.4 guests and 2.6.32 kernel (from kvm.git) guests.  I
instrumented Qemu to print out some network statistics.  It appears
that at some point in the netperf test the receiving guest ends up
having the 10GbE device receive_disabled variable in its
VLANClientState structure stuck at 1. From looking at the code it
appears that the virtio-net driver in the guest should cause
qemu_flush_queued_packets in net.c to eventually run and clear the
receive_disabled variable but it's not happening.  I don't seem to
have these issues when I have a lot of debug settings active in the
guest kernel which results in very low/poor network performance -
maybe some kind of race condition?
 
  Ok, here's an update. After realizing that none of the ethtool offload
  options were enabled in my guest, I found that I needed to be using the
  -netdev option on the qemu command line.  Once I did that, some ethtool
  offload options were enabled and the deadlock did not appear when I did
  networking between guests on different machines.  However, the deadlock
  did appear when I did networking between guests on the same machine.
 
 What does your full command line look like?  And when the networking
 stops does your same receive_disabled hack make things work?

The command line when using the -net option for the tap device is:

/usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive 
file=/autobench/var/tmp/cape-vm001-
raw.img,if=virtio,index=0,media=disk,boot=on -net 
nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51 -net 
tap,vlan=0,script=/autobench/var/tmp/ifup-kvm-
br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net 
nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1 -net 
tap,vlan=1,script=/autobench/var/tmp/ifup-kvm-
br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor 
telnet::5701,server,nowait -snapshot -daemonize

when using the -netdev option for the tap device:

/usr/local/bin/qemu-system-x86_64 -name cape-vm001 -m 1024 -drive 
file=/autobench/var/tmp/cape-vm001-
raw.img,if=virtio,index=0,media=disk,boot=on -net 
nic,model=virtio,vlan=0,macaddr=00:16:3E:00:62:51,netdev=cape-vm001-eth0 -
netdev tap,id=cape-vm001-eth0,script=/autobench/var/tmp/ifup-kvm-
br0,downscript=/autobench/var/tmp/ifdown-kvm-br0 -net 
nic,model=virtio,vlan=1,macaddr=00:16:3E:00:62:D1,netdev=cape-vm001-eth1 -
netdev tap,id=cape-vm001-eth1,script=/autobench/var/tmp/ifup-kvm-
br1,downscript=/autobench/var/tmp/ifdown-kvm-br1 -vnc :1 -monitor 
telnet::5701,server,nowait -snapshot -daemonize


The first ethernet device is a 1GbE device for communicating with the 
automation infrastructure we have.  The second ethernet device is the 10GbE 
device that the netperf tests run on.

I can get the networking to work again by bringing down the interfaces and 
reloading the virtio_net module (modprobe -r virtio_net / modprobe 
virtio_net).

I haven't had a chance yet to run the tests against a modified version of qemu 
that does not set the receive_disabled variable.

Tom

 
 thanks,
 -chris
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network shutdown under heavy load

2010-01-19 Thread Tom Lendacky
On Wednesday 13 January 2010 03:52:28 pm Chris Wright wrote:
 (Mark cc'd, sound familiar?)
 
 * Tom Lendacky (t...@linux.vnet.ibm.com) wrote:
  On Sunday 10 January 2010 06:38:54 am Avi Kivity wrote:
   On 01/10/2010 02:35 PM, Herbert Xu wrote:
On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote:
This isn't in 2.6.27.y.  Herbert, can you send it there?
   
It appears that now that TX is fixed we have a similar problem
with RX.  Once I figure that one out I'll send them together.
 
  I've been experiencing the network shutdown issue also.  I've been
  running netperf tests across 10GbE adapters with Qemu 0.12.1.2, RHEL5.4
  guests and 2.6.32 kernel (from kvm.git) guests.  I instrumented Qemu to
  print out some network statistics.  It appears that at some point in the
  netperf test the receiving guest ends up having the 10GbE device
  receive_disabled variable in its VLANClientState structure stuck at 1. 
  From looking at the code it appears that the virtio-net driver in the
  guest should cause qemu_flush_queued_packets in net.c to eventually run
  and clear the receive_disabled variable but it's not happening.  I
  don't seem to have these issues when I have a lot of debug settings
  active in the guest kernel which results in very low/poor network
  performance - maybe some kind of race condition?
 

Ok, here's an update. After realizing that none of the ethtool offload options 
were enabled in my guest, I found that I needed to be using the -netdev option 
on the qemu command line.  Once I did that, some ethtool offload options were 
enabled and the deadlock did not appear when I did networking between guests 
on different machines.  However, the deadlock did appear when I did networking 
between guests on the same machine.

Tom

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network shutdown under heavy load

2010-01-13 Thread Tom Lendacky
On Sunday 10 January 2010 06:38:54 am Avi Kivity wrote:
 On 01/10/2010 02:35 PM, Herbert Xu wrote:
  On Sun, Jan 10, 2010 at 02:30:12PM +0200, Avi Kivity wrote:
  This isn't in 2.6.27.y.  Herbert, can you send it there?
 
  It appears that now that TX is fixed we have a similar problem
  with RX.  Once I figure that one out I'll send them together.
 

I've been experiencing the network shutdown issue also.  I've been running 
netperf tests across 10GbE adapters with Qemu 0.12.1.2, RHEL5.4 guests and 
2.6.32 kernel (from kvm.git) guests.  I instrumented Qemu to print out some 
network statistics.  It appears that at some point in the netperf test the 
receiving guest ends up having the 10GbE device receive_disabled variable in 
its VLANClientState structure stuck at 1.  From looking at the code it appears 
that the virtio-net driver in the guest should cause qemu_flush_queued_packets 
in net.c to eventually run and clear the receive_disabled variable but it's 
not happening.  I don't seem to have these issues when I have a lot of debug 
settings active in the guest kernel which results in very low/poor network 
performance - maybe some kind of race condition?

Tom

 Thanks.
 
  Who is maintaining that BTW, sta...@kernel.org?
 
 Yes.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM virtio network performance on RHEL5.4

2009-10-28 Thread Tom Lendacky
I've been trying to understand why the performance from guest to guest over a 
10GbE link using virtio, as measured by netperf, dramatically decreases when 
the socket buffer size is increased on the receiving guest.  This is an Intel 
X3210 4-core 2.13GHz system running RHEL5.4.  I don't see this drop in 
performance when going from guest to host or host to guest over the 10GbE 
link.  Here are the results from netperf:

Default socket buffer sizes:
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

 87380  16384  1638460.01  2268.47   47.6999.951.722   3.609

Receiver 256K socket buffer size (actually rmem_max * 2):
Recv   SendSend  Utilization   Service Demand
Socket Socket  Message  Elapsed  Send Recv SendRecv
Size   SizeSize Time Throughput  localremote   local   remote
bytes  bytes   bytessecs.10^6bits/s  % S  % S  us/KB   us/KB

262142  16384  1638460.00  1583.75   39.0074.092.018   3.832

There is increased idle time in the receiver.  Using systemtap I found that 
the idle time is because we are waiting for data (tcp_recvmsg calling 
sk_wait_data).

I instrumented qemu on the receiver side to print out some statistics related 
to xmit/recv events.

Rx-Could not receive is incremented whenever do_virtio_net_can_receive 
returns 0
Rx-Ring full is incremented in do_virtio_net_can_receive whenever there 
are no available entries/space in the receive ring
Rx-Count is incremented whenever virtio_net_receive2 is called (and can 
receive data)
Rx-Bytes is increased in virtio_net_receive2 by the number of bytes to be 
read from the tap device
Rx-Ring buffers is increased by the number of buffers used for the data in 
virtio_net_receive2
Tx-Notify is incremented whenever virtio_net_handle_tx is invoked
Tx-Sched BH is incremented whenever virtio_net_handle_tx is invoked and 
the the qemu_bh hasn't been scheduled yet
Tx-Packets is incremented in virtio_net_flush_tx whenever a packet is 
removed from the transmit ring and sent to qemu
Tx-Bytes is increased in virtio_net_flush_tx by the number of bytes sent 
to qemu.

Here are the stats for the two cases:

Default 256K
Rx-Could not receive3,559   0
Rx-Ring full3,559   0
Rx-Count1,063,056   805,012
Rx-Bytes18,131,704,980  12,593,270,826
Rx-Ring buffers 4,963,793   3,541,010
Tx-Notify   125,068 125,702
Tx-Sched BH 125,068 125,702
Tx-Packets  147,256 232,219
Tx-Bytes11,486,448  18,113,586

Dividing the Tx-Bytes by Tx-Packets in each case yields about 78 bytes/packet 
so these are most likely ACKs.  But why am I seeing almost 85,000 more of 
these in the 256K socket buffer case?  Also, dividing the Rx-Bytes by the Rx-
Count shows that the tap device is delivering about 1413 bytes less per call 
to qemu in the 256K socket buffer case.

Does anyone have some insight as to what is happening?

Thanks,
Tom Lendacky
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


QemuOpts changes breaks multiple nic options

2009-10-12 Thread Tom Lendacky
The recent change to QemuOpts for the -net nic option breaks specifying -net 
nic,... more than once.  The net_init_nic function's return value in net.c is 
a table index, which is non-zero after the first time it is called.  The 
qemu_opts_foreach function in qemu-option.c receives the non-zero return value 
and stops processing further -net options (like associated -net tap options).  
It looks like the usb net function makes use of the index value, so the fix 
might best be to have qemu_opts_foreach check for a return code  0 as being 
an error?

Tom Lendacky
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html