Re: Network performance with small packets

2011-04-18 Thread Rusty Russell
On Thu, 14 Apr 2011 19:03:59 +0300, Michael S. Tsirkin m...@redhat.com 
wrote:
 On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote:
  They have to offer the feature, so if the have some way of allocating
  non-page-aligned amounts of memory, they'll have to add those extra 2
  bytes.
  
  So I think it's OK...
  Rusty.
 
 To clarify, my concern is that we always seem to try to map
 these extra 2 bytes, which thinkably might fail?

No, if you look at the layout it's clear that there's always most of a
page left for this extra room, both in the middle and at the end.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-04-14 Thread Rusty Russell
On Tue, 12 Apr 2011 23:01:12 +0300, Michael S. Tsirkin m...@redhat.com 
wrote:
 On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
  Here's an old patch where I played with implementing this:
 
 ...
 
  
  virtio: put last_used and last_avail index into ring itself.
  
  Generally, the other end of the virtio ring doesn't need to see where
  you're up to in consuming the ring.  However, to completely understand
  what's going on from the outside, this information must be exposed.
  For example, if you want to save and restore a virtio_ring, but you're
  not the consumer because the kernel is using it directly.
  
  Fortunately, we have room to expand:
 
 This seems to be true for x86 kvm and lguest but is it true
 for s390?

Yes, as the ring is page aligned so there's always room.

 Will this last bit work on s390?
 If I understand correctly the memory is allocated by host there?

They have to offer the feature, so if the have some way of allocating
non-page-aligned amounts of memory, they'll have to add those extra 2
bytes.

So I think it's OK...
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-04-14 Thread Michael S. Tsirkin
On Thu, Apr 14, 2011 at 08:58:41PM +0930, Rusty Russell wrote:
 On Tue, 12 Apr 2011 23:01:12 +0300, Michael S. Tsirkin m...@redhat.com 
 wrote:
  On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
   Here's an old patch where I played with implementing this:
  
  ...
  
   
   virtio: put last_used and last_avail index into ring itself.
   
   Generally, the other end of the virtio ring doesn't need to see where
   you're up to in consuming the ring.  However, to completely understand
   what's going on from the outside, this information must be exposed.
   For example, if you want to save and restore a virtio_ring, but you're
   not the consumer because the kernel is using it directly.
   
   Fortunately, we have room to expand:
  
  This seems to be true for x86 kvm and lguest but is it true
  for s390?
 
 Yes, as the ring is page aligned so there's always room.
 
  Will this last bit work on s390?
  If I understand correctly the memory is allocated by host there?
 
 They have to offer the feature, so if the have some way of allocating
 non-page-aligned amounts of memory, they'll have to add those extra 2
 bytes.
 
 So I think it's OK...
 Rusty.

To clarify, my concern is that we always seem to try to map
these extra 2 bytes, which thinkably might fail?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-04-12 Thread Michael S. Tsirkin
On Thu, Mar 10, 2011 at 12:19:42PM +1030, Rusty Russell wrote:
 Here's an old patch where I played with implementing this:

...

 
 virtio: put last_used and last_avail index into ring itself.
 
 Generally, the other end of the virtio ring doesn't need to see where
 you're up to in consuming the ring.  However, to completely understand
 what's going on from the outside, this information must be exposed.
 For example, if you want to save and restore a virtio_ring, but you're
 not the consumer because the kernel is using it directly.
 
 Fortunately, we have room to expand:

This seems to be true for x86 kvm and lguest but is it true
for s390?

err = vmem_add_mapping(config-address,
   vring_size(config-num,
  KVM_S390_VIRTIO_RING_ALIGN));
if (err)
goto out;

vq = vring_new_virtqueue(config-num, KVM_S390_VIRTIO_RING_ALIGN,
 vdev, (void *) config-address,
 kvm_notify, callback, name);
if (!vq) {
err = -ENOMEM;
goto unmap;
}



 the ring is always a whole number
 of pages and there's hundreds of bytes of padding after the avail ring
 and the used ring, whatever the number of descriptors (which must be a
 power of 2).
 
 We add a feature bit so the guest can tell the host that it's writing
 out the current value there, if it wants to use that.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 ---



 --- a/include/linux/virtio_ring.h
 +++ b/include/linux/virtio_ring.h
 @@ -29,6 +29,9 @@
  /* We support indirect buffer descriptors */
  #define VIRTIO_RING_F_INDIRECT_DESC  28
  
 +/* We publish our last-seen used index at the end of the avail ring. */
 +#define VIRTIO_RING_F_PUBLISH_INDICES29
 +
  /* Virtio ring descriptors: 16 bytes.  These can chain together via next. 
 */
  struct vring_desc
  {
 @@ -87,6 +90,7 @@ struct vring {
   *   __u16 avail_flags;
   *   __u16 avail_idx;
   *   __u16 available[num];
 + *   __u16 last_used_idx;
   *
   *   // Padding to the next align boundary.
   *   char pad[];
 @@ -95,6 +99,7 @@ struct vring {
   *   __u16 used_flags;
   *   __u16 used_idx;
   *   struct vring_used_elem used[num];
 + *   __u16 last_avail_idx;
   * };
   */
  static inline void vring_init(struct vring *vr, unsigned int num, void *p,
 @@ -111,9 +116,14 @@ static inline unsigned vring_size(unsign
  {
   return ((sizeof(struct vring_desc) * num + sizeof(__u16) * (2 + num)
+ align - 1)  ~(align - 1))
 - + sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num;
 + + sizeof(__u16) * 2 + sizeof(struct vring_used_elem) * num + 2;
  }
  
 +/* We publish the last-seen used index at the end of the available ring, and
 + * vice-versa.  These are at the end for backwards compatibility. */
 +#define vring_last_used(vr) ((vr)-avail-ring[(vr)-num])
 +#define vring_last_avail(vr) (*(__u16 *)(vr)-used-ring[(vr)-num])
 +

Will this last bit work on s390?
If I understand correctly the memory is allocated by host there?

  #ifdef __KERNEL__
  #include linux/irqreturn.h
  struct virtio_device;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-10 Thread Tom Lendacky
On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote:
 On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
  As for which CPU the interrupt gets pinned to, that doesn't matter - see
  below.
 
 So what hurts us the most is that the IRQ jumps between the VCPUs?

Yes, it appears that allowing the IRQ to run on more than one vCPU hurts.  
Without the publish last used index patch, vhost keeps injecting an irq for 
every received packet until the guest eventually turns off notifications. 
Because the irq injections end up overlapping we get contention on the 
irq_desc_lock_class lock. Here are some results using the baseline setup 
with irqbalance running.

  Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
  Exits: 121,050.45 Exits/Sec
  TxCPU: 9.61%  RxCPU: 99.45%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 24% increase over baseline.  Irqbalance essentially pinned the virtio 
irq to CPU0 preventing the irq lock contention and resulting in nice gains.

 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-10 Thread Michael S. Tsirkin
On Thu, Mar 10, 2011 at 09:23:42AM -0600, Tom Lendacky wrote:
 On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote:
  On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
   As for which CPU the interrupt gets pinned to, that doesn't matter - see
   below.
  
  So what hurts us the most is that the IRQ jumps between the VCPUs?
 
 Yes, it appears that allowing the IRQ to run on more than one vCPU hurts.  
 Without the publish last used index patch, vhost keeps injecting an irq for 
 every received packet until the guest eventually turns off notifications. 

Are you sure you see that? If yes publish used should help a lot.

 Because the irq injections end up overlapping we get contention on the 
 irq_desc_lock_class lock. Here are some results using the baseline setup 
 with irqbalance running.
 
   Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
   Exits: 121,050.45 Exits/Sec
   TxCPU: 9.61%  RxCPU: 99.45%
   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
 
 About a 24% increase over baseline.  Irqbalance essentially pinned the virtio 
 irq to CPU0 preventing the irq lock contention and resulting in nice gains.

OK, so we probably want some form of delayed free for TX
on top, and that should get us nice results already.

  --
  To unsubscribe from this list: send the line unsubscribe kvm in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-10 Thread Tom Lendacky
On Thursday, March 10, 2011 09:34:22 am Michael S. Tsirkin wrote:
 On Thu, Mar 10, 2011 at 09:23:42AM -0600, Tom Lendacky wrote:
  On Thursday, March 10, 2011 12:54:58 am Michael S. Tsirkin wrote:
   On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
As for which CPU the interrupt gets pinned to, that doesn't matter -
see below.
   
   So what hurts us the most is that the IRQ jumps between the VCPUs?
  
  Yes, it appears that allowing the IRQ to run on more than one vCPU hurts.
  Without the publish last used index patch, vhost keeps injecting an irq
  for every received packet until the guest eventually turns off
  notifications.
 
 Are you sure you see that? If yes publish used should help a lot.

I definitely see that.  I ran lockstat in the guest and saw the contention on 
the lock when the irq was able to run on either vCPU.  Once the irq was pinned 
the contention disappeared.  The publish used index patch should eliminate the 
extra irq injections and then the pinning or use of irqbalance shouldn't be 
required.  I'm getting a kernel oops during boot with the publish last used 
patches that I pulled from the mailing list - I had to make some changes in 
order to get them to apply and compile and might not have done the right 
things.  Can you re-spin that patchset against kvm.git?

 
  Because the irq injections end up overlapping we get contention on the
  irq_desc_lock_class lock. Here are some results using the baseline
  setup with irqbalance running.
  
Txn Rate: 107,714.53 Txn/Sec, Pkt Rate: 214,006 Pkts/Sec
Exits: 121,050.45 Exits/Sec
TxCPU: 9.61%  RxCPU: 99.45%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,975/0
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About a 24% increase over baseline.  Irqbalance essentially pinned the
  virtio irq to CPU0 preventing the irq lock contention and resulting in
  nice gains.
 
 OK, so we probably want some form of delayed free for TX
 on top, and that should get us nice results already.
 
   --
   To unsubscribe from this list: send the line unsubscribe kvm in
   the body of a message to majord...@vger.kernel.org
   More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-03-10 Thread Rusty Russell
On Tue, 08 Mar 2011 20:21:18 -0600, Andrew Theurer 
haban...@linux.vnet.ibm.com wrote:
 On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote:
  On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote:
   I've finally read this thread... I think we need to get more serious
   with our stats gathering to diagnose these kind of performance issues.
   
   This is a start; it should tell us what is actually happening to the
   virtio ring(s) without significant performance impact... 
  
  Should we also add similar stat on vhost vq as well for monitoring
  vhost_signal  vhost_notify?
 
 Tom L has started using Rusty's patches and found some interesting
 results, sent yesterday:
 http://marc.info/?l=kvmm=129953710930124w=2

Hmm, I'm not subscribed to kvm@ any more, so I didn't get this, so
replying here:

 Also, it looks like vhost is sending a lot of notifications for
 packets it has received before the guest can get scheduled to disable
 notifications and begin processing the packets resulting in some lock
 contention in the guest (and high interrupt rates).

Yes, this is a virtio design flaw, but one that should be fixable.
We have room at the end of the ring, which we can put a last_used
count.  Then we can tell if wakeups are redundant, before the guest
updates the flag.

Here's an old patch where I played with implementing this:

virtio: put last_used and last_avail index into ring itself.

Generally, the other end of the virtio ring doesn't need to see where
you're up to in consuming the ring.  However, to completely understand
what's going on from the outside, this information must be exposed.
For example, if you want to save and restore a virtio_ring, but you're
not the consumer because the kernel is using it directly.

Fortunately, we have room to expand: the ring is always a whole number
of pages and there's hundreds of bytes of padding after the avail ring
and the used ring, whatever the number of descriptors (which must be a
power of 2).

We add a feature bit so the guest can tell the host that it's writing
out the current value there, if it wants to use that.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/virtio/virtio_ring.c |   23 +++
 include/linux/virtio_ring.h  |   12 +++-
 2 files changed, 26 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -71,9 +71,6 @@ struct vring_virtqueue
/* Number we've added since last sync. */
unsigned int num_added;
 
-   /* Last used index we've seen. */
-   u16 last_used_idx;
-
/* How to notify other side. FIXME: commonalize hcalls! */
void (*notify)(struct virtqueue *vq);
 
@@ -278,12 +275,13 @@ static void detach_buf(struct vring_virt
 
 static inline bool more_used(const struct vring_virtqueue *vq)
 {
-   return vq-last_used_idx != vq-vring.used-idx;
+   return vring_last_used(vq-vring) != vq-vring.used-idx;
 }
 
 static void *vring_get_buf(struct virtqueue *_vq, unsigned int *len)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_used_elem *u;
void *ret;
unsigned int i;
 
@@ -300,8 +298,11 @@ static void *vring_get_buf(struct virtqu
return NULL;
}
 
-   i = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].id;
-   *len = vq-vring.used-ring[vq-last_used_idx%vq-vring.num].len;
+   u = vq-vring.used-ring[vring_last_used(vq-vring) % vq-vring.num];
+   i = u-id;
+   *len = u-len;
+   /* Make sure we don't reload i after doing checks. */
+   rmb();
 
if (unlikely(i = vq-vring.num)) {
BAD_RING(vq, id %u out of range\n, i);
@@ -315,7 +316,8 @@ static void *vring_get_buf(struct virtqu
/* detach_buf clears data, so grab it now. */
ret = vq-data[i];
detach_buf(vq, i);
-   vq-last_used_idx++;
+   vring_last_used(vq-vring)++;
+
END_USE(vq);
return ret;
 }
@@ -402,7 +404,6 @@ struct virtqueue *vring_new_virtqueue(un
vq-vq.name = name;
vq-notify = notify;
vq-broken = false;
-   vq-last_used_idx = 0;
vq-num_added = 0;
list_add_tail(vq-vq.list, vdev-vqs);
 #ifdef DEBUG
@@ -413,6 +414,10 @@ struct virtqueue *vring_new_virtqueue(un
 
vq-indirect = virtio_has_feature(vdev, VIRTIO_RING_F_INDIRECT_DESC);
 
+   /* We publish indices whether they offer it or not: if not, it's junk
+* space anyway.  But calling this acknowledges the feature. */
+   virtio_has_feature(vdev, VIRTIO_RING_F_PUBLISH_INDICES);
+
/* No callback?  Tell other side not to bother us. */
if (!callback)
vq-vring.avail-flags |= VRING_AVAIL_F_NO_INTERRUPT;
@@ -443,6 +448,8 @@ void vring_transport_features(struct vir
switch (i) {
case VIRTIO_RING_F_INDIRECT_DESC:
break;
+   

Re: Network performance with small packets

2011-03-09 Thread Shirley Ma
On Tue, 2011-03-08 at 20:21 -0600, Andrew Theurer wrote:
 Tom L has started using Rusty's patches and found some interesting
 results, sent yesterday:
 http://marc.info/?l=kvmm=129953710930124w=2

Thanks. Very good experimental. I have been struggling with guest/vhost
optimization work for a while. I created different experimental patches,
performance results really depends on workloads.

Based on the discussions and findings, seems that to improve
virtio_net/vhost optimization work, we really need to collect more
statistics data on both virtio_net and vhost for both TX and RX. 

A way to filter number of guest exits, I/O exits, irq injections in
guest networking stacks only would be helpful.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Shirley Ma
On Wed, 2011-03-09 at 09:15 +0200, Michael S. Tsirkin wrote:
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 82dba5a..ebe3337 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
 virtnet_info *vi)
 struct sk_buff *skb;
 unsigned int len, tot_sgs = 0;
 
 -   while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
 +   if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
 pr_debug(Sent skb %p\n, skb);
 vi-dev-stats.tx_bytes += skb-len;
 vi-dev-stats.tx_packets++;
 -   tot_sgs += skb_vnet_hdr(skb)-num_sg;
 +   tot_sgs = 2+MAX_SKB_FRAGS;
 dev_kfree_skb_any(skb);
 }
 return tot_sgs;

Return value should be different based on indirect or direct buffers
here?

 @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
 struct net_device *dev)
 struct virtnet_info *vi = netdev_priv(dev);
 int capacity;
 
 -   /* Free up any pending old buffers before queueing new ones.
 */
 -   free_old_xmit_skbs(vi);
 -
 /* Try to transmit */
 capacity = xmit_skb(vi, skb);
 
 @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff
 *skb, struct net_device *dev)
 skb_orphan(skb);
 nf_reset(skb);
 
 +   /* Free up any old buffers so we can queue new ones. */
 +   if (capacity  2+MAX_SKB_FRAGS)
 +   capacity += free_old_xmit_skbs(vi);
 +
 /* Apparently nice girls don't return TX_BUSY; stop the queue
  * before it gets out of hand.  Naturally, this wastes
 entries. */
 if (capacity  2+MAX_SKB_FRAGS) { 

I tried a similar patch before, it didn't help much on TCP stream
performance. But I didn't try multiple stream TCP_RR.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
 On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
  We've been doing some more experimenting with the small packet network
  performance problem in KVM.  I have a different setup than what Steve D.
  was using so I re-baselined things on the kvm.git kernel on both the
  host and guest with a 10GbE adapter.  I also made use of the
  virtio-stats patch.
  
  The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
  adapters (the first connected to a 1GbE adapter and a LAN, the second
  connected to a 10GbE adapter that is direct connected to another system
  with the same 10GbE adapter) running the kvm.git kernel.  The test was a
  TCP_RR test with 100 connections from a baremetal client to the KVM
  guest using a 256 byte message size in both directions.
  
  I used the uperf tool to do this after verifying the results against
  netperf. Uperf allows the specification of the number of connections as
  a parameter in an XML file as opposed to launching, in this case, 100
  separate instances of netperf.
  
  Here is the baseline for baremetal using 2 physical CPUs:
Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
TxCPU: 7.88%  RxCPU: 99.41%
  
  To be sure to get consistent results with KVM I disabled the
  hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
  ethernet adapter interrupts (this resulted in runs that differed by only
  about 2% from lowest to highest).  The fact that pinning is required to
  get consistent results is a different problem that we'll have to look
  into later...
  
  Here is the KVM baseline (average of six runs):
Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
Exits: 148,444.58 Exits/Sec
TxCPU: 2.40%  RxCPU: 99.35%
  
  About 42% of baremetal.
 
 Can you add interrupt stats as well please?

Yes I can.  Just the guest interrupts for the virtio device?

 
  empty.  So I coded a quick patch to delay freeing of the used Tx buffers
  until more than half the ring was used (I did not test this under a
  stream condition so I don't know if this would have a negative impact). 
  Here are the results
  
  from delaying the freeing of used Tx buffers (average of six runs):
Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
Exits: 142,681.67 Exits/Sec
TxCPU: 2.78%  RxCPU: 99.36%
  
  About a 4% increase over baseline and about 44% of baremetal.
 
 Hmm, I am not sure what you mean by delaying freeing.

In the start_xmit function of virtio_net.c the first thing done is to free any 
used entries from the ring.  I patched the code to track the number of used tx 
ring entries and only free the used entries when they are greater than half 
the capacity of the ring (similar to the way the rx ring is re-filled).

 I think we do have a problem that free_old_xmit_skbs
 tries to flush out the ring aggressively:
 it always polls until the ring is empty,
 so there could be bursts of activity where
 we spend a lot of time flushing the old entries
 before e.g. sending an ack, resulting in
 latency bursts.
 
 Generally we'll need some smarter logic,
 but with indirect at the moment we can just poll
 a single packet after we post a new one, and be done with it.
 Is your patch something like the patch below?
 Could you try mine as well please?

Yes, I'll try the patch and post the results.

 
  This spread out the kick_notify but still resulted in alot of them.  I
  decided to build on the delayed Tx buffer freeing and code up an
  ethtool like coalescing patch in order to delay the kick_notify until
  there were at least 5 packets on the ring or 2000 usecs, whichever
  occurred first.  Here are the
  
  results of delaying the kick_notify (average of six runs):
Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
Exits: 102,587.28 Exits/Sec
TxCPU: 3.03%  RxCPU: 99.33%
  
  About a 23% increase over baseline and about 52% of baremetal.
  
  Running the perf command against the guest I noticed almost 19% of the
  time being spent in _raw_spin_lock.  Enabling lockstat in the guest
  showed alot of contention in the irq_desc_lock_class. Pinning the
  virtio1-input interrupt to a single cpu in the guest and re-running the
  last test resulted in
  
  tremendous gains (average of six runs):
Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
Exits: 62,603.37 Exits/Sec
TxCPU: 3.73%  RxCPU: 98.52%
  
  About a 77% increase over baseline and about 74% of baremetal.
  
  Vhost is receiving a lot of notifications for packets that are to be
  transmitted (over 60% of the packets generate a kick_notify).  Also, it
  looks like vhost is sending a lot of notifications for packets it has
  received before the guest can get scheduled to disable notifications and
  begin processing the packets
 
 Hmm, is this really what happens to you?  The effect would be that guest
 gets an interrupt while notifications are disabled in guest, 

Re: Network performance with small packets - continued

2011-03-09 Thread Michael S. Tsirkin
On Wed, Mar 09, 2011 at 07:45:43AM -0800, Shirley Ma wrote:
 On Wed, 2011-03-09 at 09:15 +0200, Michael S. Tsirkin wrote:
  diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
  index 82dba5a..ebe3337 100644
  --- a/drivers/net/virtio_net.c
  +++ b/drivers/net/virtio_net.c
  @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
  virtnet_info *vi)
  struct sk_buff *skb;
  unsigned int len, tot_sgs = 0;
  
  -   while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
  +   if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
  pr_debug(Sent skb %p\n, skb);
  vi-dev-stats.tx_bytes += skb-len;
  vi-dev-stats.tx_packets++;
  -   tot_sgs += skb_vnet_hdr(skb)-num_sg;
  +   tot_sgs = 2+MAX_SKB_FRAGS;
  dev_kfree_skb_any(skb);
  }
  return tot_sgs;
 
 Return value should be different based on indirect or direct buffers
 here?

Something like that. Or we can assume no indirect, worst-case.
But just for testing, I think it should work as an estimation.

  @@ -576,9 +576,6 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
  struct net_device *dev)
  struct virtnet_info *vi = netdev_priv(dev);
  int capacity;
  
  -   /* Free up any pending old buffers before queueing new ones.
  */
  -   free_old_xmit_skbs(vi);
  -
  /* Try to transmit */
  capacity = xmit_skb(vi, skb);
  
  @@ -605,6 +602,10 @@ static netdev_tx_t start_xmit(struct sk_buff
  *skb, struct net_device *dev)
  skb_orphan(skb);
  nf_reset(skb);
  
  +   /* Free up any old buffers so we can queue new ones. */
  +   if (capacity  2+MAX_SKB_FRAGS)
  +   capacity += free_old_xmit_skbs(vi);
  +
  /* Apparently nice girls don't return TX_BUSY; stop the queue
   * before it gets out of hand.  Naturally, this wastes
  entries. */
  if (capacity  2+MAX_SKB_FRAGS) { 
 
 I tried a similar patch before, it didn't help much on TCP stream
 performance. But I didn't try multiple stream TCP_RR.
 
 Shirley

There's a bug in myh patch by the way. Pls try the following
instead (still untested).

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 82dba5a..4477b9a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct 
virtnet_info *vi)
struct sk_buff *skb;
unsigned int len, tot_sgs = 0;
 
-   while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
+   if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
pr_debug(Sent skb %p\n, skb);
vi-dev-stats.tx_bytes += skb-len;
vi-dev-stats.tx_packets++;
-   tot_sgs += skb_vnet_hdr(skb)-num_sg;
+   tot_sgs = 2+MAX_SKB_FRAGS;
dev_kfree_skb_any(skb);
}
return tot_sgs;
@@ -576,7 +576,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct 
net_device *dev)
struct virtnet_info *vi = netdev_priv(dev);
int capacity;
 
-   /* Free up any pending old buffers before queueing new ones. */
+   /* Free up any old buffers so we can queue new ones. */
free_old_xmit_skbs(vi);
 
/* Try to transmit */
@@ -605,6 +605,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, struct 
net_device *dev)
skb_orphan(skb);
nf_reset(skb);
 
+   /* Free up any old buffers so we can queue new ones. */
+   if (capacity  2+MAX_SKB_FRAGS)
+   capacity += free_old_xmit_skbs(vi);
+
/* Apparently nice girls don't return TX_BUSY; stop the queue
 * before it gets out of hand.  Naturally, this wastes entries. */
if (capacity  2+MAX_SKB_FRAGS) {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 01:17:44 am Michael S. Tsirkin wrote:
 On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
  I used the uperf tool to do this after verifying the results against
  netperf. Uperf allows the specification of the number of connections as
  a parameter in an XML file as opposed to launching, in this case, 100
  separate instances of netperf.
 
 Could you post the XML on the list please?

Environment variables are used to specify some of the values:
  uperf_instances=100
  uperf_dest=192.168.100.28
  uperf_duration=300
  uperf_tx_msgsize=256
  uperf_rx_msgsize=256

You can also change from threads to processes by specifying nprocs instead of 
nthreads in the group element.  I found this out later so all of my runs are 
using threads. Using processes will give you some improved peformance but I 
need to be consistent with my runs and stay with threads for now.

?xml version=1.0?
profile name=TCP_RR
  group nthreads=$uperf_instances
transaction iterations=1
flowop type=connect options=remotehost=$uperf_dest 
protocol=tcp/
/transaction
transaction duration=$uperf_duration
flowop type=write options=size=$uperf_tx_msgsize/
flowop type=read  options=size=$uperf_rx_msgsize/
/transaction
transaction iterations=1
flowop type=disconnect /
/transaction
  /group
/profile
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Shirley Ma
On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote:
  
   This spread out the kick_notify but still resulted in alot of
 them.  I
   decided to build on the delayed Tx buffer freeing and code up an
   ethtool like coalescing patch in order to delay the kick_notify
 until
   there were at least 5 packets on the ring or 2000 usecs, whichever
   occurred first.  Here are the
   
   results of delaying the kick_notify (average of six runs):
 Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
 Exits: 102,587.28 Exits/Sec
 TxCPU: 3.03%  RxCPU: 99.33%
   
   About a 23% increase over baseline and about 52% of baremetal.
   
   Running the perf command against the guest I noticed almost 19% of
 the
   time being spent in _raw_spin_lock.  Enabling lockstat in the
 guest
   showed alot of contention in the irq_desc_lock_class. Pinning
 the
   virtio1-input interrupt to a single cpu in the guest and
 re-running the
   last test resulted in
   
   tremendous gains (average of six runs):
 Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
 Exits: 62,603.37 Exits/Sec
 TxCPU: 3.73%  RxCPU: 98.52%
   
   About a 77% increase over baseline and about 74% of baremetal.
   
   Vhost is receiving a lot of notifications for packets that are to
 be
   transmitted (over 60% of the packets generate a kick_notify).
 Also, it
   looks like vhost is sending a lot of notifications for packets it
 has
   received before the guest can get scheduled to disable
 notifications and
   begin processing the packets
  
  Hmm, is this really what happens to you?  The effect would be that
 guest
  gets an interrupt while notifications are disabled in guest, right?
 Could
  you add a counter and check this please?
 
 The disabling of the interrupt/notifications is done by the guest.  So
 the 
 guest has to get scheduled and handle the notification before it
 disables 
 them.  The vhost_signal routine will keep injecting an interrupt until
 this 
 happens causing the contention in the guest.  I'll try the patches you
 specify 
 below and post the results.  They look like they should take care of
 this 
 issue.

In guest TX path, the guest interrupt should be disabled in the start
since it free_old_xmit_skbs in start_xmit call, it's not necessary to
receive any send completion interrupts to handle free old skbs. Then the
interrupt is only enabled when the netif queue is full. For multiple
streams TCP_RR test, we never hit netif queue full situation, the
cat /proc/interrupts/ send completion interrupts rate is 0, right?

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Shirley Ma
On Wed, 2011-03-09 at 18:10 +0200, Michael S. Tsirkin wrote:
 
 diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
 index 82dba5a..4477b9a 100644
 --- a/drivers/net/virtio_net.c
 +++ b/drivers/net/virtio_net.c
 @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
 virtnet_info *vi)
 struct sk_buff *skb;
 unsigned int len, tot_sgs = 0;
 
 -   while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
 +   if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
 pr_debug(Sent skb %p\n, skb);
 vi-dev-stats.tx_bytes += skb-len;
 vi-dev-stats.tx_packets++;
 -   tot_sgs += skb_vnet_hdr(skb)-num_sg;
 +   tot_sgs = 2+MAX_SKB_FRAGS;
 dev_kfree_skb_any(skb);
 }
 return tot_sgs;
 @@ -576,7 +576,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
 struct net_device *dev)
 struct virtnet_info *vi = netdev_priv(dev);
 int capacity;
 
 -   /* Free up any pending old buffers before queueing new ones.
 */
 +   /* Free up any old buffers so we can queue new ones. */
 free_old_xmit_skbs(vi);
 
 /* Try to transmit */
 @@ -605,6 +605,10 @@ static netdev_tx_t start_xmit(struct sk_buff
 *skb, struct net_device *dev)
 skb_orphan(skb);
 nf_reset(skb);
 
 +   /* Free up any old buffers so we can queue new ones. */
 +   if (capacity  2+MAX_SKB_FRAGS)
 +   capacity += free_old_xmit_skbs(vi);
 +
 /* Apparently nice girls don't return TX_BUSY; stop the queue
  * before it gets out of hand.  Naturally, this wastes
 entries. */
 if (capacity  2+MAX_SKB_FRAGS) {
 -- 

I tried this one as well. It might improve TCP_RR performance but not
TCP_STREAM. :) Let's wait for Tom's TCP_RR resutls.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Michael S. Tsirkin
On Wed, Mar 09, 2011 at 10:09:26AM -0600, Tom Lendacky wrote:
 On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
  On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
   We've been doing some more experimenting with the small packet network
   performance problem in KVM.  I have a different setup than what Steve D.
   was using so I re-baselined things on the kvm.git kernel on both the
   host and guest with a 10GbE adapter.  I also made use of the
   virtio-stats patch.
   
   The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
   adapters (the first connected to a 1GbE adapter and a LAN, the second
   connected to a 10GbE adapter that is direct connected to another system
   with the same 10GbE adapter) running the kvm.git kernel.  The test was a
   TCP_RR test with 100 connections from a baremetal client to the KVM
   guest using a 256 byte message size in both directions.

One thing that might be happening is that we are out of
atomic memory poll in guest, so indirect allocations
start failing, and this is slow path.
Could you check this please?


   I used the uperf tool to do this after verifying the results against
   netperf. Uperf allows the specification of the number of connections as
   a parameter in an XML file as opposed to launching, in this case, 100
   separate instances of netperf.
   
   Here is the baseline for baremetal using 2 physical CPUs:
 Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
 TxCPU: 7.88%  RxCPU: 99.41%
   
   To be sure to get consistent results with KVM I disabled the
   hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
   ethernet adapter interrupts (this resulted in runs that differed by only
   about 2% from lowest to highest).  The fact that pinning is required to
   get consistent results is a different problem that we'll have to look
   into later...
   
   Here is the KVM baseline (average of six runs):
 Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
 Exits: 148,444.58 Exits/Sec
 TxCPU: 2.40%  RxCPU: 99.35%
   
   About 42% of baremetal.
  
  Can you add interrupt stats as well please?
 
 Yes I can.  Just the guest interrupts for the virtio device?

Guess so: tx and rx.

  
   empty.  So I coded a quick patch to delay freeing of the used Tx buffers
   until more than half the ring was used (I did not test this under a
   stream condition so I don't know if this would have a negative impact). 
   Here are the results
   
   from delaying the freeing of used Tx buffers (average of six runs):
 Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
 Exits: 142,681.67 Exits/Sec
 TxCPU: 2.78%  RxCPU: 99.36%
   
   About a 4% increase over baseline and about 44% of baremetal.
  
  Hmm, I am not sure what you mean by delaying freeing.
 
 In the start_xmit function of virtio_net.c the first thing done is to free 
 any 
 used entries from the ring.  I patched the code to track the number of used 
 tx 
 ring entries and only free the used entries when they are greater than half 
 the capacity of the ring (similar to the way the rx ring is re-filled).

We don't even need than: just max skb frag + 2.
Also don't need to free them all: just enough to get
place for  max skb frag + 2 entries.

  I think we do have a problem that free_old_xmit_skbs
  tries to flush out the ring aggressively:
  it always polls until the ring is empty,
  so there could be bursts of activity where
  we spend a lot of time flushing the old entries
  before e.g. sending an ack, resulting in
  latency bursts.
  
  Generally we'll need some smarter logic,
  but with indirect at the moment we can just poll
  a single packet after we post a new one, and be done with it.
  Is your patch something like the patch below?
  Could you try mine as well please?
 
 Yes, I'll try the patch and post the results.
 
  
   This spread out the kick_notify but still resulted in alot of them.  I
   decided to build on the delayed Tx buffer freeing and code up an
   ethtool like coalescing patch in order to delay the kick_notify until
   there were at least 5 packets on the ring or 2000 usecs, whichever
   occurred first.  Here are the
   
   results of delaying the kick_notify (average of six runs):
 Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
 Exits: 102,587.28 Exits/Sec
 TxCPU: 3.03%  RxCPU: 99.33%
   
   About a 23% increase over baseline and about 52% of baremetal.
   
   Running the perf command against the guest I noticed almost 19% of the
   time being spent in _raw_spin_lock.  Enabling lockstat in the guest
   showed alot of contention in the irq_desc_lock_class. Pinning the
   virtio1-input interrupt to a single cpu in the guest and re-running the
   last test resulted in
   
   tremendous gains (average of six runs):
 Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
 Exits: 62,603.37 Exits/Sec
 TxCPU: 3.73%  RxCPU: 98.52%
   
   About a 77% increase over 

Re: Network performance with small packets - continued

2011-03-09 Thread Michael S. Tsirkin
On Wed, Mar 09, 2011 at 08:25:34AM -0800, Shirley Ma wrote:
 On Wed, 2011-03-09 at 18:10 +0200, Michael S. Tsirkin wrote:
  
  diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
  index 82dba5a..4477b9a 100644
  --- a/drivers/net/virtio_net.c
  +++ b/drivers/net/virtio_net.c
  @@ -514,11 +514,11 @@ static unsigned int free_old_xmit_skbs(struct
  virtnet_info *vi)
  struct sk_buff *skb;
  unsigned int len, tot_sgs = 0;
  
  -   while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
  +   if ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
  pr_debug(Sent skb %p\n, skb);
  vi-dev-stats.tx_bytes += skb-len;
  vi-dev-stats.tx_packets++;
  -   tot_sgs += skb_vnet_hdr(skb)-num_sg;
  +   tot_sgs = 2+MAX_SKB_FRAGS;
  dev_kfree_skb_any(skb);
  }
  return tot_sgs;
  @@ -576,7 +576,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb,
  struct net_device *dev)
  struct virtnet_info *vi = netdev_priv(dev);
  int capacity;
  
  -   /* Free up any pending old buffers before queueing new ones.
  */
  +   /* Free up any old buffers so we can queue new ones. */
  free_old_xmit_skbs(vi);
  
  /* Try to transmit */
  @@ -605,6 +605,10 @@ static netdev_tx_t start_xmit(struct sk_buff
  *skb, struct net_device *dev)
  skb_orphan(skb);
  nf_reset(skb);
  
  +   /* Free up any old buffers so we can queue new ones. */
  +   if (capacity  2+MAX_SKB_FRAGS)
  +   capacity += free_old_xmit_skbs(vi);
  +
  /* Apparently nice girls don't return TX_BUSY; stop the queue
   * before it gets out of hand.  Naturally, this wastes
  entries. */
  if (capacity  2+MAX_SKB_FRAGS) {
  -- 
 
 I tried this one as well. It might improve TCP_RR performance but not
 TCP_STREAM. :) Let's wait for Tom's TCP_RR resutls.
 
 Thanks
 Shirley

I think your issues are with TX overrun.
Besides delaying IRQ on TX, I don't have many ideas.

The one interesting thing is that you see better speed
if you drop packets. netdev crowd says this should not happen,
so could be an indicator of a problem somewhere.


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Shirley Ma
On Wed, 2011-03-09 at 18:32 +0200, Michael S. Tsirkin wrote:
 I think your issues are with TX overrun.
 Besides delaying IRQ on TX, I don't have many ideas.
 
 The one interesting thing is that you see better speed
 if you drop packets. netdev crowd says this should not happen,
 so could be an indicator of a problem somewhere.

Yes, I am looking at why guest didn't see see used_buffers on time from
vhost send TX completion I am trying to collect some data on vhost.

I also wonder whether it's a scheduler issue.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Shirley Ma
On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote:
   Vhost is receiving a lot of notifications for packets that are to
 be
   transmitted (over 60% of the packets generate a kick_notify). 

This is guest TX send notification when vhost enables notification.

In TCP_STREAM test, vhost exits from reaching NAPI WEIGHT, it rarely
enables the notification, vhost re-enters handle_tx from NAPI poll, so
guest doesn't do much kick_notify.

In multiple TCP_RR test, seems vhost exits from nothing to send in TX vq
very often, so it enables notification most of the time.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Michael S. Tsirkin
On Wed, Mar 09, 2011 at 08:51:33AM -0800, Shirley Ma wrote:
 On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote:
Vhost is receiving a lot of notifications for packets that are to
  be
transmitted (over 60% of the packets generate a kick_notify). 
 
 This is guest TX send notification when vhost enables notification.
 
 In TCP_STREAM test, vhost exits from reaching NAPI WEIGHT,


You mean virtio?

 it rarely
 enables the notification, vhost re-enters handle_tx from NAPI poll,

Does NAPI really call handle_tx? Not rx?

 so
 guest doesn't do much kick_notify.
 
 In multiple TCP_RR test, seems vhost exits from nothing to send in TX vq
 very often, so it enables notification most of the time.
 
 Shirley
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Shirley Ma
On Wed, 2011-03-09 at 19:16 +0200, Michael S. Tsirkin wrote:
 On Wed, Mar 09, 2011 at 08:51:33AM -0800, Shirley Ma wrote:
  On Wed, 2011-03-09 at 10:09 -0600, Tom Lendacky wrote:
 Vhost is receiving a lot of notifications for packets that are
 to
   be
 transmitted (over 60% of the packets generate a kick_notify). 
  
  This is guest TX send notification when vhost enables notification.
  
  In TCP_STREAM test, vhost exits from reaching NAPI WEIGHT,
 
 
 You mean virtio?

Sorry, I messed up NAPI WEIGHT and VHOST NET WEIGHT.

I meant VHOST_NET_WEIGH, vhost exit handdle_tx() from VHOST NET WEIGHT
w/o enabling notification.

 
  it rarely
  enables the notification, vhost re-enters handle_tx from NAPI poll,
 
 Does NAPI really call handle_tx? Not rx? 

I meant for TX/RX, vhost re-enter handle_tx from vhost_poll_queue() not
from kick_notify.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
Here are the results again with the addition of the interrupt rate that 
occurred on the guest virtio_net device:

Here is the KVM baseline (average of six runs):
  Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
  Exits: 148,444.58 Exits/Sec
  TxCPU: 2.40%  RxCPU: 99.35%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About 42% of baremetal.

Delayed freeing of TX buffers (average of six runs):
  Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
  Exits: 142,681.67 Exits/Sec
  TxCPU: 2.78%  RxCPU: 99.36%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,796/4,908
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 4% increase over baseline and about 44% of baremetal.

Delaying kick_notify (kick every 5 packets -average of six runs):
  Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
  Exits: 102,587.28 Exits/Sec
  TxCPU: 3.03%  RxCPU: 99.33%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,200/4,293
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 23% increase over baseline and about 52% of baremetal.

Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs):
  Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
  Exits: 62,603.37 Exits/Sec
  TxCPU: 3.73%  RxCPU: 98.52%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0

About a 77% increase over baseline and about 74% of baremetal.


On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
 On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
  We've been doing some more experimenting with the small packet network
  performance problem in KVM.  I have a different setup than what Steve D.
  was using so I re-baselined things on the kvm.git kernel on both the
  host and guest with a 10GbE adapter.  I also made use of the
  virtio-stats patch.
  
  The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
  adapters (the first connected to a 1GbE adapter and a LAN, the second
  connected to a 10GbE adapter that is direct connected to another system
  with the same 10GbE adapter) running the kvm.git kernel.  The test was a
  TCP_RR test with 100 connections from a baremetal client to the KVM
  guest using a 256 byte message size in both directions.
  
  I used the uperf tool to do this after verifying the results against
  netperf. Uperf allows the specification of the number of connections as
  a parameter in an XML file as opposed to launching, in this case, 100
  separate instances of netperf.
  
  Here is the baseline for baremetal using 2 physical CPUs:
Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
TxCPU: 7.88%  RxCPU: 99.41%
  
  To be sure to get consistent results with KVM I disabled the
  hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
  ethernet adapter interrupts (this resulted in runs that differed by only
  about 2% from lowest to highest).  The fact that pinning is required to
  get consistent results is a different problem that we'll have to look
  into later...
  
  Here is the KVM baseline (average of six runs):
Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
Exits: 148,444.58 Exits/Sec
TxCPU: 2.40%  RxCPU: 99.35%
  
  About 42% of baremetal.
 
 Can you add interrupt stats as well please?
 
  empty.  So I coded a quick patch to delay freeing of the used Tx buffers
  until more than half the ring was used (I did not test this under a
  stream condition so I don't know if this would have a negative impact). 
  Here are the results
  
  from delaying the freeing of used Tx buffers (average of six runs):
Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
Exits: 142,681.67 Exits/Sec
TxCPU: 2.78%  RxCPU: 99.36%
  
  About a 4% increase over baseline and about 44% of baremetal.
 
 Hmm, I am not sure what you mean by delaying freeing.
 I think we do have a problem that free_old_xmit_skbs
 tries to flush out the ring aggressively:
 it always polls until the ring is empty,
 so there could be bursts of activity where
 we spend a lot of time flushing the old entries
 before e.g. sending an ack, resulting in
 latency bursts.
 
 Generally we'll need some smarter logic,
 but with indirect at the moment we can just poll
 a single packet after we post a new one, and be done with it.
 Is your patch something like the patch below?
 Could you try mine as well please?
 
  This spread out the kick_notify but still resulted in alot of them.  I
  decided to build on the delayed Tx buffer freeing and code up an
  ethtool like coalescing patch in order to delay the kick_notify until
  there were at least 5 packets on the ring or 2000 usecs, whichever
  occurred first.  Here are the
  
  results of delaying the kick_notify (average of six runs):
Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
Exits: 102,587.28 Exits/Sec
TxCPU: 3.03%  RxCPU: 99.33%
  
  About a 23% increase over 

Re: Network performance with small packets - continued

2011-03-09 Thread Michael S. Tsirkin
On Wed, Mar 09, 2011 at 02:11:07PM -0600, Tom Lendacky wrote:
 Here are the results again with the addition of the interrupt rate that 
 occurred on the guest virtio_net device:
 
 Here is the KVM baseline (average of six runs):
   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
   Exits: 148,444.58 Exits/Sec
   TxCPU: 2.40%  RxCPU: 99.35%
   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
 
 About 42% of baremetal.
 
 Delayed freeing of TX buffers (average of six runs):
   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
   Exits: 142,681.67 Exits/Sec
   TxCPU: 2.78%  RxCPU: 99.36%
   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,796/4,908
   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
 
 About a 4% increase over baseline and about 44% of baremetal.

Looks like delayed freeing is a good idea generally.
Is this my patch? Yours?



 Delaying kick_notify (kick every 5 packets -average of six runs):
   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
   Exits: 102,587.28 Exits/Sec
   TxCPU: 3.03%  RxCPU: 99.33%
   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,200/4,293
   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
 
 About a 23% increase over baseline and about 52% of baremetal.
 
 Delaying kick_notify and pinning virtio1-input to CPU0 (average of six runs):

What exactly moves the interrupt handler between CPUs?
irqbalancer?  Does it matter which CPU you pin it to?
If yes, do you have any idea why?

Also, what happens without delaying kick_notify
but with pinning?

   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
   Exits: 62,603.37 Exits/Sec
   TxCPU: 3.73%  RxCPU: 98.52%
   Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
   Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
 
 About a 77% increase over baseline and about 74% of baremetal.

Hmm we get about 20 packets per interrupt on average.
That's pretty decent. The problem is with exits.
Let's try something adaptive in the host?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Shirley Ma
Hello Tom,

Do you also have Rusty's virtio stat patch results for both send queue
and recv queue to share here?

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 10:09:26 am Tom Lendacky wrote:
 On Wednesday, March 09, 2011 01:15:58 am Michael S. Tsirkin wrote:
  On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
   We've been doing some more experimenting with the small packet network
   performance problem in KVM.  I have a different setup than what Steve
   D. was using so I re-baselined things on the kvm.git kernel on both
   the host and guest with a 10GbE adapter.  I also made use of the
   virtio-stats patch.
   
   The virtual machine has 2 vCPUs, 8GB of memory and two virtio network
   adapters (the first connected to a 1GbE adapter and a LAN, the second
   connected to a 10GbE adapter that is direct connected to another system
   with the same 10GbE adapter) running the kvm.git kernel.  The test was
   a TCP_RR test with 100 connections from a baremetal client to the KVM
   guest using a 256 byte message size in both directions.
   
   I used the uperf tool to do this after verifying the results against
   netperf. Uperf allows the specification of the number of connections as
   a parameter in an XML file as opposed to launching, in this case, 100
   separate instances of netperf.
   
   Here is the baseline for baremetal using 2 physical CPUs:
 Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
 TxCPU: 7.88%  RxCPU: 99.41%
   
   To be sure to get consistent results with KVM I disabled the
   hyperthreads, pinned the qemu-kvm process, vCPUs, vhost thread and
   ethernet adapter interrupts (this resulted in runs that differed by
   only about 2% from lowest to highest).  The fact that pinning is
   required to get consistent results is a different problem that we'll
   have to look into later...
   
   Here is the KVM baseline (average of six runs):
 Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
 Exits: 148,444.58 Exits/Sec
 TxCPU: 2.40%  RxCPU: 99.35%
   
   About 42% of baremetal.
  
  Can you add interrupt stats as well please?
 
 Yes I can.  Just the guest interrupts for the virtio device?
 
   empty.  So I coded a quick patch to delay freeing of the used Tx
   buffers until more than half the ring was used (I did not test this
   under a stream condition so I don't know if this would have a negative
   impact). Here are the results
   
   from delaying the freeing of used Tx buffers (average of six runs):
 Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
 Exits: 142,681.67 Exits/Sec
 TxCPU: 2.78%  RxCPU: 99.36%
   
   About a 4% increase over baseline and about 44% of baremetal.
  
  Hmm, I am not sure what you mean by delaying freeing.
 
 In the start_xmit function of virtio_net.c the first thing done is to free
 any used entries from the ring.  I patched the code to track the number of
 used tx ring entries and only free the used entries when they are greater
 than half the capacity of the ring (similar to the way the rx ring is
 re-filled).
 
  I think we do have a problem that free_old_xmit_skbs
  tries to flush out the ring aggressively:
  it always polls until the ring is empty,
  so there could be bursts of activity where
  we spend a lot of time flushing the old entries
  before e.g. sending an ack, resulting in
  latency bursts.
  
  Generally we'll need some smarter logic,
  but with indirect at the moment we can just poll
  a single packet after we post a new one, and be done with it.
  Is your patch something like the patch below?
  Could you try mine as well please?
 
 Yes, I'll try the patch and post the results.
 
   This spread out the kick_notify but still resulted in alot of them.  I
   decided to build on the delayed Tx buffer freeing and code up an
   ethtool like coalescing patch in order to delay the kick_notify until
   there were at least 5 packets on the ring or 2000 usecs, whichever
   occurred first.  Here are the
   
   results of delaying the kick_notify (average of six runs):
 Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
 Exits: 102,587.28 Exits/Sec
 TxCPU: 3.03%  RxCPU: 99.33%
   
   About a 23% increase over baseline and about 52% of baremetal.
   
   Running the perf command against the guest I noticed almost 19% of the
   time being spent in _raw_spin_lock.  Enabling lockstat in the guest
   showed alot of contention in the irq_desc_lock_class. Pinning the
   virtio1-input interrupt to a single cpu in the guest and re-running the
   last test resulted in
   
   tremendous gains (average of six runs):
 Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
 Exits: 62,603.37 Exits/Sec
 TxCPU: 3.73%  RxCPU: 98.52%
   
   About a 77% increase over baseline and about 74% of baremetal.
   
   Vhost is receiving a lot of notifications for packets that are to be
   transmitted (over 60% of the packets generate a kick_notify).  Also, it
   looks like vhost is sending a lot of notifications for packets it has
   received before the guest can get scheduled to disable notifications
   and begin 

Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 04:45:12 pm Shirley Ma wrote:
 Hello Tom,
 
 Do you also have Rusty's virtio stat patch results for both send queue
 and recv queue to share here?

Let me see what I can do about getting the data extracted, averaged and in a 
form that I can put in an email.

 
 Thanks
 Shirley
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Tom Lendacky
On Wednesday, March 09, 2011 03:56:15 pm Michael S. Tsirkin wrote:
 On Wed, Mar 09, 2011 at 02:11:07PM -0600, Tom Lendacky wrote:
  Here are the results again with the addition of the interrupt rate that
  occurred on the guest virtio_net device:
  
  Here is the KVM baseline (average of six runs):
Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
Exits: 148,444.58 Exits/Sec
TxCPU: 2.40%  RxCPU: 99.35%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 5,154/5,222
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About 42% of baremetal.
  
  Delayed freeing of TX buffers (average of six runs):
Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
Exits: 142,681.67 Exits/Sec
TxCPU: 2.78%  RxCPU: 99.36%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,796/4,908
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About a 4% increase over baseline and about 44% of baremetal.
 
 Looks like delayed freeing is a good idea generally.
 Is this my patch? Yours?

These results are for my patch, I haven't had a chance to run your patch yet.

 
  Delaying kick_notify (kick every 5 packets -average of six runs):
Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
Exits: 102,587.28 Exits/Sec
TxCPU: 3.03%  RxCPU: 99.33%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 4,200/4,293
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About a 23% increase over baseline and about 52% of baremetal.
 
  Delaying kick_notify and pinning virtio1-input to CPU0 (average of six 
runs):
 What exactly moves the interrupt handler between CPUs?
 irqbalancer?  Does it matter which CPU you pin it to?
 If yes, do you have any idea why?

Looking at the guest, irqbalance isn't running and the smp_affinity for the 
irq is set to 3 (both CPUs).  It could be that irqbalance would help in this 
situation since it would probably change the smp_affinity mask to a single CPU 
and remove the irq lock contention (I think the last used index patch would be 
best though since it will avoid the extra irq injections).  I'll kick off a 
run with irqbalance running.

As for which CPU the interrupt gets pinned to, that doesn't matter - see 
below.

 
 Also, what happens without delaying kick_notify
 but with pinning?

Here are the results of a single baseline run with the IRQ pinned to CPU0:

  Txn Rate: 108,212.12 Txn/Sec, Pkt Rate: 214,994 Pkts/Sec
  Exits: 119,310.21 Exits/Sec
  TxCPU: 9.63%  RxCPU: 99.47%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 
  Virtio1-output Interrupts/Sec (CPU0/CPU1):

and CPU1:

  Txn Rate: 108,053.02 Txn/Sec, Pkt Rate: 214,678 Pkts/Sec
  Exits: 119,320.12 Exits/Sec
  TxCPU: 9.64%  RxCPU: 99.42%
  Virtio1-input  Interrupts/Sec (CPU0/CPU1): 13,608/0
  Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/13,830

About a 24% increase over baseline.

 
Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkts/Sec
Exits: 62,603.37 Exits/Sec
TxCPU: 3.73%  RxCPU: 98.52%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About a 77% increase over baseline and about 74% of baremetal.
 
 Hmm we get about 20 packets per interrupt on average.
 That's pretty decent. The problem is with exits.
 Let's try something adaptive in the host?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Shirley Ma
On Wed, 2011-03-09 at 23:56 +0200, Michael S. Tsirkin wrote:
Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
Exits: 62,603.37 Exits/Sec
TxCPU: 3.73%  RxCPU: 98.52%
Virtio1-input  Interrupts/Sec (CPU0/CPU1): 11,564/0
Virtio1-output Interrupts/Sec (CPU0/CPU1): 0/0
  
  About a 77% increase over baseline and about 74% of baremetal.
 
 Hmm we get about 20 packets per interrupt on average.
 That's pretty decent. The problem is with exits.
 Let's try something adaptive in the host? 

I did some hack before, for 32-64 multiple stream TCP_RR cases, either
queue multiple skbs per kick or delay vhost exit from handle_tx, both
improved TCP_RR aggregation performance, but single TCP_RR latency
increased.

Here, the test is about 100 TCP_RR streams from a bare metal client to
KVM guest, the kick_notify from guest RX path should be small (every 1/2
ring size, it does a kick and even under that kick, vhost might already
disable the notification). 

The kick_notify from guest TX path seems the main reason causes the
guest huge of exits, (it does a kick for every send skb, under that kick
vhost might mostly likely exit from empty ring not reaching
VHOST_NET_WEIGH. The indirect buffer is used, so I wonder how many
packets per handle_tx processed here?

In theory, for lots of TCP_RR streams, the guest should be able to keep
sending xmit skbs to send vq, so vhost should be able to disable
notification most of the time, then number of guest exits should be
significantly reduced? Why we saw lots of guest exits here still? Is it
worth to try 256 (send queue size) TCP_RRs?

Tom's kick_notify data from Rusty's patch would be helpful to understand
what's going here.

Thanks
Shirley



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Rick Jones
On Wed, 2011-03-09 at 16:59 -0800, Shirley Ma wrote:
 In theory, for lots of TCP_RR streams, the guest should be able to keep
 sending xmit skbs to send vq, so vhost should be able to disable
 notification most of the time, then number of guest exits should be
 significantly reduced? Why we saw lots of guest exits here still? Is it
 worth to try 256 (send queue size) TCP_RRs?

If these are single-transaction-at-a-time TCP_RRs rather than burst
mode then the number may be something other than send queue size to
keep it constantly active given the RTTs.  In the bare iron world at
least, that is one of the reasons I added the burst mode to the _RR
test - because it could take a Very Large Number of concurrent netperfs
to take a link to saturation, at which point it might have been just as
much a context switching benchmark as anything else :)

happy benchmarking,

rick jones

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-09 Thread Michael S. Tsirkin
On Wed, Mar 09, 2011 at 05:25:11PM -0600, Tom Lendacky wrote:
 As for which CPU the interrupt gets pinned to, that doesn't matter - see 
 below.

So what hurts us the most is that the IRQ jumps between the VCPUs?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-03-08 Thread Shirley Ma
On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote:
 I've finally read this thread... I think we need to get more serious
 with our stats gathering to diagnose these kind of performance issues.
 
 This is a start; it should tell us what is actually happening to the
 virtio ring(s) without significant performance impact... 

Should we also add similar stat on vhost vq as well for monitoring
vhost_signal  vhost_notify?

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-03-08 Thread Andrew Theurer
On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote:
 On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote:
  I've finally read this thread... I think we need to get more serious
  with our stats gathering to diagnose these kind of performance issues.
  
  This is a start; it should tell us what is actually happening to the
  virtio ring(s) without significant performance impact... 
 
 Should we also add similar stat on vhost vq as well for monitoring
 vhost_signal  vhost_notify?

Tom L has started using Rusty's patches and found some interesting
results, sent yesterday:
http://marc.info/?l=kvmm=129953710930124w=2


-Andrew
 
 Shirley
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-08 Thread Chigurupati, Chaks
Hi Tom,

My two cents. Please look for [Chaks]

snip
Comparing the transmit path to the receive path, the guest disables
notifications after the first kick and vhost re-enables notifications
after
completing processing of the tx ring.  Can a similar thing be done for
the
receive path?  Once vhost sends the first notification for a received
packet
it can disable notifications and let the guest re-enable notifications
when it
has finished processing the receive ring.  Also, can the virtio-net
driver do
some adaptive polling (or does napi take care of that for the guest)?

[Chaks] A better method is to have the producer generate the kick
notifications only when the queue/ring transitions from empty to
non-empty state. The consumer is not burdened with the task of reenabling
the notifications. This of course assumes that notifications will never
get lost. If loss of notifications is a possibility, producer can keep
generating the notifications till guest signals (via some atomically
manipulated memory variable) that it started consuming. The next
notification will go out when the ring/queue again transitions from empty
to non-empty state.

Chaks



The information contained in this message may be privileged
and confidential and protected from disclosure. If the reader
of this message is not the intended recipient, or an employee
or agent responsible for delivering this message to the
intended recipient, you are hereby notified that any reproduction,
dissemination or distribution of this communication is strictly
prohibited. If you have received this communication in error,
please notify us immediately by replying to the message and
deleting it from your computer. Thank you. Tellabs


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets - continued

2011-03-08 Thread Michael S. Tsirkin
On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
 We've been doing some more experimenting with the small packet network 
 performance problem in KVM.  I have a different setup than what Steve D. was 
 using so I re-baselined things on the kvm.git kernel on both the host and 
 guest with a 10GbE adapter.  I also made use of the virtio-stats patch.
 
 The virtual machine has 2 vCPUs, 8GB of memory and two virtio network 
 adapters 
 (the first connected to a 1GbE adapter and a LAN, the second connected to a 
 10GbE adapter that is direct connected to another system with the same 10GbE 
 adapter) running the kvm.git kernel.  The test was a TCP_RR test with 100 
 connections from a baremetal client to the KVM guest using a 256 byte message 
 size in both directions.
 
 I used the uperf tool to do this after verifying the results against netperf. 
  
 Uperf allows the specification of the number of connections as a parameter in 
 an XML file as opposed to launching, in this case, 100 separate instances of 
 netperf.
 
 Here is the baseline for baremetal using 2 physical CPUs:
   Txn Rate: 206,389.59 Txn/Sec, Pkt Rate: 410,048 Pkts/Sec
   TxCPU: 7.88%  RxCPU: 99.41%
 
 To be sure to get consistent results with KVM I disabled the hyperthreads, 
 pinned the qemu-kvm process, vCPUs, vhost thread and ethernet adapter 
 interrupts (this resulted in runs that differed by only about 2% from lowest 
 to highest).  The fact that pinning is required to get consistent results is 
 a 
 different problem that we'll have to look into later...
 
 Here is the KVM baseline (average of six runs):
   Txn Rate: 87,070.34 Txn/Sec, Pkt Rate: 172,992 Pkts/Sec
   Exits: 148,444.58 Exits/Sec
   TxCPU: 2.40%  RxCPU: 99.35%
 About 42% of baremetal.
 

Can you add interrupt stats as well please?

 empty.  So I coded a quick patch to delay freeing of the used Tx buffers 
 until 
 more than half the ring was used (I did not test this under a stream 
 condition 
 so I don't know if this would have a negative impact).  Here are the results 
 from delaying the freeing of used Tx buffers (average of six runs):
   Txn Rate: 90,886.19 Txn/Sec, Pkt Rate: 180,571 Pkts/Sec
   Exits: 142,681.67 Exits/Sec
   TxCPU: 2.78%  RxCPU: 99.36%
 About a 4% increase over baseline and about 44% of baremetal.

Hmm, I am not sure what you mean by delaying freeing.
I think we do have a problem that free_old_xmit_skbs
tries to flush out the ring aggressively:
it always polls until the ring is empty,
so there could be bursts of activity where
we spend a lot of time flushing the old entries
before e.g. sending an ack, resulting in
latency bursts.

Generally we'll need some smarter logic,
but with indirect at the moment we can just poll
a single packet after we post a new one, and be done with it.
Is your patch something like the patch below?
Could you try mine as well please?


 This spread out the kick_notify but still resulted in alot of them.  I 
 decided 
 to build on the delayed Tx buffer freeing and code up an ethtool like 
 coalescing patch in order to delay the kick_notify until there were at least 
 5 
 packets on the ring or 2000 usecs, whichever occurred first.  Here are the 
 results of delaying the kick_notify (average of six runs):
   Txn Rate: 107,106.36 Txn/Sec, Pkt Rate: 212,796 Pkts/Sec
   Exits: 102,587.28 Exits/Sec
   TxCPU: 3.03%  RxCPU: 99.33%
 About a 23% increase over baseline and about 52% of baremetal.
 
 Running the perf command against the guest I noticed almost 19% of the time 
 being spent in _raw_spin_lock.  Enabling lockstat in the guest showed alot of 
 contention in the irq_desc_lock_class. Pinning the virtio1-input interrupt 
 to a single cpu in the guest and re-running the last test resulted in 
 tremendous gains (average of six runs):
   Txn Rate: 153,696.59 Txn/Sec, Pkt Rate: 305,358 Pkgs/Sec
   Exits: 62,603.37 Exits/Sec
   TxCPU: 3.73%  RxCPU: 98.52%
 About a 77% increase over baseline and about 74% of baremetal.
 
 Vhost is receiving a lot of notifications for packets that are to be 
 transmitted (over 60% of the packets generate a kick_notify).  Also, it looks 
 like vhost is sending a lot of notifications for packets it has received 
 before the guest can get scheduled to disable notifications and begin 
 processing the packets

Hmm, is this really what happens to you?  The effect would be that guest
gets an interrupt while notifications are disabled in guest, right? Could
you add a counter and check this please?

Another possible thing to try would be these old patches to publish used index
from guest to make sure this double interrupt does not happen:
 [PATCHv2] virtio: put last seen used index into ring itself
 [PATCHv2] vhost-net: utilize PUBLISH_USED_IDX feature

 resulting in some lock contention in the guest (and 
 high interrupt rates).
 
 Some thoughts for the transmit path...  can vhost be enhanced to do some 
 adaptive polling so that the number of kick_notify events are reduced and 
 replaced by 

Re: Network performance with small packets - continued

2011-03-08 Thread Michael S. Tsirkin
On Mon, Mar 07, 2011 at 04:31:41PM -0600, Tom Lendacky wrote:
 I used the uperf tool to do this after verifying the results against netperf. 
  
 Uperf allows the specification of the number of connections as a parameter in 
 an XML file as opposed to launching, in this case, 100 separate instances of 
 netperf.

Could you post the XML on the list please?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-08 Thread Rusty Russell
On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
 On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
   Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
  
   On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
 Confused. We compare capacity to skb frags, no?
 That's sg I think ...
   
Current guest kernel use indirect buffers, num_free returns how many
available descriptors not skb frags. So it's wrong here.
   
Shirley
  
   I see. Good point. In other words when we complete the buffer
   it was indirect, but when we add a new one we
   can not allocate indirect so we consume.
   And then we start the queue and add will fail.
   I guess we need some kind of API to figure out
   whether the buf we complete was indirect?

I've finally read this thread... I think we need to get more serious
with our stats gathering to diagnose these kind of performance issues.

This is a start; it should tell us what is actually happening to the
virtio ring(s) without significant performance impact...

Subject: virtio: CONFIG_VIRTIO_STATS

For performance problems we'd like to know exactly what the ring looks
like.  This patch adds stats indexed by how-full-ring-is; we could extend
it to also record them by how-used-ring-is if we need.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au

diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
--- a/drivers/virtio/Kconfig
+++ b/drivers/virtio/Kconfig
@@ -7,6 +7,14 @@ config VIRTIO_RING
tristate
depends on VIRTIO
 
+config VIRTIO_STATS
+   bool Virtio debugging stats (EXPERIMENTAL)
+   depends on VIRTIO_RING
+   select DEBUG_FS
+   ---help---
+ Virtio stats collected by how full the ring is at any time,
+ presented under debugfs/virtio/name-vq/num-used/
+
 config VIRTIO_PCI
tristate PCI driver for virtio devices (EXPERIMENTAL)
depends on PCI  EXPERIMENTAL
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -21,6 +21,7 @@
 #include linux/virtio_config.h
 #include linux/device.h
 #include linux/slab.h
+#include linux/debugfs.h
 
 /* virtio guest is communicating with a virtual device that actually runs on
  * a host processor.  Memory barriers are used to control SMP effects. */
@@ -95,6 +96,11 @@ struct vring_virtqueue
/* How to notify other side. FIXME: commonalize hcalls! */
void (*notify)(struct virtqueue *vq);
 
+#ifdef CONFIG_VIRTIO_STATS
+   struct vring_stat *stats;
+   struct dentry *statdir;
+#endif
+
 #ifdef DEBUG
/* They're supposed to lock for us. */
unsigned int in_use;
@@ -106,6 +112,87 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+#ifdef CONFIG_VIRTIO_STATS
+/* We have an array of these, indexed by how full the ring is. */
+struct vring_stat {
+   /* How many interrupts? */
+   size_t interrupt_nowork, interrupt_work;
+   /* How many non-notify kicks, how many notify kicks, how many add 
notify? */
+   size_t kick_no_notify, kick_notify, add_notify;
+   /* How many adds? */
+   size_t add_direct, add_indirect, add_fail;
+   /* How many gets? */
+   size_t get;
+   /* How many disable callbacks? */
+   size_t disable_cb;
+   /* How many enables? */
+   size_t enable_cb_retry, enable_cb_success;
+};
+
+static struct dentry *virtio_stats;
+
+static void create_stat_files(struct vring_virtqueue *vq)
+{
+   char name[80];
+   unsigned int i;
+
+   /* Racy in theory, but we don't care. */
+   if (!virtio_stats)
+   virtio_stats = debugfs_create_dir(virtio-stats, NULL);
+
+   sprintf(name, %s-%s, dev_name(vq-vq.vdev-dev), vq-vq.name);
+   vq-statdir = debugfs_create_dir(name, virtio_stats);
+
+   for (i = 0; i  vq-vring.num; i++) {
+   struct dentry *dir;
+
+   sprintf(name, %i, i);
+   dir = debugfs_create_dir(name, vq-statdir);
+   debugfs_create_size_t(interrupt_nowork, 0400, dir,
+ vq-stats[i].interrupt_nowork);
+   debugfs_create_size_t(interrupt_work, 0400, dir,
+ vq-stats[i].interrupt_work);
+   debugfs_create_size_t(kick_no_notify, 0400, dir,
+ vq-stats[i].kick_no_notify);
+   debugfs_create_size_t(kick_notify, 0400, dir,
+ vq-stats[i].kick_notify);
+   debugfs_create_size_t(add_notify, 0400, dir,
+ vq-stats[i].add_notify);
+   debugfs_create_size_t(add_direct, 0400, dir,
+ vq-stats[i].add_direct);
+   debugfs_create_size_t(add_indirect, 0400, dir,
+

Re: Network performance with small packets

2011-02-08 Thread Michael S. Tsirkin
On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
 On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
  On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
   
On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
  Confused. We compare capacity to skb frags, no?
  That's sg I think ...

 Current guest kernel use indirect buffers, num_free returns how many
 available descriptors not skb frags. So it's wrong here.

 Shirley
   
I see. Good point. In other words when we complete the buffer
it was indirect, but when we add a new one we
can not allocate indirect so we consume.
And then we start the queue and add will fail.
I guess we need some kind of API to figure out
whether the buf we complete was indirect?
 
 I've finally read this thread... I think we need to get more serious
 with our stats gathering to diagnose these kind of performance issues.
 
 This is a start; it should tell us what is actually happening to the
 virtio ring(s) without significant performance impact...
 
 Subject: virtio: CONFIG_VIRTIO_STATS
 
 For performance problems we'd like to know exactly what the ring looks
 like.  This patch adds stats indexed by how-full-ring-is; we could extend
 it to also record them by how-used-ring-is if we need.
 
 Signed-off-by: Rusty Russell ru...@rustcorp.com.au

Not sure whether the intent is to merge this. If yes -
would it make sense to use tracing for this instead?
That's what kvm does.

 diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
 --- a/drivers/virtio/Kconfig
 +++ b/drivers/virtio/Kconfig
 @@ -7,6 +7,14 @@ config VIRTIO_RING
   tristate
   depends on VIRTIO
  
 +config VIRTIO_STATS
 + bool Virtio debugging stats (EXPERIMENTAL)
 + depends on VIRTIO_RING
 + select DEBUG_FS
 + ---help---
 +   Virtio stats collected by how full the ring is at any time,
 +   presented under debugfs/virtio/name-vq/num-used/
 +
  config VIRTIO_PCI
   tristate PCI driver for virtio devices (EXPERIMENTAL)
   depends on PCI  EXPERIMENTAL
 diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
 --- a/drivers/virtio/virtio_ring.c
 +++ b/drivers/virtio/virtio_ring.c
 @@ -21,6 +21,7 @@
  #include linux/virtio_config.h
  #include linux/device.h
  #include linux/slab.h
 +#include linux/debugfs.h
  
  /* virtio guest is communicating with a virtual device that actually runs 
 on
   * a host processor.  Memory barriers are used to control SMP effects. */
 @@ -95,6 +96,11 @@ struct vring_virtqueue
   /* How to notify other side. FIXME: commonalize hcalls! */
   void (*notify)(struct virtqueue *vq);
  
 +#ifdef CONFIG_VIRTIO_STATS
 + struct vring_stat *stats;
 + struct dentry *statdir;
 +#endif
 +
  #ifdef DEBUG
   /* They're supposed to lock for us. */
   unsigned int in_use;
 @@ -106,6 +112,87 @@ struct vring_virtqueue
  
  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
  
 +#ifdef CONFIG_VIRTIO_STATS
 +/* We have an array of these, indexed by how full the ring is. */
 +struct vring_stat {
 + /* How many interrupts? */
 + size_t interrupt_nowork, interrupt_work;
 + /* How many non-notify kicks, how many notify kicks, how many add 
 notify? */
 + size_t kick_no_notify, kick_notify, add_notify;
 + /* How many adds? */
 + size_t add_direct, add_indirect, add_fail;
 + /* How many gets? */
 + size_t get;
 + /* How many disable callbacks? */
 + size_t disable_cb;
 + /* How many enables? */
 + size_t enable_cb_retry, enable_cb_success;
 +};
 +
 +static struct dentry *virtio_stats;
 +
 +static void create_stat_files(struct vring_virtqueue *vq)
 +{
 + char name[80];
 + unsigned int i;
 +
 + /* Racy in theory, but we don't care. */
 + if (!virtio_stats)
 + virtio_stats = debugfs_create_dir(virtio-stats, NULL);
 +
 + sprintf(name, %s-%s, dev_name(vq-vq.vdev-dev), vq-vq.name);
 + vq-statdir = debugfs_create_dir(name, virtio_stats);
 +
 + for (i = 0; i  vq-vring.num; i++) {
 + struct dentry *dir;
 +
 + sprintf(name, %i, i);
 + dir = debugfs_create_dir(name, vq-statdir);
 + debugfs_create_size_t(interrupt_nowork, 0400, dir,
 +   vq-stats[i].interrupt_nowork);
 + debugfs_create_size_t(interrupt_work, 0400, dir,
 +   vq-stats[i].interrupt_work);
 + debugfs_create_size_t(kick_no_notify, 0400, dir,
 +   vq-stats[i].kick_no_notify);
 + debugfs_create_size_t(kick_notify, 0400, dir,
 +   vq-stats[i].kick_notify);
 + debugfs_create_size_t(add_notify, 0400, dir,
 +   

Re: Network performance with small packets

2011-02-08 Thread Rusty Russell
On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote:
 On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
  On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
   On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM

 On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
  On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
   Confused. We compare capacity to skb frags, no?
   That's sg I think ...
 
  Current guest kernel use indirect buffers, num_free returns how many
  available descriptors not skb frags. So it's wrong here.
 
  Shirley

 I see. Good point. In other words when we complete the buffer
 it was indirect, but when we add a new one we
 can not allocate indirect so we consume.
 And then we start the queue and add will fail.
 I guess we need some kind of API to figure out
 whether the buf we complete was indirect?
  
  I've finally read this thread... I think we need to get more serious
  with our stats gathering to diagnose these kind of performance issues.
  
  This is a start; it should tell us what is actually happening to the
  virtio ring(s) without significant performance impact...
  
  Subject: virtio: CONFIG_VIRTIO_STATS
  
  For performance problems we'd like to know exactly what the ring looks
  like.  This patch adds stats indexed by how-full-ring-is; we could extend
  it to also record them by how-used-ring-is if we need.
  
  Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 
 Not sure whether the intent is to merge this. If yes -
 would it make sense to use tracing for this instead?
 That's what kvm does.

Intent wasn't; I've not used tracepoints before, but maybe we should
consider a longer-term monitoring solution?

Patch welcome!

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-08 Thread Michael S. Tsirkin
On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote:
 On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote:
  On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
   On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
 
  On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
Confused. We compare capacity to skb frags, no?
That's sg I think ...
  
   Current guest kernel use indirect buffers, num_free returns how 
   many
   available descriptors not skb frags. So it's wrong here.
  
   Shirley
 
  I see. Good point. In other words when we complete the buffer
  it was indirect, but when we add a new one we
  can not allocate indirect so we consume.
  And then we start the queue and add will fail.
  I guess we need some kind of API to figure out
  whether the buf we complete was indirect?
   
   I've finally read this thread... I think we need to get more serious
   with our stats gathering to diagnose these kind of performance issues.
   
   This is a start; it should tell us what is actually happening to the
   virtio ring(s) without significant performance impact...
   
   Subject: virtio: CONFIG_VIRTIO_STATS
   
   For performance problems we'd like to know exactly what the ring looks
   like.  This patch adds stats indexed by how-full-ring-is; we could extend
   it to also record them by how-used-ring-is if we need.
   
   Signed-off-by: Rusty Russell ru...@rustcorp.com.au
  
  Not sure whether the intent is to merge this. If yes -
  would it make sense to use tracing for this instead?
  That's what kvm does.
 
 Intent wasn't; I've not used tracepoints before, but maybe we should
 consider a longer-term monitoring solution?
 
 Patch welcome!
 
 Cheers,
 Rusty.

Sure, I'll look into this.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-08 Thread Stefan Hajnoczi
On Wed, Feb 9, 2011 at 1:55 AM, Michael S. Tsirkin m...@redhat.com wrote:
 On Wed, Feb 09, 2011 at 12:09:35PM +1030, Rusty Russell wrote:
 On Wed, 9 Feb 2011 11:23:45 am Michael S. Tsirkin wrote:
  On Wed, Feb 09, 2011 at 11:07:20AM +1030, Rusty Russell wrote:
   On Wed, 2 Feb 2011 03:12:22 pm Michael S. Tsirkin wrote:
On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
 
  On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
Confused. We compare capacity to skb frags, no?
That's sg I think ...
  
   Current guest kernel use indirect buffers, num_free returns how 
   many
   available descriptors not skb frags. So it's wrong here.
  
   Shirley
 
  I see. Good point. In other words when we complete the buffer
  it was indirect, but when we add a new one we
  can not allocate indirect so we consume.
  And then we start the queue and add will fail.
  I guess we need some kind of API to figure out
  whether the buf we complete was indirect?
  
   I've finally read this thread... I think we need to get more serious
   with our stats gathering to diagnose these kind of performance issues.
  
   This is a start; it should tell us what is actually happening to the
   virtio ring(s) without significant performance impact...
  
   Subject: virtio: CONFIG_VIRTIO_STATS
  
   For performance problems we'd like to know exactly what the ring looks
   like.  This patch adds stats indexed by how-full-ring-is; we could extend
   it to also record them by how-used-ring-is if we need.
  
   Signed-off-by: Rusty Russell ru...@rustcorp.com.au
 
  Not sure whether the intent is to merge this. If yes -
  would it make sense to use tracing for this instead?
  That's what kvm does.

 Intent wasn't; I've not used tracepoints before, but maybe we should
 consider a longer-term monitoring solution?

 Patch welcome!

 Cheers,
 Rusty.

 Sure, I'll look into this.

There are several virtio trace events already in QEMU today (see the
trace-events file):
virtqueue_fill(void *vq, const void *elem, unsigned int len, unsigned
int idx) vq %p elem %p len %u idx %u
virtqueue_flush(void *vq, unsigned int count) vq %p count %u
virtqueue_pop(void *vq, void *elem, unsigned int in_num, unsigned int
out_num) vq %p elem %p in_num %u out_num %u
virtio_queue_notify(void *vdev, int n, void *vq) vdev %p n %d vq %p
virtio_irq(void *vq) vq %p
virtio_notify(void *vdev, void *vq) vdev %p vq %p

These can be used by building QEMU with a suitable tracing backend
like SystemTap (see docs/tracing.txt).

Inside the guest I've used dynamic ftrace in the past, although static
tracepoints would be nice.

Stefan
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-03 Thread Shirley Ma
On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote:
  Initial TCP_STREAM performance results I got for guest to local
 host 
  4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
  6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
  9.8Gb/s for 4K message size. (vs.5.xGb/s)
 
 What is the average packet size, # bytes per ack, and the # of
 interrupts
 per packet? It could be that just slowing down trahsmission
 makes GSO work better. 

There is no TX interrupts with dropping packet.

GSO/TSO is the key for small message performance, w/o GSO/TSO, the
performance is limited to about 2Gb/s no matter how big the message size
it is. I think any work we try here will increase large packet size
rate. BTW for dropping packet, TCP increased fast retrans, not slow
start. 

I will collect tcpdump, netstart before and after data to compare packet
size/rate w/o w/i the patch.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-03 Thread Michael S. Tsirkin
On Thu, Feb 03, 2011 at 07:58:00AM -0800, Shirley Ma wrote:
 On Thu, 2011-02-03 at 08:13 +0200, Michael S. Tsirkin wrote:
   Initial TCP_STREAM performance results I got for guest to local
  host 
   4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
   6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
   9.8Gb/s for 4K message size. (vs.5.xGb/s)
  
  What is the average packet size, # bytes per ack, and the # of
  interrupts
  per packet? It could be that just slowing down trahsmission
  makes GSO work better. 
 
 There is no TX interrupts with dropping packet.
 
 GSO/TSO is the key for small message performance, w/o GSO/TSO, the
 performance is limited to about 2Gb/s no matter how big the message size
 it is. I think any work we try here will increase large packet size
 rate. BTW for dropping packet, TCP increased fast retrans, not slow
 start. 
 
 I will collect tcpdump, netstart before and after data to compare packet
 size/rate w/o w/i the patch.
 
 Thanks
 Shirley

Just a thought: does it help to make tx queue len of the
virtio device smaller?
E.g. match the vq size?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-03 Thread Shirley Ma
On Thu, 2011-02-03 at 18:20 +0200, Michael S. Tsirkin wrote:
 Just a thought: does it help to make tx queue len of the
 virtio device smaller? 

Yes, that what I did before, reducing txqueuelen will cause qdisc dropp
the packet early, But it's hard to control by using tx queuelen for
performance gain. I tried on different systems, it required different
values.

Also, I tried another patch, instead of dropping packets, I used to
timer (2 jiffies) to enable/disable queue on guest without interrupts
notification, it gets better performance than original but worse
performance than dropping packets because of netif stop/wake up too
often.

vhost is definitely needed to improve for handling small message sizes.
It's unable to handle small message packets rate for queue size 256,
even with ring size 1024. QEMU seems not allowing to increase the TX
ring size to 2K (start qemu_kvm failure with no errors), I am not able
to test it out.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
 Yes, I think doing this in the host is much simpler,
 just send an interrupt after there's a decent amount
 of space in the queue.
 
 Having said that the simple heuristic that I coded
 might be a bit too simple.

From the debugging out what I have seen so far (a single small message
TCP_STEAM test), I think the right approach is to patch both guest and
vhost. The problem I have found is a regression for single  small
message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new
kernel has problem.

For Steven's problem, it's multiple stream TCP_RR issues, the old guest
doesn't perform well, so does new guest kernel. We tested reducing vhost
signaling patch before, it didn't help the performance at all.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
 On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
  On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
   w/i guest change, I played around the parameters,for example: I
 could
   get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
   size,
   w/i dropping packet, I was able to get up to 6.2Gb/s with similar
 CPU
   usage. 
  
  I meant w/o guest change, only vhost changes. Sorry about that.
  
  Shirley
 
 Ah, excellent. What were the parameters? 

I used half of the ring size 129 for packet counters, but the
performance is still not as good as dropping packets on guest, 3.7 Gb/s
vs. 6.2Gb/s.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
  Yes, I think doing this in the host is much simpler,
  just send an interrupt after there's a decent amount
  of space in the queue.
  
  Having said that the simple heuristic that I coded
  might be a bit too simple.
 
 From the debugging out what I have seen so far (a single small message
 TCP_STEAM test), I think the right approach is to patch both guest and
 vhost.

One problem is slowing down the guest helps here.
So there's a chance that just by adding complexity
in guest driver we get a small improvement :(

We can't rely on a patched guest anyway, so
I think it is best to test guest and host changes separately.

And I do agree something needs to be done in guest too,
for example when vqs share an interrupt, we
might invoke a callback when we see vq is not empty
even though it's not requested. Probably should
check interrupts enabled here?

 The problem I have found is a regression for single  small
 message TCP_STEAM test. Old kernel works well for TCP_STREAM, only new
 kernel has problem.

Likely new kernel is faster :)

 For Steven's problem, it's multiple stream TCP_RR issues, the old guest
 doesn't perform well, so does new guest kernel. We tested reducing vhost
 signaling patch before, it didn't help the performance at all.
 
 Thanks
 Shirley

Yes, it seems unrelated to tx interrupts.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
  On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
w/i guest change, I played around the parameters,for example: I
  could
get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
size,
w/i dropping packet, I was able to get up to 6.2Gb/s with similar
  CPU
usage. 
   
   I meant w/o guest change, only vhost changes. Sorry about that.
   
   Shirley
  
  Ah, excellent. What were the parameters? 
 
 I used half of the ring size 129 for packet counters, but the
 performance is still not as good as dropping packets on guest, 3.7 Gb/s
 vs. 6.2Gb/s.
 
 Shirley

And this is with sndbuf=0 in host, yes?
And do you see a lot of tx interrupts?
How packets per interrupt?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote:
 On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
  On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
   Yes, I think doing this in the host is much simpler,
   just send an interrupt after there's a decent amount
   of space in the queue.
   
   Having said that the simple heuristic that I coded
   might be a bit too simple.
  
  From the debugging out what I have seen so far (a single small
 message
  TCP_STEAM test), I think the right approach is to patch both guest
 and
  vhost.
 
 One problem is slowing down the guest helps here.
 So there's a chance that just by adding complexity
 in guest driver we get a small improvement :(
 
 We can't rely on a patched guest anyway, so
 I think it is best to test guest and host changes separately.
 
 And I do agree something needs to be done in guest too,
 for example when vqs share an interrupt, we
 might invoke a callback when we see vq is not empty
 even though it's not requested. Probably should
 check interrupts enabled here?

Yes, I modified xmit callback something like below:

static void skb_xmit_done(struct virtqueue *svq)
{
struct virtnet_info *vi = svq-vdev-priv;

/* Suppress further interrupts. */
virtqueue_disable_cb(svq);

/* We were probably waiting for more output buffers. */
if (netif_queue_stopped(vi-dev)) {
free_old_xmit_skbs(vi);
if (virtqueue_free_size(svq)  = svq-vring.num / 2) {
virtqueue_enable_cb(svq);
return;
}
}
netif_wake_queue(vi-dev);
}

  The problem I have found is a regression for single  small
  message TCP_STEAM test. Old kernel works well for TCP_STREAM, only
 new
  kernel has problem.
 
 Likely new kernel is faster :)

  For Steven's problem, it's multiple stream TCP_RR issues, the old
 guest
  doesn't perform well, so does new guest kernel. We tested reducing
 vhost
  signaling patch before, it didn't help the performance at all.
  
  Thanks
  Shirley
 
 Yes, it seems unrelated to tx interrupts. 

The issue is more likely related to latency. Do you have anything in
mind on how to reduce vhost latency?

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 17:48 +0200, Michael S. Tsirkin wrote:
 And this is with sndbuf=0 in host, yes?
 And do you see a lot of tx interrupts?
 How packets per interrupt?

Nope, sndbuf doens't matter since I never hit reaching sock wmem
condition in vhost. I am still playing around, let me know what data you
would like to collect.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 09:10:35AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 17:47 +0200, Michael S. Tsirkin wrote:
  On Wed, Feb 02, 2011 at 07:39:45AM -0800, Shirley Ma wrote:
   On Wed, 2011-02-02 at 12:48 +0200, Michael S. Tsirkin wrote:
Yes, I think doing this in the host is much simpler,
just send an interrupt after there's a decent amount
of space in the queue.

Having said that the simple heuristic that I coded
might be a bit too simple.
   
   From the debugging out what I have seen so far (a single small
  message
   TCP_STEAM test), I think the right approach is to patch both guest
  and
   vhost.
  
  One problem is slowing down the guest helps here.
  So there's a chance that just by adding complexity
  in guest driver we get a small improvement :(
  
  We can't rely on a patched guest anyway, so
  I think it is best to test guest and host changes separately.
  
  And I do agree something needs to be done in guest too,
  for example when vqs share an interrupt, we
  might invoke a callback when we see vq is not empty
  even though it's not requested. Probably should
  check interrupts enabled here?
 
 Yes, I modified xmit callback something like below:
 
 static void skb_xmit_done(struct virtqueue *svq)
 {
 struct virtnet_info *vi = svq-vdev-priv;
 
 /* Suppress further interrupts. */
 virtqueue_disable_cb(svq);
 
 /* We were probably waiting for more output buffers. */
 if (netif_queue_stopped(vi-dev)) {
 free_old_xmit_skbs(vi);
 if (virtqueue_free_size(svq)  = svq-vring.num / 2) {
 virtqueue_enable_cb(svq);
   return;
   }
 }
   netif_wake_queue(vi-dev);
 }

OK, but this should have no effect with a vhost patch
which should ensure that we don't get an interrupt
until the queue is at least half empty.
Right?

   The problem I have found is a regression for single  small
   message TCP_STEAM test. Old kernel works well for TCP_STREAM, only
  new
   kernel has problem.
  
  Likely new kernel is faster :)
 
   For Steven's problem, it's multiple stream TCP_RR issues, the old
  guest
   doesn't perform well, so does new guest kernel. We tested reducing
  vhost
   signaling patch before, it didn't help the performance at all.
   
   Thanks
   Shirley
  
  Yes, it seems unrelated to tx interrupts. 
 
 The issue is more likely related to latency.

Could be. Why do you think so?

 Do you have anything in
 mind on how to reduce vhost latency?
 
 Thanks
 Shirley

Hmm, bypassing the bridge might help a bit.
Are you using tap+bridge or macvtap?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
 OK, but this should have no effect with a vhost patch
 which should ensure that we don't get an interrupt
 until the queue is at least half empty.
 Right?

There should be some coordination between guest and vhost. We shouldn't
count the TX packets when netif queue is enabled since next guest TX
xmit will free any used buffers in vhost. We need to be careful here in
case we miss the interrupts when netif queue has stopped.

However we can't change old guest so we can test the patches separately
for guest only, vhost only, and the combination.

   
   Yes, it seems unrelated to tx interrupts. 
  
  The issue is more likely related to latency.
 
 Could be. Why do you think so?

Since I played with latency hack, I can see performance difference for
different latency.

  Do you have anything in
  mind on how to reduce vhost latency?
  
  Thanks
  Shirley
 
 Hmm, bypassing the bridge might help a bit.
 Are you using tap+bridge or macvtap? 

I am using tap+bridge for TCP_RR test, I think Steven tested macvtap
before. He might have some data from his workload performance
measurement.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 07:42:51AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 12:49 +0200, Michael S. Tsirkin wrote:
  On Tue, Feb 01, 2011 at 11:33:49PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
w/i guest change, I played around the parameters,for example: I
  could
get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
size,
w/i dropping packet, I was able to get up to 6.2Gb/s with similar
  CPU
usage. 
   
   I meant w/o guest change, only vhost changes. Sorry about that.
   
   Shirley
  
  Ah, excellent. What were the parameters? 
 
 I used half of the ring size 129 for packet counters,
 but the
 performance is still not as good as dropping packets on guest, 3.7 Gb/s
 vs. 6.2Gb/s.
 
 Shirley

How many packets and bytes per interrupt are sent?
Also, what about other values for the counters and other counters?

What does your patch do? Just drop packets instead of
stopping the interface?

To have an understanding when should we drop packets
in the guest, we need to know *why* does it help.
Otherwise, how do we know it will work for others?
Note that qdisc will drop packets when it overruns -
so what is different? Also, are we over-running some other queue
somewhere?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 20:20 +0200, Michael S. Tsirkin wrote:
 How many packets and bytes per interrupt are sent?
 Also, what about other values for the counters and other counters?
 
 What does your patch do? Just drop packets instead of
 stopping the interface?
 
 To have an understanding when should we drop packets
 in the guest, we need to know *why* does it help.
 Otherwise, how do we know it will work for others?
 Note that qdisc will drop packets when it overruns -
 so what is different? Also, are we over-running some other queue
 somewhere? 

Agreed. I am trying to put more debugging output to look for all these
answers.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
  OK, but this should have no effect with a vhost patch
  which should ensure that we don't get an interrupt
  until the queue is at least half empty.
  Right?
 
 There should be some coordination between guest and vhost.

What kind of coordination? With a patched vhost, and a full ring.
you should get an interrupt per 100 packets.
Is this what you see? And if yes, isn't the guest patch
doing nothing then?

 We shouldn't
 count the TX packets when netif queue is enabled since next guest TX
 xmit will free any used buffers in vhost. We need to be careful here in
 case we miss the interrupts when netif queue has stopped.
 
 However we can't change old guest so we can test the patches separately
 for guest only, vhost only, and the combination.
 

Yes, it seems unrelated to tx interrupts. 
   
   The issue is more likely related to latency.
  
  Could be. Why do you think so?
 
 Since I played with latency hack, I can see performance difference for
 different latency.

Which hack was that?

   Do you have anything in
   mind on how to reduce vhost latency?
   
   Thanks
   Shirley
  
  Hmm, bypassing the bridge might help a bit.
  Are you using tap+bridge or macvtap? 
 
 I am using tap+bridge for TCP_RR test, I think Steven tested macvtap
 before. He might have some data from his workload performance
 measurement.
 
 Shirley
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote:
 
 I am working on a KVM network performance issue found in our lab running
 the DayTrader benchmark.  The benchmark throughput takes a significant hit
 when running the application server in a KVM guest verses on bare metal.
 We have dug into the problem and found that DayTrader's use of small
 packets exposes KVM's overhead of handling network packets.  I have been
 able to reproduce the performance hit with a simpler setup using the
 netperf benchmark with the TCP_RR test and the request and response sizes
 set to 256 bytes.  I run the benchmark between two physical systems, each
 using a 1GB link.  In order to get the maximum throughput for the system I
 have to run 100 instances of netperf.  When I run the netserver processes
 in a guest, I see a maximum throughput that is 51% of what I get if I run
 the netserver processes directly on the host.  The CPU utilization in the
 guest is only 85% at maximum throughput, whereas it is 100% on bare metal.

You are stressing the scheduler pretty hard with this test :)
Is your real benchmark also using a huge number of threads?
If it's not, you might be seeing a different issue.
IOW, the netperf degradation might not be network-related at all,
but have to do with speed of context switch in guest.
Thoughts?

 The KVM host has 16 CPUs.  The KVM guest is configured with 2 VCPUs.  When
 I run netperf on the host I boot the host with maxcpus=2 on the kernel
 command line.  The host is running the current KVM upstream kernel along
 with the current upstream qemu.  Here is the qemu command used to launch
 the guest:
 /build/qemu-kvm/x86_64-softmmu/qemu-system-x86_64 -name glasgow-RH60 -m 32768 
 -drive file=/build/guest-data/glasgow-RH60.img,if=virtio,index=0,boot=on
  -drive file=/dev/virt/WAS,if=virtio,index=1 -net 
 nic,model=virtio,vlan=3,macaddr=00:1A:64:E5:00:63,netdev=nic0 -netdev 
 tap,id=nic0,vhost=on -smp 2
 -vnc :1 -monitor telnet::4499,server,nowait -serial 
 telnet::8899,server,nowait --mem-path /libhugetlbfs -daemonize
 
 We have tried various proposed fixes, each with varying amounts of success.
 One such fix was to add code to the vhost thread such that when it found
 the work queue empty it wouldn't just exit the thread but rather would
 delay for 50 microseconds and then recheck the queue.  If there was work on
 the queue it would loop back and process it, else it would exit the thread.
 The change got us a 13% improvement in the DayTrader throughput.
 
 Running the same netperf configuration on the same hardware but using a
 different hypervisor gets us significantly better throughput numbers.   The
 guest on that hypervisor runs at 100% CPU utilization.  The various fixes
 we have tried have not gotten us close to the throughput seen on the other
 hypervisor.  I'm looking for ideas/input from the KVM experts on how to
 make KVM perform better when handling small packets.
 
 Thanks,
 Steve
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Steve Dobbelstein
Michael S. Tsirkin m...@redhat.com wrote on 02/02/2011 12:38:47 PM:

 On Tue, Jan 25, 2011 at 03:09:34PM -0600, Steve Dobbelstein wrote:
 
  I am working on a KVM network performance issue found in our lab
running
  the DayTrader benchmark.  The benchmark throughput takes a significant
hit
  when running the application server in a KVM guest verses on bare
metal.
  We have dug into the problem and found that DayTrader's use of small
  packets exposes KVM's overhead of handling network packets.  I have
been
  able to reproduce the performance hit with a simpler setup using the
  netperf benchmark with the TCP_RR test and the request and response
sizes
  set to 256 bytes.  I run the benchmark between two physical systems,
each
  using a 1GB link.  In order to get the maximum throughput for the
system I
  have to run 100 instances of netperf.  When I run the netserver
processes
  in a guest, I see a maximum throughput that is 51% of what I get if I
run
  the netserver processes directly on the host.  The CPU utilization in
the
  guest is only 85% at maximum throughput, whereas it is 100% on bare
metal.

 You are stressing the scheduler pretty hard with this test :)
 Is your real benchmark also using a huge number of threads?

Yes.  The real benchmark has 60 threads handling client requests and 48
threads talking to a database server.

 If it's not, you might be seeing a different issue.
 IOW, the netperf degradation might not be network-related at all,
 but have to do with speed of context switch in guest.
 Thoughts?

Yes, context switches can add to the overhead.  We have that data captured,
and I can look at it.  What makes me think that's not the issue is that the
CPU utilization in the guest is only about 85% at maximum throughput.
Throughput/CPU is comparable to a different hypervisor, but that hypervisor
runs at full CPU utilization and gets better throughput.  I can't help but
think KVM would get better throughput if it could just keep the guest VCPUs
busy.

Recently I have been playing with different CPU pinnings for the guest
VCPUs and the vhost thread.  Certain combinations can get us up to a 35%
improvement in throughput with the same throughput/CPU ratio.  CPU
utilization was 94% -- not full CPU utilization, but it does illustrate
that we can get better throughput if we keep the guest VCPUs busy.  At this
point it's looking more like a scheduler issue.  We're starting to dig
through the scheduler code for clues.

Steve D.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote:
 On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
  On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
   OK, but this should have no effect with a vhost patch
   which should ensure that we don't get an interrupt
   until the queue is at least half empty.
   Right?
  
  There should be some coordination between guest and vhost.
 
 What kind of coordination? With a patched vhost, and a full ring.
 you should get an interrupt per 100 packets.
 Is this what you see? And if yes, isn't the guest patch
 doing nothing then?

vhost_signal won't be able send any TX interrupts to guest when guest TX
interrupt is disabled. Guest TX interrupt is only enabled when running
out of descriptors.

  We shouldn't
  count the TX packets when netif queue is enabled since next guest TX
  xmit will free any used buffers in vhost. We need to be careful here
 in
  case we miss the interrupts when netif queue has stopped.
  
  However we can't change old guest so we can test the patches
 separately
  for guest only, vhost only, and the combination.
  
 
 Yes, it seems unrelated to tx interrupts. 

The issue is more likely related to latency.
   
   Could be. Why do you think so?
  
  Since I played with latency hack, I can see performance difference
 for
  different latency.
 
 Which hack was that? 

I tried to accumulate multiple guest to host notifications for TX xmits,
it did help multiple streams TCP_RR results; I also forced vhost
handle_tx to handle more packets; both hack seemed help.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 11:29:35AM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 20:27 +0200, Michael S. Tsirkin wrote:
  On Wed, Feb 02, 2011 at 10:11:51AM -0800, Shirley Ma wrote:
   On Wed, 2011-02-02 at 19:32 +0200, Michael S. Tsirkin wrote:
OK, but this should have no effect with a vhost patch
which should ensure that we don't get an interrupt
until the queue is at least half empty.
Right?
   
   There should be some coordination between guest and vhost.
  
  What kind of coordination? With a patched vhost, and a full ring.
  you should get an interrupt per 100 packets.
  Is this what you see? And if yes, isn't the guest patch
  doing nothing then?
 
 vhost_signal won't be able send any TX interrupts to guest when guest TX
 interrupt is disabled. Guest TX interrupt is only enabled when running
 out of descriptors.

Well, this is also the only case where the queue is stopped, no?

   We shouldn't
   count the TX packets when netif queue is enabled since next guest TX
   xmit will free any used buffers in vhost. We need to be careful here
  in
   case we miss the interrupts when netif queue has stopped.
   
   However we can't change old guest so we can test the patches
  separately
   for guest only, vhost only, and the combination.
   
  
  Yes, it seems unrelated to tx interrupts. 
 
 The issue is more likely related to latency.

Could be. Why do you think so?
   
   Since I played with latency hack, I can see performance difference
  for
   different latency.
  
  Which hack was that? 
 
 I tried to accumulate multiple guest to host notifications for TX xmits,
 it did help multiple streams TCP_RR results;

I don't see a point to delay used idx update, do you?
So delaying just signal seems better, right?

 I also forced vhost
 handle_tx to handle more packets; both hack seemed help.
 
 Thanks
 Shirley

Haven't noticed that part, how does your patch make it
handle more packets?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
 Well, this is also the only case where the queue is stopped, no?
Yes. I got some debugging data, I saw that sometimes there were so many
packets were waiting for free in guest between vhost_signal  guest xmit
callback. Looks like the time spent too long from vhost_signal to guest
xmit callback?

  I tried to accumulate multiple guest to host notifications for TX
 xmits,
  it did help multiple streams TCP_RR results;
 I don't see a point to delay used idx update, do you?

It might cause per vhost handle_tx processed more packets.

 So delaying just signal seems better, right?

I think I need to define the test matrix to collect data for TX xmit
from guest to host here for different tests.

Data to be collected:
-
1. kvm_stat for VM, I/O exits
2. cpu utilization for both guest and host
3. cat /proc/interrupts on guest
4. packets rate from vhost handle_tx per loop
5. guest netif queue stop rate
6. how many packets are waiting for free between vhost signaling and
guest callback
7. performance results

Test

1. TCP_STREAM single stream test for 1K to 4K message size
2. TCP_RR (64 instance test): 128 - 1K request/response size

Different hacks
---
1. Base line data ( with the patch to fix capacity check first,
free_old_xmit_skbs returns number of skbs)

2. Drop packet data (will put some debugging in generic networking code)

3. Delay guest netif queue wake up until certain descriptors (1/2 ring
size, 1/4 ring size...) are available once the queue has stopped.

4. Accumulate more packets per vhost signal in handle_tx?

5. 3  4 combinations

6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 

7. Accumulate more packets per vhost handle_tx() by adding some delay?

 Haven't noticed that part, how does your patch make it
handle more packets?

Added a delay in handle_tx().

What else?

It would take sometimes to do this.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 01:03:05PM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
  Well, this is also the only case where the queue is stopped, no?
 Yes. I got some debugging data, I saw that sometimes there were so many
 packets were waiting for free in guest between vhost_signal  guest xmit
 callback.

What does this mean?

 Looks like the time spent too long from vhost_signal to guest
 xmit callback?



   I tried to accumulate multiple guest to host notifications for TX
  xmits,
   it did help multiple streams TCP_RR results;
  I don't see a point to delay used idx update, do you?
 
 It might cause per vhost handle_tx processed more packets.

I don't understand. It's a couple of writes - what is the issue?

  So delaying just signal seems better, right?
 
 I think I need to define the test matrix to collect data for TX xmit
 from guest to host here for different tests.
 
 Data to be collected:
 -
 1. kvm_stat for VM, I/O exits
 2. cpu utilization for both guest and host
 3. cat /proc/interrupts on guest
 4. packets rate from vhost handle_tx per loop
 5. guest netif queue stop rate
 6. how many packets are waiting for free between vhost signaling and
 guest callback
 7. performance results
 
 Test
 
 1. TCP_STREAM single stream test for 1K to 4K message size
 2. TCP_RR (64 instance test): 128 - 1K request/response size
 
 Different hacks
 ---
 1. Base line data ( with the patch to fix capacity check first,
 free_old_xmit_skbs returns number of skbs)
 
 2. Drop packet data (will put some debugging in generic networking code)
 
 3. Delay guest netif queue wake up until certain descriptors (1/2 ring
 size, 1/4 ring size...) are available once the queue has stopped.
 
 4. Accumulate more packets per vhost signal in handle_tx?
 
 5. 3  4 combinations
 
 6. Accumulate more packets per guest kick() (TCP_RR) by adding a timer? 
 
 7. Accumulate more packets per vhost handle_tx() by adding some delay?
 
  Haven't noticed that part, how does your patch make it
 handle more packets?
 
 Added a delay in handle_tx().
 
 What else?
 
 It would take sometimes to do this.
 
 Shirley


Need to think about this.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
  On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
   Well, this is also the only case where the queue is stopped, no?
  Yes. I got some debugging data, I saw that sometimes there were so
 many
  packets were waiting for free in guest between vhost_signal  guest
 xmit
  callback.
 
 What does this mean?

Let's look at the sequence here:

guest start_xmit()
xmit_skb()
if ring is full,
enable_cb()

guest skb_xmit_done()
disable_cb,
printk free_old_xmit_skbs -- it was between more than 1/2 to
full ring size
printk vq-num_free 

vhost handle_tx()
if (guest interrupt is enabled)
signal guest to free xmit buffers

So between guest queue full/stopped queue/enable call back to guest
receives the callback from host to free_old_xmit_skbs, there were about
1/2 to full ring size descriptors available. I thought there were only a
few. (I disabled your vhost patch for this test.)
 

  Looks like the time spent too long from vhost_signal to guest
  xmit callback?
 
 
 
I tried to accumulate multiple guest to host notifications for
 TX
   xmits,
it did help multiple streams TCP_RR results;
   I don't see a point to delay used idx update, do you?
  
  It might cause per vhost handle_tx processed more packets.
 
 I don't understand. It's a couple of writes - what is the issue?

Oh, handle_tx could process more packets per loop for multiple streams
TCP_RR case. I need to print out the data rate per loop to confirm this.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
  I think I need to define the test matrix to collect data for TX xmit
  from guest to host here for different tests.
  
  Data to be collected:
  -
  1. kvm_stat for VM, I/O exits
  2. cpu utilization for both guest and host
  3. cat /proc/interrupts on guest
  4. packets rate from vhost handle_tx per loop
  5. guest netif queue stop rate
  6. how many packets are waiting for free between vhost signaling and
  guest callback
  7. performance results
  
  Test
  
  1. TCP_STREAM single stream test for 1K to 4K message size
  2. TCP_RR (64 instance test): 128 - 1K request/response size
  
  Different hacks
  ---
  1. Base line data ( with the patch to fix capacity check first,
  free_old_xmit_skbs returns number of skbs)
  
  2. Drop packet data (will put some debugging in generic networking
 code)

Since I found that the netif queue stop/wake up is so expensive, I
created a dropping packets patch on guest side so I don't need to debug
generic networking code.

guest start_xmit()
capacity = free_old_xmit_skb() + virtqueue_get_num_freed()
if (capacity == 0)
drop this packet;
return;

In the patch, both guest TX interrupts and callback have been omitted.
Host vhost_signal in handle_tx can totally be removed as well. (A new
virtio_ring API is needed for exporting total of num_free descriptors
here -- virtioqueue_get_num_freed)

Initial TCP_STREAM performance results I got for guest to local host 
4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
9.8Gb/s for 4K message size. (vs.5.xGb/s)

Since large message size (64K) doesn't hit (capacity == 0) case, so the
performance only has a little better. (from 13.xGb/s to 14.x Gb/s)

kvm_stat output shows significant exits reduction for both VM and I/O,
no guest TX interrupts.

With dropping packets, TCP retrans has been increased here, so I can see
performance numbers are various.

This might be not a good solution, but it gave us some ideas on
expensive netif queue stop/wake up between guest and host notification.

I couldn't find a better solution on how to reduce netif queue stop/wake
up rate for small message size. But I think once we can address this,
the guest TX performance will burst for small message size.

I also compared this with return TX_BUSY approach when (capacity == 0),
it is not as good as dropping packets.

  3. Delay guest netif queue wake up until certain descriptors (1/2
 ring
  size, 1/4 ring size...) are available once the queue has stopped.
  
  4. Accumulate more packets per vhost signal in handle_tx?
  
  5. 3  4 combinations
  
  6. Accumulate more packets per guest kick() (TCP_RR) by adding a
 timer? 
  
  7. Accumulate more packets per vhost handle_tx() by adding some
 delay?
  
   Haven't noticed that part, how does your patch make it
  handle more packets?
  
  Added a delay in handle_tx().
  
  What else?
  
  It would take sometimes to do this.
  
  Shirley
 
 
 Need to think about this.
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 01:41:33PM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
   On Wed, 2011-02-02 at 22:17 +0200, Michael S. Tsirkin wrote:
Well, this is also the only case where the queue is stopped, no?
   Yes. I got some debugging data, I saw that sometimes there were so
  many
   packets were waiting for free in guest between vhost_signal  guest
  xmit
   callback.
  
  What does this mean?
 
 Let's look at the sequence here:
 
 guest start_xmit()
   xmit_skb()
   if ring is full,
   enable_cb()
 
 guest skb_xmit_done()
   disable_cb,
 printk free_old_xmit_skbs -- it was between more than 1/2 to
 full ring size
   printk vq-num_free 
 
 vhost handle_tx()
   if (guest interrupt is enabled)
   signal guest to free xmit buffers
 
 So between guest queue full/stopped queue/enable call back to guest
 receives the callback from host to free_old_xmit_skbs, there were about
 1/2 to full ring size descriptors available. I thought there were only a
 few. (I disabled your vhost patch for this test.)


The expected number is vq-num - max skb frags - 2.

 
   Looks like the time spent too long from vhost_signal to guest
   xmit callback?
  
  
  
 I tried to accumulate multiple guest to host notifications for
  TX
xmits,
 it did help multiple streams TCP_RR results;
I don't see a point to delay used idx update, do you?
   
   It might cause per vhost handle_tx processed more packets.
  
  I don't understand. It's a couple of writes - what is the issue?
 
 Oh, handle_tx could process more packets per loop for multiple streams
 TCP_RR case. I need to print out the data rate per loop to confirm this.
 
 Shirley
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Shirley Ma
On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote:
  Let's look at the sequence here:
  
  guest start_xmit()
xmit_skb()
if ring is full,
enable_cb()
  
  guest skb_xmit_done()
disable_cb,
  printk free_old_xmit_skbs -- it was between more than 1/2
 to
  full ring size
printk vq-num_free 
  
  vhost handle_tx()
if (guest interrupt is enabled)
signal guest to free xmit buffers
  
  So between guest queue full/stopped queue/enable call back to guest
  receives the callback from host to free_old_xmit_skbs, there were
 about
  1/2 to full ring size descriptors available. I thought there were
 only a
  few. (I disabled your vhost patch for this test.)
 
 
 The expected number is vq-num - max skb frags - 2. 

It was various (up to the ring size 256). This is using indirection
buffers, it returned how many freed descriptors, not number of buffers.

Why do you think it is vq-num - max skb frags - 2 here?

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 09:05:56PM -0800, Shirley Ma wrote:
 On Wed, 2011-02-02 at 23:20 +0200, Michael S. Tsirkin wrote:
   I think I need to define the test matrix to collect data for TX xmit
   from guest to host here for different tests.
   
   Data to be collected:
   -
   1. kvm_stat for VM, I/O exits
   2. cpu utilization for both guest and host
   3. cat /proc/interrupts on guest
   4. packets rate from vhost handle_tx per loop
   5. guest netif queue stop rate
   6. how many packets are waiting for free between vhost signaling and
   guest callback
   7. performance results
   
   Test
   
   1. TCP_STREAM single stream test for 1K to 4K message size
   2. TCP_RR (64 instance test): 128 - 1K request/response size
   
   Different hacks
   ---
   1. Base line data ( with the patch to fix capacity check first,
   free_old_xmit_skbs returns number of skbs)
   
   2. Drop packet data (will put some debugging in generic networking
  code)
 
 Since I found that the netif queue stop/wake up is so expensive, I
 created a dropping packets patch on guest side so I don't need to debug
 generic networking code.
 
 guest start_xmit()
   capacity = free_old_xmit_skb() + virtqueue_get_num_freed()
   if (capacity == 0)
   drop this packet;
   return;
 
 In the patch, both guest TX interrupts and callback have been omitted.
 Host vhost_signal in handle_tx can totally be removed as well. (A new
 virtio_ring API is needed for exporting total of num_free descriptors
 here -- virtioqueue_get_num_freed)
 
 Initial TCP_STREAM performance results I got for guest to local host 
 4.2Gb/s for 1K message size, (vs. 2.5Gb/s)
 6.2Gb/s for 2K message size, and (vs. 3.8Gb/s)
 9.8Gb/s for 4K message size. (vs.5.xGb/s)

What is the average packet size, # bytes per ack, and the # of interrupts
per packet? It could be that just slowing down trahsmission
makes GSO work better.

 Since large message size (64K) doesn't hit (capacity == 0) case, so the
 performance only has a little better. (from 13.xGb/s to 14.x Gb/s)
 
 kvm_stat output shows significant exits reduction for both VM and I/O,
 no guest TX interrupts.
 
 With dropping packets, TCP retrans has been increased here, so I can see
 performance numbers are various.
 
 This might be not a good solution, but it gave us some ideas on
 expensive netif queue stop/wake up between guest and host notification.
 
 I couldn't find a better solution on how to reduce netif queue stop/wake
 up rate for small message size. But I think once we can address this,
 the guest TX performance will burst for small message size.
 
 I also compared this with return TX_BUSY approach when (capacity == 0),
 it is not as good as dropping packets.
 
   3. Delay guest netif queue wake up until certain descriptors (1/2
  ring
   size, 1/4 ring size...) are available once the queue has stopped.
   
   4. Accumulate more packets per vhost signal in handle_tx?
   
   5. 3  4 combinations
   
   6. Accumulate more packets per guest kick() (TCP_RR) by adding a
  timer? 
   
   7. Accumulate more packets per vhost handle_tx() by adding some
  delay?
   
Haven't noticed that part, how does your patch make it
   handle more packets?
   
   Added a delay in handle_tx().
   
   What else?
   
   It would take sometimes to do this.
   
   Shirley
  
  
  Need to think about this.
  
  
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-02 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 10:09:14PM -0800, Shirley Ma wrote:
 On Thu, 2011-02-03 at 07:59 +0200, Michael S. Tsirkin wrote:
   Let's look at the sequence here:
   
   guest start_xmit()
 xmit_skb()
 if ring is full,
 enable_cb()
   
   guest skb_xmit_done()
 disable_cb,
   printk free_old_xmit_skbs -- it was between more than 1/2
  to
   full ring size
 printk vq-num_free 
   
   vhost handle_tx()
 if (guest interrupt is enabled)
 signal guest to free xmit buffers
   
   So between guest queue full/stopped queue/enable call back to guest
   receives the callback from host to free_old_xmit_skbs, there were
  about
   1/2 to full ring size descriptors available. I thought there were
  only a
   few. (I disabled your vhost patch for this test.)
  
  
  The expected number is vq-num - max skb frags - 2. 
 
 It was various (up to the ring size 256). This is using indirection
 buffers, it returned how many freed descriptors, not number of buffers.
 
 Why do you think it is vq-num - max skb frags - 2 here?
 
 Shirley

well queue is stopped which happens when

if (capacity  2+MAX_SKB_FRAGS) {
netif_stop_queue(dev);
if (unlikely(!virtqueue_enable_cb(vi-svq))) {
/* More just got used, free them then recheck.
 * */
capacity += free_old_xmit_skbs(vi);
if (capacity = 2+MAX_SKB_FRAGS) {
netif_start_queue(dev);
virtqueue_disable_cb(vi-svq);
}
}
}

This should be the most common case.
I guess the case with += free_old_xmit_skbs is what can get us more.
But it should be rare. Can you count how common it is?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Tue, 2011-02-01 at 22:17 +0200, Michael S. Tsirkin wrote:
 On Tue, Feb 01, 2011 at 12:09:03PM -0800, Shirley Ma wrote:
  On Tue, 2011-02-01 at 19:23 +0200, Michael S. Tsirkin wrote:
   On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote:
On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
  Interesting. Could this is be a variant of the now famuous
 bufferbloat then?
 
 Sigh, bufferbloat is the new global warming... :-/ 

Yep, some places become colder, some other places become warmer;
   Same as
BW results, sometimes faster, sometimes slower. :)

Shirley
   
   Sent a tuning patch (v2) that might help.
   Could you try it and play with the module parameters please? 
  
  Hello Michael,
  
  Sure I will play with this patch to see how it could help. 
  
  I am looking at guest side as well, I found a couple issues on guest
  side:
  
  1. free_old_xmit_skbs() should return the number of skbs instead of
 the
  total of sgs since we are using ring size to stop/start netif queue.
  static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
  {
  struct sk_buff *skb;
  unsigned int len, tot_sgs = 0;
  
  while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
  pr_debug(Sent skb %p\n, skb);
  vi-dev-stats.tx_bytes += skb-len;
  vi-dev-stats.tx_packets++;
  tot_sgs += skb_vnet_hdr(skb)-num_sg;
  dev_kfree_skb_any(skb);
  }
  return tot_sgs;  should return numbers of skbs to track
  ring usage here, I think;
  }
  
  Did the old guest use number of buffers to track ring usage before?
  
  2. In start_xmit, I think we should move capacity +=
 free_old_xmit_skbs
  before netif_stop_queue(); so we avoid unnecessary netif queue
  stop/start. This condition is heavily hit for small message size.
  
  Also we capacity checking condition should change to something like
 half
  of the vring.num size, instead of comparing 2+MAX_SKB_FRAGS?
  
 if (capacity  2+MAX_SKB_FRAGS) {
  netif_stop_queue(dev);
  if (unlikely(!virtqueue_enable_cb(vi-svq))) {
  /* More just got used, free them then
 recheck.
  */
  capacity += free_old_xmit_skbs(vi);
  if (capacity = 2+MAX_SKB_FRAGS) {
  netif_start_queue(dev);
  virtqueue_disable_cb(vi-svq);
  }
  }
  }
  
  3. Looks like the xmit callback is only used to wake the queue when
 the
  queue has stopped, right? Should we put a condition check here?
  static void skb_xmit_done(struct virtqueue *svq)
  {
  struct virtnet_info *vi = svq-vdev-priv;
  
  /* Suppress further interrupts. */
  virtqueue_disable_cb(svq);
  
  /* We were probably waiting for more output buffers. */
  ---   if (netif_queue_stopped(vi-dev))
  netif_wake_queue(vi-dev);
  }
  
  
  Shirley
 
 Well the return value is used to calculate capacity and that counts
 the # of s/g. No?

Nope, the current guest kernel uses descriptors not number of sgs. I am
not sure the old guest.

 From cache utilization POV it might be better to read from the skb and
 not peek at virtio header though...
 Pls Cc the lists on any discussions in the future.
 
 -- 
 MST

Sorry I missed reply all. :(

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote:
 Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
 
 I tried a simpler version of this patch without any tunables by
 delaying the signaling until we come out of the for loop.
 It definitely reduced the number of vmexits significantly for small
 message
 guest to host stream test and the throughput went up a little.
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 9b3ca10..5f9fae9 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
 if (err != len)
 pr_debug(Truncated TX packet: 
   len %d != %zd\n, err, len);
 -   vhost_add_used_and_signal(net-dev, vq, head, 0);
 +   vhost_add_used(vq, head, 0);
 total_len += len;
 if (unlikely(total_len = VHOST_NET_WEIGHT)) {
 vhost_poll_queue(vq-poll);
 @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
 }
 }
 
 +   if (total_len  0)
 +   vhost_signal(net-dev, vq);
 mutex_unlock(vq-mutex);
  }

Reducing the signaling will reduce the CPU utilization by reducing VM
exits. 

The small message BW is a problem we have seen faster guest/slow vhost,
even I increased VHOST_NET_WEIGHT times, it didn't help that much for
BW. For large message size, vhost is able to process all packets on
time. I played around with guest/host codes, I only see huge BW
improvement by dropping packets on guest side so far.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin
On Tue, Feb 01, 2011 at 12:25:08PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 22:17 +0200, Michael S. Tsirkin wrote:
  On Tue, Feb 01, 2011 at 12:09:03PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 19:23 +0200, Michael S. Tsirkin wrote:
On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote:
 On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
   Interesting. Could this is be a variant of the now famuous
  bufferbloat then?
  
  Sigh, bufferbloat is the new global warming... :-/ 
 
 Yep, some places become colder, some other places become warmer;
Same as
 BW results, sometimes faster, sometimes slower. :)
 
 Shirley

Sent a tuning patch (v2) that might help.
Could you try it and play with the module parameters please? 
   
   Hello Michael,
   
   Sure I will play with this patch to see how it could help. 
   
   I am looking at guest side as well, I found a couple issues on guest
   side:
   
   1. free_old_xmit_skbs() should return the number of skbs instead of
  the
   total of sgs since we are using ring size to stop/start netif queue.
   static unsigned int free_old_xmit_skbs(struct virtnet_info *vi)
   {
   struct sk_buff *skb;
   unsigned int len, tot_sgs = 0;
   
   while ((skb = virtqueue_get_buf(vi-svq, len)) != NULL) {
   pr_debug(Sent skb %p\n, skb);
   vi-dev-stats.tx_bytes += skb-len;
   vi-dev-stats.tx_packets++;
   tot_sgs += skb_vnet_hdr(skb)-num_sg;
   dev_kfree_skb_any(skb);
   }
   return tot_sgs;  should return numbers of skbs to track
   ring usage here, I think;
   }
   
   Did the old guest use number of buffers to track ring usage before?
   
   2. In start_xmit, I think we should move capacity +=
  free_old_xmit_skbs
   before netif_stop_queue(); so we avoid unnecessary netif queue
   stop/start. This condition is heavily hit for small message size.
   
   Also we capacity checking condition should change to something like
  half
   of the vring.num size, instead of comparing 2+MAX_SKB_FRAGS?
   
  if (capacity  2+MAX_SKB_FRAGS) {
   netif_stop_queue(dev);
   if (unlikely(!virtqueue_enable_cb(vi-svq))) {
   /* More just got used, free them then
  recheck.
   */
   capacity += free_old_xmit_skbs(vi);
   if (capacity = 2+MAX_SKB_FRAGS) {
   netif_start_queue(dev);
   virtqueue_disable_cb(vi-svq);
   }
   }
   }
   
   3. Looks like the xmit callback is only used to wake the queue when
  the
   queue has stopped, right? Should we put a condition check here?
   static void skb_xmit_done(struct virtqueue *svq)
   {
   struct virtnet_info *vi = svq-vdev-priv;
   
   /* Suppress further interrupts. */
   virtqueue_disable_cb(svq);
   
   /* We were probably waiting for more output buffers. */
   ---   if (netif_queue_stopped(vi-dev))
   netif_wake_queue(vi-dev);
   }
   
   
   Shirley
  
  Well the return value is used to calculate capacity and that counts
  the # of s/g. No?
 
 Nope, the current guest kernel uses descriptors not number of sgs.

Confused. We compare capacity to skb frags, no?
That's sg I think ...

 not sure the old guest.
 
  From cache utilization POV it might be better to read from the skb and
  not peek at virtio header though...
  Pls Cc the lists on any discussions in the future.
  
  -- 
  MST
 
 Sorry I missed reply all. :(
 
 Shirley
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin
On Tue, Feb 01, 2011 at 01:09:45PM -0800, Shirley Ma wrote:
 On Mon, 2011-01-31 at 17:30 -0800, Sridhar Samudrala wrote:
  Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
  
  I tried a simpler version of this patch without any tunables by
  delaying the signaling until we come out of the for loop.
  It definitely reduced the number of vmexits significantly for small
  message
  guest to host stream test and the throughput went up a little.
  
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index 9b3ca10..5f9fae9 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
  if (err != len)
  pr_debug(Truncated TX packet: 
len %d != %zd\n, err, len);
  -   vhost_add_used_and_signal(net-dev, vq, head, 0);
  +   vhost_add_used(vq, head, 0);
  total_len += len;
  if (unlikely(total_len = VHOST_NET_WEIGHT)) {
  vhost_poll_queue(vq-poll);
  @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
  }
  }
  
  +   if (total_len  0)
  +   vhost_signal(net-dev, vq);
  mutex_unlock(vq-mutex);
   }
 
 Reducing the signaling will reduce the CPU utilization by reducing VM
 exits. 
 
 The small message BW is a problem we have seen faster guest/slow vhost,
 even I increased VHOST_NET_WEIGHT times, it didn't help that much for
 BW. For large message size, vhost is able to process all packets on
 time. I played around with guest/host codes, I only see huge BW
 improvement by dropping packets on guest side so far.
 
 Thanks
 Shirley


My theory is that the issue is not signalling.
Rather, our queue fills up, then host handles
one packet and sends an interrupt, and we
immediately wake the queue. So the vq
once it gets full, stays full.

If you try my patch with bufs threshold set to e.g.
half the vq, what we will do is send interrupt after we have processed
half the vq.  So host has half the vq to go, and guest has half the vq
to fill.

See?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
 Confused. We compare capacity to skb frags, no?
 That's sg I think ...

Current guest kernel use indirect buffers, num_free returns how many
available descriptors not skb frags. So it's wrong here.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
 My theory is that the issue is not signalling.
 Rather, our queue fills up, then host handles
 one packet and sends an interrupt, and we
 immediately wake the queue. So the vq
 once it gets full, stays full.

From the printk debugging output, it might not be exactly the case. The
ring gets full, run a bit, then gets full, then run a bit, then full...

 If you try my patch with bufs threshold set to e.g.
 half the vq, what we will do is send interrupt after we have processed
 half the vq.  So host has half the vq to go, and guest has half the vq
 to fill.
 
 See?

I am cleaning up my set up to run your patch ...

Shirley


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin
On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
  Confused. We compare capacity to skb frags, no?
  That's sg I think ...
 
 Current guest kernel use indirect buffers, num_free returns how many
 available descriptors not skb frags. So it's wrong here.
 
 Shirley

I see. Good point. In other words when we complete the buffer
it was indirect, but when we add a new one we
can not allocate indirect so we consume.
And then we start the queue and add will fail.
I guess we need some kind of API to figure out
whether the buf we complete was indirect?

Another failure mode is when skb_xmit_done
wakes the queue: it might be too early, there
might not be space for the next packet in the vq yet.

A solution might be to keep some kind of pool
around for indirect, we wanted to do it for block anyway ...

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin
On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
  My theory is that the issue is not signalling.
  Rather, our queue fills up, then host handles
  one packet and sends an interrupt, and we
  immediately wake the queue. So the vq
  once it gets full, stays full.
 
 From the printk debugging output, it might not be exactly the case. The
 ring gets full, run a bit, then gets full, then run a bit, then full...

Yes, but does it get even half empty in between?

  If you try my patch with bufs threshold set to e.g.
  half the vq, what we will do is send interrupt after we have processed
  half the vq.  So host has half the vq to go, and guest has half the vq
  to fill.
  
  See?
 
 I am cleaning up my set up to run your patch ...
 
 Shirley
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Tue, 2011-02-01 at 23:42 +0200, Michael S. Tsirkin wrote:
 On Tue, Feb 01, 2011 at 01:32:35PM -0800, Shirley Ma wrote:
  On Tue, 2011-02-01 at 23:24 +0200, Michael S. Tsirkin wrote:
   My theory is that the issue is not signalling.
   Rather, our queue fills up, then host handles
   one packet and sends an interrupt, and we
   immediately wake the queue. So the vq
   once it gets full, stays full.
  
  From the printk debugging output, it might not be exactly the case.
 The
  ring gets full, run a bit, then gets full, then run a bit, then
 full...
 
 Yes, but does it get even half empty in between?

Sometimes, most of them not half of empty in between. But printk slow
down the traffics, so it's not accurate. I think your patch will improve
the performance if it signals guest when half of the ring size is
empty. 

But you manage signal by using TX bytes, I would like to change it to
half of the ring size instead for signaling. Is that OK?

Shirley



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
 There are flags for bytes, buffers and packets.
 Try playing with any one of them :)
 Just be sure to use v2.
 
 
 I would like to change it to
  half of the ring size instead for signaling. Is that OK?
  
  Shirley
  
  
 
 Sure that is why I made it a parameter so you can experiment. 

The initial test results shows that the CPUs utilization has been
reduced some, and BW has increased some with the default parameters,
like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
down from 4x% to 38%, (Similar results from the patch I submitted a
while ago to reduce signaling on vhost) but far away from dropping
packet results.

I am going to change the code to use 1/2 ring size to wake the netif
queue.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM

 On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
  On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
   Confused. We compare capacity to skb frags, no?
   That's sg I think ...
 
  Current guest kernel use indirect buffers, num_free returns how many
  available descriptors not skb frags. So it's wrong here.
 
  Shirley

 I see. Good point. In other words when we complete the buffer
 it was indirect, but when we add a new one we
 can not allocate indirect so we consume.
 And then we start the queue and add will fail.
 I guess we need some kind of API to figure out
 whether the buf we complete was indirect?

 Another failure mode is when skb_xmit_done
 wakes the queue: it might be too early, there
 might not be space for the next packet in the vq yet.

I am not sure if this is the problem - shouldn't you
see these messages:
if (likely(capacity == -ENOMEM)) {
dev_warn(dev-dev,
TX queue failure: out of memory\n);
} else {
dev-stats.tx_fifo_errors++;
dev_warn(dev-dev,
Unexpected TX queue failure: %d\n,
capacity);
}
in next xmit? I am not getting this in my testing.

 A solution might be to keep some kind of pool
 around for indirect, we wanted to do it for block anyway ...

Your vhost patch should fix this automatically. Right?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin
On Tue, Feb 01, 2011 at 02:59:57PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 23:56 +0200, Michael S. Tsirkin wrote:
  There are flags for bytes, buffers and packets.
  Try playing with any one of them :)
  Just be sure to use v2.
  
  
  I would like to change it to
   half of the ring size instead for signaling. Is that OK?
   
   Shirley
   
   
  
  Sure that is why I made it a parameter so you can experiment. 
 
 The initial test results shows that the CPUs utilization has been
 reduced some, and BW has increased some with the default parameters,
 like 1K message size BW goes from 2.5Gb/s about 2.8Gb/s, CPU utilization
 down from 4x% to 38%, (Similar results from the patch I submitted a
 while ago to reduce signaling on vhost) but far away from dropping
 packet results.
 
 I am going to change the code to use 1/2 ring size to wake the netif
 queue.
 
 Shirley

Just tweak the parameters with sysfs, you do not have to edit the code:
echo 64  /sys/module/vhost_net/parameters/tx_bufs_coalesce

Or in a similar way for tx_packets_coalesce (since we use indirect,
packets will typically use 1 buffer each).

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin
On Wed, Feb 02, 2011 at 10:09:18AM +0530, Krishna Kumar2 wrote:
  Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM
 
  On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
   On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
Confused. We compare capacity to skb frags, no?
That's sg I think ...
  
   Current guest kernel use indirect buffers, num_free returns how many
   available descriptors not skb frags. So it's wrong here.
  
   Shirley
 
  I see. Good point. In other words when we complete the buffer
  it was indirect, but when we add a new one we
  can not allocate indirect so we consume.
  And then we start the queue and add will fail.
  I guess we need some kind of API to figure out
  whether the buf we complete was indirect?
 
  Another failure mode is when skb_xmit_done
  wakes the queue: it might be too early, there
  might not be space for the next packet in the vq yet.
 
 I am not sure if this is the problem - shouldn't you
 see these messages:
   if (likely(capacity == -ENOMEM)) {
   dev_warn(dev-dev,
   TX queue failure: out of memory\n);
   } else {
   dev-stats.tx_fifo_errors++;
   dev_warn(dev-dev,
   Unexpected TX queue failure: %d\n,
   capacity);
   }
 in next xmit? I am not getting this in my testing.

Yes, I don't think we hit this in our testing,
simply because we don't stress memory.
Disable indirect, then you might see this.

  A solution might be to keep some kind of pool
  around for indirect, we wanted to do it for block anyway ...
 
 Your vhost patch should fix this automatically. Right?

Reduce the chance of it happening, yes.

 
 Thanks,
 
 - KK
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Wed, 2011-02-02 at 06:40 +0200, Michael S. Tsirkin wrote:
 ust tweak the parameters with sysfs, you do not have to edit the code:
 echo 64  /sys/module/vhost_net/parameters/tx_bufs_coalesce
 
 Or in a similar way for tx_packets_coalesce (since we use indirect,
 packets will typically use 1 buffer each).

We should use packets instead of buffers, in indirect case, one packet
has multiple buffers, each packet uses one descriptor from the ring
(default size is 256).

echo 128  /sys/module/vhost_net/parameters/tx_packets_coalesce

The way I am changing is only when netif queue has stopped, then we
start to count num_free descriptors to send the signal to wake netif
queue.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
 
 The way I am changing is only when netif queue has stopped, then we
 start to count num_free descriptors to send the signal to wake netif
 queue. 

I forgot to mention, the code change I am making is in guest kernel, in
xmit call back only wake up the queue when it's stopped  num_free =
1/2 *vq-num, I add a new API in virtio_ring.

However vhost signaling reduction is needed as well. The patch I
submitted a while ago showed both CPUs and BW improvement.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Michael S. Tsirkin
On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
 On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
  
  The way I am changing is only when netif queue has stopped, then we
  start to count num_free descriptors to send the signal to wake netif
  queue. 
 
 I forgot to mention, the code change I am making is in guest kernel, in
 xmit call back only wake up the queue when it's stopped  num_free =
 1/2 *vq-num, I add a new API in virtio_ring.

Interesting. Yes, I agree an API extension would be helpful. However,
wouldn't just the signaling reduction be enough, without guest changes?

 However vhost signaling reduction is needed as well. The patch I
 submitted a while ago showed both CPUs and BW improvement.
 
 Thanks
 Shirley

Which patch was that?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Krishna Kumar2
 On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
 
  The way I am changing is only when netif queue has stopped, then we
  start to count num_free descriptors to send the signal to wake netif
  queue.

 I forgot to mention, the code change I am making is in guest kernel, in
 xmit call back only wake up the queue when it's stopped  num_free =
 1/2 *vq-num, I add a new API in virtio_ring.

FYI :)

I have tried this before. There are a couple of issues:

1. the free count will not reduce until you run free_old_xmit_skbs,
   which will not run anymore since the tx queue is stopped.
2. You cannot call free_old_xmit_skbs directly as it races with a
   queue that was just awakened (current cb was due to the delay
   in disabling cb's).

You have to call free_old_xmit_skbs() under netif_queue_stopped()
check to avoid the race.

I got a small improvement in my testing upto some number of threads
(32 or 48?), but beyond that I was getting a regression.

Thanks,

- KK

 However vhost signaling reduction is needed as well. The patch I
 submitted a while ago showed both CPUs and BW improvement.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Wed, 2011-02-02 at 12:04 +0530, Krishna Kumar2 wrote:
  On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
  
   The way I am changing is only when netif queue has stopped, then
 we
   start to count num_free descriptors to send the signal to wake
 netif
   queue.
 
  I forgot to mention, the code change I am making is in guest kernel,
 in
  xmit call back only wake up the queue when it's stopped  num_free
 =
  1/2 *vq-num, I add a new API in virtio_ring.
 
 FYI :)

 I have tried this before. There are a couple of issues:
 
 1. the free count will not reduce until you run free_old_xmit_skbs,
which will not run anymore since the tx queue is stopped.
 2. You cannot call free_old_xmit_skbs directly as it races with a
queue that was just awakened (current cb was due to the delay
in disabling cb's).
 
 You have to call free_old_xmit_skbs() under netif_queue_stopped()
 check to avoid the race.

Yes, that' what I did, when the netif queue stop, don't enable the
queue, just free_old_xmit_skbs(), if not enough freed, then enabling
callback until half of the ring size are freed, then wake the netif
queue. But somehow I didn't reach the performance compared to drop
packets, need to think about it more. :)

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Wed, 2011-02-02 at 08:29 +0200, Michael S. Tsirkin wrote:
 On Tue, Feb 01, 2011 at 10:19:09PM -0800, Shirley Ma wrote:
  On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
   
   The way I am changing is only when netif queue has stopped, then
 we
   start to count num_free descriptors to send the signal to wake
 netif
   queue. 
  
  I forgot to mention, the code change I am making is in guest kernel,
 in
  xmit call back only wake up the queue when it's stopped  num_free
 =
  1/2 *vq-num, I add a new API in virtio_ring.
 
 Interesting. Yes, I agree an API extension would be helpful. However,
 wouldn't just the signaling reduction be enough, without guest
 changes?

w/i guest change, I played around the parameters,for example: I could
get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message size,
w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
usage.

  However vhost signaling reduction is needed as well. The patch I
  submitted a while ago showed both CPUs and BW improvement.
  
  Thanks
  Shirley
 
 Which patch was that? 

The patch was called vhost: TX used buffer guest signal accumulation.
You suggested to split add_used_bufs and signal. I am still thinking
what's the best approach to cooperate guest (virtio_kick) and
vhost(handle_tx), vhost(signaling) and guest (xmit callback) to reduce
the overheads, so I haven't submit the new patch yet.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Shirley Ma
On Tue, 2011-02-01 at 23:14 -0800, Shirley Ma wrote:
 w/i guest change, I played around the parameters,for example: I could
 get 3.7Gb/s with 42% CPU BW increasing from 2.5Gb/s for 1K message
 size,
 w/i dropping packet, I was able to get up to 6.2Gb/s with similar CPU
 usage. 

I meant w/o guest change, only vhost changes. Sorry about that.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Krishna Kumar2
 Shirley Ma mashi...@us.ibm.com wrote:

  I have tried this before. There are a couple of issues:
 
  1. the free count will not reduce until you run free_old_xmit_skbs,
 which will not run anymore since the tx queue is stopped.
  2. You cannot call free_old_xmit_skbs directly as it races with a
 queue that was just awakened (current cb was due to the delay
 in disabling cb's).
 
  You have to call free_old_xmit_skbs() under netif_queue_stopped()
  check to avoid the race.

 Yes, that' what I did, when the netif queue stop, don't enable the
 queue, just free_old_xmit_skbs(), if not enough freed, then enabling
 callback until half of the ring size are freed, then wake the netif
 queue. But somehow I didn't reach the performance compared to drop
 packets, need to think about it more. :)

Did you check if the number of vmexits increased with this
patch? This is possible if the device was keeping up (and
not going into a stop, start, xmit 1 packet, stop, start
loop). Also maybe you should try for 1/4th instead of 1/2?

MST's delayed signalling should avoid this issue, I haven't
tried both together.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-31 Thread Steve Dobbelstein
Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:

 OK, so thinking about it more, maybe the issue is this:
 tx becomes full. We process one request and interrupt the guest,
 then it adds one request and the queue is full again.

 Maybe the following will help it stabilize?
 By itself it does nothing, but if you set
 all the parameters to a huge value we will
 only interrupt when we see an empty ring.
 Which might be too much: pls try other values
 in the middle: e.g. make bufs half the ring,
 or bytes some small value, or packets some
 small value etc.

 Warning: completely untested.

 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index aac05bc..6769cdc 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -32,6 +32,13 @@
   * Using this limit prevents one virtqueue from starving others. */
  #define VHOST_NET_WEIGHT 0x8

 +int tx_bytes_coalesce = 0;
 +module_param(tx_bytes_coalesce, int, 0644);
 +int tx_bufs_coalesce = 0;
 +module_param(tx_bufs_coalesce, int, 0644);
 +int tx_packets_coalesce = 0;
 +module_param(tx_packets_coalesce, int, 0644);
 +
  enum {
 VHOST_NET_VQ_RX = 0,
 VHOST_NET_VQ_TX = 1,
 @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
 int err, wmem;
 size_t hdr_size;
 struct socket *sock;
 +   int bytes_coalesced = 0;
 +   int bufs_coalesced = 0;
 +   int packets_coalesced = 0;

 /* TODO: check that we are running from vhost_worker? */
 sock = rcu_dereference_check(vq-private_data, 1);
 @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
   pr_debug(Truncated TX packet: 
len %d != %zd\n, err, len);
 -  vhost_add_used_and_signal(net-dev, vq, head, 0);
total_len += len;
 +  packets_coalesced += 1;
 +  bytes_coalesced += len;
 +  bufs_coalesced += in;

Should this instead be:
  bufs_coalesced += out;

Perusing the code I see that earlier there is a check to see if in is not
zero, and, if so, error out of the loop.  After the check, in is not
touched until it is added to bufs_coalesced, effectively not changing
bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
below.

Or am I missing something?

 +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
 +  bytes_coalesced  tx_bytes_coalesce ||
 +  bufs_coalesced  tx_bufs_coalesce))
 + vhost_add_used_and_signal(net-dev, vq, head, 0);
 +  else
 + vhost_add_used(vq, head, 0);
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
   vhost_poll_queue(vq-poll);
   break;
}
 }

 +   if (likely(packets_coalesced  tx_packets_coalesce ||
 + bytes_coalesced  tx_bytes_coalesce ||
 + bufs_coalesced  tx_bufs_coalesce))
 +  vhost_signal(net-dev, vq);
 mutex_unlock(vq-mutex);
  }


Steve D.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-31 Thread Sridhar Samudrala
On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:
 
  OK, so thinking about it more, maybe the issue is this:
  tx becomes full. We process one request and interrupt the guest,
  then it adds one request and the queue is full again.
 
  Maybe the following will help it stabilize?
  By itself it does nothing, but if you set
  all the parameters to a huge value we will
  only interrupt when we see an empty ring.
  Which might be too much: pls try other values
  in the middle: e.g. make bufs half the ring,
  or bytes some small value, or packets some
  small value etc.
 
  Warning: completely untested.
 
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index aac05bc..6769cdc 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -32,6 +32,13 @@
* Using this limit prevents one virtqueue from starving others. */
   #define VHOST_NET_WEIGHT 0x8
 
  +int tx_bytes_coalesce = 0;
  +module_param(tx_bytes_coalesce, int, 0644);
  +int tx_bufs_coalesce = 0;
  +module_param(tx_bufs_coalesce, int, 0644);
  +int tx_packets_coalesce = 0;
  +module_param(tx_packets_coalesce, int, 0644);
  +
   enum {
  VHOST_NET_VQ_RX = 0,
  VHOST_NET_VQ_TX = 1,
  @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
  int err, wmem;
  size_t hdr_size;
  struct socket *sock;
  +   int bytes_coalesced = 0;
  +   int bufs_coalesced = 0;
  +   int packets_coalesced = 0;
 
  /* TODO: check that we are running from vhost_worker? */
  sock = rcu_dereference_check(vq-private_data, 1);
  @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
 if (err != len)
pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
  -  vhost_add_used_and_signal(net-dev, vq, head, 0);
 total_len += len;
  +  packets_coalesced += 1;
  +  bytes_coalesced += len;
  +  bufs_coalesced += in;
 
 Should this instead be:
   bufs_coalesced += out;
 
 Perusing the code I see that earlier there is a check to see if in is not
 zero, and, if so, error out of the loop.  After the check, in is not
 touched until it is added to bufs_coalesced, effectively not changing
 bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
 below.

Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.

I tried a simpler version of this patch without any tunables by
delaying the signaling until we come out of the for loop.
It definitely reduced the number of vmexits significantly for small message
guest to host stream test and the throughput went up a little.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9b3ca10..5f9fae9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   vhost_add_used(vq, head, 0);
total_len += len;
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
@@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
}
}
 
+   if (total_len  0)
+   vhost_signal(net-dev, vq);
mutex_unlock(vq-mutex);
 }
 

 
 Or am I missing something?
 
  +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
  +  bytes_coalesced  tx_bytes_coalesce ||
  +  bufs_coalesced  tx_bufs_coalesce))
  + vhost_add_used_and_signal(net-dev, vq, head, 0);
  +  else
  + vhost_add_used(vq, head, 0);
 if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
 }
  }
 
  +   if (likely(packets_coalesced  tx_packets_coalesce ||
  + bytes_coalesced  tx_bytes_coalesce ||
  + bufs_coalesced  tx_bufs_coalesce))
  +  vhost_signal(net-dev, vq);
  mutex_unlock(vq-mutex);
   }

It is possible that we can miss signaling the guest even after
processing a few pkts, if we don't hit any of these conditions.

 
 
 Steve D.
 
 --
 To unsubscribe from this list: send the line unsubscribe netdev in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-31 Thread Michael S. Tsirkin
On Mon, Jan 31, 2011 at 06:24:34PM -0600, Steve Dobbelstein wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:
 
  OK, so thinking about it more, maybe the issue is this:
  tx becomes full. We process one request and interrupt the guest,
  then it adds one request and the queue is full again.
 
  Maybe the following will help it stabilize?
  By itself it does nothing, but if you set
  all the parameters to a huge value we will
  only interrupt when we see an empty ring.
  Which might be too much: pls try other values
  in the middle: e.g. make bufs half the ring,
  or bytes some small value, or packets some
  small value etc.
 
  Warning: completely untested.
 
  diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
  index aac05bc..6769cdc 100644
  --- a/drivers/vhost/net.c
  +++ b/drivers/vhost/net.c
  @@ -32,6 +32,13 @@
* Using this limit prevents one virtqueue from starving others. */
   #define VHOST_NET_WEIGHT 0x8
 
  +int tx_bytes_coalesce = 0;
  +module_param(tx_bytes_coalesce, int, 0644);
  +int tx_bufs_coalesce = 0;
  +module_param(tx_bufs_coalesce, int, 0644);
  +int tx_packets_coalesce = 0;
  +module_param(tx_packets_coalesce, int, 0644);
  +
   enum {
  VHOST_NET_VQ_RX = 0,
  VHOST_NET_VQ_TX = 1,
  @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
  int err, wmem;
  size_t hdr_size;
  struct socket *sock;
  +   int bytes_coalesced = 0;
  +   int bufs_coalesced = 0;
  +   int packets_coalesced = 0;
 
  /* TODO: check that we are running from vhost_worker? */
  sock = rcu_dereference_check(vq-private_data, 1);
  @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
 if (err != len)
pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
  -  vhost_add_used_and_signal(net-dev, vq, head, 0);
 total_len += len;
  +  packets_coalesced += 1;
  +  bytes_coalesced += len;
  +  bufs_coalesced += in;
 
 Should this instead be:
   bufs_coalesced += out;

Correct.

 Perusing the code I see that earlier there is a check to see if in is not
 zero, and, if so, error out of the loop.  After the check, in is not
 touched until it is added to bufs_coalesced, effectively not changing
 bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
 below.
 
 Or am I missing something?
 
  +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
  +  bytes_coalesced  tx_bytes_coalesce ||
  +  bufs_coalesced  tx_bufs_coalesce))
  + vhost_add_used_and_signal(net-dev, vq, head, 0);
  +  else
  + vhost_add_used(vq, head, 0);
 if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
 }
  }
 
  +   if (likely(packets_coalesced  tx_packets_coalesce ||
  + bytes_coalesced  tx_bytes_coalesce ||
  + bufs_coalesced  tx_bufs_coalesce))
  +  vhost_signal(net-dev, vq);
  mutex_unlock(vq-mutex);
   }
 
 
 Steve D.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-31 Thread Michael S. Tsirkin
On Mon, Jan 31, 2011 at 05:30:38PM -0800, Sridhar Samudrala wrote:
 On Mon, 2011-01-31 at 18:24 -0600, Steve Dobbelstein wrote:
  Michael S. Tsirkin m...@redhat.com wrote on 01/28/2011 06:16:16 AM:
  
   OK, so thinking about it more, maybe the issue is this:
   tx becomes full. We process one request and interrupt the guest,
   then it adds one request and the queue is full again.
  
   Maybe the following will help it stabilize?
   By itself it does nothing, but if you set
   all the parameters to a huge value we will
   only interrupt when we see an empty ring.
   Which might be too much: pls try other values
   in the middle: e.g. make bufs half the ring,
   or bytes some small value, or packets some
   small value etc.
  
   Warning: completely untested.
  
   diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
   index aac05bc..6769cdc 100644
   --- a/drivers/vhost/net.c
   +++ b/drivers/vhost/net.c
   @@ -32,6 +32,13 @@
 * Using this limit prevents one virtqueue from starving others. */
#define VHOST_NET_WEIGHT 0x8
  
   +int tx_bytes_coalesce = 0;
   +module_param(tx_bytes_coalesce, int, 0644);
   +int tx_bufs_coalesce = 0;
   +module_param(tx_bufs_coalesce, int, 0644);
   +int tx_packets_coalesce = 0;
   +module_param(tx_packets_coalesce, int, 0644);
   +
enum {
   VHOST_NET_VQ_RX = 0,
   VHOST_NET_VQ_TX = 1,
   @@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
   int err, wmem;
   size_t hdr_size;
   struct socket *sock;
   +   int bytes_coalesced = 0;
   +   int bufs_coalesced = 0;
   +   int packets_coalesced = 0;
  
   /* TODO: check that we are running from vhost_worker? */
   sock = rcu_dereference_check(vq-private_data, 1);
   @@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
  if (err != len)
 pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
   -  vhost_add_used_and_signal(net-dev, vq, head, 0);
  total_len += len;
   +  packets_coalesced += 1;
   +  bytes_coalesced += len;
   +  bufs_coalesced += in;
  
  Should this instead be:
bufs_coalesced += out;
  
  Perusing the code I see that earlier there is a check to see if in is not
  zero, and, if so, error out of the loop.  After the check, in is not
  touched until it is added to bufs_coalesced, effectively not changing
  bufs_coalesced, meaning bufs_coalesced will never trigger the conditions
  below.
 
 Yes. It definitely should be 'out'. 'in' should be 0 in the tx path.
 
 I tried a simpler version of this patch without any tunables by
 delaying the signaling until we come out of the for loop.
 It definitely reduced the number of vmexits significantly for small message
 guest to host stream test and the throughput went up a little.
 
 diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
 index 9b3ca10..5f9fae9 100644
 --- a/drivers/vhost/net.c
 +++ b/drivers/vhost/net.c
 @@ -197,7 +197,7 @@ static void handle_tx(struct vhost_net *net)
   if (err != len)
   pr_debug(Truncated TX packet: 
 len %d != %zd\n, err, len);
 - vhost_add_used_and_signal(net-dev, vq, head, 0);
 + vhost_add_used(vq, head, 0);
   total_len += len;
   if (unlikely(total_len = VHOST_NET_WEIGHT)) {
   vhost_poll_queue(vq-poll);
 @@ -205,6 +205,8 @@ static void handle_tx(struct vhost_net *net)
   }
   }
  
 + if (total_len  0)
 + vhost_signal(net-dev, vq);
   mutex_unlock(vq-mutex);
  }
  
 
  
  Or am I missing something?
  
   +  if (unlikely(packets_coalesced  tx_packets_coalesce ||
   +  bytes_coalesced  tx_bytes_coalesce ||
   +  bufs_coalesced  tx_bufs_coalesce))
   + vhost_add_used_and_signal(net-dev, vq, head, 0);
   +  else
   + vhost_add_used(vq, head, 0);
  if (unlikely(total_len = VHOST_NET_WEIGHT)) {
 vhost_poll_queue(vq-poll);
 break;
  }
   }
  
   +   if (likely(packets_coalesced  tx_packets_coalesce ||
   + bytes_coalesced  tx_bytes_coalesce ||
   + bufs_coalesced  tx_bufs_coalesce))
   +  vhost_signal(net-dev, vq);
   mutex_unlock(vq-mutex);
}
 
 It is possible that we can miss signaling the guest even after
 processing a few pkts, if we don't hit any of these conditions.

Yes. It really should be
   if (likely(packets_coalesced  bytes_coalesced  bufs_coalesced))
  vhost_signal(net-dev, vq);

  
  
  Steve D.
  
  --
  To unsubscribe from this list: send the line unsubscribe netdev in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-28 Thread Michael S. Tsirkin
On Thu, Jan 27, 2011 at 01:30:38PM -0800, Shirley Ma wrote:
 On Thu, 2011-01-27 at 13:02 -0800, David Miller wrote:
   Interesting. Could this is be a variant of the now famuous
  bufferbloat then?
  
  Sigh, bufferbloat is the new global warming... :-/ 
 
 Yep, some places become colder, some other places become warmer; Same as
 BW results, sometimes faster, sometimes slower. :)
 
 Shirley

OK, so thinking about it more, maybe the issue is this:
tx becomes full. We process one request and interrupt the guest,
then it adds one request and the queue is full again.

Maybe the following will help it stabilize?
By itself it does nothing, but if you set
all the parameters to a huge value we will
only interrupt when we see an empty ring.
Which might be too much: pls try other values
in the middle: e.g. make bufs half the ring,
or bytes some small value, or packets some
small value etc.

Warning: completely untested.

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index aac05bc..6769cdc 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,13 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x8
 
+int tx_bytes_coalesce = 0;
+module_param(tx_bytes_coalesce, int, 0644);
+int tx_bufs_coalesce = 0;
+module_param(tx_bufs_coalesce, int, 0644);
+int tx_packets_coalesce = 0;
+module_param(tx_packets_coalesce, int, 0644);
+
 enum {
VHOST_NET_VQ_RX = 0,
VHOST_NET_VQ_TX = 1,
@@ -127,6 +134,9 @@ static void handle_tx(struct vhost_net *net)
int err, wmem;
size_t hdr_size;
struct socket *sock;
+   int bytes_coalesced = 0;
+   int bufs_coalesced = 0;
+   int packets_coalesced = 0;
 
/* TODO: check that we are running from vhost_worker? */
sock = rcu_dereference_check(vq-private_data, 1);
@@ -196,14 +206,26 @@ static void handle_tx(struct vhost_net *net)
if (err != len)
pr_debug(Truncated TX packet: 
  len %d != %zd\n, err, len);
-   vhost_add_used_and_signal(net-dev, vq, head, 0);
total_len += len;
+   packets_coalesced += 1;
+   bytes_coalesced += len;
+   bufs_coalesced += in;
+   if (unlikely(packets_coalesced  tx_packets_coalesce ||
+bytes_coalesced  tx_bytes_coalesce ||
+bufs_coalesced  tx_bufs_coalesce))
+   vhost_add_used_and_signal(net-dev, vq, head, 0);
+   else
+   vhost_add_used(vq, head, 0);
if (unlikely(total_len = VHOST_NET_WEIGHT)) {
vhost_poll_queue(vq-poll);
break;
}
}
 
+   if (likely(packets_coalesced  tx_packets_coalesce ||
+  bytes_coalesced  tx_bytes_coalesce ||
+  bufs_coalesced  tx_bufs_coalesce))
+   vhost_signal(net-dev, vq);
mutex_unlock(vq-mutex);
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-28 Thread Steve Dobbelstein
mashi...@linux.vnet.ibm.com wrote on 01/27/2011 02:15:05 PM:

 On Thu, 2011-01-27 at 22:05 +0200, Michael S. Tsirkin wrote:
  One simple theory is that guest net stack became faster
  and so the host can't keep up.

 Yes, that's what I think here. Some qdisc code has been changed
 recently.

I ran a test with txqueuelen set to 128, instead of the default of 1000, in
the guest in an attempt to slow down the guest transmits.  The change had
no effect on the throughput nor on the CPU usage.

On the other hand, I ran some tests with different CPU pinnings and
with/without hyperthreading enabled.  Here is a summary of the results.

Pinning configuration 1:  pin the VCPUs and pin the vhost thread to one of
the VCPU CPUs
Pinning configuration 2:  pin the VCPUs and pin the vhost thread to a
separate CPU on the same socket
Pinning configuration 3:  pin the VCPUs and pin the vhost thread to a
separate CPU a different socket

HT   Pinning   Throughput  CPU
Yes  config 1  - 40%   - 40%
Yes  config 2  - 37%   - 35%
Yes  config 3  - 37%   - 36%
No   none 0%   -  5%
No   config 1  - 41%   - 43%
No   config 2  + 32%   -  4%
No   config 3  + 34%   +  9%

Pinning the vhost thread to the same CPU as a guest VCPU hurts performance.
Turning off hyperthreading and pinning the VPUS and vhost thread to
separate CPUs significantly improves performance, getting it into the
competitive range with other hypervisors.

Steve D.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-28 Thread Steve Dobbelstein
ste...@us.ibm.com wrote on 01/28/2011 12:29:37 PM:

  On Thu, 2011-01-27 at 22:05 +0200, Michael S. Tsirkin wrote:
   One simple theory is that guest net stack became faster
   and so the host can't keep up.
 
  Yes, that's what I think here. Some qdisc code has been changed
  recently.

 I ran a test with txqueuelen set to 128, instead of the default of 1000,
in
 the guest in an attempt to slow down the guest transmits.  The change had
 no effect on the throughput nor on the CPU usage.

 On the other hand, I ran some tests with different CPU pinnings and
 with/without hyperthreading enabled.  Here is a summary of the results.

 Pinning configuration 1:  pin the VCPUs and pin the vhost thread to one
of
 the VCPU CPUs
 Pinning configuration 2:  pin the VCPUs and pin the vhost thread to a
 separate CPU on the same socket
 Pinning configuration 3:  pin the VCPUs and pin the vhost thread to a
 separate CPU a different socket

 HT   Pinning   Throughput  CPU
 Yes  config 1  - 40%   - 40%
 Yes  config 2  - 37%   - 35%
 Yes  config 3  - 37%   - 36%
 No   none 0%   -  5%
 No   config 1  - 41%   - 43%
 No   config 2  + 32%   -  4%
 No   config 3  + 34%   +  9%

 Pinning the vhost thread to the same CPU as a guest VCPU hurts
performance.
 Turning off hyperthreading and pinning the VPUS and vhost thread to
 separate CPUs significantly improves performance, getting it into the
 competitive range with other hypervisors.

 Steve D.

Those results for configs 2 and 3 with hyperthreading off are a little
strange.  Digging into the cause I found that my automation script for
pinning the vhost thread failed and pinned it to CPU 1, the same as config
1, giving results similar to config 1.  I reran the tests making sure the
pinning script did the right thing.  The results are more consistent.

HT   Pinning   Throughput  CPU
Yes  config 1  - 40%   - 40%
Yes  config 2  + 33%   -  8%
Yes  config 3  + 34%   +  9%
No   none 0%   -  5%
No   config 1  - 41%   - 43%
No   config 2  + 32%   -  4%
No   config 3  + 34%   +  9%

It appears that we have a scheduling problem.  If the processes are pinned
we can get good performance.

We also se that hyperthreading makes little difference.

Sorry for the initial misleading data.

Steve D.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-27 Thread Shirley Ma
On Wed, 2011-01-26 at 17:17 +0200, Michael S. Tsirkin wrote:
 I am seeing a similar problem, and am trying to fix that.
 My current theory is that this is a variant of a receive livelock:
 if the application isn't fast enough to process
 incoming data, the guest net stack switches
 from prequeue to backlog handling.
 
 One thing I noticed is that locking the vhost thread
 and the vcpu to the same physical CPU almost doubles the
 bandwidth.  Can you confirm that in your setup?
 
 My current guess is that when we lock both to
 a single CPU, netperf in guest gets scheduled
 slowing down the vhost thread in the host.
 
 I also noticed that this specific workload
 performs better with vhost off: presumably
 we are loading the guest less. 

I found similar issue for small message size TCP_STREAM test when guest
as TX. I found when I slow down TX, the BW performance will be doubled
for 1K to 4K message size.

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-27 Thread Michael S. Tsirkin
On Thu, Jan 27, 2011 at 10:44:34AM -0800, Shirley Ma wrote:
 On Wed, 2011-01-26 at 17:17 +0200, Michael S. Tsirkin wrote:
  I am seeing a similar problem, and am trying to fix that.
  My current theory is that this is a variant of a receive livelock:
  if the application isn't fast enough to process
  incoming data, the guest net stack switches
  from prequeue to backlog handling.
  
  One thing I noticed is that locking the vhost thread
  and the vcpu to the same physical CPU almost doubles the
  bandwidth.  Can you confirm that in your setup?
  
  My current guess is that when we lock both to
  a single CPU, netperf in guest gets scheduled
  slowing down the vhost thread in the host.
  
  I also noticed that this specific workload
  performs better with vhost off: presumably
  we are loading the guest less. 
 
 I found similar issue for small message size TCP_STREAM test when guest
 as TX. I found when I slow down TX, the BW performance will be doubled
 for 1K to 4K message size.
 
 Shirley

Interesting. In particular running vhost and the transmitting guest
on the same host would have the effect of slowing down TX.
Does it double the BW for you too?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-27 Thread Shirley Ma
On Thu, 2011-01-27 at 21:00 +0200, Michael S. Tsirkin wrote:
 Interesting. In particular running vhost and the transmitting guest
 on the same host would have the effect of slowing down TX.
 Does it double the BW for you too?
 

Running vhost and TX guest on the same host seems not good enough to
slow down TX. In order to gain the double even triple BW for guest TX to
local host I still need to play around, so 1K message size, BW is able
to increase from 2.XGb/s to 6.XGb/s.

Thanks
Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-27 Thread Michael S. Tsirkin
On Thu, Jan 27, 2011 at 11:09:00AM -0800, Shirley Ma wrote:
 On Thu, 2011-01-27 at 21:00 +0200, Michael S. Tsirkin wrote:
  Interesting. In particular running vhost and the transmitting guest
  on the same host would have the effect of slowing down TX.
  Does it double the BW for you too?
  
 
 Running vhost and TX guest on the same host seems not good enough to
 slow down TX. In order to gain the double even triple BW for guest TX to
 local host I still need to play around, so 1K message size, BW is able
 to increase from 2.XGb/s to 6.XGb/s.
 
 Thanks
 Shirley

Well slowing down the guest does not sound hard - for example we can
request guest notifications, or send extra interrupts :)
A slightly more sophisticated thing to try is to
poll the vq a bit more aggressively.
For example if we handled some requests and now tx vq is empty,
reschedule and yeild. Worth a try?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-27 Thread Shirley Ma
On Thu, 2011-01-27 at 21:31 +0200, Michael S. Tsirkin wrote:
 Well slowing down the guest does not sound hard - for example we can
 request guest notifications, or send extra interrupts :)
 A slightly more sophisticated thing to try is to
 poll the vq a bit more aggressively.
 For example if we handled some requests and now tx vq is empty,
 reschedule and yeild. Worth a try?

I used dropping packets in high level to slow down TX. I am still
thinking what's the right the approach here. 

Requesting guest notification and extra interrupts is what we want to
avoid to reduce VM exits for saving CPUs. I don't think it's good.

By polling the vq a bit more aggressively, you meant vhost, right?

Shirley

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-01-27 Thread Michael S. Tsirkin
On Thu, Jan 27, 2011 at 11:45:47AM -0800, Shirley Ma wrote:
 On Thu, 2011-01-27 at 21:31 +0200, Michael S. Tsirkin wrote:
  Well slowing down the guest does not sound hard - for example we can
  request guest notifications, or send extra interrupts :)
  A slightly more sophisticated thing to try is to
  poll the vq a bit more aggressively.
  For example if we handled some requests and now tx vq is empty,
  reschedule and yeild. Worth a try?
 
 I used dropping packets in high level to slow down TX.
 I am still
 thinking what's the right the approach here. 

Interesting. Could this is be a variant of the now famuous bufferbloat then?

I guess we could drop some packets if we see we are not keeping up. For
example if we see that the ring is  X% full, we could quickly complete
Y% without transmitting packets on. Or maybe we should drop some bytes
not packets.

 
 Requesting guest notification and extra interrupts is what we want to
 avoid to reduce VM exits for saving CPUs. I don't think it's good.

Yes but how do you explain regression?
One simple theory is that guest net stack became faster
and so the host can't keep up.


 
 By polling the vq a bit more aggressively, you meant vhost, right?
 
 Shirley

Yes.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >