Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-16 Thread Krishna Kumar2
jason wang jasow...@redhat.com wrote on 11/16/2011 11:40:45 AM:

Hi Jason,

 Have any thought in mind to solve the issue of flow handling?

So far nothing concrete.

 Maybe some performance numbers first is better, it would let us know
 where we are. During the test of my patchset, I find big regression of
 small packet transmission, and more retransmissions were noticed. This
 maybe also the issue of flow affinity. One interesting things is to see
 whether this happens in your patches :)

I haven't got any results for small packet, but will run this week
and send an update. I remember my earlier patches having regression
for small packets.

 I've played with a basic flow director implementation based on my series
 which want to make sure the packets of a flow was handled by the same
 vhost thread/guest vcpu. This is done by:

 - bind virtqueue to guest cpu
 - record the hash to queue mapping when guest sending packets and use
 this mapping to choose the virtqueue when forwarding packets to guest

 Test shows some help during for receiving packets from external host and
 packet sending to local host. But it would hurt the performance of
 sending packets to remote host. This is not the perfect solution as it
 can not handle guest moving processes among vcpus, I plan to try
 accelerate RFS and sharing the mapping between host and guest.

 Anyway this is just for receiving, the small packet sending need more
 thoughts.

I don't recollect small packet performance for guest-local host.
Also, using multiple tuns devices on the bridge (instead of mq-tun)
balances the rx/tx of a flow to a single vq. Then you can avoid
mq-tun with it's queue selector function, etc. Have you tried it?

I will run my tests this week and get back.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] kvm tools: Implement multiple VQ for virtio-net

2011-11-14 Thread Krishna Kumar2
Sasha Levin levinsasha...@gmail.com wrote on 11/14/2011 03:45:40 PM:

  Why both the bandwidth and latency performance are dropping so
  dramatically with multiple VQ?

 It looks like theres no hash sync between host and guest, which makes
 the RX VQ change for every packet. This is my guess.

Yes, I confirmed this happens for macvtap. I am
using ixgbe - it calls skb_record_rx_queue when
a skb is allocated, but sets rxhash when a packet
arrives. Macvtap is relying on record_rx_queue
first ahead of rxhash (as part of my patch making
macvtap multiqueue), hence different skbs result
in macvtap selecting different vq's.

Reordering macvtap to use rxhash first results in
all packets going to the same VQ. The code snippet
is:

{
...
if (!numvtaps)
goto out;

rxq = skb_get_rxhash(skb);
if (rxq) {
tap = rcu_dereference(vlan-taps[rxq % numvtaps]);
if (tap)
goto out;
}

if (likely(skb_rx_queue_recorded(skb))) {
rxq = skb_get_rx_queue(skb);

while (unlikely(rxq = numvtaps))
rxq -= numvtaps;
tap = rcu_dereference(vlan-taps[rxq]);
if (tap)
goto out;
}
}

I will submit a patch for macvtap separately. I am working
towards the other issue pointed out - different vhost
threads handling rx/tx of a single flow.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC 0/4] virtio and vhost-net capacity handling

2011-06-13 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 06/07/2011 09:38:30 PM:

  This is on top of the patches applied by Rusty.
 
  Warning: untested. Posting now to give people chance to
  comment on the API.

 OK, this seems to have survived some testing so far,
 after I dropped patch 4 and fixed build for patch 3
 (build fixup patch sent in reply to the original).

 I'll be mostly offline until Sunday, would appreciate
 testing reports.

Hi Michael,

I ran the latest patches with 1K I/O (guest-local host) and
the results are (60 sec run for each test case):

__
#sessions   BW%   SD%
__
1   -25.6 47.0
2   -29.3 22.9
4  .8 1.6
8 1.6 0
16  -1.6  4.1
32  -5.3  2.1
48   11.3-7.8
64  -2.8  .7
96  -6.2  .6
128 -10.6 12.7
__
BW: -4.8 SD: 5.4

I tested it again to see if the regression is fleeting (since
the numbers vary quite a bit for 1K I/O even between guest-
local host), but:

__
#sessions  BW% SD%
__
1  14.0-17.3
2  19.9-11.1
4  7.9 -15.3
8  9.6 -13.1
16 1.2 -7.3
32-.6  -13.5
48-28.7 10.0
64-5.7 -.7
96-9.4 -8.1
128   -9.4  .7
__
BW: -3.7 SD: -2.0


With 16K, there was an improvement in SD, but
higher sessions seem to slightly degrade BW/SD:

__
#sessions  BW%  SD%
__
1  30.9-25.0
2  16.5-19.4
4 -1.3  7.9
8  1.4  6.2
16 3.9 -5.4
32 04.3
48-.5.1
64 32.1-1.5
96-2.1  23.2
128   -7.4  3.8
__
BW: 5.0  SD: 7.5


Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC 0/4] virtio and vhost-net capacity handling

2011-06-13 Thread Krishna Kumar2
 Krishna Kumar2/India/IBM@IBMIN wrote on 06/13/2011 07:02:27 PM:

...

 With 16K, there was an improvement in SD, but
 higher sessions seem to slightly degrade BW/SD:

I meant to say With 16K, there was an improvement in BW
above. Again the numbers are not very reproducible,
I will test with remote host also to see if I get more
consistent numbers.

Thanks,

- KK


 __
 #sessions  BW%  SD%
 __
 1  30.9-25.0
 2  16.5-19.4
 4 -1.3  7.9
 8  1.4  6.2
 16 3.9 -5.4
 32 04.3
 48-.5.1
 64 32.1-1.5
 96-2.1  23.2
 128   -7.4  3.8
 __
 BW: 5.0  SD: 7.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 RFC 0/4] virtio and vhost-net capacity handling

2011-06-13 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 06/13/2011 07:05:13 PM:

  I ran the latest patches with 1K I/O (guest-local host) and
  the results are (60 sec run for each test case):

 Hi!
 Did you apply this one:
 [PATCHv2 RFC 4/4] Revert virtio: make add_buf return capacity remaining
 ?

 It turns out that that patch has a bug and should be reverted,
 only patches 1-3 should be applied.

 Could you confirm please?

No, I didn't apply that patch. I had also seen your mail
earlier on this patch breaking receive buffer processing
if applied.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 3/3] virtio_net: limit xmit polling

2011-06-02 Thread Krishna Kumar2
 OK, I have something very similar, but I still dislike the screw the
 latency part: this path is exactly what the IBM guys seem to hit.  So I
 created two functions: one tries to free a constant number and another
 one up to capacity. I'll post that now.

Please review this patch to see if it looks reasonable (inline and
attachment):

1. Picked comments/code from Michael's code and Rusty's review.
2. virtqueue_min_capacity() needs to be called only if it returned
   empty the last time it was called.
3. Fix return value bug in free_old_xmit_skbs (hangs guest).
4. Stop queue only if capacity is not enough for next xmit.
5. Fix/clean some likely/unlikely checks (hopefully).
6. I think xmit_skb cannot return error since
virtqueue_enable_cb_delayed() can return false only if
   3/4th space became available, which is what we check.
6. The comments for free_old_xmit_skbs needs to be more
clear (not done).

I have done some minimal netperf tests with this.

With this patch, add_buf returning capacity seems to be useful - it
allows using fewer virtio API calls.

(See attached file: patch)

Signed-off-by: Krishna Kumar krkum...@in.ibm.com
---
 drivers/net/virtio_net.c |  105 ++---
 1 file changed, 64 insertions(+), 41 deletions(-)

diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c
--- org/drivers/net/virtio_net.c2011-06-02 15:49:25.0 +0530
+++ new/drivers/net/virtio_net.c2011-06-02 19:13:02.0 +0530
@@ -509,27 +509,43 @@ again:
return received;
 }

-/* Check capacity and try to free enough pending old buffers to enable
queueing
- * new ones.  If min_skbs  0, try to free at least the specified number
of skbs
- * even if the ring already has sufficient capacity.  Return true if we
can
- * guarantee that a following virtqueue_add_buf will succeed. */
-static bool free_old_xmit_skbs(struct virtnet_info *vi, int min_skbs)
+/* Return true if freed a skb, else false */
+static inline bool free_one_old_xmit_skb(struct virtnet_info *vi)
 {
struct sk_buff *skb;
unsigned int len;
-   bool r;

-   while ((r = virtqueue_min_capacity(vi-svq)  MAX_SKB_FRAGS + 2) ||
-  min_skbs--  0) {
-   skb = virtqueue_get_buf(vi-svq, len);
-   if (unlikely(!skb))
+   skb = virtqueue_get_buf(vi-svq, len);
+   if (unlikely(!skb))
+   return 0;
+
+   pr_debug(Sent skb %p\n, skb);
+   vi-dev-stats.tx_bytes += skb-len;
+   vi-dev-stats.tx_packets++;
+   dev_kfree_skb_any(skb);
+   return 1;
+}
+
+static bool free_old_xmit_skbs(struct virtnet_info *vi, int to_free)
+{
+   bool empty = virtqueue_min_capacity(vi-svq)  MAX_SKB_FRAGS + 2;
+
+   do {
+   if (!free_one_old_xmit_skb(vi)) {
+   /* No more skbs to free up */
break;
-   pr_debug(Sent skb %p\n, skb);
-   vi-dev-stats.tx_bytes += skb-len;
-   vi-dev-stats.tx_packets++;
-   dev_kfree_skb_any(skb);
-   }
-   return r;
+   }
+
+   if (empty) {
+   /* Check again if there is enough space */
+   empty = virtqueue_min_capacity(vi-svq) 
+   MAX_SKB_FRAGS + 2;
+   } else {
+   --to_free;
+   }
+   } while (to_free  0);
+
+   return !empty;
 }

 static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb)
@@ -582,46 +598,53 @@ static int xmit_skb(struct virtnet_info
 static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
-   int ret, n;
+   int capacity;

-   /* Free up space in the ring in case this is the first time we get
-* woken up after ring full condition.  Note: this might try to free
-* more than strictly necessary if the skb has a small
-* number of fragments, but keep it simple. */
-   free_old_xmit_skbs(vi, 0);
+   /* Try to free 2 buffers for every 1 xmit, to stay ahead. */
+   free_old_xmit_skbs(vi, 2);

/* Try to transmit */
-   ret = xmit_skb(vi, skb);
+   capacity = xmit_skb(vi, skb);

-   /* Failure to queue is unlikely. It's not a bug though: it might happen
-* if we get an interrupt while the queue is still mostly full.
-* We could stop the queue and re-enable callbacks (and possibly
return
-* TX_BUSY), but as this should be rare, we don't bother. */
-   if (unlikely(ret  0)) {
+   if (unlikely(capacity  0)) {
+   /*
+* Failure to queue should be impossible. The only way to
+* reach here is if we got a cb before 3/4th of space was
+* available. We could stop the queue and re-enable
+* callbacks (and possibly return TX_BUSY), but we don't
+* bother since this is 

Re: [PATCH RFC 3/3] virtio_net: limit xmit polling

2011-06-02 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 06/02/2011 08:13:46 PM:

  Please review this patch to see if it looks reasonable:

 Hmm, since you decided to work on top of my patch,
 I'd appreciate split-up fixes.

OK (that also explains your next comment).

  1. Picked comments/code from MST's code and Rusty's review.
  2. virtqueue_min_capacity() needs to be called only if it returned
 empty the last time it was called.
  3. Fix return value bug in free_old_xmit_skbs (hangs guest).
  4. Stop queue only if capacity is not enough for next xmit.

 That's what we always did ...

I had made the patch against your patch, hence this change (sorry for
the confusion!).

  5. Fix/clean some likely/unlikely checks (hopefully).
 
  I have done some minimal netperf tests with this.
 
  With this patch, add_buf returning capacity seems to be useful - it
  allows less virtio API calls.

 Why bother? It's cheap ...

If add_buf retains it's functionality to return the capacity (it
is going to need a change to return 0 otherwise anyway), is it
useful to call another function at each xmit?

  +static bool free_old_xmit_skbs(struct virtnet_info *vi, int to_free)
  +{
  +   bool empty = virtqueue_min_capacity(vi-svq)  MAX_SKB_FRAGS + 2;
  +
  +   do {
  +  if (!free_one_old_xmit_skb(vi)) {
  + /* No more skbs to free up */
break;
  -  pr_debug(Sent skb %p\n, skb);
  -  vi-dev-stats.tx_bytes += skb-len;
  -  vi-dev-stats.tx_packets++;
  -  dev_kfree_skb_any(skb);
  -   }
  -   return r;
  +  }
  +
  +  if (empty) {
  + /* Check again if there is enough space */
  + empty = virtqueue_min_capacity(vi-svq) 
  +MAX_SKB_FRAGS + 2;
  +  } else {
  + --to_free;
  +  }
  +   } while (to_free  0);
  +
  +   return !empty;
   }

 Why bother doing the capacity check in this function?

To return whether we have enough space for next xmit. It should call
it only once unless space is running out. Does it sound OK?

  -   if (unlikely(ret  0)) {
  +   if (unlikely(capacity  0)) {
  +  /*
  +   * Failure to queue should be impossible. The only way to
  +   * reach here is if we got a cb before 3/4th of space was
  +   * available. We could stop the queue and re-enable
  +   * callbacks (and possibly return TX_BUSY), but we don't
  +   * bother since this is impossible.

 It's far from impossible.  The 3/4 thing is only a hint, and old devices
 don't support it anyway.

OK, I will re-put back your comment.

  -   if (!likely(free_old_xmit_skbs(vi, 2))) {
  -  netif_stop_queue(dev);
  -  if (unlikely(!virtqueue_enable_cb_delayed(vi-svq))) {
  - /* More just got used, free them and recheck. */
  - if (!likely(free_old_xmit_skbs(vi, 0))) {
  -netif_start_queue(dev);
  -virtqueue_disable_cb(vi-svq);
  +   /*
  +* Apparently nice girls don't return TX_BUSY; check capacity and
  +* stop the queue before it gets out of hand. Naturally, this
wastes
  +* entries.
  +*/
  +   if (capacity  2+MAX_SKB_FRAGS) {
  +  /*
  +   * We don't have enough space for the next packet. Try
  +   * freeing more.
  +   */
  +  if (likely(!free_old_xmit_skbs(vi, UINT_MAX))) {
  + netif_stop_queue(dev);
  + if (unlikely(!virtqueue_enable_cb_delayed(vi-svq))) {
  +/* More just got used, free them and recheck. */
  +if (likely(free_old_xmit_skbs(vi, UINT_MAX))) {

 Is this where the bug was?

Return value in free_old_xmit() was wrong. I will re-do against the
mainline kernel.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 3/3] virtio_net: limit xmit polling

2011-06-02 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 06/02/2011 09:04:23 PM:

   Is this where the bug was?
 
  Return value in free_old_xmit() was wrong. I will re-do against the
  mainline kernel.
 
  Thanks,
 
  - KK

 Just noting that I'm working on that patch as well, it might
 be more efficient if we don't both of us do this in parallel :)

OK, but my intention was to work on a alternate approach, which
was the reason to base it against your patch.

I will check your latest patch.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PERF RESULTS] virtio and vhost-net performance enhancements

2011-05-26 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/20/2011 04:40:07 AM:

 OK, here is the large patchset that implements the virtio spec update
 that I sent earlier (the spec itself needs a minor update, will send
 that out too next week, but I think we are on the same page here
 already). It supercedes the PUBLISH_USED_IDX patches I sent
 out earlier.

I was able to get this tested by applying the v2 patches
to git-next tree (somehow MST's git tree hung on my guest
which never got resolved). Testing was from Guest - Remote
node, using an ixgbe 10g card. The test results are
*excellent* (table: #netperf sesssions, BW% improvement,
SD% improvement, CPU% improvement):

___
   512 byte I/O
# BW% SD%  CPU%

1 151.6   -65.1-10.7
2 180.6   -66.6-6.4
4 15.5-35.8-26.1
8 1.8 -28.4-26.7
163.1 -29.0-26.5
321.1 -27.4-27.5
643.8 -30.9-26.7
965.4 -21.7-24.2
128   5.7 -24.4-25.5

BW: 16.6%   SD: -24.6%CPU: -25.5%



1K I/O
# BW% SD%  CPU%

1 233.9   -76.5-18.0
2 112.2   -64.0-23.2
4 9.2 -31.6-26.1
8-1.7 -26.8-30.3
163.5 -31.5-30.6
324.8 -25.2-30.5
645.7 -31.0-28.9
965.3 -32.2-31.7
128   4.6 -38.2-33.6

BW: 16.4%   SD: -35.%CPU: -31.5%



 16K I/O
# BW% SD%  CPU%

1 18.8-27.2-18.3
2 14.8-36.7-27.7
4 12.7-45.2-38.1
8 4.4 -56.4-54.4
164.8 -38.3-36.1
32078.0 79.2
643.8 -38.1-37.5
967.3 -35.2-31.1
128   3.4 -31.1-32.1

BW: 7.6%   SD: -30.1%   CPU: -23.7%


I plan to run some more tests tomorrow. Please let
me know if any other scenario will help.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PERF RESULTS] virtio and vhost-net performance enhancements

2011-05-26 Thread Krishna Kumar2
Shirley Ma x...@us.ibm.com wrote on 05/26/2011 09:12:22 PM:

 Could you please try TCP_RRs as well?

Right. Here's the result for TCP_RR:

__
#   RR% SD% CPU%
__
1   4.5   -31.4-27.9
2   5.1   -9.7  -5.4
4   60.4 -13.4 38.8
8   67.8 -13.5 45.0
16 55.8 -8.0   43.2
32 66.9 -14.1 43.3
64 47.2 -23.7 12.2
96 29.7 -11.8 14.3
1288.0   2.2   10.7
___
BW: 37.3%   SD: -6.7%   CPU: 15.7%
___

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PERF RESULTS] virtio and vhost-net performance enhancements

2011-05-26 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 05/26/2011 09:51:32 PM:

  Could you please try TCP_RRs as well?

 Right. Here's the result for TCP_RR:

The actual transaction rate/second numbers are:

_
# RR1  RR2 (%)  SD1SD2 (%)
_
1 9476 9903 (4.5)   28.9   19.8 (-31.4)
2 1733718225 (5.1)  92.7   83.7 (-9.7)
4 1738527902 (60.4) 364.8  315.8 (-13.4)
8 2556042912 (67.8) 1428.1 1234.0 (-13.5)
163589855934 (55.8) 4391.6 4038.1 (-8.0)
324804880228 (66.9) 17391.414932.0 (-14.1)
646041288929 (47.2) 71087.754230.1 (-23.7)
967126392439 (29.7) 145434.1   128214.0 (-11.8)
128   8420891014 (8.0)  233668.2   23.6 (2.2)
_
RR: 37.3% SD: -6.7%
_

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 10/14] virtio_net: limit xmit polling

2011-05-24 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/23/2011 04:49:00 PM:

  To do this properly, we should really be using the actual number of sg
  elements needed, but we'd have to do most of xmit_skb beforehand so we
  know how many.
 
  Cheers,
  Rusty.

 Maybe I'm confused here.  The problem isn't the failing
 add_buf for the given skb IIUC.  What we are trying to do here is stop
 the queue *before xmit_skb fails*. We can't look at the
 number of fragments in the current skb - the next one can be
 much larger.  That's why we check capacity after xmit_skb,
 not before it, right?

Maybe Rusty means it is a simpler model to free the amount
of space that this xmit needs. We will still fail anyway
at some time but it is unlikely, since earlier iteration
freed up atleast the space that it was going to use. The
code could become much simpler:

start_xmit()
{
{
num_sgs = get num_sgs for this skb;

/* Free enough pending old buffers to enable queueing this one */
free_old_xmit_skbs(vi, num_sgs * 2); /* ?? */

if (virtqueue_get_capacity()  num_sgs) {
netif_stop_queue(dev);
if (virtqueue_enable_cb_delayed(vi-svq) ||
free_old_xmit_skbs(vi, num_sgs)) {
/* Nothing freed up, or not enough freed up */
kfree_skb(skb);
return NETDEV_TX_OK;
}
netif_start_queue(dev);
virtqueue_disable_cb(vi-svq);
}

/* xmit_skb cannot fail now, also pass 'num_sgs' */
xmit_skb(vi, skb, num_sgs);
virtqueue_kick(vi-svq);

skb_orphan(skb);
nf_reset(skb);

return NETDEV_TX_OK;
}

We could even return TX_BUSY since that makes the dequeue
code more efficient. See dev_dequeue_skb() - you can skip a
lot of code (and avoid taking locks) to check if the queue
is already stopped but that code runs only if you return
TX_BUSY in the earlier iteration.

BTW, shouldn't the check in start_xmit be:
if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
...
}

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 10/14] virtio_net: limit xmit polling

2011-05-24 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/24/2011 02:42:55 PM:

To do this properly, we should really be using the actual number of
sg
elements needed, but we'd have to do most of xmit_skb beforehand so
we
know how many.
   
Cheers,
Rusty.
  
   Maybe I'm confused here.  The problem isn't the failing
   add_buf for the given skb IIUC.  What we are trying to do here is
stop
   the queue *before xmit_skb fails*. We can't look at the
   number of fragments in the current skb - the next one can be
   much larger.  That's why we check capacity after xmit_skb,
   not before it, right?
 
  Maybe Rusty means it is a simpler model to free the amount
  of space that this xmit needs. We will still fail anyway
  at some time but it is unlikely, since earlier iteration
  freed up atleast the space that it was going to use.

 Not sure I nderstand.  We can't know space is freed in the previous
 iteration as buffers might not have been used by then.

Yes, the first few iterations may not have freed up space, but
later ones should. The amount of free space should increase
from then on, especially since we try to free double of what
we consume.

  The
  code could become much simpler:
 
  start_xmit()
  {
  {
  num_sgs = get num_sgs for this skb;
 
  /* Free enough pending old buffers to enable queueing this one
*/
  free_old_xmit_skbs(vi, num_sgs * 2); /* ?? */
 
  if (virtqueue_get_capacity()  num_sgs) {
  netif_stop_queue(dev);
  if (virtqueue_enable_cb_delayed(vi-svq) ||
  free_old_xmit_skbs(vi, num_sgs)) {
  /* Nothing freed up, or not enough freed up */
  kfree_skb(skb);
  return NETDEV_TX_OK;

 This packet drop is what we wanted to avoid.

Please see below on returning NETDEV_TX_BUSY.


  }
  netif_start_queue(dev);
  virtqueue_disable_cb(vi-svq);
  }
 
  /* xmit_skb cannot fail now, also pass 'num_sgs' */
  xmit_skb(vi, skb, num_sgs);
  virtqueue_kick(vi-svq);
 
  skb_orphan(skb);
  nf_reset(skb);
 
  return NETDEV_TX_OK;
  }
 
  We could even return TX_BUSY since that makes the dequeue
  code more efficient. See dev_dequeue_skb() - you can skip a
  lot of code (and avoid taking locks) to check if the queue
  is already stopped but that code runs only if you return
  TX_BUSY in the earlier iteration.
 
  BTW, shouldn't the check in start_xmit be:
 if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) {
...
 }
 
  Thanks,
 
  - KK

 I thought we used to do basically this but other devices moved to a
 model where they stop *before* queueing fails, so we did too.

I am not sure of why it was changed, since returning TX_BUSY
seems more efficient IMHO. qdisc_restart() handles requeue'd
packets much better than a stopped queue, as a significant
part of this code is skipped if gso_skb is present (qdisc
will eventually start dropping packets when tx_queue_len is
exceeded anyway).

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2 10/14] virtio_net: limit xmit polling

2011-05-24 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/24/2011 04:59:39 PM:

Maybe Rusty means it is a simpler model to free the amount
of space that this xmit needs. We will still fail anyway
at some time but it is unlikely, since earlier iteration
freed up atleast the space that it was going to use.
  
   Not sure I nderstand.  We can't know space is freed in the previous
   iteration as buffers might not have been used by then.
 
  Yes, the first few iterations may not have freed up space, but
  later ones should. The amount of free space should increase
  from then on, especially since we try to free double of what
  we consume.

 Hmm. This is only an upper limit on the # of entries in the queue.
 Assume that vq size is 4 and we transmit 4 enties without
 getting anything in the used ring. The next transmit will fail.

 So I don't really see why it's unlikely that we reach the packet
 drop code with your patch.

I was assuming 256 entries :) I will try to get some
numbers to see how often it is true tomorrow.

  I am not sure of why it was changed, since returning TX_BUSY
  seems more efficient IMHO.
  qdisc_restart() handles requeue'd
  packets much better than a stopped queue, as a significant
  part of this code is skipped if gso_skb is present

 I think this is the argument:
 http://www.mail-archive.com/virtualization@lists.linux-
 foundation.org/msg06364.html

Thanks for digging up that thread! Yes, that one skb would get
sent first ahead of possibly higher priority skbs. However,
from a performance point, TX_BUSY code skips a lot of checks
and code for all subsequent packets till the device is
restarted. I can test performance with both cases and report
what I find (the requeue code has become very simple and clean
from horribly complex, thanks to Herbert and Dave).

  (qdisc
  will eventually start dropping packets when tx_queue_len is

 tx_queue_len is a pretty large buffer so maybe no.

I remember seeing tons of drops (pfifo_fast_enqueue) when
xmit returns TX_BUSY.

 I think the packet drops from the scheduler queue can also be
 done intelligently (e.g. with CHOKe) which should
 work better than dropping a random packet?

I am not sure of that - choke_enqueue checks against a random
skb to drop current skb, and also during congestion. But for
my sample driver xmit, returning TX_BUSY could still allow
to be used with CHOKe.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/18] virtio and vhost-net performance enhancements

2011-05-11 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 02:20:18 AM:

 [PATCH 00/18] virtio and vhost-net performance enhancements

 OK, here's a large patchset that implements the virtio spec update that I
 sent earlier. It supercedes the PUBLISH_USED_IDX patches
 I sent out earlier.

 I know it's a lot to ask but please test, and please consider for
2.6.40 :)

 I see nice performance improvements: one run showed going from 12
 to 18 Gbit/s host to guest with netperf, but I did not spend a lot
 of time testing performance, so no guarantees it's not a fluke,
 I hope others will try this out and report.
 Pls note I will be away from keyboard for the next week.

I tested with the git tree (which also contains the later
additional patch), and get this error on guest:

May 11 08:06:08 localhost kernel: net eth0: Unexpected TX queue failure:
-28
May 11 08:06:08 localhost kernel: net eth0: Unexpected TX queue failure:
-28
May 11 08:06:08 localhost kernel: net eth0: Unexpected TX queue failure:
-28
May 11 08:06:08 localhost kernel: net eth0: Unexpected TX queue failure:
-28
...

The network stops after that and requires a modprobe restart to
get it working again. This is with the new qemu/vhost/virtio-net.

Please let me know if I am missing something.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance

2011-05-07 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 09:04:13 PM:

  I haven't tuned the threshhold, it is left it at 3/4. I ran
  the new qemu/vhost/guest, and the results for 1K, 2K and 16K
  are below. Note this is a different kernel version from my
  earlier test results. So, f.e., BW1 represents 2.6.39-rc2,
  the original kernel; while BW2 represents 2.6.37-rc5 (MST's
  kernel).

 Weird. My kernel is actually 2.6.39-rc2. So which is which?

I cloned git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git

# git branch -a
  vhost
* vhost-net-next-event-idx-v1
  remotes/origin/HEAD - origin/vhost
  remotes/origin/for-linus
  remotes/origin/master
  remotes/origin/net-2.6
  remotes/origin/vhost
  remotes/origin/vhost-broken
  remotes/origin/vhost-devel
  remotes/origin/vhost-mrg-rxbuf
  remotes/origin/vhost-net
  remotes/origin/vhost-net-next
  remotes/origin/vhost-net-next-event-idx-v1
  remotes/origin/vhost-net-next-rebased
  remotes/origin/virtio-layout-aligned
  remotes/origin/virtio-layout-minimal
  remotes/origin/virtio-layout-original
  remotes/origin/virtio-layout-padded
  remotes/origin/virtio-publish-used

# git checkout vhost-net-next-event-idx-v1
Already on 'vhost-net-next-event-idx-v1'

# head -4 Makefile
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 37
EXTRAVERSION = -rc5

I am not sure what I am missing.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance

2011-05-05 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 02:53:59 AM:

  Not hope exactly. If the device is not ready, then
  the packet is requeued. The main idea is to avoid
  drops/stop/starts, etc.

 Yes, I see that, definitely. I guess it's a win if the
 interrupt takes at least a jiffy to arrive anyway,
 and a loss if not. Is there some reason interrupts
 might be delayed until the next jiffy?

I can explain this a bit as I have three debug counters
in start_xmit() just for this:

1. Whether the current xmit call was good, i.e. we had
   returned BUSY last time and this xmit was successful.
2. Whether the current xmit call was bad, i.e. we had
   returned BUSY last time and this xmit still failed.
3. The free capacity when we *resumed* xmits. This is
   after calling free_old_xmit_skbs where this function
   is not throttled, in effect it processes *all* the
   completed skbs. This counter is a sum:

   if (If_I_had_returned_EBUSY_last_iteration)
   free_slots += virtqueue_get_capacity();

The counters after a 30 min run of 1K,2K,16K netperf
sessions are:

Good:  1059172
Bad:   31226
Sum of slots:  47551557

(Total of Good+Bad tallies with the total number of requeues
as shown by tc:

qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1
1 1 1 1
 Sent 1560854473453 bytes 1075873684 pkt (dropped 718379, overlimits 0
requeues 1090398)
 backlog 0b 0p requeues 1090398
)

It shows that 2.9% of the time, the 1 jiffy was not enough
to free up space in the txq. That could also mean that we
had set xmit_restart just before jiffies changed. But the
average free capacity when we *resumed* xmits is:
Sum of slots / (Good + Bad) = 43.

So the delay of 1 jiffy helped the host clean up, on average,
just 43 entries, which is 16% of total entries. This is
intended to show that the guest is not sitting idle waiting
for the jiffy to expire.

   I can post it, mind testing this?
 
  Sure.

 Just posted. Would appreciate feedback.

Do I need to apply all the patches and simply test?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance

2011-05-05 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 02:34:39 PM:

  It shows that 2.9% of the time, the 1 jiffy was not enough
  to free up space in the txq.

 How common is it to free up space in *less than* 1 jiffy?

True, but the point is that the space freed is just
enough for 43 entries, keeping it lower means a flood
of (psuedo) stop's and restart's.

  That could also mean that we
  had set xmit_restart just before jiffies changed. But the
  average free capacity when we *resumed* xmits is:
  Sum of slots / (Good + Bad) = 43.
 
  So the delay of 1 jiffy helped the host clean up, on average,
  just 43 entries, which is 16% of total entries. This is
  intended to show that the guest is not sitting idle waiting
  for the jiffy to expire.

 OK, nice, this is exactly what my patchset is trying
 to do, without playing with timers: tell the host
 to interrupt us after 3/4 of the ring is free.
 Why 3/4 and not all of the ring? My hope is we can
 get some parallelism with the host this way.
 Why 3/4 and not 7/8? No idea :)

 I can post it, mind testing this?
   
Sure.
  
   Just posted. Would appreciate feedback.
 
  Do I need to apply all the patches and simply test?
 
  Thanks,
 
  - KK

 Exactly. You can also try to tune the threshold
 for interrupts as well.

Could you send me (privately) the entire virtio-net/vhost
patch in a single file? It will help me quite a bit :)
Either attachment or inline is fine.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance

2011-05-05 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 03:42:29 PM:

It shows that 2.9% of the time, the 1 jiffy was not enough
to free up space in the txq.
  
   How common is it to free up space in *less than* 1 jiffy?
 
  True,

 Sorry, which statement do you say is true? That interrupt
 after less than 1 jiffy is common?

I meant to say that, 97% of the time, space was enough for
the next xmit to succeed. This is keeping in mind that on
average 43 slots were freed up, indicating that the guest
was not waiting around for too long.

Regarding whether interrupts in less than 1 jiffy is
common, I think most of the time it should. But
increasing the limit as to when to do the cb would
increase to a jiffy.

To confirm, I just put some counters in the original
code and found that interrupts happen in less than a
jiffy around 96.75% of the time, only 3.25% took 1
jiffy. But as expected, this is with the host
interrupting immediately, which leads to many
stop/start/interrupts due to very little free capacity.

  but the point is that the space freed is just
  enough for 43 entries, keeping it lower means a flood
  of (psuedo) stop's and restart's.

 Better yet, here they are in git:

 git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-
 net-next-event-idx-v1
 git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git
 virtio-net-event-idx-v1

Great, I will pick up from here.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance

2011-05-05 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 02:34:39 PM:

  Do I need to apply all the patches and simply test?
 
  Thanks,
 
  - KK

 Exactly. You can also try to tune the threshold
 for interrupts as well.

I haven't tuned the threshhold, it is left it at 3/4. I ran
the new qemu/vhost/guest, and the results for 1K, 2K and 16K
are below. Note this is a different kernel version from my
earlier test results. So, f.e., BW1 represents 2.6.39-rc2,
the original kernel; while BW2 represents 2.6.37-rc5 (MST's
kernel). This also isn't with the fixes you have sent just
now. I will get a run with that either late tonight or
tomorrow.


   I/O size: 1K
# BW1 BW2 (%)SD1   SD2 (%)

1 17233016 (75.0)4.7   2.6 (-44.6)
2 32236712 (108.2)   18.0  7.1 (-60.5)
4 72238258 (14.3)36.5  24.3 (-33.4)
8 86897943 (-8.5)131.5 101.6 (-22.7)
1680597398 (-8.2)578.3 406.4 (-29.7)
3277587208 (-7.0)2281.41574.7 (-30.9)
6475037155 (-4.6)9734.06368.0 (-34.5)
9674967078 (-5.5)21980.9   15477.6 (-29.5)
128   73896900 (-6.6)40467.5   26031.9 (-35.6)

Summary: BW: (4.4) SD: (-33.5)


 I/O size: 2K
# BW1 BW2 (%)SD1   SD2 (%)

1 16084968 (208.9)   5.0   1.3 (-74.0)
2 33546974 (107.9)   18.6  4.9 (-73.6)
4 82348344 (1.3) 35.6  17.9 (-49.7)
8 84277818 (-7.2)103.5 71.2 (-31.2)
1679957491 (-6.3)410.1 273.9 (-33.2)
3278637149 (-9.0)1678.61080.4 (-35.6)
6476617092 (-7.4)7245.34717.2 (-34.8)
9675176984 (-7.0)15711.2   9838.9 (-37.3)
128   73896851 (-7.2)27121.6   18255.7 (-32.6)

Summary: BW: (6.0) SD: (-34.5)


  I/O size: 16K
# BW1 BW2 (%)SD1   SD2 (%)

1 66847019 (5.0) 1.1   1.1 (0)
2 76747196 (-6.2)5.0   4.8 (-4.0)
4 73588032 (9.1) 21.3  20.4 (-4.2)
8 73938015 (8.4) 82.7  82.0 (-.8)
1679588366 (5.1) 283.2 310.7 (9.7)
3277928113 (4.1) 1257.51363.0 (8.3)
6476738040 (4.7) 5723.15812.4 (1.5)
9674627883 (5.6) 12731.8   12119.8 (-4.8)
128   73387800 (6.2) 21331.7   21094.7 (-1.1)

Summary: BW: (4.6) SD: (-1.5)

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance

2011-05-05 Thread Krishna Kumar2
Krishna Kumar wrote on 05/05/2011 08:57:13 PM:

Oops, I sent my patch's test results for the 16K case.
The correct one is:


  I/O size: 16K
#   BW1 BW2 (%) SD1 SD2 (%)

1   66846670 (-.2)  1.1 .6 (-45.4)
2   76747859 (2.4)  5.0 2.6 (-48.0)
4   73587421 (.8)   21.311.6 (-45.5)
8   73937289 (-1.4) 82.744.8 (-45.8)
16  79587280 (-8.5) 283.2   166.3 (-41.2)
32  77927163 (-8.0) 1257.5  692.4 (-44.9)
64  76737096 (-7.5) 5723.1  2870.3 (-49.8)
96  74626963 (-6.6) 12731.8 6475.6 (-49.1)
128 73386919 (-5.7) 21331.7 12345.7 (-42.1)

Summary:BW: (-3.9)  SD: (-45.4)

Sorry for the confusion.

Regards,

- KK

 
   I/O size: 16K
 # BW1 BW2 (%)SD1   SD2 (%)
 
 1 66847019 (5.0) 1.1   1.1 (0)
 2 76747196 (-6.2)5.0   4.8 (-4.0)
 4 73588032 (9.1) 21.3  20.4 (-4.2)
 8 73938015 (8.4) 82.7  82.0 (-.8)
 1679588366 (5.1) 283.2 310.7 (9.7)
 3277928113 (4.1) 1257.51363.0 (8.3)
 6476738040 (4.7) 5723.15812.4 (1.5)
 9674627883 (5.6) 12731.8   12119.8 (-4.8)
 128   73387800 (6.2) 21331.7   21094.7 (-1.1)
 
 Summary: BW: (4.6) SD: (-1.5)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance

2011-05-04 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/04/2011 08:16:22 PM:

  A. virtio:
 - Provide a API to get available number of slots.
 
  B. virtio-net:
 - Remove stop/start txq's and associated callback.
 - Pre-calculate the number of slots needed to transmit
   the skb in xmit_skb and bail out early if enough space
   is not available. My testing shows that 2.5-3% of
   packets are benefited by using this API.
 - Do not drop skbs but instead return TX_BUSY like other
   drivers.
 - When returning EBUSY, set a per-txq variable to indicate
   to dev_queue_xmit() whether to restart xmits on this txq.
 
  C. net/sched/sch_generic.c:
 Since virtio-net now returns EBUSY, the skb is requeued to
 gso_skb. This allows adding the addional check for restart
 xmits in just the slow-path (the first re-queued packet
 case of dequeue_skb, where it checks for gso_skb) before
 deciding whether to call the driver or not.
 
  Patch was also tested between two servers with Emulex OneConnect
  10G cards to confirm there is no regression. Though the patch is
  an attempt to improve only small packet performance, there was
  improvement for 1K, 2K and also 16K both in BW and SD. Results
  from Guest - Remote Host (BW in Mbps) for 1K and 16K I/O sizes:
 
  
   I/O Size: 1K
  #   BW1   BW2 (%)  SD1   SD2 (%)
  
  1   1226   3313 (170.2)   6.6   1.9 (-71.2)
  2   3223   7705 (139.0)   18.0   7.1 (-60.5)
  4   7223   8716 (20.6)   36.5   29.7 (-18.6)
  8   8689   8693 (0)   131.5   123.0 (-6.4)
  16   8059   8285 (2.8)   578.3   506.2 (-12.4)
  32   7758   7955 (2.5)   2281.4   2244.2 (-1.6)
  64   7503   7895 (5.2)   9734.0   9424.4 (-3.1)
  96   7496   7751 (3.4)   21980.9   20169.3 (-8.2)
  128   7389   7741 (4.7)   40467.5   34995.5 (-13.5)
  
  Summary:   BW: 16.2%   SD: -10.2%
 
  
   I/O Size: 16K
  #   BW1   BW2 (%)  SD1   SD2 (%)
  
  1   6684   7019 (5.0)   1.1   1.1 (0)
  2   7674   7196 (-6.2)   5.0   4.8 (-4.0)
  4   7358   8032 (9.1)   21.3   20.4 (-4.2)
  8   7393   8015 (8.4)   82.7   82.0 (-.8)
  16   7958   8366 (5.1)   283.2   310.7 (9.7)
  32   7792   8113 (4.1)   1257.5   1363.0 (8.3)
  64   7673   8040 (4.7)   5723.1   5812.4 (1.5)
  96   7462   7883 (5.6)   12731.8   12119.8 (-4.8)
  128   7338   7800 (6.2)   21331.7   21094.7 (-1.1)
  
  Summary:   BW: 4.6%   SD: -1.5%
 
  Signed-off-by: Krishna Kumar krkum...@in.ibm.com
  ---

 So IIUC, we delay transmit by an arbitrary value and hope
 that the host is done with the packets by then?

Not hope exactly. If the device is not ready, then
the packet is requeued. The main idea is to avoid
drops/stop/starts, etc.

 Interesting.

 I am currently testing an approach where
 we tell the host explicitly to interrupt us only after
 a large part of the queue is empty.
 With 256 entries in a queue, we should get 1 interrupt per
 on the order of 100 packets which does not seem like a lot.

 I can post it, mind testing this?

Sure.

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] [RFC] virtio: Introduce new API to get free space

2011-05-04 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 01:30:23 AM:

   @@ -185,11 +193,6 @@ int virtqueue_add_buf_gfp(struct virtque
   if (vq-num_free  out + in) {
  pr_debug(Can't add buf len %i - avail = %i\n,
  out + in, vq-num_free);
   -  /* FIXME: for historical reasons, we force a notify here if
   -   * there are outgoing parts to the buffer.  Presumably the
   -   * host should service the ring ASAP. */
   -  if (out)
   - vq-notify(vq-vq);
  END_USE(vq);
  return -ENOSPC;
   }
 
  This will break qemu versions 0.13 and back.
  I'm adding some new virtio ring flags, we'll be
  able to reuse one of these to mean 'no need for
  work around', I think.

 Not really, it wont. We shall almost never get here at all.
 But then, why would this help performance?

Yes, it is not needed. I will be testing it without this
also.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/2] Multiqueue support for qemu(virtio-net)

2011-04-20 Thread Krishna Kumar2
Thanks Jason!

So I can use my virtio-net guest driver and test with this patch?
Please provide the script you use to start MQ guest.

Regards,

- KK

Jason Wang jasow...@redhat.com wrote on 04/20/2011 02:03:07 PM:

 Jason Wang jasow...@redhat.com
 04/20/2011 02:03 PM

 To

 Krishna Kumar2/India/IBM@IBMIN, kvm@vger.kernel.org, m...@redhat.com,
 net...@vger.kernel.org, ru...@rustcorp.com.au, qemu-
 de...@nongnu.org, anth...@codemonkey.ws

 cc

 Subject

 [RFC PATCH 0/2] Multiqueue support for qemu(virtio-net)

 Inspired by Krishna's patch
(http://www.spinics.net/lists/kvm/msg52098.html
 ) and
 Michael's suggestions.  The following series adds the multiqueue support
for
 qemu and enable it for virtio-net (both userspace and vhost).

 The aim for this series is to simplified the management and achieve the
same
 performacne with less codes.

 Follows are the differences between this series and Krishna's:

 - Add the multiqueue support for qemu and also for userspace virtio-net
 - Instead of hacking the vhost module to manipulate kthreads, this patch
just
 implement the userspace based multiqueues and thus can re-use the
 existed vhost kernel-side codes without any modification.
 - Use 1:1 mapping between TX/RX pairs and vhost kthread because the
 implementation is based on usersapce.
 - The cli is also changed to make the mgmt easier, the -netdev option of
qdev
 can now accpet more than one ids. You can start a multiqueue virtio-net
device
 through:
 ./qemu-system-x86_64 -netdev tap,id=hn0,vhost=on,fd=X -netdev
 tap,id=hn0,vhost=on,fd=Y -device
virtio-net-pci,netdev=hn0#hn1,queues=2 ...

 The series is very primitive and still need polished.

 Suggestions are welcomed.
 ---

 Jason Wang (2):
   net: Add multiqueue support
   virtio-net: add multiqueue support


  hw/qdev-properties.c |   37 -
  hw/qdev.h|3
  hw/vhost.c   |   26 ++-
  hw/vhost.h   |1
  hw/vhost_net.c   |7 +
  hw/vhost_net.h   |2
  hw/virtio-net.c  |  409 +++
 +--
  hw/virtio-net.h  |2
  hw/virtio-pci.c  |1
  hw/virtio.h  |1
  net.c|   34 +++-
  net.h|   15 +-
  12 files changed, 353 insertions(+), 185 deletions(-)

 --
 Jason Wang

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] [RFC rev2] virtio-net changes

2011-04-13 Thread Krishna Kumar2
Hi Rusty,

Thanks for your feedback. I agree with all the changes, and will
make it and resubmit next.

thanks,

- KK

Rusty Russell ru...@rustcorp.com.au wrote on 04/13/2011 06:58:02 AM:

 Rusty Russell ru...@rustcorp.com.au
 04/13/2011 06:58 AM

 To

 Krishna Kumar2/India/IBM@IBMIN, da...@davemloft.net, m...@redhat.com

 cc

 eric.duma...@gmail.com, a...@arndb.de, net...@vger.kernel.org,
 ho...@verge.net.au, a...@redhat.com, anth...@codemonkey.ws,
 kvm@vger.kernel.org, Krishna Kumar2/India/IBM@IBMIN

 Subject

 Re: [PATCH 2/4] [RFC rev2] virtio-net changes

 On Tue, 05 Apr 2011 20:38:52 +0530, Krishna Kumar krkum...@in.ibm.com
wrote:
  Implement mq virtio-net driver.
 
  Though struct virtio_net_config changes, it works with the old
  qemu since the last element is not accessed unless qemu sets
  VIRTIO_NET_F_MULTIQUEUE.
 
  Signed-off-by: Krishna Kumar krkum...@in.ibm.com

 Hi Krishna!

 This change looks fairly solid, but I'd prefer it split into a few
 stages for clarity.

 The first patch should extract out the struct send_queue and struct
 receive_queue, even though there's still only one.  The second patch
 can then introduce VIRTIO_NET_F_MULTIQUEUE.

 You could split into more parts if that makes sense, but I'd prefer to
 see the mechanical changes separate from the feature addition.

  -struct virtnet_info {
  -   struct virtio_device *vdev;
  -   struct virtqueue *rvq, *svq, *cvq;
  -   struct net_device *dev;
  +/* Internal representation of a send virtqueue */
  +struct send_queue {
  +   /* Virtqueue associated with this send _queue */
  +   struct virtqueue *svq;

 You can simply call this vq now it's inside 'send_queue'.

  +
  +   /* TX: fragments + linear part + virtio header */
  +   struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];

 Similarly, this can just be sg.

  +static void free_receive_bufs(struct virtnet_info *vi)
  +{
  +   int i;
  +
  +   for (i = 0; i  vi-numtxqs; i++) {
  +  BUG_ON(vi-rq[i] == NULL);
  +  while (vi-rq[i]-pages)
  + __free_pages(get_a_page(vi-rq[i], GFP_KERNEL), 0);
  +   }
  +}

 You can skip the BUG_ON(), since the next line will have the same effect.

  +/* Free memory allocated for send and receive queues */
  +static void free_rq_sq(struct virtnet_info *vi)
  +{
  +   int i;
  +
  +   if (vi-rq) {
  +  for (i = 0; i  vi-numtxqs; i++)
  + kfree(vi-rq[i]);
  +  kfree(vi-rq);
  +   }
  +
  +   if (vi-sq) {
  +  for (i = 0; i  vi-numtxqs; i++)
  + kfree(vi-sq[i]);
  +  kfree(vi-sq);
  +   }

 This looks weird, even though it's correct.

 I think we need a better name than numtxqs and shorter than
 num_queue_pairs.  Let's just use num_queues; sure, there are both tx and
 rq queues, but I still think it's pretty clear.

  +   for (i = 0; i  vi-numtxqs; i++) {
  +  struct virtqueue *svq = vi-sq[i]-svq;
  +
  +  while (1) {
  + buf = virtqueue_detach_unused_buf(svq);
  + if (!buf)
  +break;
  + dev_kfree_skb(buf);
  +  }
  +   }

 I know this isn't your code, but it's ugly :)

 while ((buf = virtqueue_detach_unused_buf(svq)) != NULL)
 dev_kfree_skb(buf);

  +   for (i = 0; i  vi-numtxqs; i++) {
  +  struct virtqueue *rvq = vi-rq[i]-rvq;
  +
  +  while (1) {
  + buf = virtqueue_detach_unused_buf(rvq);
  + if (!buf)
  +break;

 Here too...

  +#define MAX_DEVICE_NAME  16

 This isn't a good idea, see below.

  +static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
  +{
  +   vq_callback_t **callbacks;
  +   struct virtqueue **vqs;
  +   int i, err = -ENOMEM;
  +   int totalvqs;
  +   char **names;

 This whole routine is really messy.  How about doing find_vqs first,
 then have routines like setup_rxq(), setup_txq() and setup_controlq()
 would make this neater:

 static int setup_rxq(struct send_queue *sq, char *name);

 Also, use kasprintf() instead of kmalloc  sprintf.

  +#if 1
  +   /* Allocate/initialize parameters for recv/send virtqueues */

 Why is this #if 1'd?

 I do prefer the #else method of doing two loops, myself (but use
 kasprintf).

 Cheers,
 Rusty.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] [RFC rev2] Implement multiqueue (RX TX) virtio-net

2011-04-13 Thread Krishna Kumar2
Avi Kivity a...@redhat.com wrote on 04/13/2011 05:30:11 PM:

Hi Avi,

  
  1. Reduce vectors for find_vqs().
  2. Make vhost changes minimal. For now, I have restricted the number of
  vhost threads to 4. This can be either made unrestricted; or if the
  userspace vhost works, it can be removed altogether.
 
  Please review and provide feedback. I am travelling a bit in the next
  few days but will respond at the earliest.

 Do you have an update to the virtio-pci spec for this?

Not yet, will keep it in my TODO list.

thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net

2011-03-08 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 03/02/2011 03:36:00 PM:

Sorry for the delayed response, I have been sick the last few days.
I am responding to both your posts here.

  Both virtio-net and vhost need some check to make sure very
  high values are not passed by userspace. Is this not required?

 Whatever we stick in the header is effectively part of
 host/gues interface. Are you sure we'll never want
 more than 16 VQs? This value does not seem that high.

OK, so even constants cannot change?  Given that, should I remove all
checks and use kcalloc?

  OK, so virtio_net_config has num_queue_pairs, and this gets converted to
  numtxqs in virtnet_info?

 Or put num_queue_pairs in virtnet_info too.

For virtnet_info, having numtxqs is easier since all code that loops needs
only 'numtxq'.

  Also, vhost has some
  code that processes tx first before rx (e.g. vhost_net_stop/flush),

 No idea why did I do it this way. I don't think it matters.

  so this approach seemed helpful.
  I am OK either way, what do you
  suggest?

 We get less code generated but also less flexibility.
 I am not sure, I'll play around with code, for now
 let's keep it as is.

OK.

  Yes, it is a waste to have these vectors for tx ints. I initially
  thought of adding a flag to virtio_device to pass to vp_find_vqs,
  but it won't work, so a new API is needed. I can work with you on
  this in the background if you like.

 OK. For starters, how about we change find_vqs to get a structure?  Then
 we can easily add flags that tell us that some interrupts are rare.

Yes. OK to work on this outside this patch series, I guess?

  vq's are matched between qemu, virtio-net and vhost. Isn't some check
  required that userspace has not passed a bad value?


 For virtio, I'm not too concerned: qemu can already easily
 crash the guest :)
 For vhost yes, but I'm concerned that even with 16 VQs we are
 drinking a lot of resources already. I would be happier
 if we had a file descriptor per VQs pair in some way.
 The the amount of memory userspace can use up is
 limited by the # of file descriptors.

I will start working on this approach this week and see how it goes.

  OK, so define free_unused_bufs() as:
 
  static void free_unused_bufs(struct virtnet_info *vi, struct virtqueue
  *svq,
struct virtqueue *rvq)
  {
   /* Use svq and rvq with the remaining code unchanged */
  }

 Not sure I understand. I am just suggesting
 adding symmetrical functions like init/cleanup
 alloc/free etc instead of adding stuff in random
 functions that just happens to be called at the right time.

OK, I will clean up this part in the next revision.

  I was not sure what is the best way - a sysctl parameter? Or should the
  maximum depend on number of host cpus? But that results in too many
  threads, e.g. if I have 16 cpus and 16 txqs.

 I guess the question is, wouldn't # of threads == # of vqs work best?
 If we process stuff on a single CPU, let's make it pass through
 a single VQ.
 And to do this, we could simply open multiple vhost fds without
 changing vhost at all.

 Would this work well?

- enum vhost_net_poll_state 
tx_poll_state;
+ enum vhost_net_poll_state 
*tx_poll_state;
  
   another array?
 
  Yes... I am also allocating twice the space than what is required
  to make it's usage simple.

 Where's the allocation? Couldn't find it.

vhost_setup_vqs(net.c) allocates it based on nvqs, though numtxqs is
enough.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net

2011-03-08 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 03/08/2011 09:11:04 PM:

 Also, could you post your current version of the qemu code pls?
 It's useful for testing and to see the whole picture.

Sorry for the delay on this.

I am attaching the qemu changes. Some parts of the patch are
completely redundant, eg MAX_TUN_DEVICES and I will remove it
later.

It works with latest qemu and the kernel patch sent earlier.

Please let me know if there are any issues.

thanks,

- KK


(See attached file: qemu.patch)

qemu.patch
Description: Binary data


Re: [PATCH 0/3] [RFC] Implement multiqueue (RX TX) virtio-net

2011-03-04 Thread Krishna Kumar2
Andrew Theurer haban...@linux.vnet.ibm.com wrote on 03/04/2011 12:31:24
AM:

Hi Andrew,

  ___
  TCP: Guest - Local Host (TCP_STREAM)
   TCP: Local Host - Guest (TCP_MAERTS)
  UDP: Local Host - Guest (UDP_STREAM)


 Any reason why the tests don't include a guest-to-guest on same host, or
 on different hosts?  Seems like those would be a lot more common that
 guest-to/from-localhost.

This was missing in my test plan, but good point. I will run
these tests also and send the results soon.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net

2011-03-01 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 02/28/2011 03:13:20 PM:

Thank you once again for your feedback on both these patches.
I will send the qemu patch tomorrow. I will also send the next
version incorporating these suggestions once we finalize some
minor points.

 Overall looks good.
 The numtxqs meaning the number of rx queues needs some cleanup.
 init/cleanup routines need more symmetry.
 Error handling on setup also seems slightly buggy or at least
asymmetrical.
 Finally, this will use up a large number of MSI vectors,
 while TX interrupts mostly stay unused.

 Some comments below.

  +/* Maximum number of individual RX/TX queues supported */
  +#define VIRTIO_MAX_TXQS 16
  +

 This also does not seem to belong in the header.

Both virtio-net and vhost need some check to make sure very
high values are not passed by userspace. Is this not required?

  +#define VIRTIO_NET_F_NUMTXQS21  /* Device 
  supports multiple
TX queue */

 VIRTIO_NET_F_MULTIQUEUE ?

Yes, that's a better name.

  @@ -34,6 +38,8 @@ struct virtio_net_config {
   __u8 mac[6];
   /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */
   __u16 status;
  +/* number of RX/TX queues */
  +__u16 numtxqs;

 The interface here is a bit ugly:
 - this is really both # of tx and rx queues but called numtxqs
 - there's a hardcoded max value
 - 0 is assumed to be same as 1
 - assumptions above are undocumented.

 One way to address this could be num_queue_pairs, and something like
/* The actual number of TX and RX queues is num_queue_pairs +
1 each. */
__u16 num_queue_pairs;
 (and tweak code to match).

 Alternatively, have separate registers for the number of tx and rx
queues.

OK, so virtio_net_config has num_queue_pairs, and this gets converted to
numtxqs in virtnet_info?

  +struct virtnet_info {
  +struct send_queue **sq;
  +struct receive_queue **rq;
  +
  +/* read-mostly variables */
  +int numtxqs cacheline_aligned_in_smp;

 Why do you think this alignment is a win?

Actually this code was from the earlier patchset (MQ TX only) where
the layout was different. Now rq and sq are allocated as follows:
vi-sq = kzalloc(numtxqs * sizeof(*vi-sq), GFP_KERNEL);
for (i = 0; i  numtxqs; i++) {
vi-sq[i] = kzalloc(sizeof(*vi-sq[i]), GFP_KERNEL);
Since the two pointers becomes read-only during use, there is no cache
line dirty'ing.  I will remove this directive.

  +/*
  + * Note for 'qnum' below:
  + *  first 'numtxqs' vqs are RX, next 'numtxqs' vqs are TX.
  + */

 Another option to consider is to have them RX,TX,RX,TX:
 this way vq-queue_index / 2 gives you the
 queue pair number, no need to read numtxqs. On the other hand, it makes
the
 #RX==#TX assumption even more entrenched.

OK. I was following how many drivers were allocating RX and TX's
together - eg ixgbe_adapter has tx_ring and rx_ring arrays; bnx2
has rx_buf_ring and tx_buf_ring arrays, etc. Also, vhost has some
code that processes tx first before rx (e.g. vhost_net_stop/flush),
so this approach seemed helpful. I am OK either way, what do you
suggest?

  +err = vi-vdev-config-find_vqs(vi-vdev, totalvqs, vqs,
callbacks,
  +   
(const char
**)names);
  +if (err)
  +goto free_params;
  +

 This would use up quite a lot of vectors. However,
 tx interrupt is, in fact, slow path. So, assuming we don't have
 enough vectors to use per vq, I think it's a good idea to
 support reducing MSI vector usage by mapping all TX VQs to the same
vector
 and separate vectors for RX.
 The hypervisor actually allows this, but we don't have an API at the
virtio
 level to pass that info to virtio pci ATM.
 Any idea what a good API to use would be?

Yes, it is a waste to have these vectors for tx ints. I initially
thought of adding a flag to virtio_device to pass to vp_find_vqs,
but it won't work, so a new API is needed. I can work with you on
this in the background if you like.

  +for (i = 0; i  numtxqs; i++) {
  +vi-rq[i]-rvq = vqs[i];
  +vi-sq[i]-svq = vqs[i + numtxqs];

 This logic is spread all over. We need some kind of macro to
 get queue number of vq number and back.

Will add this.

  +if (virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_VQ)) {
  +vi-cvq = vqs[i + numtxqs];
  +
  +if (virtio_has_feature(vi-vdev,
VIRTIO_NET_F_CTRL_VLAN))
  +vi-dev-features |=
NETIF_F_HW_VLAN_FILTER;

 This bit does not seem to belong in initialize_vqs.

I will move it back to probe.

  +err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS,
  +

Re: [PATCH 3/3] [RFC] Changes for MQ vhost

2011-03-01 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 02/28/2011 03:34:23 PM:

  The number of vhost threads is = #txqs.  Threads handle more
  than one txq when #txqs is more than MAX_VHOST_THREADS (4).

 It is this sharing that prevents us from just reusing multiple vhost
 descriptors?

Sorry, I didn't understand this question.

 4 seems a bit arbitrary - do you have an explanation
 on why this is a good number?

I was not sure what is the best way - a sysctl parameter? Or should the
maximum depend on number of host cpus? But that results in too many
threads, e.g. if I have 16 cpus and 16 txqs.

  +struct task_struct *worker; /* worker for this vq */
  +spinlock_t *work_lock;  /* points to a 
  dev-work_lock[] entry
*/
  +struct list_head *work_list;/* points to a 
  dev-work_list[]
entry */
  +int qnum;   /* 0 for RX, 1 - n-1 for TX */

 Is this right?

Will fix this.

  @@ -122,12 +128,33 @@ struct vhost_dev {
   int nvqs;
   struct file *log_file;
   struct eventfd_ctx *log_ctx;
  -spinlock_t work_lock;
  -struct list_head work_list;
  -struct task_struct *worker;
  +spinlock_t *work_lock[MAX_VHOST_THREADS];
  +struct list_head *work_list[MAX_VHOST_THREADS];

 This looks a bit strange. Won't sticking everything in a single
 array of structures rather than multiple arrays be better for cache
 utilization?

Correct. In that context, which is better:
struct {
spinlock_t *work_lock;
struct list_head *work_list;
} work[MAX_VHOST_THREADS];
or, to make sure work_lock/work_list is cache-aligned:
struct work_lock_list {
spinlock_t work_lock;
struct list_head work_list;
} cacheline_aligned_in_smp;
and define:
struct vhost_dev {
...
struct work_lock_list work[MAX_VHOST_THREADS];
};
Second method uses a little more space but each vhost needs only
one (read-only) cache line. I tested with this and can confirm it
aligns each element on a cache-line. BW improved slightly (upto
3%), remote SD improves by upto -4% or so.

  +static inline int get_nvhosts(int nvqs)

 nvhosts - nthreads?

Yes.

  +static inline int vhost_get_thread_index(int index, int numtxqs, int
nvhosts)
  +{
  +return (index % numtxqs) % nvhosts;
  +}
  +

 As the only caller passes MAX_VHOST_THREADS,
 just use that?

Yes, nice catch.

   struct vhost_net {
   struct vhost_dev dev;
  -struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX];
  -struct vhost_poll poll[VHOST_NET_VQ_MAX];
  +struct vhost_virtqueue *vqs;
  +struct vhost_poll *poll;
  +struct socket **socks;
   /* Tells us whether we are polling a socket for TX.
* We only do this when socket buffer fills up.
* Protected by tx vq lock. */
  -enum vhost_net_poll_state tx_poll_state;
  +enum vhost_net_poll_state *tx_poll_state;

 another array?

Yes... I am also allocating twice the space than what is required
to make it's usage simple. Please let me know what you feel about
this.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] [RFC] Implement multiqueue (RX TX) virtio-net

2011-02-28 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 02/28/2011 01:05:15 PM:

  This patch series is a continuation of an earlier one that
  implemented guest MQ TX functionality.  This new patchset
  implements both RX and TX MQ.  Qemu changes are not being
  included at this time solely to aid in easier review.
  Compatibility testing with old/new combinations of qemu/guest
  and vhost was done without any issues.
 
  Some early TCP/UDP test results are at the bottom of this
  post, I plan to submit more test results in the coming days.
 
  Please review and provide feedback on what can improve.
 
  Thanks!
 
  Signed-off-by: Krishna Kumar krkum...@in.ibm.com


 To help testing, could you post the qemu changes separately please?

Thanks Michael for your review and feedback. I will send the qemu
changes and respond to your comments tomorrow.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-24 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 09:25:34 PM:

  Sure, will get a build/test on latest bits and send in 1-2 days.
 
The TX-only patch helped the guest TX path but didn't help
host-guest much (as tested using TCP_MAERTS from the guest).
But with the TX+RX patch, both directions are getting
improvements.
  
   Also, my hope is that with appropriate queue mapping,
   we might be able to do away with heuristics to detect
   single stream load that TX only code needs.
 
  Yes, that whole stuff is removed, and the TX/RX path is
  unchanged with this patch (thankfully :)

 Cool. I was wondering whether in that case, we can
 do without host kernel changes at all,
 and use a separate fd for each TX/RX pair.
 The advantage of that approach is that this way,
 the max fd limit naturally sets an upper bound
 on the amount of resources userspace can use up.

 Thoughts?

 In any case, pls don't let the above delay
 sending an RFC.

I will look into this also.

Please excuse the delay in sending the patch out faster - my
bits are a little old, so it is taking some time to move to
the latest kernel and get some initial TCP/UDP test results.
I should have it ready by tomorrow.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Krishna Kumar2
Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM:

Hi Simon,


 I have a few questions about the results below:

 1. Are the (%) comparisons between non-mq and mq virtio?

Yes - mainline kernel with transmit-only MQ patch.

 2. Was UDP or TCP used?

TCP. I had done some initial testing on UDP, but don't have
the results now as it is really old. But I will be running
it again.

 3. What was the transmit size (-m option to netperf)?

I didn't use the -m option, so it defaults to 16K. The
script does:

netperf -t TCP_STREAM -c -C -l 60 -H $SERVER

 Also, I'm interested to know what the status of these patches is.
 Are you planing a fresh series?

Yes. Michael Tsirkin had wanted to see how the MQ RX patch
would look like, so I was in the process of getting the two
working together. The patch is ready and is being tested.
Should I send a RFC patch at this time?

The TX-only patch helped the guest TX path but didn't help
host-guest much (as tested using TCP_MAERTS from the guest).
But with the TX+RX patch, both directions are getting
improvements. Remote testing is still to be done.

Thanks,

- KK

Changes from rev2:
--
  1. Define (in virtio_net.h) the maximum send txqs; and use in
 virtio-net and vhost-net.
  2. vi-sq[i] is allocated individually, resulting in cache line
 aligned sq[0] to sq[n].  Another option was to define
 'send_queue' as:
 struct send_queue {
 struct virtqueue *svq;
 struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
 } cacheline_aligned_in_smp;
 and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
 the submitted method is preferable.
  3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
 handles TX[0-n].
  4. Further change TX handling such that vhost[0] handles both RX/TX
 for single stream case.
 
Enabling MQ on virtio:
---
  When following options are passed to qemu:
  - smp  1
  - vhost=on
  - mq=on (new option, default:off)
  then #txqueues = #cpus.  The #txqueues can be changed by using an
  optional 'numtxqs' option.  e.g. for a smp=4 guest:
  vhost=on   -   #txqueues = 1
  vhost=on,mq=on -   #txqueues = 4
  vhost=on,mq=on,numtxqs=2   -   #txqueues = 2
  vhost=on,mq=on,numtxqs=8   -   #txqueues = 8
 
 
 Performance (guest - local host):
 ---
  System configuration:
  Host:  8 Intel Xeon, 8 GB memory
  Guest: 4 cpus, 2 GB memory
  Test: Each test case runs for 60 secs, sum over three runs (except
  when number of netperf sessions is 1, which has 10 runs of 12 secs
  each).  No tuning (default netperf) other than taskset vhost's to
  cpus 0-3.  numtxqs=32 gave the best results though the guest had
  only 4 vcpus (I haven't tried beyond that).
 
  __ numtxqs=2, vhosts=3  
  #sessions  BW%  CPU%RCPU%SD%  RSD%
  
  1  4.46-1.96 .19 -12.50   -6.06
  2  4.93-1.162.10  0   -2.38
  4  46.1764.77   33.72 19.51   -2.48
  8  47.8970.00   36.23 41.4613.35
  16 48.9780.44   40.67 21.11   -5.46
  24 49.0378.78   41.22 20.51   -4.78
  32 51.1177.15   42.42 15.81   -6.87
  40 51.6071.65   42.43 9.75-8.94
  48 50.1069.55   42.85 11.80   -5.81
  64 46.2468.42   42.67 14.18   -3.28
  80 46.3763.13   41.62 7.43-6.73
  96 46.4063.31   42.20 9.36-4.78
  12850.4362.79   42.16 13.11   -1.23
  
  BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
 
  __ numtxqs=8, vhosts=5  
  #sessions   BW%  CPU% RCPU% SD%  RSD%
  
  1   -.76-1.56 2.33  03.03
  2   17.4111.1111.41 0   -4.76
  4   42.1255.1130.20 19.51.62
  8   54.6980.0039.22 24.39-3.88
  16  54.7781.6240.89 20.34-6.58
  24  54.6679.6841.57 15.49-8.99
  32  54.9276.8241.79 17.59-5.70
  40  51.7968.5640.53 15.31-3.87
  48  51.7266.4040.84 9.72 -7.13
  64  51.1163.9441.10 5.93 -8.82
  80  46.5159.5039.80 9.33 -4.18
  96  47.7257.7539.84 4.20 -7.62
  128 54.3558.9540.66 3.24 -8.63
  
  BW: 38.9%,  

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 12:09:15 PM:

Hi Michael,

  Yes. Michael Tsirkin had wanted to see how the MQ RX patch
  would look like, so I was in the process of getting the two
  working together. The patch is ready and is being tested.
  Should I send a RFC patch at this time?

 Yes, please do.

Sure, will get a build/test on latest bits and send in 1-2 days.

  The TX-only patch helped the guest TX path but didn't help
  host-guest much (as tested using TCP_MAERTS from the guest).
  But with the TX+RX patch, both directions are getting
  improvements.

 Also, my hope is that with appropriate queue mapping,
 we might be able to do away with heuristics to detect
 single stream load that TX only code needs.

Yes, that whole stuff is removed, and the TX/RX path is
unchanged with this patch (thankfully :)

  Remote testing is still to be done.

 Others might be able to help here once you post the patch.

That's great, will appreciate any help.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM

 On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote:
  On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote:
   Confused. We compare capacity to skb frags, no?
   That's sg I think ...
 
  Current guest kernel use indirect buffers, num_free returns how many
  available descriptors not skb frags. So it's wrong here.
 
  Shirley

 I see. Good point. In other words when we complete the buffer
 it was indirect, but when we add a new one we
 can not allocate indirect so we consume.
 And then we start the queue and add will fail.
 I guess we need some kind of API to figure out
 whether the buf we complete was indirect?

 Another failure mode is when skb_xmit_done
 wakes the queue: it might be too early, there
 might not be space for the next packet in the vq yet.

I am not sure if this is the problem - shouldn't you
see these messages:
if (likely(capacity == -ENOMEM)) {
dev_warn(dev-dev,
TX queue failure: out of memory\n);
} else {
dev-stats.tx_fifo_errors++;
dev_warn(dev-dev,
Unexpected TX queue failure: %d\n,
capacity);
}
in next xmit? I am not getting this in my testing.

 A solution might be to keep some kind of pool
 around for indirect, we wanted to do it for block anyway ...

Your vhost patch should fix this automatically. Right?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Krishna Kumar2
 On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote:
 
  The way I am changing is only when netif queue has stopped, then we
  start to count num_free descriptors to send the signal to wake netif
  queue.

 I forgot to mention, the code change I am making is in guest kernel, in
 xmit call back only wake up the queue when it's stopped  num_free =
 1/2 *vq-num, I add a new API in virtio_ring.

FYI :)

I have tried this before. There are a couple of issues:

1. the free count will not reduce until you run free_old_xmit_skbs,
   which will not run anymore since the tx queue is stopped.
2. You cannot call free_old_xmit_skbs directly as it races with a
   queue that was just awakened (current cb was due to the delay
   in disabling cb's).

You have to call free_old_xmit_skbs() under netif_queue_stopped()
check to avoid the race.

I got a small improvement in my testing upto some number of threads
(32 or 48?), but beyond that I was getting a regression.

Thanks,

- KK

 However vhost signaling reduction is needed as well. The patch I
 submitted a while ago showed both CPUs and BW improvement.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-02-01 Thread Krishna Kumar2
 Shirley Ma mashi...@us.ibm.com wrote:

  I have tried this before. There are a couple of issues:
 
  1. the free count will not reduce until you run free_old_xmit_skbs,
 which will not run anymore since the tx queue is stopped.
  2. You cannot call free_old_xmit_skbs directly as it races with a
 queue that was just awakened (current cb was due to the delay
 in disabling cb's).
 
  You have to call free_old_xmit_skbs() under netif_queue_stopped()
  check to avoid the race.

 Yes, that' what I did, when the netif queue stop, don't enable the
 queue, just free_old_xmit_skbs(), if not enough freed, then enabling
 callback until half of the ring size are freed, then wake the netif
 queue. But somehow I didn't reach the performance compared to drop
 packets, need to think about it more. :)

Did you check if the number of vmexits increased with this
patch? This is possible if the device was keeping up (and
not going into a stop, start, xmit 1 packet, stop, start
loop). Also maybe you should try for 1/4th instead of 1/2?

MST's delayed signalling should avoid this issue, I haven't
tried both together.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


MQ performance on other cards (cxgb3)

2010-11-15 Thread Krishna Kumar2
I had sent this mail to Michael last week - he agrees that I should
share this information on the list:

On latest net-next-2.6, virtio-net (guest-host) results are:
__
 SQ vs MQ (#txqs=8)
#  BW1  BW2 (%)  CPU1 CPU2 (%)   RCPU1   RCPU2 (%)
___
1  105774  112256 (6.1)   257  255 (-.7) 532 549 (3.1)
2  20842   30674 (47.1)   107  150 (40.1)208 279 (34.1)
4  22500   31953 (42.0)   241  409 (69.7)467 619 (32.5)
8  22416   44507 (98.5)   477  1039 (117.8)  960 1459 (51.9)
16 22605   45372 (100.7)  905  2060 (127.6)  18952962 (56.3)
24 23192   44201 (90.5)   1360 3028 (122.6)  28334437 (56.6)
32 23158   43394 (87.3)   1811 3957 (118.4)  37705936 (57.4)
40 23322   42550 (82.4)   2276 4986 (119.0)  47117417 (57.4)
48 23564   41931 (77.9)   2757 5966 (116.3)  56538896 (57.3)
64 23949   41092 (71.5)   3788 7898 (108.5)  760911826 (55.4)
80 23256   41343 (77.7)   4597 9887 (115.0)  950314801 (55.7)
96 23310   40645 (74.3)   5588 11758 (110.4) 11381   17761 (56.0)
12824095   41082 (70.5)   7587 15574 (105.2) 15029   23716 (57.8)
__
Avg:  BW: (58.3)  CPU: (110.8)  RCPU: (55.9)

It's true that average CPU% on guest is almost double that of the BW
improvement. But I don't think this is due to the patch (driver does no
synchronization, etc). To compare MQ vs SQ on a 10G card, I ran the
same test from host to remote host across cxgb3. The results are
somewhat similar:

(I changed cxgb_open on the client system to:
netif_set_real_num_tx_queues(dev, 1);
err = netif_set_real_num_rx_queues(dev, 1);
to simulate single queue (SQ))
_
cxgb3 SQ vs cxgb3 MQ
# BW1  BW2 (%)  CPU1   CPU2 (%)
_
1 83018315 (.1)5 4.66 (-6.6)
2 93959380 (-.1)  1616 (0)
4 94119414 (0)3326 (-21.2)
8 94119398 (-.1)  6062 (3.3)
16   94129413 (0)116  117 (.8)
24   94429963 (5.5) 179  198 (10.6)
32   10031  10025 (0)   230 249 (8.2)
40   995310024 (.7)  300 312 (4.0)
48   10002  10015 (.1)  351 376 (7.1)
64   10022  10024 (0)   494 515 (4.2)
80   889410011 (12.5)   537630 (17.3)
96   84659907 (17.0) 612749 (22.3)
128  7541   9617 (27.5) 760989 (30.1)
_
Avg: BW: (3.8) CPU: (14.8)

(Each case runs runs once for 60 secs)

The BW increased modestly but CPU increased much more. I assume
the change I made above to convert the driver from MQ to SQ is not
incorrect.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-09 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 11/09/2010 09:03:25 PM:

   Something strange here, right?
   1. You are consistently getting 10G/s here, and even with a single
  stream?
 
  Sorry, I should have mentioned this though I had stated in my
  earlier mails. Each test result has two iterations, each of 60
  seconds, except when #netperfs is 1 for which I do 10 iteration
  (sum across 10 iterations).

 So need to divide the number by 10?

Yes, that is what I get with 512/1K macvtap I/O size :)

   I started doing many more iterations
  for 1 netperf after finding the issue earlier with single stream.
  So the BW is only 4.5-7 Gbps.
 
   2. With 2 streams, is where we get  10G/s originally. Instead of
  doubling that we get a marginal improvement with 2 queues and
  about 30% worse with 1 queue.
 
  (doubling happens consistently for guest - host, but never for
  remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
  testing scenario. In first case, there is a slight improvement in
  BW and good reduction in SD. In the second case, only SD improves
  (though BW drops for 2 stream for some reason).  In both cases,
  BW and SD improves as the number of sessions increase.

 I guess this is another indication that something's wrong.

The patch - both virtio-net and vhost-net, doesn't have any
locking/mutex's/ or any synchronization method. Guest - host
performance improvement of upto 100% shows the patch is not
doing anything wrong.

 We are quite far from line rate, the fact BW does not scale
 means there's some contention in the code.

Attaining line speed with macvtap seems to be a generic issue
and unrelated to my patch specifically. IMHO if there is nothing
wrong in the code (review) and is accepted, it will benefit as
others can also help to find what needs to be implemented in
vhost/macvtap/qemu to get line speed for guest-remote-host.

PS: bare-metal performance for host-remote-host is also
2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-08 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 02:27:09 PM:

 Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

 On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
   Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
 
  Any feedback, comments, objections, issues or bugs about the
  patches? Please let me know if something needs to be done.
 
  Some more test results:
  _
   Host-Guest BW (numtxqs=2)
  #   BW% CPU%RCPU%   SD% RSD%
  _

 I think we discussed the need for external to guest testing
 over 10G. For large messages we should not see any change
 but you should be able to get better numbers for small messages
 assuming a MQ NIC card.

I had to make a few changes to qemu (and a minor change in macvtap
driver) to get multiple TXQ support using macvtap working. The NIC
is a ixgbe card.

__
Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
#  BW1 BW2 (%)   SD1SD2 (%)RSD1RSD2 (%)
__
1  14367   13142 (-8.5)  56 62 (10.7)  88 (0)
2  36523855 (5.5)37 35 (-5.4)  76 (-14.2)
4  12529   12059 (-3.7)  65 77 (18.4)  35   35 (0)
8  13912   14668 (5.4)   288332 (15.2) 175  184 (5.1)
16 13433   14455 (7.6)   1218   1321 (8.4) 920  943 (2.5)
24 12750   13477 (5.7)   2876   2985 (3.7) 2514 2348 (-6.6)
32 11729   12632 (7.6)   5299   5332 (.6)  4934 4497 (-8.8)
40 11061   11923 (7.7)   8482   8364 (-1.3)8374 7495 (-10.4)
48 10624   11267 (6.0)   12329  12258 (-.5)1276211538 (-9.5)
64 10524   10596 (.6)21689  22859 (5.3)2362622403 (-5.1)
80 985610284 (4.3)   35769  36313 (1.5)3993236419 (-8.7)
96 969110075 (3.9)   52357  52259 (-.1)5867653463 (-8.8)
12893519794 (4.7)114707 94275 (-17.8)  114050   97337 (-14.6)
__
Avg:  BW: (3.3)  SD: (-7.3)  RSD: (-11.0)

__
Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
#  BW1  BW2 (%)   SD1   SD2 (%)RSD1   RSD2 (%)
__
1  1650915985 (-3.1)  4547 (4.4)   7   7 (0)
2  6963 4499 (-35.3)  1751 (200.0) 7   7 (0)
4  1293211080 (-14.3) 4974 (51.0)  35  35 (0)
8  1387814095 (1.5)   223   292 (30.9) 175 181 (3.4)
16 1344013698 (1.9)   980   1131 (15.4)926 942 (1.7)
24 1268012927 (1.9)   2387  2463 (3.1) 25262342 (-7.2)
32 1171412261 (4.6)   4506  4486 (-.4) 49414463 (-9.6)
40 1105911651 (5.3)   7244  7081 (-2.2)83497437 (-10.9)
48 1058011095 (4.8)   10811 10500 (-2.8)   12809   11403 (-10.9)
64 1056910566 (0) 19194 19270 (.3) 23648   21717 (-8.1)
80 9827 10753 (9.4)   31668 29425 (-7.0)   39991   33824 (-15.4)
96 1004310150 (1.0)   45352 44227 (-2.4)   57766   51131 (-11.4)
1289360 9979 (6.6)92058 79198 (-13.9)  114381  92873 (-18.8)
__
Avg:  BW: (-.5)  SD: (-7.5)  RSD: (-14.7)

Is there anything else you would like me to test/change, or shall
I submit the next version (with the above macvtap changes)?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-28 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com

   I think we discussed the need for external to guest testing
   over 10G. For large messages we should not see any change
   but you should be able to get better numbers for small messages
   assuming a MQ NIC card.
 
  For external host, there is a contention among different
  queues (vhosts) when packets are processed in tun/bridge,
  unless I implement MQ TX for macvtap (tun/bridge?).  So
  my testing shows a small improvement (1 to 1.5% average)
  in BW and a rise in SD (between 10-15%).  For remote host,
  I think tun/macvtap needs MQ TX support?

 Confused. I thought this *is* with a multiqueue tun/macvtap?
 bridge does not do any queueing AFAIK ...
 I think we need to fix the contention. With migration what was guest to
 host a minute ago might become guest to external now ...

Macvtap RX is MQ but not TX. I don't think MQ TX support is
required for macvtap, though. Is it enough for existing
macvtap sendmsg to work, since it calls dev_queue_xmit
which selects the txq for the outgoing device?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-28 Thread Krishna Kumar2
 Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:

 Results for UDP BW tests (unidirectional, sum across
 3 iterations, each iteration of 45 seconds, default
 netperf, vhosts bound to cpus 0-3; no other tuning):
   
Is binding vhost threads to CPUs really required?
What happens if we let the scheduler do its job?
  
   Nothing drastic, I remember BW% and SD% both improved a
   bit as a result of binding.
 
  If there's a significant improvement this would mean that
  we need to rethink the vhost-net interaction with the scheduler.

 I will get a test run with and without binding and post the
 results later today.

Correction: The result with binding is is much better for
SD/CPU compared to without-binding:

_
 numtxqs=8,vhosts=5, Bind vs No-bind
# BW% CPU% RCPU% SD%   RSD%
_
1 11.25 10.771.89 0-6.06
2 18.66 7.20 7.20-14.28-7.40
4 4.24 -1.27 1.56-2.70 -.98
8 14.91-3.79 5.46-12.19-3.76
1612.32-8.67 4.63-35.97-26.66
2411.68-7.83 5.10-40.73-32.37
3213.09-10.516.57-51.52-42.28
4011.04-4.12 11.23   -50.69-42.81
488.61 -10.306.04-62.38-55.54
647.55 -6.05 6.41-61.20-56.04
808.74 -11.456.29-72.65-67.17
969.84 -6.01 9.87-69.89-64.78
128   5.57 -6.23 8.99-75.03-70.97
_
BW: 10.4%,  CPU/RCPU: -7.4%,7.7%,  SD: -70.5%,-65.7%

Notes:
1.  All my test results earlier was binding vhost
to cpus 0-3 for both org and new kernel.
2.  I am not using MST's use_mq patch, only mainline
kernel. However, I reported earlier that I got
better results with that patch. The result for
MQ vs MQ+use_mm patch (from my earlier mail):

BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-27 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 04:39:13 PM:

(merging two posts into one)

 I think we discussed the need for external to guest testing
 over 10G. For large messages we should not see any change
 but you should be able to get better numbers for small messages
 assuming a MQ NIC card.

For external host, there is a contention among different
queues (vhosts) when packets are processed in tun/bridge,
unless I implement MQ TX for macvtap (tun/bridge?).  So
my testing shows a small improvement (1 to 1.5% average)
in BW and a rise in SD (between 10-15%).  For remote host,
I think tun/macvtap needs MQ TX support?

Results for UDP BW tests (unidirectional, sum across
3 iterations, each iteration of 45 seconds, default
netperf, vhosts bound to cpus 0-3; no other tuning):
  
   Is binding vhost threads to CPUs really required?
   What happens if we let the scheduler do its job?
 
  Nothing drastic, I remember BW% and SD% both improved a
  bit as a result of binding.

 If there's a significant improvement this would mean that
 we need to rethink the vhost-net interaction with the scheduler.

I will get a test run with and without binding and post the
results later today.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 10/26/2010 10:40:35 AM:

  I am trying to wrap my head around kernel/user interface here.
  E.g., will we need another incompatible change when we add multiple RX
  queues?

 Though I added a 'mq' option to qemu, there shouldn't be
 any incompatibility between old and new qemu's wrt vhost
 and virtio-net drivers. So the old qemu will run new host
 and new guest without issues, and new qemu can also run
 old host and old guest. Multiple RXQ will also not add
 any incompatibility.

 With MQ RX, I will be able to remove the hueristic (idea
 from David Stevens).  The idea is: Guest sends out packets
 on, say TXQ#2, vhost#2 processes the packets but packets
 going out from host to guest might be sent out on a
 different RXQ, say RXQ#4.  Guest receives the packet on
 RXQ#4, and all future responses on that connection are sent
 on TXQ#4.  Now vhost#4 processes both RX and TX packets for
 this connection.  Without needing to hash on the connection,
 guest can make sure that the same vhost thread will handle
 a single connection.

  Also need to think about how robust our single stream heuristic is,
  e.g. what are the chances it will misdetect a bidirectional
  UDP stream as a single TCP?

 I think it should not happen. The hueristic code gets
 called for handling just the transmit packets, packets
 that vhost sends out to the guest skip this path.

 I tested unidirectional and bidirectional UDP to confirm:

 8 iterations of iperf tests, each iteration of 15 secs,
 result is the sum of all 8 iterations in Gbits/sec
 __
 Uni-directional  Bi-directional
   Org  New Org  New
 __
   71.7871.77   71.74   72.07
 __


Results for UDP BW tests (unidirectional, sum across
3 iterations, each iteration of 45 seconds, default
netperf, vhosts bound to cpus 0-3; no other tuning):

-- numtxqs=8, vhosts=5 -
# BW%CPU%SD%

1 .491.07 0
223.51   52.5126.66
475.17   72.438.57
886.54   80.2127.85
16   92.37   85.996.27
24   91.37   84.918.41
32   89.78   82.903.31
48   89.85   79.95   -3.57
64   85.83   80.282.22
80   88.90   79.47   -23.18
96   90.12   79.9814.71
128  86.13   80.604.42

BW: 71.3%, CPU: 80.4%, SD: 1.2%


-- numtxqs=16, vhosts=5 
#BW%  CPU% SD%

11.80 00
219.8150.6826.66
457.3152.778.57
8108.44   88.19   -5.21
16   106.09   85.03   -4.44
24   102.34   84.23   -.82
32   102.77   82.71   -5.81
48   100.00   79.62   -7.29
64   96.8679.75   -6.10
80   99.2679.82   -27.34
96   94.7980.02   -5.08
128  98.1481.15   -15.25

BW: 77.9%,  CPU: 80.4%,  SD: -13.6%

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com

 On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
  Results for UDP BW tests (unidirectional, sum across
  3 iterations, each iteration of 45 seconds, default
  netperf, vhosts bound to cpus 0-3; no other tuning):

 Is binding vhost threads to CPUs really required?
 What happens if we let the scheduler do its job?

Nothing drastic, I remember BW% and SD% both improved a
bit as a result of binding. I started binding vhost thread
after Avi suggested it in response to my v1 patch (he
suggested some more that I haven't done), and have been
doing only this tuning ever since. This is part of his
mail for the tuning:

vhost:
thread #0:  CPU0
thread #1:  CPU1
thread #2:  CPU2
thread #3:  CPU3

I simply bound each thread to CPU0-3 instead.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-25 Thread Krishna Kumar2
 Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:

Any feedback, comments, objections, issues or bugs about the
patches? Please let me know if something needs to be done.

Some more test results:
_
 Host-Guest BW (numtxqs=2)
#   BW% CPU%RCPU%   SD% RSD%
_
1   5.53.31 .67 -5.88   0
2   -2.11   -1.01   -2.08   4.340
4   13.53   10.77   13.87   -1.96   0
8   34.22   22.80   30.53   -8.46   -2.50
16  30.89   24.06   35.17   -5.20   3.20
24  33.22   26.30   43.39   -5.17   7.58
32  30.85   27.27   47.74   -.5915.51
40  33.80   27.33   48.00   -7.42   7.59
48  45.93   26.33   45.46   -12.24  1.10
64  33.51   27.11   45.00   -3.27   10.30
80  39.28   29.21   52.33   -4.88   12.17
96  32.05   31.01   57.72   -1.02   19.05
128 35.66   32.04   60.00   -.6620.41
_
BW: 23.5%  CPU/RCPU: 28.6%,51.2%  SD/RSD: -2.6%,15.8%


Guest-Host 512 byte (numtxqs=2):
#   BW% CPU%RCPU%   SD% RSD%
_
1   3.02-3.84   -4.76   -12.50  -7.69
2   52.77   -15.73  -8.66   -45.31  -40.33
4   -23.14  13.84   7.5050.58   40.81
8   -21.44  28.08   16.32   63.06   47.43
16  33.53   46.50   27.19   7.61-6.60
24  55.77   42.81   30.49   -8.65   -16.48
32  52.59   38.92   29.08   -9.18   -15.63
40  50.92   36.11   28.92   -10.59  -15.30
48  46.63   34.73   28.17   -7.83   -12.32
64  45.56   37.12   28.81   -5.05   -10.80
80  44.55   36.60   28.45   -4.95   -10.61
96  43.02   35.97   28.89   -.11-5.31
128 38.54   33.88   27.19   -4.79   -9.54
_
BW: 34.4%  CPU/RCPU: 35.9%,27.8%  SD/RSD: -4.1%,-9.3%


Thanks,

- KK



 [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

 Following set of patches implement transmit MQ in virtio-net.  Also
 included is the user qemu changes.  MQ is disabled by default unless
 qemu specifies it.

   Changes from rev2:
   --
 1. Define (in virtio_net.h) the maximum send txqs; and use in
virtio-net and vhost-net.
 2. vi-sq[i] is allocated individually, resulting in cache line
aligned sq[0] to sq[n].  Another option was to define
'send_queue' as:
struct send_queue {
struct virtqueue *svq;
struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
} cacheline_aligned_in_smp;
and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
the submitted method is preferable.
 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
handles TX[0-n].
 4. Further change TX handling such that vhost[0] handles both RX/TX
for single stream case.

   Enabling MQ on virtio:
   ---
 When following options are passed to qemu:
 - smp  1
 - vhost=on
 - mq=on (new option, default:off)
 then #txqueues = #cpus.  The #txqueues can be changed by using an
 optional 'numtxqs' option.  e.g. for a smp=4 guest:
 vhost=on   -   #txqueues = 1
 vhost=on,mq=on -   #txqueues = 4
 vhost=on,mq=on,numtxqs=2   -   #txqueues = 2
 vhost=on,mq=on,numtxqs=8   -   #txqueues = 8


Performance (guest - local host):
---
 System configuration:
 Host:  8 Intel Xeon, 8 GB memory
 Guest: 4 cpus, 2 GB memory
 Test: Each test case runs for 60 secs, sum over three runs (except
 when number of netperf sessions is 1, which has 10 runs of 12 secs
 each).  No tuning (default netperf) other than taskset vhost's to
 cpus 0-3.  numtxqs=32 gave the best results though the guest had
 only 4 vcpus (I haven't tried beyond that).

 __ numtxqs=2, vhosts=3  
 #sessions  BW%  CPU%RCPU%SD%  RSD%
 
 1  4.46-1.96 .19 -12.50   -6.06
 2  4.93-1.162.10  0   -2.38
 4  46.1764.77   33.72 19.51   -2.48
 8  47.8970.00   36.23 41.4613.35
 16 48.9780.44   40.67 21.11   -5.46
 24 49.0378.78   41.22 20.51   -4.78
 32 51.1177.15   42.42 15.81   -6.87
 40 51.6071.65   42.43 9.75-8.94
 48 50.1069.55   42.85 11.80   -5.81
 64 46.2468.42   42.67 14.18   -3.28
 80 46.3763.13   41.62 7.43-6.73
 96 46.4063.31   42.20 9.36-4.78
 12850.4362.79   42.16 13.11   -1.23
 
 BW: 37.2

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-25 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/25/2010 09:47:18 PM:

  Any feedback, comments, objections, issues or bugs about the
  patches? Please let me know if something needs to be done.

 I am trying to wrap my head around kernel/user interface here.
 E.g., will we need another incompatible change when we add multiple RX
 queues?

Though I added a 'mq' option to qemu, there shouldn't be
any incompatibility between old and new qemu's wrt vhost
and virtio-net drivers. So the old qemu will run new host
and new guest without issues, and new qemu can also run
old host and old guest. Multiple RXQ will also not add
any incompatibility.

With MQ RX, I will be able to remove the hueristic (idea
from David Stevens).  The idea is: Guest sends out packets
on, say TXQ#2, vhost#2 processes the packets but packets
going out from host to guest might be sent out on a
different RXQ, say RXQ#4.  Guest receives the packet on
RXQ#4, and all future responses on that connection are sent
on TXQ#4.  Now vhost#4 processes both RX and TX packets for
this connection.  Without needing to hash on the connection,
guest can make sure that the same vhost thread will handle
a single connection.

 Also need to think about how robust our single stream heuristic is,
 e.g. what are the chances it will misdetect a bidirectional
 UDP stream as a single TCP?

I think it should not happen. The hueristic code gets
called for handling just the transmit packets, packets
that vhost sends out to the guest skip this path.

I tested unidirectional and bidirectional UDP to confirm:

8 iterations of iperf tests, each iteration of 15 secs,
result is the sum of all 8 iterations in Gbits/sec
__
Uni-directional  Bi-directional
  Org  New Org  New
__
  71.7871.77   71.74   72.07
__

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/12/2010 10:39:07 PM:

  Sorry for the delay, I was sick last couple of days. The results
  with your patch are (%'s over original code):
 
  Code   BW%   CPU%   RemoteCPU
  MQ (#txq=16)   31.4% 38.42% 6.41%
  MQ+MST (#txq=16)   28.3% 18.9%  -10.77%
 
  The patch helps CPU utilization but didn't help single stream
  drop.
 
  Thanks,

 What other shared TX/RX locks are there?  In your setup, is the same
 macvtap socket structure used for RX and TX?  If yes this will create
 cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
 there might also be contention on the lock in sk_sleep waitqueue.
 Anything else?

The patch is not introducing any locking (both vhost and virtio-net).
The single stream drop is due to different vhost threads handling the
RX/TX traffic.

I added a heuristic (fuzzy) to determine if more than one flow
is being used on the device, and if not, use vhost[0] for both
tx and rx (vhost_poll_queue figures this out before waking up
the suitable vhost thread).  Testing shows that single stream
performance is as good as the original code.

__
   #txqs = 2 (#vhosts = 3)
# BW1 BW2   (%)   CPU1CPU2 (%)   RCPU1   RCPU2 (%)
__
1 77344   74973 (-3.06)   172 143 (-16.86)   358 324 (-9.49)
2 20924   21107 (.87) 107 103 (-3.73)220 217 (-1.36)
4 21629   32911 (52.16)   214 391 (82.71)446 616 (38.11)
8 21678   34359 (58.49)   428 845 (97.42)892 1286 (44.17)
1622046   34401 (56.04)   841 1677 (99.40)   17852585 (44.81)
2422396   35117 (56.80)   12722447 (92.37)   26673863 (44.84)
3222750   35158 (54.54)   17193233 (88.07)   35695143 (44.10)
4023041   35345 (53.40)   22193970 (78.90)   44786410 (43.14)
4823209   35219 (51.74)   27074685 (73.06)   53867684 (42.66)
6423215   35209 (51.66)   36396195 (70.23)   720610218 (41.79)
8023443   35179 (50.06)   46337625 (64.58)   905112745 (40.81)
9624006   36108 (50.41)   56359096 (61.41)   10864   15283 (40.67)
128   23601   35744 (51.45)   747512104 (61.92)  14495   20405 (40.77)
__
SUM: BW: (37.6) CPU: (69.0) RCPU: (41.2)

__
   #txqs = 8 (#vhosts = 5)
# BW1 BW2(%)  CPU1 CPU2 (%)  RCPU1 RCPU2 (%)
__
1 77344   75341 (-2.58)   172 171 (-.58) 358 356 (-.55)
2 20924   26872 (28.42)   107 135 (26.16)220 262 (19.09)
4 21629   33594 (55.31)   214 394 (84.11)446 615 (37.89)
8 21678   39714 (83.19)   428 949 (121.72)   892 1358 (52.24)
1622046   39879 (80.88)   841 1791 (112.96)  17852737 (53.33)
2422396   38436 (71.61)   12722111 (65.95)   26673453 (29.47)
3222750   38776 (70.44)   17193594 (109.07)  35695421 (51.89)
4023041   38023 (65.02)   22194358 (96.39)   44786507 (45.31)
4823209   33811 (45.68)   27074047 (49.50)   53866222 (15.52)
6423215   30212 (30.13)   36393858 (6.01)72065819 (-19.24)
8023443   34497 (47.15)   46337214 (55.70)   905110776 (19.05)
9624006   30990 (29.09)   56355731 (1.70)10864   8799 (-19.00)
128   23601   29413 (24.62)   74757804 (4.40)14495   11638 (-19.71)
__
SUM: BW: (40.1) CPU: (35.7) RCPU: (4.1)
___


The SD numbers are also good (same table as before, but SD
instead of CPU:

__
   #txqs = 2 (#vhosts = 3)
# BW%   SD1 SD2 (%)RSD1 RSD2 (%)
__
1 -3.06)5   4 (-20.00) 21   19 (-9.52)
2 .87   6   6 (0)  27   27 (0)
4 52.16 26  32 (23.07) 108  103 (-4.62)
8 58.49 103 146 (41.74)431  445 (3.24)
1656.04 407 514 (26.28)1729 1586 (-8.27)
2456.80 934 1161 (24.30)   3916 3665 (-6.40)
3254.54 16682160 (29.49)   6925 6872 (-.76)
4053.40 26553317 (24.93)   1071210707 (-.04)
4851.74 39204486 (14.43)   1559814715 (-5.66)
6451.66 70968250 (16.26)   2809927211 (-3.16)
8050.06 11240   12586 (11.97)  4391342070 (-4.19)
9650.41 16342   16976 

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com
   What other shared TX/RX locks are there?  In your setup, is the same
   macvtap socket structure used for RX and TX?  If yes this will create
   cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
   there might also be contention on the lock in sk_sleep waitqueue.
   Anything else?
 
  The patch is not introducing any locking (both vhost and virtio-net).
  The single stream drop is due to different vhost threads handling the
  RX/TX traffic.
 
  I added a heuristic (fuzzy) to determine if more than one flow
  is being used on the device, and if not, use vhost[0] for both
  tx and rx (vhost_poll_queue figures this out before waking up
  the suitable vhost thread).  Testing shows that single stream
  performance is as good as the original code.

 ...

  This approach works nicely for both single and multiple stream.
  Does this look good?
 
  Thanks,
 
  - KK

 Yes, but I guess it depends on the heuristic :) What's the logic?

I define how recently a txq was used. If 0 or 1 txq's were used
recently, use vq[0] (which also handles rx). Otherwise, use
multiple txq (vq[1-n]). The code is:

/*
 * Algorithm for selecting vq:
 *
 * ConditionReturn
 * RX vqvq[0]
 * If all txqs unused   vq[0]
 * If one txq used, and new txq is same vq[0]
 * If one txq used, and new txq is differentvq[vq-qnum]
 * If  1 txqs used vq[vq-qnum]
 *  Where used means the txq was used in the last 'n' jiffies.
 *
 * Note: locking is not required as an update race will only result in
 * a different worker being woken up.
 */
static inline struct vhost_virtqueue *vhost_find_vq(struct vhost_poll
*poll)
{
if (poll-vq-qnum) {
struct vhost_dev *dev = poll-vq-dev;
struct vhost_virtqueue *vq = dev-vqs[0];
unsigned long max_time = jiffies - 5; /* Some macro needed */
unsigned long *table = dev-jiffies;
int i, used = 0;

for (i = 0; i  dev-nvqs - 1; i++) {
if (time_after_eq(table[i], max_time)  ++used  1) {
vq = poll-vq;
break;
}
}
table[poll-vq-qnum - 1] = jiffies;
return vq;
}

/* RX is handled by the same worker thread */
return poll-vq;
}

void vhost_poll_queue(struct vhost_poll *poll)
{
struct vhost_virtqueue *vq = vhost_find_vq(poll);

vhost_work_queue(vq, poll-work);
}

Since poll batches packets, find_vq does not seem to add much
to the CPU utilization (or BW). I am sure that code can be
optimized much better.

The results I sent in my last mail were without your use_mm
patch, and the only tuning was to make vhost threads run on
only cpus 0-3 (though the performance is good even without
that). I will test it later today with the use_mm patch too.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 10/14/2010 02:34:01 PM:

 void vhost_poll_queue(struct vhost_poll *poll)
 {
 struct vhost_virtqueue *vq = vhost_find_vq(poll);

 vhost_work_queue(vq, poll-work);
 }

 Since poll batches packets, find_vq does not seem to add much
 to the CPU utilization (or BW). I am sure that code can be
 optimized much better.

 The results I sent in my last mail were without your use_mm
 patch, and the only tuning was to make vhost threads run on
 only cpus 0-3 (though the performance is good even without
 that). I will test it later today with the use_mm patch too.

There's a significant reduction in CPU/SD utilization with your
patch. Following is the performance of ORG vs MQ+mm patch:

_
   Org vs MQ+mm patch txq=2
# BW% CPU/RCPU% SD/RSD%
_
1 2.26-1.16.27  -20.00  0
2 35.07   29.9021.81 0  -11.11
4 55.03   84.5737.66 26.92  -4.62
8 73.16   118.69   49.21 45.63  -.46
1677.43   98.8147.89 24.07  -7.80
2471.59   105.18   48.44 62.84  18.18
3270.91   102.38   47.15 49.22  8.54
4063.26   90.5841.00 85.27  37.33
4845.25   45.9911.23 14.31  -12.91
6442.78   41.825.50  .43-25.12
8031.40   7.31 -18.6915.78  -11.93
9627.60   7.79 -18.5417.39  -10.98
128   23.46   -11.89   -34.41-.41   -25.53
_
BW: 40.2  CPU/RCPU: 29.9,-2.2   SD/RSD: 12.0,-15.6


Following is the performance of MQ vs MQ+mm patch:
_
MQ vs MQ+mm patch
# BW%  CPU%   RCPU%SD%  RSD%
_
1  4.98-.58   .84  -20.000
2  5.17 2.96  2.29  0   -4.00
4 -.18  .25  -.16   3.12 .98
8 -5.47-1.36 -1.98  17.1816.57
16-1.90-6.64 -3.54 -14.83   -12.12
24-.01  23.63 14.65 57.6146.64
32 .27 -3.19  -3.11-22.98   -22.91
40-1.06-2.96  -2.96-4.18-4.10
48-.28 -2.34  -3.71-2.41-3.81
64 9.71 33.77  30.6581.4477.09
80-10.69-31.07-31.70   -29.22   -29.88
96-1.14 5.98   .56 -11.57   -16.14
128   -.93 -15.60 -18.31   -19.89   -22.65
_
  BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6
_

Each test case is for 60 secs, sum over two runs (except
when number of netperf sessions is 1, which has 7 runs
of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning
other than taskset each vhost to cpus 0-3.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 10/14/2010 05:47:54 PM:

Sorry, it should read txq=8 below.

- KK

 There's a significant reduction in CPU/SD utilization with your
 patch. Following is the performance of ORG vs MQ+mm patch:

 _
Org vs MQ+mm patch txq=2
 # BW% CPU/RCPU% SD/RSD%
 _
 1 2.26-1.16.27  -20.00  0
 2 35.07   29.9021.81 0  -11.11
 4 55.03   84.5737.66 26.92  -4.62
 8 73.16   118.69   49.21 45.63  -.46
 1677.43   98.8147.89 24.07  -7.80
 2471.59   105.18   48.44 62.84  18.18
 3270.91   102.38   47.15 49.22  8.54
 4063.26   90.5841.00 85.27  37.33
 4845.25   45.9911.23 14.31  -12.91
 6442.78   41.825.50  .43-25.12
 8031.40   7.31 -18.6915.78  -11.93
 9627.60   7.79 -18.5417.39  -10.98
 128   23.46   -11.89   -34.41-.41   -25.53
 _
 BW: 40.2  CPU/RCPU: 29.9,-2.2   SD/RSD: 12.0,-15.6

 Following is the performance of MQ vs MQ+mm patch:
 _
 MQ vs MQ+mm patch
 # BW%  CPU%   RCPU%SD%  RSD%
 _
 1  4.98-.58   .84  -20.000
 2  5.17 2.96  2.29  0   -4.00
 4 -.18  .25  -.16   3.12 .98
 8 -5.47-1.36 -1.98  17.1816.57
 16-1.90-6.64 -3.54 -14.83   -12.12
 24-.01  23.63 14.65 57.6146.64
 32 .27 -3.19  -3.11-22.98   -22.91
 40-1.06-2.96  -2.96-4.18-4.10
 48-.28 -2.34  -3.71-2.41-3.81
 64 9.71 33.77  30.6581.4477.09
 80-10.69-31.07-31.70   -29.22   -29.88
 96-1.14 5.98   .56 -11.57   -16.14
 128   -.93 -15.60 -18.31   -19.89   -22.65
 _
   BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6
 _

 Each test case is for 60 secs, sum over two runs (except
 when number of netperf sessions is 1, which has 7 runs
 of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning
 other than taskset each vhost to cpus 0-3.

 Thanks,

 - KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-11 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/06/2010 07:04:31 PM:

 On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
  For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
  for degradation for 1 stream case:

 I thought about possible RX/TX contention reasons, and I realized that
 we get/put the mm counter all the time.  So I write the following: I
 haven't seen any performance gain from this in a single queue case, but
 maybe this will help multiqueue?

Sorry for the delay, I was sick last couple of days. The results
with your patch are (%'s over original code):

Code   BW%   CPU%   RemoteCPU
MQ (#txq=16)   31.4% 38.42% 6.41%
MQ+MST (#txq=16)   28.3% 18.9%  -10.77%

The patch helps CPU utilization but didn't help single stream
drop.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/06/2010 07:04:31 PM:

 Michael S. Tsirkin m...@redhat.com
 10/06/2010 07:04 PM

 To

 Krishna Kumar2/India/i...@ibmin

 cc

 ru...@rustcorp.com.au, da...@davemloft.net, kvm@vger.kernel.org,
 a...@arndb.de, net...@vger.kernel.org, a...@redhat.com,
anth...@codemonkey.ws

 Subject

 Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

 On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
  For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
  for degradation for 1 stream case:

 I thought about possible RX/TX contention reasons, and I realized that
 we get/put the mm counter all the time.  So I write the following: I
 haven't seen any performance gain from this in a single queue case, but
 maybe this will help multiqueue?

Great! I am on vacation tomorrow, but will test with this patch
tomorrow night.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Krishna Kumar2
Arnd Bergmann a...@arndb.de wrote on 10/06/2010 05:49:00 PM:

  I don't see any reasons mentioned above.  However, for higher
  number of netperf sessions, I see a big increase in retransmissions:
  ___
  #netperf  ORG   NEW
  BW (#retr)BW (#retr)
  ___
  1  70244 (0) 64102 (0)
  4  21421 (0) 36570 (416)
  8  21746 (0) 38604 (148)
  16 21783 (0) 40632 (464)
  32 22677 (0) 37163 (1053)
  64 23648 (4) 36449 (2197)
  12823251 (2) 31676 (3185)
  ___


 This smells like it could be related to a problem that Ben Greear found
 recently (see macvlan:  Enable qdisc backoff logic). When the hardware
 is busy, used to just drop the packet. With Ben's patch, we return
-EAGAIN
 to qemu (or vhost-net) to trigger a resend.

 I suppose what we really should do is feed that condition back to the
 guest network stack and implement the backoff in there.

Thanks for the pointer. I will take a look at this as I hadn't seen
this patch earlier. Is there any way to figure out if this is the
issue?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/05/2010 11:53:23 PM:

   Any idea where does this come from?
   Do you see more TX interrupts? RX interrupts? Exits?
   Do interrupts bounce more between guest CPUs?
   4. Identify reasons for single netperf BW regression.
 
  After testing various combinations of #txqs, #vhosts, #netperf
  sessions, I think the drop for 1 stream is due to TX and RX for
  a flow being processed on different cpus.

 Right. Can we fix it?

I am not sure how to. My initial patch had one thread but gave
small gains and ran into limitations once number of sessions
became large.

   I did two more tests:
  1. Pin vhosts to same CPU:
  - BW drop is much lower for 1 stream case (- 5 to -8% range)
  - But performance is not so high for more sessions.
  2. Changed vhost to be single threaded:
- No degradation for 1 session, and improvement for upto
   8, sometimes 16 streams (5-12%).
- BW degrades after that, all the way till 128 netperf
sessions.
- But overall CPU utilization improves.
  Summary of the entire run (for 1-128 sessions):
  txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
  txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
 
  I don't see any reasons mentioned above.  However, for higher
  number of netperf sessions, I see a big increase in retransmissions:

 Hmm, ok, and do you see any errors?

I haven't seen any in any statistics, messages, etc. Also no
retranmissions for txq=1.

  Single netperf case didn't have any retransmissions so that is not
  the cause for drop.  I tested ixgbe (MQ):
  ___
  #netperf  ixgbe ixgbe (pin intrs to cpu#0 on
 both server/client)
  BW (#retr)  BW (#retr)
  ___
  1   3567 (117)  6000 (251)
  2   4406 (477)  6298 (725)
  4   6119 (1085) 7208 (3387)
  8   6595 (4276) 7381 (15296)
  16  6651 (11651)6856 (30394)

 Interesting.
 You are saying we get much more retransmissions with physical nic as
 well?

Yes, with ixgbe. I re-ran with 16 netperfs running for 15 secs on
both ixgbe and cxgb3 just now to reconfirm:

ixgbe: BW: 6186.85  SD/Remote: 135.711, 339.376  CPU/Remote: 79.99, 200.00,
Retrans: 545
cxgb3: BW: 8051.07  SD/Remote: 144.416, 260.487  CPU/Remote: 110.88,
200.00, Retrans: 0

However 64 netperfs for 30 secs gave:

ixgbe: BW: 6691.12  SD/Remote: 8046.617, 5259.992  CPU/Remote: 1223.86,
799.97, Retrans: 1424
cxgb3: BW: 7799.16  SD/Remote: 2589.875, 4317.013  CPU/Remote: 480.39
800.64, Retrans: 649

# ethtool -i eth4
driver: ixgbe
version: 2.0.84-k2
firmware-version: 0.9-3
bus-info: :1f:00.1

# ifconfig output:
   RX packets:783241 errors:0 dropped:0 overruns:0 frame:0
   TX packets:689533 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000

# lspci output:
1f:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network
Connec
tion (rev 01)
Subsystem: Intel Corporation Ethernet Server Adapter X520-2
Flags: bus master, fast devsel, latency 0, IRQ 30
Memory at 9890 (64-bit, prefetchable) [size=512K]
I/O ports at 2020 [size=32]
Memory at 98a0 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Capabilities: [a0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-40-4a-b4
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
Kernel driver in use: ixgbe
Kernel modules: ixgbe

  I haven't done this right now since I don't have a setup.  I guess
  it would be limited by wire speed and gains may not be there.  I
  will try to do this later when I get the setup.

 OK but at least need to check that it does not hurt things.

Yes, sure.

  Summary:
 
  1. Average BW increase for regular I/O is best for #txq=16 with the
 least CPU utilization increase.
  2. The average BW for 512 byte I/O is best for lower #txq=2. For higher
 #txqs, BW increased only after a particular #netperf sessions - in
 my testing that limit was 32 netperf sessions.
  3. Multiple txq for guest by itself doesn't seem to have any issues.
 Guest CPU% increase is slightly higher than BW improvement.  I
 think it is true for all mq drivers since more paths run in parallel
 upto the device instead of sleeping and allowing one thread to send
 all packets via qdisc_restart.
  4. Having high number of txqs gives better gains 

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-05 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 09/19/2010 06:14:43 PM:

 Could you document how exactly do you measure multistream bandwidth:
 netperf flags, etc?

All results were without any netperf flags or system tuning:
for i in $list
do
netperf -c -C -l 60 -H 192.168.122.1  /tmp/netperf.$$.$i 
done
wait
Another script processes the result files.  It also displays the
start time/end time of each iteration to make sure skew due to
parallel netperfs is minimal.

I changed the vhost functionality once more to try to get the
best model, the new model being:
1. #numtxqs=1 - #vhosts=1, this thread handles both RX/TX.
2. #numtxqs1 - vhost[0] handles RX and vhost[1-MAX] handles
   TX[0-n], where MAX is 4.  Beyond numtxqs=4, the remaining TX
   queues are handled by vhost threads in round-robin fashion.

Results from here on are with these changes, and only tuning is
to set each vhost's affinity to CPUs[0-3] (taskset -p f vhost-pids).

 Any idea where does this come from?
 Do you see more TX interrupts? RX interrupts? Exits?
 Do interrupts bounce more between guest CPUs?
 4. Identify reasons for single netperf BW regression.

After testing various combinations of #txqs, #vhosts, #netperf
sessions, I think the drop for 1 stream is due to TX and RX for
a flow being processed on different cpus.  I did two more tests:
1. Pin vhosts to same CPU:
- BW drop is much lower for 1 stream case (- 5 to -8% range)
- But performance is not so high for more sessions.
2. Changed vhost to be single threaded:
  - No degradation for 1 session, and improvement for upto
  8, sometimes 16 streams (5-12%).
  - BW degrades after that, all the way till 128 netperf sessions.
  - But overall CPU utilization improves.
Summary of the entire run (for 1-128 sessions):
txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)

I don't see any reasons mentioned above.  However, for higher
number of netperf sessions, I see a big increase in retransmissions:
___
#netperf  ORG   NEW
BW (#retr)BW (#retr)
___
1  70244 (0) 64102 (0)
4  21421 (0) 36570 (416)
8  21746 (0) 38604 (148)
16 21783 (0) 40632 (464)
32 22677 (0) 37163 (1053)
64 23648 (4) 36449 (2197)
12823251 (2) 31676 (3185)
___

Single netperf case didn't have any retransmissions so that is not
the cause for drop.  I tested ixgbe (MQ):
___
#netperf  ixgbe ixgbe (pin intrs to cpu#0 on
   both server/client)
BW (#retr)  BW (#retr)
___
1   3567 (117)  6000 (251)
2   4406 (477)  6298 (725)
4   6119 (1085) 7208 (3387)
8   6595 (4276) 7381 (15296)
16  6651 (11651)6856 (30394)
___

 5. Test perf in more scenarious:
small packets

512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
but increases with #sessions:
___
#   BW1 BW2 (%) CPU1CPU2 (%)RCPU1   RCPU2 (%)
___
1   40433800 (-6.0) 50  50 (0)  86  98 (13.9)
2   83587485 (-10.4)153 178 (16.3)  230 264 (14.7)
4   20664   13567 (-34.3)   448 490 (9.3)   530 624 (17.7)
8   25198   17590 (-30.1)   967 1021 (5.5)  10851257 (15.8)
16  23791   24057 (1.1) 19042220 (16.5) 21562578 (19.5)
24  23055   26378 (14.4)28073378 (20.3) 32253901 (20.9)
32  22873   27116 (18.5)37484525 (20.7) 43075239 (21.6)
40  22876   29106 (27.2)47055717 (21.5) 53886591 (22.3)
48  23099   31352 (35.7)56426986 (23.8) 64758085 (24.8)
64  22645   30563 (34.9)75279027 (19.9) 861910656 (23.6)
80  22497   31922 (41.8)937511390 (21.4)10736   13485 (25.6)
96  22509   32718 (45.3)11271   13710 (21.6)12927   16269 (25.8)
128 22255   32397 (45.5)15036   18093 (20.3)17144   21608 (26.0)
___
SUM:BW: (16.7)  CPU: (20.6) RCPU: (24.3)
___

 host - guest
___
#   BW1 BW2 (%) CPU1CPU2 (%)RCPU1   

Re: [v2 RFC PATCH 2/4] Changes for virtio-net

2010-09-17 Thread Krishna Kumar2
Eric Dumazet eric.duma...@gmail.com wrote on 09/17/2010 03:55:54 PM:

  +/* Our representation of a send virtqueue */
  +struct send_queue {
  +   struct virtqueue *svq;
  +
  +   /* TX: fragments + linear part + virtio header */
  +   struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
  +};

 You probably want cacheline_aligned_in_smp

I had tried this and mentioned this in Patch 0/4:
2. Cache-align data structures: I didn't see any BW/SD improvement
   after making the sq's (and similarly for vhost) cache-aligned
   statically:
struct virtnet_info {
...
struct send_queue sq[16] cacheline_aligned_in_smp;
...
};


I am not sure why this made no difference?

  +
   struct virtnet_info {
  struct virtio_device *vdev;
  -   struct virtqueue *rvq, *svq, *cvq;
  +   int numtxqs; /* Number of tx queues */
  +   struct send_queue *sq;
  +   struct virtqueue *rvq;
  +   struct virtqueue *cvq;
  struct net_device *dev;

 struct napi will probably be dirtied by RX processing

 You should make sure it doesnt dirty cache line of above (read mostly)
 fields

I am changing the layout of napi wrt other pointers in
this patch, though the to-be-submitted RX patch does that.
Should I do something for this TX-only patch?

  +#define MAX_DEVICE_NAME  16
  +static int initialize_vqs(struct virtnet_info *vi, int numtxqs)
  +{
  +   vq_callback_t **callbacks;
  +   struct virtqueue **vqs;
  +   int i, err = -ENOMEM;
  +   int totalvqs;
  +   char **names;
  +
  +   /* Allocate send queues */

 no check on numtxqs ? Hmm...

 Please then use kcalloc(numtxqs, sizeof(*vi-sq), GFP_KERNEL) so that
 some check is done for you ;)

Right! I need to re-introduce some limit. Rusty, should I simply
add a check for a constant (like 256) here?

Thanks for your review, Eric!

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 2/4] Changes for virtio-net

2010-09-17 Thread Krishna Kumar2
 Krishna Kumar2/India/i...@ibmin
 Sent by: netdev-ow...@vger.kernel.org

   +
struct virtnet_info {
   struct virtio_device *vdev;
   -   struct virtqueue *rvq, *svq, *cvq;
   +   int numtxqs; /* Number of tx queues */
   +   struct send_queue *sq;
   +   struct virtqueue *rvq;
   +   struct virtqueue *cvq;
   struct net_device *dev;
 
  struct napi will probably be dirtied by RX processing
 
  You should make sure it doesnt dirty cache line of above (read mostly)
  fields

 I am changing the layout of napi wrt other pointers in
 this patch, though the to-be-submitted RX patch does that.
 Should I do something for this TX-only patch?

Sorry, I think my sentence is not clear! I will make this
change (and also cache-line align the send queues), test
and let you know the result.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-13 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 09/13/2010 05:20:55 PM:

  Results with the original kernel:
  _
  #   BW  SD  RSD
  __
  1   20903   1   6
  2   21963   6   25
  4   22042   23  102
  8   21674   97  419
  16  22281   379 1663
  24  22521   857 3748
  32  22976   15286594
  40  23197   239010239
  48  22973   354215074
  64  23809   648627244
  80  23564   10169   43118
  96  22977   14954   62948
  128 23649   27067   113892
  
 
  With higher number of threads running in parallel, SD
  increased. In this case most threads run in parallel
  only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
  higher number of threads run in parallel through
  ndo_start_xmit. I *think* the increase in SD is to do
  with higher # of threads running for larger code path
  From the numbers I posted with the patch (cut-n-paste
  only the % parts), BW increased much more than the SD,
  sometimes more than twice the increase in SD.

 Service demand is BW/CPU, right? So if BW goes up by 50%
 and SD by 40%, this means that CPU more than doubled.

I think the SD calculation might be more complicated,
I think it does it based on adding up averages sampled
and stored during the run. But, I still don't see how CPU
can double?? e.g.
BW: 1000 - 1500 (50%)
SD: 100 - 140 (40%)
CPU: 10 - 10.71 (7.1%)

  N#  BW% SD%  RSD%
  4   54.30   40.00-1.16
  8   71.79   46.59-2.68
  16  71.89   50.40-2.50
  32  72.24   34.26-14.52
  48  70.10   31.51-14.35
  64  69.01   38.81-9.66
  96  70.68   71.2610.74
 
  I also think SD calculation gets skewed for guest-local
  host testing.

 If it's broken, let's fix it?

  For this test, I ran a guest with numtxqs=16.
  The first result below is with my patch, which creates 16
  vhosts. The second result is with a modified patch which
  creates only 2 vhosts (testing with #netperfs = 64):

 My guess is it's not a good idea to have more TX VQs than guest CPUs.

Definitely, I will try to run tomorrow with more reasonable
values, also will test with my second version of the patch
that creates restricted number of vhosts and post results.

 I realize for management it's easier to pass in a single vhost fd, but
 just for testing it's probably easier to add code in userspace to open
 /dev/vhost multiple times.

 
  #vhosts  BW% SD%RSD%
  16   20.79   186.01 149.74
  230.89   34.55  18.44
 
  The remote SD increases with the number of vhost threads,
  but that number seems to correlate with guest SD. So though
  BW% increased slightly from 20% to 30%, SD fell drastically
  from 186% to 34%. I think it could be a calculation skew
  with host SD, which also fell from 150% to 18%.

 I think by default netperf looks in /proc/stat for CPU utilization data:
 so host CPU utilization will include the guest CPU, I think?

It appears that way to me too, but the data above seems to
suggest the opposite...

 I would go further and claim that for host/guest TCP
 CPU utilization and SD should always be identical.
 Makes sense?

It makes sense to me, but once again I am not sure how SD
is really done, or whether it is linear to CPU. Cc'ing Rick
in case he can comment


 
  I am planning to submit 2nd patch rev with restricted
  number of vhosts.
 
Likely cause for the 1 stream degradation with multiple
vhost patch:
   
1. Two vhosts run handling the RX and TX respectively.
   I think the issue is related to cache ping-pong esp
   since these run on different cpus/sockets.
  
   Right. With TCP I think we are better off handling
   TX and RX for a socket by the same vhost, so that
   packet and its ack are handled by the same thread.
   Is this what happens with RX multiqueue patch?
   How do we select an RX queue to put the packet on?
 
  My (unsubmitted) RX patch doesn't do this yet, that is
  something I will check.
 
  Thanks,
 
  - KK

 You'll want to work on top of net-next, I think there's
 RX flow filtering work going on there.

Thanks Michael, I will follow up on that for the RX patch,
plus your suggestion on tying RX with TX.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-12 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 09/12/2010 05:10:25 PM:

  SINGLE vhost (Guest - Host):
 1 netperf:BW: 10.7% SD: -1.4%
 4 netperfs:   BW: 3%SD: 1.4%
 8 netperfs:   BW: 17.7% SD: -10%
16 netperfs:  BW: 4.7%  SD: -7.0%
32 netperfs:  BW: -6.1% SD: -5.7%
  BW and SD both improves (guest multiple txqs help). For 32
  netperfs, SD improves.
 
  But with multiple vhosts, guest is able to send more packets
  and BW increases much more (SD too increases, but I think
  that is expected).

 Why is this expected?

Results with the original kernel:
_
#   BW  SD  RSD
__
1   20903   1   6
2   21963   6   25
4   22042   23  102
8   21674   97  419
16  22281   379 1663
24  22521   857 3748
32  22976   15286594
40  23197   239010239
48  22973   354215074
64  23809   648627244
80  23564   10169   43118
96  22977   14954   62948
128 23649   27067   113892


With higher number of threads running in parallel, SD
increased. In this case most threads run in parallel
only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
higher number of threads run in parallel through
ndo_start_xmit. I *think* the increase in SD is to do
with higher # of threads running for larger code path
From the numbers I posted with the patch (cut-n-paste
only the % parts), BW increased much more than the SD,
sometimes more than twice the increase in SD.

N#  BW% SD%  RSD%
4   54.30   40.00-1.16
8   71.79   46.59-2.68
16  71.89   50.40-2.50
32  72.24   34.26-14.52
48  70.10   31.51-14.35
64  69.01   38.81-9.66
96  70.68   71.2610.74

I also think SD calculation gets skewed for guest-local
host testing. For this test, I ran a guest with numtxqs=16.
The first result below is with my patch, which creates 16
vhosts. The second result is with a modified patch which
creates only 2 vhosts (testing with #netperfs = 64):

#vhosts  BW% SD%RSD%
16   20.79   186.01 149.74
230.89   34.55  18.44

The remote SD increases with the number of vhost threads,
but that number seems to correlate with guest SD. So though
BW% increased slightly from 20% to 30%, SD fell drastically
from 186% to 34%. I think it could be a calculation skew
with host SD, which also fell from 150% to 18%.

I am planning to submit 2nd patch rev with restricted
number of vhosts.

  Likely cause for the 1 stream degradation with multiple
  vhost patch:
 
  1. Two vhosts run handling the RX and TX respectively.
 I think the issue is related to cache ping-pong esp
 since these run on different cpus/sockets.

 Right. With TCP I think we are better off handling
 TX and RX for a socket by the same vhost, so that
 packet and its ack are handled by the same thread.
 Is this what happens with RX multiqueue patch?
 How do we select an RX queue to put the packet on?

My (unsubmitted) RX patch doesn't do this yet, that is
something I will check.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/4] Add a new API to virtio-pci

2010-09-12 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 09/12/2010 05:16:37 PM:

 Michael S. Tsirkin m...@redhat.com
 09/12/2010 05:16 PM

 On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote:
  Unfortunately I need a
  constant in vhost for now.

 Maybe not even that: you create multiple vhost-net
 devices so vhost-net in kernel does not care about these
 either, right? So this can be just part of vhost_net.h
 in qemu.

Sorry, I didn't understand what you meant.

I can remove all socks[] arrays/constants by pre-allocating
sockets in vhost_setup_vqs. Then I can remove all socks
parameters in vhost_net_stop, vhost_net_release and
vhost_net_reset_owner.

Does this make sense?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Krishna Kumar2
 Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:

Some more results and likely cause for single netperf
degradation below.


 Guest - Host (single netperf):
 I am getting a drop of almost 20%. I am trying to figure out
 why.

 Host - guest (single netperf):
 I am getting an improvement of almost 15%. Again - unexpected.

 Guest - Host TCP_RR: I get an average 7.4% increase in #packets
 for runs upto 128 sessions. With fewer netperf (under 8), there
 was a drop of 3-7% in #packets, but beyond that, the #packets
 improved significantly to give an average improvement of 7.4%.

 So it seems that fewer sessions is having negative effect for
 some reason on the tx side. The code path in virtio-net has not
 changed much, so the drop in some cases is quite unexpected.

The drop for the single netperf seems to be due to multiple vhost.
I changed the patch to start *single* vhost:

Guest - Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
Guest - Host (1 netperf) : Latency: -3%, SD: 3.5%

Single vhost performs well but hits the barrier at 16 netperf
sessions:

SINGLE vhost (Guest - Host):
1 netperf:BW: 10.7% SD: -1.4%
4 netperfs:   BW: 3%SD: 1.4%
8 netperfs:   BW: 17.7% SD: -10%
  16 netperfs:  BW: 4.7%  SD: -7.0%
  32 netperfs:  BW: -6.1% SD: -5.7%
BW and SD both improves (guest multiple txqs help). For 32
netperfs, SD improves.

But with multiple vhosts, guest is able to send more packets
and BW increases much more (SD too increases, but I think
that is expected). From the earlier results:

N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2(%)
___
4   26387   40716 (54.30)   20  28   (40.00)86  85
(-1.16)
8   24356   41843 (71.79)   88  129  (46.59)372 362
(-2.68)
16  23587   40546 (71.89)   375 564  (50.40)15581519
(-2.50)
32  22927   39490 (72.24)   16172171 (34.26)66945722
(-14.52)
48  23067   39238 (70.10)   39315170 (31.51)15823   13552
(-14.35)
64  22927   38750 (69.01)   71429914 (38.81)28972   26173
(-9.66)
96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
(10.74)
___
(All tests were done without any tuning)

From my testing:

1. Single vhost improves mq guest performance upto 16
   netperfs but degrades after that.
2. Multiple vhost degrades single netperf guest
   performance, but significantly improves performance
   for any number of netperf sessions.

Likely cause for the 1 stream degradation with multiple
vhost patch:

1. Two vhosts run handling the RX and TX respectively.
   I think the issue is related to cache ping-pong esp
   since these run on different cpus/sockets.
2. I (re-)modified the patch to share RX with TX[0]. The
   performance drop is the same, but the reason is the
   guest is not using txq[0] in most cases (dev_pick_tx),
   so vhost's rx and tx are running on different threads.
   But whenever the guest uses txq[0], only one vhost
   runs and the performance is similar to original.

I went back to my *submitted* patch and started a guest
with numtxq=16 and pinned every vhost to cpus #01. Now
whether guest used txq[0] or txq[n], the performance is
similar or better (between 10-27% across 10 runs) than
original code. Also, -6% to -24% improvement in SD.

I will start a full test run of original vs submitted
code with minimal tuning (Avi also suggested the same),
and re-send. Please let me know if you need any other
data.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 09/09/2010 03:15:53 PM:

 I will start a full test run of original vs submitted
 code with minimal tuning (Avi also suggested the same),
 and re-send. Please let me know if you need any other
 data.

Same patch, only change is that I ran taskset -p 03
all vhost threads, no other tuning on host or guest.
Default netperf without any options. The BW is the sum
across two iterations, each is 60secs. Guest is started
with 2 txqs.

BW1/BW2: BW for org  new in mbps
SD1/SD2: SD for org  new
RSD1/RSD2: Remote SD for org  new
___
#BW1   BW2(%)SD1SD2   (%)  RSD1RSD2  (%)
___
120903 19422  (-7.08)1  1(0)   6   7
(16.66)
221963 24330  (10.77)6  6(0)   25  25(0)
422042 31841  (44.45)23 28   (21.73)   102 110   (7.84)
821674 32045  (47.84)97 111  (14.43)   419 421   (.47)
16   22281 31361  (40.75)379551  (45.38)   16632110
(26.87)
24   22521 31945  (41.84)857981  (14.46)   37483742  (-.16)
32   22976 32473  (41.33)1528   1806  (18.19)  65946885  (4.41)
40   23197 32594  (40.50)2390   2755  (15.27)  10239   10450 (2.06)
48   22973 32757  (42.58)3542   3786  (6.88)   15074   14395
(-4.50)
64   23809 32814  (37.82)6486   6981  (7.63)   27244   26381
(-3.16)
80   23564 32682  (38.69)10169  11133 (9.47)   43118   41397
(-3.99)
96   22977 33069  (43.92)14954  15881 (6.19)   62948   59071
(-6.15)
128  23649 33032  (39.67)27067  28832 (6.52)   113892  106096
(-6.84)
___
 294534400371 (35.9) 67504  72858 (7.9)285077  271096
(-4.9)
___

I will try more tuning later as Avi suggested, wanted to test
the minimal for now.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Krishna Kumar2
Arnd Bergmann a...@arndb.de wrote on 09/09/2010 04:10:27 PM:

   Can you live migrate a new guest from new-qemu/new-kernel
   to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
   If not, do we need to support all those cases?
 
  I have not tried this, though I added some minimal code in
  virtio_net_load and virtio_net_save. I don't know what needs
  to be done exactly at this time. I forgot to put this in the
  Next steps list of things to do.

 I was mostly trying to find out if you think it should work
 or if there are specific reasons why it would not.
 E.g. when migrating to a machine that has an old qemu, the guest
 gets reduced to a single queue, but it's not clear to me how
 it can learn about this, or if it can get hidden by the outbound
 qemu.

I agree, I am also not sure how the old guest will handle this.
Sorry about my ignorance on migration :(

Regards,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/4] Add a new API to virtio-pci

2010-09-09 Thread Krishna Kumar2
Rusty Russell ru...@rustcorp.com.au wrote on 09/09/2010 05:44:25 PM:

   This seems a bit weird.  I mean, the driver used vdev-config-
find_vqs
   to find the queues, which returns them (in order).  So, can't you put
  this
   into your struct send_queue?
 
  I am saving the vqs in the send_queue, but the cb needs
  to locate the device txq from the svq. The only other way
  I could think of is to iterate through the send_queue's
  and compare svq against sq[i]-svq, but cb's happen quite
  a bit. Is there a better way?

 Ah, good point.  Move the queue index into the struct virtqueue?

Is it OK to move the queue_index from virtio_pci_vq_info
to virtqueue? I didn't want to change any data structures
in virtio for this patch, but I can do it either way.

   Also, why define VIRTIO_MAX_TXQS?  If the driver can't handle all of
  them,
   it should simply not use them...
 
  The main reason was vhost :) Since vhost_net_release
  should not fail (__fput can't handle f_op-release()
  failure), I needed a maximum number of socks to
  clean up:

 Ah, then it belongs in the vhost headers.  The guest shouldn't see such
 a restriction if it doesn't apply; it's a host thing.

 Oh, and I think you could profitably use virtio_config_val(), too.

OK, I will make those changes. Thanks for the reference to
virtio_config_val(), I will use it in guest probe instead of
the cumbersome way I am doing now. Unfortunately I need a
constant in vhost for now.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Krishna Kumar2
Sridhar Samudrala s...@us.ibm.com wrote on 09/10/2010 04:30:24 AM:

 I remember seeing similar issue when using a separate vhost thread for
 TX and
 RX queues.  Basically, we should have the same vhost thread process a
 TCP flow
 in both directions. I guess this allows the data and ACKs to be
 processed in sync.

I was trying that by sharing threads between rx and tx[0], but
that didn't work either since guest rarely picks txq=0. I was
able to get reasonable single stream performance by pinning
vhosts to the same cpu.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
Avi Kivity a...@redhat.com wrote on 09/08/2010 01:17:34 PM:

   On 09/08/2010 10:28 AM, Krishna Kumar wrote:
  Following patches implement Transmit mq in virtio-net.  Also
  included is the user qemu changes.
 
  1. This feature was first implemented with a single vhost.
  Testing showed 3-8% performance gain for upto 8 netperf
  sessions (and sometimes 16), but BW dropped with more
  sessions.  However, implementing per-txq vhost improved
  BW significantly all the way to 128 sessions.

 Why were vhost kernel changes required?  Can't you just instantiate more
 vhost queues?

I did try using a single thread processing packets from multiple
vq's on host, but the BW dropped beyond a certain number of
sessions. I don't have the code and performance numbers for that
right now since it is a bit ancient, I can try to resuscitate
that if you want.

  Guest interrupts for a 4 TXQ device after a 5 min test:
  # egrep virtio0|CPU /proc/interrupts
 CPU0 CPU1 CPU2CPU3
  40:   000   0PCI-MSI-edge  virtio0-config
  41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
  42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
  43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
  44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
  45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3

 How are vhost threads and host interrupts distributed?  We need to move
 vhost queue threads to be colocated with the related vcpu threads (if no
 extra cores are available) or on the same socket (if extra cores are
 available).  Similarly, move device interrupts to the same core as the
 vhost thread.

All my testing was without any tuning, including binding netperf 
netserver (irqbalance is also off). I assume (maybe wrongly) that
the above might give better results? Are you suggesting this
combination:
IRQ on guest:
40: CPU0
41: CPU1
42: CPU2
43: CPU3 (all CPUs are on socket #0)
vhost:
thread #0:  CPU0
thread #1:  CPU1
thread #2:  CPU2
thread #3:  CPU3
qemu:
thread #0:  CPU4
thread #1:  CPU5
thread #2:  CPU6
thread #3:  CPU7 (all CPUs are on socket#1)
netperf/netserver:
Run on CPUs 0-4 on both sides

The reason I did not optimize anything from user space is because
I felt showing the default works reasonably well is important.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 09/08/2010 01:40:11 PM:


___

 TCP (#numtxqs=2)
  N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2
(%)
 

___

  4   26387   40716 (54.30)   20  28   (40.00)86i 85
(-1.16)
  8   24356   41843 (71.79)   88  129  (46.59)372 362
(-2.68)
  16  23587   40546 (71.89)   375 564  (50.40)15581519
(-2.50)
  32  22927   39490 (72.24)   16172171 (34.26)66945722
(-14.52)
  48  23067   39238 (70.10)   39315170 (31.51)15823   13552
(-14.35)
  64  22927   38750 (69.01)   71429914 (38.81)28972   26173
(-9.66)
  96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
(10.74)

 That's a significant hit in TCP SD. Is it caused by the imbalance between
 number of queues for TX and RX? Since you mention RX is complete,
 maybe measure with a balanced TX/RX?

Yes, I am not sure why it is so high. I found the same with #RX=#TX
too. As a hack, I tried ixgbe without MQ (set indices=1 before
calling alloc_etherdev_mq, not sure if that is entirely correct) -
here too SD worsened by around 40%. I can't explain it, since the
virtio-net driver runs lock free once sch_direct_xmit gets
HARD_TX_LOCK for the specific txq. Maybe the SD calculation is not strictly
correct since
more threads are now running parallel and load is higher? Eg, if you
compare SD between
#netperfs = 8 vs 16 for original code (cut-n-paste relevant columns
only) ...

N# BWSD
8   24356   88
16 23587   375

... SD has increased more than 4 times for the same BW.

 What happens with a single netperf?
 host - guest performance with TCP and small packet speed
 are also worth measuring.

OK, I will do this and send the results later today.

 At some level, host/guest communication is easy in that we don't really
 care which queue is used.  I would like to give some thought (and
 testing) to how is this going to work with a real NIC card and packet
 steering at the backend.
 Any idea?

I have done a little testing with guest - remote server both
using a bridge and with macvtap (mq is required only for rx).
I didn't understand what you mean by packet steering though,
is it whether packets go out of the NIC on different queues?
If so, I verified that is the case by putting a counter and
displaying through /debug interface on the host. dev_queue_xmit
on the host handles it by calling dev_pick_tx().

  Guest interrupts for a 4 TXQ device after a 5 min test:
  # egrep virtio0|CPU /proc/interrupts
CPU0 CPU1 CPU2CPU3
  40:   000   0PCI-MSI-edge  virtio0-config
  41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
  42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
  43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
  44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
  45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3

 Does this mean each interrupt is constantly bouncing between CPUs?

Yes. I didn't do *any* tuning for the tests. The only tuning
was to use 64K IO size with netperf. When I ran default netperf
(16K), I got a little lesser improvement in BW and worse(!) SD
than with 64K.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
Hi Michael,

Michael S. Tsirkin m...@redhat.com wrote on 09/08/2010 01:43:26 PM:

 On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
  1. mq RX patch is also complete - plan to submit once TX is OK.

 It's good that you split patches, I think it would be interesting to see
 the RX patches at least once to complete the picture.
 You could make it a separate patchset, tag them as RFC.

OK, I need to re-do some parts of it, since I started the TX only
branch a couple of weeks earlier and the RX side is outdated. I
will try to send that out in the next couple of days, as you say
it will help to complete the picture. Reasons to send it only TX
now:

- Reduce size of patch and complexity
- I didn't get much improvement on multiple RX patch (netperf from
  host - guest), so needed some time to figure out the reason and
  fix it.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
Avi Kivity a...@redhat.com wrote on 09/08/2010 02:58:21 PM:

  1. This feature was first implemented with a single vhost.
   Testing showed 3-8% performance gain for upto 8 netperf
   sessions (and sometimes 16), but BW dropped with more
   sessions.  However, implementing per-txq vhost improved
   BW significantly all the way to 128 sessions.
  Why were vhost kernel changes required?  Can't you just instantiate
more
  vhost queues?
  I did try using a single thread processing packets from multiple
  vq's on host, but the BW dropped beyond a certain number of
  sessions.

 Oh - so the interface has not changed (which can be seen from the
 patch).  That was my concern, I remembered that we planned for vhost-net
 to be multiqueue-ready.

 The new guest and qemu code work with old vhost-net, just with reduced
 performance, yes?

Yes, I have tested new guest/qemu with old vhost but using
#numtxqs=1 (or not passing any arguments at all to qemu to
enable MQ). Giving numtxqs  1 fails with ENOBUFS in vhost,
since vhost_net_set_backend in the unmodified vhost checks
for boundary overflow.

I have also tested running an unmodified guest with new
vhost/qemu, but qemu should not specify numtxqs1.

  Are you suggesting this
  combination:
 IRQ on guest:
40: CPU0
41: CPU1
42: CPU2
43: CPU3 (all CPUs are on socket #0)
 vhost:
thread #0:  CPU0
thread #1:  CPU1
thread #2:  CPU2
thread #3:  CPU3
 qemu:
thread #0:  CPU4
thread #1:  CPU5
thread #2:  CPU6
thread #3:  CPU7 (all CPUs are on socket#1)

 May be better to put vcpu threads and vhost threads on the same socket.

 Also need to affine host interrupts.

 netperf/netserver:
Run on CPUs 0-4 on both sides
 
  The reason I did not optimize anything from user space is because
  I felt showing the default works reasonably well is important.

 Definitely.  Heavy tuning is not a useful path for general end users.
 We need to make sure the the scheduler is able to arrive at the optimal
 layout without pinning (but perhaps with hints).

OK, I will see if I can get results with this.

Thanks for your suggestions,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 09/08/2010 04:18:33 PM:


___

 
   TCP (#numtxqs=2)
N#  BW1 BW2(%)  SD1 SD2(%)  RSD1
RSD2
  (%)
   
  
 

___

 
4   26387   40716 (54.30)   20  28   (40.00)86i 85
  (-1.16)
8   24356   41843 (71.79)   88  129  (46.59)372 362
  (-2.68)
16  23587   40546 (71.89)   375 564  (50.40)1558
1519
  (-2.50)
32  22927   39490 (72.24)   16172171 (34.26)6694
5722
  (-14.52)
48  23067   39238 (70.10)   39315170 (31.51)15823
13552
  (-14.35)
64  22927   38750 (69.01)   71429914 (38.81)28972
26173
  (-9.66)
96  22568   38520 (70.68)   16258   27844 (71.26)   65944
73031
  (10.74)
  
   That's a significant hit in TCP SD. Is it caused by the imbalance
between
   number of queues for TX and RX? Since you mention RX is complete,
   maybe measure with a balanced TX/RX?
 
  Yes, I am not sure why it is so high.

 Any errors at higher levels? Are any packets reordered?

I haven't seen any messages logged, and retransmission is similar
to non-mq case. Device also has no errors/dropped packets. Anything
else I should look for?

On the host:

# ifconfig vnet0
vnet0 Link encap:Ethernet  HWaddr 9A:9D:99:E1:CA:CE
  inet6 addr: fe80::989d:99ff:fee1:cace/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:5090371 errors:0 dropped:0 overruns:0 frame:0
  TX packets:5054616 errors:0 dropped:0 overruns:65 carrier:0
  collisions:0 txqueuelen:500
  RX bytes:237793761392 (221.4 GiB)  TX bytes:333630070 (318.1 MiB)
# netstat -s  |grep -i retrans
1310 segments retransmited
35 times recovered from packet loss due to fast retransmit
1 timeouts after reno fast retransmit
41 fast retransmits
1236 retransmits in slow start

So retranmissions are 0.025% of total packets received from the guest.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 09/08/2010 01:40:11 PM:


___

 UDP (#numtxqs=8)
  N#  BW1 BW2   (%)  SD1 SD2   (%)
  __
  4   29836   56761 (90.24)   67  63(-5.97)
  8   27666   63767 (130.48)  326 265   (-18.71)
  16  25452   60665 (138.35)  13961269  (-9.09)
  32  26172   63491 (142.59)  56174202  (-25.19)
  48  26146   64629 (147.18)  12813   9316  (-27.29)
  64  25575   65448 (155.90)  23063   16346 (-29.12)
  128 26454   63772 (141.06)  91054   85051 (-6.59)
  __
  N#: Number of netperf sessions, 90 sec runs
  BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
SD for original code
  BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
SD for new code. e.g. BW2=40716 means average BW2 was
20358 mbps.
 

 What happens with a single netperf?
 host - guest performance with TCP and small packet speed
 are also worth measuring.

Guest - Host (single netperf):
I am getting a drop of almost 20%. I am trying to figure out
why.

Host - guest (single netperf):
I am getting an improvement of almost 15%. Again - unexpected.

Guest - Host TCP_RR: I get an average 7.4% increase in #packets
for runs upto 128 sessions. With fewer netperf (under 8), there
was a drop of 3-7% in #packets, but beyond that, the #packets
improved significantly to give an average improvement of 7.4%.

So it seems that fewer sessions is having negative effect for
some reason on the tx side. The code path in virtio-net has not
changed much, so the drop in some cases is quite unexpected.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 1/4] Add a new API to virtio-pci

2010-09-08 Thread Krishna Kumar2
Rusty Russell ru...@rustcorp.com.au wrote on 09/09/2010 09:19:39 AM:

 On Wed, 8 Sep 2010 04:59:05 pm Krishna Kumar wrote:
  Add virtio_get_queue_index() to get the queue index of a
  vq.  This is needed by the cb handler to locate the queue
  that should be processed.

 This seems a bit weird.  I mean, the driver used vdev-config-find_vqs
 to find the queues, which returns them (in order).  So, can't you put
this
 into your struct send_queue?

I am saving the vqs in the send_queue, but the cb needs
to locate the device txq from the svq. The only other way
I could think of is to iterate through the send_queue's
and compare svq against sq[i]-svq, but cb's happen quite
a bit. Is there a better way?

static void skb_xmit_done(struct virtqueue *svq)
{
struct virtnet_info *vi = svq-vdev-priv;
int qnum = virtio_get_queue_index(svq) - 1; /* 0 is RX vq */

/* Suppress further interrupts. */
virtqueue_disable_cb(svq);

/* We were probably waiting for more output buffers. */
netif_wake_subqueue(vi-dev, qnum);
}

 Also, why define VIRTIO_MAX_TXQS?  If the driver can't handle all of
them,
 it should simply not use them...

The main reason was vhost :) Since vhost_net_release
should not fail (__fput can't handle f_op-release()
failure), I needed a maximum number of socks to
clean up:

#define MAX_VQS (1 + VIRTIO_MAX_TXQS)
static int vhost_net_release(struct inode *inode, struct file *f)
{
struct vhost_net *n = f-private_data;
struct vhost_dev *dev = n-dev;
struct socket *socks[MAX_VQS];
int i;

vhost_net_stop(n, socks);
vhost_net_flush(n);
vhost_dev_cleanup(dev);

for (i = n-dev.nvqs - 1; i = 0; i--)
if (socks[i])
fput(socks[i]-file);
...
}

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html