Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/12/2010 10:39:07 PM:

  Sorry for the delay, I was sick last couple of days. The results
  with your patch are (%'s over original code):
 
  Code   BW%   CPU%   RemoteCPU
  MQ (#txq=16)   31.4% 38.42% 6.41%
  MQ+MST (#txq=16)   28.3% 18.9%  -10.77%
 
  The patch helps CPU utilization but didn't help single stream
  drop.
 
  Thanks,

 What other shared TX/RX locks are there?  In your setup, is the same
 macvtap socket structure used for RX and TX?  If yes this will create
 cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
 there might also be contention on the lock in sk_sleep waitqueue.
 Anything else?

The patch is not introducing any locking (both vhost and virtio-net).
The single stream drop is due to different vhost threads handling the
RX/TX traffic.

I added a heuristic (fuzzy) to determine if more than one flow
is being used on the device, and if not, use vhost[0] for both
tx and rx (vhost_poll_queue figures this out before waking up
the suitable vhost thread).  Testing shows that single stream
performance is as good as the original code.

__
   #txqs = 2 (#vhosts = 3)
# BW1 BW2   (%)   CPU1CPU2 (%)   RCPU1   RCPU2 (%)
__
1 77344   74973 (-3.06)   172 143 (-16.86)   358 324 (-9.49)
2 20924   21107 (.87) 107 103 (-3.73)220 217 (-1.36)
4 21629   32911 (52.16)   214 391 (82.71)446 616 (38.11)
8 21678   34359 (58.49)   428 845 (97.42)892 1286 (44.17)
1622046   34401 (56.04)   841 1677 (99.40)   17852585 (44.81)
2422396   35117 (56.80)   12722447 (92.37)   26673863 (44.84)
3222750   35158 (54.54)   17193233 (88.07)   35695143 (44.10)
4023041   35345 (53.40)   22193970 (78.90)   44786410 (43.14)
4823209   35219 (51.74)   27074685 (73.06)   53867684 (42.66)
6423215   35209 (51.66)   36396195 (70.23)   720610218 (41.79)
8023443   35179 (50.06)   46337625 (64.58)   905112745 (40.81)
9624006   36108 (50.41)   56359096 (61.41)   10864   15283 (40.67)
128   23601   35744 (51.45)   747512104 (61.92)  14495   20405 (40.77)
__
SUM: BW: (37.6) CPU: (69.0) RCPU: (41.2)

__
   #txqs = 8 (#vhosts = 5)
# BW1 BW2(%)  CPU1 CPU2 (%)  RCPU1 RCPU2 (%)
__
1 77344   75341 (-2.58)   172 171 (-.58) 358 356 (-.55)
2 20924   26872 (28.42)   107 135 (26.16)220 262 (19.09)
4 21629   33594 (55.31)   214 394 (84.11)446 615 (37.89)
8 21678   39714 (83.19)   428 949 (121.72)   892 1358 (52.24)
1622046   39879 (80.88)   841 1791 (112.96)  17852737 (53.33)
2422396   38436 (71.61)   12722111 (65.95)   26673453 (29.47)
3222750   38776 (70.44)   17193594 (109.07)  35695421 (51.89)
4023041   38023 (65.02)   22194358 (96.39)   44786507 (45.31)
4823209   33811 (45.68)   27074047 (49.50)   53866222 (15.52)
6423215   30212 (30.13)   36393858 (6.01)72065819 (-19.24)
8023443   34497 (47.15)   46337214 (55.70)   905110776 (19.05)
9624006   30990 (29.09)   56355731 (1.70)10864   8799 (-19.00)
128   23601   29413 (24.62)   74757804 (4.40)14495   11638 (-19.71)
__
SUM: BW: (40.1) CPU: (35.7) RCPU: (4.1)
___


The SD numbers are also good (same table as before, but SD
instead of CPU:

__
   #txqs = 2 (#vhosts = 3)
# BW%   SD1 SD2 (%)RSD1 RSD2 (%)
__
1 -3.06)5   4 (-20.00) 21   19 (-9.52)
2 .87   6   6 (0)  27   27 (0)
4 52.16 26  32 (23.07) 108  103 (-4.62)
8 58.49 103 146 (41.74)431  445 (3.24)
1656.04 407 514 (26.28)1729 1586 (-8.27)
2456.80 934 1161 (24.30)   3916 3665 (-6.40)
3254.54 16682160 (29.49)   6925 6872 (-.76)
4053.40 26553317 (24.93)   1071210707 (-.04)
4851.74 39204486 (14.43)   1559814715 (-5.66)
6451.66 70968250 (16.26)   2809927211 (-3.16)
8050.06 11240   12586 (11.97)  4391342070 (-4.19)
9650.41 16342   16976 

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Michael S. Tsirkin
On Thu, Oct 14, 2010 at 01:28:58PM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 10/12/2010 10:39:07 PM:
 
   Sorry for the delay, I was sick last couple of days. The results
   with your patch are (%'s over original code):
  
   Code   BW%   CPU%   RemoteCPU
   MQ (#txq=16)   31.4% 38.42% 6.41%
   MQ+MST (#txq=16)   28.3% 18.9%  -10.77%
  
   The patch helps CPU utilization but didn't help single stream
   drop.
  
   Thanks,
 
  What other shared TX/RX locks are there?  In your setup, is the same
  macvtap socket structure used for RX and TX?  If yes this will create
  cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
  there might also be contention on the lock in sk_sleep waitqueue.
  Anything else?
 
 The patch is not introducing any locking (both vhost and virtio-net).
 The single stream drop is due to different vhost threads handling the
 RX/TX traffic.
 
 I added a heuristic (fuzzy) to determine if more than one flow
 is being used on the device, and if not, use vhost[0] for both
 tx and rx (vhost_poll_queue figures this out before waking up
 the suitable vhost thread).  Testing shows that single stream
 performance is as good as the original code.

...

 This approach works nicely for both single and multiple stream.
 Does this look good?
 
 Thanks,
 
 - KK

Yes, but I guess it depends on the heuristic :) What's the logic?

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
 Michael S. Tsirkin m...@redhat.com
   What other shared TX/RX locks are there?  In your setup, is the same
   macvtap socket structure used for RX and TX?  If yes this will create
   cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
   there might also be contention on the lock in sk_sleep waitqueue.
   Anything else?
 
  The patch is not introducing any locking (both vhost and virtio-net).
  The single stream drop is due to different vhost threads handling the
  RX/TX traffic.
 
  I added a heuristic (fuzzy) to determine if more than one flow
  is being used on the device, and if not, use vhost[0] for both
  tx and rx (vhost_poll_queue figures this out before waking up
  the suitable vhost thread).  Testing shows that single stream
  performance is as good as the original code.

 ...

  This approach works nicely for both single and multiple stream.
  Does this look good?
 
  Thanks,
 
  - KK

 Yes, but I guess it depends on the heuristic :) What's the logic?

I define how recently a txq was used. If 0 or 1 txq's were used
recently, use vq[0] (which also handles rx). Otherwise, use
multiple txq (vq[1-n]). The code is:

/*
 * Algorithm for selecting vq:
 *
 * ConditionReturn
 * RX vqvq[0]
 * If all txqs unused   vq[0]
 * If one txq used, and new txq is same vq[0]
 * If one txq used, and new txq is differentvq[vq-qnum]
 * If  1 txqs used vq[vq-qnum]
 *  Where used means the txq was used in the last 'n' jiffies.
 *
 * Note: locking is not required as an update race will only result in
 * a different worker being woken up.
 */
static inline struct vhost_virtqueue *vhost_find_vq(struct vhost_poll
*poll)
{
if (poll-vq-qnum) {
struct vhost_dev *dev = poll-vq-dev;
struct vhost_virtqueue *vq = dev-vqs[0];
unsigned long max_time = jiffies - 5; /* Some macro needed */
unsigned long *table = dev-jiffies;
int i, used = 0;

for (i = 0; i  dev-nvqs - 1; i++) {
if (time_after_eq(table[i], max_time)  ++used  1) {
vq = poll-vq;
break;
}
}
table[poll-vq-qnum - 1] = jiffies;
return vq;
}

/* RX is handled by the same worker thread */
return poll-vq;
}

void vhost_poll_queue(struct vhost_poll *poll)
{
struct vhost_virtqueue *vq = vhost_find_vq(poll);

vhost_work_queue(vq, poll-work);
}

Since poll batches packets, find_vq does not seem to add much
to the CPU utilization (or BW). I am sure that code can be
optimized much better.

The results I sent in my last mail were without your use_mm
patch, and the only tuning was to make vhost threads run on
only cpus 0-3 (though the performance is good even without
that). I will test it later today with the use_mm patch too.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 10/14/2010 02:34:01 PM:

 void vhost_poll_queue(struct vhost_poll *poll)
 {
 struct vhost_virtqueue *vq = vhost_find_vq(poll);

 vhost_work_queue(vq, poll-work);
 }

 Since poll batches packets, find_vq does not seem to add much
 to the CPU utilization (or BW). I am sure that code can be
 optimized much better.

 The results I sent in my last mail were without your use_mm
 patch, and the only tuning was to make vhost threads run on
 only cpus 0-3 (though the performance is good even without
 that). I will test it later today with the use_mm patch too.

There's a significant reduction in CPU/SD utilization with your
patch. Following is the performance of ORG vs MQ+mm patch:

_
   Org vs MQ+mm patch txq=2
# BW% CPU/RCPU% SD/RSD%
_
1 2.26-1.16.27  -20.00  0
2 35.07   29.9021.81 0  -11.11
4 55.03   84.5737.66 26.92  -4.62
8 73.16   118.69   49.21 45.63  -.46
1677.43   98.8147.89 24.07  -7.80
2471.59   105.18   48.44 62.84  18.18
3270.91   102.38   47.15 49.22  8.54
4063.26   90.5841.00 85.27  37.33
4845.25   45.9911.23 14.31  -12.91
6442.78   41.825.50  .43-25.12
8031.40   7.31 -18.6915.78  -11.93
9627.60   7.79 -18.5417.39  -10.98
128   23.46   -11.89   -34.41-.41   -25.53
_
BW: 40.2  CPU/RCPU: 29.9,-2.2   SD/RSD: 12.0,-15.6


Following is the performance of MQ vs MQ+mm patch:
_
MQ vs MQ+mm patch
# BW%  CPU%   RCPU%SD%  RSD%
_
1  4.98-.58   .84  -20.000
2  5.17 2.96  2.29  0   -4.00
4 -.18  .25  -.16   3.12 .98
8 -5.47-1.36 -1.98  17.1816.57
16-1.90-6.64 -3.54 -14.83   -12.12
24-.01  23.63 14.65 57.6146.64
32 .27 -3.19  -3.11-22.98   -22.91
40-1.06-2.96  -2.96-4.18-4.10
48-.28 -2.34  -3.71-2.41-3.81
64 9.71 33.77  30.6581.4477.09
80-10.69-31.07-31.70   -29.22   -29.88
96-1.14 5.98   .56 -11.57   -16.14
128   -.93 -15.60 -18.31   -19.89   -22.65
_
  BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6
_

Each test case is for 60 secs, sum over two runs (except
when number of netperf sessions is 1, which has 7 runs
of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning
other than taskset each vhost to cpus 0-3.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 10/14/2010 05:47:54 PM:

Sorry, it should read txq=8 below.

- KK

 There's a significant reduction in CPU/SD utilization with your
 patch. Following is the performance of ORG vs MQ+mm patch:

 _
Org vs MQ+mm patch txq=2
 # BW% CPU/RCPU% SD/RSD%
 _
 1 2.26-1.16.27  -20.00  0
 2 35.07   29.9021.81 0  -11.11
 4 55.03   84.5737.66 26.92  -4.62
 8 73.16   118.69   49.21 45.63  -.46
 1677.43   98.8147.89 24.07  -7.80
 2471.59   105.18   48.44 62.84  18.18
 3270.91   102.38   47.15 49.22  8.54
 4063.26   90.5841.00 85.27  37.33
 4845.25   45.9911.23 14.31  -12.91
 6442.78   41.825.50  .43-25.12
 8031.40   7.31 -18.6915.78  -11.93
 9627.60   7.79 -18.5417.39  -10.98
 128   23.46   -11.89   -34.41-.41   -25.53
 _
 BW: 40.2  CPU/RCPU: 29.9,-2.2   SD/RSD: 12.0,-15.6

 Following is the performance of MQ vs MQ+mm patch:
 _
 MQ vs MQ+mm patch
 # BW%  CPU%   RCPU%SD%  RSD%
 _
 1  4.98-.58   .84  -20.000
 2  5.17 2.96  2.29  0   -4.00
 4 -.18  .25  -.16   3.12 .98
 8 -5.47-1.36 -1.98  17.1816.57
 16-1.90-6.64 -3.54 -14.83   -12.12
 24-.01  23.63 14.65 57.6146.64
 32 .27 -3.19  -3.11-22.98   -22.91
 40-1.06-2.96  -2.96-4.18-4.10
 48-.28 -2.34  -3.71-2.41-3.81
 64 9.71 33.77  30.6581.4477.09
 80-10.69-31.07-31.70   -29.22   -29.88
 96-1.14 5.98   .56 -11.57   -16.14
 128   -.93 -15.60 -18.31   -19.89   -22.65
 _
   BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6
 _

 Each test case is for 60 secs, sum over two runs (except
 when number of netperf sessions is 1, which has 7 runs
 of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning
 other than taskset each vhost to cpus 0-3.

 Thanks,

 - KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-11 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/06/2010 07:04:31 PM:

 On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
  For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
  for degradation for 1 stream case:

 I thought about possible RX/TX contention reasons, and I realized that
 we get/put the mm counter all the time.  So I write the following: I
 haven't seen any performance gain from this in a single queue case, but
 maybe this will help multiqueue?

Sorry for the delay, I was sick last couple of days. The results
with your patch are (%'s over original code):

Code   BW%   CPU%   RemoteCPU
MQ (#txq=16)   31.4% 38.42% 6.41%
MQ+MST (#txq=16)   28.3% 18.9%  -10.77%

The patch helps CPU utilization but didn't help single stream
drop.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Arnd Bergmann
On Tuesday 05 October 2010, Krishna Kumar2 wrote:
 After testing various combinations of #txqs, #vhosts, #netperf
 sessions, I think the drop for 1 stream is due to TX and RX for
 a flow being processed on different cpus.  I did two more tests:
 1. Pin vhosts to same CPU:
 - BW drop is much lower for 1 stream case (- 5 to -8% range)
 - But performance is not so high for more sessions.
 2. Changed vhost to be single threaded:
   - No degradation for 1 session, and improvement for upto
   8, sometimes 16 streams (5-12%).
   - BW degrades after that, all the way till 128 netperf sessions.
   - But overall CPU utilization improves.
 Summary of the entire run (for 1-128 sessions):
 txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
 txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
 
 I don't see any reasons mentioned above.  However, for higher
 number of netperf sessions, I see a big increase in retransmissions:
 ___
 #netperf  ORG   NEW
 BW (#retr)BW (#retr)
 ___
 1  70244 (0) 64102 (0)
 4  21421 (0) 36570 (416)
 8  21746 (0) 38604 (148)
 16 21783 (0) 40632 (464)
 32 22677 (0) 37163 (1053)
 64 23648 (4) 36449 (2197)
 12823251 (2) 31676 (3185)
 ___


This smells like it could be related to a problem that Ben Greear found
recently (see macvlan:  Enable qdisc backoff logic). When the hardware
is busy, used to just drop the packet. With Ben's patch, we return -EAGAIN
to qemu (or vhost-net) to trigger a resend.

I suppose what we really should do is feed that condition back to the
guest network stack and implement the backoff in there.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Michael S. Tsirkin
On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
 For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
 for degradation for 1 stream case:

I thought about possible RX/TX contention reasons, and I realized that
we get/put the mm counter all the time.  So I write the following: I
haven't seen any performance gain from this in a single queue case, but
maybe this will help multiqueue?

Thanks,

Michael S. Tsirkin (2):
  vhost: put mm after thread stop
  vhost-net: batch use/unuse mm

 drivers/vhost/net.c   |7 ---
 drivers/vhost/vhost.c |   16 ++--
 2 files changed, 10 insertions(+), 13 deletions(-)

-- 
1.7.3-rc1
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/06/2010 07:04:31 PM:

 Michael S. Tsirkin m...@redhat.com
 10/06/2010 07:04 PM

 To

 Krishna Kumar2/India/i...@ibmin

 cc

 ru...@rustcorp.com.au, da...@davemloft.net, kvm@vger.kernel.org,
 a...@arndb.de, net...@vger.kernel.org, a...@redhat.com,
anth...@codemonkey.ws

 Subject

 Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

 On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
  For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
  for degradation for 1 stream case:

 I thought about possible RX/TX contention reasons, and I realized that
 we get/put the mm counter all the time.  So I write the following: I
 haven't seen any performance gain from this in a single queue case, but
 maybe this will help multiqueue?

Great! I am on vacation tomorrow, but will test with this patch
tomorrow night.

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Krishna Kumar2
Arnd Bergmann a...@arndb.de wrote on 10/06/2010 05:49:00 PM:

  I don't see any reasons mentioned above.  However, for higher
  number of netperf sessions, I see a big increase in retransmissions:
  ___
  #netperf  ORG   NEW
  BW (#retr)BW (#retr)
  ___
  1  70244 (0) 64102 (0)
  4  21421 (0) 36570 (416)
  8  21746 (0) 38604 (148)
  16 21783 (0) 40632 (464)
  32 22677 (0) 37163 (1053)
  64 23648 (4) 36449 (2197)
  12823251 (2) 31676 (3185)
  ___


 This smells like it could be related to a problem that Ben Greear found
 recently (see macvlan:  Enable qdisc backoff logic). When the hardware
 is busy, used to just drop the packet. With Ben's patch, we return
-EAGAIN
 to qemu (or vhost-net) to trigger a resend.

 I suppose what we really should do is feed that condition back to the
 guest network stack and implement the backoff in there.

Thanks for the pointer. I will take a look at this as I hadn't seen
this patch earlier. Is there any way to figure out if this is the
issue?

Thanks,

- KK

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 10/05/2010 11:53:23 PM:

   Any idea where does this come from?
   Do you see more TX interrupts? RX interrupts? Exits?
   Do interrupts bounce more between guest CPUs?
   4. Identify reasons for single netperf BW regression.
 
  After testing various combinations of #txqs, #vhosts, #netperf
  sessions, I think the drop for 1 stream is due to TX and RX for
  a flow being processed on different cpus.

 Right. Can we fix it?

I am not sure how to. My initial patch had one thread but gave
small gains and ran into limitations once number of sessions
became large.

   I did two more tests:
  1. Pin vhosts to same CPU:
  - BW drop is much lower for 1 stream case (- 5 to -8% range)
  - But performance is not so high for more sessions.
  2. Changed vhost to be single threaded:
- No degradation for 1 session, and improvement for upto
   8, sometimes 16 streams (5-12%).
- BW degrades after that, all the way till 128 netperf
sessions.
- But overall CPU utilization improves.
  Summary of the entire run (for 1-128 sessions):
  txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
  txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
 
  I don't see any reasons mentioned above.  However, for higher
  number of netperf sessions, I see a big increase in retransmissions:

 Hmm, ok, and do you see any errors?

I haven't seen any in any statistics, messages, etc. Also no
retranmissions for txq=1.

  Single netperf case didn't have any retransmissions so that is not
  the cause for drop.  I tested ixgbe (MQ):
  ___
  #netperf  ixgbe ixgbe (pin intrs to cpu#0 on
 both server/client)
  BW (#retr)  BW (#retr)
  ___
  1   3567 (117)  6000 (251)
  2   4406 (477)  6298 (725)
  4   6119 (1085) 7208 (3387)
  8   6595 (4276) 7381 (15296)
  16  6651 (11651)6856 (30394)

 Interesting.
 You are saying we get much more retransmissions with physical nic as
 well?

Yes, with ixgbe. I re-ran with 16 netperfs running for 15 secs on
both ixgbe and cxgb3 just now to reconfirm:

ixgbe: BW: 6186.85  SD/Remote: 135.711, 339.376  CPU/Remote: 79.99, 200.00,
Retrans: 545
cxgb3: BW: 8051.07  SD/Remote: 144.416, 260.487  CPU/Remote: 110.88,
200.00, Retrans: 0

However 64 netperfs for 30 secs gave:

ixgbe: BW: 6691.12  SD/Remote: 8046.617, 5259.992  CPU/Remote: 1223.86,
799.97, Retrans: 1424
cxgb3: BW: 7799.16  SD/Remote: 2589.875, 4317.013  CPU/Remote: 480.39
800.64, Retrans: 649

# ethtool -i eth4
driver: ixgbe
version: 2.0.84-k2
firmware-version: 0.9-3
bus-info: :1f:00.1

# ifconfig output:
   RX packets:783241 errors:0 dropped:0 overruns:0 frame:0
   TX packets:689533 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000

# lspci output:
1f:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network
Connec
tion (rev 01)
Subsystem: Intel Corporation Ethernet Server Adapter X520-2
Flags: bus master, fast devsel, latency 0, IRQ 30
Memory at 9890 (64-bit, prefetchable) [size=512K]
I/O ports at 2020 [size=32]
Memory at 98a0 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Capabilities: [a0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-40-4a-b4
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
Kernel driver in use: ixgbe
Kernel modules: ixgbe

  I haven't done this right now since I don't have a setup.  I guess
  it would be limited by wire speed and gains may not be there.  I
  will try to do this later when I get the setup.

 OK but at least need to check that it does not hurt things.

Yes, sure.

  Summary:
 
  1. Average BW increase for regular I/O is best for #txq=16 with the
 least CPU utilization increase.
  2. The average BW for 512 byte I/O is best for lower #txq=2. For higher
 #txqs, BW increased only after a particular #netperf sessions - in
 my testing that limit was 32 netperf sessions.
  3. Multiple txq for guest by itself doesn't seem to have any issues.
 Guest CPU% increase is slightly higher than BW improvement.  I
 think it is true for all mq drivers since more paths run in parallel
 upto the device instead of sleeping and allowing one thread to send
 all packets via qdisc_restart.
  4. Having high number of txqs gives better gains 

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Arnd Bergmann
On Wednesday 06 October 2010 19:14:42 Krishna Kumar2 wrote:
 Arnd Bergmann a...@arndb.de wrote on 10/06/2010 05:49:00 PM:
 
   I don't see any reasons mentioned above.  However, for higher
   number of netperf sessions, I see a big increase in retransmissions:
   ___
   #netperf  ORG   NEW
   BW (#retr)BW (#retr)
   ___
   1  70244 (0) 64102 (0)
   4  21421 (0) 36570 (416)
   8  21746 (0) 38604 (148)
   16 21783 (0) 40632 (464)
   32 22677 (0) 37163 (1053)
   64 23648 (4) 36449 (2197)
   12823251 (2) 31676 (3185)
   ___
 
 
  This smells like it could be related to a problem that Ben Greear found
  recently (see macvlan:  Enable qdisc backoff logic). When the hardware
  is busy, used to just drop the packet. With Ben's patch, we return
 -EAGAIN
  to qemu (or vhost-net) to trigger a resend.
 
  I suppose what we really should do is feed that condition back to the
  guest network stack and implement the backoff in there.
 
 Thanks for the pointer. I will take a look at this as I hadn't seen
 this patch earlier. Is there any way to figure out if this is the
 issue?

I think a good indication would be if this changes with/without the
patch, and if you see -EAGAIN in qemu with the patch applied.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Michael S. Tsirkin
On Wed, Oct 06, 2010 at 11:13:31PM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 10/05/2010 11:53:23 PM:
 
Any idea where does this come from?
Do you see more TX interrupts? RX interrupts? Exits?
Do interrupts bounce more between guest CPUs?
4. Identify reasons for single netperf BW regression.
  
   After testing various combinations of #txqs, #vhosts, #netperf
   sessions, I think the drop for 1 stream is due to TX and RX for
   a flow being processed on different cpus.
 
  Right. Can we fix it?
 
 I am not sure how to. My initial patch had one thread but gave
 small gains and ran into limitations once number of sessions
 became large.

Sure. We will need multiple RX queues, and have a single
thread handle a TX and RX pair. Then we need to make sure packets
from a given flow on TX land on the same thread on RX.
As flows can be hashed differently, for this to work we'll have to
expose this info in host/guest interface.
But since multiqueue implies host/guest ABI changes anyway,
this point is moot.

BTW, an interesting approach could be using bonding
and multiple virtio-net interfaces.
What are the disadvantages of such a setup?  One advantage
is it can be made to work in existing guests.

I did two more tests:
   1. Pin vhosts to same CPU:
   - BW drop is much lower for 1 stream case (- 5 to -8% range)
   - But performance is not so high for more sessions.
   2. Changed vhost to be single threaded:
 - No degradation for 1 session, and improvement for upto
8, sometimes 16 streams (5-12%).
 - BW degrades after that, all the way till 128 netperf
 sessions.
 - But overall CPU utilization improves.
   Summary of the entire run (for 1-128 sessions):
   txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
   txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
  
   I don't see any reasons mentioned above.  However, for higher
   number of netperf sessions, I see a big increase in retransmissions:
 
  Hmm, ok, and do you see any errors?
 
 I haven't seen any in any statistics, messages, etc.

Herbert, could you help out debugging this increase in retransmissions
please?  Older mail on netdev in this thread has some numbers that seem
to imply that we start hitting retransmissions much more as # of flows
goes up.

 Also no
 retranmissions for txq=1.

While it's nice that we have this parameter, the need to choose between
single stream and multi stream performance when you start the vm makes
this patch much less interesting IMHO.


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-05 Thread Krishna Kumar2
Michael S. Tsirkin m...@redhat.com wrote on 09/19/2010 06:14:43 PM:

 Could you document how exactly do you measure multistream bandwidth:
 netperf flags, etc?

All results were without any netperf flags or system tuning:
for i in $list
do
netperf -c -C -l 60 -H 192.168.122.1  /tmp/netperf.$$.$i 
done
wait
Another script processes the result files.  It also displays the
start time/end time of each iteration to make sure skew due to
parallel netperfs is minimal.

I changed the vhost functionality once more to try to get the
best model, the new model being:
1. #numtxqs=1 - #vhosts=1, this thread handles both RX/TX.
2. #numtxqs1 - vhost[0] handles RX and vhost[1-MAX] handles
   TX[0-n], where MAX is 4.  Beyond numtxqs=4, the remaining TX
   queues are handled by vhost threads in round-robin fashion.

Results from here on are with these changes, and only tuning is
to set each vhost's affinity to CPUs[0-3] (taskset -p f vhost-pids).

 Any idea where does this come from?
 Do you see more TX interrupts? RX interrupts? Exits?
 Do interrupts bounce more between guest CPUs?
 4. Identify reasons for single netperf BW regression.

After testing various combinations of #txqs, #vhosts, #netperf
sessions, I think the drop for 1 stream is due to TX and RX for
a flow being processed on different cpus.  I did two more tests:
1. Pin vhosts to same CPU:
- BW drop is much lower for 1 stream case (- 5 to -8% range)
- But performance is not so high for more sessions.
2. Changed vhost to be single threaded:
  - No degradation for 1 session, and improvement for upto
  8, sometimes 16 streams (5-12%).
  - BW degrades after that, all the way till 128 netperf sessions.
  - But overall CPU utilization improves.
Summary of the entire run (for 1-128 sessions):
txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)

I don't see any reasons mentioned above.  However, for higher
number of netperf sessions, I see a big increase in retransmissions:
___
#netperf  ORG   NEW
BW (#retr)BW (#retr)
___
1  70244 (0) 64102 (0)
4  21421 (0) 36570 (416)
8  21746 (0) 38604 (148)
16 21783 (0) 40632 (464)
32 22677 (0) 37163 (1053)
64 23648 (4) 36449 (2197)
12823251 (2) 31676 (3185)
___

Single netperf case didn't have any retransmissions so that is not
the cause for drop.  I tested ixgbe (MQ):
___
#netperf  ixgbe ixgbe (pin intrs to cpu#0 on
   both server/client)
BW (#retr)  BW (#retr)
___
1   3567 (117)  6000 (251)
2   4406 (477)  6298 (725)
4   6119 (1085) 7208 (3387)
8   6595 (4276) 7381 (15296)
16  6651 (11651)6856 (30394)
___

 5. Test perf in more scenarious:
small packets

512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
but increases with #sessions:
___
#   BW1 BW2 (%) CPU1CPU2 (%)RCPU1   RCPU2 (%)
___
1   40433800 (-6.0) 50  50 (0)  86  98 (13.9)
2   83587485 (-10.4)153 178 (16.3)  230 264 (14.7)
4   20664   13567 (-34.3)   448 490 (9.3)   530 624 (17.7)
8   25198   17590 (-30.1)   967 1021 (5.5)  10851257 (15.8)
16  23791   24057 (1.1) 19042220 (16.5) 21562578 (19.5)
24  23055   26378 (14.4)28073378 (20.3) 32253901 (20.9)
32  22873   27116 (18.5)37484525 (20.7) 43075239 (21.6)
40  22876   29106 (27.2)47055717 (21.5) 53886591 (22.3)
48  23099   31352 (35.7)56426986 (23.8) 64758085 (24.8)
64  22645   30563 (34.9)75279027 (19.9) 861910656 (23.6)
80  22497   31922 (41.8)937511390 (21.4)10736   13485 (25.6)
96  22509   32718 (45.3)11271   13710 (21.6)12927   16269 (25.8)
128 22255   32397 (45.5)15036   18093 (20.3)17144   21608 (26.0)
___
SUM:BW: (16.7)  CPU: (20.6) RCPU: (24.3)
___

 host - guest
___
#   BW1 BW2 (%) CPU1CPU2 (%)RCPU1   

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-05 Thread Michael S. Tsirkin
On Tue, Oct 05, 2010 at 04:10:00PM +0530, Krishna Kumar2 wrote:
 Michael S. Tsirkin m...@redhat.com wrote on 09/19/2010 06:14:43 PM:
 
  Could you document how exactly do you measure multistream bandwidth:
  netperf flags, etc?
 
 All results were without any netperf flags or system tuning:
 for i in $list
 do
 netperf -c -C -l 60 -H 192.168.122.1  /tmp/netperf.$$.$i 
 done
 wait
 Another script processes the result files.  It also displays the
 start time/end time of each iteration to make sure skew due to
 parallel netperfs is minimal.
 
 I changed the vhost functionality once more to try to get the
 best model, the new model being:
 1. #numtxqs=1 - #vhosts=1, this thread handles both RX/TX.
 2. #numtxqs1 - vhost[0] handles RX and vhost[1-MAX] handles
TX[0-n], where MAX is 4.  Beyond numtxqs=4, the remaining TX
queues are handled by vhost threads in round-robin fashion.
 
 Results from here on are with these changes, and only tuning is
 to set each vhost's affinity to CPUs[0-3] (taskset -p f vhost-pids).
 
  Any idea where does this come from?
  Do you see more TX interrupts? RX interrupts? Exits?
  Do interrupts bounce more between guest CPUs?
  4. Identify reasons for single netperf BW regression.
 
 After testing various combinations of #txqs, #vhosts, #netperf
 sessions, I think the drop for 1 stream is due to TX and RX for
 a flow being processed on different cpus.

Right. Can we fix it?

  I did two more tests:
 1. Pin vhosts to same CPU:
 - BW drop is much lower for 1 stream case (- 5 to -8% range)
 - But performance is not so high for more sessions.
 2. Changed vhost to be single threaded:
   - No degradation for 1 session, and improvement for upto
 8, sometimes 16 streams (5-12%).
   - BW degrades after that, all the way till 128 netperf sessions.
   - But overall CPU utilization improves.
 Summary of the entire run (for 1-128 sessions):
 txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
 txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
 
 I don't see any reasons mentioned above.  However, for higher
 number of netperf sessions, I see a big increase in retransmissions:

Hmm, ok, and do you see any errors?

 ___
 #netperf  ORG   NEW
 BW (#retr)BW (#retr)
 ___
 1  70244 (0) 64102 (0)
 4  21421 (0) 36570 (416)
 8  21746 (0) 38604 (148)
 16 21783 (0) 40632 (464)
 32 22677 (0) 37163 (1053)
 64 23648 (4) 36449 (2197)
 12823251 (2) 31676 (3185)
 ___
 
 Single netperf case didn't have any retransmissions so that is not
 the cause for drop.  I tested ixgbe (MQ):
 ___
 #netperf  ixgbe ixgbe (pin intrs to cpu#0 on
both server/client)
 BW (#retr)  BW (#retr)
 ___
 1   3567 (117)  6000 (251)
 2   4406 (477)  6298 (725)
 4   6119 (1085) 7208 (3387)
 8   6595 (4276) 7381 (15296)
 16  6651 (11651)6856 (30394)

Interesting.
You are saying we get much more retransmissions with physical nic as
well?

 ___
 
  5. Test perf in more scenarious:
 small packets
 
 512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
 but increases with #sessions:
 ___
 #   BW1 BW2 (%) CPU1CPU2 (%)RCPU1   RCPU2 (%)
 ___
 1   40433800 (-6.0) 50  50 (0)  86  98 (13.9)
 2   83587485 (-10.4)153 178 (16.3)  230 264 (14.7)
 4   20664   13567 (-34.3)   448 490 (9.3)   530 624 (17.7)
 8   25198   17590 (-30.1)   967 1021 (5.5)  10851257 (15.8)
 16  23791   24057 (1.1) 19042220 (16.5) 21562578 (19.5)
 24  23055   26378 (14.4)28073378 (20.3) 32253901 (20.9)
 32  22873   27116 (18.5)37484525 (20.7) 43075239 (21.6)
 40  22876   29106 (27.2)47055717 (21.5) 53886591 (22.3)
 48  23099   31352 (35.7)56426986 (23.8) 64758085 (24.8)
 64  22645   30563 (34.9)75279027 (19.9) 861910656 (23.6)
 80  22497   31922 (41.8)937511390 (21.4)10736   13485 (25.6)
 96  22509   32718 (45.3)11271   13710 (21.6)12927   16269 (25.8)
 128 22255   32397 (45.5)15036   18093 (20.3)17144   21608 (26.0)
 

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-19 Thread Michael S. Tsirkin
On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
 For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
 for degradation for 1 stream case:

Could you document how exactly do you measure multistream bandwidth:
netperf flags, etc?

 1. Without any tuning, BW falls -6.5%.

Any idea where does this come from?
Do you see more TX interrupts? RX interrupts? Exits?
Do interrupts bounce more between guest CPUs?


 2. When vhosts on server were bound to CPU0, BW was as good
as with original code.
 3. When new code was started with numtxqs=1 (or mq=off, which
is the default), there was no degradation.
 
Next steps:
---
 1. MQ RX patch is also complete - plan to submit once TX is OK (as
well as after identifying bandwidth degradations for some test
cases).
 2. Cache-align data structures: I didn't see any BW/SD improvement
after making the sq's (and similarly for vhost) cache-aligned
statically:
 struct virtnet_info {
 ...
 struct send_queue sq[16] cacheline_aligned_in_smp;
 ...
 };
 3. Migration is not tested.

4. Identify reasons for single netperf BW regression.

5. Test perf in more scenarious:
   small packets
   host - guest
   guest - external
   in last case:
 find some other way to measure host CPU utilization,
 try multiqueue and single queue devices

6. Use above to figure out what is a sane default for numtxqs.

 
 Review/feedback appreciated.
 
 Signed-off-by: Krishna Kumar krkum...@in.ibm.com
 ---
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-17 Thread Krishna Kumar
Following patches implement transmit MQ in virtio-net.  Also
included is the user qemu changes. MQ is disabled by default
unless qemu specifies it.

1. This feature was first implemented with a single vhost.
   Testing showed 3-8% performance gain for upto 8 netperf
   sessions (and sometimes 16), but BW dropped with more
   sessions.  However, adding more vhosts improved BW
   significantly all the way to 128 sessions. Multiple
   vhost is implemented in-kernel by passing an argument
   to SET_OWNER (retaining backward compatibility). The
   vhost patch adds 173 source lines (incl comments).
2. BW - CPU/SD equation: Average TCP performance increased
   23% compared to almost 70% for earlier patch (with
   unrestricted #vhosts).  SD improved -4.2% while it had
   increased 55% for the earlier patch.  Increasing #vhosts
   has it's pros and cons, but this patch lays emphasis on
   reducing CPU utilization.  Another option could be a
   tunable to select number of vhosts threads.
3. Interoperability: Many combinations, but not all, of qemu,
   host, guest tested together.  Tested with multiple i/f's
   on guest, with both mq=on/off, vhost=on/off, etc.

  Changes from rev1:
  --
1. Move queue_index from virtio_pci_vq_info to virtqueue,
   and resulting changes to existing code and to the patch.
2. virtio-net probe uses virtio_config_val.
3. Remove constants: VIRTIO_MAX_TXQS, MAX_VQS, all arrays
   allocated on stack, etc.
4. Restrict number of vhost threads to 2 - I get much better
   cpu/sd results (without any tuning) with low number of vhost
   threads.  Higher vhosts gives better average BW performance
   (from average of 45%), but SD increases significantly (90%).
5. Working of vhost threads changes, eg for numtxqs=4:
   vhost-0: handles RX
   vhost-1: handles TX[0]
   vhost-0: handles TX[1]
   vhost-1: handles TX[2]
   vhost-0: handles TX[3]

  Enabling MQ on virtio:
  ---
When following options are passed to qemu:
- smp  1
- vhost=on
- mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using
an optional 'numtxqs' option.  e.g. for a smp=4 guest:
vhost=on   -   #txqueues = 1
vhost=on,mq=on -   #txqueues = 4
vhost=on,mq=on,numtxqs=8   -   #txqueues = 8
vhost=on,mq=on,numtxqs=2   -   #txqueues = 2


   Performance (guest - local host):
   ---
System configuration:
Host:  8 Intel Xeon, 8 GB memory
Guest: 4 cpus, 2 GB memory, numtxqs=4
All testing without any system tuning, and default netperf
Results split across two tables to show SD and CPU usage:

TCP: BW vs CPU/Remote CPU utilization:
#BW1BW2 (%)CPU1CPU2 (%) RCPU1  RCPU2 (%)

169971  65376 (-6.56)  134   170  (26.86)   322376   (16.77)
220911  24839 (18.78)  107   139  (29.90)   217264   (21.65)
421431  28912 (34.90)  213   318  (49.29)   444541   (21.84)
821857  34592 (58.26)  444   859  (93.46)   9011247  (38.40)
16   22368  33083 (47.90)  899   1523 (69.41)   1813   2410  (32.92)
24   22556  32578 (44.43)  1347  2249 (66.96)   2712   3606  (32.96)
32   22727  30923 (36.06)  1806  2506 (38.75)   3622   3952  (9.11)
40   23054  29334 (27.24)  2319  2872 (23.84)   4544   4551  (.15)
48   23006  28800 (25.18)  2827  2990 (5.76)5465   4718  (-13.66)
64   23411  27661 (18.15)  3708  3306 (-10.84)  7231   5218  (-27.83)
80   23175  27141 (17.11)  4796  4509 (-5.98)   9152   7182  (-21.52)
96   23337  26759 (14.66)  5603  4543 (-18.91)  10890  7162  (-34.23)
128  22726  28339 (24.69)  7559  6395 (-15.39)  14600  10169 (-30.34)

Summary:BW: 22.8%CPU: 1.9%RCPU: -17.0%

TCP: BW vs SD/Remote SD:
#BW1BW2 (%)SD1  SD2  (%)RSD1RSD2   (%)

169971  65376 (-6.56)  4   6 (50.00)21  26 (23.80)
220911  24839 (18.78)  6   7 (16.66)27  28 (3.70)
421431  28912 (34.90)  26  31(19.23)108 111(2.77)
821857  34592 (58.26)  106 135   (27.35)432 393(-9.02)
16   22368  33083 (47.90)  431 577   (33.87)17421828   (4.93)
24   22556  32578 (44.43)  972 1393  (43.31)39154479   (14.40)
32   22727  30923 (36.06)  17232165  (25.65)69086842   (-.95)
40   23054  29334 (27.24)  27742761  (-.46) 10874   8764   (-19.40)
48   23006  28800 (25.18)