Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-24 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 02/23/2011 09:25:34 PM:

> > Sure, will get a build/test on latest bits and send in 1-2 days.
> >
> > > > The TX-only patch helped the guest TX path but didn't help
> > > > host->guest much (as tested using TCP_MAERTS from the guest).
> > > > But with the TX+RX patch, both directions are getting
> > > > improvements.
> > >
> > > Also, my hope is that with appropriate queue mapping,
> > > we might be able to do away with heuristics to detect
> > > single stream load that TX only code needs.
> >
> > Yes, that whole stuff is removed, and the TX/RX path is
> > unchanged with this patch (thankfully :)
>
> Cool. I was wondering whether in that case, we can
> do without host kernel changes at all,
> and use a separate fd for each TX/RX pair.
> The advantage of that approach is that this way,
> the max fd limit naturally sets an upper bound
> on the amount of resources userspace can use up.
>
> Thoughts?
>
> In any case, pls don't let the above delay
> sending an RFC.

I will look into this also.

Please excuse the delay in sending the patch out faster - my
bits are a little old, so it is taking some time to move to
the latest kernel and get some initial TCP/UDP test results.
I should have it ready by tomorrow.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-23 Thread Simon Horman
On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote:
> Simon Horman  wrote on 02/22/2011 01:17:09 PM:
> 
> Hi Simon,
> 
> 
> > I have a few questions about the results below:
> >
> > 1. Are the (%) comparisons between non-mq and mq virtio?
> 
> Yes - mainline kernel with transmit-only MQ patch.
> 
> > 2. Was UDP or TCP used?
> 
> TCP. I had done some initial testing on UDP, but don't have
> the results now as it is really old. But I will be running
> it again.
> 
> > 3. What was the transmit size (-m option to netperf)?
> 
> I didn't use the -m option, so it defaults to 16K. The
> script does:
> 
> netperf -t TCP_STREAM -c -C -l 60 -H $SERVER
> 
> > Also, I'm interested to know what the status of these patches is.
> > Are you planing a fresh series?
> 
> Yes. Michael Tsirkin had wanted to see how the MQ RX patch
> would look like, so I was in the process of getting the two
> working together. The patch is ready and is being tested.
> Should I send a RFC patch at this time?
> 
> The TX-only patch helped the guest TX path but didn't help
> host->guest much (as tested using TCP_MAERTS from the guest).
> But with the TX+RX patch, both directions are getting
> improvements. Remote testing is still to be done.

Hi Krishna,

thanks for clarifying the test results.
I'm looking forward to the forthcoming RFC patches.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-23 Thread Michael S. Tsirkin
On Wed, Feb 23, 2011 at 12:18:36PM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin"  wrote on 02/23/2011 12:09:15 PM:
> 
> Hi Michael,
> 
> > > Yes. Michael Tsirkin had wanted to see how the MQ RX patch
> > > would look like, so I was in the process of getting the two
> > > working together. The patch is ready and is being tested.
> > > Should I send a RFC patch at this time?
> >
> > Yes, please do.
> 
> Sure, will get a build/test on latest bits and send in 1-2 days.
> 
> > > The TX-only patch helped the guest TX path but didn't help
> > > host->guest much (as tested using TCP_MAERTS from the guest).
> > > But with the TX+RX patch, both directions are getting
> > > improvements.
> >
> > Also, my hope is that with appropriate queue mapping,
> > we might be able to do away with heuristics to detect
> > single stream load that TX only code needs.
> 
> Yes, that whole stuff is removed, and the TX/RX path is
> unchanged with this patch (thankfully :)

Cool. I was wondering whether in that case, we can
do without host kernel changes at all,
and use a separate fd for each TX/RX pair.
The advantage of that approach is that this way,
the max fd limit naturally sets an upper bound
on the amount of resources userspace can use up.

Thoughts?

In any case, pls don't let the above delay
sending an RFC.

> > > Remote testing is still to be done.
> >
> > Others might be able to help here once you post the patch.
> 
> That's great, will appreciate any help.
> 
> Thanks,
> 
> - KK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Krishna Kumar2
> "Michael S. Tsirkin"  wrote on 02/23/2011 12:09:15 PM:

Hi Michael,

> > Yes. Michael Tsirkin had wanted to see how the MQ RX patch
> > would look like, so I was in the process of getting the two
> > working together. The patch is ready and is being tested.
> > Should I send a RFC patch at this time?
>
> Yes, please do.

Sure, will get a build/test on latest bits and send in 1-2 days.

> > The TX-only patch helped the guest TX path but didn't help
> > host->guest much (as tested using TCP_MAERTS from the guest).
> > But with the TX+RX patch, both directions are getting
> > improvements.
>
> Also, my hope is that with appropriate queue mapping,
> we might be able to do away with heuristics to detect
> single stream load that TX only code needs.

Yes, that whole stuff is removed, and the TX/RX path is
unchanged with this patch (thankfully :)

> > Remote testing is still to be done.
>
> Others might be able to help here once you post the patch.

That's great, will appreciate any help.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Michael S. Tsirkin
On Wed, Feb 23, 2011 at 10:52:09AM +0530, Krishna Kumar2 wrote:
> Simon Horman  wrote on 02/22/2011 01:17:09 PM:
> 
> Hi Simon,
> 
> 
> > I have a few questions about the results below:
> >
> > 1. Are the (%) comparisons between non-mq and mq virtio?
> 
> Yes - mainline kernel with transmit-only MQ patch.
> 
> > 2. Was UDP or TCP used?
> 
> TCP. I had done some initial testing on UDP, but don't have
> the results now as it is really old. But I will be running
> it again.
> 
> > 3. What was the transmit size (-m option to netperf)?
> 
> I didn't use the -m option, so it defaults to 16K. The
> script does:
> 
> netperf -t TCP_STREAM -c -C -l 60 -H $SERVER
> 
> > Also, I'm interested to know what the status of these patches is.
> > Are you planing a fresh series?
> 
> Yes. Michael Tsirkin had wanted to see how the MQ RX patch
> would look like, so I was in the process of getting the two
> working together. The patch is ready and is being tested.
> Should I send a RFC patch at this time?

Yes, please do.

> The TX-only patch helped the guest TX path but didn't help
> host->guest much (as tested using TCP_MAERTS from the guest).
> But with the TX+RX patch, both directions are getting
> improvements.

Also, my hope is that with appropriate queue mapping,
we might be able to do away with heuristics to detect
single stream load that TX only code needs.

> Remote testing is still to be done.

Others might be able to help here once you post the patch.

> Thanks,
> 
> - KK
> 
> > >   Changes from rev2:
> > >   --
> > > 1. Define (in virtio_net.h) the maximum send txqs; and use in
> > >virtio-net and vhost-net.
> > > 2. vi->sq[i] is allocated individually, resulting in cache line
> > >aligned sq[0] to sq[n].  Another option was to define
> > >'send_queue' as:
> > >struct send_queue {
> > >struct virtqueue *svq;
> > >struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> > >} cacheline_aligned_in_smp;
> > >and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
> > >the submitted method is preferable.
> > > 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
> > >handles TX[0-n].
> > > 4. Further change TX handling such that vhost[0] handles both RX/TX
> > >for single stream case.
> > >
> > >   Enabling MQ on virtio:
> > >   ---
> > > When following options are passed to qemu:
> > > - smp > 1
> > > - vhost=on
> > > - mq=on (new option, default:off)
> > > then #txqueues = #cpus.  The #txqueues can be changed by using an
> > > optional 'numtxqs' option.  e.g. for a smp=4 guest:
> > > vhost=on   ->   #txqueues = 1
> > > vhost=on,mq=on ->   #txqueues = 4
> > > vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> > > vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
> > >
> > >
> > >Performance (guest -> local host):
> > >---
> > > System configuration:
> > > Host:  8 Intel Xeon, 8 GB memory
> > > Guest: 4 cpus, 2 GB memory
> > > Test: Each test case runs for 60 secs, sum over three runs (except
> > > when number of netperf sessions is 1, which has 10 runs of 12 secs
> > > each).  No tuning (default netperf) other than taskset vhost's to
> > > cpus 0-3.  numtxqs=32 gave the best results though the guest had
> > > only 4 vcpus (I haven't tried beyond that).
> > >
> > > __ numtxqs=2, vhosts=3  
> > > #sessions  BW%  CPU%RCPU%SD%  RSD%
> > > 
> > > 1  4.46-1.96 .19 -12.50   -6.06
> > > 2  4.93-1.162.10  0   -2.38
> > > 4  46.1764.77   33.72 19.51   -2.48
> > > 8  47.8970.00   36.23 41.4613.35
> > > 16 48.9780.44   40.67 21.11   -5.46
> > > 24 49.0378.78   41.22 20.51   -4.78
> > > 32 51.1177.15   42.42 15.81   -6.87
> > > 40 51.6071.65   42.43 9.75-8.94
> > > 48 50.1069.55   42.85 11.80   -5.81
> > > 64 46.2468.42   42.67 14.18   -3.28
> > > 80 46.3763.13   41.62 7.43-6.73
> > > 96 46.4063.31   42.20 9.36-4.78
> > > 12850.4362.79   42.16 13.11   -1.23
> > > 
> > > BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
> > >
> > > __ numtxqs=8, vhosts=5  
> > > #sessions   BW%  CPU% RCPU% SD%  RSD%
> > > 
> > > 1   -.76-1.56 2.33  03.03
> > > 2   17.4111.1111.41 0   -4.76
> > > 4   42.1255.1130.20 19.51.62
> > 

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Krishna Kumar2
Simon Horman  wrote on 02/22/2011 01:17:09 PM:

Hi Simon,


> I have a few questions about the results below:
>
> 1. Are the (%) comparisons between non-mq and mq virtio?

Yes - mainline kernel with transmit-only MQ patch.

> 2. Was UDP or TCP used?

TCP. I had done some initial testing on UDP, but don't have
the results now as it is really old. But I will be running
it again.

> 3. What was the transmit size (-m option to netperf)?

I didn't use the -m option, so it defaults to 16K. The
script does:

netperf -t TCP_STREAM -c -C -l 60 -H $SERVER

> Also, I'm interested to know what the status of these patches is.
> Are you planing a fresh series?

Yes. Michael Tsirkin had wanted to see how the MQ RX patch
would look like, so I was in the process of getting the two
working together. The patch is ready and is being tested.
Should I send a RFC patch at this time?

The TX-only patch helped the guest TX path but didn't help
host->guest much (as tested using TCP_MAERTS from the guest).
But with the TX+RX patch, both directions are getting
improvements. Remote testing is still to be done.

Thanks,

- KK

> >   Changes from rev2:
> >   --
> > 1. Define (in virtio_net.h) the maximum send txqs; and use in
> >virtio-net and vhost-net.
> > 2. vi->sq[i] is allocated individually, resulting in cache line
> >aligned sq[0] to sq[n].  Another option was to define
> >'send_queue' as:
> >struct send_queue {
> >struct virtqueue *svq;
> >struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> >} cacheline_aligned_in_smp;
> >and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
> >the submitted method is preferable.
> > 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
> >handles TX[0-n].
> > 4. Further change TX handling such that vhost[0] handles both RX/TX
> >for single stream case.
> >
> >   Enabling MQ on virtio:
> >   ---
> > When following options are passed to qemu:
> > - smp > 1
> > - vhost=on
> > - mq=on (new option, default:off)
> > then #txqueues = #cpus.  The #txqueues can be changed by using an
> > optional 'numtxqs' option.  e.g. for a smp=4 guest:
> > vhost=on   ->   #txqueues = 1
> > vhost=on,mq=on ->   #txqueues = 4
> > vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> > vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
> >
> >
> >Performance (guest -> local host):
> >---
> > System configuration:
> > Host:  8 Intel Xeon, 8 GB memory
> > Guest: 4 cpus, 2 GB memory
> > Test: Each test case runs for 60 secs, sum over three runs (except
> > when number of netperf sessions is 1, which has 10 runs of 12 secs
> > each).  No tuning (default netperf) other than taskset vhost's to
> > cpus 0-3.  numtxqs=32 gave the best results though the guest had
> > only 4 vcpus (I haven't tried beyond that).
> >
> > __ numtxqs=2, vhosts=3  
> > #sessions  BW%  CPU%RCPU%SD%  RSD%
> > 
> > 1  4.46-1.96 .19 -12.50   -6.06
> > 2  4.93-1.162.10  0   -2.38
> > 4  46.1764.77   33.72 19.51   -2.48
> > 8  47.8970.00   36.23 41.4613.35
> > 16 48.9780.44   40.67 21.11   -5.46
> > 24 49.0378.78   41.22 20.51   -4.78
> > 32 51.1177.15   42.42 15.81   -6.87
> > 40 51.6071.65   42.43 9.75-8.94
> > 48 50.1069.55   42.85 11.80   -5.81
> > 64 46.2468.42   42.67 14.18   -3.28
> > 80 46.3763.13   41.62 7.43-6.73
> > 96 46.4063.31   42.20 9.36-4.78
> > 12850.4362.79   42.16 13.11   -1.23
> > 
> > BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
> >
> > __ numtxqs=8, vhosts=5  
> > #sessions   BW%  CPU% RCPU% SD%  RSD%
> > 
> > 1   -.76-1.56 2.33  03.03
> > 2   17.4111.1111.41 0   -4.76
> > 4   42.1255.1130.20 19.51.62
> > 8   54.6980.0039.22 24.39-3.88
> > 16  54.7781.6240.89 20.34-6.58
> > 24  54.6679.6841.57 15.49-8.99
> > 32  54.9276.8241.79 17.59-5.70
> > 40  51.7968.5640.53 15.31-3.87
> > 48  51.7266.4040.84 9.72 -7.13
> > 64  51.1163.9441.10 5.93 -8.82
> > 80  46.5159.5039.80 9.33 -4.18
> > 96  47.7257.7539.84

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2011-02-22 Thread Simon Horman
On Wed, Oct 20, 2010 at 02:24:52PM +0530, Krishna Kumar wrote:
> Following set of patches implement transmit MQ in virtio-net.  Also
> included is the user qemu changes.  MQ is disabled by default unless
> qemu specifies it.

Hi Krishna,

I have a few questions about the results below:

1. Are the (%) comparisons between non-mq and mq virtio?
2. Was UDP or TCP used?
3. What was the transmit size (-m option to netperf)?

Also, I'm interested to know what the status of these patches is.
Are you planing a fresh series?

> 
>   Changes from rev2:
>   --
> 1. Define (in virtio_net.h) the maximum send txqs; and use in
>virtio-net and vhost-net.
> 2. vi->sq[i] is allocated individually, resulting in cache line
>aligned sq[0] to sq[n].  Another option was to define
>'send_queue' as:
>struct send_queue {
>struct virtqueue *svq;
>struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>} cacheline_aligned_in_smp;
>and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
>the submitted method is preferable.
> 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
>handles TX[0-n].
> 4. Further change TX handling such that vhost[0] handles both RX/TX
>for single stream case.
> 
>   Enabling MQ on virtio:
>   ---
> When following options are passed to qemu:
> - smp > 1
> - vhost=on
> - mq=on (new option, default:off)
> then #txqueues = #cpus.  The #txqueues can be changed by using an
> optional 'numtxqs' option.  e.g. for a smp=4 guest:
> vhost=on   ->   #txqueues = 1
> vhost=on,mq=on ->   #txqueues = 4
> vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
> 
> 
>Performance (guest -> local host):
>---
> System configuration:
> Host:  8 Intel Xeon, 8 GB memory
> Guest: 4 cpus, 2 GB memory
> Test: Each test case runs for 60 secs, sum over three runs (except
> when number of netperf sessions is 1, which has 10 runs of 12 secs
> each).  No tuning (default netperf) other than taskset vhost's to
> cpus 0-3.  numtxqs=32 gave the best results though the guest had
> only 4 vcpus (I haven't tried beyond that).
> 
> __ numtxqs=2, vhosts=3  
> #sessions  BW%  CPU%RCPU%SD%  RSD%
> 
> 1  4.46-1.96 .19 -12.50   -6.06
> 2  4.93-1.162.10  0   -2.38
> 4  46.1764.77   33.72 19.51   -2.48
> 8  47.8970.00   36.23 41.4613.35
> 16 48.9780.44   40.67 21.11   -5.46
> 24 49.0378.78   41.22 20.51   -4.78
> 32 51.1177.15   42.42 15.81   -6.87
> 40 51.6071.65   42.43 9.75-8.94
> 48 50.1069.55   42.85 11.80   -5.81
> 64 46.2468.42   42.67 14.18   -3.28
> 80 46.3763.13   41.62 7.43-6.73
> 96 46.4063.31   42.20 9.36-4.78
> 12850.4362.79   42.16 13.11   -1.23
> 
> BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%
> 
> __ numtxqs=8, vhosts=5  
> #sessions   BW%  CPU% RCPU% SD%  RSD%
> 
> 1   -.76-1.56 2.33  03.03
> 2   17.4111.1111.41 0   -4.76
> 4   42.1255.1130.20 19.51.62
> 8   54.6980.0039.22 24.39-3.88
> 16  54.7781.6240.89 20.34-6.58
> 24  54.6679.6841.57 15.49-8.99
> 32  54.9276.8241.79 17.59-5.70
> 40  51.7968.5640.53 15.31-3.87
> 48  51.7266.4040.84 9.72 -7.13
> 64  51.1163.9441.10 5.93 -8.82
> 80  46.5159.5039.80 9.33 -4.18
> 96  47.7257.7539.84 4.20 -7.62
> 128 54.3558.9540.66 3.24 -8.63
> 
> BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%
> 
> __ numtxqs=16, vhosts=5  ___
> #sessions   BW%  CPU% RCPU% SD%  RSD%
> 
> 1   -1.43-3.521.55  0  3.03
> 2   33.09 21.63   20.12-10.00 -9.52
> 4   67.17 94.60   44.28 19.51 -11.80
> 8   75.72 108.14  49.15 25.00 -10.71
> 16  80.34 101.77  52.94 25.93 -4.49
> 24  70.84 93.12   43.62 27.63 -5.03
> 32  69.01   

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-10 Thread Michael S. Tsirkin
On Tue, Nov 09, 2010 at 10:54:57PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 11/09/2010 09:03:25 PM:
> 
> > > > Something strange here, right?
> > > > 1. You are consistently getting >10G/s here, and even with a single
> > > stream?
> > >
> > > Sorry, I should have mentioned this though I had stated in my
> > > earlier mails. Each test result has two iterations, each of 60
> > > seconds, except when #netperfs is 1 for which I do 10 iteration
> > > (sum across 10 iterations).
> >
> > So need to divide the number by 10?
> 
> Yes, that is what I get with 512/1K macvtap I/O size :)
> 
> > >  I started doing many more iterations
> > > for 1 netperf after finding the issue earlier with single stream.
> > > So the BW is only 4.5-7 Gbps.
> > >
> > > > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> > > >doubling that we get a marginal improvement with 2 queues and
> > > >about 30% worse with 1 queue.
> > >
> > > (doubling happens consistently for guest -> host, but never for
> > > remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> > > testing scenario. In first case, there is a slight improvement in
> > > BW and good reduction in SD. In the second case, only SD improves
> > > (though BW drops for 2 stream for some reason).  In both cases,
> > > BW and SD improves as the number of sessions increase.
> >
> > I guess this is another indication that something's wrong.
> 
> The patch - both virtio-net and vhost-net, doesn't have any
> locking/mutex's/ or any synchronization method. Guest -> host
> performance improvement of upto 100% shows the patch is not
> doing anything wrong.

My concern is this: we don't seem to do anything in tap or macvtap to
help packets from separate virtio queues get to separate queues in the
hardware device and to avoid reordering when we do this.

- skb_tx_hash calculation will get different results
- hash math that e.g. tcp does will run on guest and seems to be discarded

etc

Maybe it's as simple as some tap/macvtap ioctls to set up the queue number
in skbs. Or maybe we need to pass the skb hash from guest to host.
It's this last option that should make us especially cautios as it'll
affect guest/host interface.

Also see d5a9e24afb4ab38110ebb777588ea0bd0eacbd0a: if we have
hardware which records an RX queue, it appears important to
pass that info to guest and to use that in selecting the TX queue.
Of course we won't see this in netperf runs but this needs to
be given thought too - supporting this seems to suggest either
sticking the hash in the virtio net header for both tx and rx,
or using multiplease RX queues.

> > We are quite far from line rate, the fact BW does not scale
> > means there's some contention in the code.
> 
> Attaining line speed with macvtap seems to be a generic issue
> and unrelated to my patch specifically. IMHO if there is nothing
> wrong in the code (review) and is accepted, it will benefit as
> others can also help to find what needs to be implemented in
> vhost/macvtap/qemu to get line speed for guest->remote-host.

No problem, I will queue these patches in some branch
to help enable cooperation, as well as help you
iterate with incremental patches instead of resending it all each time.


> PS: bare-metal performance for host->remote-host is also
> 2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.
> 
> Thanks,

You mean native linux BW does not scale for your host with
# of connections either? I guess this just means need another
setup for testing?

> - KK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-09 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 11/09/2010 09:03:25 PM:

> > > Something strange here, right?
> > > 1. You are consistently getting >10G/s here, and even with a single
> > stream?
> >
> > Sorry, I should have mentioned this though I had stated in my
> > earlier mails. Each test result has two iterations, each of 60
> > seconds, except when #netperfs is 1 for which I do 10 iteration
> > (sum across 10 iterations).
>
> So need to divide the number by 10?

Yes, that is what I get with 512/1K macvtap I/O size :)

> >  I started doing many more iterations
> > for 1 netperf after finding the issue earlier with single stream.
> > So the BW is only 4.5-7 Gbps.
> >
> > > 2. With 2 streams, is where we get < 10G/s originally. Instead of
> > >doubling that we get a marginal improvement with 2 queues and
> > >about 30% worse with 1 queue.
> >
> > (doubling happens consistently for guest -> host, but never for
> > remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied
> > testing scenario. In first case, there is a slight improvement in
> > BW and good reduction in SD. In the second case, only SD improves
> > (though BW drops for 2 stream for some reason).  In both cases,
> > BW and SD improves as the number of sessions increase.
>
> I guess this is another indication that something's wrong.

The patch - both virtio-net and vhost-net, doesn't have any
locking/mutex's/ or any synchronization method. Guest -> host
performance improvement of upto 100% shows the patch is not
doing anything wrong.

> We are quite far from line rate, the fact BW does not scale
> means there's some contention in the code.

Attaining line speed with macvtap seems to be a generic issue
and unrelated to my patch specifically. IMHO if there is nothing
wrong in the code (review) and is accepted, it will benefit as
others can also help to find what needs to be implemented in
vhost/macvtap/qemu to get line speed for guest->remote-host.

PS: bare-metal performance for host->remote-host is also
2.7 Gbps and 2.8 Gbps for 512/1024 for the same card.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-09 Thread Michael S. Tsirkin
On Tue, Nov 09, 2010 at 08:58:44PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 11/09/2010 06:52:39 PM:
> 
> > > > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> > > >
> > > > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > > > Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
> > > > >
> > > > > Any feedback, comments, objections, issues or bugs about the
> > > > > patches? Please let me know if something needs to be done.
> > > > >
> > > > > Some more test results:
> > > > > _
> > > > >  Host->Guest BW (numtxqs=2)
> > > > > #   BW% CPU%RCPU%   SD% RSD%
> > > > > _
> > > >
> > > > I think we discussed the need for external to guest testing
> > > > over 10G. For large messages we should not see any change
> > > > but you should be able to get better numbers for small messages
> > > > assuming a MQ NIC card.
> > >
> > > I had to make a few changes to qemu (and a minor change in macvtap
> > > driver) to get multiple TXQ support using macvtap working. The NIC
> > > is a ixgbe card.
> > >
> > >
> __
> > > Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> > > #  BW1 BW2 (%)   SD1SD2 (%)RSD1RSD2 (%)
> > >
> __
> > > 1  14367   13142 (-8.5)  56 62 (10.7)  88 (0)
> > > 2  36523855 (5.5)37 35 (-5.4)  76 (-14.2)
> > > 4  12529   12059 (-3.7)  65 77 (18.4)  35   35 (0)
> > > 8  13912   14668 (5.4)   288332 (15.2) 175  184 (5.1)
> > > 16 13433   14455 (7.6)   1218   1321 (8.4) 920  943 (2.5)
> > > 24 12750   13477 (5.7)   2876   2985 (3.7) 2514 2348 (-6.6)
> > > 32 11729   12632 (7.6)   5299   5332 (.6)  4934 4497 (-8.8)
> > > 40 11061   11923 (7.7)   8482   8364 (-1.3)8374 7495
> (-10.4)
> > > 48 10624   11267 (6.0)   12329  12258 (-.5)1276211538
> (-9.5)
> > > 64 10524   10596 (.6)21689  22859 (5.3)2362622403
> (-5.1)
> > > 80 985610284 (4.3)   35769  36313 (1.5)3993236419
> (-8.7)
> > > 96 969110075 (3.9)   52357  52259 (-.1)5867653463
> (-8.8)
> > > 12893519794 (4.7)114707 94275 (-17.8)  114050   97337
> (-14.6)
> > >
> __
> > > Avg:  BW: (3.3)  SD: (-7.3)  RSD: (-11.0)
> > >
> > >
> __
> > > Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> > > #  BW1  BW2 (%)   SD1   SD2 (%)RSD1   RSD2 (%)
> > >
> __
> > > 1  1650915985 (-3.1)  4547 (4.4)   7   7 (0)
> > > 2  6963 4499 (-35.3)  1751 (200.0) 7   7 (0)
> > > 4  1293211080 (-14.3) 4974 (51.0)  35  35 (0)
> > > 8  1387814095 (1.5)   223   292 (30.9) 175 181 (3.4)
> > > 16 1344013698 (1.9)   980   1131 (15.4)926 942 (1.7)
> > > 24 1268012927 (1.9)   2387  2463 (3.1) 25262342 (-7.2)
> > > 32 1171412261 (4.6)   4506  4486 (-.4) 49414463 (-9.6)
> > > 40 1105911651 (5.3)   7244  7081 (-2.2)83497437 (-10.9)
> > > 48 1058011095 (4.8)   10811 10500 (-2.8)   12809   11403
> (-10.9)
> > > 64 1056910566 (0) 19194 19270 (.3) 23648   21717 (-8.1)
> > > 80 9827 10753 (9.4)   31668 29425 (-7.0)   39991   33824
> (-15.4)
> > > 96 1004310150 (1.0)   45352 44227 (-2.4)   57766   51131
> (-11.4)
> > > 1289360 9979 (6.6)92058 79198 (-13.9)  114381  92873
> (-18.8)
> > >
> __
> > > Avg:  BW: (-.5)  SD: (-7.5)  RSD: (-14.7)
> > >
> > > Is there anything else you would like me to test/change, or shall
> > > I submit the next version (with the above macvtap changes)?
&g

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-09 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 11/09/2010 06:52:39 PM:

> > > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> > >
> > > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > > Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
> > > >
> > > > Any feedback, comments, objections, issues or bugs about the
> > > > patches? Please let me know if something needs to be done.
> > > >
> > > > Some more test results:
> > > > _
> > > >  Host->Guest BW (numtxqs=2)
> > > > #   BW% CPU%RCPU%   SD% RSD%
> > > > _
> > >
> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > I had to make a few changes to qemu (and a minor change in macvtap
> > driver) to get multiple TXQ support using macvtap working. The NIC
> > is a ixgbe card.
> >
> >
__
> > Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> > #  BW1 BW2 (%)   SD1SD2 (%)RSD1RSD2 (%)
> >
__
> > 1  14367   13142 (-8.5)  56 62 (10.7)  88 (0)
> > 2  36523855 (5.5)37 35 (-5.4)  76 (-14.2)
> > 4  12529   12059 (-3.7)  65 77 (18.4)  35   35 (0)
> > 8  13912   14668 (5.4)   288332 (15.2) 175  184 (5.1)
> > 16 13433   14455 (7.6)   1218   1321 (8.4) 920  943 (2.5)
> > 24 12750   13477 (5.7)   2876   2985 (3.7) 2514 2348 (-6.6)
> > 32 11729   12632 (7.6)   5299   5332 (.6)  4934 4497 (-8.8)
> > 40 11061   11923 (7.7)   8482   8364 (-1.3)8374 7495
(-10.4)
> > 48 10624   11267 (6.0)   12329  12258 (-.5)1276211538
(-9.5)
> > 64 10524   10596 (.6)21689  22859 (5.3)2362622403
(-5.1)
> > 80 985610284 (4.3)   35769  36313 (1.5)3993236419
(-8.7)
> > 96 969110075 (3.9)   52357  52259 (-.1)5867653463
(-8.8)
> > 12893519794 (4.7)114707 94275 (-17.8)  114050   97337
(-14.6)
> >
__
> > Avg:  BW: (3.3)  SD: (-7.3)  RSD: (-11.0)
> >
> >
__
> > Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> > #  BW1  BW2 (%)   SD1   SD2 (%)RSD1   RSD2 (%)
> >
__
> > 1  1650915985 (-3.1)  4547 (4.4)   7   7 (0)
> > 2  6963 4499 (-35.3)  1751 (200.0) 7   7 (0)
> > 4  1293211080 (-14.3) 4974 (51.0)  35  35 (0)
> > 8  1387814095 (1.5)   223   292 (30.9) 175 181 (3.4)
> > 16 1344013698 (1.9)   980   1131 (15.4)926 942 (1.7)
> > 24 1268012927 (1.9)   2387  2463 (3.1) 25262342 (-7.2)
> > 32 1171412261 (4.6)   4506  4486 (-.4) 49414463 (-9.6)
> > 40 1105911651 (5.3)   7244  7081 (-2.2)83497437 (-10.9)
> > 48 1058011095 (4.8)   10811 10500 (-2.8)   12809   11403
(-10.9)
> > 64 1056910566 (0) 19194 19270 (.3) 23648   21717 (-8.1)
> > 80 9827 10753 (9.4)   31668 29425 (-7.0)   39991   33824
(-15.4)
> > 96 1004310150 (1.0)   45352 44227 (-2.4)   57766   51131
(-11.4)
> > 1289360 9979 (6.6)92058 79198 (-13.9)  114381  92873
(-18.8)
> >
__
> > Avg:  BW: (-.5)  SD: (-7.5)  RSD: (-14.7)
> >
> > Is there anything else you would like me to test/change, or shall
> > I submit the next version (with the above macvtap changes)?
> >
> > Thanks,
> >
> > - KK
>
> Something strange here, right?
> 1. You are consistently getting >10G/s here, and even with a single
stream?

Sorry, I should have mentioned this though I had stated in my
earlier mails. Each test result has two iterations, each of 60
seconds, except when #netperfs is 1 for which I do 10 iteration
(sum across 10 iterations).  I started doing many more iterations
for 1 netperf after finding the issue earlier w

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-09 Thread Michael S. Tsirkin
On Tue, Nov 09, 2010 at 10:08:21AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 10/26/2010 02:27:09 PM:
> 
> > Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> >
> > On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > > Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
> > >
> > > Any feedback, comments, objections, issues or bugs about the
> > > patches? Please let me know if something needs to be done.
> > >
> > > Some more test results:
> > > _
> > >  Host->Guest BW (numtxqs=2)
> > > #   BW% CPU%RCPU%   SD% RSD%
> > > _
> >
> > I think we discussed the need for external to guest testing
> > over 10G. For large messages we should not see any change
> > but you should be able to get better numbers for small messages
> > assuming a MQ NIC card.
> 
> I had to make a few changes to qemu (and a minor change in macvtap
> driver) to get multiple TXQ support using macvtap working. The NIC
> is a ixgbe card.
> 
> __
> Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
> #  BW1 BW2 (%)   SD1SD2 (%)RSD1RSD2 (%)
> __
> 1  14367   13142 (-8.5)  56 62 (10.7)  88 (0)
> 2  36523855 (5.5)37 35 (-5.4)  76 (-14.2)
> 4  12529   12059 (-3.7)  65 77 (18.4)  35   35 (0)
> 8  13912   14668 (5.4)   288332 (15.2) 175  184 (5.1)
> 16 13433   14455 (7.6)   1218   1321 (8.4) 920  943 (2.5)
> 24 12750   13477 (5.7)   2876   2985 (3.7) 2514 2348 (-6.6)
> 32 11729   12632 (7.6)   5299   5332 (.6)  4934 4497 (-8.8)
> 40 11061   11923 (7.7)   8482   8364 (-1.3)8374 7495 (-10.4)
> 48 10624   11267 (6.0)   12329  12258 (-.5)1276211538 (-9.5)
> 64 10524   10596 (.6)21689  22859 (5.3)2362622403 (-5.1)
> 80 985610284 (4.3)   35769  36313 (1.5)3993236419 (-8.7)
> 96 969110075 (3.9)   52357  52259 (-.1)5867653463 (-8.8)
> 12893519794 (4.7)114707 94275 (-17.8)  114050   97337 (-14.6)
> __
> Avg:  BW: (3.3)  SD: (-7.3)  RSD: (-11.0)
> 
> __
> Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
> #  BW1  BW2 (%)   SD1   SD2 (%)RSD1   RSD2 (%)
> __
> 1  1650915985 (-3.1)  4547 (4.4)   7   7 (0)
> 2  6963 4499 (-35.3)  1751 (200.0) 7   7 (0)
> 4  1293211080 (-14.3) 4974 (51.0)  35  35 (0)
> 8  1387814095 (1.5)   223   292 (30.9) 175 181 (3.4)
> 16 1344013698 (1.9)   980   1131 (15.4)926 942 (1.7)
> 24 1268012927 (1.9)   2387  2463 (3.1) 25262342 (-7.2)
> 32 1171412261 (4.6)   4506  4486 (-.4) 49414463 (-9.6)
> 40 1105911651 (5.3)   7244  7081 (-2.2)83497437 (-10.9)
> 48 1058011095 (4.8)   10811 10500 (-2.8)   12809   11403 (-10.9)
> 64 1056910566 (0) 19194 19270 (.3) 23648   21717 (-8.1)
> 80 9827 10753 (9.4)   31668 29425 (-7.0)   39991   33824 (-15.4)
> 96 1004310150 (1.0)   45352 44227 (-2.4)   57766   51131 (-11.4)
> 1289360 9979 (6.6)92058 79198 (-13.9)  114381  92873 (-18.8)
> __
> Avg:  BW: (-.5)  SD: (-7.5)  RSD: (-14.7)
> 
> Is there anything else you would like me to test/change, or shall
> I submit the next version (with the above macvtap changes)?
> 
> Thanks,
> 
> - KK

Something strange here, right?
1. You are consistently getting >10G/s here, and even with a single stream?
2. With 2 streams, is where we get < 10G/s originally. Instead of
   doubling that we get a marginal improvement with 2 queues and
   about 30% worse with 1 queue.

Is your card MQ?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-08 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 10/26/2010 02:27:09 PM:

> Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
>
> On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > > Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
> >
> > Any feedback, comments, objections, issues or bugs about the
> > patches? Please let me know if something needs to be done.
> >
> > Some more test results:
> > _
> >  Host->Guest BW (numtxqs=2)
> > #   BW% CPU%RCPU%   SD% RSD%
> > _
>
> I think we discussed the need for external to guest testing
> over 10G. For large messages we should not see any change
> but you should be able to get better numbers for small messages
> assuming a MQ NIC card.

I had to make a few changes to qemu (and a minor change in macvtap
driver) to get multiple TXQ support using macvtap working. The NIC
is a ixgbe card.

__
Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3)
#  BW1 BW2 (%)   SD1SD2 (%)RSD1RSD2 (%)
__
1  14367   13142 (-8.5)  56 62 (10.7)  88 (0)
2  36523855 (5.5)37 35 (-5.4)  76 (-14.2)
4  12529   12059 (-3.7)  65 77 (18.4)  35   35 (0)
8  13912   14668 (5.4)   288332 (15.2) 175  184 (5.1)
16 13433   14455 (7.6)   1218   1321 (8.4) 920  943 (2.5)
24 12750   13477 (5.7)   2876   2985 (3.7) 2514 2348 (-6.6)
32 11729   12632 (7.6)   5299   5332 (.6)  4934 4497 (-8.8)
40 11061   11923 (7.7)   8482   8364 (-1.3)8374 7495 (-10.4)
48 10624   11267 (6.0)   12329  12258 (-.5)1276211538 (-9.5)
64 10524   10596 (.6)21689  22859 (5.3)2362622403 (-5.1)
80 985610284 (4.3)   35769  36313 (1.5)3993236419 (-8.7)
96 969110075 (3.9)   52357  52259 (-.1)5867653463 (-8.8)
12893519794 (4.7)114707 94275 (-17.8)  114050   97337 (-14.6)
__
Avg:  BW: (3.3)  SD: (-7.3)  RSD: (-11.0)

__
Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5)
#  BW1  BW2 (%)   SD1   SD2 (%)RSD1   RSD2 (%)
__
1  1650915985 (-3.1)  4547 (4.4)   7   7 (0)
2  6963 4499 (-35.3)  1751 (200.0) 7   7 (0)
4  1293211080 (-14.3) 4974 (51.0)  35  35 (0)
8  1387814095 (1.5)   223   292 (30.9) 175 181 (3.4)
16 1344013698 (1.9)   980   1131 (15.4)926 942 (1.7)
24 1268012927 (1.9)   2387  2463 (3.1) 25262342 (-7.2)
32 1171412261 (4.6)   4506  4486 (-.4) 49414463 (-9.6)
40 1105911651 (5.3)   7244  7081 (-2.2)83497437 (-10.9)
48 1058011095 (4.8)   10811 10500 (-2.8)   12809   11403 (-10.9)
64 1056910566 (0) 19194 19270 (.3) 23648   21717 (-8.1)
80 9827 10753 (9.4)   31668 29425 (-7.0)   39991   33824 (-15.4)
96 1004310150 (1.0)   45352 44227 (-2.4)   57766   51131 (-11.4)
1289360 9979 (6.6)92058 79198 (-13.9)  114381  92873 (-18.8)
__
Avg:  BW: (-.5)  SD: (-7.5)  RSD: (-14.7)

Is there anything else you would like me to test/change, or shall
I submit the next version (with the above macvtap changes)?

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-11-03 Thread Michael S. Tsirkin
On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
> >
> > > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > > >
> > > > > Is binding vhost threads to CPUs really required?
> > > > > What happens if we let the scheduler do its job?
> > > >
> > > > Nothing drastic, I remember BW% and SD% both improved a
> > > > bit as a result of binding.
> > >
> > > If there's a significant improvement this would mean that
> > > we need to rethink the vhost-net interaction with the scheduler.
> >
> > I will get a test run with and without binding and post the
> > results later today.
> 
> Correction: The result with binding is is much better for
> SD/CPU compared to without-binding:

Something that was suggested to me off-list is
trying to set smp affinity for NIC: in host to guest
case probably virtio-net, for external to guest
the host NIC as well.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-29 Thread linux_kvm
On Fri, 29 Oct 2010 13:26 +0200, "Michael S. Tsirkin" 
wrote:
> On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote:
> > > Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
> In practice users are very unlikely to pin threads to CPUs.

I may be misunderstanding what you're referring to. It caught my
attention since I'm working on a configuration to do what you say is
unlikely, so I'll chime in for what it's worth.

An option in Vyatta allows assigning CPU affinity to network adapters,
since apparently seperate L2 caches can have a significant impact on
throughput.

Although much of their focus seems to be on commercial virtualization
platforms, I do see quite a few forum posts with regard to KVM.
Mabye this still qualifies as an edge case, but as for virtualized
routing theirs seems to offer the most functionality.

http://www.vyatta.org/forum/viewtopic.php?t=2697

-cb
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-29 Thread Michael S. Tsirkin
On Thu, Oct 28, 2010 at 12:48:57PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
> >
> > > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > > >
> > > > > Is binding vhost threads to CPUs really required?
> > > > > What happens if we let the scheduler do its job?
> > > >
> > > > Nothing drastic, I remember BW% and SD% both improved a
> > > > bit as a result of binding.
> > >
> > > If there's a significant improvement this would mean that
> > > we need to rethink the vhost-net interaction with the scheduler.
> >
> > I will get a test run with and without binding and post the
> > results later today.
> 
> Correction: The result with binding is is much better for
> SD/CPU compared to without-binding:

Can you pls ty finding out why that is?  Is some thread bouncing between
CPUs?  Does a wrong numa node get picked up?
In practice users are very unlikely to pin threads to CPUs.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-28 Thread Krishna Kumar2
> Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM:
>
> > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > >
> > > > Is binding vhost threads to CPUs really required?
> > > > What happens if we let the scheduler do its job?
> > >
> > > Nothing drastic, I remember BW% and SD% both improved a
> > > bit as a result of binding.
> >
> > If there's a significant improvement this would mean that
> > we need to rethink the vhost-net interaction with the scheduler.
>
> I will get a test run with and without binding and post the
> results later today.

Correction: The result with binding is is much better for
SD/CPU compared to without-binding:

_
 numtxqs=8,vhosts=5, Bind vs No-bind
# BW% CPU% RCPU% SD%   RSD%
_
1 11.25 10.771.89 0-6.06
2 18.66 7.20 7.20-14.28-7.40
4 4.24 -1.27 1.56-2.70 -.98
8 14.91-3.79 5.46-12.19-3.76
1612.32-8.67 4.63-35.97-26.66
2411.68-7.83 5.10-40.73-32.37
3213.09-10.516.57-51.52-42.28
4011.04-4.12 11.23   -50.69-42.81
488.61 -10.306.04-62.38-55.54
647.55 -6.05 6.41-61.20-56.04
808.74 -11.456.29-72.65-67.17
969.84 -6.01 9.87-69.89-64.78
128   5.57 -6.23 8.99-75.03-70.97
_
BW: 10.4%,  CPU/RCPU: -7.4%,7.7%,  SD: -70.5%,-65.7%

Notes:
1.  All my test results earlier was binding vhost
to cpus 0-3 for both org and new kernel.
2.  I am not using MST's use_mq patch, only mainline
kernel. However, I reported earlier that I got
better results with that patch. The result for
MQ vs MQ+use_mm patch (from my earlier mail):

BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-27 Thread Michael S. Tsirkin
On Thu, Oct 28, 2010 at 11:42:05AM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" 
> 
> > > > I think we discussed the need for external to guest testing
> > > > over 10G. For large messages we should not see any change
> > > > but you should be able to get better numbers for small messages
> > > > assuming a MQ NIC card.
> > >
> > > For external host, there is a contention among different
> > > queues (vhosts) when packets are processed in tun/bridge,
> > > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > > my testing shows a small improvement (1 to 1.5% average)
> > > in BW and a rise in SD (between 10-15%).  For remote host,
> > > I think tun/macvtap needs MQ TX support?
> >
> > Confused. I thought this *is* with a multiqueue tun/macvtap?
> > bridge does not do any queueing AFAIK ...
> > I think we need to fix the contention. With migration what was guest to
> > host a minute ago might become guest to external now ...
> 
> Macvtap RX is MQ but not TX. I don't think MQ TX support is
> required for macvtap, though. Is it enough for existing
> macvtap sendmsg to work, since it calls dev_queue_xmit
> which selects the txq for the outgoing device?
> 
> Thanks,
> 
> - KK

I think there would be an issue with using a single poll notifier and
contention on send buffer atomic variable.
Is tun different than macvtap? We need to support both long term ...

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-27 Thread Krishna Kumar2
> "Michael S. Tsirkin" 

> > > I think we discussed the need for external to guest testing
> > > over 10G. For large messages we should not see any change
> > > but you should be able to get better numbers for small messages
> > > assuming a MQ NIC card.
> >
> > For external host, there is a contention among different
> > queues (vhosts) when packets are processed in tun/bridge,
> > unless I implement MQ TX for macvtap (tun/bridge?).  So
> > my testing shows a small improvement (1 to 1.5% average)
> > in BW and a rise in SD (between 10-15%).  For remote host,
> > I think tun/macvtap needs MQ TX support?
>
> Confused. I thought this *is* with a multiqueue tun/macvtap?
> bridge does not do any queueing AFAIK ...
> I think we need to fix the contention. With migration what was guest to
> host a minute ago might become guest to external now ...

Macvtap RX is MQ but not TX. I don't think MQ TX support is
required for macvtap, though. Is it enough for existing
macvtap sendmsg to work, since it calls dev_queue_xmit
which selects the txq for the outgoing device?

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-27 Thread Michael S. Tsirkin
On Thu, Oct 28, 2010 at 10:44:14AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 10/26/2010 04:39:13 PM:
> 
> (merging two posts into one)
> 
> > I think we discussed the need for external to guest testing
> > over 10G. For large messages we should not see any change
> > but you should be able to get better numbers for small messages
> > assuming a MQ NIC card.
> 
> For external host, there is a contention among different
> queues (vhosts) when packets are processed in tun/bridge,
> unless I implement MQ TX for macvtap (tun/bridge?).  So
> my testing shows a small improvement (1 to 1.5% average)
> in BW and a rise in SD (between 10-15%).  For remote host,
> I think tun/macvtap needs MQ TX support?

Confused. I thought this *is* with a multiqueue tun/macvtap?
bridge does not do any queueing AFAIK ...
I think we need to fix the contention. With migration what was guest to
host a minute ago might become guest to external now ...

> > > > > Results for UDP BW tests (unidirectional, sum across
> > > > > 3 iterations, each iteration of 45 seconds, default
> > > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > > >
> > > > Is binding vhost threads to CPUs really required?
> > > > What happens if we let the scheduler do its job?
> > >
> > > Nothing drastic, I remember BW% and SD% both improved a
> > > bit as a result of binding.
> >
> > If there's a significant improvement this would mean that
> > we need to rethink the vhost-net interaction with the scheduler.
> 
> I will get a test run with and without binding and post the
> results later today.
> 
> Thanks,
> 
> - KK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-27 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 10/26/2010 04:39:13 PM:

(merging two posts into one)

> I think we discussed the need for external to guest testing
> over 10G. For large messages we should not see any change
> but you should be able to get better numbers for small messages
> assuming a MQ NIC card.

For external host, there is a contention among different
queues (vhosts) when packets are processed in tun/bridge,
unless I implement MQ TX for macvtap (tun/bridge?).  So
my testing shows a small improvement (1 to 1.5% average)
in BW and a rise in SD (between 10-15%).  For remote host,
I think tun/macvtap needs MQ TX support?

> > > > Results for UDP BW tests (unidirectional, sum across
> > > > 3 iterations, each iteration of 45 seconds, default
> > > > netperf, vhosts bound to cpus 0-3; no other tuning):
> > >
> > > Is binding vhost threads to CPUs really required?
> > > What happens if we let the scheduler do its job?
> >
> > Nothing drastic, I remember BW% and SD% both improved a
> > bit as a result of binding.
>
> If there's a significant improvement this would mean that
> we need to rethink the vhost-net interaction with the scheduler.

I will get a test run with and without binding and post the
results later today.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Michael S. Tsirkin
On Tue, Oct 26, 2010 at 03:31:39PM +0530, Krishna Kumar2 wrote:
> > "Michael S. Tsirkin" 
> >
> > On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
> > > Results for UDP BW tests (unidirectional, sum across
> > > 3 iterations, each iteration of 45 seconds, default
> > > netperf, vhosts bound to cpus 0-3; no other tuning):
> >
> > Is binding vhost threads to CPUs really required?
> > What happens if we let the scheduler do its job?
> 
> Nothing drastic, I remember BW% and SD% both improved a
> bit as a result of binding.

If there's a significant improvement this would mean that
we need to rethink the vhost-net interaction with the scheduler.

> I started binding vhost thread
> after Avi suggested it in response to my v1 patch (he
> suggested some more that I haven't done), and have been
> doing only this tuning ever since. This is part of his
> mail for the tuning:
> 
> >  vhost:
> >  thread #0:  CPU0
> >  thread #1:  CPU1
> >  thread #2:  CPU2
> >  thread #3:  CPU3
> 
> I simply bound each thread to CPU0-3 instead.
> 
> Thanks,
> 
> - KK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Krishna Kumar2
> "Michael S. Tsirkin" 
>
> On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
> > Results for UDP BW tests (unidirectional, sum across
> > 3 iterations, each iteration of 45 seconds, default
> > netperf, vhosts bound to cpus 0-3; no other tuning):
>
> Is binding vhost threads to CPUs really required?
> What happens if we let the scheduler do its job?

Nothing drastic, I remember BW% and SD% both improved a
bit as a result of binding. I started binding vhost thread
after Avi suggested it in response to my v1 patch (he
suggested some more that I haven't done), and have been
doing only this tuning ever since. This is part of his
mail for the tuning:

>vhost:
>thread #0:  CPU0
>thread #1:  CPU1
>thread #2:  CPU2
>thread #3:  CPU3

I simply bound each thread to CPU0-3 instead.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Michael S. Tsirkin
On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote:
> Results for UDP BW tests (unidirectional, sum across
> 3 iterations, each iteration of 45 seconds, default
> netperf, vhosts bound to cpus 0-3; no other tuning):

Is binding vhost threads to CPUs really required?
What happens if we let the scheduler do its job?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 10/26/2010 10:40:35 AM:

> > I am trying to wrap my head around kernel/user interface here.
> > E.g., will we need another incompatible change when we add multiple RX
> > queues?
>
> Though I added a 'mq' option to qemu, there shouldn't be
> any incompatibility between old and new qemu's wrt vhost
> and virtio-net drivers. So the old qemu will run new host
> and new guest without issues, and new qemu can also run
> old host and old guest. Multiple RXQ will also not add
> any incompatibility.
>
> With MQ RX, I will be able to remove the hueristic (idea
> from David Stevens).  The idea is: Guest sends out packets
> on, say TXQ#2, vhost#2 processes the packets but packets
> going out from host to guest might be sent out on a
> different RXQ, say RXQ#4.  Guest receives the packet on
> RXQ#4, and all future responses on that connection are sent
> on TXQ#4.  Now vhost#4 processes both RX and TX packets for
> this connection.  Without needing to hash on the connection,
> guest can make sure that the same vhost thread will handle
> a single connection.
>
> > Also need to think about how robust our single stream heuristic is,
> > e.g. what are the chances it will misdetect a bidirectional
> > UDP stream as a single TCP?

> I think it should not happen. The hueristic code gets
> called for handling just the transmit packets, packets
> that vhost sends out to the guest skip this path.
>
> I tested unidirectional and bidirectional UDP to confirm:
>
> 8 iterations of iperf tests, each iteration of 15 secs,
> result is the sum of all 8 iterations in Gbits/sec
> __
> Uni-directional  Bi-directional
>   Org  New Org  New
> __
>   71.7871.77   71.74   72.07
> __


Results for UDP BW tests (unidirectional, sum across
3 iterations, each iteration of 45 seconds, default
netperf, vhosts bound to cpus 0-3; no other tuning):

-- numtxqs=8, vhosts=5 -
# BW%CPU%SD%

1 .491.07 0
223.51   52.5126.66
475.17   72.438.57
886.54   80.2127.85
16   92.37   85.996.27
24   91.37   84.918.41
32   89.78   82.903.31
48   89.85   79.95   -3.57
64   85.83   80.282.22
80   88.90   79.47   -23.18
96   90.12   79.9814.71
128  86.13   80.604.42

BW: 71.3%, CPU: 80.4%, SD: 1.2%


-- numtxqs=16, vhosts=5 
#BW%  CPU% SD%

11.80 00
219.8150.6826.66
457.3152.778.57
8108.44   88.19   -5.21
16   106.09   85.03   -4.44
24   102.34   84.23   -.82
32   102.77   82.71   -5.81
48   100.00   79.62   -7.29
64   96.8679.75   -6.10
80   99.2679.82   -27.34
96   94.7980.02   -5.08
128  98.1481.15   -15.25

BW: 77.9%,  CPU: 80.4%,  SD: -13.6%

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-26 Thread Michael S. Tsirkin
On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
> 
> Any feedback, comments, objections, issues or bugs about the
> patches? Please let me know if something needs to be done.
> 
> Some more test results:
> _
>  Host->Guest BW (numtxqs=2)
> #   BW% CPU%RCPU%   SD% RSD%
> _

I think we discussed the need for external to guest testing
over 10G. For large messages we should not see any change
but you should be able to get better numbers for small messages
assuming a MQ NIC card.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-25 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 10/25/2010 09:47:18 PM:

> > Any feedback, comments, objections, issues or bugs about the
> > patches? Please let me know if something needs to be done.
>
> I am trying to wrap my head around kernel/user interface here.
> E.g., will we need another incompatible change when we add multiple RX
> queues?

Though I added a 'mq' option to qemu, there shouldn't be
any incompatibility between old and new qemu's wrt vhost
and virtio-net drivers. So the old qemu will run new host
and new guest without issues, and new qemu can also run
old host and old guest. Multiple RXQ will also not add
any incompatibility.

With MQ RX, I will be able to remove the hueristic (idea
from David Stevens).  The idea is: Guest sends out packets
on, say TXQ#2, vhost#2 processes the packets but packets
going out from host to guest might be sent out on a
different RXQ, say RXQ#4.  Guest receives the packet on
RXQ#4, and all future responses on that connection are sent
on TXQ#4.  Now vhost#4 processes both RX and TX packets for
this connection.  Without needing to hash on the connection,
guest can make sure that the same vhost thread will handle
a single connection.

> Also need to think about how robust our single stream heuristic is,
> e.g. what are the chances it will misdetect a bidirectional
> UDP stream as a single TCP?

I think it should not happen. The hueristic code gets
called for handling just the transmit packets, packets
that vhost sends out to the guest skip this path.

I tested unidirectional and bidirectional UDP to confirm:

8 iterations of iperf tests, each iteration of 15 secs,
result is the sum of all 8 iterations in Gbits/sec
__
Uni-directional  Bi-directional
  Org  New Org  New
__
  71.7871.77   71.74   72.07
__

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-25 Thread Michael S. Tsirkin
On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:
> 
> Any feedback, comments, objections, issues or bugs about the
> patches? Please let me know if something needs to be done.

I am trying to wrap my head around kernel/user interface here.
E.g., will we need another incompatible change when we add multiple RX
queues? Also need to think about how robust our single stream heuristic is,
e.g. what are the chances it will misdetect a bidirectional
UDP stream as a single TCP?

> Some more test results:
> _
>  Host->Guest BW (numtxqs=2)
> #   BW% CPU%RCPU%   SD% RSD%
> _
> 1   5.53.31 .67 -5.88   0
> 2   -2.11   -1.01   -2.08   4.340
> 4   13.53   10.77   13.87   -1.96   0
> 8   34.22   22.80   30.53   -8.46   -2.50
> 16  30.89   24.06   35.17   -5.20   3.20
> 24  33.22   26.30   43.39   -5.17   7.58
> 32  30.85   27.27   47.74   -.5915.51
> 40  33.80   27.33   48.00   -7.42   7.59
> 48  45.93   26.33   45.46   -12.24  1.10
> 64  33.51   27.11   45.00   -3.27   10.30
> 80  39.28   29.21   52.33   -4.88   12.17
> 96  32.05   31.01   57.72   -1.02   19.05
> 128 35.66   32.04   60.00   -.6620.41
> _
> BW: 23.5%  CPU/RCPU: 28.6%,51.2%  SD/RSD: -2.6%,15.8%
> 
> 
> Guest->Host 512 byte (numtxqs=2):
> #   BW% CPU%RCPU%   SD% RSD%
> _
> 1   3.02-3.84   -4.76   -12.50  -7.69
> 2   52.77   -15.73  -8.66   -45.31  -40.33
> 4   -23.14  13.84   7.5050.58   40.81
> 8   -21.44  28.08   16.32   63.06   47.43
> 16  33.53   46.50   27.19   7.61-6.60
> 24  55.77   42.81   30.49   -8.65   -16.48
> 32  52.59   38.92   29.08   -9.18   -15.63
> 40  50.92   36.11   28.92   -10.59  -15.30
> 48  46.63   34.73   28.17   -7.83   -12.32
> 64  45.56   37.12   28.81   -5.05   -10.80
> 80  44.55   36.60   28.45   -4.95   -10.61
> 96  43.02   35.97   28.89   -.11-5.31
> 128 38.54   33.88   27.19   -4.79   -9.54
> _____________________
> BW: 34.4%  CPU/RCPU: 35.9%,27.8%  SD/RSD: -4.1%,-9.3%
> 
> 
> Thanks,
> 
> - KK
> 
> 
> 
> > [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
> >
> > Following set of patches implement transmit MQ in virtio-net.  Also
> > included is the user qemu changes.  MQ is disabled by default unless
> > qemu specifies it.
> >
> >   Changes from rev2:
> >   --
> > 1. Define (in virtio_net.h) the maximum send txqs; and use in
> >virtio-net and vhost-net.
> > 2. vi->sq[i] is allocated individually, resulting in cache line
> >aligned sq[0] to sq[n].  Another option was to define
> >'send_queue' as:
> >struct send_queue {
> >struct virtqueue *svq;
> >struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
> >} cacheline_aligned_in_smp;
> >and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
> >the submitted method is preferable.
> > 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
> >handles TX[0-n].
> > 4. Further change TX handling such that vhost[0] handles both RX/TX
> >for single stream case.
> >
> >   Enabling MQ on virtio:
> >   ---
> > When following options are passed to qemu:
> > - smp > 1
> > - vhost=on
> > - mq=on (new option, default:off)
> > then #txqueues = #cpus.  The #txqueues can be changed by using an
> > optional 'numtxqs' option.  e.g. for a smp=4 guest:
> > vhost=on   ->   #txqueues = 1
> > vhost=on,mq=on ->   #txqueues = 4
> > vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> > vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
> >
> >
> >Performance (guest -> local host):
> >---
> > System configuration:
> > Host:  8 Intel Xeon, 8 GB memory
> > Guest: 4 cpus, 2 GB memory
> > Test: Each test case runs for 60 secs, sum over three runs (except
> > when number of netperf sessions is 1, which has 10 runs of 12 secs

Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-25 Thread Krishna Kumar2
> Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM:

Any feedback, comments, objections, issues or bugs about the
patches? Please let me know if something needs to be done.

Some more test results:
_
 Host->Guest BW (numtxqs=2)
#   BW% CPU%RCPU%   SD% RSD%
_
1   5.53.31 .67 -5.88   0
2   -2.11   -1.01   -2.08   4.340
4   13.53   10.77   13.87   -1.96   0
8   34.22   22.80   30.53   -8.46   -2.50
16  30.89   24.06   35.17   -5.20   3.20
24  33.22   26.30   43.39   -5.17   7.58
32  30.85   27.27   47.74   -.5915.51
40  33.80   27.33   48.00   -7.42   7.59
48  45.93   26.33   45.46   -12.24  1.10
64  33.51   27.11   45.00   -3.27   10.30
80  39.28   29.21   52.33   -4.88   12.17
96  32.05   31.01   57.72   -1.02   19.05
128 35.66   32.04   60.00   -.6620.41
_
BW: 23.5%  CPU/RCPU: 28.6%,51.2%  SD/RSD: -2.6%,15.8%


Guest->Host 512 byte (numtxqs=2):
#   BW% CPU%RCPU%   SD% RSD%
_
1   3.02-3.84   -4.76   -12.50  -7.69
2   52.77   -15.73  -8.66   -45.31  -40.33
4   -23.14  13.84   7.5050.58   40.81
8   -21.44  28.08   16.32   63.06   47.43
16  33.53   46.50   27.19   7.61-6.60
24  55.77   42.81   30.49   -8.65   -16.48
32  52.59   38.92   29.08   -9.18   -15.63
40  50.92   36.11   28.92   -10.59  -15.30
48  46.63   34.73   28.17   -7.83   -12.32
64  45.56   37.12   28.81   -5.05   -10.80
80  44.55   36.60   28.45   -4.95   -10.61
96  43.02   35.97   28.89   -.11-5.31
128 38.54   33.88   27.19   -4.79   -9.54
_
BW: 34.4%  CPU/RCPU: 35.9%,27.8%  SD/RSD: -4.1%,-9.3%


Thanks,

- KK



> [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
>
> Following set of patches implement transmit MQ in virtio-net.  Also
> included is the user qemu changes.  MQ is disabled by default unless
> qemu specifies it.
>
>   Changes from rev2:
>   --
> 1. Define (in virtio_net.h) the maximum send txqs; and use in
>virtio-net and vhost-net.
> 2. vi->sq[i] is allocated individually, resulting in cache line
>aligned sq[0] to sq[n].  Another option was to define
>'send_queue' as:
>struct send_queue {
>struct virtqueue *svq;
>struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
>} cacheline_aligned_in_smp;
>and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
>the submitted method is preferable.
> 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
>handles TX[0-n].
> 4. Further change TX handling such that vhost[0] handles both RX/TX
>for single stream case.
>
>   Enabling MQ on virtio:
>   ---
> When following options are passed to qemu:
> - smp > 1
> - vhost=on
> - mq=on (new option, default:off)
> then #txqueues = #cpus.  The #txqueues can be changed by using an
> optional 'numtxqs' option.  e.g. for a smp=4 guest:
> vhost=on   ->   #txqueues = 1
> vhost=on,mq=on ->   #txqueues = 4
> vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
>
>
>Performance (guest -> local host):
>---
> System configuration:
> Host:  8 Intel Xeon, 8 GB memory
> Guest: 4 cpus, 2 GB memory
> Test: Each test case runs for 60 secs, sum over three runs (except
> when number of netperf sessions is 1, which has 10 runs of 12 secs
> each).  No tuning (default netperf) other than taskset vhost's to
> cpus 0-3.  numtxqs=32 gave the best results though the guest had
> only 4 vcpus (I haven't tried beyond that).
>
> __ numtxqs=2, vhosts=3  
> #sessions  BW%  CPU%RCPU%SD%  RSD%
> 
> 1  4.46-1.96 .19 -12.50   -6.06
> 2  4.93-1.162.10  0   -2.38
> 4  46.1764.77   33.72 19.51   -2.48
> 8  47.8970.00   36.23 41.4613.35
> 16 48.9780.44   40.67 21.11   -5.46
> 24 49.0378.78   41.22 20.51   -4.78
> 32 51.1177.15   42.42 15.81   -6.87
> 40 51.6071.65   42.43 9.75-8.94
> 48

[v3 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-20 Thread Krishna Kumar
Following set of patches implement transmit MQ in virtio-net.  Also
included is the user qemu changes.  MQ is disabled by default unless
qemu specifies it.

  Changes from rev2:
  --
1. Define (in virtio_net.h) the maximum send txqs; and use in
   virtio-net and vhost-net.
2. vi->sq[i] is allocated individually, resulting in cache line
   aligned sq[0] to sq[n].  Another option was to define
   'send_queue' as:
   struct send_queue {
   struct virtqueue *svq;
   struct scatterlist tx_sg[MAX_SKB_FRAGS + 2];
   } cacheline_aligned_in_smp;
   and to statically allocate 'VIRTIO_MAX_SQ' of those.  I hope
   the submitted method is preferable.
3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX]
   handles TX[0-n].
4. Further change TX handling such that vhost[0] handles both RX/TX
   for single stream case.

  Enabling MQ on virtio:
  ---
When following options are passed to qemu:
- smp > 1
- vhost=on
- mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using an
optional 'numtxqs' option.  e.g. for a smp=4 guest:
vhost=on   ->   #txqueues = 1
vhost=on,mq=on ->   #txqueues = 4
vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8


   Performance (guest -> local host):
   ---
System configuration:
Host:  8 Intel Xeon, 8 GB memory
Guest: 4 cpus, 2 GB memory
Test: Each test case runs for 60 secs, sum over three runs (except
when number of netperf sessions is 1, which has 10 runs of 12 secs
each).  No tuning (default netperf) other than taskset vhost's to
cpus 0-3.  numtxqs=32 gave the best results though the guest had
only 4 vcpus (I haven't tried beyond that).

__ numtxqs=2, vhosts=3  
#sessions  BW%  CPU%RCPU%SD%  RSD%

1  4.46-1.96 .19 -12.50   -6.06
2  4.93-1.162.10  0   -2.38
4  46.1764.77   33.72 19.51   -2.48
8  47.8970.00   36.23 41.4613.35
16 48.9780.44   40.67 21.11   -5.46
24 49.0378.78   41.22 20.51   -4.78
32 51.1177.15   42.42 15.81   -6.87
40 51.6071.65   42.43 9.75-8.94
48 50.1069.55   42.85 11.80   -5.81
64 46.2468.42   42.67 14.18   -3.28
80 46.3763.13   41.62 7.43-6.73
96 46.4063.31   42.20 9.36-4.78
12850.4362.79   42.16 13.11   -1.23

BW: 37.2%,  CPU/RCPU: 66.3%,41.6%,  SD/RSD: 11.5%,-3.7%

__ numtxqs=8, vhosts=5  
#sessions   BW%  CPU% RCPU% SD%  RSD%

1   -.76-1.56 2.33  03.03
2   17.4111.1111.41 0   -4.76
4   42.1255.1130.20 19.51.62
8   54.6980.0039.22 24.39-3.88
16  54.7781.6240.89 20.34-6.58
24  54.6679.6841.57 15.49-8.99
32  54.9276.8241.79 17.59-5.70
40  51.7968.5640.53 15.31-3.87
48  51.7266.4040.84 9.72 -7.13
64  51.1163.9441.10 5.93 -8.82
80  46.5159.5039.80 9.33 -4.18
96  47.7257.7539.84 4.20 -7.62
128 54.3558.9540.66 3.24 -8.63

BW: 38.9%,  CPU/RCPU: 63.0%,40.1%,  SD/RSD: 6.0%,-7.4%

__ numtxqs=16, vhosts=5  ___
#sessions   BW%  CPU% RCPU% SD%  RSD%

1   -1.43-3.521.55  0  3.03
2   33.09 21.63   20.12-10.00 -9.52
4   67.17 94.60   44.28 19.51 -11.80
8   75.72 108.14  49.15 25.00 -10.71
16  80.34 101.77  52.94 25.93 -4.49
24  70.84 93.12   43.62 27.63 -5.03
32  69.01 94.16   47.33 29.68 -1.51
40  58.56 63.47   25.91-3.92  -25.85
48  61.16 74.70   34.88 .89   -22.08
64  54.37 69.09   26.80-6.68  -30.04
80  36.22 22.73   -2.97-8.25  -27.23
96  41.51 50.59   13.24 9.84  -16.77
128 48.98 38.15   6.41 -.33   -22.80

BW: 46.2%,  CPU/RCPU: 55.2%,18.8%,  SD/RSD: 1.2%,-22.0%

__ numtxqs=32, vhosts=5  ___
# 

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 10/14/2010 05:47:54 PM:

Sorry, it should read "txq=8" below.

- KK

> There's a significant reduction in CPU/SD utilization with your
> patch. Following is the performance of ORG vs MQ+mm patch:
>
> _
>Org vs MQ+mm patch txq=2
> # BW% CPU/RCPU% SD/RSD%
> _
> 1 2.26-1.16.27  -20.00  0
> 2 35.07   29.9021.81 0  -11.11
> 4 55.03   84.5737.66 26.92  -4.62
> 8 73.16   118.69   49.21 45.63  -.46
> 1677.43   98.8147.89 24.07  -7.80
> 2471.59   105.18   48.44 62.84  18.18
> 3270.91   102.38   47.15 49.22  8.54
> 4063.26   90.5841.00 85.27  37.33
> 4845.25   45.9911.23 14.31  -12.91
> 6442.78   41.825.50  .43-25.12
> 8031.40   7.31 -18.6915.78  -11.93
> 9627.60   7.79 -18.5417.39  -10.98
> 128   23.46   -11.89   -34.41-.41   -25.53
> _
> BW: 40.2  CPU/RCPU: 29.9,-2.2   SD/RSD: 12.0,-15.6
>
> Following is the performance of MQ vs MQ+mm patch:
> _
> MQ vs MQ+mm patch
> # BW%  CPU%   RCPU%SD%  RSD%
> _
> 1  4.98-.58   .84  -20.000
> 2  5.17 2.96  2.29  0   -4.00
> 4 -.18  .25  -.16   3.12 .98
> 8 -5.47-1.36 -1.98  17.1816.57
> 16-1.90-6.64 -3.54 -14.83   -12.12
> 24-.01  23.63 14.65 57.6146.64
> 32 .27 -3.19  -3.11-22.98   -22.91
> 40-1.06-2.96  -2.96-4.18-4.10
> 48-.28 -2.34  -3.71-2.41-3.81
> 64 9.71 33.77  30.6581.4477.09
> 80-10.69-31.07-31.70   -29.22   -29.88
> 96-1.14 5.98   .56 -11.57   -16.14
> 128   -.93 -15.60 -18.31   -19.89   -22.65
> _
>   BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6
> _
>
> Each test case is for 60 secs, sum over two runs (except
> when number of netperf sessions is 1, which has 7 runs
> of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning
> other than taskset each vhost to cpus 0-3.
>
> Thanks,
>
> - KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 10/14/2010 02:34:01 PM:

> void vhost_poll_queue(struct vhost_poll *poll)
> {
> struct vhost_virtqueue *vq = vhost_find_vq(poll);
>
> vhost_work_queue(vq, &poll->work);
> }
>
> Since poll batches packets, find_vq does not seem to add much
> to the CPU utilization (or BW). I am sure that code can be
> optimized much better.
>
> The results I sent in my last mail were without your use_mm
> patch, and the only tuning was to make vhost threads run on
> only cpus 0-3 (though the performance is good even without
> that). I will test it later today with the use_mm patch too.

There's a significant reduction in CPU/SD utilization with your
patch. Following is the performance of ORG vs MQ+mm patch:

_
   Org vs MQ+mm patch txq=2
# BW% CPU/RCPU% SD/RSD%
_
1 2.26-1.16.27  -20.00  0
2 35.07   29.9021.81 0  -11.11
4 55.03   84.5737.66 26.92  -4.62
8 73.16   118.69   49.21 45.63  -.46
1677.43   98.8147.89 24.07  -7.80
2471.59   105.18   48.44 62.84  18.18
3270.91   102.38   47.15 49.22  8.54
4063.26   90.5841.00 85.27  37.33
4845.25   45.9911.23 14.31  -12.91
6442.78   41.825.50  .43-25.12
8031.40   7.31 -18.6915.78  -11.93
9627.60   7.79 -18.5417.39  -10.98
128   23.46   -11.89   -34.41-.41   -25.53
_
BW: 40.2  CPU/RCPU: 29.9,-2.2   SD/RSD: 12.0,-15.6


Following is the performance of MQ vs MQ+mm patch:
_
MQ vs MQ+mm patch
# BW%  CPU%   RCPU%SD%  RSD%
_
1  4.98-.58   .84  -20.000
2  5.17 2.96  2.29  0   -4.00
4 -.18  .25  -.16   3.12 .98
8 -5.47-1.36 -1.98  17.1816.57
16-1.90-6.64 -3.54 -14.83   -12.12
24-.01  23.63 14.65 57.6146.64
32 .27 -3.19  -3.11-22.98   -22.91
40-1.06-2.96  -2.96-4.18-4.10
48-.28 -2.34  -3.71-2.41-3.81
64 9.71 33.77  30.6581.4477.09
80-10.69-31.07-31.70   -29.22   -29.88
96-1.14 5.98   .56 -11.57   -16.14
128   -.93 -15.60 -18.31   -19.89   -22.65
_
  BW: 0   CPU/RCPU: -4.2,-6.1  SD/RSD: -13.1,-15.6
_

Each test case is for 60 secs, sum over two runs (except
when number of netperf sessions is 1, which has 7 runs
of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning
other than taskset each vhost to cpus 0-3.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
> "Michael S. Tsirkin" 
> > > What other shared TX/RX locks are there?  In your setup, is the same
> > > macvtap socket structure used for RX and TX?  If yes this will create
> > > cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
> > > there might also be contention on the lock in sk_sleep waitqueue.
> > > Anything else?
> >
> > The patch is not introducing any locking (both vhost and virtio-net).
> > The single stream drop is due to different vhost threads handling the
> > RX/TX traffic.
> >
> > I added a heuristic (fuzzy) to determine if more than one flow
> > is being used on the device, and if not, use vhost[0] for both
> > tx and rx (vhost_poll_queue figures this out before waking up
> > the suitable vhost thread).  Testing shows that single stream
> > performance is as good as the original code.
>
> ...
>
> > This approach works nicely for both single and multiple stream.
> > Does this look good?
> >
> > Thanks,
> >
> > - KK
>
> Yes, but I guess it depends on the heuristic :) What's the logic?

I define how recently a txq was used. If 0 or 1 txq's were used
recently, use vq[0] (which also handles rx). Otherwise, use
multiple txq (vq[1-n]). The code is:

/*
 * Algorithm for selecting vq:
 *
 * ConditionReturn
 * RX vqvq[0]
 * If all txqs unused   vq[0]
 * If one txq used, and new txq is same vq[0]
 * If one txq used, and new txq is differentvq[vq->qnum]
 * If > 1 txqs used vq[vq->qnum]
 *  Where "used" means the txq was used in the last 'n' jiffies.
 *
 * Note: locking is not required as an update race will only result in
 * a different worker being woken up.
 */
static inline struct vhost_virtqueue *vhost_find_vq(struct vhost_poll
*poll)
{
if (poll->vq->qnum) {
struct vhost_dev *dev = poll->vq->dev;
struct vhost_virtqueue *vq = &dev->vqs[0];
unsigned long max_time = jiffies - 5; /* Some macro needed */
unsigned long *table = dev->jiffies;
int i, used = 0;

for (i = 0; i < dev->nvqs - 1; i++) {
if (time_after_eq(table[i], max_time) && ++used > 1) {
vq = poll->vq;
break;
}
}
table[poll->vq->qnum - 1] = jiffies;
return vq;
}

/* RX is handled by the same worker thread */
return poll->vq;
}

void vhost_poll_queue(struct vhost_poll *poll)
{
struct vhost_virtqueue *vq = vhost_find_vq(poll);

vhost_work_queue(vq, &poll->work);
}

Since poll batches packets, find_vq does not seem to add much
to the CPU utilization (or BW). I am sure that code can be
optimized much better.

The results I sent in my last mail were without your use_mm
patch, and the only tuning was to make vhost threads run on
only cpus 0-3 (though the performance is good even without
that). I will test it later today with the use_mm patch too.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Michael S. Tsirkin
On Thu, Oct 14, 2010 at 01:28:58PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 10/12/2010 10:39:07 PM:
> 
> > > Sorry for the delay, I was sick last couple of days. The results
> > > with your patch are (%'s over original code):
> > >
> > > Code   BW%   CPU%   RemoteCPU
> > > MQ (#txq=16)   31.4% 38.42% 6.41%
> > > MQ+MST (#txq=16)   28.3% 18.9%  -10.77%
> > >
> > > The patch helps CPU utilization but didn't help single stream
> > > drop.
> > >
> > > Thanks,
> >
> > What other shared TX/RX locks are there?  In your setup, is the same
> > macvtap socket structure used for RX and TX?  If yes this will create
> > cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
> > there might also be contention on the lock in sk_sleep waitqueue.
> > Anything else?
> 
> The patch is not introducing any locking (both vhost and virtio-net).
> The single stream drop is due to different vhost threads handling the
> RX/TX traffic.
> 
> I added a heuristic (fuzzy) to determine if more than one flow
> is being used on the device, and if not, use vhost[0] for both
> tx and rx (vhost_poll_queue figures this out before waking up
> the suitable vhost thread).  Testing shows that single stream
> performance is as good as the original code.

...

> This approach works nicely for both single and multiple stream.
> Does this look good?
> 
> Thanks,
> 
> - KK

Yes, but I guess it depends on the heuristic :) What's the logic?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-14 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 10/12/2010 10:39:07 PM:

> > Sorry for the delay, I was sick last couple of days. The results
> > with your patch are (%'s over original code):
> >
> > Code   BW%   CPU%   RemoteCPU
> > MQ (#txq=16)   31.4% 38.42% 6.41%
> > MQ+MST (#txq=16)   28.3% 18.9%  -10.77%
> >
> > The patch helps CPU utilization but didn't help single stream
> > drop.
> >
> > Thanks,
>
> What other shared TX/RX locks are there?  In your setup, is the same
> macvtap socket structure used for RX and TX?  If yes this will create
> cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
> there might also be contention on the lock in sk_sleep waitqueue.
> Anything else?

The patch is not introducing any locking (both vhost and virtio-net).
The single stream drop is due to different vhost threads handling the
RX/TX traffic.

I added a heuristic (fuzzy) to determine if more than one flow
is being used on the device, and if not, use vhost[0] for both
tx and rx (vhost_poll_queue figures this out before waking up
the suitable vhost thread).  Testing shows that single stream
performance is as good as the original code.

__
   #txqs = 2 (#vhosts = 3)
# BW1 BW2   (%)   CPU1CPU2 (%)   RCPU1   RCPU2 (%)
__
1 77344   74973 (-3.06)   172 143 (-16.86)   358 324 (-9.49)
2 20924   21107 (.87) 107 103 (-3.73)220 217 (-1.36)
4 21629   32911 (52.16)   214 391 (82.71)446 616 (38.11)
8 21678   34359 (58.49)   428 845 (97.42)892 1286 (44.17)
1622046   34401 (56.04)   841 1677 (99.40)   17852585 (44.81)
2422396   35117 (56.80)   12722447 (92.37)   26673863 (44.84)
3222750   35158 (54.54)   17193233 (88.07)   35695143 (44.10)
4023041   35345 (53.40)   22193970 (78.90)   44786410 (43.14)
4823209   35219 (51.74)   27074685 (73.06)   53867684 (42.66)
6423215   35209 (51.66)   36396195 (70.23)   720610218 (41.79)
8023443   35179 (50.06)   46337625 (64.58)   905112745 (40.81)
9624006   36108 (50.41)   56359096 (61.41)   10864   15283 (40.67)
128   23601   35744 (51.45)   747512104 (61.92)  14495   20405 (40.77)
__
SUM: BW: (37.6) CPU: (69.0) RCPU: (41.2)

__
   #txqs = 8 (#vhosts = 5)
# BW1 BW2(%)  CPU1 CPU2 (%)  RCPU1 RCPU2 (%)
__
1 77344   75341 (-2.58)   172 171 (-.58) 358 356 (-.55)
2 20924   26872 (28.42)   107 135 (26.16)220 262 (19.09)
4 21629   33594 (55.31)   214 394 (84.11)446 615 (37.89)
8 21678   39714 (83.19)   428 949 (121.72)   892 1358 (52.24)
1622046   39879 (80.88)   841 1791 (112.96)  17852737 (53.33)
2422396   38436 (71.61)   12722111 (65.95)   26673453 (29.47)
3222750   38776 (70.44)   17193594 (109.07)  35695421 (51.89)
4023041   38023 (65.02)   22194358 (96.39)   44786507 (45.31)
4823209   33811 (45.68)   27074047 (49.50)   53866222 (15.52)
6423215   30212 (30.13)   36393858 (6.01)72065819 (-19.24)
8023443   34497 (47.15)   46337214 (55.70)   905110776 (19.05)
9624006   30990 (29.09)   56355731 (1.70)10864   8799 (-19.00)
128   23601   29413 (24.62)   74757804 (4.40)14495   11638 (-19.71)
__
SUM: BW: (40.1) CPU: (35.7) RCPU: (4.1)
___


The SD numbers are also good (same table as before, but SD
instead of CPU:

__
   #txqs = 2 (#vhosts = 3)
# BW%   SD1 SD2 (%)RSD1 RSD2 (%)
__
1 -3.06)5   4 (-20.00) 21   19 (-9.52)
2 .87   6   6 (0)  27   27 (0)
4 52.16 26  32 (23.07) 108  103 (-4.62)
8 58.49 103 146 (41.74)431  445 (3.24)
1656.04 407 514 (26.28)1729 1586 (-8.27)
2456.80 934 1161 (24.30)   3916 3665 (-6.40)
3254.54 16682160 (29.49)   6925 6872 (-.76)
4053.40 26553317 (24.93)   1071210707 (-.04)
4851.74 39204486 (14.43)   1559814715 (-5.66)
6451.66 70968250 (16.26)   2809927211 (-3.16)
8050.06 11240   12586 (11.97)  4391342070 (-4.19)
9650.41 1634

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-12 Thread Michael S. Tsirkin
On Mon, Oct 11, 2010 at 12:51:27PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 10/06/2010 07:04:31 PM:
> 
> > On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> > > For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> > > for degradation for 1 stream case:
> >
> > I thought about possible RX/TX contention reasons, and I realized that
> > we get/put the mm counter all the time.  So I write the following: I
> > haven't seen any performance gain from this in a single queue case, but
> > maybe this will help multiqueue?
> 
> Sorry for the delay, I was sick last couple of days. The results
> with your patch are (%'s over original code):
> 
> Code   BW%   CPU%   RemoteCPU
> MQ (#txq=16)   31.4% 38.42% 6.41%
> MQ+MST (#txq=16)   28.3% 18.9%  -10.77%
> 
> The patch helps CPU utilization but didn't help single stream
> drop.
> 
> Thanks,

What other shared TX/RX locks are there?  In your setup, is the same
macvtap socket structure used for RX and TX?  If yes this will create
cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line,
there might also be contention on the lock in sk_sleep waitqueue.
Anything else?

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-11 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 10/06/2010 07:04:31 PM:

> On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> > For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> > for degradation for 1 stream case:
>
> I thought about possible RX/TX contention reasons, and I realized that
> we get/put the mm counter all the time.  So I write the following: I
> haven't seen any performance gain from this in a single queue case, but
> maybe this will help multiqueue?

Sorry for the delay, I was sick last couple of days. The results
with your patch are (%'s over original code):

Code   BW%   CPU%   RemoteCPU
MQ (#txq=16)   31.4% 38.42% 6.41%
MQ+MST (#txq=16)   28.3% 18.9%  -10.77%

The patch helps CPU utilization but didn't help single stream
drop.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Michael S. Tsirkin
On Wed, Oct 06, 2010 at 11:13:31PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 10/05/2010 11:53:23 PM:
> 
> > > > Any idea where does this come from?
> > > > Do you see more TX interrupts? RX interrupts? Exits?
> > > > Do interrupts bounce more between guest CPUs?
> > > > 4. Identify reasons for single netperf BW regression.
> > >
> > > After testing various combinations of #txqs, #vhosts, #netperf
> > > sessions, I think the drop for 1 stream is due to TX and RX for
> > > a flow being processed on different cpus.
> >
> > Right. Can we fix it?
> 
> I am not sure how to. My initial patch had one thread but gave
> small gains and ran into limitations once number of sessions
> became large.

Sure. We will need multiple RX queues, and have a single
thread handle a TX and RX pair. Then we need to make sure packets
from a given flow on TX land on the same thread on RX.
As flows can be hashed differently, for this to work we'll have to
expose this info in host/guest interface.
But since multiqueue implies host/guest ABI changes anyway,
this point is moot.

BTW, an interesting approach could be using bonding
and multiple virtio-net interfaces.
What are the disadvantages of such a setup?  One advantage
is it can be made to work in existing guests.

> > >  I did two more tests:
> > > 1. Pin vhosts to same CPU:
> > > - BW drop is much lower for 1 stream case (- 5 to -8% range)
> > > - But performance is not so high for more sessions.
> > > 2. Changed vhost to be single threaded:
> > >   - No degradation for 1 session, and improvement for upto
> > >  8, sometimes 16 streams (5-12%).
> > >   - BW degrades after that, all the way till 128 netperf
> sessions.
> > >   - But overall CPU utilization improves.
> > > Summary of the entire run (for 1-128 sessions):
> > > txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
> > > txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
> > >
> > > I don't see any reasons mentioned above.  However, for higher
> > > number of netperf sessions, I see a big increase in retransmissions:
> >
> > Hmm, ok, and do you see any errors?
> 
> I haven't seen any in any statistics, messages, etc.

Herbert, could you help out debugging this increase in retransmissions
please?  Older mail on netdev in this thread has some numbers that seem
to imply that we start hitting retransmissions much more as # of flows
goes up.

> Also no
> retranmissions for txq=1.

While it's nice that we have this parameter, the need to choose between
single stream and multi stream performance when you start the vm makes
this patch much less interesting IMHO.


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Arnd Bergmann
On Wednesday 06 October 2010 19:14:42 Krishna Kumar2 wrote:
> Arnd Bergmann  wrote on 10/06/2010 05:49:00 PM:
> 
> > > I don't see any reasons mentioned above.  However, for higher
> > > number of netperf sessions, I see a big increase in retransmissions:
> > > ___
> > > #netperf  ORG   NEW
> > > BW (#retr)BW (#retr)
> > > ___
> > > 1  70244 (0) 64102 (0)
> > > 4  21421 (0) 36570 (416)
> > > 8  21746 (0) 38604 (148)
> > > 16 21783 (0) 40632 (464)
> > > 32 22677 (0) 37163 (1053)
> > > 64 23648 (4) 36449 (2197)
> > > 12823251 (2) 31676 (3185)
> > > ___
> >
> >
> > This smells like it could be related to a problem that Ben Greear found
> > recently (see "macvlan:  Enable qdisc backoff logic"). When the hardware
> > is busy, used to just drop the packet. With Ben's patch, we return
> -EAGAIN
> > to qemu (or vhost-net) to trigger a resend.
> >
> > I suppose what we really should do is feed that condition back to the
> > guest network stack and implement the backoff in there.
> 
> Thanks for the pointer. I will take a look at this as I hadn't seen
> this patch earlier. Is there any way to figure out if this is the
> issue?

I think a good indication would be if this changes with/without the
patch, and if you see -EAGAIN in qemu with the patch applied.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 10/05/2010 11:53:23 PM:

> > > Any idea where does this come from?
> > > Do you see more TX interrupts? RX interrupts? Exits?
> > > Do interrupts bounce more between guest CPUs?
> > > 4. Identify reasons for single netperf BW regression.
> >
> > After testing various combinations of #txqs, #vhosts, #netperf
> > sessions, I think the drop for 1 stream is due to TX and RX for
> > a flow being processed on different cpus.
>
> Right. Can we fix it?

I am not sure how to. My initial patch had one thread but gave
small gains and ran into limitations once number of sessions
became large.

> >  I did two more tests:
> > 1. Pin vhosts to same CPU:
> > - BW drop is much lower for 1 stream case (- 5 to -8% range)
> > - But performance is not so high for more sessions.
> > 2. Changed vhost to be single threaded:
> >   - No degradation for 1 session, and improvement for upto
> >  8, sometimes 16 streams (5-12%).
> >   - BW degrades after that, all the way till 128 netperf
sessions.
> >   - But overall CPU utilization improves.
> > Summary of the entire run (for 1-128 sessions):
> > txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
> > txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
> >
> > I don't see any reasons mentioned above.  However, for higher
> > number of netperf sessions, I see a big increase in retransmissions:
>
> Hmm, ok, and do you see any errors?

I haven't seen any in any statistics, messages, etc. Also no
retranmissions for txq=1.

> > Single netperf case didn't have any retransmissions so that is not
> > the cause for drop.  I tested ixgbe (MQ):
> > ___
> > #netperf  ixgbe ixgbe (pin intrs to cpu#0 on
> >both server/client)
> > BW (#retr)  BW (#retr)
> > ___
> > 1   3567 (117)  6000 (251)
> > 2   4406 (477)  6298 (725)
> > 4   6119 (1085) 7208 (3387)
> > 8   6595 (4276) 7381 (15296)
> > 16  6651 (11651)6856 (30394)
>
> Interesting.
> You are saying we get much more retransmissions with physical nic as
> well?

Yes, with ixgbe. I re-ran with 16 netperfs running for 15 secs on
both ixgbe and cxgb3 just now to reconfirm:

ixgbe: BW: 6186.85  SD/Remote: 135.711, 339.376  CPU/Remote: 79.99, 200.00,
Retrans: 545
cxgb3: BW: 8051.07  SD/Remote: 144.416, 260.487  CPU/Remote: 110.88,
200.00, Retrans: 0

However 64 netperfs for 30 secs gave:

ixgbe: BW: 6691.12  SD/Remote: 8046.617, 5259.992  CPU/Remote: 1223.86,
799.97, Retrans: 1424
cxgb3: BW: 7799.16  SD/Remote: 2589.875, 4317.013  CPU/Remote: 480.39
800.64, Retrans: 649

# ethtool -i eth4
driver: ixgbe
version: 2.0.84-k2
firmware-version: 0.9-3
bus-info: :1f:00.1

# ifconfig output:
   RX packets:783241 errors:0 dropped:0 overruns:0 frame:0
   TX packets:689533 errors:0 dropped:0 overruns:0 carrier:0
   collisions:0 txqueuelen:1000

# lspci output:
1f:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network
Connec
tion (rev 01)
Subsystem: Intel Corporation Ethernet Server Adapter X520-2
Flags: bus master, fast devsel, latency 0, IRQ 30
Memory at 9890 (64-bit, prefetchable) [size=512K]
I/O ports at 2020 [size=32]
Memory at 98a0 (64-bit, prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
Capabilities: [a0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-40-4a-b4
Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
Capabilities: [160] Single Root I/O Virtualization (SR-IOV)
Kernel driver in use: ixgbe
Kernel modules: ixgbe

> > I haven't done this right now since I don't have a setup.  I guess
> > it would be limited by wire speed and gains may not be there.  I
> > will try to do this later when I get the setup.
>
> OK but at least need to check that it does not hurt things.

Yes, sure.

> > Summary:
> >
> > 1. Average BW increase for regular I/O is best for #txq=16 with the
> >least CPU utilization increase.
> > 2. The average BW for 512 byte I/O is best for lower #txq=2. For higher
> >#txqs, BW increased only after a particular #netperf sessions - in
> >my testing that limit was 32 netperf sessions.
> > 3. Multiple txq for guest by itself doesn't seem to have any issues.
> >Guest CPU% increase is slightly higher than BW improvement.  I
> >think it is true for all mq drivers since more paths run in parallel
> >upto the device instead of sleeping and allowing one threa

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Krishna Kumar2
Arnd Bergmann  wrote on 10/06/2010 05:49:00 PM:

> > I don't see any reasons mentioned above.  However, for higher
> > number of netperf sessions, I see a big increase in retransmissions:
> > ___
> > #netperf  ORG   NEW
> > BW (#retr)BW (#retr)
> > ___
> > 1  70244 (0) 64102 (0)
> > 4  21421 (0) 36570 (416)
> > 8  21746 (0) 38604 (148)
> > 16 21783 (0) 40632 (464)
> > 32 22677 (0) 37163 (1053)
> > 64 23648 (4) 36449 (2197)
> > 12823251 (2) 31676 (3185)
> > ___
>
>
> This smells like it could be related to a problem that Ben Greear found
> recently (see "macvlan:  Enable qdisc backoff logic"). When the hardware
> is busy, used to just drop the packet. With Ben's patch, we return
-EAGAIN
> to qemu (or vhost-net) to trigger a resend.
>
> I suppose what we really should do is feed that condition back to the
> guest network stack and implement the backoff in there.

Thanks for the pointer. I will take a look at this as I hadn't seen
this patch earlier. Is there any way to figure out if this is the
issue?

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 10/06/2010 07:04:31 PM:

> "Michael S. Tsirkin" 
> 10/06/2010 07:04 PM
>
> To
>
> Krishna Kumar2/India/i...@ibmin
>
> cc
>
> ru...@rustcorp.com.au, da...@davemloft.net, kvm@vger.kernel.org,
> a...@arndb.de, net...@vger.kernel.org, a...@redhat.com,
anth...@codemonkey.ws
>
> Subject
>
> Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
>
> On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> > For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> > for degradation for 1 stream case:
>
> I thought about possible RX/TX contention reasons, and I realized that
> we get/put the mm counter all the time.  So I write the following: I
> haven't seen any performance gain from this in a single queue case, but
> maybe this will help multiqueue?

Great! I am on vacation tomorrow, but will test with this patch
tomorrow night.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Michael S. Tsirkin
On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> for degradation for 1 stream case:

I thought about possible RX/TX contention reasons, and I realized that
we get/put the mm counter all the time.  So I write the following: I
haven't seen any performance gain from this in a single queue case, but
maybe this will help multiqueue?

Thanks,

Michael S. Tsirkin (2):
  vhost: put mm after thread stop
  vhost-net: batch use/unuse mm

 drivers/vhost/net.c   |7 ---
 drivers/vhost/vhost.c |   16 ++--
 2 files changed, 10 insertions(+), 13 deletions(-)

-- 
1.7.3-rc1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-06 Thread Arnd Bergmann
On Tuesday 05 October 2010, Krishna Kumar2 wrote:
> After testing various combinations of #txqs, #vhosts, #netperf
> sessions, I think the drop for 1 stream is due to TX and RX for
> a flow being processed on different cpus.  I did two more tests:
> 1. Pin vhosts to same CPU:
> - BW drop is much lower for 1 stream case (- 5 to -8% range)
> - But performance is not so high for more sessions.
> 2. Changed vhost to be single threaded:
>   - No degradation for 1 session, and improvement for upto
>   8, sometimes 16 streams (5-12%).
>   - BW degrades after that, all the way till 128 netperf sessions.
>   - But overall CPU utilization improves.
> Summary of the entire run (for 1-128 sessions):
> txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
> txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
> 
> I don't see any reasons mentioned above.  However, for higher
> number of netperf sessions, I see a big increase in retransmissions:
> ___
> #netperf  ORG   NEW
> BW (#retr)BW (#retr)
> ___
> 1  70244 (0) 64102 (0)
> 4  21421 (0) 36570 (416)
> 8  21746 (0) 38604 (148)
> 16 21783 (0) 40632 (464)
> 32 22677 (0) 37163 (1053)
> 64 23648 (4) 36449 (2197)
> 12823251 (2) 31676 (3185)
> ___


This smells like it could be related to a problem that Ben Greear found
recently (see "macvlan:  Enable qdisc backoff logic"). When the hardware
is busy, used to just drop the packet. With Ben's patch, we return -EAGAIN
to qemu (or vhost-net) to trigger a resend.

I suppose what we really should do is feed that condition back to the
guest network stack and implement the backoff in there.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-05 Thread Michael S. Tsirkin
On Tue, Oct 05, 2010 at 04:10:00PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 09/19/2010 06:14:43 PM:
> 
> > Could you document how exactly do you measure multistream bandwidth:
> > netperf flags, etc?
> 
> All results were without any netperf flags or system tuning:
> for i in $list
> do
> netperf -c -C -l 60 -H 192.168.122.1 > /tmp/netperf.$$.$i &
> done
> wait
> Another script processes the result files.  It also displays the
> start time/end time of each iteration to make sure skew due to
> parallel netperfs is minimal.
> 
> I changed the vhost functionality once more to try to get the
> best model, the new model being:
> 1. #numtxqs=1 -> #vhosts=1, this thread handles both RX/TX.
> 2. #numtxqs>1 -> vhost[0] handles RX and vhost[1-MAX] handles
>TX[0-n], where MAX is 4.  Beyond numtxqs=4, the remaining TX
>queues are handled by vhost threads in round-robin fashion.
> 
> Results from here on are with these changes, and only "tuning" is
> to set each vhost's affinity to CPUs[0-3] ("taskset -p f ").
> 
> > Any idea where does this come from?
> > Do you see more TX interrupts? RX interrupts? Exits?
> > Do interrupts bounce more between guest CPUs?
> > 4. Identify reasons for single netperf BW regression.
> 
> After testing various combinations of #txqs, #vhosts, #netperf
> sessions, I think the drop for 1 stream is due to TX and RX for
> a flow being processed on different cpus.

Right. Can we fix it?

>  I did two more tests:
> 1. Pin vhosts to same CPU:
> - BW drop is much lower for 1 stream case (- 5 to -8% range)
> - But performance is not so high for more sessions.
> 2. Changed vhost to be single threaded:
>   - No degradation for 1 session, and improvement for upto
> 8, sometimes 16 streams (5-12%).
>   - BW degrades after that, all the way till 128 netperf sessions.
>   - But overall CPU utilization improves.
> Summary of the entire run (for 1-128 sessions):
> txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
> txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)
> 
> I don't see any reasons mentioned above.  However, for higher
> number of netperf sessions, I see a big increase in retransmissions:

Hmm, ok, and do you see any errors?

> ___
> #netperf  ORG   NEW
> BW (#retr)BW (#retr)
> ___
> 1  70244 (0) 64102 (0)
> 4  21421 (0) 36570 (416)
> 8  21746 (0) 38604 (148)
> 16 21783 (0) 40632 (464)
> 32 22677 (0) 37163 (1053)
> 64 23648 (4) 36449 (2197)
> 12823251 (2) 31676 (3185)
> ___
> 
> Single netperf case didn't have any retransmissions so that is not
> the cause for drop.  I tested ixgbe (MQ):
> ___
> #netperf  ixgbe ixgbe (pin intrs to cpu#0 on
>both server/client)
> BW (#retr)  BW (#retr)
> ___
> 1   3567 (117)  6000 (251)
> 2   4406 (477)  6298 (725)
> 4   6119 (1085) 7208 (3387)
> 8   6595 (4276) 7381 (15296)
> 16  6651 (11651)6856 (30394)

Interesting.
You are saying we get much more retransmissions with physical nic as
well?

> ___
> 
> > 5. Test perf in more scenarious:
> >small packets
> 
> 512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
> but increases with #sessions:
> ___
> #   BW1 BW2 (%) CPU1CPU2 (%)RCPU1   RCPU2 (%)
> ___
> 1   40433800 (-6.0) 50  50 (0)  86  98 (13.9)
> 2   83587485 (-10.4)153 178 (16.3)  230 264 (14.7)
> 4   20664   13567 (-34.3)   448 490 (9.3)   530 624 (17.7)
> 8   25198   17590 (-30.1)   967 1021 (5.5)  10851257 (15.8)
> 16  23791   24057 (1.1) 19042220 (16.5) 21562578 (19.5)
> 24  23055   26378 (14.4)28073378 (20.3) 32253901 (20.9)
> 32  22873   27116 (18.5)37484525 (20.7) 43075239 (21.6)
> 40  22876   29106 (27.2)47055717 (21.5) 53886591 (22.3)
> 48  23099   31352 (35.7)56426986 (23.8) 64758085 (24.8)
> 64  22645   30563 (34.9)75279027 (19.9) 861910656 (23.6)
> 80  22497   31922 (41.8)937511390 (21.4)10736   13485 (25.6)
> 96  22509   32718 (45.3)11271   13710 (21.6)12927   16269 (25.8)
> 128 22255   32397 (45.5)15036   18093 (20.3

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-10-05 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 09/19/2010 06:14:43 PM:

> Could you document how exactly do you measure multistream bandwidth:
> netperf flags, etc?

All results were without any netperf flags or system tuning:
for i in $list
do
netperf -c -C -l 60 -H 192.168.122.1 > /tmp/netperf.$$.$i &
done
wait
Another script processes the result files.  It also displays the
start time/end time of each iteration to make sure skew due to
parallel netperfs is minimal.

I changed the vhost functionality once more to try to get the
best model, the new model being:
1. #numtxqs=1 -> #vhosts=1, this thread handles both RX/TX.
2. #numtxqs>1 -> vhost[0] handles RX and vhost[1-MAX] handles
   TX[0-n], where MAX is 4.  Beyond numtxqs=4, the remaining TX
   queues are handled by vhost threads in round-robin fashion.

Results from here on are with these changes, and only "tuning" is
to set each vhost's affinity to CPUs[0-3] ("taskset -p f ").

> Any idea where does this come from?
> Do you see more TX interrupts? RX interrupts? Exits?
> Do interrupts bounce more between guest CPUs?
> 4. Identify reasons for single netperf BW regression.

After testing various combinations of #txqs, #vhosts, #netperf
sessions, I think the drop for 1 stream is due to TX and RX for
a flow being processed on different cpus.  I did two more tests:
1. Pin vhosts to same CPU:
- BW drop is much lower for 1 stream case (- 5 to -8% range)
- But performance is not so high for more sessions.
2. Changed vhost to be single threaded:
  - No degradation for 1 session, and improvement for upto
  8, sometimes 16 streams (5-12%).
  - BW degrades after that, all the way till 128 netperf sessions.
  - But overall CPU utilization improves.
Summary of the entire run (for 1-128 sessions):
txq=4:  BW: (-2.3)  CPU: (-16.5)RCPU: (-5.3)
txq=16: BW: (-1.9)  CPU: (-24.9)RCPU: (-9.6)

I don't see any reasons mentioned above.  However, for higher
number of netperf sessions, I see a big increase in retransmissions:
___
#netperf  ORG   NEW
BW (#retr)BW (#retr)
___
1  70244 (0) 64102 (0)
4  21421 (0) 36570 (416)
8  21746 (0) 38604 (148)
16 21783 (0) 40632 (464)
32 22677 (0) 37163 (1053)
64 23648 (4) 36449 (2197)
12823251 (2) 31676 (3185)
___

Single netperf case didn't have any retransmissions so that is not
the cause for drop.  I tested ixgbe (MQ):
___
#netperf  ixgbe ixgbe (pin intrs to cpu#0 on
   both server/client)
BW (#retr)  BW (#retr)
___
1   3567 (117)  6000 (251)
2   4406 (477)  6298 (725)
4   6119 (1085) 7208 (3387)
8   6595 (4276) 7381 (15296)
16  6651 (11651)6856 (30394)
___

> 5. Test perf in more scenarious:
>small packets

512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
but increases with #sessions:
___
#   BW1 BW2 (%) CPU1CPU2 (%)RCPU1   RCPU2 (%)
___
1   40433800 (-6.0) 50  50 (0)  86  98 (13.9)
2   83587485 (-10.4)153 178 (16.3)  230 264 (14.7)
4   20664   13567 (-34.3)   448 490 (9.3)   530 624 (17.7)
8   25198   17590 (-30.1)   967 1021 (5.5)  10851257 (15.8)
16  23791   24057 (1.1) 19042220 (16.5) 21562578 (19.5)
24  23055   26378 (14.4)28073378 (20.3) 32253901 (20.9)
32  22873   27116 (18.5)37484525 (20.7) 43075239 (21.6)
40  22876   29106 (27.2)47055717 (21.5) 53886591 (22.3)
48  23099   31352 (35.7)56426986 (23.8) 64758085 (24.8)
64  22645   30563 (34.9)75279027 (19.9) 861910656 (23.6)
80  22497   31922 (41.8)937511390 (21.4)10736   13485 (25.6)
96  22509   32718 (45.3)11271   13710 (21.6)12927   16269 (25.8)
128 22255   32397 (45.5)15036   18093 (20.3)17144   21608 (26.0)
___
SUM:BW: (16.7)  CPU: (20.6) RCPU: (24.3)
___

> host -> guest
___
#   BW1 BW2 (%) CPU1CPU2 (%)RCPU1   RCPU2 

Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-19 Thread Michael S. Tsirkin
On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote:
> For 1 TCP netperf, I ran 7 iterations and summed it. Explanation
> for degradation for 1 stream case:

Could you document how exactly do you measure multistream bandwidth:
netperf flags, etc?

> 1. Without any tuning, BW falls -6.5%.

Any idea where does this come from?
Do you see more TX interrupts? RX interrupts? Exits?
Do interrupts bounce more between guest CPUs?


> 2. When vhosts on server were bound to CPU0, BW was as good
>as with original code.
> 3. When new code was started with numtxqs=1 (or mq=off, which
>is the default), there was no degradation.
> 
>Next steps:
>---
> 1. MQ RX patch is also complete - plan to submit once TX is OK (as
>well as after identifying bandwidth degradations for some test
>cases).
> 2. Cache-align data structures: I didn't see any BW/SD improvement
>after making the sq's (and similarly for vhost) cache-aligned
>statically:
> struct virtnet_info {
> ...
> struct send_queue sq[16] cacheline_aligned_in_smp;
> ...
> };
> 3. Migration is not tested.

4. Identify reasons for single netperf BW regression.

5. Test perf in more scenarious:
   small packets
   host -> guest
   guest <-> external
   in last case:
 find some other way to measure host CPU utilization,
 try multiqueue and single queue devices

6. Use above to figure out what is a sane default for numtxqs.

> 
> Review/feedback appreciated.
> 
> Signed-off-by: Krishna Kumar 
> ---
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-17 Thread Sridhar Samudrala
On Fri, 2010-09-17 at 15:33 +0530, Krishna Kumar wrote:
> Following patches implement transmit MQ in virtio-net.  Also
> included is the user qemu changes. MQ is disabled by default
> unless qemu specifies it.
> 
> 1. This feature was first implemented with a single vhost.
>Testing showed 3-8% performance gain for upto 8 netperf
>sessions (and sometimes 16), but BW dropped with more
>sessions.  However, adding more vhosts improved BW
>significantly all the way to 128 sessions. Multiple
>vhost is implemented in-kernel by passing an argument
>to SET_OWNER (retaining backward compatibility). The
>vhost patch adds 173 source lines (incl comments).
> 2. BW -> CPU/SD equation: Average TCP performance increased
>23% compared to almost 70% for earlier patch (with
>unrestricted #vhosts).  SD improved -4.2% while it had
>increased 55% for the earlier patch.  Increasing #vhosts
>has it's pros and cons, but this patch lays emphasis on
>reducing CPU utilization.  Another option could be a
>tunable to select number of vhosts threads.
> 3. Interoperability: Many combinations, but not all, of qemu,
>host, guest tested together.  Tested with multiple i/f's
>on guest, with both mq=on/off, vhost=on/off, etc.
> 
>   Changes from rev1:
>   --
> 1. Move queue_index from virtio_pci_vq_info to virtqueue,
>and resulting changes to existing code and to the patch.
> 2. virtio-net probe uses virtio_config_val.
> 3. Remove constants: VIRTIO_MAX_TXQS, MAX_VQS, all arrays
>allocated on stack, etc.
> 4. Restrict number of vhost threads to 2 - I get much better
>cpu/sd results (without any tuning) with low number of vhost
>threads.  Higher vhosts gives better average BW performance
>(from average of 45%), but SD increases significantly (90%).
> 5. Working of vhost threads changes, eg for numtxqs=4:
>vhost-0: handles RX
>vhost-1: handles TX[0]
>vhost-0: handles TX[1]
>vhost-1: handles TX[2]
>vhost-0: handles TX[3]

This doesn't look symmetrical.
TCP flows that go via TX(1,3) use the same vhost thread for RX packets,
whereas flows via TX(0,2) use a different vhost thread.

Thanks
Sridhar

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[v2 RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-17 Thread Krishna Kumar
Following patches implement transmit MQ in virtio-net.  Also
included is the user qemu changes. MQ is disabled by default
unless qemu specifies it.

1. This feature was first implemented with a single vhost.
   Testing showed 3-8% performance gain for upto 8 netperf
   sessions (and sometimes 16), but BW dropped with more
   sessions.  However, adding more vhosts improved BW
   significantly all the way to 128 sessions. Multiple
   vhost is implemented in-kernel by passing an argument
   to SET_OWNER (retaining backward compatibility). The
   vhost patch adds 173 source lines (incl comments).
2. BW -> CPU/SD equation: Average TCP performance increased
   23% compared to almost 70% for earlier patch (with
   unrestricted #vhosts).  SD improved -4.2% while it had
   increased 55% for the earlier patch.  Increasing #vhosts
   has it's pros and cons, but this patch lays emphasis on
   reducing CPU utilization.  Another option could be a
   tunable to select number of vhosts threads.
3. Interoperability: Many combinations, but not all, of qemu,
   host, guest tested together.  Tested with multiple i/f's
   on guest, with both mq=on/off, vhost=on/off, etc.

  Changes from rev1:
  --
1. Move queue_index from virtio_pci_vq_info to virtqueue,
   and resulting changes to existing code and to the patch.
2. virtio-net probe uses virtio_config_val.
3. Remove constants: VIRTIO_MAX_TXQS, MAX_VQS, all arrays
   allocated on stack, etc.
4. Restrict number of vhost threads to 2 - I get much better
   cpu/sd results (without any tuning) with low number of vhost
   threads.  Higher vhosts gives better average BW performance
   (from average of 45%), but SD increases significantly (90%).
5. Working of vhost threads changes, eg for numtxqs=4:
   vhost-0: handles RX
   vhost-1: handles TX[0]
   vhost-0: handles TX[1]
   vhost-1: handles TX[2]
   vhost-0: handles TX[3]

  Enabling MQ on virtio:
  ---
When following options are passed to qemu:
- smp > 1
- vhost=on
- mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using
an optional 'numtxqs' option.  e.g. for a smp=4 guest:
vhost=on   ->   #txqueues = 1
vhost=on,mq=on ->   #txqueues = 4
vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2


   Performance (guest -> local host):
   ---
System configuration:
Host:  8 Intel Xeon, 8 GB memory
Guest: 4 cpus, 2 GB memory, numtxqs=4
All testing without any system tuning, and default netperf
Results split across two tables to show SD and CPU usage:

TCP: BW vs CPU/Remote CPU utilization:
#BW1BW2 (%)CPU1CPU2 (%) RCPU1  RCPU2 (%)

169971  65376 (-6.56)  134   170  (26.86)   322376   (16.77)
220911  24839 (18.78)  107   139  (29.90)   217264   (21.65)
421431  28912 (34.90)  213   318  (49.29)   444541   (21.84)
821857  34592 (58.26)  444   859  (93.46)   9011247  (38.40)
16   22368  33083 (47.90)  899   1523 (69.41)   1813   2410  (32.92)
24   22556  32578 (44.43)  1347  2249 (66.96)   2712   3606  (32.96)
32   22727  30923 (36.06)  1806  2506 (38.75)   3622   3952  (9.11)
40   23054  29334 (27.24)  2319  2872 (23.84)   4544   4551  (.15)
48   23006  28800 (25.18)  2827  2990 (5.76)5465   4718  (-13.66)
64   23411  27661 (18.15)  3708  3306 (-10.84)  7231   5218  (-27.83)
80   23175  27141 (17.11)  4796  4509 (-5.98)   9152   7182  (-21.52)
96   23337  26759 (14.66)  5603  4543 (-18.91)  10890  7162  (-34.23)
128  22726  28339 (24.69)  7559  6395 (-15.39)  14600  10169 (-30.34)

Summary:BW: 22.8%CPU: 1.9%RCPU: -17.0%

TCP: BW vs SD/Remote SD:
#BW1BW2 (%)SD1  SD2  (%)RSD1RSD2   (%)

169971  65376 (-6.56)  4   6 (50.00)21  26 (23.80)
220911  24839 (18.78)  6   7 (16.66)27  28 (3.70)
421431  28912 (34.90)  26  31(19.23)108 111(2.77)
821857  34592 (58.26)  106 135   (27.35)432 393(-9.02)
16   22368  33083 (47.90)  431 577   (33.87)17421828   (4.93)
24   22556  32578 (44.43)  972 1393  (43.31)39154479   (14.40)
32   22727  30923 (36.06)  17232165  (25.65)69086842   (-.95)
40   23054  29334 (27.24)  27742761  (-.46) 10874   8764   (-19.40)
48   23006  28800 (25.18

Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-14 Thread Michael S. Tsirkin
On Mon, Sep 13, 2010 at 09:53:40PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 09/13/2010 05:20:55 PM:
> 
> > > Results with the original kernel:
> > > _
> > > #   BW  SD  RSD
> > > __
> > > 1   20903   1   6
> > > 2   21963   6   25
> > > 4   22042   23  102
> > > 8   21674   97  419
> > > 16  22281   379 1663
> > > 24  22521   857 3748
> > > 32  22976   15286594
> > > 40  23197   239010239
> > > 48  22973   354215074
> > > 64  23809   648627244
> > > 80  23564   10169   43118
> > > 96  22977   14954   62948
> > > 128 23649   27067   113892
> > > 
> > >
> > > With higher number of threads running in parallel, SD
> > > increased. In this case most threads run in parallel
> > > only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> > > higher number of threads run in parallel through
> > > ndo_start_xmit. I *think* the increase in SD is to do
> > > with higher # of threads running for larger code path
> > > >From the numbers I posted with the patch (cut-n-paste
> > > only the % parts), BW increased much more than the SD,
> > > sometimes more than twice the increase in SD.
> >
> > Service demand is BW/CPU, right? So if BW goes up by 50%
> > and SD by 40%, this means that CPU more than doubled.
> 
> I think the SD calculation might be more complicated,
> I think it does it based on adding up averages sampled
> and stored during the run. But, I still don't see how CPU
> can double?? e.g.
>   BW: 1000 -> 1500 (50%)
>   SD: 100 -> 140 (40%)
>   CPU: 10 -> 10.71 (7.1%)

Hmm. Time to look at the source. Which netperf version did you use?

> > > N#  BW% SD%  RSD%
> > > 4   54.30   40.00-1.16
> > > 8   71.79   46.59-2.68
> > > 16  71.89   50.40-2.50
> > > 32  72.24   34.26-14.52
> > > 48  70.10   31.51-14.35
> > > 64  69.01   38.81-9.66
> > > 96  70.68   71.2610.74
> > >
> > > I also think SD calculation gets skewed for guest->local
> > > host testing.
> >
> > If it's broken, let's fix it?
> >
> > > For this test, I ran a guest with numtxqs=16.
> > > The first result below is with my patch, which creates 16
> > > vhosts. The second result is with a modified patch which
> > > creates only 2 vhosts (testing with #netperfs = 64):
> >
> > My guess is it's not a good idea to have more TX VQs than guest CPUs.
> 
> Definitely, I will try to run tomorrow with more reasonable
> values, also will test with my second version of the patch
> that creates restricted number of vhosts and post results.
> 
> > I realize for management it's easier to pass in a single vhost fd, but
> > just for testing it's probably easier to add code in userspace to open
> > /dev/vhost multiple times.
> >
> > >
> > > #vhosts  BW% SD%RSD%
> > > 16   20.79   186.01 149.74
> > > 230.89   34.55  18.44
> > >
> > > The remote SD increases with the number of vhost threads,
> > > but that number seems to correlate with guest SD. So though
> > > BW% increased slightly from 20% to 30%, SD fell drastically
> > > from 186% to 34%. I think it could be a calculation skew
> > > with host SD, which also fell from 150% to 18%.
> >
> > I think by default netperf looks in /proc/stat for CPU utilization data:
> > so host CPU utilization will include the guest CPU, I think?
> 
> It appears that way to me too, but the data above seems to
> suggest the opposite...
> 
> > I would go further and claim that for host/guest TCP
> > CPU utilization and SD should always be identical.
> > Makes sense?
> 
> It makes sense to me, but once again I am not sure how SD
> is really done, or whether it is linear to CPU. Cc'ing Rick
> in case he can comment

Me neither. I should rephrase: I think we should always
use host CPU utilization always.

> >
> > >
> > > I am planning to submit 2nd patch rev with restricted
> > > number of vhosts.
> > >
> > > > > Likely cause for the 1 stream degradation with multiple
> > > > > vhost patch:
> > > > >
> > > > > 1. Two vhosts run handling the RX and TX respectively.
> > > > >I think the issue is related to cache ping-pong esp
> > > > >since these run on different cpus/sockets.
> > > >
> > > > Right. With TCP I think we are better off handling
> > > > TX and RX for a socket by the same vhost, so that
> > > > packet and its ack are handled by the same thread.
> > > > Is this what happens with RX multiqueue patch?
> > > > How do we select an RX queue to put the packet on?
> > >
> > > My (unsubmitted) RX patch doesn't do this yet, that is
> > > something I will check.
> > >
> > > Thanks,
> > >
> > > - KK
> >
> > You'll want to work on top of net-next, I think there's
> > RX flow filtering work going on there.
> 
> Thanks Michael, I will follow up on that for the RX patch,
> plus your suggestion on tying RX with TX.
> 
> Th

Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-13 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 09/13/2010 05:20:55 PM:

> > Results with the original kernel:
> > _
> > #   BW  SD  RSD
> > __
> > 1   20903   1   6
> > 2   21963   6   25
> > 4   22042   23  102
> > 8   21674   97  419
> > 16  22281   379 1663
> > 24  22521   857 3748
> > 32  22976   15286594
> > 40  23197   239010239
> > 48  22973   354215074
> > 64  23809   648627244
> > 80  23564   10169   43118
> > 96  22977   14954   62948
> > 128 23649   27067   113892
> > 
> >
> > With higher number of threads running in parallel, SD
> > increased. In this case most threads run in parallel
> > only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> > higher number of threads run in parallel through
> > ndo_start_xmit. I *think* the increase in SD is to do
> > with higher # of threads running for larger code path
> > >From the numbers I posted with the patch (cut-n-paste
> > only the % parts), BW increased much more than the SD,
> > sometimes more than twice the increase in SD.
>
> Service demand is BW/CPU, right? So if BW goes up by 50%
> and SD by 40%, this means that CPU more than doubled.

I think the SD calculation might be more complicated,
I think it does it based on adding up averages sampled
and stored during the run. But, I still don't see how CPU
can double?? e.g.
BW: 1000 -> 1500 (50%)
SD: 100 -> 140 (40%)
CPU: 10 -> 10.71 (7.1%)

> > N#  BW% SD%  RSD%
> > 4   54.30   40.00-1.16
> > 8   71.79   46.59-2.68
> > 16  71.89   50.40-2.50
> > 32  72.24   34.26-14.52
> > 48  70.10   31.51-14.35
> > 64  69.01   38.81-9.66
> > 96  70.68   71.2610.74
> >
> > I also think SD calculation gets skewed for guest->local
> > host testing.
>
> If it's broken, let's fix it?
>
> > For this test, I ran a guest with numtxqs=16.
> > The first result below is with my patch, which creates 16
> > vhosts. The second result is with a modified patch which
> > creates only 2 vhosts (testing with #netperfs = 64):
>
> My guess is it's not a good idea to have more TX VQs than guest CPUs.

Definitely, I will try to run tomorrow with more reasonable
values, also will test with my second version of the patch
that creates restricted number of vhosts and post results.

> I realize for management it's easier to pass in a single vhost fd, but
> just for testing it's probably easier to add code in userspace to open
> /dev/vhost multiple times.
>
> >
> > #vhosts  BW% SD%RSD%
> > 16   20.79   186.01 149.74
> > 230.89   34.55  18.44
> >
> > The remote SD increases with the number of vhost threads,
> > but that number seems to correlate with guest SD. So though
> > BW% increased slightly from 20% to 30%, SD fell drastically
> > from 186% to 34%. I think it could be a calculation skew
> > with host SD, which also fell from 150% to 18%.
>
> I think by default netperf looks in /proc/stat for CPU utilization data:
> so host CPU utilization will include the guest CPU, I think?

It appears that way to me too, but the data above seems to
suggest the opposite...

> I would go further and claim that for host/guest TCP
> CPU utilization and SD should always be identical.
> Makes sense?

It makes sense to me, but once again I am not sure how SD
is really done, or whether it is linear to CPU. Cc'ing Rick
in case he can comment

>
> >
> > I am planning to submit 2nd patch rev with restricted
> > number of vhosts.
> >
> > > > Likely cause for the 1 stream degradation with multiple
> > > > vhost patch:
> > > >
> > > > 1. Two vhosts run handling the RX and TX respectively.
> > > >I think the issue is related to cache ping-pong esp
> > > >since these run on different cpus/sockets.
> > >
> > > Right. With TCP I think we are better off handling
> > > TX and RX for a socket by the same vhost, so that
> > > packet and its ack are handled by the same thread.
> > > Is this what happens with RX multiqueue patch?
> > > How do we select an RX queue to put the packet on?
> >
> > My (unsubmitted) RX patch doesn't do this yet, that is
> > something I will check.
> >
> > Thanks,
> >
> > - KK
>
> You'll want to work on top of net-next, I think there's
> RX flow filtering work going on there.

Thanks Michael, I will follow up on that for the RX patch,
plus your suggestion on tying RX with TX.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-13 Thread Michael S. Tsirkin
On Mon, Sep 13, 2010 at 09:42:22AM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 09/12/2010 05:10:25 PM:
> 
> > > SINGLE vhost (Guest -> Host):
> > >1 netperf:BW: 10.7% SD: -1.4%
> > >4 netperfs:   BW: 3%SD: 1.4%
> > >8 netperfs:   BW: 17.7% SD: -10%
> > >   16 netperfs:  BW: 4.7%  SD: -7.0%
> > >   32 netperfs:  BW: -6.1% SD: -5.7%
> > > BW and SD both improves (guest multiple txqs help). For 32
> > > netperfs, SD improves.
> > >
> > > But with multiple vhosts, guest is able to send more packets
> > > and BW increases much more (SD too increases, but I think
> > > that is expected).
> >
> > Why is this expected?
> 
> Results with the original kernel:
> _
> #   BW  SD  RSD
> __
> 1   20903   1   6
> 2   21963   6   25
> 4   22042   23  102
> 8   21674   97  419
> 16  22281   379 1663
> 24  22521   857 3748
> 32  22976   15286594
> 40  23197   239010239
> 48  22973   354215074
> 64  23809   648627244
> 80  23564   10169   43118
> 96  22977   14954   62948
> 128 23649   27067   113892
> 
> 
> With higher number of threads running in parallel, SD
> increased. In this case most threads run in parallel
> only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
> higher number of threads run in parallel through
> ndo_start_xmit. I *think* the increase in SD is to do
> with higher # of threads running for larger code path
> >From the numbers I posted with the patch (cut-n-paste
> only the % parts), BW increased much more than the SD,
> sometimes more than twice the increase in SD.

Service demand is BW/CPU, right? So if BW goes up by 50%
and SD by 40%, this means that CPU more than doubled.

> N#  BW% SD%  RSD%
> 4   54.30   40.00-1.16
> 8   71.79   46.59-2.68
> 16  71.89   50.40-2.50
> 32  72.24   34.26-14.52
> 48  70.10   31.51-14.35
> 64  69.01   38.81-9.66
> 96  70.68   71.2610.74
> 
> I also think SD calculation gets skewed for guest->local
> host testing.

If it's broken, let's fix it?

> For this test, I ran a guest with numtxqs=16.
> The first result below is with my patch, which creates 16
> vhosts. The second result is with a modified patch which
> creates only 2 vhosts (testing with #netperfs = 64):

My guess is it's not a good idea to have more TX VQs than guest CPUs.

I realize for management it's easier to pass in a single vhost fd, but
just for testing it's probably easier to add code in userspace to open
/dev/vhost multiple times.

> 
> #vhosts  BW% SD%RSD%
> 16   20.79   186.01 149.74
> 230.89   34.55  18.44
> 
> The remote SD increases with the number of vhost threads,
> but that number seems to correlate with guest SD. So though
> BW% increased slightly from 20% to 30%, SD fell drastically
> from 186% to 34%. I think it could be a calculation skew
> with host SD, which also fell from 150% to 18%.

I think by default netperf looks in /proc/stat for CPU utilization data:
so host CPU utilization will include the guest CPU, I think?

I would go further and claim that for host/guest TCP
CPU utilization and SD should always be identical.
Makes sense?

> 
> I am planning to submit 2nd patch rev with restricted
> number of vhosts.
> 
> > > Likely cause for the 1 stream degradation with multiple
> > > vhost patch:
> > >
> > > 1. Two vhosts run handling the RX and TX respectively.
> > >I think the issue is related to cache ping-pong esp
> > >since these run on different cpus/sockets.
> >
> > Right. With TCP I think we are better off handling
> > TX and RX for a socket by the same vhost, so that
> > packet and its ack are handled by the same thread.
> > Is this what happens with RX multiqueue patch?
> > How do we select an RX queue to put the packet on?
> 
> My (unsubmitted) RX patch doesn't do this yet, that is
> something I will check.
> 
> Thanks,
> 
> - KK

You'll want to work on top of net-next, I think there's
RX flow filtering work going on there.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-12 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 09/12/2010 05:10:25 PM:

> > SINGLE vhost (Guest -> Host):
> >1 netperf:BW: 10.7% SD: -1.4%
> >4 netperfs:   BW: 3%SD: 1.4%
> >8 netperfs:   BW: 17.7% SD: -10%
> >   16 netperfs:  BW: 4.7%  SD: -7.0%
> >   32 netperfs:  BW: -6.1% SD: -5.7%
> > BW and SD both improves (guest multiple txqs help). For 32
> > netperfs, SD improves.
> >
> > But with multiple vhosts, guest is able to send more packets
> > and BW increases much more (SD too increases, but I think
> > that is expected).
>
> Why is this expected?

Results with the original kernel:
_
#   BW  SD  RSD
__
1   20903   1   6
2   21963   6   25
4   22042   23  102
8   21674   97  419
16  22281   379 1663
24  22521   857 3748
32  22976   15286594
40  23197   239010239
48  22973   354215074
64  23809   648627244
80  23564   10169   43118
96  22977   14954   62948
128 23649   27067   113892


With higher number of threads running in parallel, SD
increased. In this case most threads run in parallel
only till __dev_xmit_skb (#numtxqs=1). With mq TX patch,
higher number of threads run in parallel through
ndo_start_xmit. I *think* the increase in SD is to do
with higher # of threads running for larger code path
>From the numbers I posted with the patch (cut-n-paste
only the % parts), BW increased much more than the SD,
sometimes more than twice the increase in SD.

N#  BW% SD%  RSD%
4   54.30   40.00-1.16
8   71.79   46.59-2.68
16  71.89   50.40-2.50
32  72.24   34.26-14.52
48  70.10   31.51-14.35
64  69.01   38.81-9.66
96  70.68   71.2610.74

I also think SD calculation gets skewed for guest->local
host testing. For this test, I ran a guest with numtxqs=16.
The first result below is with my patch, which creates 16
vhosts. The second result is with a modified patch which
creates only 2 vhosts (testing with #netperfs = 64):

#vhosts  BW% SD%RSD%
16   20.79   186.01 149.74
230.89   34.55  18.44

The remote SD increases with the number of vhost threads,
but that number seems to correlate with guest SD. So though
BW% increased slightly from 20% to 30%, SD fell drastically
from 186% to 34%. I think it could be a calculation skew
with host SD, which also fell from 150% to 18%.

I am planning to submit 2nd patch rev with restricted
number of vhosts.

> > Likely cause for the 1 stream degradation with multiple
> > vhost patch:
> >
> > 1. Two vhosts run handling the RX and TX respectively.
> >I think the issue is related to cache ping-pong esp
> >since these run on different cpus/sockets.
>
> Right. With TCP I think we are better off handling
> TX and RX for a socket by the same vhost, so that
> packet and its ack are handled by the same thread.
> Is this what happens with RX multiqueue patch?
> How do we select an RX queue to put the packet on?

My (unsubmitted) RX patch doesn't do this yet, that is
something I will check.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-12 Thread Michael S. Tsirkin
On Thu, Sep 09, 2010 at 03:15:53PM +0530, Krishna Kumar2 wrote:
> > Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:
> 
> Some more results and likely cause for single netperf
> degradation below.
> 
> 
> > Guest -> Host (single netperf):
> > I am getting a drop of almost 20%. I am trying to figure out
> > why.
> >
> > Host -> guest (single netperf):
> > I am getting an improvement of almost 15%. Again - unexpected.
> >
> > Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
> > for runs upto 128 sessions. With fewer netperf (under 8), there
> > was a drop of 3-7% in #packets, but beyond that, the #packets
> > improved significantly to give an average improvement of 7.4%.
> >
> > So it seems that fewer sessions is having negative effect for
> > some reason on the tx side. The code path in virtio-net has not
> > changed much, so the drop in some cases is quite unexpected.
> 
> The drop for the single netperf seems to be due to multiple vhost.
> I changed the patch to start *single* vhost:
> 
> Guest -> Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
> Guest -> Host (1 netperf) : Latency: -3%, SD: 3.5%
> 
> Single vhost performs well but hits the barrier at 16 netperf
> sessions:
> 
> SINGLE vhost (Guest -> Host):
>   1 netperf:BW: 10.7% SD: -1.4%
>   4 netperfs:   BW: 3%SD: 1.4%
>   8 netperfs:   BW: 17.7% SD: -10%
>   16 netperfs:  BW: 4.7%  SD: -7.0%
>   32 netperfs:  BW: -6.1% SD: -5.7%
> BW and SD both improves (guest multiple txqs help). For 32
> netperfs, SD improves.
> 
> But with multiple vhosts, guest is able to send more packets
> and BW increases much more (SD too increases, but I think
> that is expected).

Why is this expected?

> From the earlier results:
> 
> N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2(%)
> ___
> 4   26387   40716 (54.30)   20  28   (40.00)86  85
> (-1.16)
> 8   24356   41843 (71.79)   88  129  (46.59)372 362
> (-2.68)
> 16  23587   40546 (71.89)   375 564  (50.40)15581519
> (-2.50)
> 32  22927   39490 (72.24)   16172171 (34.26)66945722
> (-14.52)
> 48  23067   39238 (70.10)   39315170 (31.51)15823   13552
> (-14.35)
> 64  22927   38750 (69.01)   71429914 (38.81)28972   26173
> (-9.66)
> 96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
> (10.74)
> ___
> (All tests were done without any tuning)
> 
> >From my testing:
> 
> 1. Single vhost improves mq guest performance upto 16
>netperfs but degrades after that.
> 2. Multiple vhost degrades single netperf guest
>performance, but significantly improves performance
>for any number of netperf sessions.
> 
> Likely cause for the 1 stream degradation with multiple
> vhost patch:
> 
> 1. Two vhosts run handling the RX and TX respectively.
>I think the issue is related to cache ping-pong esp
>since these run on different cpus/sockets.

Right. With TCP I think we are better off handling
TX and RX for a socket by the same vhost, so that
packet and its ack are handled by the same thread.
Is this what happens with RX multiqueue patch?
How do we select an RX queue to put the packet on?


> 2. I (re-)modified the patch to share RX with TX[0]. The
>performance drop is the same, but the reason is the
>guest is not using txq[0] in most cases (dev_pick_tx),
>so vhost's rx and tx are running on different threads.
>But whenever the guest uses txq[0], only one vhost
>runs and the performance is similar to original.
> 
> I went back to my *submitted* patch and started a guest
> with numtxq=16 and pinned every vhost to cpus #0&1. Now
> whether guest used txq[0] or txq[n], the performance is
> similar or better (between 10-27% across 10 runs) than
> original code. Also, -6% to -24% improvement in SD.
> 
> I will start a full test run of original vs submitted
> code with minimal tuning (Avi also suggested the same),
> and re-send. Please let me know if you need any other
> data.
> 
> Thanks,
> 
> - KK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Krishna Kumar2
Sridhar Samudrala  wrote on 09/10/2010 04:30:24 AM:

> I remember seeing similar issue when using a separate vhost thread for
> TX and
> RX queues.  Basically, we should have the same vhost thread process a
> TCP flow
> in both directions. I guess this allows the data and ACKs to be
> processed in sync.

I was trying that by sharing threads between rx and tx[0], but
that didn't work either since guest rarely picks txq=0. I was
able to get reasonable single stream performance by pinning
vhosts to the same cpu.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Sridhar Samudrala

 On 9/9/2010 2:45 AM, Krishna Kumar2 wrote:

Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:

Some more results and likely cause for single netperf
degradation below.



Guest ->  Host (single netperf):
I am getting a drop of almost 20%. I am trying to figure out
why.

Host ->  guest (single netperf):
I am getting an improvement of almost 15%. Again - unexpected.

Guest ->  Host TCP_RR: I get an average 7.4% increase in #packets
for runs upto 128 sessions. With fewer netperf (under 8), there
was a drop of 3-7% in #packets, but beyond that, the #packets
improved significantly to give an average improvement of 7.4%.

So it seems that fewer sessions is having negative effect for
some reason on the tx side. The code path in virtio-net has not
changed much, so the drop in some cases is quite unexpected.

The drop for the single netperf seems to be due to multiple vhost.
I changed the patch to start *single* vhost:

Guest ->  Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
Guest ->  Host (1 netperf) : Latency: -3%, SD: 3.5%
I remember seeing similar issue when using a separate vhost thread for 
TX and
RX queues.  Basically, we should have the same vhost thread process a 
TCP flow
in both directions. I guess this allows the data and ACKs to be 
processed in sync.



Thanks
Sridhar

Single vhost performs well but hits the barrier at 16 netperf
sessions:

SINGLE vhost (Guest ->  Host):
1 netperf:BW: 10.7% SD: -1.4%
4 netperfs:   BW: 3%SD: 1.4%
8 netperfs:   BW: 17.7% SD: -10%
   16 netperfs:  BW: 4.7%  SD: -7.0%
   32 netperfs:  BW: -6.1% SD: -5.7%
BW and SD both improves (guest multiple txqs help). For 32
netperfs, SD improves.

But with multiple vhosts, guest is able to send more packets
and BW increases much more (SD too increases, but I think
that is expected). From the earlier results:

N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2(%)
___
4   26387   40716 (54.30)   20  28   (40.00)86  85
(-1.16)
8   24356   41843 (71.79)   88  129  (46.59)372 362
(-2.68)
16  23587   40546 (71.89)   375 564  (50.40)15581519
(-2.50)
32  22927   39490 (72.24)   16172171 (34.26)66945722
(-14.52)
48  23067   39238 (70.10)   39315170 (31.51)15823   13552
(-14.35)
64  22927   38750 (69.01)   71429914 (38.81)28972   26173
(-9.66)
96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
(10.74)
___
(All tests were done without any tuning)

 From my testing:

1. Single vhost improves mq guest performance upto 16
netperfs but degrades after that.
2. Multiple vhost degrades single netperf guest
performance, but significantly improves performance
for any number of netperf sessions.

Likely cause for the 1 stream degradation with multiple
vhost patch:

1. Two vhosts run handling the RX and TX respectively.
I think the issue is related to cache ping-pong esp
since these run on different cpus/sockets.
2. I (re-)modified the patch to share RX with TX[0]. The
performance drop is the same, but the reason is the
guest is not using txq[0] in most cases (dev_pick_tx),
so vhost's rx and tx are running on different threads.
But whenever the guest uses txq[0], only one vhost
runs and the performance is similar to original.

I went back to my *submitted* patch and started a guest
with numtxq=16 and pinned every vhost to cpus #0&1. Now
whether guest used txq[0] or txq[n], the performance is
similar or better (between 10-27% across 10 runs) than
original code. Also, -6% to -24% improvement in SD.

I will start a full test run of original vs submitted
code with minimal tuning (Avi also suggested the same),
and re-send. Please let me know if you need any other
data.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Krishna Kumar2
Arnd Bergmann  wrote on 09/09/2010 04:10:27 PM:

> > > Can you live migrate a new guest from new-qemu/new-kernel
> > > to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
> > > If not, do we need to support all those cases?
> >
> > I have not tried this, though I added some minimal code in
> > virtio_net_load and virtio_net_save. I don't know what needs
> > to be done exactly at this time. I forgot to put this in the
> > "Next steps" list of things to do.
>
> I was mostly trying to find out if you think it should work
> or if there are specific reasons why it would not.
> E.g. when migrating to a machine that has an old qemu, the guest
> gets reduced to a single queue, but it's not clear to me how
> it can learn about this, or if it can get hidden by the outbound
> qemu.

I agree, I am also not sure how the old guest will handle this.
Sorry about my ignorance on migration :(

Regards,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Krishna Kumar2
Krishna Kumar2/India/IBM wrote on 09/09/2010 03:15:53 PM:

> I will start a full test run of original vs submitted
> code with minimal tuning (Avi also suggested the same),
> and re-send. Please let me know if you need any other
> data.

Same patch, only change is that I ran "taskset -p 03
", no other tuning on host or guest.
Default netperf without any options. The BW is the sum
across two iterations, each is 60secs. Guest is started
with 2 txqs.

BW1/BW2: BW for org & new in mbps
SD1/SD2: SD for org & new
RSD1/RSD2: Remote SD for org & new
___
#BW1   BW2(%)SD1SD2   (%)  RSD1RSD2  (%)
___
120903 19422  (-7.08)1  1(0)   6   7
(16.66)
221963 24330  (10.77)6  6(0)   25  25(0)
422042 31841  (44.45)23 28   (21.73)   102 110   (7.84)
821674 32045  (47.84)97 111  (14.43)   419 421   (.47)
16   22281 31361  (40.75)379551  (45.38)   16632110
(26.87)
24   22521 31945  (41.84)857981  (14.46)   37483742  (-.16)
32   22976 32473  (41.33)1528   1806  (18.19)  65946885  (4.41)
40   23197 32594  (40.50)2390   2755  (15.27)  10239   10450 (2.06)
48   22973 32757  (42.58)3542   3786  (6.88)   15074   14395
(-4.50)
64   23809 32814  (37.82)6486   6981  (7.63)   27244   26381
(-3.16)
80   23564 32682  (38.69)10169  11133 (9.47)   43118   41397
(-3.99)
96   22977 33069  (43.92)14954  15881 (6.19)   62948   59071
(-6.15)
128  23649 33032  (39.67)27067  28832 (6.52)   113892  106096
(-6.84)
___
 294534400371 (35.9) 67504  72858 (7.9)285077  271096
(-4.9)
___

I will try more tuning later as Avi suggested, wanted to test
the minimal for now.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Arnd Bergmann
On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > > > The new guest and qemu code work with old vhost-net, just with
> reduced
> > > > performance, yes?
> > >
> > > Yes, I have tested new guest/qemu with old vhost but using
> > > #numtxqs=1 (or not passing any arguments at all to qemu to
> > > enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
> > > since vhost_net_set_backend in the unmodified vhost checks
> > > for boundary overflow.
> > >
> > > I have also tested running an unmodified guest with new
> > > vhost/qemu, but qemu should not specify numtxqs>1.
> >
> > Can you live migrate a new guest from new-qemu/new-kernel
> > to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
> > If not, do we need to support all those cases?
> 
> I have not tried this, though I added some minimal code in
> virtio_net_load and virtio_net_save. I don't know what needs
> to be done exactly at this time. I forgot to put this in the
> "Next steps" list of things to do.

I was mostly trying to find out if you think it should work
or if there are specific reasons why it would not.
E.g. when migrating to a machine that has an old qemu, the guest
gets reduced to a single queue, but it's not clear to me how
it can learn about this, or if it can get hidden by the outbound
qemu.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-09 Thread Krishna Kumar2
> Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM:

Some more results and likely cause for single netperf
degradation below.


> Guest -> Host (single netperf):
> I am getting a drop of almost 20%. I am trying to figure out
> why.
>
> Host -> guest (single netperf):
> I am getting an improvement of almost 15%. Again - unexpected.
>
> Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
> for runs upto 128 sessions. With fewer netperf (under 8), there
> was a drop of 3-7% in #packets, but beyond that, the #packets
> improved significantly to give an average improvement of 7.4%.
>
> So it seems that fewer sessions is having negative effect for
> some reason on the tx side. The code path in virtio-net has not
> changed much, so the drop in some cases is quite unexpected.

The drop for the single netperf seems to be due to multiple vhost.
I changed the patch to start *single* vhost:

Guest -> Host (1 netperf, 64K): BW: 10.79%, SD: -1.45%
Guest -> Host (1 netperf) : Latency: -3%, SD: 3.5%

Single vhost performs well but hits the barrier at 16 netperf
sessions:

SINGLE vhost (Guest -> Host):
1 netperf:BW: 10.7% SD: -1.4%
4 netperfs:   BW: 3%SD: 1.4%
8 netperfs:   BW: 17.7% SD: -10%
  16 netperfs:  BW: 4.7%  SD: -7.0%
  32 netperfs:  BW: -6.1% SD: -5.7%
BW and SD both improves (guest multiple txqs help). For 32
netperfs, SD improves.

But with multiple vhosts, guest is able to send more packets
and BW increases much more (SD too increases, but I think
that is expected). From the earlier results:

N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2(%)
___
4   26387   40716 (54.30)   20  28   (40.00)86  85
(-1.16)
8   24356   41843 (71.79)   88  129  (46.59)372 362
(-2.68)
16  23587   40546 (71.89)   375 564  (50.40)15581519
(-2.50)
32  22927   39490 (72.24)   16172171 (34.26)66945722
(-14.52)
48  23067   39238 (70.10)   39315170 (31.51)15823   13552
(-14.35)
64  22927   38750 (69.01)   71429914 (38.81)28972   26173
(-9.66)
96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
(10.74)
___
(All tests were done without any tuning)

>From my testing:

1. Single vhost improves mq guest performance upto 16
   netperfs but degrades after that.
2. Multiple vhost degrades single netperf guest
   performance, but significantly improves performance
   for any number of netperf sessions.

Likely cause for the 1 stream degradation with multiple
vhost patch:

1. Two vhosts run handling the RX and TX respectively.
   I think the issue is related to cache ping-pong esp
   since these run on different cpus/sockets.
2. I (re-)modified the patch to share RX with TX[0]. The
   performance drop is the same, but the reason is the
   guest is not using txq[0] in most cases (dev_pick_tx),
   so vhost's rx and tx are running on different threads.
   But whenever the guest uses txq[0], only one vhost
   runs and the performance is similar to original.

I went back to my *submitted* patch and started a guest
with numtxq=16 and pinned every vhost to cpus #0&1. Now
whether guest used txq[0] or txq[n], the performance is
similar or better (between 10-27% across 10 runs) than
original code. Also, -6% to -24% improvement in SD.

I will start a full test run of original vs submitted
code with minimal tuning (Avi also suggested the same),
and re-send. Please let me know if you need any other
data.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 09/08/2010 01:40:11 PM:

>
___

> >UDP (#numtxqs=8)
> > N#  BW1 BW2   (%)  SD1 SD2   (%)
> > __
> > 4   29836   56761 (90.24)   67  63(-5.97)
> > 8   27666   63767 (130.48)  326 265   (-18.71)
> > 16  25452   60665 (138.35)  13961269  (-9.09)
> > 32  26172   63491 (142.59)  56174202  (-25.19)
> > 48  26146   64629 (147.18)  12813   9316  (-27.29)
> > 64  25575   65448 (155.90)  23063   16346 (-29.12)
> > 128 26454   63772 (141.06)  91054   85051 (-6.59)
> > __
> > N#: Number of netperf sessions, 90 sec runs
> > BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
> >   SD for original code
> > BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
> >   SD for new code. e.g. BW2=40716 means average BW2 was
> >   20358 mbps.
> >
>
> What happens with a single netperf?
> host -> guest performance with TCP and small packet speed
> are also worth measuring.

Guest -> Host (single netperf):
I am getting a drop of almost 20%. I am trying to figure out
why.

Host -> guest (single netperf):
I am getting an improvement of almost 15%. Again - unexpected.

Guest -> Host TCP_RR: I get an average 7.4% increase in #packets
for runs upto 128 sessions. With fewer netperf (under 8), there
was a drop of 3-7% in #packets, but beyond that, the #packets
improved significantly to give an average improvement of 7.4%.

So it seems that fewer sessions is having negative effect for
some reason on the tx side. The code path in virtio-net has not
changed much, so the drop in some cases is quite unexpected.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
> On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > > The new guest and qemu code work with old vhost-net, just with
reduced
> > > performance, yes?
> >
> > Yes, I have tested new guest/qemu with old vhost but using
> > #numtxqs=1 (or not passing any arguments at all to qemu to
> > enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
> > since vhost_net_set_backend in the unmodified vhost checks
> > for boundary overflow.
> >
> > I have also tested running an unmodified guest with new
> > vhost/qemu, but qemu should not specify numtxqs>1.
>
> Can you live migrate a new guest from new-qemu/new-kernel
> to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
> If not, do we need to support all those cases?

I have not tried this, though I added some minimal code in
virtio_net_load and virtio_net_save. I don't know what needs
to be done exactly at this time. I forgot to put this in the
"Next steps" list of things to do.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Arnd Bergmann
On Wednesday 08 September 2010, Krishna Kumar2 wrote:
> > The new guest and qemu code work with old vhost-net, just with reduced
> > performance, yes?
> 
> Yes, I have tested new guest/qemu with old vhost but using
> #numtxqs=1 (or not passing any arguments at all to qemu to
> enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
> since vhost_net_set_backend in the unmodified vhost checks
> for boundary overflow.
> 
> I have also tested running an unmodified guest with new
> vhost/qemu, but qemu should not specify numtxqs>1.

Can you live migrate a new guest from new-qemu/new-kernel
to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel?
If not, do we need to support all those cases?

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 09/08/2010 04:18:33 PM:

>
___

> >
> > > >TCP (#numtxqs=2)
> > > > N#  BW1 BW2(%)  SD1 SD2(%)  RSD1
RSD2
> > (%)
> > > >
> > >
> >
>
___

> >
> > > > 4   26387   40716 (54.30)   20  28   (40.00)86i 85
> > (-1.16)
> > > > 8   24356   41843 (71.79)   88  129  (46.59)372 362
> > (-2.68)
> > > > 16  23587   40546 (71.89)   375 564  (50.40)1558
1519
> > (-2.50)
> > > > 32  22927   39490 (72.24)   16172171 (34.26)6694
5722
> > (-14.52)
> > > > 48  23067   39238 (70.10)   39315170 (31.51)15823
13552
> > (-14.35)
> > > > 64  22927   38750 (69.01)   71429914 (38.81)28972
26173
> > (-9.66)
> > > > 96  22568   38520 (70.68)   16258   27844 (71.26)   65944
73031
> > (10.74)
> > >
> > > That's a significant hit in TCP SD. Is it caused by the imbalance
between
> > > number of queues for TX and RX? Since you mention RX is complete,
> > > maybe measure with a balanced TX/RX?
> >
> > Yes, I am not sure why it is so high.
>
> Any errors at higher levels? Are any packets reordered?

I haven't seen any messages logged, and retransmission is similar
to non-mq case. Device also has no errors/dropped packets. Anything
else I should look for?

On the host:

# ifconfig vnet0
vnet0 Link encap:Ethernet  HWaddr 9A:9D:99:E1:CA:CE
  inet6 addr: fe80::989d:99ff:fee1:cace/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:5090371 errors:0 dropped:0 overruns:0 frame:0
  TX packets:5054616 errors:0 dropped:0 overruns:65 carrier:0
  collisions:0 txqueuelen:500
  RX bytes:237793761392 (221.4 GiB)  TX bytes:333630070 (318.1 MiB)
# netstat -s  |grep -i retrans
1310 segments retransmited
35 times recovered from packet loss due to fast retransmit
1 timeouts after reno fast retransmit
41 fast retransmits
1236 retransmits in slow start

So retranmissions are 0.025% of total packets received from the guest.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Michael S. Tsirkin
On Wed, Sep 08, 2010 at 02:53:03PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin"  wrote on 09/08/2010 01:40:11 PM:
> 
> >
> ___
> 
> > >TCP (#numtxqs=2)
> > > N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2
> (%)
> > >
> >
> ___
> 
> > > 4   26387   40716 (54.30)   20  28   (40.00)86i 85
> (-1.16)
> > > 8   24356   41843 (71.79)   88  129  (46.59)372 362
> (-2.68)
> > > 16  23587   40546 (71.89)   375 564  (50.40)15581519
> (-2.50)
> > > 32  22927   39490 (72.24)   16172171 (34.26)66945722
> (-14.52)
> > > 48  23067   39238 (70.10)   39315170 (31.51)15823   13552
> (-14.35)
> > > 64  22927   38750 (69.01)   71429914 (38.81)28972   26173
> (-9.66)
> > > 96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
> (10.74)
> >
> > That's a significant hit in TCP SD. Is it caused by the imbalance between
> > number of queues for TX and RX? Since you mention RX is complete,
> > maybe measure with a balanced TX/RX?
> 
> Yes, I am not sure why it is so high.

Any errors at higher levels? Are any packets reordered?

> I found the same with #RX=#TX
> too. As a hack, I tried ixgbe without MQ (set "indices=1" before
> calling alloc_etherdev_mq, not sure if that is entirely correct) -
> here too SD worsened by around 40%. I can't explain it, since the
> virtio-net driver runs lock free once sch_direct_xmit gets
> HARD_TX_LOCK for the specific txq. Maybe the SD calculation is not strictly
> correct since
> more threads are now running parallel and load is higher? Eg, if you
> compare SD between
> #netperfs = 8 vs 16 for original code (cut-n-paste relevant columns
> only) ...
> 
> N# BWSD
> 8   24356   88
> 16 23587   375
> 
> ... SD has increased more than 4 times for the same BW.
> 
> > What happens with a single netperf?
> > host -> guest performance with TCP and small packet speed
> > are also worth measuring.
> 
> OK, I will do this and send the results later today.
> 
> > At some level, host/guest communication is easy in that we don't really
> > care which queue is used.  I would like to give some thought (and
> > testing) to how is this going to work with a real NIC card and packet
> > steering at the backend.
> > Any idea?
> 
> I have done a little testing with guest -> remote server both
> using a bridge and with macvtap (mq is required only for rx).
> I didn't understand what you mean by packet steering though,
> is it whether packets go out of the NIC on different queues?
> If so, I verified that is the case by putting a counter and
> displaying through /debug interface on the host. dev_queue_xmit
> on the host handles it by calling dev_pick_tx().
> 
> > > Guest interrupts for a 4 TXQ device after a 5 min test:
> > > # egrep "virtio0|CPU" /proc/interrupts
> > >   CPU0 CPU1 CPU2CPU3
> > > 40:   000   0PCI-MSI-edge  virtio0-config
> > > 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> > > 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> > > 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
> > > 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
> > > 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3
> >
> > Does this mean each interrupt is constantly bouncing between CPUs?
> 
> Yes. I didn't do *any* tuning for the tests. The only "tuning"
> was to use 64K IO size with netperf. When I ran default netperf
> (16K), I got a little lesser improvement in BW and worse(!) SD
> than with 64K.
> 
> Thanks,
> 
> - KK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
Avi Kivity  wrote on 09/08/2010 02:58:21 PM:

> >>> 1. This feature was first implemented with a single vhost.
> >>>  Testing showed 3-8% performance gain for upto 8 netperf
> >>>  sessions (and sometimes 16), but BW dropped with more
> >>>  sessions.  However, implementing per-txq vhost improved
> >>>  BW significantly all the way to 128 sessions.
> >> Why were vhost kernel changes required?  Can't you just instantiate
more
> >> vhost queues?
> > I did try using a single thread processing packets from multiple
> > vq's on host, but the BW dropped beyond a certain number of
> > sessions.
>
> Oh - so the interface has not changed (which can be seen from the
> patch).  That was my concern, I remembered that we planned for vhost-net
> to be multiqueue-ready.
>
> The new guest and qemu code work with old vhost-net, just with reduced
> performance, yes?

Yes, I have tested new guest/qemu with old vhost but using
#numtxqs=1 (or not passing any arguments at all to qemu to
enable MQ). Giving numtxqs > 1 fails with ENOBUFS in vhost,
since vhost_net_set_backend in the unmodified vhost checks
for boundary overflow.

I have also tested running an unmodified guest with new
vhost/qemu, but qemu should not specify numtxqs>1.

> > Are you suggesting this
> > combination:
> >IRQ on guest:
> >   40: CPU0
> >   41: CPU1
> >   42: CPU2
> >   43: CPU3 (all CPUs are on socket #0)
> >vhost:
> >   thread #0:  CPU0
> >   thread #1:  CPU1
> >   thread #2:  CPU2
> >   thread #3:  CPU3
> >qemu:
> >   thread #0:  CPU4
> >   thread #1:  CPU5
> >   thread #2:  CPU6
> >   thread #3:  CPU7 (all CPUs are on socket#1)
>
> May be better to put vcpu threads and vhost threads on the same socket.
>
> Also need to affine host interrupts.
>
> >netperf/netserver:
> >   Run on CPUs 0-4 on both sides
> >
> > The reason I did not optimize anything from user space is because
> > I felt showing the default works reasonably well is important.
>
> Definitely.  Heavy tuning is not a useful path for general end users.
> We need to make sure the the scheduler is able to arrive at the optimal
> layout without pinning (but perhaps with hints).

OK, I will see if I can get results with this.

Thanks for your suggestions,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Avi Kivity

 On 09/08/2010 12:22 PM, Krishna Kumar2 wrote:

Avi Kivity  wrote on 09/08/2010 01:17:34 PM:


   On 09/08/2010 10:28 AM, Krishna Kumar wrote:

Following patches implement Transmit mq in virtio-net.  Also
included is the user qemu changes.

1. This feature was first implemented with a single vhost.
 Testing showed 3-8% performance gain for upto 8 netperf
 sessions (and sometimes 16), but BW dropped with more
 sessions.  However, implementing per-txq vhost improved
 BW significantly all the way to 128 sessions.

Why were vhost kernel changes required?  Can't you just instantiate more
vhost queues?

I did try using a single thread processing packets from multiple
vq's on host, but the BW dropped beyond a certain number of
sessions.


Oh - so the interface has not changed (which can be seen from the 
patch).  That was my concern, I remembered that we planned for vhost-net 
to be multiqueue-ready.


The new guest and qemu code work with old vhost-net, just with reduced 
performance, yes?



I don't have the code and performance numbers for that
right now since it is a bit ancient, I can try to resuscitate
that if you want.


No need.


Guest interrupts for a 4 TXQ device after a 5 min test:
# egrep "virtio0|CPU" /proc/interrupts
CPU0 CPU1 CPU2CPU3
40:   000   0PCI-MSI-edge  virtio0-config
41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3

How are vhost threads and host interrupts distributed?  We need to move
vhost queue threads to be colocated with the related vcpu threads (if no
extra cores are available) or on the same socket (if extra cores are
available).  Similarly, move device interrupts to the same core as the
vhost thread.

All my testing was without any tuning, including binding netperf&
netserver (irqbalance is also off). I assume (maybe wrongly) that
the above might give better results?


I hope so!


Are you suggesting this
combination:
IRQ on guest:
40: CPU0
41: CPU1
42: CPU2
43: CPU3 (all CPUs are on socket #0)
vhost:
thread #0:  CPU0
thread #1:  CPU1
thread #2:  CPU2
thread #3:  CPU3
qemu:
thread #0:  CPU4
thread #1:  CPU5
thread #2:  CPU6
thread #3:  CPU7 (all CPUs are on socket#1)


May be better to put vcpu threads and vhost threads on the same socket.

Also need to affine host interrupts.


netperf/netserver:
Run on CPUs 0-4 on both sides

The reason I did not optimize anything from user space is because
I felt showing the default works reasonably well is important.


Definitely.  Heavy tuning is not a useful path for general end users.  
We need to make sure the the scheduler is able to arrive at the optimal 
layout without pinning (but perhaps with hints).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
Hi Michael,

"Michael S. Tsirkin"  wrote on 09/08/2010 01:43:26 PM:

> On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> > 1. mq RX patch is also complete - plan to submit once TX is OK.
>
> It's good that you split patches, I think it would be interesting to see
> the RX patches at least once to complete the picture.
> You could make it a separate patchset, tag them as RFC.

OK, I need to re-do some parts of it, since I started the TX only
branch a couple of weeks earlier and the RX side is outdated. I
will try to send that out in the next couple of days, as you say
it will help to complete the picture. Reasons to send it only TX
now:

- Reduce size of patch and complexity
- I didn't get much improvement on multiple RX patch (netperf from
  host -> guest), so needed some time to figure out the reason and
  fix it.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
"Michael S. Tsirkin"  wrote on 09/08/2010 01:40:11 PM:

>
___

> >TCP (#numtxqs=2)
> > N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2
(%)
> >
>
___

> > 4   26387   40716 (54.30)   20  28   (40.00)86i 85
(-1.16)
> > 8   24356   41843 (71.79)   88  129  (46.59)372 362
(-2.68)
> > 16  23587   40546 (71.89)   375 564  (50.40)15581519
(-2.50)
> > 32  22927   39490 (72.24)   16172171 (34.26)66945722
(-14.52)
> > 48  23067   39238 (70.10)   39315170 (31.51)15823   13552
(-14.35)
> > 64  22927   38750 (69.01)   71429914 (38.81)28972   26173
(-9.66)
> > 96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031
(10.74)
>
> That's a significant hit in TCP SD. Is it caused by the imbalance between
> number of queues for TX and RX? Since you mention RX is complete,
> maybe measure with a balanced TX/RX?

Yes, I am not sure why it is so high. I found the same with #RX=#TX
too. As a hack, I tried ixgbe without MQ (set "indices=1" before
calling alloc_etherdev_mq, not sure if that is entirely correct) -
here too SD worsened by around 40%. I can't explain it, since the
virtio-net driver runs lock free once sch_direct_xmit gets
HARD_TX_LOCK for the specific txq. Maybe the SD calculation is not strictly
correct since
more threads are now running parallel and load is higher? Eg, if you
compare SD between
#netperfs = 8 vs 16 for original code (cut-n-paste relevant columns
only) ...

N# BWSD
8   24356   88
16 23587   375

... SD has increased more than 4 times for the same BW.

> What happens with a single netperf?
> host -> guest performance with TCP and small packet speed
> are also worth measuring.

OK, I will do this and send the results later today.

> At some level, host/guest communication is easy in that we don't really
> care which queue is used.  I would like to give some thought (and
> testing) to how is this going to work with a real NIC card and packet
> steering at the backend.
> Any idea?

I have done a little testing with guest -> remote server both
using a bridge and with macvtap (mq is required only for rx).
I didn't understand what you mean by packet steering though,
is it whether packets go out of the NIC on different queues?
If so, I verified that is the case by putting a counter and
displaying through /debug interface on the host. dev_queue_xmit
on the host handles it by calling dev_pick_tx().

> > Guest interrupts for a 4 TXQ device after a 5 min test:
> > # egrep "virtio0|CPU" /proc/interrupts
> >   CPU0 CPU1 CPU2CPU3
> > 40:   000   0PCI-MSI-edge  virtio0-config
> > 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> > 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> > 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
> > 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
> > 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3
>
> Does this mean each interrupt is constantly bouncing between CPUs?

Yes. I didn't do *any* tuning for the tests. The only "tuning"
was to use 64K IO size with netperf. When I ran default netperf
(16K), I got a little lesser improvement in BW and worse(!) SD
than with 64K.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar2
Avi Kivity  wrote on 09/08/2010 01:17:34 PM:

>   On 09/08/2010 10:28 AM, Krishna Kumar wrote:
> > Following patches implement Transmit mq in virtio-net.  Also
> > included is the user qemu changes.
> >
> > 1. This feature was first implemented with a single vhost.
> > Testing showed 3-8% performance gain for upto 8 netperf
> > sessions (and sometimes 16), but BW dropped with more
> > sessions.  However, implementing per-txq vhost improved
> > BW significantly all the way to 128 sessions.
>
> Why were vhost kernel changes required?  Can't you just instantiate more
> vhost queues?

I did try using a single thread processing packets from multiple
vq's on host, but the BW dropped beyond a certain number of
sessions. I don't have the code and performance numbers for that
right now since it is a bit ancient, I can try to resuscitate
that if you want.

> > Guest interrupts for a 4 TXQ device after a 5 min test:
> > # egrep "virtio0|CPU" /proc/interrupts
> >CPU0 CPU1 CPU2CPU3
> > 40:   000   0PCI-MSI-edge  virtio0-config
> > 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> > 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> > 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
> > 44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
> > 45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3
>
> How are vhost threads and host interrupts distributed?  We need to move
> vhost queue threads to be colocated with the related vcpu threads (if no
> extra cores are available) or on the same socket (if extra cores are
> available).  Similarly, move device interrupts to the same core as the
> vhost thread.

All my testing was without any tuning, including binding netperf &
netserver (irqbalance is also off). I assume (maybe wrongly) that
the above might give better results? Are you suggesting this
combination:
IRQ on guest:
40: CPU0
41: CPU1
42: CPU2
43: CPU3 (all CPUs are on socket #0)
vhost:
thread #0:  CPU0
thread #1:  CPU1
thread #2:  CPU2
thread #3:  CPU3
qemu:
thread #0:  CPU4
thread #1:  CPU5
thread #2:  CPU6
thread #3:  CPU7 (all CPUs are on socket#1)
netperf/netserver:
Run on CPUs 0-4 on both sides

The reason I did not optimize anything from user space is because
I felt showing the default works reasonably well is important.

Thanks,

- KK

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Michael S. Tsirkin
On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> 1. mq RX patch is also complete - plan to submit once TX is OK.

It's good that you split patches, I think it would be interesting to see
the RX patches at least once to complete the picture.
You could make it a separate patchset, tag them as RFC.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Michael S. Tsirkin
On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote:
> Following patches implement Transmit mq in virtio-net.  Also
> included is the user qemu changes.
> 
> 1. This feature was first implemented with a single vhost.
>Testing showed 3-8% performance gain for upto 8 netperf
>sessions (and sometimes 16), but BW dropped with more
>sessions.  However, implementing per-txq vhost improved
>BW significantly all the way to 128 sessions.
> 2. For this mq TX patch, 1 daemon is created for RX and 'n'
>daemons for the 'n' TXQ's, for a total of (n+1) daemons.
>The (subsequent) RX mq patch changes that to a total of
>'n' daemons, where RX and TX vq's share 1 daemon.
> 3. Service Demand increases for TCP, but significantly
>improves for UDP.
> 4. Interoperability: Many combinations, but not all, of
>qemu, host, guest tested together.
> 
> 
>   Enabling mq on virtio:
>   ---
> 
> When following options are passed to qemu:
> - smp > 1
> - vhost=on
> - mq=on (new option, default:off)
> then #txqueues = #cpus.  The #txqueues can be changed by using
> an optional 'numtxqs' option. e.g.  for a smp=4 guest:
> vhost=on,mq=on ->   #txqueues = 4
> vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
> vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2
> 
> 
>Performance (guest -> local host):
>---
> 
> System configuration:
> Host:  8 Intel Xeon, 8 GB memory
> Guest: 4 cpus, 2 GB memory
> All testing without any tuning, and TCP netperf with 64K I/O
> ___
>TCP (#numtxqs=2)
> N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2(%)
> ___
> 4   26387   40716 (54.30)   20  28   (40.00)86i 85 (-1.16)
> 8   24356   41843 (71.79)   88  129  (46.59)372 362(-2.68)
> 16  23587   40546 (71.89)   375 564  (50.40)15581519   (-2.50)
> 32  22927   39490 (72.24)   16172171 (34.26)66945722   
> (-14.52)
> 48  23067   39238 (70.10)   39315170 (31.51)15823   13552  
> (-14.35)
> 64  22927   38750 (69.01)   71429914 (38.81)28972   26173  (-9.66)
> 96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031  (10.74)

That's a significant hit in TCP SD. Is it caused by the imbalance between
number of queues for TX and RX? Since you mention RX is complete,
maybe measure with a balanced TX/RX?


> ___
>UDP (#numtxqs=8)
> N#  BW1 BW2   (%)  SD1 SD2   (%)
> __
> 4   29836   56761 (90.24)   67  63(-5.97)
> 8   27666   63767 (130.48)  326 265   (-18.71)
> 16  25452   60665 (138.35)  13961269  (-9.09)
> 32  26172   63491 (142.59)  56174202  (-25.19)
> 48  26146   64629 (147.18)  12813   9316  (-27.29)
> 64  25575   65448 (155.90)  23063   16346 (-29.12)
> 128 26454   63772 (141.06)  91054   85051 (-6.59)
> __
> N#: Number of netperf sessions, 90 sec runs
> BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
>   SD for original code
> BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
>   SD for new code. e.g. BW2=40716 means average BW2 was
>   20358 mbps.
> 

What happens with a single netperf?
host -> guest performance with TCP and small packet speed
are also worth measuring.


>Next steps:
>---
> 
> 1. mq RX patch is also complete - plan to submit once TX is OK.
> 2. Cache-align data structures: I didn't see any BW/SD improvement
>after making the sq's (and similarly for vhost) cache-aligned
>statically:
> struct virtnet_info {
> ...
> struct send_queue sq[16] cacheline_aligned_in_smp;
> ...
> };
> 

At some level, host/guest communication is easy in that we don't really
care which queue is used.  I would like to give some thought (and
testing) to how is this going to work with a real NIC card and packet
steering at the backend.
Any idea?

> Guest interrupts for a 4 TXQ device after a 5 min test:
> # egrep "virtio0|CPU" /proc/interrupts 
>   CPU0 CPU1 CPU2CPU3   
> 40:   000   0PCI-MSI-edge  virtio0-config
> 41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
> 42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
> 43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-ou

Re: [RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Avi Kivity

 On 09/08/2010 10:28 AM, Krishna Kumar wrote:

Following patches implement Transmit mq in virtio-net.  Also
included is the user qemu changes.

1. This feature was first implemented with a single vhost.
Testing showed 3-8% performance gain for upto 8 netperf
sessions (and sometimes 16), but BW dropped with more
sessions.  However, implementing per-txq vhost improved
BW significantly all the way to 128 sessions.


Why were vhost kernel changes required?  Can't you just instantiate more 
vhost queues?



2. For this mq TX patch, 1 daemon is created for RX and 'n'
daemons for the 'n' TXQ's, for a total of (n+1) daemons.
The (subsequent) RX mq patch changes that to a total of
'n' daemons, where RX and TX vq's share 1 daemon.
3. Service Demand increases for TCP, but significantly
improves for UDP.
4. Interoperability: Many combinations, but not all, of
qemu, host, guest tested together.


Please update the virtio-pci spec @ http://ozlabs.org/~rusty/virtio-spec/.



   Enabling mq on virtio:
   ---

When following options are passed to qemu:
 - smp>  1
 - vhost=on
 - mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using
an optional 'numtxqs' option. e.g.  for a smp=4 guest:
 vhost=on,mq=on ->#txqueues = 4
 vhost=on,mq=on,numtxqs=8   ->#txqueues = 8
 vhost=on,mq=on,numtxqs=2   ->#txqueues = 2


Performance (guest ->  local host):
---

System configuration:
 Host:  8 Intel Xeon, 8 GB memory
 Guest: 4 cpus, 2 GB memory
All testing without any tuning, and TCP netperf with 64K I/O
___
TCP (#numtxqs=2)
N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2(%)
___
4   26387   40716 (54.30)   20  28   (40.00)86i 85 (-1.16)
8   24356   41843 (71.79)   88  129  (46.59)372 362(-2.68)
16  23587   40546 (71.89)   375 564  (50.40)15581519   (-2.50)
32  22927   39490 (72.24)   16172171 (34.26)66945722   (-14.52)
48  23067   39238 (70.10)   39315170 (31.51)15823   13552  (-14.35)
64  22927   38750 (69.01)   71429914 (38.81)28972   26173  (-9.66)
96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031  (10.74)
___
UDP (#numtxqs=8)
N#  BW1 BW2   (%)  SD1 SD2   (%)
__
4   29836   56761 (90.24)   67  63(-5.97)
8   27666   63767 (130.48)  326 265   (-18.71)
16  25452   60665 (138.35)  13961269  (-9.09)
32  26172   63491 (142.59)  56174202  (-25.19)
48  26146   64629 (147.18)  12813   9316  (-27.29)
64  25575   65448 (155.90)  23063   16346 (-29.12)
128 26454   63772 (141.06)  91054   85051 (-6.59)


Impressive results.


__
N#: Number of netperf sessions, 90 sec runs
BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
   SD for original code
BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
   SD for new code. e.g. BW2=40716 means average BW2 was
   20358 mbps.


Next steps:
---

1. mq RX patch is also complete - plan to submit once TX is OK.
2. Cache-align data structures: I didn't see any BW/SD improvement
after making the sq's (and similarly for vhost) cache-aligned
statically:
 struct virtnet_info {
 ...
 struct send_queue sq[16] cacheline_aligned_in_smp;
 ...
 };

Guest interrupts for a 4 TXQ device after a 5 min test:
# egrep "virtio0|CPU" /proc/interrupts
   CPU0 CPU1 CPU2CPU3
40:   000   0PCI-MSI-edge  virtio0-config
41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3


How are vhost threads and host interrupts distributed?  We need to move 
vhost queue threads to be colocated with the related vcpu threads (if no 
extra cores are available) or on the same socket (if extra cores are 
available).  Similarly, move device interrupts to the same core as the 
vhost thread.




--
I have a truly marvellous patch that fixes the bug which

[RFC PATCH 0/4] Implement multiqueue virtio-net

2010-09-08 Thread Krishna Kumar
Following patches implement Transmit mq in virtio-net.  Also
included is the user qemu changes.

1. This feature was first implemented with a single vhost.
   Testing showed 3-8% performance gain for upto 8 netperf
   sessions (and sometimes 16), but BW dropped with more
   sessions.  However, implementing per-txq vhost improved
   BW significantly all the way to 128 sessions.
2. For this mq TX patch, 1 daemon is created for RX and 'n'
   daemons for the 'n' TXQ's, for a total of (n+1) daemons.
   The (subsequent) RX mq patch changes that to a total of
   'n' daemons, where RX and TX vq's share 1 daemon.
3. Service Demand increases for TCP, but significantly
   improves for UDP.
4. Interoperability: Many combinations, but not all, of
   qemu, host, guest tested together.


  Enabling mq on virtio:
  ---

When following options are passed to qemu:
- smp > 1
- vhost=on
- mq=on (new option, default:off)
then #txqueues = #cpus.  The #txqueues can be changed by using
an optional 'numtxqs' option. e.g.  for a smp=4 guest:
vhost=on,mq=on ->   #txqueues = 4
vhost=on,mq=on,numtxqs=8   ->   #txqueues = 8
vhost=on,mq=on,numtxqs=2   ->   #txqueues = 2


   Performance (guest -> local host):
   ---

System configuration:
Host:  8 Intel Xeon, 8 GB memory
Guest: 4 cpus, 2 GB memory
All testing without any tuning, and TCP netperf with 64K I/O
___
   TCP (#numtxqs=2)
N#  BW1 BW2(%)  SD1 SD2(%)  RSD1RSD2(%)
___
4   26387   40716 (54.30)   20  28   (40.00)86i 85 (-1.16)
8   24356   41843 (71.79)   88  129  (46.59)372 362(-2.68)
16  23587   40546 (71.89)   375 564  (50.40)15581519   (-2.50)
32  22927   39490 (72.24)   16172171 (34.26)66945722   (-14.52)
48  23067   39238 (70.10)   39315170 (31.51)15823   13552  (-14.35)
64  22927   38750 (69.01)   71429914 (38.81)28972   26173  (-9.66)
96  22568   38520 (70.68)   16258   27844 (71.26)   65944   73031  (10.74)
___
   UDP (#numtxqs=8)
N#  BW1 BW2   (%)  SD1 SD2   (%)
__
4   29836   56761 (90.24)   67  63(-5.97)
8   27666   63767 (130.48)  326 265   (-18.71)
16  25452   60665 (138.35)  13961269  (-9.09)
32  26172   63491 (142.59)  56174202  (-25.19)
48  26146   64629 (147.18)  12813   9316  (-27.29)
64  25575   65448 (155.90)  23063   16346 (-29.12)
128 26454   63772 (141.06)  91054   85051 (-6.59)
__
N#: Number of netperf sessions, 90 sec runs
BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote
  SD for original code
BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote
  SD for new code. e.g. BW2=40716 means average BW2 was
  20358 mbps.


   Next steps:
   ---

1. mq RX patch is also complete - plan to submit once TX is OK.
2. Cache-align data structures: I didn't see any BW/SD improvement
   after making the sq's (and similarly for vhost) cache-aligned
   statically:
struct virtnet_info {
...
struct send_queue sq[16] cacheline_aligned_in_smp;
...
};

Guest interrupts for a 4 TXQ device after a 5 min test:
# egrep "virtio0|CPU" /proc/interrupts 
  CPU0 CPU1 CPU2CPU3   
40:   000   0PCI-MSI-edge  virtio0-config
41:   126955   126912   126505  126940   PCI-MSI-edge  virtio0-input
42:   108583   107787   107853  107716   PCI-MSI-edge  virtio0-output.0
43:   300278   297653   299378  300554   PCI-MSI-edge  virtio0-output.1
44:   372607   374884   371092  372011   PCI-MSI-edge  virtio0-output.2
45:   162042   162261   163623  162923   PCI-MSI-edge  virtio0-output.3

Review/feedback appreciated.

Signed-off-by: Krishna Kumar 
---
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html