Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
jason wang jasow...@redhat.com wrote on 11/16/2011 11:40:45 AM: Hi Jason, Have any thought in mind to solve the issue of flow handling? So far nothing concrete. Maybe some performance numbers first is better, it would let us know where we are. During the test of my patchset, I find big regression of small packet transmission, and more retransmissions were noticed. This maybe also the issue of flow affinity. One interesting things is to see whether this happens in your patches :) I haven't got any results for small packet, but will run this week and send an update. I remember my earlier patches having regression for small packets. I've played with a basic flow director implementation based on my series which want to make sure the packets of a flow was handled by the same vhost thread/guest vcpu. This is done by: - bind virtqueue to guest cpu - record the hash to queue mapping when guest sending packets and use this mapping to choose the virtqueue when forwarding packets to guest Test shows some help during for receiving packets from external host and packet sending to local host. But it would hurt the performance of sending packets to remote host. This is not the perfect solution as it can not handle guest moving processes among vcpus, I plan to try accelerate RFS and sharing the mapping between host and guest. Anyway this is just for receiving, the small packet sending need more thoughts. I don't recollect small packet performance for guest-local host. Also, using multiple tuns devices on the bridge (instead of mq-tun) balances the rx/tx of a flow to a single vq. Then you can avoid mq-tun with it's queue selector function, etc. Have you tried it? I will run my tests this week and get back. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] kvm tools: Implement multiple VQ for virtio-net
Sasha Levin levinsasha...@gmail.com wrote on 11/14/2011 03:45:40 PM: Why both the bandwidth and latency performance are dropping so dramatically with multiple VQ? It looks like theres no hash sync between host and guest, which makes the RX VQ change for every packet. This is my guess. Yes, I confirmed this happens for macvtap. I am using ixgbe - it calls skb_record_rx_queue when a skb is allocated, but sets rxhash when a packet arrives. Macvtap is relying on record_rx_queue first ahead of rxhash (as part of my patch making macvtap multiqueue), hence different skbs result in macvtap selecting different vq's. Reordering macvtap to use rxhash first results in all packets going to the same VQ. The code snippet is: { ... if (!numvtaps) goto out; rxq = skb_get_rxhash(skb); if (rxq) { tap = rcu_dereference(vlan-taps[rxq % numvtaps]); if (tap) goto out; } if (likely(skb_rx_queue_recorded(skb))) { rxq = skb_get_rx_queue(skb); while (unlikely(rxq = numvtaps)) rxq -= numvtaps; tap = rcu_dereference(vlan-taps[rxq]); if (tap) goto out; } } I will submit a patch for macvtap separately. I am working towards the other issue pointed out - different vhost threads handling rx/tx of a single flow. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 RFC 0/4] virtio and vhost-net capacity handling
Michael S. Tsirkin m...@redhat.com wrote on 06/07/2011 09:38:30 PM: This is on top of the patches applied by Rusty. Warning: untested. Posting now to give people chance to comment on the API. OK, this seems to have survived some testing so far, after I dropped patch 4 and fixed build for patch 3 (build fixup patch sent in reply to the original). I'll be mostly offline until Sunday, would appreciate testing reports. Hi Michael, I ran the latest patches with 1K I/O (guest-local host) and the results are (60 sec run for each test case): __ #sessions BW% SD% __ 1 -25.6 47.0 2 -29.3 22.9 4 .8 1.6 8 1.6 0 16 -1.6 4.1 32 -5.3 2.1 48 11.3-7.8 64 -2.8 .7 96 -6.2 .6 128 -10.6 12.7 __ BW: -4.8 SD: 5.4 I tested it again to see if the regression is fleeting (since the numbers vary quite a bit for 1K I/O even between guest- local host), but: __ #sessions BW% SD% __ 1 14.0-17.3 2 19.9-11.1 4 7.9 -15.3 8 9.6 -13.1 16 1.2 -7.3 32-.6 -13.5 48-28.7 10.0 64-5.7 -.7 96-9.4 -8.1 128 -9.4 .7 __ BW: -3.7 SD: -2.0 With 16K, there was an improvement in SD, but higher sessions seem to slightly degrade BW/SD: __ #sessions BW% SD% __ 1 30.9-25.0 2 16.5-19.4 4 -1.3 7.9 8 1.4 6.2 16 3.9 -5.4 32 04.3 48-.5.1 64 32.1-1.5 96-2.1 23.2 128 -7.4 3.8 __ BW: 5.0 SD: 7.5 Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 RFC 0/4] virtio and vhost-net capacity handling
Krishna Kumar2/India/IBM@IBMIN wrote on 06/13/2011 07:02:27 PM: ... With 16K, there was an improvement in SD, but higher sessions seem to slightly degrade BW/SD: I meant to say With 16K, there was an improvement in BW above. Again the numbers are not very reproducible, I will test with remote host also to see if I get more consistent numbers. Thanks, - KK __ #sessions BW% SD% __ 1 30.9-25.0 2 16.5-19.4 4 -1.3 7.9 8 1.4 6.2 16 3.9 -5.4 32 04.3 48-.5.1 64 32.1-1.5 96-2.1 23.2 128 -7.4 3.8 __ BW: 5.0 SD: 7.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 RFC 0/4] virtio and vhost-net capacity handling
Michael S. Tsirkin m...@redhat.com wrote on 06/13/2011 07:05:13 PM: I ran the latest patches with 1K I/O (guest-local host) and the results are (60 sec run for each test case): Hi! Did you apply this one: [PATCHv2 RFC 4/4] Revert virtio: make add_buf return capacity remaining ? It turns out that that patch has a bug and should be reverted, only patches 1-3 should be applied. Could you confirm please? No, I didn't apply that patch. I had also seen your mail earlier on this patch breaking receive buffer processing if applied. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 3/3] virtio_net: limit xmit polling
OK, I have something very similar, but I still dislike the screw the latency part: this path is exactly what the IBM guys seem to hit. So I created two functions: one tries to free a constant number and another one up to capacity. I'll post that now. Please review this patch to see if it looks reasonable (inline and attachment): 1. Picked comments/code from Michael's code and Rusty's review. 2. virtqueue_min_capacity() needs to be called only if it returned empty the last time it was called. 3. Fix return value bug in free_old_xmit_skbs (hangs guest). 4. Stop queue only if capacity is not enough for next xmit. 5. Fix/clean some likely/unlikely checks (hopefully). 6. I think xmit_skb cannot return error since virtqueue_enable_cb_delayed() can return false only if 3/4th space became available, which is what we check. 6. The comments for free_old_xmit_skbs needs to be more clear (not done). I have done some minimal netperf tests with this. With this patch, add_buf returning capacity seems to be useful - it allows using fewer virtio API calls. (See attached file: patch) Signed-off-by: Krishna Kumar krkum...@in.ibm.com --- drivers/net/virtio_net.c | 105 ++--- 1 file changed, 64 insertions(+), 41 deletions(-) diff -ruNp org/drivers/net/virtio_net.c new/drivers/net/virtio_net.c --- org/drivers/net/virtio_net.c2011-06-02 15:49:25.0 +0530 +++ new/drivers/net/virtio_net.c2011-06-02 19:13:02.0 +0530 @@ -509,27 +509,43 @@ again: return received; } -/* Check capacity and try to free enough pending old buffers to enable queueing - * new ones. If min_skbs 0, try to free at least the specified number of skbs - * even if the ring already has sufficient capacity. Return true if we can - * guarantee that a following virtqueue_add_buf will succeed. */ -static bool free_old_xmit_skbs(struct virtnet_info *vi, int min_skbs) +/* Return true if freed a skb, else false */ +static inline bool free_one_old_xmit_skb(struct virtnet_info *vi) { struct sk_buff *skb; unsigned int len; - bool r; - while ((r = virtqueue_min_capacity(vi-svq) MAX_SKB_FRAGS + 2) || - min_skbs-- 0) { - skb = virtqueue_get_buf(vi-svq, len); - if (unlikely(!skb)) + skb = virtqueue_get_buf(vi-svq, len); + if (unlikely(!skb)) + return 0; + + pr_debug(Sent skb %p\n, skb); + vi-dev-stats.tx_bytes += skb-len; + vi-dev-stats.tx_packets++; + dev_kfree_skb_any(skb); + return 1; +} + +static bool free_old_xmit_skbs(struct virtnet_info *vi, int to_free) +{ + bool empty = virtqueue_min_capacity(vi-svq) MAX_SKB_FRAGS + 2; + + do { + if (!free_one_old_xmit_skb(vi)) { + /* No more skbs to free up */ break; - pr_debug(Sent skb %p\n, skb); - vi-dev-stats.tx_bytes += skb-len; - vi-dev-stats.tx_packets++; - dev_kfree_skb_any(skb); - } - return r; + } + + if (empty) { + /* Check again if there is enough space */ + empty = virtqueue_min_capacity(vi-svq) + MAX_SKB_FRAGS + 2; + } else { + --to_free; + } + } while (to_free 0); + + return !empty; } static int xmit_skb(struct virtnet_info *vi, struct sk_buff *skb) @@ -582,46 +598,53 @@ static int xmit_skb(struct virtnet_info static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); - int ret, n; + int capacity; - /* Free up space in the ring in case this is the first time we get -* woken up after ring full condition. Note: this might try to free -* more than strictly necessary if the skb has a small -* number of fragments, but keep it simple. */ - free_old_xmit_skbs(vi, 0); + /* Try to free 2 buffers for every 1 xmit, to stay ahead. */ + free_old_xmit_skbs(vi, 2); /* Try to transmit */ - ret = xmit_skb(vi, skb); + capacity = xmit_skb(vi, skb); - /* Failure to queue is unlikely. It's not a bug though: it might happen -* if we get an interrupt while the queue is still mostly full. -* We could stop the queue and re-enable callbacks (and possibly return -* TX_BUSY), but as this should be rare, we don't bother. */ - if (unlikely(ret 0)) { + if (unlikely(capacity 0)) { + /* +* Failure to queue should be impossible. The only way to +* reach here is if we got a cb before 3/4th of space was +* available. We could stop the queue and re-enable +* callbacks (and possibly return TX_BUSY), but we don't +* bother since this is
Re: [PATCH RFC 3/3] virtio_net: limit xmit polling
Michael S. Tsirkin m...@redhat.com wrote on 06/02/2011 08:13:46 PM: Please review this patch to see if it looks reasonable: Hmm, since you decided to work on top of my patch, I'd appreciate split-up fixes. OK (that also explains your next comment). 1. Picked comments/code from MST's code and Rusty's review. 2. virtqueue_min_capacity() needs to be called only if it returned empty the last time it was called. 3. Fix return value bug in free_old_xmit_skbs (hangs guest). 4. Stop queue only if capacity is not enough for next xmit. That's what we always did ... I had made the patch against your patch, hence this change (sorry for the confusion!). 5. Fix/clean some likely/unlikely checks (hopefully). I have done some minimal netperf tests with this. With this patch, add_buf returning capacity seems to be useful - it allows less virtio API calls. Why bother? It's cheap ... If add_buf retains it's functionality to return the capacity (it is going to need a change to return 0 otherwise anyway), is it useful to call another function at each xmit? +static bool free_old_xmit_skbs(struct virtnet_info *vi, int to_free) +{ + bool empty = virtqueue_min_capacity(vi-svq) MAX_SKB_FRAGS + 2; + + do { + if (!free_one_old_xmit_skb(vi)) { + /* No more skbs to free up */ break; - pr_debug(Sent skb %p\n, skb); - vi-dev-stats.tx_bytes += skb-len; - vi-dev-stats.tx_packets++; - dev_kfree_skb_any(skb); - } - return r; + } + + if (empty) { + /* Check again if there is enough space */ + empty = virtqueue_min_capacity(vi-svq) +MAX_SKB_FRAGS + 2; + } else { + --to_free; + } + } while (to_free 0); + + return !empty; } Why bother doing the capacity check in this function? To return whether we have enough space for next xmit. It should call it only once unless space is running out. Does it sound OK? - if (unlikely(ret 0)) { + if (unlikely(capacity 0)) { + /* + * Failure to queue should be impossible. The only way to + * reach here is if we got a cb before 3/4th of space was + * available. We could stop the queue and re-enable + * callbacks (and possibly return TX_BUSY), but we don't + * bother since this is impossible. It's far from impossible. The 3/4 thing is only a hint, and old devices don't support it anyway. OK, I will re-put back your comment. - if (!likely(free_old_xmit_skbs(vi, 2))) { - netif_stop_queue(dev); - if (unlikely(!virtqueue_enable_cb_delayed(vi-svq))) { - /* More just got used, free them and recheck. */ - if (!likely(free_old_xmit_skbs(vi, 0))) { -netif_start_queue(dev); -virtqueue_disable_cb(vi-svq); + /* +* Apparently nice girls don't return TX_BUSY; check capacity and +* stop the queue before it gets out of hand. Naturally, this wastes +* entries. +*/ + if (capacity 2+MAX_SKB_FRAGS) { + /* + * We don't have enough space for the next packet. Try + * freeing more. + */ + if (likely(!free_old_xmit_skbs(vi, UINT_MAX))) { + netif_stop_queue(dev); + if (unlikely(!virtqueue_enable_cb_delayed(vi-svq))) { +/* More just got used, free them and recheck. */ +if (likely(free_old_xmit_skbs(vi, UINT_MAX))) { Is this where the bug was? Return value in free_old_xmit() was wrong. I will re-do against the mainline kernel. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 3/3] virtio_net: limit xmit polling
Michael S. Tsirkin m...@redhat.com wrote on 06/02/2011 09:04:23 PM: Is this where the bug was? Return value in free_old_xmit() was wrong. I will re-do against the mainline kernel. Thanks, - KK Just noting that I'm working on that patch as well, it might be more efficient if we don't both of us do this in parallel :) OK, but my intention was to work on a alternate approach, which was the reason to base it against your patch. I will check your latest patch. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PERF RESULTS] virtio and vhost-net performance enhancements
Michael S. Tsirkin m...@redhat.com wrote on 05/20/2011 04:40:07 AM: OK, here is the large patchset that implements the virtio spec update that I sent earlier (the spec itself needs a minor update, will send that out too next week, but I think we are on the same page here already). It supercedes the PUBLISH_USED_IDX patches I sent out earlier. I was able to get this tested by applying the v2 patches to git-next tree (somehow MST's git tree hung on my guest which never got resolved). Testing was from Guest - Remote node, using an ixgbe 10g card. The test results are *excellent* (table: #netperf sesssions, BW% improvement, SD% improvement, CPU% improvement): ___ 512 byte I/O # BW% SD% CPU% 1 151.6 -65.1-10.7 2 180.6 -66.6-6.4 4 15.5-35.8-26.1 8 1.8 -28.4-26.7 163.1 -29.0-26.5 321.1 -27.4-27.5 643.8 -30.9-26.7 965.4 -21.7-24.2 128 5.7 -24.4-25.5 BW: 16.6% SD: -24.6%CPU: -25.5% 1K I/O # BW% SD% CPU% 1 233.9 -76.5-18.0 2 112.2 -64.0-23.2 4 9.2 -31.6-26.1 8-1.7 -26.8-30.3 163.5 -31.5-30.6 324.8 -25.2-30.5 645.7 -31.0-28.9 965.3 -32.2-31.7 128 4.6 -38.2-33.6 BW: 16.4% SD: -35.%CPU: -31.5% 16K I/O # BW% SD% CPU% 1 18.8-27.2-18.3 2 14.8-36.7-27.7 4 12.7-45.2-38.1 8 4.4 -56.4-54.4 164.8 -38.3-36.1 32078.0 79.2 643.8 -38.1-37.5 967.3 -35.2-31.1 128 3.4 -31.1-32.1 BW: 7.6% SD: -30.1% CPU: -23.7% I plan to run some more tests tomorrow. Please let me know if any other scenario will help. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PERF RESULTS] virtio and vhost-net performance enhancements
Shirley Ma x...@us.ibm.com wrote on 05/26/2011 09:12:22 PM: Could you please try TCP_RRs as well? Right. Here's the result for TCP_RR: __ # RR% SD% CPU% __ 1 4.5 -31.4-27.9 2 5.1 -9.7 -5.4 4 60.4 -13.4 38.8 8 67.8 -13.5 45.0 16 55.8 -8.0 43.2 32 66.9 -14.1 43.3 64 47.2 -23.7 12.2 96 29.7 -11.8 14.3 1288.0 2.2 10.7 ___ BW: 37.3% SD: -6.7% CPU: 15.7% ___ Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PERF RESULTS] virtio and vhost-net performance enhancements
Krishna Kumar2/India/IBM wrote on 05/26/2011 09:51:32 PM: Could you please try TCP_RRs as well? Right. Here's the result for TCP_RR: The actual transaction rate/second numbers are: _ # RR1 RR2 (%) SD1SD2 (%) _ 1 9476 9903 (4.5) 28.9 19.8 (-31.4) 2 1733718225 (5.1) 92.7 83.7 (-9.7) 4 1738527902 (60.4) 364.8 315.8 (-13.4) 8 2556042912 (67.8) 1428.1 1234.0 (-13.5) 163589855934 (55.8) 4391.6 4038.1 (-8.0) 324804880228 (66.9) 17391.414932.0 (-14.1) 646041288929 (47.2) 71087.754230.1 (-23.7) 967126392439 (29.7) 145434.1 128214.0 (-11.8) 128 8420891014 (8.0) 233668.2 23.6 (2.2) _ RR: 37.3% SD: -6.7% _ Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 10/14] virtio_net: limit xmit polling
Michael S. Tsirkin m...@redhat.com wrote on 05/23/2011 04:49:00 PM: To do this properly, we should really be using the actual number of sg elements needed, but we'd have to do most of xmit_skb beforehand so we know how many. Cheers, Rusty. Maybe I'm confused here. The problem isn't the failing add_buf for the given skb IIUC. What we are trying to do here is stop the queue *before xmit_skb fails*. We can't look at the number of fragments in the current skb - the next one can be much larger. That's why we check capacity after xmit_skb, not before it, right? Maybe Rusty means it is a simpler model to free the amount of space that this xmit needs. We will still fail anyway at some time but it is unlikely, since earlier iteration freed up atleast the space that it was going to use. The code could become much simpler: start_xmit() { { num_sgs = get num_sgs for this skb; /* Free enough pending old buffers to enable queueing this one */ free_old_xmit_skbs(vi, num_sgs * 2); /* ?? */ if (virtqueue_get_capacity() num_sgs) { netif_stop_queue(dev); if (virtqueue_enable_cb_delayed(vi-svq) || free_old_xmit_skbs(vi, num_sgs)) { /* Nothing freed up, or not enough freed up */ kfree_skb(skb); return NETDEV_TX_OK; } netif_start_queue(dev); virtqueue_disable_cb(vi-svq); } /* xmit_skb cannot fail now, also pass 'num_sgs' */ xmit_skb(vi, skb, num_sgs); virtqueue_kick(vi-svq); skb_orphan(skb); nf_reset(skb); return NETDEV_TX_OK; } We could even return TX_BUSY since that makes the dequeue code more efficient. See dev_dequeue_skb() - you can skip a lot of code (and avoid taking locks) to check if the queue is already stopped but that code runs only if you return TX_BUSY in the earlier iteration. BTW, shouldn't the check in start_xmit be: if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) { ... } Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 10/14] virtio_net: limit xmit polling
Michael S. Tsirkin m...@redhat.com wrote on 05/24/2011 02:42:55 PM: To do this properly, we should really be using the actual number of sg elements needed, but we'd have to do most of xmit_skb beforehand so we know how many. Cheers, Rusty. Maybe I'm confused here. The problem isn't the failing add_buf for the given skb IIUC. What we are trying to do here is stop the queue *before xmit_skb fails*. We can't look at the number of fragments in the current skb - the next one can be much larger. That's why we check capacity after xmit_skb, not before it, right? Maybe Rusty means it is a simpler model to free the amount of space that this xmit needs. We will still fail anyway at some time but it is unlikely, since earlier iteration freed up atleast the space that it was going to use. Not sure I nderstand. We can't know space is freed in the previous iteration as buffers might not have been used by then. Yes, the first few iterations may not have freed up space, but later ones should. The amount of free space should increase from then on, especially since we try to free double of what we consume. The code could become much simpler: start_xmit() { { num_sgs = get num_sgs for this skb; /* Free enough pending old buffers to enable queueing this one */ free_old_xmit_skbs(vi, num_sgs * 2); /* ?? */ if (virtqueue_get_capacity() num_sgs) { netif_stop_queue(dev); if (virtqueue_enable_cb_delayed(vi-svq) || free_old_xmit_skbs(vi, num_sgs)) { /* Nothing freed up, or not enough freed up */ kfree_skb(skb); return NETDEV_TX_OK; This packet drop is what we wanted to avoid. Please see below on returning NETDEV_TX_BUSY. } netif_start_queue(dev); virtqueue_disable_cb(vi-svq); } /* xmit_skb cannot fail now, also pass 'num_sgs' */ xmit_skb(vi, skb, num_sgs); virtqueue_kick(vi-svq); skb_orphan(skb); nf_reset(skb); return NETDEV_TX_OK; } We could even return TX_BUSY since that makes the dequeue code more efficient. See dev_dequeue_skb() - you can skip a lot of code (and avoid taking locks) to check if the queue is already stopped but that code runs only if you return TX_BUSY in the earlier iteration. BTW, shouldn't the check in start_xmit be: if (likely(!free_old_xmit_skbs(vi, 2+MAX_SKB_FRAGS))) { ... } Thanks, - KK I thought we used to do basically this but other devices moved to a model where they stop *before* queueing fails, so we did too. I am not sure of why it was changed, since returning TX_BUSY seems more efficient IMHO. qdisc_restart() handles requeue'd packets much better than a stopped queue, as a significant part of this code is skipped if gso_skb is present (qdisc will eventually start dropping packets when tx_queue_len is exceeded anyway). Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCHv2 10/14] virtio_net: limit xmit polling
Michael S. Tsirkin m...@redhat.com wrote on 05/24/2011 04:59:39 PM: Maybe Rusty means it is a simpler model to free the amount of space that this xmit needs. We will still fail anyway at some time but it is unlikely, since earlier iteration freed up atleast the space that it was going to use. Not sure I nderstand. We can't know space is freed in the previous iteration as buffers might not have been used by then. Yes, the first few iterations may not have freed up space, but later ones should. The amount of free space should increase from then on, especially since we try to free double of what we consume. Hmm. This is only an upper limit on the # of entries in the queue. Assume that vq size is 4 and we transmit 4 enties without getting anything in the used ring. The next transmit will fail. So I don't really see why it's unlikely that we reach the packet drop code with your patch. I was assuming 256 entries :) I will try to get some numbers to see how often it is true tomorrow. I am not sure of why it was changed, since returning TX_BUSY seems more efficient IMHO. qdisc_restart() handles requeue'd packets much better than a stopped queue, as a significant part of this code is skipped if gso_skb is present I think this is the argument: http://www.mail-archive.com/virtualization@lists.linux- foundation.org/msg06364.html Thanks for digging up that thread! Yes, that one skb would get sent first ahead of possibly higher priority skbs. However, from a performance point, TX_BUSY code skips a lot of checks and code for all subsequent packets till the device is restarted. I can test performance with both cases and report what I find (the requeue code has become very simple and clean from horribly complex, thanks to Herbert and Dave). (qdisc will eventually start dropping packets when tx_queue_len is tx_queue_len is a pretty large buffer so maybe no. I remember seeing tons of drops (pfifo_fast_enqueue) when xmit returns TX_BUSY. I think the packet drops from the scheduler queue can also be done intelligently (e.g. with CHOKe) which should work better than dropping a random packet? I am not sure of that - choke_enqueue checks against a random skb to drop current skb, and also during congestion. But for my sample driver xmit, returning TX_BUSY could still allow to be used with CHOKe. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/18] virtio and vhost-net performance enhancements
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 02:20:18 AM: [PATCH 00/18] virtio and vhost-net performance enhancements OK, here's a large patchset that implements the virtio spec update that I sent earlier. It supercedes the PUBLISH_USED_IDX patches I sent out earlier. I know it's a lot to ask but please test, and please consider for 2.6.40 :) I see nice performance improvements: one run showed going from 12 to 18 Gbit/s host to guest with netperf, but I did not spend a lot of time testing performance, so no guarantees it's not a fluke, I hope others will try this out and report. Pls note I will be away from keyboard for the next week. I tested with the git tree (which also contains the later additional patch), and get this error on guest: May 11 08:06:08 localhost kernel: net eth0: Unexpected TX queue failure: -28 May 11 08:06:08 localhost kernel: net eth0: Unexpected TX queue failure: -28 May 11 08:06:08 localhost kernel: net eth0: Unexpected TX queue failure: -28 May 11 08:06:08 localhost kernel: net eth0: Unexpected TX queue failure: -28 ... The network stops after that and requires a modprobe restart to get it working again. This is with the new qemu/vhost/virtio-net. Please let me know if I am missing something. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 09:04:13 PM: I haven't tuned the threshhold, it is left it at 3/4. I ran the new qemu/vhost/guest, and the results for 1K, 2K and 16K are below. Note this is a different kernel version from my earlier test results. So, f.e., BW1 represents 2.6.39-rc2, the original kernel; while BW2 represents 2.6.37-rc5 (MST's kernel). Weird. My kernel is actually 2.6.39-rc2. So which is which? I cloned git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git # git branch -a vhost * vhost-net-next-event-idx-v1 remotes/origin/HEAD - origin/vhost remotes/origin/for-linus remotes/origin/master remotes/origin/net-2.6 remotes/origin/vhost remotes/origin/vhost-broken remotes/origin/vhost-devel remotes/origin/vhost-mrg-rxbuf remotes/origin/vhost-net remotes/origin/vhost-net-next remotes/origin/vhost-net-next-event-idx-v1 remotes/origin/vhost-net-next-rebased remotes/origin/virtio-layout-aligned remotes/origin/virtio-layout-minimal remotes/origin/virtio-layout-original remotes/origin/virtio-layout-padded remotes/origin/virtio-publish-used # git checkout vhost-net-next-event-idx-v1 Already on 'vhost-net-next-event-idx-v1' # head -4 Makefile VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 37 EXTRAVERSION = -rc5 I am not sure what I am missing. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 02:53:59 AM: Not hope exactly. If the device is not ready, then the packet is requeued. The main idea is to avoid drops/stop/starts, etc. Yes, I see that, definitely. I guess it's a win if the interrupt takes at least a jiffy to arrive anyway, and a loss if not. Is there some reason interrupts might be delayed until the next jiffy? I can explain this a bit as I have three debug counters in start_xmit() just for this: 1. Whether the current xmit call was good, i.e. we had returned BUSY last time and this xmit was successful. 2. Whether the current xmit call was bad, i.e. we had returned BUSY last time and this xmit still failed. 3. The free capacity when we *resumed* xmits. This is after calling free_old_xmit_skbs where this function is not throttled, in effect it processes *all* the completed skbs. This counter is a sum: if (If_I_had_returned_EBUSY_last_iteration) free_slots += virtqueue_get_capacity(); The counters after a 30 min run of 1K,2K,16K netperf sessions are: Good: 1059172 Bad: 31226 Sum of slots: 47551557 (Total of Good+Bad tallies with the total number of requeues as shown by tc: qdisc pfifo_fast 0: root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 1560854473453 bytes 1075873684 pkt (dropped 718379, overlimits 0 requeues 1090398) backlog 0b 0p requeues 1090398 ) It shows that 2.9% of the time, the 1 jiffy was not enough to free up space in the txq. That could also mean that we had set xmit_restart just before jiffies changed. But the average free capacity when we *resumed* xmits is: Sum of slots / (Good + Bad) = 43. So the delay of 1 jiffy helped the host clean up, on average, just 43 entries, which is 16% of total entries. This is intended to show that the guest is not sitting idle waiting for the jiffy to expire. I can post it, mind testing this? Sure. Just posted. Would appreciate feedback. Do I need to apply all the patches and simply test? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 02:34:39 PM: It shows that 2.9% of the time, the 1 jiffy was not enough to free up space in the txq. How common is it to free up space in *less than* 1 jiffy? True, but the point is that the space freed is just enough for 43 entries, keeping it lower means a flood of (psuedo) stop's and restart's. That could also mean that we had set xmit_restart just before jiffies changed. But the average free capacity when we *resumed* xmits is: Sum of slots / (Good + Bad) = 43. So the delay of 1 jiffy helped the host clean up, on average, just 43 entries, which is 16% of total entries. This is intended to show that the guest is not sitting idle waiting for the jiffy to expire. OK, nice, this is exactly what my patchset is trying to do, without playing with timers: tell the host to interrupt us after 3/4 of the ring is free. Why 3/4 and not all of the ring? My hope is we can get some parallelism with the host this way. Why 3/4 and not 7/8? No idea :) I can post it, mind testing this? Sure. Just posted. Would appreciate feedback. Do I need to apply all the patches and simply test? Thanks, - KK Exactly. You can also try to tune the threshold for interrupts as well. Could you send me (privately) the entire virtio-net/vhost patch in a single file? It will help me quite a bit :) Either attachment or inline is fine. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 03:42:29 PM: It shows that 2.9% of the time, the 1 jiffy was not enough to free up space in the txq. How common is it to free up space in *less than* 1 jiffy? True, Sorry, which statement do you say is true? That interrupt after less than 1 jiffy is common? I meant to say that, 97% of the time, space was enough for the next xmit to succeed. This is keeping in mind that on average 43 slots were freed up, indicating that the guest was not waiting around for too long. Regarding whether interrupts in less than 1 jiffy is common, I think most of the time it should. But increasing the limit as to when to do the cb would increase to a jiffy. To confirm, I just put some counters in the original code and found that interrupts happen in less than a jiffy around 96.75% of the time, only 3.25% took 1 jiffy. But as expected, this is with the host interrupting immediately, which leads to many stop/start/interrupts due to very little free capacity. but the point is that the space freed is just enough for 43 entries, keeping it lower means a flood of (psuedo) stop's and restart's. Better yet, here they are in git: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost- net-next-event-idx-v1 git://git.kernel.org/pub/scm/linux/kernel/git/mst/qemu-kvm.git virtio-net-event-idx-v1 Great, I will pick up from here. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 02:34:39 PM: Do I need to apply all the patches and simply test? Thanks, - KK Exactly. You can also try to tune the threshold for interrupts as well. I haven't tuned the threshhold, it is left it at 3/4. I ran the new qemu/vhost/guest, and the results for 1K, 2K and 16K are below. Note this is a different kernel version from my earlier test results. So, f.e., BW1 represents 2.6.39-rc2, the original kernel; while BW2 represents 2.6.37-rc5 (MST's kernel). This also isn't with the fixes you have sent just now. I will get a run with that either late tonight or tomorrow. I/O size: 1K # BW1 BW2 (%)SD1 SD2 (%) 1 17233016 (75.0)4.7 2.6 (-44.6) 2 32236712 (108.2) 18.0 7.1 (-60.5) 4 72238258 (14.3)36.5 24.3 (-33.4) 8 86897943 (-8.5)131.5 101.6 (-22.7) 1680597398 (-8.2)578.3 406.4 (-29.7) 3277587208 (-7.0)2281.41574.7 (-30.9) 6475037155 (-4.6)9734.06368.0 (-34.5) 9674967078 (-5.5)21980.9 15477.6 (-29.5) 128 73896900 (-6.6)40467.5 26031.9 (-35.6) Summary: BW: (4.4) SD: (-33.5) I/O size: 2K # BW1 BW2 (%)SD1 SD2 (%) 1 16084968 (208.9) 5.0 1.3 (-74.0) 2 33546974 (107.9) 18.6 4.9 (-73.6) 4 82348344 (1.3) 35.6 17.9 (-49.7) 8 84277818 (-7.2)103.5 71.2 (-31.2) 1679957491 (-6.3)410.1 273.9 (-33.2) 3278637149 (-9.0)1678.61080.4 (-35.6) 6476617092 (-7.4)7245.34717.2 (-34.8) 9675176984 (-7.0)15711.2 9838.9 (-37.3) 128 73896851 (-7.2)27121.6 18255.7 (-32.6) Summary: BW: (6.0) SD: (-34.5) I/O size: 16K # BW1 BW2 (%)SD1 SD2 (%) 1 66847019 (5.0) 1.1 1.1 (0) 2 76747196 (-6.2)5.0 4.8 (-4.0) 4 73588032 (9.1) 21.3 20.4 (-4.2) 8 73938015 (8.4) 82.7 82.0 (-.8) 1679588366 (5.1) 283.2 310.7 (9.7) 3277928113 (4.1) 1257.51363.0 (8.3) 6476738040 (4.7) 5723.15812.4 (1.5) 9674627883 (5.6) 12731.8 12119.8 (-4.8) 128 73387800 (6.2) 21331.7 21094.7 (-1.1) Summary: BW: (4.6) SD: (-1.5) Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
Krishna Kumar wrote on 05/05/2011 08:57:13 PM: Oops, I sent my patch's test results for the 16K case. The correct one is: I/O size: 16K # BW1 BW2 (%) SD1 SD2 (%) 1 66846670 (-.2) 1.1 .6 (-45.4) 2 76747859 (2.4) 5.0 2.6 (-48.0) 4 73587421 (.8) 21.311.6 (-45.5) 8 73937289 (-1.4) 82.744.8 (-45.8) 16 79587280 (-8.5) 283.2 166.3 (-41.2) 32 77927163 (-8.0) 1257.5 692.4 (-44.9) 64 76737096 (-7.5) 5723.1 2870.3 (-49.8) 96 74626963 (-6.6) 12731.8 6475.6 (-49.1) 128 73386919 (-5.7) 21331.7 12345.7 (-42.1) Summary:BW: (-3.9) SD: (-45.4) Sorry for the confusion. Regards, - KK I/O size: 16K # BW1 BW2 (%)SD1 SD2 (%) 1 66847019 (5.0) 1.1 1.1 (0) 2 76747196 (-6.2)5.0 4.8 (-4.0) 4 73588032 (9.1) 21.3 20.4 (-4.2) 8 73938015 (8.4) 82.7 82.0 (-.8) 1679588366 (5.1) 283.2 310.7 (9.7) 3277928113 (4.1) 1257.51363.0 (8.3) 6476738040 (4.7) 5723.15812.4 (1.5) 9674627883 (5.6) 12731.8 12119.8 (-4.8) 128 73387800 (6.2) 21331.7 21094.7 (-1.1) Summary: BW: (4.6) SD: (-1.5) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] [RFC] virtio-net: Improve small packet performance
Michael S. Tsirkin m...@redhat.com wrote on 05/04/2011 08:16:22 PM: A. virtio: - Provide a API to get available number of slots. B. virtio-net: - Remove stop/start txq's and associated callback. - Pre-calculate the number of slots needed to transmit the skb in xmit_skb and bail out early if enough space is not available. My testing shows that 2.5-3% of packets are benefited by using this API. - Do not drop skbs but instead return TX_BUSY like other drivers. - When returning EBUSY, set a per-txq variable to indicate to dev_queue_xmit() whether to restart xmits on this txq. C. net/sched/sch_generic.c: Since virtio-net now returns EBUSY, the skb is requeued to gso_skb. This allows adding the addional check for restart xmits in just the slow-path (the first re-queued packet case of dequeue_skb, where it checks for gso_skb) before deciding whether to call the driver or not. Patch was also tested between two servers with Emulex OneConnect 10G cards to confirm there is no regression. Though the patch is an attempt to improve only small packet performance, there was improvement for 1K, 2K and also 16K both in BW and SD. Results from Guest - Remote Host (BW in Mbps) for 1K and 16K I/O sizes: I/O Size: 1K # BW1 BW2 (%) SD1 SD2 (%) 1 1226 3313 (170.2) 6.6 1.9 (-71.2) 2 3223 7705 (139.0) 18.0 7.1 (-60.5) 4 7223 8716 (20.6) 36.5 29.7 (-18.6) 8 8689 8693 (0) 131.5 123.0 (-6.4) 16 8059 8285 (2.8) 578.3 506.2 (-12.4) 32 7758 7955 (2.5) 2281.4 2244.2 (-1.6) 64 7503 7895 (5.2) 9734.0 9424.4 (-3.1) 96 7496 7751 (3.4) 21980.9 20169.3 (-8.2) 128 7389 7741 (4.7) 40467.5 34995.5 (-13.5) Summary: BW: 16.2% SD: -10.2% I/O Size: 16K # BW1 BW2 (%) SD1 SD2 (%) 1 6684 7019 (5.0) 1.1 1.1 (0) 2 7674 7196 (-6.2) 5.0 4.8 (-4.0) 4 7358 8032 (9.1) 21.3 20.4 (-4.2) 8 7393 8015 (8.4) 82.7 82.0 (-.8) 16 7958 8366 (5.1) 283.2 310.7 (9.7) 32 7792 8113 (4.1) 1257.5 1363.0 (8.3) 64 7673 8040 (4.7) 5723.1 5812.4 (1.5) 96 7462 7883 (5.6) 12731.8 12119.8 (-4.8) 128 7338 7800 (6.2) 21331.7 21094.7 (-1.1) Summary: BW: 4.6% SD: -1.5% Signed-off-by: Krishna Kumar krkum...@in.ibm.com --- So IIUC, we delay transmit by an arbitrary value and hope that the host is done with the packets by then? Not hope exactly. If the device is not ready, then the packet is requeued. The main idea is to avoid drops/stop/starts, etc. Interesting. I am currently testing an approach where we tell the host explicitly to interrupt us only after a large part of the queue is empty. With 256 entries in a queue, we should get 1 interrupt per on the order of 100 packets which does not seem like a lot. I can post it, mind testing this? Sure. - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] [RFC] virtio: Introduce new API to get free space
Michael S. Tsirkin m...@redhat.com wrote on 05/05/2011 01:30:23 AM: @@ -185,11 +193,6 @@ int virtqueue_add_buf_gfp(struct virtque if (vq-num_free out + in) { pr_debug(Can't add buf len %i - avail = %i\n, out + in, vq-num_free); - /* FIXME: for historical reasons, we force a notify here if - * there are outgoing parts to the buffer. Presumably the - * host should service the ring ASAP. */ - if (out) - vq-notify(vq-vq); END_USE(vq); return -ENOSPC; } This will break qemu versions 0.13 and back. I'm adding some new virtio ring flags, we'll be able to reuse one of these to mean 'no need for work around', I think. Not really, it wont. We shall almost never get here at all. But then, why would this help performance? Yes, it is not needed. I will be testing it without this also. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/2] Multiqueue support for qemu(virtio-net)
Thanks Jason! So I can use my virtio-net guest driver and test with this patch? Please provide the script you use to start MQ guest. Regards, - KK Jason Wang jasow...@redhat.com wrote on 04/20/2011 02:03:07 PM: Jason Wang jasow...@redhat.com 04/20/2011 02:03 PM To Krishna Kumar2/India/IBM@IBMIN, kvm@vger.kernel.org, m...@redhat.com, net...@vger.kernel.org, ru...@rustcorp.com.au, qemu- de...@nongnu.org, anth...@codemonkey.ws cc Subject [RFC PATCH 0/2] Multiqueue support for qemu(virtio-net) Inspired by Krishna's patch (http://www.spinics.net/lists/kvm/msg52098.html ) and Michael's suggestions. The following series adds the multiqueue support for qemu and enable it for virtio-net (both userspace and vhost). The aim for this series is to simplified the management and achieve the same performacne with less codes. Follows are the differences between this series and Krishna's: - Add the multiqueue support for qemu and also for userspace virtio-net - Instead of hacking the vhost module to manipulate kthreads, this patch just implement the userspace based multiqueues and thus can re-use the existed vhost kernel-side codes without any modification. - Use 1:1 mapping between TX/RX pairs and vhost kthread because the implementation is based on usersapce. - The cli is also changed to make the mgmt easier, the -netdev option of qdev can now accpet more than one ids. You can start a multiqueue virtio-net device through: ./qemu-system-x86_64 -netdev tap,id=hn0,vhost=on,fd=X -netdev tap,id=hn0,vhost=on,fd=Y -device virtio-net-pci,netdev=hn0#hn1,queues=2 ... The series is very primitive and still need polished. Suggestions are welcomed. --- Jason Wang (2): net: Add multiqueue support virtio-net: add multiqueue support hw/qdev-properties.c | 37 - hw/qdev.h|3 hw/vhost.c | 26 ++- hw/vhost.h |1 hw/vhost_net.c |7 + hw/vhost_net.h |2 hw/virtio-net.c | 409 +++ +-- hw/virtio-net.h |2 hw/virtio-pci.c |1 hw/virtio.h |1 net.c| 34 +++- net.h| 15 +- 12 files changed, 353 insertions(+), 185 deletions(-) -- Jason Wang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/4] [RFC rev2] virtio-net changes
Hi Rusty, Thanks for your feedback. I agree with all the changes, and will make it and resubmit next. thanks, - KK Rusty Russell ru...@rustcorp.com.au wrote on 04/13/2011 06:58:02 AM: Rusty Russell ru...@rustcorp.com.au 04/13/2011 06:58 AM To Krishna Kumar2/India/IBM@IBMIN, da...@davemloft.net, m...@redhat.com cc eric.duma...@gmail.com, a...@arndb.de, net...@vger.kernel.org, ho...@verge.net.au, a...@redhat.com, anth...@codemonkey.ws, kvm@vger.kernel.org, Krishna Kumar2/India/IBM@IBMIN Subject Re: [PATCH 2/4] [RFC rev2] virtio-net changes On Tue, 05 Apr 2011 20:38:52 +0530, Krishna Kumar krkum...@in.ibm.com wrote: Implement mq virtio-net driver. Though struct virtio_net_config changes, it works with the old qemu since the last element is not accessed unless qemu sets VIRTIO_NET_F_MULTIQUEUE. Signed-off-by: Krishna Kumar krkum...@in.ibm.com Hi Krishna! This change looks fairly solid, but I'd prefer it split into a few stages for clarity. The first patch should extract out the struct send_queue and struct receive_queue, even though there's still only one. The second patch can then introduce VIRTIO_NET_F_MULTIQUEUE. You could split into more parts if that makes sense, but I'd prefer to see the mechanical changes separate from the feature addition. -struct virtnet_info { - struct virtio_device *vdev; - struct virtqueue *rvq, *svq, *cvq; - struct net_device *dev; +/* Internal representation of a send virtqueue */ +struct send_queue { + /* Virtqueue associated with this send _queue */ + struct virtqueue *svq; You can simply call this vq now it's inside 'send_queue'. + + /* TX: fragments + linear part + virtio header */ + struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; Similarly, this can just be sg. +static void free_receive_bufs(struct virtnet_info *vi) +{ + int i; + + for (i = 0; i vi-numtxqs; i++) { + BUG_ON(vi-rq[i] == NULL); + while (vi-rq[i]-pages) + __free_pages(get_a_page(vi-rq[i], GFP_KERNEL), 0); + } +} You can skip the BUG_ON(), since the next line will have the same effect. +/* Free memory allocated for send and receive queues */ +static void free_rq_sq(struct virtnet_info *vi) +{ + int i; + + if (vi-rq) { + for (i = 0; i vi-numtxqs; i++) + kfree(vi-rq[i]); + kfree(vi-rq); + } + + if (vi-sq) { + for (i = 0; i vi-numtxqs; i++) + kfree(vi-sq[i]); + kfree(vi-sq); + } This looks weird, even though it's correct. I think we need a better name than numtxqs and shorter than num_queue_pairs. Let's just use num_queues; sure, there are both tx and rq queues, but I still think it's pretty clear. + for (i = 0; i vi-numtxqs; i++) { + struct virtqueue *svq = vi-sq[i]-svq; + + while (1) { + buf = virtqueue_detach_unused_buf(svq); + if (!buf) +break; + dev_kfree_skb(buf); + } + } I know this isn't your code, but it's ugly :) while ((buf = virtqueue_detach_unused_buf(svq)) != NULL) dev_kfree_skb(buf); + for (i = 0; i vi-numtxqs; i++) { + struct virtqueue *rvq = vi-rq[i]-rvq; + + while (1) { + buf = virtqueue_detach_unused_buf(rvq); + if (!buf) +break; Here too... +#define MAX_DEVICE_NAME 16 This isn't a good idea, see below. +static int initialize_vqs(struct virtnet_info *vi, int numtxqs) +{ + vq_callback_t **callbacks; + struct virtqueue **vqs; + int i, err = -ENOMEM; + int totalvqs; + char **names; This whole routine is really messy. How about doing find_vqs first, then have routines like setup_rxq(), setup_txq() and setup_controlq() would make this neater: static int setup_rxq(struct send_queue *sq, char *name); Also, use kasprintf() instead of kmalloc sprintf. +#if 1 + /* Allocate/initialize parameters for recv/send virtqueues */ Why is this #if 1'd? I do prefer the #else method of doing two loops, myself (but use kasprintf). Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] [RFC rev2] Implement multiqueue (RX TX) virtio-net
Avi Kivity a...@redhat.com wrote on 04/13/2011 05:30:11 PM: Hi Avi, 1. Reduce vectors for find_vqs(). 2. Make vhost changes minimal. For now, I have restricted the number of vhost threads to 4. This can be either made unrestricted; or if the userspace vhost works, it can be removed altogether. Please review and provide feedback. I am travelling a bit in the next few days but will respond at the earliest. Do you have an update to the virtio-pci spec for this? Not yet, will keep it in my TODO list. thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 03/02/2011 03:36:00 PM: Sorry for the delayed response, I have been sick the last few days. I am responding to both your posts here. Both virtio-net and vhost need some check to make sure very high values are not passed by userspace. Is this not required? Whatever we stick in the header is effectively part of host/gues interface. Are you sure we'll never want more than 16 VQs? This value does not seem that high. OK, so even constants cannot change? Given that, should I remove all checks and use kcalloc? OK, so virtio_net_config has num_queue_pairs, and this gets converted to numtxqs in virtnet_info? Or put num_queue_pairs in virtnet_info too. For virtnet_info, having numtxqs is easier since all code that loops needs only 'numtxq'. Also, vhost has some code that processes tx first before rx (e.g. vhost_net_stop/flush), No idea why did I do it this way. I don't think it matters. so this approach seemed helpful. I am OK either way, what do you suggest? We get less code generated but also less flexibility. I am not sure, I'll play around with code, for now let's keep it as is. OK. Yes, it is a waste to have these vectors for tx ints. I initially thought of adding a flag to virtio_device to pass to vp_find_vqs, but it won't work, so a new API is needed. I can work with you on this in the background if you like. OK. For starters, how about we change find_vqs to get a structure? Then we can easily add flags that tell us that some interrupts are rare. Yes. OK to work on this outside this patch series, I guess? vq's are matched between qemu, virtio-net and vhost. Isn't some check required that userspace has not passed a bad value? For virtio, I'm not too concerned: qemu can already easily crash the guest :) For vhost yes, but I'm concerned that even with 16 VQs we are drinking a lot of resources already. I would be happier if we had a file descriptor per VQs pair in some way. The the amount of memory userspace can use up is limited by the # of file descriptors. I will start working on this approach this week and see how it goes. OK, so define free_unused_bufs() as: static void free_unused_bufs(struct virtnet_info *vi, struct virtqueue *svq, struct virtqueue *rvq) { /* Use svq and rvq with the remaining code unchanged */ } Not sure I understand. I am just suggesting adding symmetrical functions like init/cleanup alloc/free etc instead of adding stuff in random functions that just happens to be called at the right time. OK, I will clean up this part in the next revision. I was not sure what is the best way - a sysctl parameter? Or should the maximum depend on number of host cpus? But that results in too many threads, e.g. if I have 16 cpus and 16 txqs. I guess the question is, wouldn't # of threads == # of vqs work best? If we process stuff on a single CPU, let's make it pass through a single VQ. And to do this, we could simply open multiple vhost fds without changing vhost at all. Would this work well? - enum vhost_net_poll_state tx_poll_state; + enum vhost_net_poll_state *tx_poll_state; another array? Yes... I am also allocating twice the space than what is required to make it's usage simple. Where's the allocation? Couldn't find it. vhost_setup_vqs(net.c) allocates it based on nvqs, though numtxqs is enough. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 03/08/2011 09:11:04 PM: Also, could you post your current version of the qemu code pls? It's useful for testing and to see the whole picture. Sorry for the delay on this. I am attaching the qemu changes. Some parts of the patch are completely redundant, eg MAX_TUN_DEVICES and I will remove it later. It works with latest qemu and the kernel patch sent earlier. Please let me know if there are any issues. thanks, - KK (See attached file: qemu.patch) qemu.patch Description: Binary data
Re: [PATCH 0/3] [RFC] Implement multiqueue (RX TX) virtio-net
Andrew Theurer haban...@linux.vnet.ibm.com wrote on 03/04/2011 12:31:24 AM: Hi Andrew, ___ TCP: Guest - Local Host (TCP_STREAM) TCP: Local Host - Guest (TCP_MAERTS) UDP: Local Host - Guest (UDP_STREAM) Any reason why the tests don't include a guest-to-guest on same host, or on different hosts? Seems like those would be a lot more common that guest-to/from-localhost. This was missing in my test plan, but good point. I will run these tests also and send the results soon. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] [RFC] Changes for MQ virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 02/28/2011 03:13:20 PM: Thank you once again for your feedback on both these patches. I will send the qemu patch tomorrow. I will also send the next version incorporating these suggestions once we finalize some minor points. Overall looks good. The numtxqs meaning the number of rx queues needs some cleanup. init/cleanup routines need more symmetry. Error handling on setup also seems slightly buggy or at least asymmetrical. Finally, this will use up a large number of MSI vectors, while TX interrupts mostly stay unused. Some comments below. +/* Maximum number of individual RX/TX queues supported */ +#define VIRTIO_MAX_TXQS 16 + This also does not seem to belong in the header. Both virtio-net and vhost need some check to make sure very high values are not passed by userspace. Is this not required? +#define VIRTIO_NET_F_NUMTXQS21 /* Device supports multiple TX queue */ VIRTIO_NET_F_MULTIQUEUE ? Yes, that's a better name. @@ -34,6 +38,8 @@ struct virtio_net_config { __u8 mac[6]; /* See VIRTIO_NET_F_STATUS and VIRTIO_NET_S_* above */ __u16 status; +/* number of RX/TX queues */ +__u16 numtxqs; The interface here is a bit ugly: - this is really both # of tx and rx queues but called numtxqs - there's a hardcoded max value - 0 is assumed to be same as 1 - assumptions above are undocumented. One way to address this could be num_queue_pairs, and something like /* The actual number of TX and RX queues is num_queue_pairs + 1 each. */ __u16 num_queue_pairs; (and tweak code to match). Alternatively, have separate registers for the number of tx and rx queues. OK, so virtio_net_config has num_queue_pairs, and this gets converted to numtxqs in virtnet_info? +struct virtnet_info { +struct send_queue **sq; +struct receive_queue **rq; + +/* read-mostly variables */ +int numtxqs cacheline_aligned_in_smp; Why do you think this alignment is a win? Actually this code was from the earlier patchset (MQ TX only) where the layout was different. Now rq and sq are allocated as follows: vi-sq = kzalloc(numtxqs * sizeof(*vi-sq), GFP_KERNEL); for (i = 0; i numtxqs; i++) { vi-sq[i] = kzalloc(sizeof(*vi-sq[i]), GFP_KERNEL); Since the two pointers becomes read-only during use, there is no cache line dirty'ing. I will remove this directive. +/* + * Note for 'qnum' below: + * first 'numtxqs' vqs are RX, next 'numtxqs' vqs are TX. + */ Another option to consider is to have them RX,TX,RX,TX: this way vq-queue_index / 2 gives you the queue pair number, no need to read numtxqs. On the other hand, it makes the #RX==#TX assumption even more entrenched. OK. I was following how many drivers were allocating RX and TX's together - eg ixgbe_adapter has tx_ring and rx_ring arrays; bnx2 has rx_buf_ring and tx_buf_ring arrays, etc. Also, vhost has some code that processes tx first before rx (e.g. vhost_net_stop/flush), so this approach seemed helpful. I am OK either way, what do you suggest? +err = vi-vdev-config-find_vqs(vi-vdev, totalvqs, vqs, callbacks, + (const char **)names); +if (err) +goto free_params; + This would use up quite a lot of vectors. However, tx interrupt is, in fact, slow path. So, assuming we don't have enough vectors to use per vq, I think it's a good idea to support reducing MSI vector usage by mapping all TX VQs to the same vector and separate vectors for RX. The hypervisor actually allows this, but we don't have an API at the virtio level to pass that info to virtio pci ATM. Any idea what a good API to use would be? Yes, it is a waste to have these vectors for tx ints. I initially thought of adding a flag to virtio_device to pass to vp_find_vqs, but it won't work, so a new API is needed. I can work with you on this in the background if you like. +for (i = 0; i numtxqs; i++) { +vi-rq[i]-rvq = vqs[i]; +vi-sq[i]-svq = vqs[i + numtxqs]; This logic is spread all over. We need some kind of macro to get queue number of vq number and back. Will add this. +if (virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_VQ)) { +vi-cvq = vqs[i + numtxqs]; + +if (virtio_has_feature(vi-vdev, VIRTIO_NET_F_CTRL_VLAN)) +vi-dev-features |= NETIF_F_HW_VLAN_FILTER; This bit does not seem to belong in initialize_vqs. I will move it back to probe. +err = virtio_config_val(vdev, VIRTIO_NET_F_NUMTXQS, +
Re: [PATCH 3/3] [RFC] Changes for MQ vhost
Michael S. Tsirkin m...@redhat.com wrote on 02/28/2011 03:34:23 PM: The number of vhost threads is = #txqs. Threads handle more than one txq when #txqs is more than MAX_VHOST_THREADS (4). It is this sharing that prevents us from just reusing multiple vhost descriptors? Sorry, I didn't understand this question. 4 seems a bit arbitrary - do you have an explanation on why this is a good number? I was not sure what is the best way - a sysctl parameter? Or should the maximum depend on number of host cpus? But that results in too many threads, e.g. if I have 16 cpus and 16 txqs. +struct task_struct *worker; /* worker for this vq */ +spinlock_t *work_lock; /* points to a dev-work_lock[] entry */ +struct list_head *work_list;/* points to a dev-work_list[] entry */ +int qnum; /* 0 for RX, 1 - n-1 for TX */ Is this right? Will fix this. @@ -122,12 +128,33 @@ struct vhost_dev { int nvqs; struct file *log_file; struct eventfd_ctx *log_ctx; -spinlock_t work_lock; -struct list_head work_list; -struct task_struct *worker; +spinlock_t *work_lock[MAX_VHOST_THREADS]; +struct list_head *work_list[MAX_VHOST_THREADS]; This looks a bit strange. Won't sticking everything in a single array of structures rather than multiple arrays be better for cache utilization? Correct. In that context, which is better: struct { spinlock_t *work_lock; struct list_head *work_list; } work[MAX_VHOST_THREADS]; or, to make sure work_lock/work_list is cache-aligned: struct work_lock_list { spinlock_t work_lock; struct list_head work_list; } cacheline_aligned_in_smp; and define: struct vhost_dev { ... struct work_lock_list work[MAX_VHOST_THREADS]; }; Second method uses a little more space but each vhost needs only one (read-only) cache line. I tested with this and can confirm it aligns each element on a cache-line. BW improved slightly (upto 3%), remote SD improves by upto -4% or so. +static inline int get_nvhosts(int nvqs) nvhosts - nthreads? Yes. +static inline int vhost_get_thread_index(int index, int numtxqs, int nvhosts) +{ +return (index % numtxqs) % nvhosts; +} + As the only caller passes MAX_VHOST_THREADS, just use that? Yes, nice catch. struct vhost_net { struct vhost_dev dev; -struct vhost_virtqueue vqs[VHOST_NET_VQ_MAX]; -struct vhost_poll poll[VHOST_NET_VQ_MAX]; +struct vhost_virtqueue *vqs; +struct vhost_poll *poll; +struct socket **socks; /* Tells us whether we are polling a socket for TX. * We only do this when socket buffer fills up. * Protected by tx vq lock. */ -enum vhost_net_poll_state tx_poll_state; +enum vhost_net_poll_state *tx_poll_state; another array? Yes... I am also allocating twice the space than what is required to make it's usage simple. Please let me know what you feel about this. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] [RFC] Implement multiqueue (RX TX) virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 02/28/2011 01:05:15 PM: This patch series is a continuation of an earlier one that implemented guest MQ TX functionality. This new patchset implements both RX and TX MQ. Qemu changes are not being included at this time solely to aid in easier review. Compatibility testing with old/new combinations of qemu/guest and vhost was done without any issues. Some early TCP/UDP test results are at the bottom of this post, I plan to submit more test results in the coming days. Please review and provide feedback on what can improve. Thanks! Signed-off-by: Krishna Kumar krkum...@in.ibm.com To help testing, could you post the qemu changes separately please? Thanks Michael for your review and feedback. I will send the qemu changes and respond to your comments tomorrow. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 09:25:34 PM: Sure, will get a build/test on latest bits and send in 1-2 days. The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Also, my hope is that with appropriate queue mapping, we might be able to do away with heuristics to detect single stream load that TX only code needs. Yes, that whole stuff is removed, and the TX/RX path is unchanged with this patch (thankfully :) Cool. I was wondering whether in that case, we can do without host kernel changes at all, and use a separate fd for each TX/RX pair. The advantage of that approach is that this way, the max fd limit naturally sets an upper bound on the amount of resources userspace can use up. Thoughts? In any case, pls don't let the above delay sending an RFC. I will look into this also. Please excuse the delay in sending the patch out faster - my bits are a little old, so it is taking some time to move to the latest kernel and get some initial TCP/UDP test results. I should have it ready by tomorrow. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Simon Horman ho...@verge.net.au wrote on 02/22/2011 01:17:09 PM: Hi Simon, I have a few questions about the results below: 1. Are the (%) comparisons between non-mq and mq virtio? Yes - mainline kernel with transmit-only MQ patch. 2. Was UDP or TCP used? TCP. I had done some initial testing on UDP, but don't have the results now as it is really old. But I will be running it again. 3. What was the transmit size (-m option to netperf)? I didn't use the -m option, so it defaults to 16K. The script does: netperf -t TCP_STREAM -c -C -l 60 -H $SERVER Also, I'm interested to know what the status of these patches is. Are you planing a fresh series? Yes. Michael Tsirkin had wanted to see how the MQ RX patch would look like, so I was in the process of getting the two working together. The patch is ready and is being tested. Should I send a RFC patch at this time? The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Remote testing is still to be done. Thanks, - KK Changes from rev2: -- 1. Define (in virtio_net.h) the maximum send txqs; and use in virtio-net and vhost-net. 2. vi-sq[i] is allocated individually, resulting in cache line aligned sq[0] to sq[n]. Another option was to define 'send_queue' as: struct send_queue { struct virtqueue *svq; struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; } cacheline_aligned_in_smp; and to statically allocate 'VIRTIO_MAX_SQ' of those. I hope the submitted method is preferable. 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX] handles TX[0-n]. 4. Further change TX handling such that vhost[0] handles both RX/TX for single stream case. Enabling MQ on virtio: --- When following options are passed to qemu: - smp 1 - vhost=on - mq=on (new option, default:off) then #txqueues = #cpus. The #txqueues can be changed by using an optional 'numtxqs' option. e.g. for a smp=4 guest: vhost=on - #txqueues = 1 vhost=on,mq=on - #txqueues = 4 vhost=on,mq=on,numtxqs=2 - #txqueues = 2 vhost=on,mq=on,numtxqs=8 - #txqueues = 8 Performance (guest - local host): --- System configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Test: Each test case runs for 60 secs, sum over three runs (except when number of netperf sessions is 1, which has 10 runs of 12 secs each). No tuning (default netperf) other than taskset vhost's to cpus 0-3. numtxqs=32 gave the best results though the guest had only 4 vcpus (I haven't tried beyond that). __ numtxqs=2, vhosts=3 #sessions BW% CPU%RCPU%SD% RSD% 1 4.46-1.96 .19 -12.50 -6.06 2 4.93-1.162.10 0 -2.38 4 46.1764.77 33.72 19.51 -2.48 8 47.8970.00 36.23 41.4613.35 16 48.9780.44 40.67 21.11 -5.46 24 49.0378.78 41.22 20.51 -4.78 32 51.1177.15 42.42 15.81 -6.87 40 51.6071.65 42.43 9.75-8.94 48 50.1069.55 42.85 11.80 -5.81 64 46.2468.42 42.67 14.18 -3.28 80 46.3763.13 41.62 7.43-6.73 96 46.4063.31 42.20 9.36-4.78 12850.4362.79 42.16 13.11 -1.23 BW: 37.2%, CPU/RCPU: 66.3%,41.6%, SD/RSD: 11.5%,-3.7% __ numtxqs=8, vhosts=5 #sessions BW% CPU% RCPU% SD% RSD% 1 -.76-1.56 2.33 03.03 2 17.4111.1111.41 0 -4.76 4 42.1255.1130.20 19.51.62 8 54.6980.0039.22 24.39-3.88 16 54.7781.6240.89 20.34-6.58 24 54.6679.6841.57 15.49-8.99 32 54.9276.8241.79 17.59-5.70 40 51.7968.5640.53 15.31-3.87 48 51.7266.4040.84 9.72 -7.13 64 51.1163.9441.10 5.93 -8.82 80 46.5159.5039.80 9.33 -4.18 96 47.7257.7539.84 4.20 -7.62 128 54.3558.9540.66 3.24 -8.63 BW: 38.9%,
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 02/23/2011 12:09:15 PM: Hi Michael, Yes. Michael Tsirkin had wanted to see how the MQ RX patch would look like, so I was in the process of getting the two working together. The patch is ready and is being tested. Should I send a RFC patch at this time? Yes, please do. Sure, will get a build/test on latest bits and send in 1-2 days. The TX-only patch helped the guest TX path but didn't help host-guest much (as tested using TCP_MAERTS from the guest). But with the TX+RX patch, both directions are getting improvements. Also, my hope is that with appropriate queue mapping, we might be able to do away with heuristics to detect single stream load that TX only code needs. Yes, that whole stuff is removed, and the TX/RX path is unchanged with this patch (thankfully :) Remote testing is still to be done. Others might be able to help here once you post the patch. That's great, will appreciate any help. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
Michael S. Tsirkin m...@redhat.com 02/02/2011 03:11 AM On Tue, Feb 01, 2011 at 01:28:45PM -0800, Shirley Ma wrote: On Tue, 2011-02-01 at 23:21 +0200, Michael S. Tsirkin wrote: Confused. We compare capacity to skb frags, no? That's sg I think ... Current guest kernel use indirect buffers, num_free returns how many available descriptors not skb frags. So it's wrong here. Shirley I see. Good point. In other words when we complete the buffer it was indirect, but when we add a new one we can not allocate indirect so we consume. And then we start the queue and add will fail. I guess we need some kind of API to figure out whether the buf we complete was indirect? Another failure mode is when skb_xmit_done wakes the queue: it might be too early, there might not be space for the next packet in the vq yet. I am not sure if this is the problem - shouldn't you see these messages: if (likely(capacity == -ENOMEM)) { dev_warn(dev-dev, TX queue failure: out of memory\n); } else { dev-stats.tx_fifo_errors++; dev_warn(dev-dev, Unexpected TX queue failure: %d\n, capacity); } in next xmit? I am not getting this in my testing. A solution might be to keep some kind of pool around for indirect, we wanted to do it for block anyway ... Your vhost patch should fix this automatically. Right? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-02-01 at 22:05 -0800, Shirley Ma wrote: The way I am changing is only when netif queue has stopped, then we start to count num_free descriptors to send the signal to wake netif queue. I forgot to mention, the code change I am making is in guest kernel, in xmit call back only wake up the queue when it's stopped num_free = 1/2 *vq-num, I add a new API in virtio_ring. FYI :) I have tried this before. There are a couple of issues: 1. the free count will not reduce until you run free_old_xmit_skbs, which will not run anymore since the tx queue is stopped. 2. You cannot call free_old_xmit_skbs directly as it races with a queue that was just awakened (current cb was due to the delay in disabling cb's). You have to call free_old_xmit_skbs() under netif_queue_stopped() check to avoid the race. I got a small improvement in my testing upto some number of threads (32 or 48?), but beyond that I was getting a regression. Thanks, - KK However vhost signaling reduction is needed as well. The patch I submitted a while ago showed both CPUs and BW improvement. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
Shirley Ma mashi...@us.ibm.com wrote: I have tried this before. There are a couple of issues: 1. the free count will not reduce until you run free_old_xmit_skbs, which will not run anymore since the tx queue is stopped. 2. You cannot call free_old_xmit_skbs directly as it races with a queue that was just awakened (current cb was due to the delay in disabling cb's). You have to call free_old_xmit_skbs() under netif_queue_stopped() check to avoid the race. Yes, that' what I did, when the netif queue stop, don't enable the queue, just free_old_xmit_skbs(), if not enough freed, then enabling callback until half of the ring size are freed, then wake the netif queue. But somehow I didn't reach the performance compared to drop packets, need to think about it more. :) Did you check if the number of vmexits increased with this patch? This is possible if the device was keeping up (and not going into a stop, start, xmit 1 packet, stop, start loop). Also maybe you should try for 1/4th instead of 1/2? MST's delayed signalling should avoid this issue, I haven't tried both together. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
MQ performance on other cards (cxgb3)
I had sent this mail to Michael last week - he agrees that I should share this information on the list: On latest net-next-2.6, virtio-net (guest-host) results are: __ SQ vs MQ (#txqs=8) # BW1 BW2 (%) CPU1 CPU2 (%) RCPU1 RCPU2 (%) ___ 1 105774 112256 (6.1) 257 255 (-.7) 532 549 (3.1) 2 20842 30674 (47.1) 107 150 (40.1)208 279 (34.1) 4 22500 31953 (42.0) 241 409 (69.7)467 619 (32.5) 8 22416 44507 (98.5) 477 1039 (117.8) 960 1459 (51.9) 16 22605 45372 (100.7) 905 2060 (127.6) 18952962 (56.3) 24 23192 44201 (90.5) 1360 3028 (122.6) 28334437 (56.6) 32 23158 43394 (87.3) 1811 3957 (118.4) 37705936 (57.4) 40 23322 42550 (82.4) 2276 4986 (119.0) 47117417 (57.4) 48 23564 41931 (77.9) 2757 5966 (116.3) 56538896 (57.3) 64 23949 41092 (71.5) 3788 7898 (108.5) 760911826 (55.4) 80 23256 41343 (77.7) 4597 9887 (115.0) 950314801 (55.7) 96 23310 40645 (74.3) 5588 11758 (110.4) 11381 17761 (56.0) 12824095 41082 (70.5) 7587 15574 (105.2) 15029 23716 (57.8) __ Avg: BW: (58.3) CPU: (110.8) RCPU: (55.9) It's true that average CPU% on guest is almost double that of the BW improvement. But I don't think this is due to the patch (driver does no synchronization, etc). To compare MQ vs SQ on a 10G card, I ran the same test from host to remote host across cxgb3. The results are somewhat similar: (I changed cxgb_open on the client system to: netif_set_real_num_tx_queues(dev, 1); err = netif_set_real_num_rx_queues(dev, 1); to simulate single queue (SQ)) _ cxgb3 SQ vs cxgb3 MQ # BW1 BW2 (%) CPU1 CPU2 (%) _ 1 83018315 (.1)5 4.66 (-6.6) 2 93959380 (-.1) 1616 (0) 4 94119414 (0)3326 (-21.2) 8 94119398 (-.1) 6062 (3.3) 16 94129413 (0)116 117 (.8) 24 94429963 (5.5) 179 198 (10.6) 32 10031 10025 (0) 230 249 (8.2) 40 995310024 (.7) 300 312 (4.0) 48 10002 10015 (.1) 351 376 (7.1) 64 10022 10024 (0) 494 515 (4.2) 80 889410011 (12.5) 537630 (17.3) 96 84659907 (17.0) 612749 (22.3) 128 7541 9617 (27.5) 760989 (30.1) _ Avg: BW: (3.8) CPU: (14.8) (Each case runs runs once for 60 secs) The BW increased modestly but CPU increased much more. I assume the change I made above to convert the driver from MQ to SQ is not incorrect. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 11/09/2010 09:03:25 PM: Something strange here, right? 1. You are consistently getting 10G/s here, and even with a single stream? Sorry, I should have mentioned this though I had stated in my earlier mails. Each test result has two iterations, each of 60 seconds, except when #netperfs is 1 for which I do 10 iteration (sum across 10 iterations). So need to divide the number by 10? Yes, that is what I get with 512/1K macvtap I/O size :) I started doing many more iterations for 1 netperf after finding the issue earlier with single stream. So the BW is only 4.5-7 Gbps. 2. With 2 streams, is where we get 10G/s originally. Instead of doubling that we get a marginal improvement with 2 queues and about 30% worse with 1 queue. (doubling happens consistently for guest - host, but never for remote host) I tried 512/txqs=2 and 1024/txqs=8 to get a varied testing scenario. In first case, there is a slight improvement in BW and good reduction in SD. In the second case, only SD improves (though BW drops for 2 stream for some reason). In both cases, BW and SD improves as the number of sessions increase. I guess this is another indication that something's wrong. The patch - both virtio-net and vhost-net, doesn't have any locking/mutex's/ or any synchronization method. Guest - host performance improvement of upto 100% shows the patch is not doing anything wrong. We are quite far from line rate, the fact BW does not scale means there's some contention in the code. Attaining line speed with macvtap seems to be a generic issue and unrelated to my patch specifically. IMHO if there is nothing wrong in the code (review) and is accepted, it will benefit as others can also help to find what needs to be implemented in vhost/macvtap/qemu to get line speed for guest-remote-host. PS: bare-metal performance for host-remote-host is also 2.7 Gbps and 2.8 Gbps for 512/1024 for the same card. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 02:27:09 PM: Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net On Mon, Oct 25, 2010 at 09:20:38PM +0530, Krishna Kumar2 wrote: Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. Some more test results: _ Host-Guest BW (numtxqs=2) # BW% CPU%RCPU% SD% RSD% _ I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. I had to make a few changes to qemu (and a minor change in macvtap driver) to get multiple TXQ support using macvtap working. The NIC is a ixgbe card. __ Org vs New (I/O: 512 bytes, #numtxqs=2, #vhosts=3) # BW1 BW2 (%) SD1SD2 (%)RSD1RSD2 (%) __ 1 14367 13142 (-8.5) 56 62 (10.7) 88 (0) 2 36523855 (5.5)37 35 (-5.4) 76 (-14.2) 4 12529 12059 (-3.7) 65 77 (18.4) 35 35 (0) 8 13912 14668 (5.4) 288332 (15.2) 175 184 (5.1) 16 13433 14455 (7.6) 1218 1321 (8.4) 920 943 (2.5) 24 12750 13477 (5.7) 2876 2985 (3.7) 2514 2348 (-6.6) 32 11729 12632 (7.6) 5299 5332 (.6) 4934 4497 (-8.8) 40 11061 11923 (7.7) 8482 8364 (-1.3)8374 7495 (-10.4) 48 10624 11267 (6.0) 12329 12258 (-.5)1276211538 (-9.5) 64 10524 10596 (.6)21689 22859 (5.3)2362622403 (-5.1) 80 985610284 (4.3) 35769 36313 (1.5)3993236419 (-8.7) 96 969110075 (3.9) 52357 52259 (-.1)5867653463 (-8.8) 12893519794 (4.7)114707 94275 (-17.8) 114050 97337 (-14.6) __ Avg: BW: (3.3) SD: (-7.3) RSD: (-11.0) __ Org vs New (I/O: 1K, #numtxqs=8, #vhosts=5) # BW1 BW2 (%) SD1 SD2 (%)RSD1 RSD2 (%) __ 1 1650915985 (-3.1) 4547 (4.4) 7 7 (0) 2 6963 4499 (-35.3) 1751 (200.0) 7 7 (0) 4 1293211080 (-14.3) 4974 (51.0) 35 35 (0) 8 1387814095 (1.5) 223 292 (30.9) 175 181 (3.4) 16 1344013698 (1.9) 980 1131 (15.4)926 942 (1.7) 24 1268012927 (1.9) 2387 2463 (3.1) 25262342 (-7.2) 32 1171412261 (4.6) 4506 4486 (-.4) 49414463 (-9.6) 40 1105911651 (5.3) 7244 7081 (-2.2)83497437 (-10.9) 48 1058011095 (4.8) 10811 10500 (-2.8) 12809 11403 (-10.9) 64 1056910566 (0) 19194 19270 (.3) 23648 21717 (-8.1) 80 9827 10753 (9.4) 31668 29425 (-7.0) 39991 33824 (-15.4) 96 1004310150 (1.0) 45352 44227 (-2.4) 57766 51131 (-11.4) 1289360 9979 (6.6)92058 79198 (-13.9) 114381 92873 (-18.8) __ Avg: BW: (-.5) SD: (-7.5) RSD: (-14.7) Is there anything else you would like me to test/change, or shall I submit the next version (with the above macvtap changes)? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. For external host, there is a contention among different queues (vhosts) when packets are processed in tun/bridge, unless I implement MQ TX for macvtap (tun/bridge?). So my testing shows a small improvement (1 to 1.5% average) in BW and a rise in SD (between 10-15%). For remote host, I think tun/macvtap needs MQ TX support? Confused. I thought this *is* with a multiqueue tun/macvtap? bridge does not do any queueing AFAIK ... I think we need to fix the contention. With migration what was guest to host a minute ago might become guest to external now ... Macvtap RX is MQ but not TX. I don't think MQ TX support is required for macvtap, though. Is it enough for existing macvtap sendmsg to work, since it calls dev_queue_xmit which selects the txq for the outgoing device? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/IBM wrote on 10/28/2010 10:44:14 AM: Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. If there's a significant improvement this would mean that we need to rethink the vhost-net interaction with the scheduler. I will get a test run with and without binding and post the results later today. Correction: The result with binding is is much better for SD/CPU compared to without-binding: _ numtxqs=8,vhosts=5, Bind vs No-bind # BW% CPU% RCPU% SD% RSD% _ 1 11.25 10.771.89 0-6.06 2 18.66 7.20 7.20-14.28-7.40 4 4.24 -1.27 1.56-2.70 -.98 8 14.91-3.79 5.46-12.19-3.76 1612.32-8.67 4.63-35.97-26.66 2411.68-7.83 5.10-40.73-32.37 3213.09-10.516.57-51.52-42.28 4011.04-4.12 11.23 -50.69-42.81 488.61 -10.306.04-62.38-55.54 647.55 -6.05 6.41-61.20-56.04 808.74 -11.456.29-72.65-67.17 969.84 -6.01 9.87-69.89-64.78 128 5.57 -6.23 8.99-75.03-70.97 _ BW: 10.4%, CPU/RCPU: -7.4%,7.7%, SD: -70.5%,-65.7% Notes: 1. All my test results earlier was binding vhost to cpus 0-3 for both org and new kernel. 2. I am not using MST's use_mq patch, only mainline kernel. However, I reported earlier that I got better results with that patch. The result for MQ vs MQ+use_mm patch (from my earlier mail): BW: 0 CPU/RCPU: -4.2,-6.1 SD/RSD: -13.1,-15.6 Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/26/2010 04:39:13 PM: (merging two posts into one) I think we discussed the need for external to guest testing over 10G. For large messages we should not see any change but you should be able to get better numbers for small messages assuming a MQ NIC card. For external host, there is a contention among different queues (vhosts) when packets are processed in tun/bridge, unless I implement MQ TX for macvtap (tun/bridge?). So my testing shows a small improvement (1 to 1.5% average) in BW and a rise in SD (between 10-15%). For remote host, I think tun/macvtap needs MQ TX support? Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. If there's a significant improvement this would mean that we need to rethink the vhost-net interaction with the scheduler. I will get a test run with and without binding and post the results later today. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/IBM wrote on 10/26/2010 10:40:35 AM: I am trying to wrap my head around kernel/user interface here. E.g., will we need another incompatible change when we add multiple RX queues? Though I added a 'mq' option to qemu, there shouldn't be any incompatibility between old and new qemu's wrt vhost and virtio-net drivers. So the old qemu will run new host and new guest without issues, and new qemu can also run old host and old guest. Multiple RXQ will also not add any incompatibility. With MQ RX, I will be able to remove the hueristic (idea from David Stevens). The idea is: Guest sends out packets on, say TXQ#2, vhost#2 processes the packets but packets going out from host to guest might be sent out on a different RXQ, say RXQ#4. Guest receives the packet on RXQ#4, and all future responses on that connection are sent on TXQ#4. Now vhost#4 processes both RX and TX packets for this connection. Without needing to hash on the connection, guest can make sure that the same vhost thread will handle a single connection. Also need to think about how robust our single stream heuristic is, e.g. what are the chances it will misdetect a bidirectional UDP stream as a single TCP? I think it should not happen. The hueristic code gets called for handling just the transmit packets, packets that vhost sends out to the guest skip this path. I tested unidirectional and bidirectional UDP to confirm: 8 iterations of iperf tests, each iteration of 15 secs, result is the sum of all 8 iterations in Gbits/sec __ Uni-directional Bi-directional Org New Org New __ 71.7871.77 71.74 72.07 __ Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): -- numtxqs=8, vhosts=5 - # BW%CPU%SD% 1 .491.07 0 223.51 52.5126.66 475.17 72.438.57 886.54 80.2127.85 16 92.37 85.996.27 24 91.37 84.918.41 32 89.78 82.903.31 48 89.85 79.95 -3.57 64 85.83 80.282.22 80 88.90 79.47 -23.18 96 90.12 79.9814.71 128 86.13 80.604.42 BW: 71.3%, CPU: 80.4%, SD: 1.2% -- numtxqs=16, vhosts=5 #BW% CPU% SD% 11.80 00 219.8150.6826.66 457.3152.778.57 8108.44 88.19 -5.21 16 106.09 85.03 -4.44 24 102.34 84.23 -.82 32 102.77 82.71 -5.81 48 100.00 79.62 -7.29 64 96.8679.75 -6.10 80 99.2679.82 -27.34 96 94.7980.02 -5.08 128 98.1481.15 -15.25 BW: 77.9%, CPU: 80.4%, SD: -13.6% Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com On Tue, Oct 26, 2010 at 02:38:53PM +0530, Krishna Kumar2 wrote: Results for UDP BW tests (unidirectional, sum across 3 iterations, each iteration of 45 seconds, default netperf, vhosts bound to cpus 0-3; no other tuning): Is binding vhost threads to CPUs really required? What happens if we let the scheduler do its job? Nothing drastic, I remember BW% and SD% both improved a bit as a result of binding. I started binding vhost thread after Avi suggested it in response to my v1 patch (he suggested some more that I haven't done), and have been doing only this tuning ever since. This is part of his mail for the tuning: vhost: thread #0: CPU0 thread #1: CPU1 thread #2: CPU2 thread #3: CPU3 I simply bound each thread to CPU0-3 instead. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/i...@ibmin wrote on 10/20/2010 02:24:52 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. Some more test results: _ Host-Guest BW (numtxqs=2) # BW% CPU%RCPU% SD% RSD% _ 1 5.53.31 .67 -5.88 0 2 -2.11 -1.01 -2.08 4.340 4 13.53 10.77 13.87 -1.96 0 8 34.22 22.80 30.53 -8.46 -2.50 16 30.89 24.06 35.17 -5.20 3.20 24 33.22 26.30 43.39 -5.17 7.58 32 30.85 27.27 47.74 -.5915.51 40 33.80 27.33 48.00 -7.42 7.59 48 45.93 26.33 45.46 -12.24 1.10 64 33.51 27.11 45.00 -3.27 10.30 80 39.28 29.21 52.33 -4.88 12.17 96 32.05 31.01 57.72 -1.02 19.05 128 35.66 32.04 60.00 -.6620.41 _ BW: 23.5% CPU/RCPU: 28.6%,51.2% SD/RSD: -2.6%,15.8% Guest-Host 512 byte (numtxqs=2): # BW% CPU%RCPU% SD% RSD% _ 1 3.02-3.84 -4.76 -12.50 -7.69 2 52.77 -15.73 -8.66 -45.31 -40.33 4 -23.14 13.84 7.5050.58 40.81 8 -21.44 28.08 16.32 63.06 47.43 16 33.53 46.50 27.19 7.61-6.60 24 55.77 42.81 30.49 -8.65 -16.48 32 52.59 38.92 29.08 -9.18 -15.63 40 50.92 36.11 28.92 -10.59 -15.30 48 46.63 34.73 28.17 -7.83 -12.32 64 45.56 37.12 28.81 -5.05 -10.80 80 44.55 36.60 28.45 -4.95 -10.61 96 43.02 35.97 28.89 -.11-5.31 128 38.54 33.88 27.19 -4.79 -9.54 _ BW: 34.4% CPU/RCPU: 35.9%,27.8% SD/RSD: -4.1%,-9.3% Thanks, - KK [v3 RFC PATCH 0/4] Implement multiqueue virtio-net Following set of patches implement transmit MQ in virtio-net. Also included is the user qemu changes. MQ is disabled by default unless qemu specifies it. Changes from rev2: -- 1. Define (in virtio_net.h) the maximum send txqs; and use in virtio-net and vhost-net. 2. vi-sq[i] is allocated individually, resulting in cache line aligned sq[0] to sq[n]. Another option was to define 'send_queue' as: struct send_queue { struct virtqueue *svq; struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; } cacheline_aligned_in_smp; and to statically allocate 'VIRTIO_MAX_SQ' of those. I hope the submitted method is preferable. 3. Changed vhost model such that vhost[0] handles RX and vhost[1-MAX] handles TX[0-n]. 4. Further change TX handling such that vhost[0] handles both RX/TX for single stream case. Enabling MQ on virtio: --- When following options are passed to qemu: - smp 1 - vhost=on - mq=on (new option, default:off) then #txqueues = #cpus. The #txqueues can be changed by using an optional 'numtxqs' option. e.g. for a smp=4 guest: vhost=on - #txqueues = 1 vhost=on,mq=on - #txqueues = 4 vhost=on,mq=on,numtxqs=2 - #txqueues = 2 vhost=on,mq=on,numtxqs=8 - #txqueues = 8 Performance (guest - local host): --- System configuration: Host: 8 Intel Xeon, 8 GB memory Guest: 4 cpus, 2 GB memory Test: Each test case runs for 60 secs, sum over three runs (except when number of netperf sessions is 1, which has 10 runs of 12 secs each). No tuning (default netperf) other than taskset vhost's to cpus 0-3. numtxqs=32 gave the best results though the guest had only 4 vcpus (I haven't tried beyond that). __ numtxqs=2, vhosts=3 #sessions BW% CPU%RCPU%SD% RSD% 1 4.46-1.96 .19 -12.50 -6.06 2 4.93-1.162.10 0 -2.38 4 46.1764.77 33.72 19.51 -2.48 8 47.8970.00 36.23 41.4613.35 16 48.9780.44 40.67 21.11 -5.46 24 49.0378.78 41.22 20.51 -4.78 32 51.1177.15 42.42 15.81 -6.87 40 51.6071.65 42.43 9.75-8.94 48 50.1069.55 42.85 11.80 -5.81 64 46.2468.42 42.67 14.18 -3.28 80 46.3763.13 41.62 7.43-6.73 96 46.4063.31 42.20 9.36-4.78 12850.4362.79 42.16 13.11 -1.23 BW: 37.2
Re: [v3 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/25/2010 09:47:18 PM: Any feedback, comments, objections, issues or bugs about the patches? Please let me know if something needs to be done. I am trying to wrap my head around kernel/user interface here. E.g., will we need another incompatible change when we add multiple RX queues? Though I added a 'mq' option to qemu, there shouldn't be any incompatibility between old and new qemu's wrt vhost and virtio-net drivers. So the old qemu will run new host and new guest without issues, and new qemu can also run old host and old guest. Multiple RXQ will also not add any incompatibility. With MQ RX, I will be able to remove the hueristic (idea from David Stevens). The idea is: Guest sends out packets on, say TXQ#2, vhost#2 processes the packets but packets going out from host to guest might be sent out on a different RXQ, say RXQ#4. Guest receives the packet on RXQ#4, and all future responses on that connection are sent on TXQ#4. Now vhost#4 processes both RX and TX packets for this connection. Without needing to hash on the connection, guest can make sure that the same vhost thread will handle a single connection. Also need to think about how robust our single stream heuristic is, e.g. what are the chances it will misdetect a bidirectional UDP stream as a single TCP? I think it should not happen. The hueristic code gets called for handling just the transmit packets, packets that vhost sends out to the guest skip this path. I tested unidirectional and bidirectional UDP to confirm: 8 iterations of iperf tests, each iteration of 15 secs, result is the sum of all 8 iterations in Gbits/sec __ Uni-directional Bi-directional Org New Org New __ 71.7871.77 71.74 72.07 __ Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/12/2010 10:39:07 PM: Sorry for the delay, I was sick last couple of days. The results with your patch are (%'s over original code): Code BW% CPU% RemoteCPU MQ (#txq=16) 31.4% 38.42% 6.41% MQ+MST (#txq=16) 28.3% 18.9% -10.77% The patch helps CPU utilization but didn't help single stream drop. Thanks, What other shared TX/RX locks are there? In your setup, is the same macvtap socket structure used for RX and TX? If yes this will create cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line, there might also be contention on the lock in sk_sleep waitqueue. Anything else? The patch is not introducing any locking (both vhost and virtio-net). The single stream drop is due to different vhost threads handling the RX/TX traffic. I added a heuristic (fuzzy) to determine if more than one flow is being used on the device, and if not, use vhost[0] for both tx and rx (vhost_poll_queue figures this out before waking up the suitable vhost thread). Testing shows that single stream performance is as good as the original code. __ #txqs = 2 (#vhosts = 3) # BW1 BW2 (%) CPU1CPU2 (%) RCPU1 RCPU2 (%) __ 1 77344 74973 (-3.06) 172 143 (-16.86) 358 324 (-9.49) 2 20924 21107 (.87) 107 103 (-3.73)220 217 (-1.36) 4 21629 32911 (52.16) 214 391 (82.71)446 616 (38.11) 8 21678 34359 (58.49) 428 845 (97.42)892 1286 (44.17) 1622046 34401 (56.04) 841 1677 (99.40) 17852585 (44.81) 2422396 35117 (56.80) 12722447 (92.37) 26673863 (44.84) 3222750 35158 (54.54) 17193233 (88.07) 35695143 (44.10) 4023041 35345 (53.40) 22193970 (78.90) 44786410 (43.14) 4823209 35219 (51.74) 27074685 (73.06) 53867684 (42.66) 6423215 35209 (51.66) 36396195 (70.23) 720610218 (41.79) 8023443 35179 (50.06) 46337625 (64.58) 905112745 (40.81) 9624006 36108 (50.41) 56359096 (61.41) 10864 15283 (40.67) 128 23601 35744 (51.45) 747512104 (61.92) 14495 20405 (40.77) __ SUM: BW: (37.6) CPU: (69.0) RCPU: (41.2) __ #txqs = 8 (#vhosts = 5) # BW1 BW2(%) CPU1 CPU2 (%) RCPU1 RCPU2 (%) __ 1 77344 75341 (-2.58) 172 171 (-.58) 358 356 (-.55) 2 20924 26872 (28.42) 107 135 (26.16)220 262 (19.09) 4 21629 33594 (55.31) 214 394 (84.11)446 615 (37.89) 8 21678 39714 (83.19) 428 949 (121.72) 892 1358 (52.24) 1622046 39879 (80.88) 841 1791 (112.96) 17852737 (53.33) 2422396 38436 (71.61) 12722111 (65.95) 26673453 (29.47) 3222750 38776 (70.44) 17193594 (109.07) 35695421 (51.89) 4023041 38023 (65.02) 22194358 (96.39) 44786507 (45.31) 4823209 33811 (45.68) 27074047 (49.50) 53866222 (15.52) 6423215 30212 (30.13) 36393858 (6.01)72065819 (-19.24) 8023443 34497 (47.15) 46337214 (55.70) 905110776 (19.05) 9624006 30990 (29.09) 56355731 (1.70)10864 8799 (-19.00) 128 23601 29413 (24.62) 74757804 (4.40)14495 11638 (-19.71) __ SUM: BW: (40.1) CPU: (35.7) RCPU: (4.1) ___ The SD numbers are also good (same table as before, but SD instead of CPU: __ #txqs = 2 (#vhosts = 3) # BW% SD1 SD2 (%)RSD1 RSD2 (%) __ 1 -3.06)5 4 (-20.00) 21 19 (-9.52) 2 .87 6 6 (0) 27 27 (0) 4 52.16 26 32 (23.07) 108 103 (-4.62) 8 58.49 103 146 (41.74)431 445 (3.24) 1656.04 407 514 (26.28)1729 1586 (-8.27) 2456.80 934 1161 (24.30) 3916 3665 (-6.40) 3254.54 16682160 (29.49) 6925 6872 (-.76) 4053.40 26553317 (24.93) 1071210707 (-.04) 4851.74 39204486 (14.43) 1559814715 (-5.66) 6451.66 70968250 (16.26) 2809927211 (-3.16) 8050.06 11240 12586 (11.97) 4391342070 (-4.19) 9650.41 16342 16976
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com What other shared TX/RX locks are there? In your setup, is the same macvtap socket structure used for RX and TX? If yes this will create cacheline bounces as sk_wmem_alloc/sk_rmem_alloc share a cache line, there might also be contention on the lock in sk_sleep waitqueue. Anything else? The patch is not introducing any locking (both vhost and virtio-net). The single stream drop is due to different vhost threads handling the RX/TX traffic. I added a heuristic (fuzzy) to determine if more than one flow is being used on the device, and if not, use vhost[0] for both tx and rx (vhost_poll_queue figures this out before waking up the suitable vhost thread). Testing shows that single stream performance is as good as the original code. ... This approach works nicely for both single and multiple stream. Does this look good? Thanks, - KK Yes, but I guess it depends on the heuristic :) What's the logic? I define how recently a txq was used. If 0 or 1 txq's were used recently, use vq[0] (which also handles rx). Otherwise, use multiple txq (vq[1-n]). The code is: /* * Algorithm for selecting vq: * * ConditionReturn * RX vqvq[0] * If all txqs unused vq[0] * If one txq used, and new txq is same vq[0] * If one txq used, and new txq is differentvq[vq-qnum] * If 1 txqs used vq[vq-qnum] * Where used means the txq was used in the last 'n' jiffies. * * Note: locking is not required as an update race will only result in * a different worker being woken up. */ static inline struct vhost_virtqueue *vhost_find_vq(struct vhost_poll *poll) { if (poll-vq-qnum) { struct vhost_dev *dev = poll-vq-dev; struct vhost_virtqueue *vq = dev-vqs[0]; unsigned long max_time = jiffies - 5; /* Some macro needed */ unsigned long *table = dev-jiffies; int i, used = 0; for (i = 0; i dev-nvqs - 1; i++) { if (time_after_eq(table[i], max_time) ++used 1) { vq = poll-vq; break; } } table[poll-vq-qnum - 1] = jiffies; return vq; } /* RX is handled by the same worker thread */ return poll-vq; } void vhost_poll_queue(struct vhost_poll *poll) { struct vhost_virtqueue *vq = vhost_find_vq(poll); vhost_work_queue(vq, poll-work); } Since poll batches packets, find_vq does not seem to add much to the CPU utilization (or BW). I am sure that code can be optimized much better. The results I sent in my last mail were without your use_mm patch, and the only tuning was to make vhost threads run on only cpus 0-3 (though the performance is good even without that). I will test it later today with the use_mm patch too. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/IBM wrote on 10/14/2010 02:34:01 PM: void vhost_poll_queue(struct vhost_poll *poll) { struct vhost_virtqueue *vq = vhost_find_vq(poll); vhost_work_queue(vq, poll-work); } Since poll batches packets, find_vq does not seem to add much to the CPU utilization (or BW). I am sure that code can be optimized much better. The results I sent in my last mail were without your use_mm patch, and the only tuning was to make vhost threads run on only cpus 0-3 (though the performance is good even without that). I will test it later today with the use_mm patch too. There's a significant reduction in CPU/SD utilization with your patch. Following is the performance of ORG vs MQ+mm patch: _ Org vs MQ+mm patch txq=2 # BW% CPU/RCPU% SD/RSD% _ 1 2.26-1.16.27 -20.00 0 2 35.07 29.9021.81 0 -11.11 4 55.03 84.5737.66 26.92 -4.62 8 73.16 118.69 49.21 45.63 -.46 1677.43 98.8147.89 24.07 -7.80 2471.59 105.18 48.44 62.84 18.18 3270.91 102.38 47.15 49.22 8.54 4063.26 90.5841.00 85.27 37.33 4845.25 45.9911.23 14.31 -12.91 6442.78 41.825.50 .43-25.12 8031.40 7.31 -18.6915.78 -11.93 9627.60 7.79 -18.5417.39 -10.98 128 23.46 -11.89 -34.41-.41 -25.53 _ BW: 40.2 CPU/RCPU: 29.9,-2.2 SD/RSD: 12.0,-15.6 Following is the performance of MQ vs MQ+mm patch: _ MQ vs MQ+mm patch # BW% CPU% RCPU%SD% RSD% _ 1 4.98-.58 .84 -20.000 2 5.17 2.96 2.29 0 -4.00 4 -.18 .25 -.16 3.12 .98 8 -5.47-1.36 -1.98 17.1816.57 16-1.90-6.64 -3.54 -14.83 -12.12 24-.01 23.63 14.65 57.6146.64 32 .27 -3.19 -3.11-22.98 -22.91 40-1.06-2.96 -2.96-4.18-4.10 48-.28 -2.34 -3.71-2.41-3.81 64 9.71 33.77 30.6581.4477.09 80-10.69-31.07-31.70 -29.22 -29.88 96-1.14 5.98 .56 -11.57 -16.14 128 -.93 -15.60 -18.31 -19.89 -22.65 _ BW: 0 CPU/RCPU: -4.2,-6.1 SD/RSD: -13.1,-15.6 _ Each test case is for 60 secs, sum over two runs (except when number of netperf sessions is 1, which has 7 runs of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning other than taskset each vhost to cpus 0-3. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/IBM wrote on 10/14/2010 05:47:54 PM: Sorry, it should read txq=8 below. - KK There's a significant reduction in CPU/SD utilization with your patch. Following is the performance of ORG vs MQ+mm patch: _ Org vs MQ+mm patch txq=2 # BW% CPU/RCPU% SD/RSD% _ 1 2.26-1.16.27 -20.00 0 2 35.07 29.9021.81 0 -11.11 4 55.03 84.5737.66 26.92 -4.62 8 73.16 118.69 49.21 45.63 -.46 1677.43 98.8147.89 24.07 -7.80 2471.59 105.18 48.44 62.84 18.18 3270.91 102.38 47.15 49.22 8.54 4063.26 90.5841.00 85.27 37.33 4845.25 45.9911.23 14.31 -12.91 6442.78 41.825.50 .43-25.12 8031.40 7.31 -18.6915.78 -11.93 9627.60 7.79 -18.5417.39 -10.98 128 23.46 -11.89 -34.41-.41 -25.53 _ BW: 40.2 CPU/RCPU: 29.9,-2.2 SD/RSD: 12.0,-15.6 Following is the performance of MQ vs MQ+mm patch: _ MQ vs MQ+mm patch # BW% CPU% RCPU%SD% RSD% _ 1 4.98-.58 .84 -20.000 2 5.17 2.96 2.29 0 -4.00 4 -.18 .25 -.16 3.12 .98 8 -5.47-1.36 -1.98 17.1816.57 16-1.90-6.64 -3.54 -14.83 -12.12 24-.01 23.63 14.65 57.6146.64 32 .27 -3.19 -3.11-22.98 -22.91 40-1.06-2.96 -2.96-4.18-4.10 48-.28 -2.34 -3.71-2.41-3.81 64 9.71 33.77 30.6581.4477.09 80-10.69-31.07-31.70 -29.22 -29.88 96-1.14 5.98 .56 -11.57 -16.14 128 -.93 -15.60 -18.31 -19.89 -22.65 _ BW: 0 CPU/RCPU: -4.2,-6.1 SD/RSD: -13.1,-15.6 _ Each test case is for 60 secs, sum over two runs (except when number of netperf sessions is 1, which has 7 runs of 10 secs each), numcpus=4, numtxqs=8, etc. No tuning other than taskset each vhost to cpus 0-3. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/06/2010 07:04:31 PM: On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote: For 1 TCP netperf, I ran 7 iterations and summed it. Explanation for degradation for 1 stream case: I thought about possible RX/TX contention reasons, and I realized that we get/put the mm counter all the time. So I write the following: I haven't seen any performance gain from this in a single queue case, but maybe this will help multiqueue? Sorry for the delay, I was sick last couple of days. The results with your patch are (%'s over original code): Code BW% CPU% RemoteCPU MQ (#txq=16) 31.4% 38.42% 6.41% MQ+MST (#txq=16) 28.3% 18.9% -10.77% The patch helps CPU utilization but didn't help single stream drop. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/06/2010 07:04:31 PM: Michael S. Tsirkin m...@redhat.com 10/06/2010 07:04 PM To Krishna Kumar2/India/i...@ibmin cc ru...@rustcorp.com.au, da...@davemloft.net, kvm@vger.kernel.org, a...@arndb.de, net...@vger.kernel.org, a...@redhat.com, anth...@codemonkey.ws Subject Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net On Fri, Sep 17, 2010 at 03:33:07PM +0530, Krishna Kumar wrote: For 1 TCP netperf, I ran 7 iterations and summed it. Explanation for degradation for 1 stream case: I thought about possible RX/TX contention reasons, and I realized that we get/put the mm counter all the time. So I write the following: I haven't seen any performance gain from this in a single queue case, but maybe this will help multiqueue? Great! I am on vacation tomorrow, but will test with this patch tomorrow night. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Arnd Bergmann a...@arndb.de wrote on 10/06/2010 05:49:00 PM: I don't see any reasons mentioned above. However, for higher number of netperf sessions, I see a big increase in retransmissions: ___ #netperf ORG NEW BW (#retr)BW (#retr) ___ 1 70244 (0) 64102 (0) 4 21421 (0) 36570 (416) 8 21746 (0) 38604 (148) 16 21783 (0) 40632 (464) 32 22677 (0) 37163 (1053) 64 23648 (4) 36449 (2197) 12823251 (2) 31676 (3185) ___ This smells like it could be related to a problem that Ben Greear found recently (see macvlan: Enable qdisc backoff logic). When the hardware is busy, used to just drop the packet. With Ben's patch, we return -EAGAIN to qemu (or vhost-net) to trigger a resend. I suppose what we really should do is feed that condition back to the guest network stack and implement the backoff in there. Thanks for the pointer. I will take a look at this as I hadn't seen this patch earlier. Is there any way to figure out if this is the issue? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 10/05/2010 11:53:23 PM: Any idea where does this come from? Do you see more TX interrupts? RX interrupts? Exits? Do interrupts bounce more between guest CPUs? 4. Identify reasons for single netperf BW regression. After testing various combinations of #txqs, #vhosts, #netperf sessions, I think the drop for 1 stream is due to TX and RX for a flow being processed on different cpus. Right. Can we fix it? I am not sure how to. My initial patch had one thread but gave small gains and ran into limitations once number of sessions became large. I did two more tests: 1. Pin vhosts to same CPU: - BW drop is much lower for 1 stream case (- 5 to -8% range) - But performance is not so high for more sessions. 2. Changed vhost to be single threaded: - No degradation for 1 session, and improvement for upto 8, sometimes 16 streams (5-12%). - BW degrades after that, all the way till 128 netperf sessions. - But overall CPU utilization improves. Summary of the entire run (for 1-128 sessions): txq=4: BW: (-2.3) CPU: (-16.5)RCPU: (-5.3) txq=16: BW: (-1.9) CPU: (-24.9)RCPU: (-9.6) I don't see any reasons mentioned above. However, for higher number of netperf sessions, I see a big increase in retransmissions: Hmm, ok, and do you see any errors? I haven't seen any in any statistics, messages, etc. Also no retranmissions for txq=1. Single netperf case didn't have any retransmissions so that is not the cause for drop. I tested ixgbe (MQ): ___ #netperf ixgbe ixgbe (pin intrs to cpu#0 on both server/client) BW (#retr) BW (#retr) ___ 1 3567 (117) 6000 (251) 2 4406 (477) 6298 (725) 4 6119 (1085) 7208 (3387) 8 6595 (4276) 7381 (15296) 16 6651 (11651)6856 (30394) Interesting. You are saying we get much more retransmissions with physical nic as well? Yes, with ixgbe. I re-ran with 16 netperfs running for 15 secs on both ixgbe and cxgb3 just now to reconfirm: ixgbe: BW: 6186.85 SD/Remote: 135.711, 339.376 CPU/Remote: 79.99, 200.00, Retrans: 545 cxgb3: BW: 8051.07 SD/Remote: 144.416, 260.487 CPU/Remote: 110.88, 200.00, Retrans: 0 However 64 netperfs for 30 secs gave: ixgbe: BW: 6691.12 SD/Remote: 8046.617, 5259.992 CPU/Remote: 1223.86, 799.97, Retrans: 1424 cxgb3: BW: 7799.16 SD/Remote: 2589.875, 4317.013 CPU/Remote: 480.39 800.64, Retrans: 649 # ethtool -i eth4 driver: ixgbe version: 2.0.84-k2 firmware-version: 0.9-3 bus-info: :1f:00.1 # ifconfig output: RX packets:783241 errors:0 dropped:0 overruns:0 frame:0 TX packets:689533 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 # lspci output: 1f:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit Network Connec tion (rev 01) Subsystem: Intel Corporation Ethernet Server Adapter X520-2 Flags: bus master, fast devsel, latency 0, IRQ 30 Memory at 9890 (64-bit, prefetchable) [size=512K] I/O ports at 2020 [size=32] Memory at 98a0 (64-bit, prefetchable) [size=16K] Capabilities: [40] Power Management version 3 Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+ Capabilities: [70] MSI-X: Enable+ Count=64 Masked- Capabilities: [a0] Express Endpoint, MSI 00 Capabilities: [100] Advanced Error Reporting Capabilities: [140] Device Serial Number 00-1b-21-ff-ff-40-4a-b4 Capabilities: [150] Alternative Routing-ID Interpretation (ARI) Capabilities: [160] Single Root I/O Virtualization (SR-IOV) Kernel driver in use: ixgbe Kernel modules: ixgbe I haven't done this right now since I don't have a setup. I guess it would be limited by wire speed and gains may not be there. I will try to do this later when I get the setup. OK but at least need to check that it does not hurt things. Yes, sure. Summary: 1. Average BW increase for regular I/O is best for #txq=16 with the least CPU utilization increase. 2. The average BW for 512 byte I/O is best for lower #txq=2. For higher #txqs, BW increased only after a particular #netperf sessions - in my testing that limit was 32 netperf sessions. 3. Multiple txq for guest by itself doesn't seem to have any issues. Guest CPU% increase is slightly higher than BW improvement. I think it is true for all mq drivers since more paths run in parallel upto the device instead of sleeping and allowing one thread to send all packets via qdisc_restart. 4. Having high number of txqs gives better gains
Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 09/19/2010 06:14:43 PM: Could you document how exactly do you measure multistream bandwidth: netperf flags, etc? All results were without any netperf flags or system tuning: for i in $list do netperf -c -C -l 60 -H 192.168.122.1 /tmp/netperf.$$.$i done wait Another script processes the result files. It also displays the start time/end time of each iteration to make sure skew due to parallel netperfs is minimal. I changed the vhost functionality once more to try to get the best model, the new model being: 1. #numtxqs=1 - #vhosts=1, this thread handles both RX/TX. 2. #numtxqs1 - vhost[0] handles RX and vhost[1-MAX] handles TX[0-n], where MAX is 4. Beyond numtxqs=4, the remaining TX queues are handled by vhost threads in round-robin fashion. Results from here on are with these changes, and only tuning is to set each vhost's affinity to CPUs[0-3] (taskset -p f vhost-pids). Any idea where does this come from? Do you see more TX interrupts? RX interrupts? Exits? Do interrupts bounce more between guest CPUs? 4. Identify reasons for single netperf BW regression. After testing various combinations of #txqs, #vhosts, #netperf sessions, I think the drop for 1 stream is due to TX and RX for a flow being processed on different cpus. I did two more tests: 1. Pin vhosts to same CPU: - BW drop is much lower for 1 stream case (- 5 to -8% range) - But performance is not so high for more sessions. 2. Changed vhost to be single threaded: - No degradation for 1 session, and improvement for upto 8, sometimes 16 streams (5-12%). - BW degrades after that, all the way till 128 netperf sessions. - But overall CPU utilization improves. Summary of the entire run (for 1-128 sessions): txq=4: BW: (-2.3) CPU: (-16.5)RCPU: (-5.3) txq=16: BW: (-1.9) CPU: (-24.9)RCPU: (-9.6) I don't see any reasons mentioned above. However, for higher number of netperf sessions, I see a big increase in retransmissions: ___ #netperf ORG NEW BW (#retr)BW (#retr) ___ 1 70244 (0) 64102 (0) 4 21421 (0) 36570 (416) 8 21746 (0) 38604 (148) 16 21783 (0) 40632 (464) 32 22677 (0) 37163 (1053) 64 23648 (4) 36449 (2197) 12823251 (2) 31676 (3185) ___ Single netperf case didn't have any retransmissions so that is not the cause for drop. I tested ixgbe (MQ): ___ #netperf ixgbe ixgbe (pin intrs to cpu#0 on both server/client) BW (#retr) BW (#retr) ___ 1 3567 (117) 6000 (251) 2 4406 (477) 6298 (725) 4 6119 (1085) 7208 (3387) 8 6595 (4276) 7381 (15296) 16 6651 (11651)6856 (30394) ___ 5. Test perf in more scenarious: small packets 512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions, but increases with #sessions: ___ # BW1 BW2 (%) CPU1CPU2 (%)RCPU1 RCPU2 (%) ___ 1 40433800 (-6.0) 50 50 (0) 86 98 (13.9) 2 83587485 (-10.4)153 178 (16.3) 230 264 (14.7) 4 20664 13567 (-34.3) 448 490 (9.3) 530 624 (17.7) 8 25198 17590 (-30.1) 967 1021 (5.5) 10851257 (15.8) 16 23791 24057 (1.1) 19042220 (16.5) 21562578 (19.5) 24 23055 26378 (14.4)28073378 (20.3) 32253901 (20.9) 32 22873 27116 (18.5)37484525 (20.7) 43075239 (21.6) 40 22876 29106 (27.2)47055717 (21.5) 53886591 (22.3) 48 23099 31352 (35.7)56426986 (23.8) 64758085 (24.8) 64 22645 30563 (34.9)75279027 (19.9) 861910656 (23.6) 80 22497 31922 (41.8)937511390 (21.4)10736 13485 (25.6) 96 22509 32718 (45.3)11271 13710 (21.6)12927 16269 (25.8) 128 22255 32397 (45.5)15036 18093 (20.3)17144 21608 (26.0) ___ SUM:BW: (16.7) CPU: (20.6) RCPU: (24.3) ___ host - guest ___ # BW1 BW2 (%) CPU1CPU2 (%)RCPU1
Re: [v2 RFC PATCH 2/4] Changes for virtio-net
Eric Dumazet eric.duma...@gmail.com wrote on 09/17/2010 03:55:54 PM: +/* Our representation of a send virtqueue */ +struct send_queue { + struct virtqueue *svq; + + /* TX: fragments + linear part + virtio header */ + struct scatterlist tx_sg[MAX_SKB_FRAGS + 2]; +}; You probably want cacheline_aligned_in_smp I had tried this and mentioned this in Patch 0/4: 2. Cache-align data structures: I didn't see any BW/SD improvement after making the sq's (and similarly for vhost) cache-aligned statically: struct virtnet_info { ... struct send_queue sq[16] cacheline_aligned_in_smp; ... }; I am not sure why this made no difference? + struct virtnet_info { struct virtio_device *vdev; - struct virtqueue *rvq, *svq, *cvq; + int numtxqs; /* Number of tx queues */ + struct send_queue *sq; + struct virtqueue *rvq; + struct virtqueue *cvq; struct net_device *dev; struct napi will probably be dirtied by RX processing You should make sure it doesnt dirty cache line of above (read mostly) fields I am changing the layout of napi wrt other pointers in this patch, though the to-be-submitted RX patch does that. Should I do something for this TX-only patch? +#define MAX_DEVICE_NAME 16 +static int initialize_vqs(struct virtnet_info *vi, int numtxqs) +{ + vq_callback_t **callbacks; + struct virtqueue **vqs; + int i, err = -ENOMEM; + int totalvqs; + char **names; + + /* Allocate send queues */ no check on numtxqs ? Hmm... Please then use kcalloc(numtxqs, sizeof(*vi-sq), GFP_KERNEL) so that some check is done for you ;) Right! I need to re-introduce some limit. Rusty, should I simply add a check for a constant (like 256) here? Thanks for your review, Eric! - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v2 RFC PATCH 2/4] Changes for virtio-net
Krishna Kumar2/India/i...@ibmin Sent by: netdev-ow...@vger.kernel.org + struct virtnet_info { struct virtio_device *vdev; - struct virtqueue *rvq, *svq, *cvq; + int numtxqs; /* Number of tx queues */ + struct send_queue *sq; + struct virtqueue *rvq; + struct virtqueue *cvq; struct net_device *dev; struct napi will probably be dirtied by RX processing You should make sure it doesnt dirty cache line of above (read mostly) fields I am changing the layout of napi wrt other pointers in this patch, though the to-be-submitted RX patch does that. Should I do something for this TX-only patch? Sorry, I think my sentence is not clear! I will make this change (and also cache-line align the send queues), test and let you know the result. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 09/13/2010 05:20:55 PM: Results with the original kernel: _ # BW SD RSD __ 1 20903 1 6 2 21963 6 25 4 22042 23 102 8 21674 97 419 16 22281 379 1663 24 22521 857 3748 32 22976 15286594 40 23197 239010239 48 22973 354215074 64 23809 648627244 80 23564 10169 43118 96 22977 14954 62948 128 23649 27067 113892 With higher number of threads running in parallel, SD increased. In this case most threads run in parallel only till __dev_xmit_skb (#numtxqs=1). With mq TX patch, higher number of threads run in parallel through ndo_start_xmit. I *think* the increase in SD is to do with higher # of threads running for larger code path From the numbers I posted with the patch (cut-n-paste only the % parts), BW increased much more than the SD, sometimes more than twice the increase in SD. Service demand is BW/CPU, right? So if BW goes up by 50% and SD by 40%, this means that CPU more than doubled. I think the SD calculation might be more complicated, I think it does it based on adding up averages sampled and stored during the run. But, I still don't see how CPU can double?? e.g. BW: 1000 - 1500 (50%) SD: 100 - 140 (40%) CPU: 10 - 10.71 (7.1%) N# BW% SD% RSD% 4 54.30 40.00-1.16 8 71.79 46.59-2.68 16 71.89 50.40-2.50 32 72.24 34.26-14.52 48 70.10 31.51-14.35 64 69.01 38.81-9.66 96 70.68 71.2610.74 I also think SD calculation gets skewed for guest-local host testing. If it's broken, let's fix it? For this test, I ran a guest with numtxqs=16. The first result below is with my patch, which creates 16 vhosts. The second result is with a modified patch which creates only 2 vhosts (testing with #netperfs = 64): My guess is it's not a good idea to have more TX VQs than guest CPUs. Definitely, I will try to run tomorrow with more reasonable values, also will test with my second version of the patch that creates restricted number of vhosts and post results. I realize for management it's easier to pass in a single vhost fd, but just for testing it's probably easier to add code in userspace to open /dev/vhost multiple times. #vhosts BW% SD%RSD% 16 20.79 186.01 149.74 230.89 34.55 18.44 The remote SD increases with the number of vhost threads, but that number seems to correlate with guest SD. So though BW% increased slightly from 20% to 30%, SD fell drastically from 186% to 34%. I think it could be a calculation skew with host SD, which also fell from 150% to 18%. I think by default netperf looks in /proc/stat for CPU utilization data: so host CPU utilization will include the guest CPU, I think? It appears that way to me too, but the data above seems to suggest the opposite... I would go further and claim that for host/guest TCP CPU utilization and SD should always be identical. Makes sense? It makes sense to me, but once again I am not sure how SD is really done, or whether it is linear to CPU. Cc'ing Rick in case he can comment I am planning to submit 2nd patch rev with restricted number of vhosts. Likely cause for the 1 stream degradation with multiple vhost patch: 1. Two vhosts run handling the RX and TX respectively. I think the issue is related to cache ping-pong esp since these run on different cpus/sockets. Right. With TCP I think we are better off handling TX and RX for a socket by the same vhost, so that packet and its ack are handled by the same thread. Is this what happens with RX multiqueue patch? How do we select an RX queue to put the packet on? My (unsubmitted) RX patch doesn't do this yet, that is something I will check. Thanks, - KK You'll want to work on top of net-next, I think there's RX flow filtering work going on there. Thanks Michael, I will follow up on that for the RX patch, plus your suggestion on tying RX with TX. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 09/12/2010 05:10:25 PM: SINGLE vhost (Guest - Host): 1 netperf:BW: 10.7% SD: -1.4% 4 netperfs: BW: 3%SD: 1.4% 8 netperfs: BW: 17.7% SD: -10% 16 netperfs: BW: 4.7% SD: -7.0% 32 netperfs: BW: -6.1% SD: -5.7% BW and SD both improves (guest multiple txqs help). For 32 netperfs, SD improves. But with multiple vhosts, guest is able to send more packets and BW increases much more (SD too increases, but I think that is expected). Why is this expected? Results with the original kernel: _ # BW SD RSD __ 1 20903 1 6 2 21963 6 25 4 22042 23 102 8 21674 97 419 16 22281 379 1663 24 22521 857 3748 32 22976 15286594 40 23197 239010239 48 22973 354215074 64 23809 648627244 80 23564 10169 43118 96 22977 14954 62948 128 23649 27067 113892 With higher number of threads running in parallel, SD increased. In this case most threads run in parallel only till __dev_xmit_skb (#numtxqs=1). With mq TX patch, higher number of threads run in parallel through ndo_start_xmit. I *think* the increase in SD is to do with higher # of threads running for larger code path From the numbers I posted with the patch (cut-n-paste only the % parts), BW increased much more than the SD, sometimes more than twice the increase in SD. N# BW% SD% RSD% 4 54.30 40.00-1.16 8 71.79 46.59-2.68 16 71.89 50.40-2.50 32 72.24 34.26-14.52 48 70.10 31.51-14.35 64 69.01 38.81-9.66 96 70.68 71.2610.74 I also think SD calculation gets skewed for guest-local host testing. For this test, I ran a guest with numtxqs=16. The first result below is with my patch, which creates 16 vhosts. The second result is with a modified patch which creates only 2 vhosts (testing with #netperfs = 64): #vhosts BW% SD%RSD% 16 20.79 186.01 149.74 230.89 34.55 18.44 The remote SD increases with the number of vhost threads, but that number seems to correlate with guest SD. So though BW% increased slightly from 20% to 30%, SD fell drastically from 186% to 34%. I think it could be a calculation skew with host SD, which also fell from 150% to 18%. I am planning to submit 2nd patch rev with restricted number of vhosts. Likely cause for the 1 stream degradation with multiple vhost patch: 1. Two vhosts run handling the RX and TX respectively. I think the issue is related to cache ping-pong esp since these run on different cpus/sockets. Right. With TCP I think we are better off handling TX and RX for a socket by the same vhost, so that packet and its ack are handled by the same thread. Is this what happens with RX multiqueue patch? How do we select an RX queue to put the packet on? My (unsubmitted) RX patch doesn't do this yet, that is something I will check. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] Add a new API to virtio-pci
Michael S. Tsirkin m...@redhat.com wrote on 09/12/2010 05:16:37 PM: Michael S. Tsirkin m...@redhat.com 09/12/2010 05:16 PM On Thu, Sep 09, 2010 at 07:19:33PM +0530, Krishna Kumar2 wrote: Unfortunately I need a constant in vhost for now. Maybe not even that: you create multiple vhost-net devices so vhost-net in kernel does not care about these either, right? So this can be just part of vhost_net.h in qemu. Sorry, I didn't understand what you meant. I can remove all socks[] arrays/constants by pre-allocating sockets in vhost_setup_vqs. Then I can remove all socks parameters in vhost_net_stop, vhost_net_release and vhost_net_reset_owner. Does this make sense? Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/IBM wrote on 09/08/2010 10:17:49 PM: Some more results and likely cause for single netperf degradation below. Guest - Host (single netperf): I am getting a drop of almost 20%. I am trying to figure out why. Host - guest (single netperf): I am getting an improvement of almost 15%. Again - unexpected. Guest - Host TCP_RR: I get an average 7.4% increase in #packets for runs upto 128 sessions. With fewer netperf (under 8), there was a drop of 3-7% in #packets, but beyond that, the #packets improved significantly to give an average improvement of 7.4%. So it seems that fewer sessions is having negative effect for some reason on the tx side. The code path in virtio-net has not changed much, so the drop in some cases is quite unexpected. The drop for the single netperf seems to be due to multiple vhost. I changed the patch to start *single* vhost: Guest - Host (1 netperf, 64K): BW: 10.79%, SD: -1.45% Guest - Host (1 netperf) : Latency: -3%, SD: 3.5% Single vhost performs well but hits the barrier at 16 netperf sessions: SINGLE vhost (Guest - Host): 1 netperf:BW: 10.7% SD: -1.4% 4 netperfs: BW: 3%SD: 1.4% 8 netperfs: BW: 17.7% SD: -10% 16 netperfs: BW: 4.7% SD: -7.0% 32 netperfs: BW: -6.1% SD: -5.7% BW and SD both improves (guest multiple txqs help). For 32 netperfs, SD improves. But with multiple vhosts, guest is able to send more packets and BW increases much more (SD too increases, but I think that is expected). From the earlier results: N# BW1 BW2(%) SD1 SD2(%) RSD1RSD2(%) ___ 4 26387 40716 (54.30) 20 28 (40.00)86 85 (-1.16) 8 24356 41843 (71.79) 88 129 (46.59)372 362 (-2.68) 16 23587 40546 (71.89) 375 564 (50.40)15581519 (-2.50) 32 22927 39490 (72.24) 16172171 (34.26)66945722 (-14.52) 48 23067 39238 (70.10) 39315170 (31.51)15823 13552 (-14.35) 64 22927 38750 (69.01) 71429914 (38.81)28972 26173 (-9.66) 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 (10.74) ___ (All tests were done without any tuning) From my testing: 1. Single vhost improves mq guest performance upto 16 netperfs but degrades after that. 2. Multiple vhost degrades single netperf guest performance, but significantly improves performance for any number of netperf sessions. Likely cause for the 1 stream degradation with multiple vhost patch: 1. Two vhosts run handling the RX and TX respectively. I think the issue is related to cache ping-pong esp since these run on different cpus/sockets. 2. I (re-)modified the patch to share RX with TX[0]. The performance drop is the same, but the reason is the guest is not using txq[0] in most cases (dev_pick_tx), so vhost's rx and tx are running on different threads. But whenever the guest uses txq[0], only one vhost runs and the performance is similar to original. I went back to my *submitted* patch and started a guest with numtxq=16 and pinned every vhost to cpus #01. Now whether guest used txq[0] or txq[n], the performance is similar or better (between 10-27% across 10 runs) than original code. Also, -6% to -24% improvement in SD. I will start a full test run of original vs submitted code with minimal tuning (Avi also suggested the same), and re-send. Please let me know if you need any other data. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Krishna Kumar2/India/IBM wrote on 09/09/2010 03:15:53 PM: I will start a full test run of original vs submitted code with minimal tuning (Avi also suggested the same), and re-send. Please let me know if you need any other data. Same patch, only change is that I ran taskset -p 03 all vhost threads, no other tuning on host or guest. Default netperf without any options. The BW is the sum across two iterations, each is 60secs. Guest is started with 2 txqs. BW1/BW2: BW for org new in mbps SD1/SD2: SD for org new RSD1/RSD2: Remote SD for org new ___ #BW1 BW2(%)SD1SD2 (%) RSD1RSD2 (%) ___ 120903 19422 (-7.08)1 1(0) 6 7 (16.66) 221963 24330 (10.77)6 6(0) 25 25(0) 422042 31841 (44.45)23 28 (21.73) 102 110 (7.84) 821674 32045 (47.84)97 111 (14.43) 419 421 (.47) 16 22281 31361 (40.75)379551 (45.38) 16632110 (26.87) 24 22521 31945 (41.84)857981 (14.46) 37483742 (-.16) 32 22976 32473 (41.33)1528 1806 (18.19) 65946885 (4.41) 40 23197 32594 (40.50)2390 2755 (15.27) 10239 10450 (2.06) 48 22973 32757 (42.58)3542 3786 (6.88) 15074 14395 (-4.50) 64 23809 32814 (37.82)6486 6981 (7.63) 27244 26381 (-3.16) 80 23564 32682 (38.69)10169 11133 (9.47) 43118 41397 (-3.99) 96 22977 33069 (43.92)14954 15881 (6.19) 62948 59071 (-6.15) 128 23649 33032 (39.67)27067 28832 (6.52) 113892 106096 (-6.84) ___ 294534400371 (35.9) 67504 72858 (7.9)285077 271096 (-4.9) ___ I will try more tuning later as Avi suggested, wanted to test the minimal for now. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Arnd Bergmann a...@arndb.de wrote on 09/09/2010 04:10:27 PM: Can you live migrate a new guest from new-qemu/new-kernel to old-qemu/old-kernel, new-qemu/old-kernel and old-qemu/new-kernel? If not, do we need to support all those cases? I have not tried this, though I added some minimal code in virtio_net_load and virtio_net_save. I don't know what needs to be done exactly at this time. I forgot to put this in the Next steps list of things to do. I was mostly trying to find out if you think it should work or if there are specific reasons why it would not. E.g. when migrating to a machine that has an old qemu, the guest gets reduced to a single queue, but it's not clear to me how it can learn about this, or if it can get hidden by the outbound qemu. I agree, I am also not sure how the old guest will handle this. Sorry about my ignorance on migration :( Regards, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] Add a new API to virtio-pci
Rusty Russell ru...@rustcorp.com.au wrote on 09/09/2010 05:44:25 PM: This seems a bit weird. I mean, the driver used vdev-config- find_vqs to find the queues, which returns them (in order). So, can't you put this into your struct send_queue? I am saving the vqs in the send_queue, but the cb needs to locate the device txq from the svq. The only other way I could think of is to iterate through the send_queue's and compare svq against sq[i]-svq, but cb's happen quite a bit. Is there a better way? Ah, good point. Move the queue index into the struct virtqueue? Is it OK to move the queue_index from virtio_pci_vq_info to virtqueue? I didn't want to change any data structures in virtio for this patch, but I can do it either way. Also, why define VIRTIO_MAX_TXQS? If the driver can't handle all of them, it should simply not use them... The main reason was vhost :) Since vhost_net_release should not fail (__fput can't handle f_op-release() failure), I needed a maximum number of socks to clean up: Ah, then it belongs in the vhost headers. The guest shouldn't see such a restriction if it doesn't apply; it's a host thing. Oh, and I think you could profitably use virtio_config_val(), too. OK, I will make those changes. Thanks for the reference to virtio_config_val(), I will use it in guest probe instead of the cumbersome way I am doing now. Unfortunately I need a constant in vhost for now. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Sridhar Samudrala s...@us.ibm.com wrote on 09/10/2010 04:30:24 AM: I remember seeing similar issue when using a separate vhost thread for TX and RX queues. Basically, we should have the same vhost thread process a TCP flow in both directions. I guess this allows the data and ACKs to be processed in sync. I was trying that by sharing threads between rx and tx[0], but that didn't work either since guest rarely picks txq=0. I was able to get reasonable single stream performance by pinning vhosts to the same cpu. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Avi Kivity a...@redhat.com wrote on 09/08/2010 01:17:34 PM: On 09/08/2010 10:28 AM, Krishna Kumar wrote: Following patches implement Transmit mq in virtio-net. Also included is the user qemu changes. 1. This feature was first implemented with a single vhost. Testing showed 3-8% performance gain for upto 8 netperf sessions (and sometimes 16), but BW dropped with more sessions. However, implementing per-txq vhost improved BW significantly all the way to 128 sessions. Why were vhost kernel changes required? Can't you just instantiate more vhost queues? I did try using a single thread processing packets from multiple vq's on host, but the BW dropped beyond a certain number of sessions. I don't have the code and performance numbers for that right now since it is a bit ancient, I can try to resuscitate that if you want. Guest interrupts for a 4 TXQ device after a 5 min test: # egrep virtio0|CPU /proc/interrupts CPU0 CPU1 CPU2CPU3 40: 000 0PCI-MSI-edge virtio0-config 41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input 42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0 43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1 44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2 45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3 How are vhost threads and host interrupts distributed? We need to move vhost queue threads to be colocated with the related vcpu threads (if no extra cores are available) or on the same socket (if extra cores are available). Similarly, move device interrupts to the same core as the vhost thread. All my testing was without any tuning, including binding netperf netserver (irqbalance is also off). I assume (maybe wrongly) that the above might give better results? Are you suggesting this combination: IRQ on guest: 40: CPU0 41: CPU1 42: CPU2 43: CPU3 (all CPUs are on socket #0) vhost: thread #0: CPU0 thread #1: CPU1 thread #2: CPU2 thread #3: CPU3 qemu: thread #0: CPU4 thread #1: CPU5 thread #2: CPU6 thread #3: CPU7 (all CPUs are on socket#1) netperf/netserver: Run on CPUs 0-4 on both sides The reason I did not optimize anything from user space is because I felt showing the default works reasonably well is important. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 09/08/2010 01:40:11 PM: ___ TCP (#numtxqs=2) N# BW1 BW2(%) SD1 SD2(%) RSD1RSD2 (%) ___ 4 26387 40716 (54.30) 20 28 (40.00)86i 85 (-1.16) 8 24356 41843 (71.79) 88 129 (46.59)372 362 (-2.68) 16 23587 40546 (71.89) 375 564 (50.40)15581519 (-2.50) 32 22927 39490 (72.24) 16172171 (34.26)66945722 (-14.52) 48 23067 39238 (70.10) 39315170 (31.51)15823 13552 (-14.35) 64 22927 38750 (69.01) 71429914 (38.81)28972 26173 (-9.66) 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 (10.74) That's a significant hit in TCP SD. Is it caused by the imbalance between number of queues for TX and RX? Since you mention RX is complete, maybe measure with a balanced TX/RX? Yes, I am not sure why it is so high. I found the same with #RX=#TX too. As a hack, I tried ixgbe without MQ (set indices=1 before calling alloc_etherdev_mq, not sure if that is entirely correct) - here too SD worsened by around 40%. I can't explain it, since the virtio-net driver runs lock free once sch_direct_xmit gets HARD_TX_LOCK for the specific txq. Maybe the SD calculation is not strictly correct since more threads are now running parallel and load is higher? Eg, if you compare SD between #netperfs = 8 vs 16 for original code (cut-n-paste relevant columns only) ... N# BWSD 8 24356 88 16 23587 375 ... SD has increased more than 4 times for the same BW. What happens with a single netperf? host - guest performance with TCP and small packet speed are also worth measuring. OK, I will do this and send the results later today. At some level, host/guest communication is easy in that we don't really care which queue is used. I would like to give some thought (and testing) to how is this going to work with a real NIC card and packet steering at the backend. Any idea? I have done a little testing with guest - remote server both using a bridge and with macvtap (mq is required only for rx). I didn't understand what you mean by packet steering though, is it whether packets go out of the NIC on different queues? If so, I verified that is the case by putting a counter and displaying through /debug interface on the host. dev_queue_xmit on the host handles it by calling dev_pick_tx(). Guest interrupts for a 4 TXQ device after a 5 min test: # egrep virtio0|CPU /proc/interrupts CPU0 CPU1 CPU2CPU3 40: 000 0PCI-MSI-edge virtio0-config 41: 126955 126912 126505 126940 PCI-MSI-edge virtio0-input 42: 108583 107787 107853 107716 PCI-MSI-edge virtio0-output.0 43: 300278 297653 299378 300554 PCI-MSI-edge virtio0-output.1 44: 372607 374884 371092 372011 PCI-MSI-edge virtio0-output.2 45: 162042 162261 163623 162923 PCI-MSI-edge virtio0-output.3 Does this mean each interrupt is constantly bouncing between CPUs? Yes. I didn't do *any* tuning for the tests. The only tuning was to use 64K IO size with netperf. When I ran default netperf (16K), I got a little lesser improvement in BW and worse(!) SD than with 64K. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Hi Michael, Michael S. Tsirkin m...@redhat.com wrote on 09/08/2010 01:43:26 PM: On Wed, Sep 08, 2010 at 12:58:59PM +0530, Krishna Kumar wrote: 1. mq RX patch is also complete - plan to submit once TX is OK. It's good that you split patches, I think it would be interesting to see the RX patches at least once to complete the picture. You could make it a separate patchset, tag them as RFC. OK, I need to re-do some parts of it, since I started the TX only branch a couple of weeks earlier and the RX side is outdated. I will try to send that out in the next couple of days, as you say it will help to complete the picture. Reasons to send it only TX now: - Reduce size of patch and complexity - I didn't get much improvement on multiple RX patch (netperf from host - guest), so needed some time to figure out the reason and fix it. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Avi Kivity a...@redhat.com wrote on 09/08/2010 02:58:21 PM: 1. This feature was first implemented with a single vhost. Testing showed 3-8% performance gain for upto 8 netperf sessions (and sometimes 16), but BW dropped with more sessions. However, implementing per-txq vhost improved BW significantly all the way to 128 sessions. Why were vhost kernel changes required? Can't you just instantiate more vhost queues? I did try using a single thread processing packets from multiple vq's on host, but the BW dropped beyond a certain number of sessions. Oh - so the interface has not changed (which can be seen from the patch). That was my concern, I remembered that we planned for vhost-net to be multiqueue-ready. The new guest and qemu code work with old vhost-net, just with reduced performance, yes? Yes, I have tested new guest/qemu with old vhost but using #numtxqs=1 (or not passing any arguments at all to qemu to enable MQ). Giving numtxqs 1 fails with ENOBUFS in vhost, since vhost_net_set_backend in the unmodified vhost checks for boundary overflow. I have also tested running an unmodified guest with new vhost/qemu, but qemu should not specify numtxqs1. Are you suggesting this combination: IRQ on guest: 40: CPU0 41: CPU1 42: CPU2 43: CPU3 (all CPUs are on socket #0) vhost: thread #0: CPU0 thread #1: CPU1 thread #2: CPU2 thread #3: CPU3 qemu: thread #0: CPU4 thread #1: CPU5 thread #2: CPU6 thread #3: CPU7 (all CPUs are on socket#1) May be better to put vcpu threads and vhost threads on the same socket. Also need to affine host interrupts. netperf/netserver: Run on CPUs 0-4 on both sides The reason I did not optimize anything from user space is because I felt showing the default works reasonably well is important. Definitely. Heavy tuning is not a useful path for general end users. We need to make sure the the scheduler is able to arrive at the optimal layout without pinning (but perhaps with hints). OK, I will see if I can get results with this. Thanks for your suggestions, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 09/08/2010 04:18:33 PM: ___ TCP (#numtxqs=2) N# BW1 BW2(%) SD1 SD2(%) RSD1 RSD2 (%) ___ 4 26387 40716 (54.30) 20 28 (40.00)86i 85 (-1.16) 8 24356 41843 (71.79) 88 129 (46.59)372 362 (-2.68) 16 23587 40546 (71.89) 375 564 (50.40)1558 1519 (-2.50) 32 22927 39490 (72.24) 16172171 (34.26)6694 5722 (-14.52) 48 23067 39238 (70.10) 39315170 (31.51)15823 13552 (-14.35) 64 22927 38750 (69.01) 71429914 (38.81)28972 26173 (-9.66) 96 22568 38520 (70.68) 16258 27844 (71.26) 65944 73031 (10.74) That's a significant hit in TCP SD. Is it caused by the imbalance between number of queues for TX and RX? Since you mention RX is complete, maybe measure with a balanced TX/RX? Yes, I am not sure why it is so high. Any errors at higher levels? Are any packets reordered? I haven't seen any messages logged, and retransmission is similar to non-mq case. Device also has no errors/dropped packets. Anything else I should look for? On the host: # ifconfig vnet0 vnet0 Link encap:Ethernet HWaddr 9A:9D:99:E1:CA:CE inet6 addr: fe80::989d:99ff:fee1:cace/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:5090371 errors:0 dropped:0 overruns:0 frame:0 TX packets:5054616 errors:0 dropped:0 overruns:65 carrier:0 collisions:0 txqueuelen:500 RX bytes:237793761392 (221.4 GiB) TX bytes:333630070 (318.1 MiB) # netstat -s |grep -i retrans 1310 segments retransmited 35 times recovered from packet loss due to fast retransmit 1 timeouts after reno fast retransmit 41 fast retransmits 1236 retransmits in slow start So retranmissions are 0.025% of total packets received from the guest. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/4] Implement multiqueue virtio-net
Michael S. Tsirkin m...@redhat.com wrote on 09/08/2010 01:40:11 PM: ___ UDP (#numtxqs=8) N# BW1 BW2 (%) SD1 SD2 (%) __ 4 29836 56761 (90.24) 67 63(-5.97) 8 27666 63767 (130.48) 326 265 (-18.71) 16 25452 60665 (138.35) 13961269 (-9.09) 32 26172 63491 (142.59) 56174202 (-25.19) 48 26146 64629 (147.18) 12813 9316 (-27.29) 64 25575 65448 (155.90) 23063 16346 (-29.12) 128 26454 63772 (141.06) 91054 85051 (-6.59) __ N#: Number of netperf sessions, 90 sec runs BW1,SD1,RSD1: Bandwidth (sum across 2 runs in mbps), SD and Remote SD for original code BW2,SD2,RSD2: Bandwidth (sum across 2 runs in mbps), SD and Remote SD for new code. e.g. BW2=40716 means average BW2 was 20358 mbps. What happens with a single netperf? host - guest performance with TCP and small packet speed are also worth measuring. Guest - Host (single netperf): I am getting a drop of almost 20%. I am trying to figure out why. Host - guest (single netperf): I am getting an improvement of almost 15%. Again - unexpected. Guest - Host TCP_RR: I get an average 7.4% increase in #packets for runs upto 128 sessions. With fewer netperf (under 8), there was a drop of 3-7% in #packets, but beyond that, the #packets improved significantly to give an average improvement of 7.4%. So it seems that fewer sessions is having negative effect for some reason on the tx side. The code path in virtio-net has not changed much, so the drop in some cases is quite unexpected. Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/4] Add a new API to virtio-pci
Rusty Russell ru...@rustcorp.com.au wrote on 09/09/2010 09:19:39 AM: On Wed, 8 Sep 2010 04:59:05 pm Krishna Kumar wrote: Add virtio_get_queue_index() to get the queue index of a vq. This is needed by the cb handler to locate the queue that should be processed. This seems a bit weird. I mean, the driver used vdev-config-find_vqs to find the queues, which returns them (in order). So, can't you put this into your struct send_queue? I am saving the vqs in the send_queue, but the cb needs to locate the device txq from the svq. The only other way I could think of is to iterate through the send_queue's and compare svq against sq[i]-svq, but cb's happen quite a bit. Is there a better way? static void skb_xmit_done(struct virtqueue *svq) { struct virtnet_info *vi = svq-vdev-priv; int qnum = virtio_get_queue_index(svq) - 1; /* 0 is RX vq */ /* Suppress further interrupts. */ virtqueue_disable_cb(svq); /* We were probably waiting for more output buffers. */ netif_wake_subqueue(vi-dev, qnum); } Also, why define VIRTIO_MAX_TXQS? If the driver can't handle all of them, it should simply not use them... The main reason was vhost :) Since vhost_net_release should not fail (__fput can't handle f_op-release() failure), I needed a maximum number of socks to clean up: #define MAX_VQS (1 + VIRTIO_MAX_TXQS) static int vhost_net_release(struct inode *inode, struct file *f) { struct vhost_net *n = f-private_data; struct vhost_dev *dev = n-dev; struct socket *socks[MAX_VQS]; int i; vhost_net_stop(n, socks); vhost_net_flush(n); vhost_dev_cleanup(dev); for (i = n-dev.nvqs - 1; i = 0; i--) if (socks[i]) fput(socks[i]-file); ... } Thanks, - KK -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html