Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Thu, Apr 16, 2015 at 1:42 PM, Eric Dumazet eric.duma...@gmail.com wrote: On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote: He suggested that after he'd been prodded by 4 more e-mails in which two of us guessed what he was trying to get at. That's what I was complaining about. My big complain is that I suggested to test to double the sysctl, which gave good results. Then you provided a patch using a 8x factor. How does that sound ? Next time I ask a raise, I should try a 8x factor as well, who knows, it might be accepted. I see. I chose the value that Stefano had determined had completely eliminated the overhead. Doubling the value reduces the overhead to 8%, which should be fine for a short-term fix while we git a proper mid/long-term fix. -George -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On 04/16/2015 10:20 AM, Daniel Borkmann wrote: So mid term, it would be much more beneficial if you attempt fix the underlying driver issues that actually cause high tx completion delays, instead of reintroducing bufferbloat. So that we all can move forward and not backwards in time. Yes, I think we definitely see the need for this. I think we certainly agree that bufferbloat needs to be reduced, and minimizing the data we need in the pipe for full performance on xennet is an important part of that. It should be said, however, that any virtual device is always going to have higher latency than a physical device. Hopefully we'll be able to get the latency of xennet down to something that's more reasonable, but it may just not be possible. And in any case, if we're going to be cranking down these limits to just barely within the tolerance of physical NICs, virtual devices (either xennet or virtio_net) are never going to be able to catch up. (Without cheating that is.) What Eric described to you was that you introduce a new netdev member like netdev-needs_bufferbloat, set that indication from driver site, and cache that in the socket that binds to it, so you can adjust the test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication for such devices. Hmm? He suggested that after he'd been prodded by 4 more e-mails in which two of us guessed what he was trying to get at. That's what I was complaining about. Having a per-device long transmit latency hint sounds like a sensible short-term solution to me. -George -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On 04/15/2015 07:19 PM, Eric Dumazet wrote: On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote: Maybe you should stop wasting all of our time and just tell us what you're thinking. I think you make me wasting my time. I already gave all the hints in prior discussions. Right, and I suggested these two options: Obviously one solution would be to allow the drivers themselves to set the tcp_limit_output_bytes, but that seems like a maintenance nightmare. Another simple solution would be to allow drivers to indicate whether they have a high transmit latency, and have the kernel use a higher value by default when that's the case. [1] Neither of which you commented on. Instead you pointed me to a comment that only partially described what the limitations were. (I.e., it described the two packets or 1ms, but not how they related, nor how they related to the max of 2 64k packets outstanding of the default tcp_limit_output_bytes setting.) -George [1] http://marc.info/?i=CAFLBxZYt7-v29ysm=f+5qmow64_qhesjzj98udba+1cs-pf...@mail.gmail.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Thu, Apr 16, 2015 at 10:22 AM, David Laight david.lai...@aculab.com wrote: ISTM that you are changing the wrong knob. You need to change something that affects the global amount of pending tx data, not the amount that can be buffered by a single connection. Well it seems like the problem is that the global amount of pending tx data is high enough, but that the per-stream amount is too low for only a single stream. If you change tcp_limit_output_bytes and then have 1000 connections trying to send data you'll suffer 'bufferbloat'. Right -- so are you worried about the buffers in the local device here, or are you worried about buffers elsewhere in the network? If you're worried about buffers on the local device, don't you have a similar problem for physical NICs? i.e., if a NIC has a big buffer that you're trying to keep mostly empty, limiting a single TCP stream may keep that buffer empty, but if you have 1000 connections, 1000*limit will still fill up the buffer. Or am I missing something? If you call skb_orphan() in the tx setup path then the total number of buffers is limited, but a single connection can (and will) will the tx ring leading to incorrect RTT calculations and additional latency for other connections. This will give high single connection throughput but isn't ideal. One possibility might be to call skb_orphan() when enough time has elapsed since the packet was queued for transmit that it is very likely to have actually been transmitted - even though 'transmit done' has not yet been signalled. Not at all sure how this would fit in though... Right -- so it sounds like the problem with skb_orphan() is making sure that the tx ring is shared properly between different streams. That would mean that ideally we wouldn't call it until the tx ring actually had space to add more packets onto it. The Xen project is having a sort of developer meeting in a few weeks; if we can get a good picture of all the constraints, maybe we can hash out a solution that works for everyone. -George -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On 04/15/2015 07:17 PM, Eric Dumazet wrote: Do not expect me to fight bufferbloat alone. Be part of the challenge, instead of trying to get back to proven bad solutions. I tried that. I wrote a description of what I thought the situation was, so that you could correct me if my understanding was wrong, and then what I thought we could do about it. You apparently didn't even read it, but just pointed me to a single cryptic comment that doesn't give me enough information to actually figure out what the situation is. We all agree that bufferbloat is a problem for everybody, and I can definitely understand the desire to actually make the situation better rather than dying the death of a thousand exceptions. If you want help fighting bufferbloat, you have to educate people to help you; or alternately, if you don't want to bother educating people, you have to fight it alone -- or lose the battle due to having a thousand exceptions. So, back to TSQ limits. What's so magical about 2 packets being *in the device itself*? And what does 1ms, or 2*64k packets (the default for tcp_limit_output_bytes), have anything to do with it? Your comment lists three benefits: 1. better RTT estimation 2. faster recovery 3. high rates #3 is just marketing fluff; it's also contradicted by the statement that immediately follows it -- i.e., there are drivers for which the limitation does *not* give high rates. #1, as far as I can tell, has to do with measuring the *actual* minimal round trip time of an empty pipe, rather than the round trip time you get when there's 512MB of packets in the device buffer. If a device has a large internal buffer, then having a large number of packets outstanding means that the measured RTT is skewed. The goal here, I take it, is to have this pipe *exactly* full; having it significantly more than full is what leads to bufferbloat. #2 sounds like you're saying that if there are too many packets outstanding when you discover that you need to adjust things, that it takes a long time for your changes to have an effect; i.e., if you have 5ms of data in the pipe, it will take at least 5ms for your reduced transmmission rate to actually have an effect. Is that accurate, or have I misunderstood something? -George -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [PATCH RFC] tcp: Allow sk_wmem_alloc to exceed sysctl_tcp_limit_output_bytes
On Mon, Apr 13, 2015 at 4:03 PM, Malcolm Crossley malcolm.cross...@citrix.com wrote: But the main concern here is it basically breaks back pressure. And you do not want this, unless there is no other choice. virtio_net already use's skb_orphan() in it's transmit path. It seems only fair that other virtual network drivers behave in the same way. There are no easy solutions to decrease the transmit latency for netback/netfront. We map the guest memory through to the backend to avoid memory copies. The frontend memory can only be freed once the network driver has completed transmitting the packet in the backend. Modern network drivers can be quite slow at freeing the skb's once transmitted (the packet is already on the wire as far as they are concerned) and this delay is compounded by needing the signal the completion of the transmit back to the frontend (by IPI in worst case). From a networking point of view, the backend is a switch. Is it OK to consider the packet to have been transmitted from the guest point of view once the backend is aware of the packet? This would help justify the skb_orphan() in the frontend. This sounds sensible to me, particularly if virtio_net is already doing it. -George -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On Mon, Apr 13, 2015 at 2:49 PM, Eric Dumazet eric.duma...@gmail.com wrote: On Mon, 2015-04-13 at 11:56 +0100, George Dunlap wrote: Is the problem perhaps that netback/netfront delays TX completion? Would it be better to see if that can be addressed properly, so that the original purpose of the patch (fighting bufferbloat) can be achieved while not degrading performance for Xen? Or at least, so that people get decent perfomance out of the box without having to tweak TCP parameters? Sure, please provide a patch, that does not break back pressure. But just in case, if Xen performance relied on bufferbloat, it might be very difficult to reach a stable equilibrium : Any small change in stack or scheduling might introduce a significant difference in 'raw performance'. So help me understand this a little bit here. tcp_limit_output_bytes limits the amount of data allowed to be in-transit between a send() and the wire, is that right? And so the bufferbloat problem you're talking about here are TCP buffers inside the kernel, and/or buffers in the NIC, is that right? So ideally, you want this to be large enough to fill the pipeline all the way from send() down to actually getting out on the wire; otherwise, you'll have gaps in the pipeline, and the machinery won't be working at full throttle. And the reason it's a problem is that many NICs now come with large send buffers; and effectively what happens then is that this makes the pipeline longer -- as the buffer fills up, the time between send() and the wire is increased. This increased latency causes delays in round-trip-times and interferes with the mechanisms TCP uses to try to determine what the actual sustainable rate of data trasmission is. By limiting the number of in-transit bytes, you make sure that neither the kernel nor the NIC are going to have packets queues up for long lengths of time in buffers, and you keep this pipeline as close to the actual minimal length of the pipeline as possible. And it sounds like for your 40G NIC, 128k is big enough to fill the pipeline without unduly making it longer by introducing buffering. Is that an accurate picture of what you're trying to achieve? But the problem for xennet (and a number of other drivers), as I understand it, is that at the moment the pipeline itself is just longer -- it just takes a longer time from the time you send a packet to the time it actually gets out on the wire. So it's not actually accurate to say that Xen performance relies on bufferbloat. There's no buffering involved -- the pipeline is just longer, and so to fill up the pipeline you need more data. Basically, to maximize throughput while minimizing buffering, for *any* connection, tcp_limit_output_bytes should ideally be around (min_tx_latency * max_bandwidth). For physical NICs, the minimum latency is really small, but for xennet -- and I'm guessing for a lot of virtualized cards -- the min_tx_latency will be a lot higher, requiring a much higher ideal tcp_limit_output value. Rather than trying to pick a single value which will be good for all NICs, it seems like it would make more sense to have this vary depending on the parameters of the NIC. After all, for NICs that have low throughput -- say, old 100MiB NICs -- even 128k may still introduce a significant amount of buffering. Obviously one solution would be to allow the drivers themselves to set the tcp_limit_output_bytes, but that seems like a maintenance nightmare. Another simple solution would be to allow drivers to indicate whether they have a high transmit latency, and have the kernel use a higher value by default when that's the case. Probably the most sustainable solution would be to have the networking layer keep track of the average and minimum transmit latencies, and automatically adjust tcp_limit_output_bytes based on that. (Keeping the minimum as well as the average because the whole problem with bufferbloat is that the more data you give it, the longer the apparent pipeline becomes.) Thoughts? -George -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On 04/15/2015 05:38 PM, Eric Dumazet wrote: My thoughts that instead of these long talks you should guys read the code : /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : * - better RTT estimation and ACK scheduling * - faster recovery * - high rates * Alas, some drivers / subsystems require a fair amount * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ limit = max(2 * skb-truesize, sk-sk_pacing_rate 10); limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); Then you'll see that most of your questions are already answered. Feel free to try to improve the behavior, if it does not hurt critical workloads like TCP_RR, where we we send very small messages, millions times per second. First of all, with regard to critical workloads, once this patch gets into distros, *normal TCP streams* on every VM running on Amazon, Rackspace, Linode, c will get a 30% hit in performance *by default*. Normal TCP streams on xennet *are* a critical workload, and deserve the same kind of accommodation as TCP_RR (if not more). The same goes for virtio_net. Secondly, according to Stefano's and Jonathan's tests, tcp_limit_output_bytes completely fixes the problem for Xen. Which means that max(2*skb-truesize, sk-sk_pacing_rate 10) is *already* larger for Xen; that calculation mentioned in the comment is *already* doing the right thing. As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an automatic TSQ calculation which is actually choosing an effective value for xennet. It certainly makes sense for sysctl_tcp_limit_output_bytes to be an actual maximum limit. I went back and looked at the original patch which introduced it (46d3ceabd), and it looks to me like it was designed to be a rough, quick estimate of two packets outstanding (by choosing the maximum size of the packet, 64k, and multiplying it by two). Now that you have a better algorithm -- the size of 2 actual packets or the amount transmitted in 1ms -- it seems like the default sysctl_tcp_limit_output_bytes should be higher, and let the automatic TSQ you have on the first line throttle things down when necessary. -George -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On 04/15/2015 06:29 PM, Eric Dumazet wrote: On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote: On 04/15/2015 05:38 PM, Eric Dumazet wrote: My thoughts that instead of these long talks you should guys read the code : /* TCP Small Queues : * Control number of packets in qdisc/devices to two packets / or ~1 ms. * This allows for : * - better RTT estimation and ACK scheduling * - faster recovery * - high rates * Alas, some drivers / subsystems require a fair amount * of queued bytes to ensure line rate. * One example is wifi aggregation (802.11 AMPDU) */ limit = max(2 * skb-truesize, sk-sk_pacing_rate 10); limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes); Then you'll see that most of your questions are already answered. Feel free to try to improve the behavior, if it does not hurt critical workloads like TCP_RR, where we we send very small messages, millions times per second. First of all, with regard to critical workloads, once this patch gets into distros, *normal TCP streams* on every VM running on Amazon, Rackspace, Linode, c will get a 30% hit in performance *by default*. Normal TCP streams on xennet *are* a critical workload, and deserve the same kind of accommodation as TCP_RR (if not more). The same goes for virtio_net. Secondly, according to Stefano's and Jonathan's tests, tcp_limit_output_bytes completely fixes the problem for Xen. Which means that max(2*skb-truesize, sk-sk_pacing_rate 10) is *already* larger for Xen; that calculation mentioned in the comment is *already* doing the right thing. As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an automatic TSQ calculation which is actually choosing an effective value for xennet. It certainly makes sense for sysctl_tcp_limit_output_bytes to be an actual maximum limit. I went back and looked at the original patch which introduced it (46d3ceabd), and it looks to me like it was designed to be a rough, quick estimate of two packets outstanding (by choosing the maximum size of the packet, 64k, and multiplying it by two). Now that you have a better algorithm -- the size of 2 actual packets or the amount transmitted in 1ms -- it seems like the default sysctl_tcp_limit_output_bytes should be higher, and let the automatic TSQ you have on the first line throttle things down when necessary. I asked you guys to make a test by increasing sysctl_tcp_limit_output_bytes So you'd be OK with a patch like this? (With perhaps a better changelog?) -George --- TSQ: Raise default static TSQ limit A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the size of actual packets and the amount of data being transmitted. Raise the default static limit to allow that new limit to actually come into effect. This fixes a regression where NICs with large transmit completion times (such as xennet) had a 30% hit unless the user manually tweaked the value in /proc. Signed-off-by: George Dunlap george.dun...@eu.citrix.com diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1db253e..8ad7cdf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; */ int sysctl_tcp_workaround_signed_windows __read_mostly = 0; -/* Default TSQ limit of two TSO segments */ -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; +/* Static TSQ limit. A more dynamic limit is calculated in tcp_write_xmit. */ +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; /* This limits the percentage of the congestion window which we * will allow a single TSO frame to consume. Building TSO frames TSQ: Raise default static TSQ limit A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the size of actual packets and the amount of data being transmitted. Raise the default static limit to allow that new limit to actually come into effect. This fixes a regression where NICs with large transmit completion times (such as xennet) had a 30% hit unless the user manually tweaked the value in /proc. Signed-off-by: George Dunlap george.dun...@eu.citrix.com diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1db253e..8ad7cdf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; */ int sysctl_tcp_workaround_signed_windows __read_mostly = 0; -/* Default TSQ limit of two TSO segments */ -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; +/* Static TSQ limit. A more dynamic limit is calculated in tcp_write_xmit. */ +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; /* This limits the percentage of the congestion window which we * will allow a single TSO frame to consume
Re: [Xen-devel] tcp: refine TSO autosizing causes performance regression on Xen
On 04/15/2015 06:52 PM, Eric Dumazet wrote: On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote: So you'd be OK with a patch like this? (With perhaps a better changelog?) -George --- TSQ: Raise default static TSQ limit A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the size of actual packets and the amount of data being transmitted. Raise the default static limit to allow that new limit to actually come into effect. This fixes a regression where NICs with large transmit completion times (such as xennet) had a 30% hit unless the user manually tweaked the value in /proc. Signed-off-by: George Dunlap george.dun...@eu.citrix.com diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1db253e..8ad7cdf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; */ int sysctl_tcp_workaround_signed_windows __read_mostly = 0; -/* Default TSQ limit of two TSO segments */ -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; +/* Static TSQ limit. A more dynamic limit is calculated in tcp_write_xmit. */ +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; /* This limits the percentage of the congestion window which we * will allow a single TSO frame to consume. Building TSO frames Have you tested this patch on a NIC without GSO/TSO ? This would allow more than 500 packets for a single flow. Hello bufferbloat. So my answer to this patch is a no. You said: I asked you guys to make a test by increasing sysctl_tcp_limit_output_bytes You have no need to explain me the code I wrote, thank you. Which implies to me that you think you've already pointed us to the answer you want and we're just not getting it. Maybe you should stop wasting all of our time and just tell us what you're thinking. -George -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html