Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
> On June 22, 2016 at 7:53 AM Yuchung Chengwrote: > > Thanks for the patience. I've collected data from some Google Web > servers. They serve both a mix of US and SouthAm users using > HTTP1 and HTTP2. The traffic is Web browsing (e.g., search, maps, > gmails, etc but not Youtube videos). The mean RTT is about 100ms. > > The user connections were split into 4 groups of different TCP RTO > configs. Each group has many millions of connections but the > size variation among groups is well under 1%. > > B: baseline Linux > D: this patch > R: change RTTYAR averaging as in D, but bound RTO to 1sec per RFC6298 > Y: change RTTVAR averaging as in D, but bound RTTVAR to 200ms instead (like B) > > For mean TCP latency of HTTP responses (first byte sent to last byte > acked), B < R < Y < D. But the differences are so insignificant (<1%). > The median, 95pctl, and 99pctl has similar indifference. In summary > there's hardly visible impact on latency. I also look at only response > less than 4KB but do not see a different picture. > > The main difference is the retransmission rate where R =~ Y < B =~D. > R and Y are ~20% lower than B and D. Parsing the SNMP stats reveal > more interesting details. The table shows the deltas in percentage to > the baseline B. > > D R Y > -- > Timeout +12% -16% -16% > TailLossProb +28%-7% -7% > DSACK_rcvd +37%-7% -7% > Cwnd-undo+16% -29% -29% > > RTO change affects TLP because TLP will use the min of RTO and TLP > timer value to arm the probe timer. > > The stats indicate that the main culprit of spurious timeouts / rtx is > the RTO lower-bound. But they also show the RFC RTTVAR averaging is as > good as current Linux approach. > > Given that I would recommend we revise this patch to use the RFC > averaging but keep existing lower-bound (of RTTVAR to 200ms). We can > further experiment the lower-bound and change that in a separate > patch. Great news Yuchung! Then Daniel will prepare v4 with a min-rto lower bound: max(RTTVAR, tcp_rto_min_us(struct sock)) Any further suggestions Yuchung, Eric? We will also feed this v4 in our test environment to check the behavior for sender limited, non-continuous flows. Hagen
Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
> On June 15, 2016 at 10:38 PM Eric Dumazetwrote: > > I guess the problem is that some folks use smaller rto than > RTAX_RTO_MIN , look at tcp_rto_min() Due to the nature of the Linux calculation, this is probably more of a reason to use the RFC 6298 calculation. When a smaller MinRTO as 200ms is used, the Linux “advantage” to account for Delayed ACKs up to 200ms is decreased. Assuming a MinRTO of 0ms, the Linux ability and the RFC ability to account for sudden Delayed ACKs is pretty equal: zero. To illustrate this: RTT: 50ms, RTTVAR: 0ms, MinRTO: 50ms, Delayed ACKs: 200ms. Before any ACK is delayed: Linux RTO ~ 100+ms (tested) RFC 6298 RTO ~ 50+ms (tested) RTT of first delayed ACK if it is not shortened due to another data packet: ~250ms This is not tied to the RTT: RTT 1000ms, RTTVAR: 0ms, MinRTO: 50ms, Delayed ACKs: 200ms Before any ACK is delayed: Linux RTO ~ 1050+ms (tested) RFC 6298 RTO ~ 1000+ms (tested) RTT of first delayed ACK if it is not shortened due to another data packet: ~1200ms A RFC 6298 problem we run in so far was with extremely steady RTTs and sender limited data. A Spurious Retransmission occurred from time to time in this case. Hagen
Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
> On June 15, 2016 at 8:02 PM Yuchung Chengwrote: > > Let's say the SRTT is 100ms and RTT variations is 10ms. The variation > is low because we've been sending large chunks, and RTT is fairly > stable, and we sample on every ACK. The RTOs produced are > > RFC6298: RTO=1s > Linux: RTO=300ms > This patch: RTO=200ms > > Then we send 1 packet out. The receiver delays the ACK up to 200ms. > The actual RTT can be longer because other network components further > delay the data or the ACK. This patch would surely fire the RTO > spuriously. > > so we can either implement RFC6298 faithfully, or apply the > lower-bound as-is, or something in between. But the current patch > as-is is more aggressive. Did I miss something? We analyzed the impact for a wide variety of network characteristics. Starting from bulk data till chatty, sender-limited transmissions from low RTTs to high RTTs, small and large variances as well as different queue characteristics. For a group of tests we measured advantages of a RFC 6298 compliant implementation: sender-limited flows. For bulk data we did not measured any difference compared to standard Linux. As a result we concluded that the RFC conform implementation - mapped to real world protocols - if beneficial. For the mentioned use case, yes the new implementation is a little bit more aggressive: when delayed ack kicks in, a spurious retransmission can be triggerd, yes. We asked ourself if this is a real world scenario or more an theoretical issue. Furthermore, if a real world problem, if the retransmission is negligible compared to the advantages? Yuchung, can you test the patch and see if the patch have any downsides? And thank you for the comments! Hagen
Re: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation
* Yuchung Cheng | 2016-06-14 14:33:18 [-0700]: >> + tp->rttvar_us = tp->mdev_us; >AFAICT we can update rttvar_us directly and don't need mdev_us anymore? Yes, v3 will remove mdev_us. >This is more aggressive than RFC6298 that RTO <- SRTT + max (G, >K*RTTVAR) where G = MIN_RTO = 200ms > >based on our discussion, in the spirit of keeping RTO more >conservative, I recommend we implement RFC formula. Acks being delayed >over 200ms is not uncommon (unfortunately due to bloat or other >issues). > >Also I think we should change __tcp_set_rto so that the formula >applies to backoffs or ICMP timeouts calculations too. We are a unsure what you mean Yuchung. We believe this patch not to be more aggressive than RFC 6298. In fact, we believe it to be RFC 6298 compliant, as in RFC 6298, G is the clock granularity and we don’t see where it deviates from the RFC. However, it is more aggressive than “RTO <- SRTT + max (G, K*RTTVAR) where G = MIN_RTO = 200ms”. Which formula do you want to implement? Hagen
Re: [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation
* Yuchung Cheng | 2016-06-13 15:38:24 [-0700]: Hey Eric, Yuchung, regarding the missed mdev_max_us: internal communication problem. Daniel well respin a v2 removing the no longer required mdev_max_us. >Thanks for the patch. I also have long wanted to evaluate Linux's RTO vs RFC's. > >Since this is not a small change, and your patch is only tested on >emulation-based testbed AFAICT, I'd like to try your patch on Google >servers to get more data. But this would take a few days to setup & >collect. Great - no hurry! We tried hard to find any downsides of RFC 6298 so far without any result. If you have any special & concrete tests in mind: Daniel will test it! >Note that this paper >https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf has >detailed rationale of current design (section 4). IMO having a "tight" >RTO is less necessary now after TLP. I am also testing a new set of >patches to install a quick reordering timer. But it's worth mentioning >the paper in the commit message. We had "difficulties" to find scenarios where the RTO kicks-in. For the majority of use cases duplicate ACKs triggers TCP retransmission. For bulk data transmissions almost 100% of retransmissions are triggered by duplicate ACKs (except connection teardown). TLP will reduce the requirement for RTO even further, also window probes helps sometimes. The use case we realized was sender limited, non-continuous flows where a RFC 6298 compliant implementation is better. Thank you Yuchung, we will add an reference in v2. Hagen
Re: Fwd: Re: Section 4 No. 9,10 Failed was occurred by IPv6 Ready Logo Conformance Test
> On April 15, 2016 at 10:47 AM Yuki Machida <machida.y...@jp.fujitsu.com> > wrote: > > >> commit 9d289715eb5c252ae15bd547cb252ca547a3c4f2 > >> Author: Hagen Paul Pfeifer <ha...@jauu.net> > >> Date: Thu Jan 15 22:34:25 2015 +0100 > >> > >> ipv6: stop sending PTB packets for MTU < 1280 > >> > >> Reduce the attack vector and stop generating IPv6 Fragment Header for > >> paths with an MTU smaller than the minimum required IPv6 MTU > >> size (1280 byte) - called atomic fragments. > >> > >> See IETF I-D "Deprecating the Generation of IPv6 Atomic Fragments" > >> [1] > >> for more information and how this "feature" can be misused. > >> > >> [1] > >> https://tools.ietf.org/html/draft-ietf-6man-deprecate-atomfrag-generation-00 > >> > >> Signed-off-by: Fernando Gont <fg...@si6networks.com> > >> Signed-off-by: Hagen Paul Pfeifer <ha...@jauu.net> > >> Acked-by: Hannes Frederic Sowa <han...@stressinduktion.org> > >> Signed-off-by: David S. Miller <da...@davemloft.net> > > > > I will try. > > I confirmed that v4.1.20 revert above patch is passed Section 4 No. 9 and 10 > testcases > in IPv6 Ready Logo Conformance Test. > I can't immediately revert above patch from v4.6-rc1 by implementation has > changed. is it to please a conforming test tool or fix "revert 9d289715eb5c2" a real problem? If so: which problem do you have with 9d289715eb5c2 or draft-ietf-6man-deprecate-atomfrag-generation-06? Hagen
Re: [PATCH v2] jhash: Deinline jhash, jhash2 and __jhash_nwords
On July 16, 2015 at 9:23 PM Joe Perches j...@perches.com wrote: It might be useful to have these performance impacting changes guarded by something like CONFIG_CC_OPTIMIZE_FOR_SIZE with another static __always_inline __func and a function EXPORT_SYMBOL or just a static inline so that where code size is critical it's uninlined. But keep in mind that jhash, jhash2 and __jhash_nwords are *not* one-instruction long functions. We duplicate code over and over resulting probably in more cache misses. __always_inline__ is probably too strict and a vanilla inline is already for 99% of all distribution builds a __always_inline__, see ARCH_SUPPORTS_OPTIMIZED_INLINING and CONFIG_CC_OPTIMIZE_FOR_SIZE. The answer depends on the specific workload. Sometimes an enforced inline perform better and sometimes a call is the better solution (read: less cache misses). General purpose vendors with a larger working set size should reduce cache misses by deinline many functions. For high-performance special fast-path operations a strong inlined kernel build is probably faster. __always_inline__ makes it impossible for the user to deinline functions or not. Hagen -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What queues/buffers does tc-netem use?
On July 16, 2015 at 1:28 PM Motejlek, Petr pmote...@akamai.com wrote: I was wondering what queues/buffers does netem use and how does one control or monitor them? netem uses his own rbtree based queue. You can use tc(1) to get statistics. I could not find this information anywhere and I am not that good in reading the sources to be able to tell enough about this :) If we talk only about the situation where netem is the root qdisc for a particular interface, I would imagine it might be using the txqueue of that interface, but I am not sure if that's really the case... Saddly there is no netem implementation documentation, but the source code is straightforward. You may take a look: http://lxr.free-electrons.com/source/net/sched/sch_netem.c Cheers, Hagen -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What queues/buffers does tc-netem use?
On July 16, 2015 at 2:48 PM Motejlek, Petr pmote...@akamai.com wrote: Could you please give me some example of such a tc command that would tell me the statistics? I am not sure what you mean. tc -s qdisc show dev eth0 Is there a way I can manipulate the internal rbtree queue size, please? Sure, the option is called limit. Thank you You are welcome! Hagen -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netstat and dual stack sockets
On 15 June 2015 at 22:54, Phil Sutter p...@nwl.cc wrote: As I see it, a user has no way of detecting the listening socket in this address family: it does not show in /proc/net/{tcp,udp} nor do 'netstat', 'ss' or 'lsof' print any additional information about those sockets over pure IPv6 ones. Probably a combination of IPV6_V6ONLY(1, 0) and IN6_IS_ADDR_V4MAPPED fulfills all user requirements, ... so far. Your proposal is to hand over sk-sk_ipv6only? Hagen -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [KJ] [patch] net/tipc: sprintf/strcpy conversion
* Alexey Dobriyan | 2006-11-03 03:09:05 [+0300]: On Wed, Nov 01, 2006 at 03:06:24PM +0100, Florian Westphal wrote: convert sprintf(a,b) to strcpy(a,b). Make tipc_bclink_name[] const. Ahhh, I missed the start of threads. Patch is useless because it changes one unbounded string function into another unbounded string function. The discussion in this thread is really back-breaking! 1. To make tipc_bclink_name const there is absolutly no objection 2. Replace sprintf with strcpy a) First of all: If you _copy_ a string then use also strCPY() Thats a question of good style! b) If the compiler is smart enough, he realize that you want to copy a string and replace the sprintf call with a pushl %ebx callstrcpy Surprise - Surprise! Assumed you use gcc with -Os or -O2! Don't know how icc handle this case. If you compile without optimization you save at least a repz movsb %ds:(%esi),%es:(%edi) instruction. c) Last but not least I read all the time this patch doesn't introduce bounds-checking. This isn't a argument because the author is aware of the destination length of the buffer. BTW: grep for (sprintf|strcpy) in /usr/src/linux and be surprised how unsecure the kernel is (thats ironical). This patch is 100% OK! HGN - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm
* David Miller | 2006-10-26 17:02:21 [-0700]: Your email client turned the tabs into spaces in the patch making it useless. Sorry my mistake! I am en route and I paste the patch into my editor, who eat all tabs. One more time: sorry! Check if user has CAP_NET_ADMIN capability to change congestion control algorithm. Signed-off-by: Hagen Paul Pfeifer [EMAIL PROTECTED] --- net/ipv4/tcp_cong.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index af0aca1..c1ae2e9 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -10,6 +10,7 @@ #include linux/module.h #include linux/mm.h #include linux/types.h #include linux/list.h +#include linux/capability.h #include net/tcp.h static DEFINE_SPINLOCK(tcp_cong_list_lock); @@ -151,6 +152,9 @@ int tcp_set_congestion_control(struct so struct tcp_congestion_ops *ca; int err = 0; + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + rcu_read_lock(); ca = tcp_ca_find(name); if (ca == icsk-icsk_ca_ops) -- 1.4.1.1 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm
* Stephen Hemminger | 2006-10-27 07:41:02 [-0700]: Please no, it makes the socket option useless. Technical no, in the sense of usability for everybody yes. You are right Stephen, as a programmer I understand you complete! But on the other side: We know for sure that this IS a problem if we allow everybody to prefer his socket. In my opinion we should prefer fairness before usability! As John Heffner introduce, we can introduce a ranking system for congestion control algorithms - but this solution seems a little bit oversized and maybe can't be complete guaranteed (complex interaction between the protocols in different environment and so on, you know). HGN -- /°\ --- JOIN NOW!!! --- \ / ASCII ribbon campaign X against HTML / \in mail and news - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP congestion graphs
Hi Stephen, is your rt-patch to netem public available? Best regards HGN -- Signed and/or encrypted mails preferd. Key-Id = 0x98350C22 Fingerprint = 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22 Key available under: www.jauu.net/download/gnupg_key - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] tcp: setsockopt congestion control autoload
* John Heffner | 2006-10-26 13:29:26 [-0400]: My reservation in doing this would be that as an administrator, I may want to choose exactly what congestion control is available any any given time. The different congestion control algorithms are not necessarily fair to each other. ACK, completely right. A user without CAP_NET_ADMIN MUST NOT changed the algorithm. We know that there are some unfairness out there. And maybe some time ago someone introduce a satellite-algorithm which is per definition completely unfair to vanilla tcp. We should guard this with a CAP_NET_ADMIN capability so that built-in modules also shouldn't be enabled. HGN -- Signed and/or encrypted mails preferd. Key-Id = 0x98350C22 Fingerprint = 490F 557B 6C48 6D7E 5706 2EA2 4A22 8D45 9835 0C22 - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Check if user has CAP_NET_ADMIN to change congestion control algorithm
Check if user has CAP_NET_ADMIN capability to change congestion control algorithm. Under normal circumstances a application programmer doesn't have enough information to choose the right algorithm (expect he is the pchar/pathchar maintainer). At 99.9% only the local host administrator has the knowledge to select a proper standard, system-wide algorithm (the remaining 0.1% are for testing purpose). If we let the user select an alternative algorithm we introduce one potential weak spot - so we ban this eventuality. HGN Signed-off-by: Hagen Paul Pfeifer [EMAIL PROTECTED] diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index af0aca1..c1ae2e9 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -10,6 +10,7 @@ #include linux/module.h #include linux/mm.h #include linux/types.h #include linux/list.h +#include linux/capability.h #include net/tcp.h static DEFINE_SPINLOCK(tcp_cong_list_lock); @@ -151,6 +152,9 @@ int tcp_set_congestion_control(struct so struct tcp_congestion_ops *ca; int err = 0; + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + rcu_read_lock(); ca = tcp_ca_find(name); if (ca == icsk-icsk_ca_ops) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to grab a block of binary data w/out using ioctls?
* Ben Greear | 2006-10-23 17:44:24 [-0700]: Since IOCTLs are out of favor these days, what would be a preferred way to get a block of binary data out of the kernel? I suggest netlink socket for that purpose! Netlink scales also well if the amount of data surprisedly rise. Thanks, Ben HGN - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2.6.16.19 2/2] LARTC: trace control for netem: kernelspace
* Rainer Baumann | 2006-09-22 08:15:13 [+0200]: Patch for linux kernel 2.6.16.19: http://tcn.hypert.net/tcnKernel_procfs.patch Coding Style need at least some work ... Whitespaces around operators and parentheses, useless parentheses, braces for the else branch, mixes C99/C89 comments, indentation, proc_read_stats() look unclean (bzero) and maybe some other stuff too - the code at a whole look a little bit grubby. HGN -- 43rd Law of Computing: Anything that can go wr fortune: Segmentation violation -- Core dumped - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcp: set congestion default through Kconfig
* Andi Kleen | 2006-09-19 12:03:51 [+0200]: How about a single auto selection heuristic: e.g. check the handshake latencies and if they are too long switch from reno to the newer one deemed most stable? Thats absolute no practicable solution to discover the 'right' algorithm. The latenzy is only one piece in the puzzle the determine the optimal congestion control algorithm and completely inadequate. You must also discover the bandwith and other factors to make a decision. And this is nearly impossible and object of actual research. Through due to the heterogeneity of the net it is often not possible the select the right one. To select a general purpose algorithm here is the better choice. BTW: BIC is sometimes to aggressive in comparison to standard tcp behaviour (e.g. in short RTT environments) - this is an known issue. Why not cubic as the default one? -Andi HGN -- You need the computing power of a Pentium, 16 MB RAM and 1 GB Harddisk to run Win95. It took the computing power of 3 Commodore 64 to fly to the Moon. Something is wrong here, and it wasn't the Apollo. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: means to artificially alter the bandwidth of a system
* Irfan Habib | 2006-08-02 23:04:41 [+0500]: Hi, For research purposes we are considering to develop a program to alter the bandwidth of a system via the software, so instance: a machine has 100 MB/s and we change it to 1MB/s. Does something like this already exist? Or is there a way to do this without creating a program/kernel module Of course: see http://linux-net.osdl.org/index.php/Iproute2 (especially tc) Any help will be highly appreciated! Irfan Habib HGN - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Congestion Avoidance Monitoring Tools
* Stephen Hemminger | 2006-04-21 08:19:17 [-0700]: 2.6.13 still had lots of problems, things didn't really get working right till 2.6.15 or later. Especially with TSO. --verbose? I have a tool using kprobe's see http://developer.osdl.org/shemminger/prototypes/tcpprobe.tar.gz I try to keep it up to date with current kernel and build process, last used it on 2.6.16. wget http://developer.osdl.org/shemminger/prototypes/tcpprobe.tar.gz Ended with following error code: ;-) 00:32:48 ERROR 403: Forbidden. HGN -- Microsoft is to software what McDonalds is to gourmet cooking. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html