Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function
Ok, I've missed this bit in the submitting patches documentation.. however sorry for that. 2007/2/22, David Miller [EMAIL PROTECTED]: Please never submit patches like this, submit the infrastructure FIRST, then submit the stuff that uses it. When a sequence of patches is applied, in sequence, the tree should build properly (even with all available new options enabled) at each step along the way. Otherwise we have the situation we have now, in that YeaH is in my tree but doesn't build successfully. What I'm going to do to fix this, is yank YeaH implementation out of my tree, add this second patch first, then add the YeaH patch back. Please never do this again. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function
John Heffner ha scritto: Sorry for the confusion. The patch I attached to my message was compile-tested only. Well I've read your reply by night and I haven't seen that you attached a patch. Sorry for that. Kind regards, Angelo - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function
Forgot the patch.. Angelo P. Castellani ha scritto: From: Angelo P. Castellani [EMAIL PROTECTED] RFC3742: limited slow start See http://www.ietf.org/rfc/rfc3742.txt Signed-off-by: Angelo P. Castellani [EMAIL PROTECTED] --- To allow code reutilization I've added the limited slow start procedure as an exported symbol of linux tcp congestion control. On large BDP networks canonical slow start should be avoided because it requires large packet losses to converge, whereas at lower BDPs slow start and limited slow start are identical. Large BDP is defined through the max_ssthresh variable. I think limited slow start could safely replace the canonical slow start procedure in Linux. Regards, Angelo P. Castellani p.s.: in the attached patch is added an exported function currently used only by YeAH TCP include/net/tcp.h |1 + net/ipv4/tcp_cong.c | 23 +++ 2 files changed, 24 insertions(+) diff -uprN linux-2.6.20-a/include/net/tcp.h linux-2.6.20-c/include/net/tcp.h --- linux-2.6.20-a/include/net/tcp.h 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-c/include/net/tcp.h 2007-02-19 10:54:10.0 +0100 @@ -669,6 +669,7 @@ extern void tcp_get_allowed_congestion_c extern int tcp_set_allowed_congestion_control(char *allowed); extern int tcp_set_congestion_control(struct sock *sk, const char *name); extern void tcp_slow_start(struct tcp_sock *tp); +extern void tcp_limited_slow_start(struct tcp_sock *tp); extern struct tcp_congestion_ops tcp_init_congestion_ops; extern u32 tcp_reno_ssthresh(struct sock *sk); diff -uprN linux-2.6.20-a/net/ipv4/tcp_cong.c linux-2.6.20-c/net/ipv4/tcp_cong.c --- linux-2.6.20-a/net/ipv4/tcp_cong.c 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-c/net/ipv4/tcp_cong.c 2007-02-19 10:54:10.0 +0100 @@ -297,6 +297,29 @@ void tcp_slow_start(struct tcp_sock *tp) } EXPORT_SYMBOL_GPL(tcp_slow_start); +void tcp_limited_slow_start(struct tcp_sock *tp) +{ + /* RFC3742: limited slow start + * the window is increased by 1/K MSS for each arriving ACK, + * for K = int(cwnd/(0.5 max_ssthresh)) + */ + + const int max_ssthresh = 100; + + if (max_ssthresh 0 tp-snd_cwnd max_ssthresh) { + u32 k = max(tp-snd_cwnd / (max_ssthresh 1), 1U); + if (++tp-snd_cwnd_cnt = k) { + if (tp-snd_cwnd tp-snd_cwnd_clamp) +tp-snd_cwnd++; + tp-snd_cwnd_cnt = 0; + } + } else { + if (tp-snd_cwnd tp-snd_cwnd_clamp) + tp-snd_cwnd++; + } +} +EXPORT_SYMBOL_GPL(tcp_limited_slow_start); + /* * TCP Reno congestion control * This is special case used for fallback as well.
[PATCH 1/2][TCP] YeAH-TCP: algorithm implementation
From: Angelo P. Castellani [EMAIL PROTECTED] YeAH-TCP is a sender-side high-speed enabled TCP congestion control algorithm, which uses a mixed loss/delay approach to compute the congestion window. It's design goals target high efficiency, internal, RTT and Reno fairness, resilience to link loss while keeping network elements load as low as possible. For further details look here: http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf Signed-off-by: Angelo P. Castellani [EMAIL PROTECTED] --- This is the YeAH-TCP implementation of the algorithm presented to PFLDnet2007 (http://wil.cs.caltech.edu/pfldnet2007/). Regards, Angelo P. Castellani Kconfig| 14 ++ Makefile |1 tcp_yeah.c | 288 + tcp_yeah.h | 134 4 files changed, 437 insertions(+) diff -uprN linux-2.6.20-a/net/ipv4/Kconfig linux-2.6.20-b/net/ipv4/Kconfig --- linux-2.6.20-a/net/ipv4/Kconfig 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-b/net/ipv4/Kconfig 2007-02-19 10:52:46.0 +0100 @@ -574,6 +574,20 @@ config TCP_CONG_VENO loss packets. See http://www.ntu.edu.sg/home5/ZHOU0022/papers/CPFu03a.pdf +config TCP_CONG_YEAH + tristate YeAH TCP + depends on EXPERIMENTAL + default n + ---help--- + YeAH-TCP is a sender-side high-speed enabled TCP congestion control + algorithm, which uses a mixed loss/delay approach to compute the + congestion window. It's design goals target high efficiency, + internal, RTT and Reno fairness, resilience to link loss while + keeping network elements load as low as possible. + + For further details look here: + http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf + choice prompt Default TCP congestion control default DEFAULT_CUBIC diff -uprN linux-2.6.20-a/net/ipv4/Makefile linux-2.6.20-b/net/ipv4/Makefile --- linux-2.6.20-a/net/ipv4/Makefile 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-b/net/ipv4/Makefile 2007-02-19 10:52:46.0 +0100 @@ -49,6 +49,7 @@ obj-$(CONFIG_TCP_CONG_VEGAS) += tcp_vega obj-$(CONFIG_TCP_CONG_VENO) += tcp_veno.o obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o +obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o obj-$(CONFIG_NETLABEL) += cipso_ipv4.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ diff -uprN linux-2.6.20-a/net/ipv4/tcp_yeah.c linux-2.6.20-b/net/ipv4/tcp_yeah.c --- linux-2.6.20-a/net/ipv4/tcp_yeah.c 1970-01-01 01:00:00.0 +0100 +++ linux-2.6.20-b/net/ipv4/tcp_yeah.c 2007-02-19 10:52:46.0 +0100 @@ -0,0 +1,288 @@ +/* + * + * YeAH TCP + * + * For further details look at: + *http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf + * + */ + +#include tcp_yeah.h + +/* Default values of the Vegas variables, in fixed-point representation + * with V_PARAM_SHIFT bits to the right of the binary point. + */ +#define V_PARAM_SHIFT 1 + +#define TCP_YEAH_ALPHA 80 //lin number of packets queued at the bottleneck +#define TCP_YEAH_GAMMA1 //lin fraction of queue to be removed per rtt +#define TCP_YEAH_DELTA3 //log minimum fraction of cwnd to be removed on loss +#define TCP_YEAH_EPSILON 1 //log maximum fraction to be removed on early decongestion +#define TCP_YEAH_PHY 8 //lin maximum delta from base +#define TCP_YEAH_RHO 16 //lin minumum number of consecutive rtt to consider competition on loss +#define TCP_YEAH_ZETA50 //lin minimum number of state switchs to reset reno_count + +#define TCP_SCALABLE_AI_CNT 100U + +/* YeAH variables */ +struct yeah { + /* Vegas */ + u32 beg_snd_nxt; /* right edge during last RTT */ + u32 beg_snd_una; /* left edge during last RTT */ + u32 beg_snd_cwnd; /* saves the size of the cwnd */ + u8 doing_vegas_now;/* if true, do vegas for this RTT */ + u16 cntRTT; /* # of RTTs measured within last RTT */ + u32 minRTT; /* min of RTTs measured within last RTT (in usec) */ + u32 baseRTT; /* the min of all Vegas RTT measurements seen (in usec) */ + + /* YeAH */ + u32 lastQ; + u32 doing_reno_now; + + u32 reno_count; + u32 fast_count; + + u32 pkts_acked; +}; + +static void tcp_yeah_init(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct yeah *yeah = inet_csk_ca(sk); + + tcp_vegas_init(sk); + + yeah-doing_reno_now = 0; + yeah-lastQ = 0; + + yeah-reno_count = 2; + + /* Ensure the MD arithmetic works. This is somewhat pedantic, + * since I don't think we will see a cwnd this large. :) */ + tp-snd_cwnd_clamp = min_t(u32, tp-snd_cwnd_clamp, 0x/128); + +} + + +static void tcp_yeah_pkts_acked(struct sock *sk, u32 pkts_acked) +{ + const struct inet_connection_sock *icsk = inet_csk(sk); + struct yeah *yeah = inet_csk_ca(sk); + + if (icsk-icsk_ca_state == TCP_CA_Open) + yeah-pkts_acked = pkts_acked; +} + +/* 64bit divisor, dividend and result. dynamic precision */ +static inline u64 div64_64(u64 dividend, u64 divisor) +{ + u32 d = divisor; + + if (divisor 0xULL
[PATCH 1/2][TCP] YeAH-TCP: algorithm implementation
From: Angelo P. Castellani [EMAIL PROTECTED] YeAH-TCP is a sender-side high-speed enabled TCP congestion control algorithm, which uses a mixed loss/delay approach to compute the congestion window. It's design goals target high efficiency, internal, RTT and Reno fairness, resilience to link loss while keeping network elements load as low as possible. For further details look here: http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf Signed-off-by: Angelo P. Castellani [EMAIL PROTECTED] --- This is the YeAH-TCP implementation of the algorithm presented to PFLDnet2007 (http://wil.cs.caltech.edu/pfldnet2007/). Regards, Angelo P. Castellani Kconfig| 14 ++ Makefile |1 tcp_yeah.c | 288 + tcp_yeah.h | 134 4 files changed, 437 insertions(+) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2][TCP] YeAH-TCP: algorithm implementation
The patch. Angelo P. Castellani ha scritto: From: Angelo P. Castellani [EMAIL PROTECTED] YeAH-TCP is a sender-side high-speed enabled TCP congestion control algorithm, which uses a mixed loss/delay approach to compute the congestion window. It's design goals target high efficiency, internal, RTT and Reno fairness, resilience to link loss while keeping network elements load as low as possible. For further details look here: http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf Signed-off-by: Angelo P. Castellani [EMAIL PROTECTED] --- This is the YeAH-TCP implementation of the algorithm presented to PFLDnet2007 (http://wil.cs.caltech.edu/pfldnet2007/). Regards, Angelo P. Castellani Kconfig| 14 ++ Makefile |1 tcp_yeah.c | 288 + tcp_yeah.h | 134 4 files changed, 437 insertions(+) diff -uprN linux-2.6.20-a/net/ipv4/Kconfig linux-2.6.20-b/net/ipv4/Kconfig --- linux-2.6.20-a/net/ipv4/Kconfig 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-b/net/ipv4/Kconfig 2007-02-19 10:52:46.0 +0100 @@ -574,6 +574,20 @@ config TCP_CONG_VENO loss packets. See http://www.ntu.edu.sg/home5/ZHOU0022/papers/CPFu03a.pdf +config TCP_CONG_YEAH + tristate YeAH TCP + depends on EXPERIMENTAL + default n + ---help--- + YeAH-TCP is a sender-side high-speed enabled TCP congestion control + algorithm, which uses a mixed loss/delay approach to compute the + congestion window. It's design goals target high efficiency, + internal, RTT and Reno fairness, resilience to link loss while + keeping network elements load as low as possible. + + For further details look here: + http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf + choice prompt Default TCP congestion control default DEFAULT_CUBIC diff -uprN linux-2.6.20-a/net/ipv4/Makefile linux-2.6.20-b/net/ipv4/Makefile --- linux-2.6.20-a/net/ipv4/Makefile 2007-02-04 19:44:54.0 +0100 +++ linux-2.6.20-b/net/ipv4/Makefile 2007-02-19 10:52:46.0 +0100 @@ -49,6 +49,7 @@ obj-$(CONFIG_TCP_CONG_VEGAS) += tcp_vega obj-$(CONFIG_TCP_CONG_VENO) += tcp_veno.o obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o +obj-$(CONFIG_TCP_CONG_YEAH) += tcp_yeah.o obj-$(CONFIG_NETLABEL) += cipso_ipv4.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ diff -uprN linux-2.6.20-a/net/ipv4/tcp_yeah.c linux-2.6.20-b/net/ipv4/tcp_yeah.c --- linux-2.6.20-a/net/ipv4/tcp_yeah.c 1970-01-01 01:00:00.0 +0100 +++ linux-2.6.20-b/net/ipv4/tcp_yeah.c 2007-02-19 10:52:46.0 +0100 @@ -0,0 +1,288 @@ +/* + * + * YeAH TCP + * + * For further details look at: + *http://wil.cs.caltech.edu/pfldnet2007/paper/YeAH_TCP.pdf + * + */ + +#include tcp_yeah.h + +/* Default values of the Vegas variables, in fixed-point representation + * with V_PARAM_SHIFT bits to the right of the binary point. + */ +#define V_PARAM_SHIFT 1 + +#define TCP_YEAH_ALPHA 80 //lin number of packets queued at the bottleneck +#define TCP_YEAH_GAMMA1 //lin fraction of queue to be removed per rtt +#define TCP_YEAH_DELTA3 //log minimum fraction of cwnd to be removed on loss +#define TCP_YEAH_EPSILON 1 //log maximum fraction to be removed on early decongestion +#define TCP_YEAH_PHY 8 //lin maximum delta from base +#define TCP_YEAH_RHO 16 //lin minumum number of consecutive rtt to consider competition on loss +#define TCP_YEAH_ZETA50 //lin minimum number of state switchs to reset reno_count + +#define TCP_SCALABLE_AI_CNT 100U + +/* YeAH variables */ +struct yeah { + /* Vegas */ + u32 beg_snd_nxt; /* right edge during last RTT */ + u32 beg_snd_una; /* left edge during last RTT */ + u32 beg_snd_cwnd; /* saves the size of the cwnd */ + u8 doing_vegas_now;/* if true, do vegas for this RTT */ + u16 cntRTT; /* # of RTTs measured within last RTT */ + u32 minRTT; /* min of RTTs measured within last RTT (in usec) */ + u32 baseRTT; /* the min of all Vegas RTT measurements seen (in usec) */ + + /* YeAH */ + u32 lastQ; + u32 doing_reno_now; + + u32 reno_count; + u32 fast_count; + + u32 pkts_acked; +}; + +static void tcp_yeah_init(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct yeah *yeah = inet_csk_ca(sk); + + tcp_vegas_init(sk); + + yeah-doing_reno_now = 0; + yeah-lastQ = 0; + + yeah-reno_count = 2; + + /* Ensure the MD arithmetic works. This is somewhat pedantic, + * since I don't think we will see a cwnd this large. :) */ + tp-snd_cwnd_clamp = min_t(u32, tp-snd_cwnd_clamp, 0x/128); + +} + + +static void tcp_yeah_pkts_acked(struct sock *sk, u32 pkts_acked) +{ + const struct inet_connection_sock *icsk = inet_csk(sk); + struct yeah *yeah = inet_csk_ca(sk); + + if (icsk-icsk_ca_state == TCP_CA_Open) + yeah-pkts_acked = pkts_acked; +} + +/* 64bit divisor, dividend and result. dynamic precision */ +static inline u64 div64_64(u64 dividend, u64 divisor) +{ + u32 d
Re: [PATCH 2/2][TCP] YeAH-TCP: limited slow start exported function
John Heffner ha scritto: Note the patch is compile-tested only! I can do some real testing if you'd like to apply this Dave. The date you read on the patch is due to the fact I've splitted this patchset into 2 diff files. This isn't compile-tested only, I've used this piece of code for about 3 months. However more testing is good and welcome. Regards, Angelo P. Castellani - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[TCP] window update during recovery (continuing on window reduction)
During a recovery, we should always reduce send window if the host is notifying a window reduction. This is needed because in the recovery phase the host requires to buffer the packets between the beginning of the recovery and the data we're sending forward with ssthresh window and sacked_out count. So it isn't buffering the in_flight packets, but a number of packets that could be much higher. If the host asks a window reduction its buffer is filling up, and if we ignore the reduction, when the buffer is full all the packets arriving to the host will be dropped. When the first packet is dropped the host will begin asking for a zero window; this request will be eventually granted, however in the meantime we will have lost a full window of packets. Anyway we could neither set the FLAG_WIN_UPDATE, otherwise the ack will not be considered a DUPACK and when using Reno the sacked_out count will not be updated. Regards, Angelo P. Castellani diff -urd linux-2.6.16-orig/net/ipv4/tcp_input.c linux-2.6.16-winupdate/net/ipv4/tcp_input.c --- linux-2.6.16-orig/net/ipv4/tcp_input.c 2006-05-16 14:53:02.0 +0200 +++ linux-2.6.16-winupdate/net/ipv4/tcp_input.c 2006-07-05 15:38:08.0 +0200 @@ -2365,12 +2365,44 @@ { int flag = 0; u32 nwin = ntohs(skb-h.th-window); + struct inet_connection_sock *icsk = inet_csk(sk); + int silent_update = 0; if (likely(!skb-h.th-syn)) nwin = tp-rx_opt.snd_wscale; - if (tcp_may_update_window(tp, ack, ack_seq, nwin)) { - flag |= FLAG_WIN_UPDATE; + /* + * During a recovery, we should always reduce send window if the host + * is notifying a window reduction. + * + * This is needed because in the recovery phase the host requires + * to buffer the packets between the beginning of the recovery + * and the data we're sending forward with ssthresh window and + * sacked_out count. + * + * So it isn't buffering the in_flight packets, but a number of packets + * that could be much higher. + * + * If the host asks a window reduction its buffer is filling up, and if we + * ignore the reduction, when the buffer is full all the packets arriving + * to the host will be dropped. + * + * When the first packet is dropped the host will begin asking for a zero + * window; this request will be eventually granted, however in the + * meantime we will have lost a full window of packets. + * + * Anyway we could neither set the FLAG_WIN_UPDATE, otherwise the + * ack will not be considered a DUPACK and when using Reno the + * sacked_out count will not be updated. + * + */ + if (icsk-icsk_ca_state == TCP_CA_Recovery + nwin tp-snd_wnd) + silent_update = 1; + + if (silent_update || tcp_may_update_window(tp, ack, ack_seq, nwin)) { + if (!silent_update) + flag |= FLAG_WIN_UPDATE; tcp_update_wl(tp, ack, ack_seq); if (tp-snd_wnd != nwin) {
[TCP] rfc strict recovery
In this patch there is a collection of changes useful to have linux tcp recovery close to rfc standard. The linux kernel using this patch defaults to linux standard recovery, when net.ipv4.tcp_rfcstrict_recovery=1 the changes in this patch are enabled. I've already discussed something like this here and I don't think the best thing is to discuss whether linux should be or not close to rfc's. I send this patch only because in my tests this changes have helped me to obtain the expected (and more performing) results. The rfcstrict recovery reveals extremely more performing during Reno recovery of large network drops. Regards, Angelo P. Castellani diff -urd linux-2.6.16-orig/include/linux/sysctl.h linux-2.6.16-stdrecovery/include/linux/sysctl.h --- linux-2.6.16-orig/include/linux/sysctl.h 2006-05-16 14:53:02.0 +0200 +++ linux-2.6.16-stdrecovery/include/linux/sysctl.h 2006-07-05 17:05:24.0 +0200 @@ -397,6 +397,7 @@ NET_TCP_CONG_CONTROL=110, NET_TCP_ABC=111, NET_IPV4_IPFRAG_MAX_DIST=112, + NET_TCP_RFCSTRICT_RECOVERY, }; enum { diff -urd linux-2.6.16-orig/include/net/tcp.h linux-2.6.16-stdrecovery/include/net/tcp.h --- linux-2.6.16-orig/include/net/tcp.h 2006-05-16 14:53:02.0 +0200 +++ linux-2.6.16-stdrecovery/include/net/tcp.h 2006-07-05 17:06:41.0 +0200 @@ -219,6 +219,7 @@ extern int sysctl_tcp_moderate_rcvbuf; extern int sysctl_tcp_tso_win_divisor; extern int sysctl_tcp_abc; +extern int sysctl_tcp_rfcstrict_recovery; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; diff -urd linux-2.6.16-orig/net/ipv4/sysctl_net_ipv4.c linux-2.6.16-stdrecovery/net/ipv4/sysctl_net_ipv4.c --- linux-2.6.16-orig/net/ipv4/sysctl_net_ipv4.c 2006-05-16 14:53:02.0 +0200 +++ linux-2.6.16-stdrecovery/net/ipv4/sysctl_net_ipv4.c 2006-07-05 17:08:31.0 +0200 @@ -664,6 +664,14 @@ .mode = 0644, .proc_handler = proc_dointvec, }, + { + .ctl_name = NET_TCP_RFCSTRICT_RECOVERY, + .procname = tcp_rfcstrict_recovery, + .data = sysctl_tcp_rfcstrict_recovery, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { .ctl_name = 0 } }; diff -urd linux-2.6.16-orig/net/ipv4/tcp_input.c linux-2.6.16-stdrecovery/net/ipv4/tcp_input.c --- linux-2.6.16-orig/net/ipv4/tcp_input.c 2006-05-16 14:53:02.0 +0200 +++ linux-2.6.16-stdrecovery/net/ipv4/tcp_input.c 2006-07-05 17:26:41.0 +0200 @@ -91,6 +91,8 @@ int sysctl_tcp_moderate_rcvbuf = 1; int sysctl_tcp_abc = 1; +int sysctl_tcp_rfcstrict_recovery = 0; + #define FLAG_DATA 0x01 /* Incoming frame contained data. */ #define FLAG_WIN_UPDATE 0x02 /* Incoming ACK was a window update. */ #define FLAG_DATA_ACKED 0x04 /* This ACK acknowledged new data. */ @@ -854,7 +856,8 @@ const int ts) { struct tcp_sock *tp = tcp_sk(sk); - if (metric tp-reordering) { + // rfcstrict: no dynamic reordering metric + if (!sysctl_tcp_rfcstrict_recovery metric tp-reordering) { tp-reordering = min(TCP_MAX_REORDERING, metric); /* This exciting event is worth to be remembered. 8) */ @@ -1784,7 +1787,10 @@ /* Hold old state until something *above* high_seq * is ACKed. For Reno it is MUST to prevent false * fast retransmits (RFC2582). SACK TCP is safe. */ - tcp_moderate_cwnd(tp); + // rfcstrict: a tcp_moderate_cwnd at the end of the recovery + //already solves any kind of burstiness issue + if (!sysctl_tcp_rfcstrict_recovery) + tcp_moderate_cwnd(tp); return 1; } tcp_set_ca_state(sk, TCP_CA_Open); @@ -2039,6 +2045,10 @@ if (!(flagFLAG_ECE)) tp-prior_ssthresh = tcp_current_ssthresh(sk); tp-snd_ssthresh = icsk-icsk_ca_ops-ssthresh(sk); + // rfcstrict: standard rule cwnd = ssthresh + 3 + // note: tp-reordering segments have been already added to sacked_out + if (sysctl_tcp_rfcstrict_recovery) +tp-snd_cwnd = tp-snd_ssthresh; TCP_ECN_queue_cwr(tp); } @@ -2049,7 +2059,9 @@ if (is_dupack || tcp_head_timedout(sk, tp)) tcp_update_scoreboard(sk, tp); - tcp_cwnd_down(sk); + // rfcstrict: no further reduction other than cwnd = sshthresh + 3 + if (!sysctl_tcp_rfcstrict_recovery || icsk-icsk_ca_state == TCP_CA_CWR) + tcp_cwnd_down(sk); tcp_xmit_retransmit_queue(sk); }
[PATCH] TCP Compound: dwnd=0 on ssthresh
In the TCP Compound article used as a reference for the implementation, we read: If a retransmission timeout occurs, dwnd should be reset to zero and the delay-based component is disabled. at page 5 of ftp://ftp.research.microsoft.com/pub/tr/TR-2005-86.pdf The attached patch implements this requirement. Regards, Angelo P. Castellani diff -urd a/net/ipv4/tcp_compound.c b/net/ipv4/tcp_compound.c --- a/net/ipv4/tcp_compound.c 2006-07-05 17:19:28.0 +0200 +++ b/net/ipv4/tcp_compound.c 2006-07-05 17:20:42.0 +0200 @@ -221,12 +221,9 @@ tcp_compound_init(sk); } -static void tcp_compound_cong_avoid(struct sock *sk, u32 ack, -u32 seq_rtt, u32 in_flight, int flag) -{ +static inline void tcp_compound_synch(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); struct compound *vegas = inet_csk_ca(sk); - u8 inc = 0; if (vegas-cwnd + vegas-dwnd tp-snd_cwnd) { if (vegas-cwnd tp-snd_cwnd || vegas-dwnd tp-snd_cwnd) { @@ -234,9 +231,19 @@ vegas-dwnd = 0; } else vegas-cwnd = tp-snd_cwnd - vegas-dwnd; - } +} + +static void tcp_compound_cong_avoid(struct sock *sk, u32 ack, +u32 seq_rtt, u32 in_flight, int flag) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct compound *vegas = inet_csk_ca(sk); + u8 inc = 0; + + tcp_compound_synch(sk); + if (!tcp_is_cwnd_limited(sk, in_flight)) return; @@ -415,9 +422,21 @@ } } +static u32 tcp_compound_ssthresh(struct sock *sk) { + struct tcp_sock *tp = tcp_sk(sk); + struct compound *vegas = inet_csk_ca(sk); + + tcp_compound_synch(sk); + + vegas-dwnd = 0; + tp-snd_cwnd = vegas-cwnd; + + return tcp_reno_ssthresh(sk); +} + static struct tcp_congestion_ops tcp_compound = { .init = tcp_compound_init, - .ssthresh = tcp_reno_ssthresh, + .ssthresh = tcp_compound_ssthresh, .cong_avoid = tcp_compound_cong_avoid, .min_cwnd = tcp_reno_min_cwnd, .rtt_sample = tcp_compound_rtt_calc,
[PATCH] TCP Compound
From: Angelo P. Castellani [EMAIL PROTECTED] TCP Compound is a sender-side only change to TCP that uses a mixed Reno/Vegas approach to calculate the cwnd. For further details look here: ftp://ftp.research.microsoft.com/pub/tr/TR-2005-86.pdf Signed-off-by: Angelo P. Castellani [EMAIL PROTECTED] --- This new revision of the TCP Compound implementation fixes some issues present in the previous patch and has been reverted to a stand-alone file (thanks to Stephen suggestion). Regards, Angelo P. Castellani diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 326676b..e577eb8 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -542,6 +542,16 @@ config TCP_CONG_LP ``fair share`` of bandwidth as targeted by TCP. See http://www-ece.rice.edu/networks/TCP-LP/ +config TCP_CONG_COMPOUND + tristate TCP Compound + depends on EXPERIMENTAL + default n + ---help--- + TCP Compound is a sender-side only change to TCP that uses + a mixed Reno/Vegas approach to calculate the cwnd. + For further details look here: + ftp://ftp.research.microsoft.com/pub/tr/TR-2005-86.pdf + endmenu config TCP_CONG_BIC diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index 5c65487..f0697c4 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -43,6 +43,7 @@ obj-$(CONFIG_TCP_CONG_HTCP) += tcp_htcp. obj-$(CONFIG_TCP_CONG_VEGAS) += tcp_vegas.o obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o obj-$(CONFIG_TCP_CONG_LP) += tcp_lp.o +obj-$(CONFIG_TCP_CONG_COMPOUND) += tcp_compound.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o /* * TCP Vegas congestion control * * This is based on the congestion detection/avoidance scheme described in *Lawrence S. Brakmo and Larry L. Peterson. *TCP Vegas: End to end congestion avoidance on a global internet. *IEEE Journal on Selected Areas in Communication, 13(8):1465--1480, *October 1995. Available from: * ftp://ftp.cs.arizona.edu/xkernel/Papers/jsac.ps * * See http://www.cs.arizona.edu/xkernel/ for their implementation. * The main aspects that distinguish this implementation from the * Arizona Vegas implementation are: * o We do not change the loss detection or recovery mechanisms of * Linux in any way. Linux already recovers from losses quite well, * using fine-grained timers, NewReno, and FACK. * o To avoid the performance penalty imposed by increasing cwnd * only every-other RTT during slow start, we increase during * every RTT during slow start, just like Reno. * o Largely to allow continuous cwnd growth during slow start, * we use the rate at which ACKs come back as the actual * rate, rather than the rate at which data is sent. * o To speed convergence to the right rate, we set the cwnd * to achieve the right (actual) rate when we exit slow start. * o To filter out the noise caused by delayed ACKs, we use the * minimum RTT sample observed during the last RTT to calculate * the actual rate. * o When the sender re-starts from idle, it waits until it has * received ACKs for an entire flight of new data before making * a cwnd adjustment decision. The original Vegas implementation * assumed senders never went idle. * * * TCP Compound based on TCP Vegas * * further details can be found here: * ftp://ftp.research.microsoft.com/pub/tr/TR-2005-86.pdf */ #include linux/config.h #include linux/mm.h #include linux/module.h #include linux/skbuff.h #include linux/inet_diag.h #include net/tcp.h /* Default values of the Vegas variables, in fixed-point representation * with V_PARAM_SHIFT bits to the right of the binary point. */ #define V_PARAM_SHIFT 1 #define TCP_COMPOUND_ALPHA 3U #define TCP_COMPOUND_BETA 1U #define TCP_COMPOUND_KAPPA_POW 3 #define TCP_COMPOUND_KAPPA_NSQRT2 #define TCP_COMPOUND_GAMMA 30 #define TCP_COMPOUND_ZETA 1 /* TCP compound variables */ struct compound { u32 beg_snd_nxt; /* right edge during last RTT */ u32 beg_snd_una; /* left edge during last RTT */ u32 beg_snd_cwnd; /* saves the size of the cwnd */ u8 doing_vegas_now; /* if true, do vegas for this RTT */ u16 cntRTT; /* # of RTTs measured within last RTT */ u32 minRTT; /* min of RTTs measured within last RTT (in usec) */ u32 baseRTT; /* the min of all Vegas RTT measurements seen (in usec) */ u32 cwnd; u32 dwnd; }; /* There are several situations when we must re-start Vegas: * * o when a connection is established * o after an RTO * o after fast recovery * o when we send a packet and there is no outstanding *unacknowledged data (restarting an idle connection) * * In these circumstances we cannot do a Vegas calculation at the * end of the first RTT, because any calculation we do is using * stale info -- both the saved cwnd and congestion feedback are * stale. * * Instead we must wait until the completion of an RTT during * which we actually receive ACKs
[PATCH] reno sacked_out count fix
Using NewReno, if a sk_buff is timed out and is accounted as lost_out, it should also be removed from the sacked_out. This is necessary because recovery using NewReno fast retransmit could take up to a lot RTTs and the sk_buff RTO can expire without actually being really lost. left_out = sacked_out + lost_out in_flight = packets_out - left_out + retrans_out Using NewReno without this patch, on very large network losses, left_out becames bigger than packets_out + retrans_out (!!). For this reason unsigned integer in_flight overflows to 2^32 - something. Regards, Angelo P. Castellani diff -urd ../linux-2.6.16-orig/net/ipv4/tcp_input.c ./net/ipv4/tcp_input.c --- ../linux-2.6.16-orig/net/ipv4/tcp_input.c 2006-05-15 15:42:39.0 +0200 +++ ./net/ipv4/tcp_input.c 2006-05-16 11:18:21.0 +0200 @@ -1676,6 +1676,8 @@ if (!(TCP_SKB_CB(skb)-sackedTCPCB_TAGBITS)) { TCP_SKB_CB(skb)-sacked |= TCPCB_LOST; tp-lost_out += tcp_skb_pcount(skb); +if (IsReno(tp)) + tcp_remove_reno_sacks(sk, tp, tcp_skb_pcount(skb) + 1); /* clear xmit_retrans hint */ if (tp-retransmit_skb_hint
[PATCH] Enabling standard compliant behaviour in the Linux TCP implementation
Hi all, I'm a student doing a thesis about TCP performance over high BDP links and so about congestion control in TCP. To do this work I've built a testbed using the latest Linux release (2.6.16). Anyway I've came across the fact that Linux TCP implementation isn't fully standard compliant. Even if the choices made to be different from the standards have been wisely thought, I think that should be possible to disable these Linuxisms. Surely this can help all the people using Linux to evaluate a standard environment. Moreover it permits to compare the proscons of the Linux implementation against the standard one. So I've disabled the first two Linux-specific mechanisms I've found: - rate halving - dynamic reordering metric (dynamic DupThresh) These're disabled as long as net.ipv4.tcp_standard_compliant=1 (default: 0). However I don't exclude that there're more non-standard details, so I hope that somebody can point some more differences between Linux and the RFCs. Moreover NewReno is implemented in the Impatient variant (resets the retransmit timer only on the first partial ack), with net.ipv4.tcp_slow_but_steady=1 (default: 0) you can enable the Slow-but-Steady variant (resets the retransmit timer every partial ack). Hoping that this can be useful, I attach the patch. Regards, Angelo P. Castellani diff -urd ../linux-2.6.16-orig/include/linux/sysctl.h ./include/linux/sysctl.h --- ../linux-2.6.16-orig/include/linux/sysctl.h 2006-05-16 14:53:02.0 +0200 +++ ./include/linux/sysctl.h 2006-05-16 14:54:50.0 +0200 @@ -397,6 +397,8 @@ NET_TCP_CONG_CONTROL=110, NET_TCP_ABC=111, NET_IPV4_IPFRAG_MAX_DIST=112, + NET_TCP_STANDARD_COMPLIANT, + NET_TCP_SLOW_BUT_STEADY, }; enum { diff -urd ../linux-2.6.16-orig/include/net/tcp.h ./include/net/tcp.h --- ../linux-2.6.16-orig/include/net/tcp.h 2006-05-16 14:53:02.0 +0200 +++ ./include/net/tcp.h 2006-05-16 14:55:43.0 +0200 @@ -219,6 +219,8 @@ extern int sysctl_tcp_moderate_rcvbuf; extern int sysctl_tcp_tso_win_divisor; extern int sysctl_tcp_abc; +extern int sysctl_tcp_standard_compliant; +extern int sysctl_tcp_slow_but_steady; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; diff -urd ../linux-2.6.16-orig/net/ipv4/sysctl_net_ipv4.c ./net/ipv4/sysctl_net_ipv4.c --- ../linux-2.6.16-orig/net/ipv4/sysctl_net_ipv4.c 2006-05-16 14:53:02.0 +0200 +++ ./net/ipv4/sysctl_net_ipv4.c 2006-05-16 14:57:23.0 +0200 @@ -664,6 +664,22 @@ .mode = 0644, .proc_handler = proc_dointvec, }, + { + .ctl_name = NET_TCP_STANDARD_COMPLIANT, + .procname = tcp_standard_compliant, + .data = sysctl_tcp_standard_compliant, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { + .ctl_name = NET_TCP_SLOW_BUT_STEADY, + .procname = tcp_slow_but_steady, + .data = sysctl_tcp_slow_but_steady, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { .ctl_name = 0 } }; diff -urd ../linux-2.6.16-orig/net/ipv4/tcp_input.c ./net/ipv4/tcp_input.c --- ../linux-2.6.16-orig/net/ipv4/tcp_input.c 2006-05-16 14:53:02.0 +0200 +++ ./net/ipv4/tcp_input.c 2006-05-16 14:52:43.0 +0200 @@ -81,6 +81,7 @@ int sysctl_tcp_dsack = 1; int sysctl_tcp_app_win = 31; int sysctl_tcp_adv_win_scale = 2; +int sysctl_tcp_standard_compliant = 0; int sysctl_tcp_stdurg; int sysctl_tcp_rfc1337; @@ -854,7 +855,7 @@ const int ts) { struct tcp_sock *tp = tcp_sk(sk); - if (metric tp-reordering) { + if (!sysctl_tcp_standard_compliant metric tp-reordering) { tp-reordering = min(TCP_MAX_REORDERING, metric); /* This exciting event is worth to be remembered. 8) */ @@ -2039,6 +2040,8 @@ if (!(flagFLAG_ECE)) tp-prior_ssthresh = tcp_current_ssthresh(sk); tp-snd_ssthresh = icsk-icsk_ca_ops-ssthresh(sk); + if (sysctl_tcp_standard_compliant) +tp-snd_cwnd = tp-snd_ssthresh; // tp-reordering segments should've been already added to sacked_out TCP_ECN_queue_cwr(tp); } @@ -2049,7 +2052,8 @@ if (is_dupack || tcp_head_timedout(sk, tp)) tcp_update_scoreboard(sk, tp); - tcp_cwnd_down(sk); + if (!sysctl_tcp_standard_compliant || icsk-icsk_ca_state == TCP_CA_CWR) + tcp_cwnd_down(sk); tcp_xmit_retransmit_queue(sk); } diff -urd ../linux-2.6.16-orig/net/ipv4/tcp_output.c ./net/ipv4/tcp_output.c --- ../linux-2.6.16-orig/net/ipv4/tcp_output.c 2006-05-16 14:53:02.0 +0200 +++ ./net/ipv4/tcp_output.c 2006-05-16 14:52:43.0 +0200 @@ -51,6 +51,9 @@ */ int sysctl_tcp_tso_win_divisor = 3; +/* Enables the Slow-but-Steady variant of NewReno (cfr. RFC2582 Ch.4) */ +int sysctl_tcp_slow_but_steady = 0; + static void update_send_head(struct sock *sk, struct tcp_sock *tp, struct sk_buff *skb) { @@ -1604,7 +1607,7 @@ else NET_INC_STATS_BH(LINUX_MIB_TCPSLOWSTARTRETRANS); - if (skb == + if (sysctl_tcp_slow_but_steady || skb == skb_peek(sk-sk_write_queue
Re: tcp compound
On 5/9/06, Stephen Hemminger [EMAIL PROTECTED] wrote: Moved discussion over to netdev mailing list.. Could you export symbols in tcp_vegas (and change config dependencies) to allow code reuse rather than having to copy/paste everything from vegas? I hope I've done that properly. tcp_compound.patch.gz Description: GNU Zip compressed data