Re: [PATCHSET 0/18] FRTO: fixes and small changes + SACK enhanced version
On Thu, 22 Feb 2007, David Miller wrote: > If you have any further bug fixes or enhancements to FRTO we > can put it there too. I tried the rate-halving approach in spurious RTO response and it looks really nice indeed. I'll try to also run couple of tests with the undo_cwr(,1) response too in near future to ensure that it doesn't cause immediate breakage with something. Before preparing the submittable spurious RTO responses patch, I would like to know what is the preferred style for the response selector code (I'm controlling the response through frto_response sysctl)? switch? array of function pointers? If array, where to place it (considering the need for forward declarations and mixing of code and data in some approaches)? -- i. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fix limited slow start bug
On Thu, 22 Feb 2007, John Heffner wrote: > diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c > index 7fd2910..a0c894f 100644 > --- a/net/ipv4/tcp_cong.c > +++ b/net/ipv4/tcp_cong.c > @@ -303,9 +303,9 @@ void tcp_slow_start(struct tcp_sock *tp) > > tp->snd_cwnd_cnt += cnt; > while (tp->snd_cwnd_cnt >= tp->snd_cwnd) { > + tp->snd_cwnd_cnt -= tp->snd_cwnd; > if (tp->snd_cwnd < tp->snd_cwnd_clamp) > tp->snd_cwnd++; > - tp->snd_cwnd_cnt -= tp->snd_cwnd; > } > } > EXPORT_SYMBOL_GPL(tcp_slow_start); ACK. I was going to track this down tomorrow as I noticed strange behavior during slow start while testing tcp-2.6 today but I don't have to anymore :-). The problem I saw is now obvious, whole congestion control was practically disabled due to underflow of the unsigned type that occurs without this patch, causing TCP to be limited only by the receiver advertized window during slow-start. BTW, while looking this patch, I noticed that snd_cwnd_clamp is only u16 while snd_cwnd is u32, which seems rather strange since snd_cwnd is being limited by the clamp value here and there?!?! And tcp_highspeed.c is clearly assuming even more than this (but the problem is hidden as snd_cwnd_clamp is feed back to the min_t and the used 32-bit constant could be safely cut to 16-bits anyway): tp->snd_cwnd_clamp = min_t(u32, tp->snd_cwnd_clamp, 0x/128); Has the type being changed somewhere in the past or why is this so? -- i. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] [TCP]: Correct reordering detection change (no FRTO case)
The reordering detection must work also when FRTO has not been used at all which was the original intention of mine, just the expression of the idea was flawed. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- Applies on the top of tcp-2.6 branch as you would probably have guessed. diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index bb3f234..f6ba07f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1240,7 +1240,7 @@ tcp_sacktag_write_queue(struct sock *sk, tp->left_out = tp->sacked_out + tp->lost_out; if ((reord < tp->fackets_out) && icsk->icsk_ca_state != TCP_CA_Loss && - (tp->frto_highmark && after(tp->snd_una, tp->frto_highmark))) + (!tp->frto_highmark || after(tp->snd_una, tp->frto_highmark))) tcp_update_reordering(sk, ((tp->fackets_out + 1) - reord), 0); #if FASTRETRANS_DEBUG > 0 -- 1.4.2
[PATCH 1/2] [TCP]: Add two new spurious RTO responses to FRTO
New sysctl tcp_frto_response is added to select amongst these responses: - Rate halving based; reuses CA_CWR state (default) - Very conservative; used to be the only one available (=1) - Undo cwr; undoes ssthresh and cwnd reductions (=2) The response with rate halving requires a new parameter to tcp_enter_cwr because FRTO has already reduced ssthresh and doing a second reduction there has to be prevented. In addition, to keep things nice on 80 cols screen, a local variable was added. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- include/linux/sysctl.h |1 + include/net/tcp.h |3 ++- net/ipv4/sysctl_net_ipv4.c |8 net/ipv4/tcp_input.c | 30 ++ net/ipv4/tcp_output.c |2 +- 5 files changed, 38 insertions(+), 6 deletions(-) diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h index a2dce72..80f11e0 100644 --- a/include/linux/sysctl.h +++ b/include/linux/sysctl.h @@ -439,6 +439,7 @@ enum NET_TCP_AVAIL_CONG_CONTROL=122, NET_TCP_ALLOWED_CONG_CONTROL=123, NET_TCP_MAX_SSTHRESH=124, + NET_TCP_FRTO_RESPONSE=125, }; enum { diff --git a/include/net/tcp.h b/include/net/tcp.h index 6d09f50..f0c9e34 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -220,6 +220,7 @@ extern int sysctl_tcp_app_win; extern int sysctl_tcp_adv_win_scale; extern int sysctl_tcp_tw_reuse; extern int sysctl_tcp_frto; +extern int sysctl_tcp_frto_response; extern int sysctl_tcp_low_latency; extern int sysctl_tcp_dma_copybreak; extern int sysctl_tcp_nometrics_save; @@ -738,7 +739,7 @@ static inline void tcp_sync_left_out(str tp->left_out = tp->sacked_out + tp->lost_out; } -extern void tcp_enter_cwr(struct sock *sk); +extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst); /* Slow start with delack produces 3 packets of burst, so that diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index d68effe..6817d64 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -647,6 +647,14 @@ #endif .proc_handler = &proc_dointvec }, { + .ctl_name = NET_TCP_FRTO_RESPONSE, + .procname = "tcp_frto_response", + .data = &sysctl_tcp_frto_response, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = &proc_dointvec + }, + { .ctl_name = NET_TCP_LOW_LATENCY, .procname = "tcp_low_latency", .data = &sysctl_tcp_low_latency, diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index f6ba07f..d6e1776 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -86,6 +86,7 @@ int sysctl_tcp_stdurg __read_mostly; int sysctl_tcp_rfc1337 __read_mostly; int sysctl_tcp_max_orphans __read_mostly = NR_FILE; int sysctl_tcp_frto __read_mostly; +int sysctl_tcp_frto_response __read_mostly; int sysctl_tcp_nometrics_save __read_mostly; int sysctl_tcp_moderate_rcvbuf __read_mostly = 1; @@ -762,15 +763,17 @@ __u32 tcp_init_cwnd(struct tcp_sock *tp, } /* Set slow start threshold and cwnd not falling to slow start */ -void tcp_enter_cwr(struct sock *sk) +void tcp_enter_cwr(struct sock *sk, const int set_ssthresh) { struct tcp_sock *tp = tcp_sk(sk); + const struct inet_connection_sock *icsk = inet_csk(sk); tp->prior_ssthresh = 0; tp->bytes_acked = 0; if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) { tp->undo_marker = 0; - tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk); + if (set_ssthresh) + tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); tp->snd_cwnd = min(tp->snd_cwnd, tcp_packets_in_flight(tp) + 1U); tp->snd_cwnd_cnt = 0; @@ -2003,7 +2006,7 @@ static void tcp_try_to_open(struct sock tp->retrans_stamp = 0; if (flag&FLAG_ECE) - tcp_enter_cwr(sk); + tcp_enter_cwr(sk, 1); if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) { int state = TCP_CA_Open; @@ -2579,6 +2582,21 @@ static void tcp_conservative_spur_to_res tcp_moderate_cwnd(tp); } +/* A conservative spurious RTO response algorithm: reduce cwnd using + * rate halving and continue in congestion avoidance. + */ +static void tcp_ratehalving_spur_to_response(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + tcp_enter_cwr(sk, 0); + tp->high_seq = tp->frto_highmark; /* Smoother w/o this? - ij */ +} + +static void tcp_undo_spur_to_response(struct sock *sk) +{ + tcp_undo_cw
[PATCH 2/2] [TCP] Sysctl documentation: tcp_frto_response
In addition, fixed minor things in tcp_frto sysctl. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- Documentation/networking/ip-sysctl.txt | 21 +++-- 1 files changed, 19 insertions(+), 2 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index d66777b..95e2a9f 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -174,15 +174,32 @@ tcp_fin_timeout - INTEGER because they eat maximum 1.5K of memory, but they tend to live longer. Cf. tcp_max_orphans. -tcp_frto - BOOLEAN +tcp_frto - INTEGER Enables F-RTO, an enhanced recovery algorithm for TCP retransmission timeouts. It is particularly beneficial in wireless environments where packet loss is typically due to random radio interference rather than intermediate router congestion. If set to 1, basic - version is enabled. 2 enables SACK enhanced FRTO, which is + version is enabled. 2 enables SACK enhanced F-RTO, which is EXPERIMENTAL. The basic version can be used also when SACK is enabled for a flow through tcp_sack sysctl. +tcp_frto_response - INTEGER + When F-RTO has detected that a TCP retransmission timeout was + spurious (i.e, the timeout would have been avoided had TCP set a + longer retransmission timeout), TCP has several options what to do + next. Possible values are: + 0 Rate halving based; a smooth and conservative response, + results in halved cwnd and ssthresh after one RTT + 1 Very conservative response; not recommended because even + though being valid, it interacts poorly with the rest of + Linux TCP, halves cwnd and ssthresh immediately + 2 Aggressive response; undoes congestion control measures + that are now known to be unnecessary (ignoring the + possibility of a lost retransmission that would require + TCP to be more cautious), cwnd and ssthresh are restored + to the values prior timeout + Default: 0 (rate halving based) + tcp_keepalive_time - INTEGER How often TCP sends out keepalive messages when keepalive is enabled. Default: 2hours. -- 1.4.2
Re: [PATCH 1/2] [TCP]: Add two new spurious RTO responses to FRTO
On Wed, 28 Feb 2007, Jarek Poplawski wrote: > On 27-02-2007 16:50, Ilpo Järvinen wrote: > > New sysctl tcp_frto_response is added to select amongst these > ... > > Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> > > @@ -762,15 +763,17 @@ __u32 tcp_init_cwnd(struct tcp_sock *tp, > > } > > > > /* Set slow start threshold and cwnd not falling to slow start */ > > -void tcp_enter_cwr(struct sock *sk) > > +void tcp_enter_cwr(struct sock *sk, const int set_ssthresh) > > { > > struct tcp_sock *tp = tcp_sk(sk); > > + const struct inet_connection_sock *icsk = inet_csk(sk); > > > > tp->prior_ssthresh = 0; > > tp->bytes_acked = 0; > > if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) { > > - if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) { > + if (icsk->icsk_ca_state < TCP_CA_CWR) { > > Probably something for the next "BTW". These are going to 2.6.22, not to 2.6.21, see: http://marc.theaimsgroup.com/?l=linux-netdev&m=117213215924406&w=2 ...or do you mean something else? Since DaveM has already applied this, here is a patch with your correction alone on the top of tcp-2.6. -- [PATCH] [TCP]: Complete icsk-to-local-variable change (in tcp_enter_cwr) A local variable for icsk was created but this change was missing. Spotted by Jarek Poplawski. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index d6e1776..dc221a3 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -770,7 +770,7 @@ void tcp_enter_cwr(struct sock *sk, cons tp->prior_ssthresh = 0; tp->bytes_acked = 0; - if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) { + if (icsk->icsk_ca_state < TCP_CA_CWR) { tp->undo_marker = 0; if (set_ssthresh) tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); -- 1.4.2
[RFC PATCH] [TCP]: Move clearing of the prior_ssthresh due to ECE earlier
Previously the prior_ssthresh was only used in the undo functions that were called by fastretrans_alert. FRTO processing, however, happens before that and one of the responses is doing undo too. I think that after this patch, FRTO should be ECN-safe; only the undo_cwr response is a suspect anyway. The possible cases: 1) Pre-RTO ECE: FRTO is not setting prior_ssthresh in CA_CWR at all, therefore it correctly remains cleared and ssthresh is the already reduced one 2a) Post-RTO ECE, before deciding ACK: prior_ssthresh is cleared, so no undoing for ssthresh will occur 2b) Post-RTO ECE, deciding ACK: this patch changes this case equal to 2a 3) Between two RTOs: prior_ssthresh is cleared like in 2a or the reduced value is used like in 1 Deciding ACK is the one after which FRTO declared RTO as spurious. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- This problem could be solved locally in the undo_cwr response of FRTO by falling back to ratehalving response when FLAG_ECE is enabled. If it's more preferrable way, just ping me to get a patch. I'm not absolutely sure that this does not break anything even though I couldn't find any using prior_ssthresh between the new and the old place (seen more than once already within years that things are not always as simple as they seem). net/ipv4/tcp_input.c | 19 +-- 1 files changed, 9 insertions(+), 10 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index dc221a3..b0c9dfb 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2080,15 +2080,11 @@ tcp_fastretrans_alert(struct sock *sk, u tp->fackets_out = 0; /* Now state machine starts. -* A. ECE, hence prohibit cwnd undoing, the reduction is required. */ - if (flag&FLAG_ECE) - tp->prior_ssthresh = 0; - - /* B. In all the states check for reneging SACKs. */ +* A. In all the states check for reneging SACKs. */ if (tp->sacked_out && tcp_check_sack_reneging(sk)) return; - /* C. Process data loss notification, provided it is valid. */ + /* B. Process data loss notification, provided it is valid. */ if ((flag&FLAG_DATA_LOST) && before(tp->snd_una, tp->high_seq) && icsk->icsk_ca_state != TCP_CA_Open && @@ -2097,10 +2093,10 @@ tcp_fastretrans_alert(struct sock *sk, u NET_INC_STATS_BH(LINUX_MIB_TCPLOSS); } - /* D. Synchronize left_out to current state. */ + /* C. Synchronize left_out to current state. */ tcp_sync_left_out(tp); - /* E. Check state exit conditions. State can be terminated + /* D. Check state exit conditions. State can be terminated *when high_seq is ACKed. */ if (icsk->icsk_ca_state == TCP_CA_Open) { BUG_TRAP(tp->retrans_out == 0); @@ -2143,7 +2139,7 @@ tcp_fastretrans_alert(struct sock *sk, u } } - /* F. Process state. */ + /* E. Process state. */ switch (icsk->icsk_ca_state) { case TCP_CA_Recovery: if (prior_snd_una == tp->snd_una) { @@ -2742,8 +2738,11 @@ static int tcp_ack(struct sock *sk, stru if (TCP_SKB_CB(skb)->sacked) flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una); - if (TCP_ECN_rcv_ecn_echo(tp, skb->h.th)) + if (TCP_ECN_rcv_ecn_echo(tp, skb->h.th)) { flag |= FLAG_ECE; + /* ECE, prohibit cwnd undoing, reduction is required */ + tp->prior_ssthresh = 0; + } tcp_ca_event(sk, CA_EVENT_SLOW_ACK); } -- 1.4.2
Re: [RFC PATCH] [TCP]: Move clearing of the prior_ssthresh due to ECE earlier
On Thu, 1 Mar 2007, Ilpo Järvinen wrote: > Previously the prior_ssthresh was only used in the undo functions > that were called by fastretrans_alert. FRTO processing, however, > happens before that and one of the responses is doing undo too. > > I think that after this patch, FRTO should be ECN-safe; only the > undo_cwr response is a suspect anyway. The possible cases: > [...snip...] > > This problem could be solved locally in the undo_cwr response of > FRTO by falling back to ratehalving response when FLAG_ECE is > enabled. I think that doing it in the response is better that this approach, since it knows that the ssthresh has been halved already within that round-trip, so there is no need to do that again... I'll submit the patch tomorrow... With this prior_ssthresh clearing move alone, the ssthresh ends up being halved twice if I tought it right (first in tcp_enter_frto and then again in tcp_enter_cwr that is called from fastretrans_alert)... So please, drop this patch. -- i.
[PATCH] [TCP]: FRTO undo response falls back to ratehalving one if ECEd
Undoing ssthresh is disabled in fastretrans_alert whenever FLAG_ECE is set by clearing prior_ssthresh. This clearing does not protect FRTO because FRTO operates before fastretrans_alert. Moving the clearing of prior_ssthresh earlier seems to be a suboptimal solution to the FRTO case because then FLAG_ECE will cause a second ssthresh reduction in try_to_open (the first occurred when FRTO was entered). So instead, FRTO falls back immediately to the rate halving response, which switches TCP to CA_CWR state preventing the latter reduction of ssthresh. If the first ECE arrived before the ACK after which FRTO is able to decide RTO as spurious, prior_ssthresh is already cleared. Thus no undoing for ssthresh occurs. Besides, FLAG_ECE should be set also in the following ACKs resulting in rate halving response that sees TCP already in CA_CWR, which again prevents an extra ssthresh reduction on that round-trip. If the first ECE arrived before RTO, ssthresh has already been adapted and prior_ssthresh remains cleared on entry because TCP is in CA_CWR (the same applies also to a case where FRTO is entered more than once and ECE comes in the middle). I believe that after this patch, FRTO should be ECN-safe and even able to take advantage of synergy benefits. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c |9 ++--- 1 files changed, 6 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index dc221a3..bdd6172 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2592,9 +2592,12 @@ static void tcp_ratehalving_spur_to_resp tp->high_seq = tp->frto_highmark; /* Smoother w/o this? - ij */ } -static void tcp_undo_spur_to_response(struct sock *sk) +static void tcp_undo_spur_to_response(struct sock *sk, int flag) { - tcp_undo_cwr(sk, 1); + if (flag&FLAG_ECE) + tcp_ratehalving_spur_to_response(sk); + else + tcp_undo_cwr(sk, 1); } /* F-RTO spurious RTO detection algorithm (RFC4138) @@ -2680,7 +2683,7 @@ static int tcp_process_frto(struct sock return 1; } else /* frto_counter == 2 */ { switch (sysctl_tcp_frto_response) { - case 2: tcp_undo_spur_to_response(sk); break; + case 2: tcp_undo_spur_to_response(sk, flag); break; case 1: tcp_conservative_spur_to_response(tp); break; default: tcp_ratehalving_spur_to_response(sk); break; } -- 1.4.2
[PATCH v2] [TCP]: FRTO undo response falls back to ratehalving one if ECEd
Undoing ssthresh is disabled in fastretrans_alert whenever FLAG_ECE is set by clearing prior_ssthresh. The clearing does not protect FRTO because FRTO operates before fastretrans_alert. Moving the clearing of prior_ssthresh earlier seems to be a suboptimal solution to the FRTO case because then FLAG_ECE will cause a second ssthresh reduction in try_to_open (the first occurred when FRTO was entered). So instead, FRTO falls back immediately to the rate halving response, which switches TCP to CA_CWR state preventing the latter reduction of ssthresh. If the first ECE arrived before the ACK after which FRTO is able to decide RTO as spurious, prior_ssthresh is already cleared. Thus no undoing for ssthresh occurs. Besides, FLAG_ECE should be set also in the following ACKs resulting in rate halving response that sees TCP is already in CA_CWR, which again prevents an extra ssthresh reduction on that round-trip. If the first ECE arrived before RTO, ssthresh has already been adapted and prior_ssthresh remains cleared on entry because TCP is in CA_CWR (the same applies also to a case where FRTO is entered more than once and ECE comes in the middle). High_seq must not be touched after tcp_enter_cwr because CWR round-trip calculation depends on it. I believe that after this patch, FRTO should be ECN-safe and even able to take advantage of synergy benefits. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- Of course I forgot to fix also the high_seq thing I had in mind last evening, so here is this again now with it too. diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index dc221a3..6b268dc 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2587,14 +2587,15 @@ static void tcp_conservative_spur_to_res */ static void tcp_ratehalving_spur_to_response(struct sock *sk) { - struct tcp_sock *tp = tcp_sk(sk); tcp_enter_cwr(sk, 0); - tp->high_seq = tp->frto_highmark; /* Smoother w/o this? - ij */ } -static void tcp_undo_spur_to_response(struct sock *sk) +static void tcp_undo_spur_to_response(struct sock *sk, int flag) { - tcp_undo_cwr(sk, 1); + if (flag&FLAG_ECE) + tcp_ratehalving_spur_to_response(sk); + else + tcp_undo_cwr(sk, 1); } /* F-RTO spurious RTO detection algorithm (RFC4138) @@ -2680,7 +2681,7 @@ static int tcp_process_frto(struct sock return 1; } else /* frto_counter == 2 */ { switch (sysctl_tcp_frto_response) { - case 2: tcp_undo_spur_to_response(sk); break; + case 2: tcp_undo_spur_to_response(sk, flag); break; case 1: tcp_conservative_spur_to_response(tp); break; default: tcp_ratehalving_spur_to_response(sk); break; } -- 1.4.2
Re: [PATCH 4/4]: Kill fastpath_{skb,cnt}_hint.
On Fri, 2 Mar 2007, David Miller wrote: > From: Baruch Even <[EMAIL PROTECTED]> > Date: Thu, 1 Mar 2007 20:13:40 +0200 > > One drawback for this approach is that you now walk the entire sack > > block when you advance one packet. If you consider a 10,000 packet queue > > which had several losses at the beginning and a large sack block that > > advances from the middle to the end you'll walk a lot of packets for > > that one last stretch of a sack block. Maybe this information in the last ACK could be even used as advantage to speed up the startpoint lookup, since the last ACK very likely contains the highest SACK block (could not hold under attack though), no matter how big it is, and it's not necessary to process it ever if we know for sure that it was the global highest rather than a local one. Using that, it's trivial to find the starting point below the SACK block and that could be passed to the head_lost marking on-the-fly using stack rather than in the structs. Sort of fastpath too :-)... Maybe even some auto-tuning thingie which enables conditionally skipping of large SACK-blocks to reuse all of this last ACK information in mark_head but that would get rather complicated I think. > > One way to handle that is to use the still existing sack fast path to > > detect this case and calculate what is the sequence number to search > > for. Since you know what was the end_seq that was handled last, you can > > search for it as the start_seq and go on from there. Does it make sense? > > BTW, I think I figured out a way to get rid of > lost_{skb,cnt}_hint. The fact of the matter in this case is that > the setting of the tag bits always propagates from front of the queue > onward. We don't get holes mid-way. > > So what we can do is search the RB-tree for high_seq and walk > backwards. Once we hit something with TCPCB_TAGBITS set, we > stop processing as there are no earlier SKBs which we'd need > to do anything with. If TCP knows the highest SACK block skb (or its seq) and high_seq, finding the starting point is really trivial as you can always take min of them without any walking. Then walk backwards skipping first reordering, until the first LOST one is encountered. ...Only overhead then comes from the skipped which depends on the current reordering (of course that was not overhead if they were timedout). This would not even require knowledge about per skb fack_count because the skipping servers for its purpose and it can be done on the fly. > scoreboard_skb_hint is a little bit trickier, but it is a similar > case to the tcp_lost_skb_hint case. Except here the termination > condition is a relative timeout instead of a sequence number and > packet count test. > > Perhaps for that we can remember some state from the > tcp_mark_head_lost() we do first. In fact, we can start > the queue walk from the latest packet which tcp_mark_head_lost() > marked with a tag bit. Yes, the problem with this case compared to head_lost seems to be that we don't know whether the first skb (in backwards walk) must be marked until we have walked past it (actually to the point where the first SACKED_RETRANS is encountered, timestamps should be in order except for the discontinuity that occurs at SACKED_RETRANS edge). So it seems to me that any backwards walk could be a bit problematic in this case exactly because of this discontinuity? Armed with this knowledge, however, backwards walk could start checking for timedout marking right at the first SACKED_RETRANS skb. And continue later from that point forward if the skb at the edge was also marked due to timeout. ...Actually, I think that other discontinuity can exists at the EVER_RETRANS edge but that suffers from the same problem as non-RETRANS skbs before SACKED_RETRANS edge when first encountered and therefore is probably pretty useless. > Basically these two algorithms are saying: > > 1) Mark up to smallest of 'lost' or tp->high_seq. > 2) Mark packets after those processed in #1 which have >timed out. > > Right? So I would take another angle to this problem (basically combine the lost calculation from tcp_update_scoreboard and mark_head lost stuff and ignore this lost altogether). Basically what I propose is something like this (hopefully I don't stomp your feet by coding it this much as pseudish approach seemed to get almost as complex :-)): void tcp_update_scoreboard_fack(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); int reord_count = 0; int had_retrans = 0; struct sk_buff *timedout_reentry_skb = NULL; /* How to get this highest_sack? */ skb = get_min_skb_wrapsafe(tp->high_seq, highest_sack); walk_backwards_from(sk, skb) { if (TCP_CB_SKB(skb)->sacked & TCPCB_LOST) break; if (TCP_CB_SKB(skb)->sacked & TCPCB_SACKED_RETRANS) { had_retrans = 1;
Re: [PATCH 4/4]: Kill fastpath_{skb,cnt}_hint.
On Sat, 3 Mar 2007, Ilpo Järvinen wrote: > On Fri, 2 Mar 2007, David Miller wrote: > > From: Baruch Even <[EMAIL PROTECTED]> > > Date: Thu, 1 Mar 2007 20:13:40 +0200 > > > > One drawback for this approach is that you now walk the entire sack > > > block when you advance one packet. If you consider a 10,000 packet queue > > > which had several losses at the beginning and a large sack block that > > > advances from the middle to the end you'll walk a lot of packets for > > > that one last stretch of a sack block. > > Maybe this information in the last ACK could be even used as advantage to > speed up the startpoint lookup, since the last ACK very likely contains > the highest SACK block (could not hold under attack though), no matter how > big it is, and it's not necessary to process it ever if we know for sure > that it was the global highest rather than a local one. Using that, it's > trivial to find the starting point below the SACK block and that could be > passed to the head_lost marking on-the-fly using stack rather than in the > structs. Sort of fastpath too :-)... Maybe even some auto-tuning thingie > which enables conditionally skipping of large SACK-blocks to reuse all of > this last ACK information in mark_head but that would get rather > complicated I think. Even better, adding to more the globally highest SACKed block range that is already larger than tp->reordering does nothing in mark_head_lost unless ca_state or high_seq are also changed (just a quick thoughts, it might be possible to exclude also them with careful analysis of state), isn't that so? But taking advantage of this might require inter-ACK state and be then less useful optimization. So only thing that would remain is the check for timedout above the highest SACKed block... -- i.
SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one)
On Tue, 11 Dec 2007, David Miller wrote: > Interesting approach, but I think there is limited value to this > (arguably) complex form. > > The core issue is that the data and the SACK state are maintained in > the same datastructure. The complexity in all the state management > and fixups in your patch is purely because of this. Yeah, had just too much time while waiting for person who never arrived... :-) It would have covered the typical case quite well tough, for sure it was very intrusive. > If we maintain SACK scoreboard information seperately, outside of > the SKB, then there are only two changes to make: > > 1) Every access to TCP_SKB_CB() SACK scoreboard is adjusted to >new data structure. Not sure if it is going to all that straightforward because you seem to ignore in this simple description all details of performing the linking between the structures, which becomes tricky when we add those retransmission adjustments. > 2) Retransmit is adjusted so that it can retransmit an SKB >constructed as a portion of an existing SKB. Since TSO >implies SG, this can be handled with simple offset and >length arguments and suitable creation of a clone referencing >the pages in the SG vector that contain the desired data. ...I.e., how to count retrans_out origins in such case because the presented sack_ref knows nothing about them nor do we have anything but one bit in ->sacked per skb. We need the origins when retransmission get sacked/acked to reduce retrans_out correctly. To solve this, I think the sack_ref must have ->sacked as well and the structure should store the R-bits there as well, and may L-bits too though their defination is more obvious because it's mostly a complement of sacked (except for small part near fackets_out and timedout marking that probably makes bookkeeping of L still necessary). The pcount boundaries are no longer that well defined after we break skb boundaries during retransmitting, so doing this makes fackets_out calculation even more trickier to track, unless whole pcount is replaced by byte based fackets_out :-/, which is very trivial to determine from seqnos only (TCP_NODELAYers might get unhappy though if we do that in a straightforward way...). This would clearly be a good step from one perspective. Nowadays GSO'ed skbs may get fragmented when mss_now changes, for two or more consecutive non-SACKed skbs that means one or more extra (small) packets when retransmitting. > I would envision this SACK state thing to reference into the > retransmit queue SKB's somehow. Each SACK block could perhaps > look something like: > > struct sack_ref { > struct sk_buff *start_skb; > struct sk_buff *end_skb; > unsigned int start_off; > unsigned int len; > }; ...I assume that these are fast searchable? And skbs as well (because the linking of start/end_skb at minimum is a costly operation otherwise)? -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: net-2.6.25 rebased...
On Sun, 16 Dec 2007, David Miller wrote: > I needed to rebase for two reasons: > > 1) Ilpo asked me to revert a lot of TCP stuff and the >easiest way to do that was during a rebase. > > 2) Patrick McHardy needs some of the pending net-2.6 >bug fixes in there in order to send me patches >for some netfilter compat stuff. > > It's all there in the usual spot: > > kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.25.git > > I'm still doing build tests with various configurations on my > Niagara-2 box. > > Let me know if I screwed up anything, thanks! Thanks, at least the end result for TCP was exactly what I expected to find there... -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] [TCP]: Include __tcp_reset_fack_counts to non-__ version
On Tue, 11 Dec 2007, Christoph Hellwig wrote: > On Tue, Dec 11, 2007 at 01:50:39PM +0200, Ilpo J?rvinen wrote: > > + BUG_ON((prev != NULL) && !tcp_skb_adjacent(sk, prev, > > skb[queue])); > > + > > + tcp_for_write_queue_from(skb[queue], sk, queue) { > > + if ((prev != NULL) && !tcp_skb_adjacent(sk, prev, > > skb[queue])) > > + break; > > + > > + if (!before(TCP_SKB_CB(skb[queue])->seq, > > tcp_sk(sk)->snd_nxt) || > > + TCP_SKB_CB(skb[queue])->fack_count == fc) > > + return; > > There's quite a few overflows of the normal 80 char limit here. Because > you're current style is a little on the verbose side that's trivially > fixable, though: This part got removed when part of TCP code got removed during net-2.6.25 rebase... Thanks anyway for the reminder, I'll try to be more careful during code moves in future but I'll probably continue to allow expections in cases where the offenders only consist of closing parenthesis, block opening brace and termination semicolon (like it was in one of these lines as well). -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT
On Tue, 18 Dec 2007, Gavin McCullagh wrote: > The last attempt didn't take account of the situation where a timestamp > wasn't available and tcp_clean_rtx_queue() has to feed both the RTO and the > congestion avoidance. This updated patch stores both RTTs, making the > delayed one available for the RTO and the other (ca_seq_rtt) available for > congestion control. > > > Signed-off-by: Gavin McCullagh <[EMAIL PROTECTED]> > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 889c893..6fb7989 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -2686,13 +2687,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 > *seq_rtt_p, > if (sacked & TCPCB_SACKED_RETRANS) > tp->retrans_out -= packets_acked; > flag |= FLAG_RETRANS_DATA_ACKED; > + ca_seq_rtt = -1; > seq_rtt = -1; > if ((flag & FLAG_DATA_ACKED) || > (packets_acked > 1)) > flag |= FLAG_NONHEAD_RETRANS_ACKED; > } else { > + ca_seq_rtt = now - scb->when; Isn't it also much better this way in a case where ACK losses happened, taking the longest RTT in that case is clearly questionable as it may over-estimate considerably. However, another thing to consider is the possibility of this value being used in "timeout-like" fashion in ca modules (I haven't read enough ca modules code to know if any of them does that), on contrary to determinating just rtt or packet's delay in which case this change seems appropriate (most modules do the latter). Therefore, if timeout-like module exists one should also add TCP_CONG_RTT_STAMP_LONGEST for that particular module and keep using seq_rtt for it like previously and use ca_seq_rtt only for others. > if (seq_rtt < 0) { > - seq_rtt = now - scb->when; > + seq_rtt = ca_seq_rtt; > if (fully_acked) > last_ackt = skb->tstamp; > } > @@ -2709,8 +2712,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 > *seq_rtt_p, > !before(end_seq, tp->snd_up)) > tp->urg_mode = 0; > } else { > + ca_seq_rtt = now - scb->when; > if (seq_rtt < 0) { > - seq_rtt = now - scb->when; > + seq_rtt = ca_seq_rtt; > if (fully_acked) > last_ackt = skb->tstamp; > } This part doesn't exists anymore in development tree. Please base this patch (and anything in future) you intend to get included to mainline onto net-2.6.25 unless there's a very good reason to not do so or whatever 2.6.xx is the correct net development tree at that time (if one exists). Thanks. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: After many hours all outbound connections get stuck in SYN_SENT
On Sun, 16 Dec 2007, James Nichols wrote: > I have a Java application that makes a large number of outbound > webservice calls over HTTP/TCP. The hosts contacted are a fixed set > of about 2000 hosts and a web service call is made to each of them > approximately every 5 mintues by a pool of 200 Java threads. Over > time, on average a percentage of these hosts are unreachable for one > reason or another, usually because they are on wireless cell phone > NICs, so there is a persistent count of sockets in the SYN_SENT state > in the range of about 60-80. This is fine, as these failed connection > attempts eventually time out. > > However, after approximately 38 hours of operation, all outbound > connection attempts get stuck in the SYN_SENT state. It happens > instantaneously, where I go from the baseline of about 60-80 sockets > in SYN_SENT to a count of 200 (corresponding to the # of java threads > that make these calls). > > When I stop and start the Java application, all the new outbound > connections still get stuck in SYN_SENT state. Is it so that they don't timeout at all? You can collect some of their state from /proc/net/tcp (shows at least timers and attempt counters) > During this time, I am > still able to SSH to the box and run wget to Google, cnn, etc, so the > problem appears to be specific to the hosts that I'm accessing via the > webservices. Are you sure that you just don't get unlucky at some point of time and all 200 available threads are just temporarily stuck and your application is just very slowly progressing then? > For a long time, the only thing that would resolve this was rebooting > the entire machine. Once I did this, the outbound connections could > be made succesfully. To the very same hosts? Or to another set of hosts? > However, very recently when I had once of these incidents I disabled > tcp_sack via: > > echo "0" > /proc/sys/net/ipv4/tcp_sack > > And the problem almost instanteaously resolved itself and outbound > connection attempts were succesful. New or the pending ones? > I hadn't attempted this before because I assumed that if any of my > network > equipment or remote hosts had a problem with SACK, that it would never > work. Many bugs just are not like that at all... Usually people who coded things had at least some clue :-), so things work "almost correctly"... > In my case, it worked fine for about 38 hours before hitting a > wall where no outbound connections could be made. How accurate number? Is the lockup somehow related to daytime cycle? > Is there a kernel buffer or some data structure that tcp_sack uses > that gets filled up after an extended period of operation? SACK has pretty little meaning in context of SYNs, there's only the sackperm(itted) TCP option which is sent along with the SYN/SYN-ACK. The SACK scoreboard is currently included to the skbs (has been like this for very long time), so no additional data structures should be there because of SACK... > How can I debug this problem in the kernel to find out what the root cause is? > > I emailed linux-kernel and they asked for output of netstat -s, I can > get this the next > time it occurs- any other usefull data to collect? /proc/net/tcp couple of times in a row, try something something like this: for i in (seq 1 40); do cat /proc/net/tcp; echo "-"; sleep 10; done > I'm running kernel 2.6.18 on RedHat, but have had this problem occur > on earlier kernel versions (all 2.4 and 2.6). I've done some fixes to SACK processing since 2.6.18 (not sure if RedHat has backported them). Though they're not that critical nor anything in them should affect in SYN_SENT state. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT
On Wed, 19 Dec 2007, Gavin McCullagh wrote: > On Wed, 19 Dec 2007, Ilpo Järvinen wrote: > > > Isn't it also much better this way in a case where ACK losses happened, > > taking the longest RTT in that case is clearly questionable as it > > may over-estimate considerably. > > Quite so. > > > However, another thing to consider is the possibility of this value being > > used in "timeout-like" fashion in ca modules (I haven't read enough ca > > modules code to know if any of them does that), on contrary to > > determinating just rtt or packet's delay in which case this change seems > > appropriate (most modules do the latter). > > I'm not aware of any, but I haven't read them all either. I would have > thought tp->srtt was the value to use in this instance, Very likely so... > > Therefore, if timeout-like module exists one should also add > > TCP_CONG_RTT_STAMP_LONGEST for that particular module and keep using > > seq_rtt for it like previously and use ca_seq_rtt only for others. > > Seems reasonable. I'll add this. ...therefore I said "if". I'm not sure what they all do, haven't read them all that carefully... :-) Please check first if ..._LONGEST is necessary at all by quickly going through how the ca modules use it, I guess most of them won't be that complicated, one can probably figure out the intented usage by couple of minutes review. If there aren't any modules who need delayed ACK & other path caused delays included, ..._LONGEST would just end up being unnecessary cruft :-). > > This part doesn't exists anymore in development tree. Please base this > > patch (and anything in future) you intend to get included to mainline > > onto net-2.6.25 unless there's a very good reason to not do so or > > whatever 2.6.xx is the correct net development tree at that time (if > > one exists). Thanks. > > Will do. I gather I should use the latest net- tree in future when > submitting patches. Doh, I owe you apology as I was probably too hasty to point you towards net-2.6.25. I suppose this could by considered as fix as well and therefore could probably be accepted to net-2.6 as well, which is for bugfixes only after merge window is closed. But it's Dave how will make such decisions, not me :-), and it's he who gets to deal with all the resulting conflicts ;-) (I added Cc to him). -- i.
Re: After many hours all outbound connections get stuck in SYN_SENT
On Wed, 19 Dec 2007, James Nichols wrote: > > > And the problem almost instanteaously resolved itself and outbound > > > connection attempts were succesful. > > > > New or the pending ones? > > I'm fairly sure that sockets that were already open in SYN_SENT state > when I turned tcp_sack off started to work as the count of sockets in > SYN_SENT state drops very rapidly. Heh, "very rapidly" :-/, considering that you have 200 x SYN_SENT flows and if tcp_syn_retries is set to default, you will see something happening quite quickly for sure and the whole situation is over in ~3 mins if I counted correctly. > > > Is there a kernel buffer or some data structure that tcp_sack uses > > > that gets filled up after an extended period of operation? > > > > SACK has pretty little meaning in context of SYNs, there's only the > > sackperm(itted) TCP option which is sent along with the SYN/SYN-ACK. > > > > The SACK scoreboard is currently included to the skbs (has been like > > this for very long time), so no additional data structures should be > > there because of SACK... > > I've been seeing this problem for about 4 years, so could it be > related to the scoreboard implementation somehow? Scoreboard has no meaning in this context, it is used while _input_ packets are processed. If you don't get SYN-ACKs at all, it doesn't have any meaning. > > /proc/net/tcp couple of times in a row, try something something like > > this: > > > > for i in (seq 1 40); do cat /proc/net/tcp; echo "-"; sleep 10; done > > I can set up to do this the next time the problem occurs. for i in $(seq 1 40); ... is the right way to do the loop. :-) > > > I'm running kernel 2.6.18 on RedHat, but have had this problem occur > > > on earlier kernel versions (all 2.4 and 2.6). > > > > I've done some fixes to SACK processing since 2.6.18 (not sure if RedHat > > has backported them). Though they're not that critical nor anything in > > them should affect in SYN_SENT state. > > Ok, unless there is direct evidence that there is a fix to this > problem in a later kernel I won't be able to upgrade. If there is a > redhat provided patch I can probably do that. ...They won't affect in SYN_SENT. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: After many hours all outbound connections get stuck in SYN_SENT
On Wed, 19 Dec 2007, Eric Dumazet wrote: > James Nichols a écrit : > > On 12/19/07, Eric Dumazet <[EMAIL PROTECTED]> wrote: > > > James Nichols a écrit : > > > > > So you see outgoing SYN packets, but no SYN replies coming from the > > > > > remote > > > > > peer ? (you mention ACKS, but the first packet received from the > > > > > remote > > > > > peer should be a SYN+ACK), > > > > Right, I meant to say SYN+ACK. I don't see them coming back. > > > So... Really unlikely a linux problem, but ... > > > > > > > > > I don't know how you can be so sure. Turning tcp_sack off instantly > > resovles the problem and all connections are succesful. I can't > > imagine even the most far-fetched scenario where a router or every > > single remote endpoints would suddenly stop causing the problem just > > by removing a single TCP option. You could also check if you can pull same trick off by touching tcp_timestamps. It affects the TCP header as well. -- i.
TSO trimming question
Hi all, I'm not fully sure what's purpose of this code in tcp_write_xmit: if (skb->len < limit) { unsigned int trim = skb->len % mss_now; if (trim) limit = skb->len - trim; } Is it used to make sure we send only multiples of mss_now here and leave the left-over into another skb? Or does it try to make sure that tso_fragment result honors multiple of mss_now boundaries when snd_wnd is the limitting factor? For latter IMHO this would be necessary: if (skb->len > limit) limit -= limit % mss_now; -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO trimming question
On Wed, 19 Dec 2007, Ilpo Järvinen wrote: > I'm not fully sure what's purpose of this code in tcp_write_xmit: > >if (skb->len < limit) { >unsigned int trim = skb->len % mss_now; > >if (trim) >limit = skb->len - trim; > } > > Is it used to make sure we send only multiples of mss_now here and leave > the left-over into another skb? Or does it try to make sure that > tso_fragment result honors multiple of mss_now boundaries when snd_wnd > is the limitting factor? For latter IMHO this would be necessary: > > if (skb->len > limit) > limit -= limit % mss_now; ...Anyway, here's an untested patch to just do it :-): -- i. [PATCH] [TCP]: Force tso skb split to mss_now boundary (if snd_wnd limits) If snd_wnd was limitting factor, the tso_fragment might not create full-sized skbs but would include the window left-overs as well. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c |4 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1c7ef17..8dafda9 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1453,6 +1453,8 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle) if (trim) limit = skb->len - trim; + } else if (skb->len > limit) { + limit -= limit % mss_now; } } @@ -1525,6 +1527,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now) if (trim) limit = skb->len - trim; + } else if (skb->len > limit) { + limit -= limit % mss_now; } } -- 1.5.0.6
Re: A short question about net git tree and patches
On Thu, 20 Dec 2007, Sam Ravnborg wrote: > On Thu, Dec 20, 2007 at 11:20:26AM +0200, David Shwatrz wrote: > > Hello, > > I have a short question regarding the net git tree and patches: > > I want to write and send patches against the most recent and > > bleeding edge kernel networking code. > > I see in: > > http://kernel.org/pub/scm/linux/kernel/git/davem/?C=M;O=A > > that there are 3 git trees which can be candidates for git-clone and > > making patches against; > > these are: > > netdev-2.6.git, net-2.6.25.git and net-2.6.git. > > > > It seems to me that net-2.6.git is the most suitable one to work against; > > am I right ? > > what is the difference, in short, between the three repositories? > > IIRC the usage is: > netdev-2.6.git <= old stuff, 4 weeks since last update. Not in use > net-2.6.25.git <= patches for current kernel release (only fixes) Nope, we don't even have 2.6.24 yet. :-) > net-2.6.git <= patches for next kernel relase and planned to be > applied in next merge window > > So net-2.6.git is the correct choice for bleeding edge. net-2.6 is for fixes only, net-2.6.25 will become net-2.6 once 2.6.24 gets released, eventually net-2.6.26 gets opened (not necessarily at the same time as the merge window is closed but a bit later) and the cycle repeats with similar transitions when 2.6.25 gets released. The netdev trees are for network drivers and are usually managed by Jeff Garzik but there were recently some arrangement between Dave and Jeff due to vacations so that also netdev was managed by Dave. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO trimming question
On Wed, 19 Dec 2007, David Miller wrote: > From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> > Date: Wed, 19 Dec 2007 23:46:33 +0200 (EET) > > > I'm not fully sure what's purpose of this code in tcp_write_xmit: > > > >if (skb->len < limit) { > >unsigned int trim = skb->len % mss_now; > > > >if (trim) > >limit = skb->len - trim; > > } > > > > Is it used to make sure we send only multiples of mss_now here and leave > > the left-over into another skb? Yeah, I now understand that this part is correct. I somehow got such impression while trying to figure this out that it ends up being dead code but that wasn't correct thought from my side. However, it caught my attention and after some thinking I'd say there's more to handle here (covered by the second question). Also note that patch I sent earlier is not right either but needs some refining to do the right thing. > > Or does it try to make sure that > > tso_fragment result honors multiple of mss_now boundaries when snd_wnd > > is the limitting factor? For latter IMHO this would be necessary: > > > > if (skb->len > limit) > > limit -= limit % mss_now; > > The purpose of the test is to make sure we process tail sub-mss chunks > correctly wrt. Nagle, which most closely matches the first purpose > you've listed. > > So I think the calculation really does belong where it is. > > Because of the way that the sendmsg() super-skb formation logic > works, we always will tack on more data and grow the tail > SKB before creating a new one. So any sub-mss chunk at the > end of a TSO frame really is at the end of the write queue > and really should get nagle processing. Yes, I now agree this is fully correct for this task. > Actually, there is an exception, which is when we run out of > skb_frag_list slots. In that case we'll potentially have breaks at > odd boundaries in the middle of the queue. But this can only happen > in exceptional cases (user does tons of 1-byte sendfile()'s over > random non-consequetive locations of a file) or outright bugs > (MAX_SKB_FRAGS is defined incorrectly, for example) and thus this > situation is not worth coding for. That's not the only case, IMHO if there's odd boundary due to snd_una+snd_wnd - skb->seq limit (done in tcp_window_allows()), we don't consider it as odd but break the skb at arbitary point resulting two small segments to the network, and what's worse, when the later skb resulting from the first split is matching skb->len < limit check as well causing an unnecessary small skb to be created for nagle purpose too, solving it fully requires some thought in case the mss_now != mss_cache even if non-odd boundaries are honored in the middle of skb. Though whether we get there is depending on what tcp_tso_should_defer() decided. Hmm, there seems to be an unrelated bug in it as well :-/. A patch below. Please consider the fact that enabling TSO deferring may have some unpleasant effect to TCP dynamics, considering that I don't find stable mandatory for this to avoid breaking, besides things have been working quite well without it too... Only compile tested. -- i. -- [PATCH] [TCP]: Fix TSO deferring I'd say that most of what tcp_tso_should_defer had in between there was dead code because of this. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 8dafda9..693b9f6 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb) goto send_now; /* Defer for less than two clock ticks. */ - if (!tp->tso_deferred && ((jiffies<<1)>>1) - (tp->tso_deferred>>1) > 1) + if (tp->tso_deferred && + ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1) goto send_now; in_flight = tcp_packets_in_flight(tp); -- 1.5.0.6
Re: TSO trimming question
On Thu, 20 Dec 2007, David Miller wrote: > From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> > Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET) > > > That's not the only case, IMHO if there's odd boundary due to > > snd_una+snd_wnd - skb->seq limit (done in tcp_window_allows()), we don't > > consider it as odd but break the skb at arbitary point resulting > > two small segments to the network, and what's worse, when the later skb > > resulting from the first split is matching skb->len < limit check as well > > causing an unnecessary small skb to be created for nagle purpose too, > > solving it fully requires some thought in case the mss_now != mss_cache > > even if non-odd boundaries are honored in the middle of skb. > > In the most ideal sense, tcp_window_allows() should probably > be changed to only return MSS multiples. That's what Herbert suggested already, I'll send a patch later on... :-) > Unfortunately this would add an expensive modulo operation, > however I think it would elimiate this problem case. Yes. Should we still call tcp_minshall_update() if split in the middle of wq results in smaller than MSS tail (occurs only if mss_now != mss_cache)? -- i.
Re: After many hours all outbound connections get stuck in SYN_SENT
On Thu, 20 Dec 2007, James Nichols wrote: > > I still dont understand. > > > > "tcpdump -p -n -s 1600 -c 1" doesnt reveal User data at all. > > > > Without any exact data from you, I am afraid nobody can help. > > Oh, I didn't see that you specified specific options. I'll still have > to anonymize 2000+ IP addresses, but I think there is an open source > tool that will do this for you. Even a simple for loop in shell can do that. It's not that hard and there's very little need for manual work! Ingrediments: for, cut, grep and sed. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: After many hours all outbound connections get stuck in SYN_SENT
On Thu, 20 Dec 2007, James Nichols wrote: > > You'd probably should also investigate the Linux kernel, > > especially the size and locks of the components of the Sack data > > structures and what happens to those data structures after Sack is > > disabled (presumably the Sack data structure is in some unhappy > > circumstance, and disabling Sack allows the data to be discarded, > > magically unclaging the box). ...Not sure if you want now to invent such structure. Yes, we have per skb ->sacked but again in SYN_SENT there are very few things who touch it at all, and they just set it to zero (though it would not even be mandatory for tcp_transmit_skb, IIRC, checked that just couple of days ago due to other things). Another thing is the rx_opt.sack_ok which is just couple flag bits that tell the TCP variant in use (and it's mostly used only after SYN handshake completes). The rest (the actual SACK blocks) is in the ack_skb but again it has very little meaning in SYN_SENT state unless somebody is crazy enough to add SACK blocks to SYN-ACKs :-). > > In the absence of the reporter wanting to dump the kernel's > > core, how about a patch to print the Sack datastructure when > > the command to disable Sack is received by the kernel? > > Maybe just print the last 16b of the IP address? > > Given the fact that I've had this problem for so long, over a variety > of networking hardware vendors and colo-facilities, this really sounds > good to me. It will be challenging for me to justify a kernel core > dump, but a simple patch to dump the Sack data would be do-able. If your symptoms really are: SYNs leaving (if they show up in tcpdump, for sure they've left TCP code already) and SYN-ACK not showing up even in something as early as in tcpdump (for sure TCP side code didn't execute at that point yet), there's very little change that Linux' TCP code has some bug in it, only things that do something in such scenario are the SYN generation and retransmitting SYNs (and those are trivially verifiable from tcpdump). -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [TCP] IPV6 : Change a divide into a right shift in tcp_v6_send_ack()
On Fri, 21 Dec 2007, Eric Dumazet wrote: Because tot_len is signed in tcp_v6_send_ack(), tot_len/4 forces compiler to emit an integer divide, while we can help it to use a right shift, less expensive. Can't you just change tot_len to unsigned here? It's just sizeof and some positive constants added... -- i.diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 0268e11..92f0fda 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1124,7 +1124,7 @@ static void tcp_v6_send_ack(struct tcp_timewait_sock *tw, memset(t1, 0, sizeof(*t1)); t1->dest = th->source; t1->source = th->dest; - t1->doff = tot_len/4; + t1->doff = tot_len >> 2; t1->seq = htonl(seq); t1->ack_seq = htonl(ack); t1->ack = 1;
Re: TSO trimming question
On Fri, 21 Dec 2007, Bill Fink wrote: > I meant to ask about this a while back but then got distracted by > other things. But now since the subject has come up, I had a couple > of more questions about this code. > > What's with all the shifting back and forth? Here with: > > ((jiffies<<1)>>1) - (tp->tso_deferred>>1) > > and later with: > > /* Ok, it looks like it is advisable to defer. */ > tp->tso_deferred = 1 | (jiffies<<1); > > Is this just done to try and avoid the special case of jiffies==0 > when the jiffies wrap? I thought that it must be the reason. I couldn't figure out any other explination while thinking the same thing but since I saw no problem in that rather weird approach, I didn't want to touch that in a patch which had net-2.6 (or stable) potential. > If so it seems like a lot of unnecessary > work just to avoid a 1 in 4 billion event, since it's my understanding > that the whole tcp_tso_should_defer function is just an optimization > and not a criticality to the proper functioning of TCP, especially > considering it hasn't even been executing at all up to now. It would still be good to not return 1 in that case we didn't flag the deferral, how about adding one additional tick (+comment) in the zero case making tso_deferred == 0 again unambiguously defined, e.g.: tp->tso_deferred = min_t(u32, jiffies, 1); ...I'm relatively sure that nobody would ever notice that tick :-) and we kept return value consistent with tso_deferred state invariant. > My second question is more basic and if I'm not mistaken actually > relates to a remaining bug in the (corrected) test: > > /* Defer for less than two clock ticks. */ > if (tp->tso_deferred && > ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1) > > Since jiffies is an unsigned long, which is 64-bits on a 64-bit system, > whereas tp->tso_deferred is a u32, once jiffies exceeds 31-bits, which > will happen in about 25 days if HZ=1000, won't the second part of the > test always be true after that? Or am I missing something obvious? I didn't notice that earlier but I think you're right though my knowledge about jiffies and such is quite limited. ...Feel free to submit a patch to correct these. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TSO trimming question
On Fri, 21 Dec 2007, Ilpo Järvinen wrote: > On Fri, 21 Dec 2007, Bill Fink wrote: > > > If so it seems like a lot of unnecessary > > work just to avoid a 1 in 4 billion event, since it's my understanding > > that the whole tcp_tso_should_defer function is just an optimization > > and not a criticality to the proper functioning of TCP, especially > > considering it hasn't even been executing at all up to now. > > It would still be good to not return 1 in that case we didn't flag the > deferral, how about adding one additional tick (+comment) in the zero > case making tso_deferred == 0 again unambiguously defined, e.g.: > > tp->tso_deferred = min_t(u32, jiffies, 1); Blah, max_t of course :-). -- i.
[PATCH] [TCP]: Remove seq_rtt ptr from clean_rtx_queue args
While checking Gavin's patch I noticed that the returned seq_rtt is not used by the caller. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- I think it shouldn't introduce new conflicts between net-2.6.25 and net-2.6 (tested with git-am -3 to both). net/ipv4/tcp_input.c |7 ++- 1 files changed, 2 insertions(+), 5 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 0f0c1c9..a020e83 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2638,8 +2638,7 @@ static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb) * is before the ack sequence we can discard it as it's confirmed to have * arrived at the other end. */ -static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, - int prior_fackets) +static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets) { struct tcp_sock *tp = tcp_sk(sk); const struct inet_connection_sock *icsk = inet_csk(sk); @@ -2803,7 +2802,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p, } } #endif - *seq_rtt_p = seq_rtt; return flag; } @@ -3044,7 +3042,6 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) u32 ack = TCP_SKB_CB(skb)->ack_seq; u32 prior_in_flight; u32 prior_fackets; - s32 seq_rtt; int prior_packets; int frto_cwnd = 0; @@ -3111,7 +3108,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) prior_in_flight = tcp_packets_in_flight(tp); /* See if we can take anything off of the retransmit queue. */ - flag |= tcp_clean_rtx_queue(sk, &seq_rtt, prior_fackets); + flag |= tcp_clean_rtx_queue(sk, prior_fackets); if (tp->frto_counter) frto_cwnd = tcp_process_frto(sk, flag); -- 1.5.0.6
Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT
On Fri, 21 Dec 2007, Gavin McCullagh wrote: > On Fri, 21 Dec 2007, David Miller wrote: > > > When Gavin respins the patch I'll look at in the context of submitting > > it as a bug fix. So Gavin please generate the patch against Linus's > > vanilla GIT tree or net-2.6, your choise. > > The existing patch was against Linus' linux-2.6.git from a few days ago so > I've updated my tree and regenerated the patch (below). Is that the right > one? > > I'm just checking through the existing CA modules. I don't see the rtt > used for RTO anywhere. This is what I gather they're each using rtt for. I meant more timeout like fashion (e.g., to "timeout" some internal phase but I don't find that too likely)... Thanks for checking them. > So as far as I can tell, timeout stuff is not ever altered using > pkts_acked() so I guess this fix only affects westwood, htcp and cubic just > now. > > I need to re-read properly, but I think the same problem affects the > microsecond values where TCP_CONG_RTT_STAMP is set (used by vegas, veno, > yeah, illinois). I might follow up with another patch which changes the > behaviour where TCP_CONG_RTT_STAMP when I'm more sure of that. Please do, you might have to remove fully_acked checks to do that right though so it won't be as straight-forward change as this one and requires some amount of thinking to result in a right thing. > Signed-off-by: Gavin McCullagh <[EMAIL PROTECTED]> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 889c893..6fb7989 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -2651,6 +2651,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 > *seq_rtt_p, > u32 cnt = 0; > u32 reord = tp->packets_out; > s32 seq_rtt = -1; > + s32 ca_seq_rtt = -1; > ktime_t last_ackt = net_invalid_timestamp(); > > while ((skb = tcp_write_queue_head(sk)) && skb != tcp_send_head(sk)) { > @@ -2686,13 +2687,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 > *seq_rtt_p, > if (sacked & TCPCB_SACKED_RETRANS) > tp->retrans_out -= packets_acked; > flag |= FLAG_RETRANS_DATA_ACKED; > + ca_seq_rtt = -1; > seq_rtt = -1; > if ((flag & FLAG_DATA_ACKED) || > (packets_acked > 1)) > flag |= FLAG_NONHEAD_RETRANS_ACKED; > } else { > + ca_seq_rtt = now - scb->when; > if (seq_rtt < 0) { > - seq_rtt = now - scb->when; > + seq_rtt = ca_seq_rtt; > if (fully_acked) > last_ackt = skb->tstamp; > } > @@ -2709,8 +2712,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 > *seq_rtt_p, > !before(end_seq, tp->snd_up)) > tp->urg_mode = 0; > } else { > + ca_seq_rtt = now - scb->when; > if (seq_rtt < 0) { > - seq_rtt = now - scb->when; > + seq_rtt = ca_seq_rtt; > if (fully_acked) > last_ackt = skb->tstamp; > } > @@ -2772,8 +2776,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 > *seq_rtt_p, >net_invalid_timestamp())) > rtt_us = > ktime_us_delta(ktime_get_real(), > last_ackt); > - else if (seq_rtt > 0) > - rtt_us = jiffies_to_usecs(seq_rtt); > + else if (ca_seq_rtt > 0) > + rtt_us = jiffies_to_usecs(ca_seq_rtt); > } > > ca_ops->pkts_acked(sk, pkts_acked, rtt_us); > Acked-by: Ilpo Järvinen <[EMAIL PROTECTED]> / Reviewed-by... whatever, I don't know if they really started to use it or not... :-) -- i.
Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT
On Fri, 21 Dec 2007, Ilpo Järvinen wrote: > On Fri, 21 Dec 2007, Gavin McCullagh wrote: > > > I'm just checking through the existing CA modules. I don't see the rtt > > used for RTO anywhere. This is what I gather they're each using rtt for. > > I meant more timeout like fashion (e.g., to "timeout" some internal > phase but I don't find that too likely)... I meant didn't, no need to recheck them due to that... :-) -- i.
[PATCH] [TCP]: Force TSO splits to MSS boundaries
On Thu, 20 Dec 2007, David Miller wrote: > From: Herbert Xu <[EMAIL PROTECTED]> > Date: Thu, 20 Dec 2007 22:00:12 +0800 > > > On Thu, Dec 20, 2007 at 04:00:37AM -0800, David Miller wrote: > > > > > > In the most ideal sense, tcp_window_allows() should probably > > > be changed to only return MSS multiples. > > > > > > Unfortunately this would add an expensive modulo operation, > > > however I think it would elimiate this problem case. > > > > Well you only have to divide in the unlikely case of us being > > limited by the receiver window. In that case speed is probably > > not of the essence anyway. > > Agreed, to some extent. > > I say "to some extent" because it might be realistic, with > lots (millions) of sockets to hit this case a lot. > > There are so many things that are a "don't care" performance > wise until you have a lot of stinky connections over crappy > links. How about this, I had to use another approach due to reasons outlined in the commit message: -- [PATCH] [TCP]: Force TSO splits to MSS boundaries If snd_wnd - snd_nxt wasn't multiple of MSS, skb was split on odd boundary by the callers of tcp_window_allows. We try really hard to avoid unnecessary modulos. Therefore the old caller side check "if (skb->len < limit)" was too wide as well because limit is not bound in any way to skb->len and can cause spurious testing for trimming in the middle of the queue while we only wanted that to happen at the tail of the queue. A simple additional caller side check for tcp_write_queue_tail would likely have resulted 2 x modulos because the limit would have to be first calculated from window, however, doing that unnecessary modulo is not mandatory. After a minor change to the algorithm, simply determine first if the modulo is needed at all and at that point immediately decide also from which value it should be calculated from. This approach also kills some duplicated code. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c | 51 - 1 files changed, 25 insertions(+), 26 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1852698..ea92a1b 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1008,13 +1008,29 @@ static void tcp_cwnd_validate(struct sock *sk) } } -static unsigned int tcp_window_allows(struct tcp_sock *tp, struct sk_buff *skb, unsigned int mss_now, unsigned int cwnd) +/* Returns the portion of skb which can be sent right away without + * introducing MSS oddities to segment boundaries. In rare cases where + * mss_now != mss_cache, we will request caller to create a small skb + * per input skb which could be mostly avoided here (if desired). + */ +static unsigned int tcp_mss_split_point(struct sock *sk, struct sk_buff *skb, + unsigned int mss_now, + unsigned int cwnd) { - u32 window, cwnd_len; + struct tcp_sock *tp = tcp_sk(sk); + u32 needed, window, cwnd_len; window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq); cwnd_len = mss_now * cwnd; - return min(window, cwnd_len); + + if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk))) + return cwnd_len; + + if (skb == tcp_write_queue_tail(sk) && cwnd_len <= skb->len) + return cwnd_len; + + needed = min(skb->len, window); + return needed - needed % mss_now; } /* Can at least one segment of SKB be sent right now, according to the @@ -1445,17 +1461,9 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle) } limit = mss_now; - if (tso_segs > 1) { - limit = tcp_window_allows(tp, skb, - mss_now, cwnd_quota); - - if (skb->len < limit) { - unsigned int trim = skb->len % mss_now; - - if (trim) - limit = skb->len - trim; - } - } + if (tso_segs > 1) + limit = tcp_mss_split_point(sk, skb, mss_now, + cwnd_quota); if (skb->len > limit && unlikely(tso_fragment(sk, skb, limit, mss_now))) @@ -1502,7 +1510,6 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss, */ void tcp_push_one(struct sock *sk, unsigned int mss_now) { - struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb = tcp_send_head(sk); unsigned int tso_segs, cwnd_qu
Re: TSO trimming question
On Fri, 21 Dec 2007, Bill Fink wrote: > On Fri, 21 Dec 2007, Bill Fink wrote: > > > Or perhaps even: > > > > /* Ok, it looks like it is advisable to defer. */ > > tp->tso_deferred = jiffies; > > > > /* need to return a non-zero value to defer, which means won't > > * defer if jiffies == 0 but it's only a 1 in 4 billion event > > * (and avoids a compare/branch by not checking jiffies) > > / > > return jiffies; > > Ack. I introduced my own 64-bit to 32-bit issue (too late at night). > How about: > > /* Ok, it looks like it is advisable to defer. */ > tp->tso_deferred = jiffies; > > /* this won't defer if jiffies == 0 but it's only a 1 in >* 4 billion event (and avoids a branch) >*/ > return (jiffies != 0); I'm not sure how the jiffies work but is this racy as well? Simple return tp->tso_deferred; should work, shouldn't it? :-) -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [TCP]: Force TSO splits to MSS boundaries
On Fri, 21 Dec 2007, Ilpo Järvinen wrote: > How about this, I had to use another approach due to reasons > outlined in the commit message: > > -- > [PATCH] [TCP]: Force TSO splits to MSS boundaries > > If snd_wnd - snd_nxt wasn't multiple of MSS, skb was split on > odd boundary by the callers of tcp_window_allows. > > We try really hard to avoid unnecessary modulos. Therefore the > old caller side check "if (skb->len < limit)" was too wide as > well because limit is not bound in any way to skb->len and can > cause spurious testing for trimming in the middle of the queue > while we only wanted that to happen at the tail of the queue. > A simple additional caller side check for tcp_write_queue_tail > would likely have resulted 2 x modulos because the limit would > have to be first calculated from window, however, doing that > unnecessary modulo is not mandatory. After a minor change to > the algorithm, simply determine first if the modulo is needed > at all and at that point immediately decide also from which > value it should be calculated from. > > This approach also kills some duplicated code. > > Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> > --- > net/ipv4/tcp_output.c | 51 > - > 1 files changed, 25 insertions(+), 26 deletions(-) > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 1852698..ea92a1b 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -1008,13 +1008,29 @@ static void tcp_cwnd_validate(struct sock *sk) > } > } > > -static unsigned int tcp_window_allows(struct tcp_sock *tp, struct sk_buff > *skb, unsigned int mss_now, unsigned int cwnd) > +/* Returns the portion of skb which can be sent right away without > + * introducing MSS oddities to segment boundaries. In rare cases where > + * mss_now != mss_cache, we will request caller to create a small skb > + * per input skb which could be mostly avoided here (if desired). > + */ > +static unsigned int tcp_mss_split_point(struct sock *sk, struct sk_buff *skb, > + unsigned int mss_now, > + unsigned int cwnd) > { > - u32 window, cwnd_len; > + struct tcp_sock *tp = tcp_sk(sk); > + u32 needed, window, cwnd_len; > > window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq); > cwnd_len = mss_now * cwnd; > - return min(window, cwnd_len); > + > + if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk))) > + return cwnd_len; > + > + if (skb == tcp_write_queue_tail(sk) && cwnd_len <= skb->len) > + return cwnd_len; ...if desired that this won't cause small skbs in the middle of the queue (except in case where just the sub-MSS portion is left out above window), something like this could be added here (again, avoiding more modulos): if (skb != tcp_write_queue_tail(sk) && window > skb->len) return skb->len; > + needed = min(skb->len, window); > + return needed - needed % mss_now; > } > > /* Can at least one segment of SKB be sent right now, according to the -- i.
Re: [PATCH/RFC] [v3] TCP: use non-delayed ACK for congestion control RTT
On Sun, 30 Dec 2007, Gavin McCullagh wrote: > Hi, > > On Fri, 21 Dec 2007, Ilpo Järvinen wrote: > > > > I need to re-read properly, but I think the same problem affects the > > > microsecond values where TCP_CONG_RTT_STAMP is set (used by vegas, veno, > > > yeah, illinois). I might follow up with another patch which changes the > > > behaviour where TCP_CONG_RTT_STAMP when I'm more sure of that. > > > > Please do, you might have to remove fully_acked checks to do that right > > though so it won't be as straight-forward change as this one and requires > > some amount of thinking to result in a right thing. > > The TCP_CONG_RTT_STAMP code does need to be fixed similarly. A combined > patch will follow this mail. As I understand it, the fully_acked checks > kick in where the ACK is a portion of a TSO chunk and doesn't completely > ACK that chunk. Leaving the checks in place means you get one rtt for each > TSO chunk, based on the ACK for the last byte of the chunk. This rtt > therefore is the maximum available and includes the time-lag between the > first and last chunk being acked. Removing the tests gives you an RTT > value for each ACK in a tso chunk, including the minimum and maximum. It > seems the minimum rtt is the best indicator of congestion. On the other > hand having all available RTTs gives the congestion avoidance greater > knowledge of how the RTT is evolving (albeit somewhat coloured by TSO > delays which don't seem too severe in my tests). I guess the non-minimum TSO delays are only significant in case there was something unexpectional happening and in such case we definately want to have some measurements taken. -- i.
[PATCH net-2.6.25 0/9]: TCP cleanups & minor changes.
Hi Dave, The first one is restored after getting removed in the straight-forward revert we did. Please check that the TCPCB_URG removal is indeed valid. I couldn't find any use for it but there might be some non-obviously named things I've missed. I did a larger cleanup with indent for tcp_input and tcp_output because I started to find so many unnecessarily line split & missing spaces here and there. There's still a lot to do because not every case has a trivial solution but at least something got cleaner :-). These should apply cleanly to the rebased net-2.6.25. I did some trivial test with them before the rebase. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/9] [TCP]: Make invariant check complain about invalid sacked_out
Earlier resolution for NewReno's sacked_out should now keep it small enough for this to become invariant-like check. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c |5 + 1 files changed, 1 insertions(+), 4 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 722c9cb..41f4b86 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2504,11 +2504,8 @@ tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) (tcp_fackets_out(tp) > tp->reordering)); int fast_rexmit = 0; - /* Some technical things: -* 1. Reno does not count dupacks (sacked_out) automatically. */ - if (!tp->packets_out) + if (WARN_ON(!tp->packets_out && tp->sacked_out)) tp->sacked_out = 0; - if (WARN_ON(!tp->sacked_out && tp->fackets_out)) tp->fackets_out = 0; -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/9] [TCP]: Remove unnecessary local variables
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c | 11 +++ 1 files changed, 3 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1ca638b..025dddf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -995,9 +995,8 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed) static void tcp_cwnd_validate(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - __u32 packets_out = tp->packets_out; - if (packets_out >= tp->snd_cwnd) { + if (tp->packets_out >= tp->snd_cwnd) { /* Network is feed fully. */ tp->snd_cwnd_used = 0; tp->snd_cwnd_stamp = tcp_time_stamp; @@ -1042,17 +1041,13 @@ static unsigned int tcp_mss_split_point(struct sock *sk, struct sk_buff *skb, */ static inline unsigned int tcp_cwnd_test(struct tcp_sock *tp, struct sk_buff *skb) { - u32 in_flight, cwnd; - /* Don't be strict about the congestion window for the final FIN. */ if ((TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) && tcp_skb_pcount(skb) == 1) return 1; - in_flight = tcp_packets_in_flight(tp); - cwnd = tp->snd_cwnd; - if (in_flight < cwnd) - return (cwnd - in_flight); + if (tcp_packets_in_flight(tp) < tp->snd_cwnd) + return tp->snd_cwnd - tcp_packets_in_flight(tp); return 0; } -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/9] [TCP]: Rename update_send_head & include related increment to it
There's very little need to have the packets_out incrementing in a separate function. Also name the combined function appropriately. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c | 32 1 files changed, 12 insertions(+), 20 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 7a4834a..1ca638b 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -61,29 +61,22 @@ int sysctl_tcp_base_mss __read_mostly = 512; /* By default, RFC2861 behavior. */ int sysctl_tcp_slow_start_after_idle __read_mostly = 1; -static inline void tcp_packets_out_inc(struct sock *sk, - const struct sk_buff *skb) -{ - struct tcp_sock *tp = tcp_sk(sk); - int orig = tp->packets_out; - - tp->packets_out += tcp_skb_pcount(skb); - if (!orig) - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, - inet_csk(sk)->icsk_rto, TCP_RTO_MAX); -} - -static void update_send_head(struct sock *sk, struct sk_buff *skb) +static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb) { struct tcp_sock *tp = tcp_sk(sk); + unsigned int prior_packets = tp->packets_out; tcp_advance_send_head(sk, skb); tp->snd_nxt = TCP_SKB_CB(skb)->end_seq; - tcp_packets_out_inc(sk, skb); /* Don't override Nagle indefinately with F-RTO */ if (tp->frto_counter == 2) tp->frto_counter = 3; + + tp->packets_out += tcp_skb_pcount(skb); + if (!prior_packets) + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, + inet_csk(sk)->icsk_rto, TCP_RTO_MAX); } /* SND.NXT, if window was not shrunk. @@ -1410,7 +1403,7 @@ static int tcp_mtu_probe(struct sock *sk) /* Decrement cwnd here because we are sending * effectively two packets. */ tp->snd_cwnd--; - update_send_head(sk, nskb); + tcp_event_new_data_sent(sk, nskb); icsk->icsk_mtup.probe_size = tcp_mss_to_mtu(sk, nskb->len); tp->mtu_probe.probe_seq_start = TCP_SKB_CB(nskb)->seq; @@ -1494,7 +1487,7 @@ static int tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle) /* Advance the send_head. This one is sent out. * This call will increment packets_out. */ - update_send_head(sk, skb); + tcp_event_new_data_sent(sk, skb); tcp_minshall_update(tp, mss_now, skb); sent_pkts++; @@ -1553,7 +1546,7 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now) TCP_SKB_CB(skb)->when = tcp_time_stamp; if (likely(!tcp_transmit_skb(sk, skb, 1, sk->sk_allocation))) { - update_send_head(sk, skb); + tcp_event_new_data_sent(sk, skb); tcp_cwnd_validate(sk); return; } @@ -2528,9 +2521,8 @@ int tcp_write_wakeup(struct sock *sk) TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH; TCP_SKB_CB(skb)->when = tcp_time_stamp; err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC); - if (!err) { - update_send_head(sk, skb); - } + if (!err) + tcp_event_new_data_sent(sk, skb); return err; } else { if (tp->urg_mode && -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/9] [TCP]: Introduce tcp_wnd_end() to reduce line lengths
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- include/net/tcp.h |6 ++ net/ipv4/tcp_input.c |3 +-- net/ipv4/tcp_output.c | 20 ++-- 3 files changed, 17 insertions(+), 12 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 76286e8..6a732d4 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -785,6 +785,12 @@ static __inline__ __u32 tcp_max_burst(const struct tcp_sock *tp) return 3; } +/* Returns end sequence number of the receiver's advertised window */ +static inline u32 tcp_wnd_end(const struct tcp_sock *tp) +{ + return tp->snd_una + tp->snd_wnd; +} + /* RFC2861 Check whether we are limited by application or congestion window * This is the inverse of cwnd check in tcp_tso_should_defer */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 41f4b86..366f63a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2921,8 +2921,7 @@ static void tcp_ack_probe(struct sock *sk) /* Was it a usable window open? */ - if (!after(TCP_SKB_CB(tcp_send_head(sk))->end_seq, - tp->snd_una + tp->snd_wnd)) { + if (!after(TCP_SKB_CB(tcp_send_head(sk))->end_seq, tcp_wnd_end(tp))) { icsk->icsk_backoff = 0; inet_csk_clear_xmit_timer(sk, ICSK_TIME_PROBE0); /* Socket must be waked up by subsequent tcp_data_snd_check(). diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 025dddf..202d60c 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -89,10 +89,10 @@ static inline __u32 tcp_acceptable_seq(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - if (!before(tp->snd_una+tp->snd_wnd, tp->snd_nxt)) + if (!before(tcp_wnd_end(tp), tp->snd_nxt)) return tp->snd_nxt; else - return tp->snd_una+tp->snd_wnd; + return tcp_wnd_end(tp); } /* Calculate mss to advertise in SYN segment. @@ -1023,7 +1023,7 @@ static unsigned int tcp_mss_split_point(struct sock *sk, struct sk_buff *skb, struct tcp_sock *tp = tcp_sk(sk); u32 needed, window, cwnd_len; - window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq); + window = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq; cwnd_len = mss_now * cwnd; if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk))) @@ -1129,7 +1129,7 @@ static inline int tcp_snd_wnd_test(struct tcp_sock *tp, struct sk_buff *skb, uns if (skb->len > cur_mss) end_seq = TCP_SKB_CB(skb)->seq + cur_mss; - return !after(end_seq, tp->snd_una + tp->snd_wnd); + return !after(end_seq, tcp_wnd_end(tp)); } /* This checks if the data bearing packet SKB (usually tcp_send_head(sk)) @@ -1246,7 +1246,7 @@ static int tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb) BUG_ON(tcp_skb_pcount(skb) <= 1 || (tp->snd_cwnd <= in_flight)); - send_win = (tp->snd_una + tp->snd_wnd) - TCP_SKB_CB(skb)->seq; + send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq; /* From in_flight test above, we know that cwnd > in_flight. */ cong_win = (tp->snd_cwnd - in_flight) * tp->mss_cache; @@ -1327,7 +1327,7 @@ static int tcp_mtu_probe(struct sock *sk) if (tp->snd_wnd < size_needed) return -1; - if (after(tp->snd_nxt + size_needed, tp->snd_una + tp->snd_wnd)) + if (after(tp->snd_nxt + size_needed, tcp_wnd_end(tp))) return 0; /* Do we need to wait to drain cwnd? With none in flight, don't stall */ @@ -1682,7 +1682,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m return; /* Next skb is out of window. */ - if (after(TCP_SKB_CB(next_skb)->end_seq, tp->snd_una+tp->snd_wnd)) + if (after(TCP_SKB_CB(next_skb)->end_seq, tcp_wnd_end(tp))) return; /* Punt if not enough space exists in the first SKB for @@ -1826,7 +1826,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) * case, when window is shrunk to zero. In this case * our retransmit serves as a zero window probe. */ - if (!before(TCP_SKB_CB(skb)->seq, tp->snd_una+tp->snd_wnd) + if (!before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(tp)) && TCP_SKB_CB(skb)->seq != tp->snd_una) return -EAGAIN; @@ -2492,10 +2492,10 @@ int tcp_write_wakeup(struct sock *sk) struct sk_buff *skb; if ((skb = tcp_send_head(sk)) != NULL && - before(TCP_SKB_CB(skb)->seq, tp->snd_una+tp->snd_wnd)) { + before(TCP_SKB_CB(skb)->seq, tcp_wnd_end
[PATCH 5/9] [TCP]: Dropped unnecessary skb/sacked accessing in reneging
SACK reneging can be precalculated to a FLAG in clean_rtx_queue which has the right skb looked up. This will help a bit in future because skb->sacked access will be changed eventually, changing it already won't hurt any. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 25 + 1 files changed, 13 insertions(+), 12 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 366f63a..7bac1fa 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -105,6 +105,7 @@ int sysctl_tcp_abc __read_mostly; #define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */ #define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */ #define FLAG_NONHEAD_RETRANS_ACKED 0x1000 /* Non-head rexmitted data was ACKed */ +#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */ #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED) #define FLAG_NOT_DUP (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED) @@ -1918,18 +1919,15 @@ void tcp_enter_loss(struct sock *sk, int how) tp->frto_counter = 0; } -static int tcp_check_sack_reneging(struct sock *sk) +/* If ACK arrived pointing to a remembered SACK, it means that our + * remembered SACKs do not reflect real state of receiver i.e. + * receiver _host_ is heavily congested (or buggy). + * + * Do processing similar to RTO timeout. + */ +static int tcp_check_sack_reneging(struct sock *sk, int flag) { - struct sk_buff *skb; - - /* If ACK arrived pointing to a remembered SACK, -* it means that our remembered SACKs do not reflect -* real state of receiver i.e. -* receiver _host_ is heavily congested (or buggy). -* Do processing similar to RTO timeout. -*/ - if ((skb = tcp_write_queue_head(sk)) != NULL && - (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) { + if (flag & FLAG_SACK_RENEGING) { struct inet_connection_sock *icsk = inet_csk(sk); NET_INC_STATS_BH(LINUX_MIB_TCPSACKRENEGING); @@ -2515,7 +2513,7 @@ tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) tp->prior_ssthresh = 0; /* B. In all the states check for reneging SACKs. */ - if (tp->sacked_out && tcp_check_sack_reneging(sk)) + if (tcp_check_sack_reneging(sk, flag)) return; /* C. Process data loss notification, provided it is valid. */ @@ -2852,6 +2850,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets) tcp_clear_all_retrans_hints(tp); } + if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) + flag |= FLAG_SACK_RENEGING; + if (flag & FLAG_ACKED) { const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops; -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/9] [TCP]: reduce tcp_output's indentation levels a bit
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c | 239 + 1 files changed, 121 insertions(+), 118 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 182ae21..199253c 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1668,75 +1668,77 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m { struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *next_skb = tcp_write_queue_next(sk, skb); + int skb_size, next_skb_size; + u16 flags; /* The first test we must make is that neither of these two * SKB's are still referenced by someone else. */ - if (!skb_cloned(skb) && !skb_cloned(next_skb)) { - int skb_size = skb->len, next_skb_size = next_skb->len; - u16 flags = TCP_SKB_CB(skb)->flags; + if (skb_cloned(skb) || skb_cloned(next_skb)) + return; - /* Also punt if next skb has been SACK'd. */ - if (TCP_SKB_CB(next_skb)->sacked & TCPCB_SACKED_ACKED) - return; + skb_size = skb->len; + next_skb_size = next_skb->len; + flags = TCP_SKB_CB(skb)->flags; - /* Next skb is out of window. */ - if (after(TCP_SKB_CB(next_skb)->end_seq, tcp_wnd_end(tp))) - return; + /* Also punt if next skb has been SACK'd. */ + if (TCP_SKB_CB(next_skb)->sacked & TCPCB_SACKED_ACKED) + return; - /* Punt if not enough space exists in the first SKB for -* the data in the second, or the total combined payload -* would exceed the MSS. -*/ - if ((next_skb_size > skb_tailroom(skb)) || - ((skb_size + next_skb_size) > mss_now)) - return; + /* Next skb is out of window. */ + if (after(TCP_SKB_CB(next_skb)->end_seq, tcp_wnd_end(tp))) + return; - BUG_ON(tcp_skb_pcount(skb) != 1 || - tcp_skb_pcount(next_skb) != 1); + /* Punt if not enough space exists in the first SKB for +* the data in the second, or the total combined payload +* would exceed the MSS. +*/ + if ((next_skb_size > skb_tailroom(skb)) || + ((skb_size + next_skb_size) > mss_now)) + return; - tcp_highest_sack_combine(sk, next_skb, skb); + BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1); - /* Ok. We will be able to collapse the packet. */ - tcp_unlink_write_queue(next_skb, sk); + tcp_highest_sack_combine(sk, next_skb, skb); - skb_copy_from_linear_data(next_skb, - skb_put(skb, next_skb_size), - next_skb_size); + /* Ok. We will be able to collapse the packet. */ + tcp_unlink_write_queue(next_skb, sk); - if (next_skb->ip_summed == CHECKSUM_PARTIAL) - skb->ip_summed = CHECKSUM_PARTIAL; + skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size), + next_skb_size); - if (skb->ip_summed != CHECKSUM_PARTIAL) - skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size); + if (next_skb->ip_summed == CHECKSUM_PARTIAL) + skb->ip_summed = CHECKSUM_PARTIAL; - /* Update sequence range on original skb. */ - TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq; + if (skb->ip_summed != CHECKSUM_PARTIAL) + skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size); - /* Merge over control information. */ - flags |= TCP_SKB_CB(next_skb)->flags; /* This moves PSH/FIN etc. over */ - TCP_SKB_CB(skb)->flags = flags; + /* Update sequence range on original skb. */ + TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq; - /* All done, get rid of second SKB and account for it so -* packet counting does not break. -*/ - TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & TCPCB_EVER_RETRANS; - if (TCP_SKB_CB(next_skb)->sacked&TCPCB_SACKED_RETRANS) - tp->retrans_out -= tcp_skb_pcount(next_skb); - if (TCP_SKB_CB(next_skb)->sacked&TCPCB_LOST) - tp->lost_out -= tcp_skb_pcount(next_skb); - /* Reno case is special. Sigh... */ - if (tcp_is_reno(tp) && tp->sacked_out) - tcp_dec_pcount_a
[PATCH 6/9] [TCP]: Remove TCPCB_URG & TCPCB_AT_TAIL as unnecessary
The snd_up check should be enough. I suspect this has been there to provide a minor optimization in clean_rtx_queue which used to have a small if (!->sacked) block which could skip snd_up check among the other work. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- include/net/tcp.h |4 net/ipv4/tcp.c|1 - net/ipv4/tcp_input.c |3 +-- net/ipv4/tcp_output.c |7 +++ 4 files changed, 4 insertions(+), 11 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 6a732d4..48081ad 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -578,10 +578,6 @@ struct tcp_skb_cb { #define TCPCB_EVER_RETRANS 0x80/* Ever retransmitted frame */ #define TCPCB_RETRANS (TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS) -#define TCPCB_URG 0x20/* Urgent pointer advanced here */ - -#define TCPCB_AT_TAIL (TCPCB_URG) - __u16 urg_ptr;/* Valid w/URG flags is set.*/ __u32 ack_seq;/* Sequence number ACK'd*/ }; diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 2cbfa6d..34085e3 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -497,7 +497,6 @@ static inline void tcp_mark_urg(struct tcp_sock *tp, int flags, if (flags & MSG_OOB) { tp->urg_mode = 1; tp->snd_up = tp->write_seq; - TCP_SKB_CB(skb)->sacked |= TCPCB_URG; } } diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 7bac1fa..1e7fd81 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2821,8 +2821,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets) if (sacked & TCPCB_LOST) tp->lost_out -= acked_pcount; - if (unlikely((sacked & TCPCB_URG) && tp->urg_mode && -!before(end_seq, tp->snd_up))) + if (unlikely(tp->urg_mode && !before(end_seq, tp->snd_up))) tp->urg_mode = 0; tp->packets_out -= acked_pcount; diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 202d60c..182ae21 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -711,7 +711,6 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH); TCP_SKB_CB(buff)->flags = flags; TCP_SKB_CB(buff)->sacked = TCP_SKB_CB(skb)->sacked; - TCP_SKB_CB(skb)->sacked &= ~TCPCB_AT_TAIL; if (!skb_shinfo(skb)->nr_frags && skb->ip_summed != CHECKSUM_PARTIAL) { /* Copy and checksum data tail into the new buffer. */ @@ -1721,7 +1720,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb, int m /* All done, get rid of second SKB and account for it so * packet counting does not break. */ - TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked&(TCPCB_EVER_RETRANS|TCPCB_AT_TAIL); + TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & TCPCB_EVER_RETRANS; if (TCP_SKB_CB(next_skb)->sacked&TCPCB_SACKED_RETRANS) tp->retrans_out -= tcp_skb_pcount(next_skb); if (TCP_SKB_CB(next_skb)->sacked&TCPCB_LOST) @@ -2470,7 +2469,7 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent) skb_reserve(skb, MAX_TCP_HEADER); skb->csum = 0; TCP_SKB_CB(skb)->flags = TCPCB_FLAG_ACK; - TCP_SKB_CB(skb)->sacked = urgent; + TCP_SKB_CB(skb)->sacked = 0; skb_shinfo(skb)->gso_segs = 1; skb_shinfo(skb)->gso_size = 0; skb_shinfo(skb)->gso_type = 0; @@ -2522,7 +2521,7 @@ int tcp_write_wakeup(struct sock *sk) } else { if (tp->urg_mode && between(tp->snd_up, tp->snd_una+1, tp->snd_una+0x)) - tcp_xmit_probe_skb(sk, TCPCB_URG); + tcp_xmit_probe_skb(sk, 1); return tcp_xmit_probe_skb(sk, 0); } } -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 9/9] [TCP]: Code duplication removal, added tcp_bound_to_half_wnd()
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c | 19 +++ 1 files changed, 11 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index eb8c61a..c70f1d0 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -888,6 +888,15 @@ void tcp_mtup_init(struct sock *sk) icsk->icsk_mtup.probe_size = 0; } +/* Bound MSS / TSO packet size with the half of the window */ +static int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize) +{ + if (tp->max_window && pktsize > (tp->max_window >> 1)) + return max(tp->max_window >> 1, 68U - tp->tcp_header_len); + else + return pktsize; +} + /* This function synchronize snd mss to current pmtu/exthdr set. tp->rx_opt.user_mss is mss set by user by TCP_MAXSEG. It does NOT counts @@ -920,10 +929,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu) icsk->icsk_mtup.search_high = pmtu; mss_now = tcp_mtu_to_mss(sk, pmtu); - - /* Bound mss with half of window */ - if (tp->max_window && mss_now > (tp->max_window >> 1)) - mss_now = max((tp->max_window >> 1), 68U - tp->tcp_header_len); + mss_now = tcp_bound_to_half_wnd(tp, mss_now); /* And store cached results */ icsk->icsk_pmtu_cookie = pmtu; @@ -977,10 +983,7 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed) inet_csk(sk)->icsk_ext_hdr_len - tp->tcp_header_len); - if (tp->max_window && (xmit_size_goal > (tp->max_window >> 1))) - xmit_size_goal = max((tp->max_window >> 1), -68U - tp->tcp_header_len); - + xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); xmit_size_goal -= (xmit_size_goal % mss_now); } tp->xmit_size_goal = xmit_size_goal; -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel 2.6.23.8: KERNEL: assertion in net/ipv4/tcp_input.c
On Fri, 14 Dec 2007, Ilpo Järvinen wrote: > On Thu, 13 Dec 2007, Wolfgang Walter wrote: > > > it happened again with your patch applied: > > > > WARNING: at net/ipv4/tcp_input.c:1018 tcp_sacktag_write_queue() > > > > Call Trace: > > [] tcp_sacktag_write_queue+0x7d0/0xa60 > > [] add_partial+0x19/0x60 > > [] tcp_ack+0x5a4/0x1d70 > > [] tcp_rcv_established+0x485/0x7b0 > > [] tcp_v4_do_rcv+0xed/0x3e0 > > [] tcp_v4_rcv+0x947/0x970 > > [] ip_local_deliver+0xac/0x290 > > [] ip_rcv+0x362/0x6c0 > > [] netif_receive_skb+0x323/0x420 > > [] tg3_poll+0x630/0xa50 > > [] net_rx_action+0x8a/0x140 > > [] __do_softirq+0x69/0xe0 > > [] call_softirq+0x1c/0x30 > > [] do_softirq+0x35/0x90 > > [] irq_exit+0x55/0x60 > > [] do_IRQ+0x80/0x100 > > [] ret_from_intr+0x0/0xa > > > > ...Yeah, as I suspected, left_out != 0 when sacked_out and lost_out are > zero. I'll try to read the code again to see how that could happen (in > any case this is just annoying at the best, no other harm but the > message is being done). ...If nothing comes up I might ask you to run > with another test patch but it might take week or so until I've enough > time to dig into this fully because I must also come familiar with > something as pre-historic as the 2.6.23 (there are already large number > of related changes since then, both in upcoming 2.6.24 and some in > net-2.6.25)... :-) ...I checked and found this one place from where left_out inconsistency could develop from in stable-2.6.23 or earlier. No idea though how we end up there and do not sync afterwards. Adding WARN_ON to that branch might help but it will give false positives too from sacktag_write_queue due to DSACKs. ...If either of you wants to, you could add WARN_ON(1) to that modified block too and just ignore those now and then stack_traces having tcp_sacktag_write_queue calling tcp_fragment (they are not suspect because they occur due to DSACKs and sync before the assertion). -- i. -- [PATCH] [TCP]: Try to fix left_out inconsistency This spot performs (ie., doesn't perform it) obviously incorrect counting for left_out though I know very few ways to call tcp_fragment() for SACKED_ACKED skbs (DSACK seems to be the only case, and perhaps some split-after-first ACK SACK blocks could cause this as well and both fill sync after the loop). Those usually shouldn't lead to any adjustments of pcount though (and tcp_sync_left_out()s are all over the code so it's very narrow maze to not "hit" them in between which could remove all temporal inconsistencies). Therefore this might not be the actually cause of the detected inconsistencies but due to complexity of corner-case scenarios it seems still be a worth of try (I might have missed some path :-)). This is only valid for stable-2.6.23 or earlier because anything before that drops left_out and calculates it on the fly from lost_out and sacked_out and is therefore always consistently in sync. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 666d8a5..9fc08cb 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -677,8 +677,10 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 len, unsigned int mss tp->packets_out -= diff; - if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) + if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) { tp->sacked_out -= diff; + tp->left_out -= diff; + } if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS) tp->retrans_out -= diff; -- 1.5.0.6
Re: [PATCH 3/9] [TCP]: Remove unnecessary local variables
On Mon, 31 Dec 2007, David Miller wrote: > From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> > Date: Mon, 31 Dec 2007 12:47:51 +0200 > > > Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> > ... > > > > - in_flight = tcp_packets_in_flight(tp); > > - cwnd = tp->snd_cwnd; > > - if (in_flight < cwnd) > > - return (cwnd - in_flight); > > + if (tcp_packets_in_flight(tp) < tp->snd_cwnd) > > + return tp->snd_cwnd - tcp_packets_in_flight(tp); > > > > I don't know about this one. > > Although tcp_packets_in_flight() is inline and the compiler > should CSE the first call into a local register and not > redo the calculation: > > 1) That isn't something to rely upon. The compiler might look >at a function or set of functions in this file and decide >to not inline tcp_packets_in_flight() or not see the CSE >opportunity and that the second call is redundant. How about doing something really straight-forward then: return max_t(s32, tp->snd_cwnd - tcp_packets_in_flight(tp), 0); ...I was a bit unsure to add such type trickery and thus didn't go for that right away. > So best to keep the local vars here. > > I also think the code is clearer that way too. Fair enough. -- i.
[PATCH 3/3] [TCP]: Remove unnecessary local variable
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index b3110fc..f6d279a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -995,9 +995,8 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed) static void tcp_cwnd_validate(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); - __u32 packets_out = tp->packets_out; - if (packets_out >= tp->snd_cwnd) { + if (tp->packets_out >= tp->snd_cwnd) { /* Network is feed fully. */ tp->snd_cwnd_used = 0; tp->snd_cwnd_stamp = tcp_time_stamp; -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] [TCP]: Code duplication removal, added tcp_bound_to_half_wnd()
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c | 19 +++ 1 files changed, 11 insertions(+), 8 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index bb7e80a..b3110fc 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -888,6 +888,15 @@ void tcp_mtup_init(struct sock *sk) icsk->icsk_mtup.probe_size = 0; } +/* Bound MSS / TSO packet size with the half of the window */ +static int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize) +{ + if (tp->max_window && pktsize > (tp->max_window >> 1)) + return max(tp->max_window >> 1, 68U - tp->tcp_header_len); + else + return pktsize; +} + /* This function synchronize snd mss to current pmtu/exthdr set. tp->rx_opt.user_mss is mss set by user by TCP_MAXSEG. It does NOT counts @@ -920,10 +929,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu) icsk->icsk_mtup.search_high = pmtu; mss_now = tcp_mtu_to_mss(sk, pmtu); - - /* Bound mss with half of window */ - if (tp->max_window && mss_now > (tp->max_window >> 1)) - mss_now = max((tp->max_window >> 1), 68U - tp->tcp_header_len); + mss_now = tcp_bound_to_half_wnd(tp, mss_now); /* And store cached results */ icsk->icsk_pmtu_cookie = pmtu; @@ -977,10 +983,7 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed) inet_csk(sk)->icsk_ext_hdr_len - tp->tcp_header_len); - if (tp->max_window && (xmit_size_goal > (tp->max_window >> 1))) - xmit_size_goal = max((tp->max_window >> 1), -68U - tp->tcp_header_len); - + xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal); xmit_size_goal -= (xmit_size_goal % mss_now); } tp->xmit_size_goal = xmit_size_goal; -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: TCP event tracking via netlink...
On Wed, 2 Jan 2008, David Miller wrote: > From: Stephen Hemminger <[EMAIL PROTECTED]> > Date: Thu, 6 Dec 2007 09:23:12 -0800 > > > Tools and scripts for testing that generate graphs are at: > > git://git.kernel.org/pub/scm/tcptest/tcptest > > Did you move it somewhere else? > > [EMAIL PROTECTED]:~/src/GIT$ git clone > git://git.kernel.org/pub/scm/tcptest/tcptest > Initialized empty Git repository in /home/davem/src/GIT/tcptest/.git/ > fatal: The remote end hung up unexpectedly > fetch-pack from 'git://git.kernel.org/pub/scm/tcptest/tcptest' failed. .../network/ was missing from the path :-). $ git-remote show origin * remote origin URL: git://git.kernel.org/pub/scm/network/tcptest/tcptest.git Remote branch(es) merged with 'git pull' while on branch master master Tracked remote branches master -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.25 2/3] [TCP]: Urgent parameter effect can be simplified.
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f6d279a..6c7cd0a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2488,7 +2488,7 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent) * end to send an ack. Don't queue or clone SKB, just * send it. */ - TCP_SKB_CB(skb)->seq = urgent ? tp->snd_una : tp->snd_una - 1; + TCP_SKB_CB(skb)->seq = tp->snd_una - !urgent; TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq; TCP_SKB_CB(skb)->when = tcp_time_stamp; return tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC); -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.25 3/3] [TCP]: Perform setting of common control fields in one place
In case of segments which are purely for control without any data (SYN/ACK/FIN/RST), many fields are set to common values in multiple places. i386 results: $ gcc --version gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13) $ codiff tcp_output.o.old tcp_output.o.new net/ipv4/tcp_output.c: tcp_xmit_probe_skb| -48 tcp_send_ack | -56 tcp_retransmit_skb| -79 tcp_connect | -43 tcp_send_active_reset | -35 tcp_make_synack | -42 tcp_send_fin | -48 7 functions changed, 351 bytes removed net/ipv4/tcp_output.c: tcp_init_nondata_skb | +90 1 function changed, 90 bytes added tcp_output.o.mid: 8 functions changed, 90 bytes added, 351 bytes removed, diff: -261 Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c | 91 +++- 1 files changed, 36 insertions(+), 55 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 6c7cd0a..89f0188 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -327,6 +327,26 @@ static inline void TCP_ECN_send(struct sock *sk, struct sk_buff *skb, } } +/* Constructs common control bits of non-data skb. If SYN/FIN is present, + * auto increment end seqno. + */ +static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags) +{ + skb->csum = 0; + + TCP_SKB_CB(skb)->flags = flags; + TCP_SKB_CB(skb)->sacked = 0; + + skb_shinfo(skb)->gso_segs = 1; + skb_shinfo(skb)->gso_size = 0; + skb_shinfo(skb)->gso_type = 0; + + TCP_SKB_CB(skb)->seq = seq; + if (flags & (TCPCB_FLAG_SYN | TCPCB_FLAG_FIN)) + seq++; + TCP_SKB_CB(skb)->end_seq = seq; +} + static void tcp_build_and_update_options(__be32 *ptr, struct tcp_sock *tp, __u32 tstamp, __u8 **md5_hash) { @@ -1864,12 +1884,10 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) && tp->snd_una == (TCP_SKB_CB(skb)->end_seq - 1)) { if (!pskb_trim(skb, 0)) { - TCP_SKB_CB(skb)->seq = TCP_SKB_CB(skb)->end_seq - 1; - skb_shinfo(skb)->gso_segs = 1; - skb_shinfo(skb)->gso_size = 0; - skb_shinfo(skb)->gso_type = 0; + /* Reuse, even though it does some unnecessary work */ + tcp_init_nondata_skb(skb, TCP_SKB_CB(skb)->end_seq - 1, +TCP_SKB_CB(skb)->flags); skb->ip_summed = CHECKSUM_NONE; - skb->csum = 0; } } @@ -2068,16 +2086,9 @@ void tcp_send_fin(struct sock *sk) /* Reserve space for headers and prepare control bits. */ skb_reserve(skb, MAX_TCP_HEADER); - skb->csum = 0; - TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_FIN); - TCP_SKB_CB(skb)->sacked = 0; - skb_shinfo(skb)->gso_segs = 1; - skb_shinfo(skb)->gso_size = 0; - skb_shinfo(skb)->gso_type = 0; - /* FIN eats a sequence byte, write_seq advanced by tcp_queue_skb(). */ - TCP_SKB_CB(skb)->seq = tp->write_seq; - TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + 1; + tcp_init_nondata_skb(skb, tp->write_seq, +TCPCB_FLAG_ACK | TCPCB_FLAG_FIN); tcp_queue_skb(sk, skb); } __tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF); @@ -2101,16 +2112,9 @@ void tcp_send_active_reset(struct sock *sk, gfp_t priority) /* Reserve space for headers and prepare control bits. */ skb_reserve(skb, MAX_TCP_HEADER); - skb->csum = 0; - TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_RST); - TCP_SKB_CB(skb)->sacked = 0; - skb_shinfo(skb)->gso_segs = 1; - skb_shinfo(skb)->gso_size = 0; - skb_shinfo(skb)->gso_type = 0; - + tcp_init_nondata_skb(skb, tcp_acceptable_seq(sk), +TCPCB_FLAG_ACK | TCPCB_FLAG_RST); /* Send it off. */ - TCP_SKB_CB(skb)->seq = tcp_acceptable_seq(sk); - TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq; TCP_SKB_CB(skb)->when = tcp_time_stamp; if (tcp_transmit_skb(sk, skb, 0, priority)) NET_INC_STATS(LINUX_MIB_TCPABORTFAILED); @@ -2198,12 +2202,11 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst, TCP_ECN_make_synack(req, th); th->source = inet_sk(sk)->sport; th->dest = ireq->rmt_port; - TCP_SKB_CB(skb)->seq = tcp_rsk(req)->snt_isn; - TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(s
[PATCH net-2.6.25 1/3] [TCP]: cleanup tcp_parse_options deep indented switch
Removed case indentation level & combined some nested ifs, mostly within 80 lines now. This is a leftover from indent patch, it just had to be done manually to avoid messing it up completely. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c | 135 +- 1 files changed, 67 insertions(+), 68 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 18e099c..d5b6adf 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -3278,81 +3278,80 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx, int opsize; switch (opcode) { - case TCPOPT_EOL: + case TCPOPT_EOL: + return; + case TCPOPT_NOP:/* Ref: RFC 793 section 3.1 */ + length--; + continue; + default: + opsize = *ptr++; + if (opsize < 2) /* "silly options" */ return; - case TCPOPT_NOP:/* Ref: RFC 793 section 3.1 */ - length--; - continue; - default: - opsize=*ptr++; - if (opsize < 2) /* "silly options" */ - return; - if (opsize > length) - return; /* don't parse partial options */ - switch (opcode) { - case TCPOPT_MSS: - if (opsize==TCPOLEN_MSS && th->syn && !estab) { - u16 in_mss = ntohs(get_unaligned((__be16 *)ptr)); - if (in_mss) { - if (opt_rx->user_mss && opt_rx->user_mss < in_mss) - in_mss = opt_rx->user_mss; - opt_rx->mss_clamp = in_mss; - } - } - break; - case TCPOPT_WINDOW: - if (opsize==TCPOLEN_WINDOW && th->syn && !estab) - if (sysctl_tcp_window_scaling) { - __u8 snd_wscale = *(__u8 *) ptr; - opt_rx->wscale_ok = 1; - if (snd_wscale > 14) { - if (net_ratelimit()) - printk(KERN_INFO "tcp_parse_options: Illegal window " - "scaling value %d >14 received.\n", - snd_wscale); - snd_wscale = 14; - } - opt_rx->snd_wscale = snd_wscale; - } - break; - case TCPOPT_TIMESTAMP: - if (opsize==TCPOLEN_TIMESTAMP) { - if ((estab && opt_rx->tstamp_ok) || - (!estab && sysctl_tcp_timestamps)) { - opt_rx->saw_tstamp = 1; - opt_rx->rcv_tsval = ntohl(get_unaligned((__be32 *)ptr)); - opt_rx->rcv_tsecr = ntohl(get_unaligned((__be32 *)(ptr+4))); - } + if (opsize > length) + return; /* don't parse partial options */ + switch (opcode) { + case TCPOPT_MSS: + if (opsize == TCPOLEN_MSS && th->syn && !estab) { + u16 in_mss = ntohs(get_unaligned((__be16 *)ptr)); + if (in_mss) { + if (opt_rx->user_mss && +
[PATCH net-2.6.25] [NET]: Remove obsolete comment
[PATCH] [NET]: Remove obsolete comment It seems that ip_build_xmit is no longer used in here and ip_append_data is used. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- Hi Dave, While reading some nearby code, I noticed this. I'm not 100% sure if the removal is valid or not nor have interest to dig it up from history but I suppose you know it better without having to look it up from history :-). net/ipv4/ip_output.c |2 -- 1 files changed, 0 insertions(+), 2 deletions(-) diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 3dc0c12..e57de0f 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1339,8 +1339,6 @@ static int ip_reply_glue_bits(void *dptr, char *to, int offset, * * Should run single threaded per socket because it uses the sock * structure to pass arguments. - * - * LATER: switch from ip_build_xmit to ip_append_* */ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *arg, unsigned int len) -- 1.5.0.6
[PATCH net-2.6.25 0/4] [NET]: Bloat, bloat and more bloat
Hi Dave, After Arnaldo got codiff's inline instrumentation bugs fixed (thanks! :-)), I got my .c-inline-bloat-o-meter to power up reliably after some tweaking and bug fixing on my behalf... It shows some very high readings every now and then in the code under net/. ...Aand... we've a sovereign winner, though it was only fifth on a kernel wide list (arch/ excluded due to number of reasons) :-/. Hall of (unquestionable) fame (measured per inline, top 10 under net/): -4496 ctnetlink_parse_tuplenetfilter/nf_conntrack_netlink.c -2165 ctnetlink_dump_tuplesnetfilter/nf_conntrack_netlink.c -2115 __ip_vs_get_out_rt ipv4/ipvs/ip_vs_xmit.c -1924 xfrm_audit_helper_pktinfoxfrm/xfrm_state.c -1799 ctnetlink_parse_tuple_proto netfilter/nf_conntrack_netlink.c -1268 ctnetlink_parse_tuple_ip netfilter/nf_conntrack_netlink.c -1093 ctnetlink_exp_dump_expectnetfilter/nf_conntrack_netlink.c -1060 ccid3_update_send_interval dccp/ccids/ccid3.c -983 ctnetlink_dump_tuples_proto netfilter/nf_conntrack_netlink.c -827 ctnetlink_exp_dump_tuple netfilter/nf_conntrack_netlink.c Removing inlines is done iteratively because e.g., uninlining ctnetlink_parse_tuple affected ..._{proto,ip} prices as well. i386/gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13)/allyesconfig except CONFIG_FORCED_INLINING. Tried without some CONFIG.*DEBUG as well and got slightly better numbers for some functions, yet the number don't differ enough to be that meaningful, ie., if there's bloat somewhere, removing DEBUGs won't make it go away. ...After this first aid we're at least below 1k :-). -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] [IPVS]: Kill some bloat
From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <[EMAIL PROTECTED]> net/ipv4/ipvs/ip_vs_xmit.c: ip_vs_icmp_xmit | -638 ip_vs_tunnel_xmit | -674 ip_vs_nat_xmit| -716 ip_vs_dr_xmit | -682 4 functions changed, 2710 bytes removed, diff: -2710 net/ipv4/ipvs/ip_vs_xmit.c: __ip_vs_get_out_rt | +595 1 function changed, 595 bytes added, diff: +595 net/ipv4/ipvs/ip_vs_xmit.o: 5 functions changed, 595 bytes added, 2710 bytes removed, diff: -2115 Without some CONFIG.*DEBUGs: net/ipv4/ipvs/ip_vs_xmit.o: 5 functions changed, 383 bytes added, 1513 bytes removed, diff: -1130 Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/ipvs/ip_vs_xmit.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/ipvs/ip_vs_xmit.c b/net/ipv4/ipvs/ip_vs_xmit.c index 1e96bf8..8436bf8 100644 --- a/net/ipv4/ipvs/ip_vs_xmit.c +++ b/net/ipv4/ipvs/ip_vs_xmit.c @@ -59,7 +59,7 @@ __ip_vs_dst_check(struct ip_vs_dest *dest, u32 rtos, u32 cookie) return dst; } -static inline struct rtable * +static struct rtable * __ip_vs_get_out_rt(struct ip_vs_conn *cp, u32 rtos) { struct rtable *rt; /* Route to the other host */ -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] [NETFILTER]: Kill some supper dupper bloatry
From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <[EMAIL PROTECTED]> /me awards the bloatiest-of-all-net/-.c-code award to nf_conntrack_netlink.c, congratulations to all the authors :-/! Hall of (unquestionable) fame (measured per inline, top 10 under net/): -4496 ctnetlink_parse_tuplenetfilter/nf_conntrack_netlink.c -2165 ctnetlink_dump_tuplesnetfilter/nf_conntrack_netlink.c -2115 __ip_vs_get_out_rt ipv4/ipvs/ip_vs_xmit.c -1924 xfrm_audit_helper_pktinfoxfrm/xfrm_state.c -1799 ctnetlink_parse_tuple_proto netfilter/nf_conntrack_netlink.c -1268 ctnetlink_parse_tuple_ip netfilter/nf_conntrack_netlink.c -1093 ctnetlink_exp_dump_expectnetfilter/nf_conntrack_netlink.c -1060 void ccid3_update_send_interval dccp/ccids/ccid3.c -983 ctnetlink_dump_tuples_proto netfilter/nf_conntrack_netlink.c -827 ctnetlink_exp_dump_tuple netfilter/nf_conntrack_netlink.c (i386 / gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13) / allyesconfig except CONFIG_FORCED_INLINING) ...and I left < 200 byte gains as future work item. After iterative inline removal, I finally have this: net/netfilter/nf_conntrack_netlink.c: ctnetlink_exp_fill_info | -1104 ctnetlink_new_expect | -1572 ctnetlink_fill_info | -1303 ctnetlink_new_conntrack | -2230 ctnetlink_get_expect | -341 ctnetlink_del_expect | -352 ctnetlink_expect_event| -1110 ctnetlink_conntrack_event | -1548 ctnetlink_del_conntrack | -729 ctnetlink_get_conntrack | -728 10 functions changed, 11017 bytes removed, diff: -11017 net/netfilter/nf_conntrack_netlink.c: ctnetlink_parse_tuple | +419 dump_nat_seq_adj | +183 ctnetlink_dump_counters | +166 ctnetlink_dump_tuples | +261 ctnetlink_exp_dump_expect | +633 ctnetlink_change_status | +460 6 functions changed, 2122 bytes added, diff: +2122 net/netfilter/nf_conntrack_netlink.o: 16 functions changed, 2122 bytes added, 11017 bytes removed, diff: -8895 Without a number of CONFIG.*DEBUGs, I got this: net/netfilter/nf_conntrack_netlink.o: 16 functions changed, 2122 bytes added, 11029 bytes removed, diff: -8907 Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/netfilter/nf_conntrack_netlink.c | 12 ++-- 1 files changed, 6 insertions(+), 6 deletions(-) diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c index d93d58d..38141f1 100644 --- a/net/netfilter/nf_conntrack_netlink.c +++ b/net/netfilter/nf_conntrack_netlink.c @@ -95,7 +95,7 @@ nla_put_failure: return -1; } -static inline int +static int ctnetlink_dump_tuples(struct sk_buff *skb, const struct nf_conntrack_tuple *tuple) { @@ -205,7 +205,7 @@ nla_put_failure: } #ifdef CONFIG_NF_CT_ACCT -static inline int +static int ctnetlink_dump_counters(struct sk_buff *skb, const struct nf_conn *ct, enum ip_conntrack_dir dir) { @@ -284,7 +284,7 @@ nla_put_failure: } #ifdef CONFIG_NF_NAT_NEEDED -static inline int +static int dump_nat_seq_adj(struct sk_buff *skb, const struct nf_nat_seq *natseq, int type) { struct nlattr *nest_parms; @@ -648,7 +648,7 @@ ctnetlink_parse_tuple_proto(struct nlattr *attr, return ret; } -static inline int +static int ctnetlink_parse_tuple(struct nlattr *cda[], struct nf_conntrack_tuple *tuple, enum ctattr_tuple type, u_int8_t l3num) { @@ -888,7 +888,7 @@ out: return err; } -static inline int +static int ctnetlink_change_status(struct nf_conn *ct, struct nlattr *cda[]) { unsigned long d; @@ -1349,7 +1349,7 @@ nla_put_failure: return -1; } -static inline int +static int ctnetlink_exp_dump_expect(struct sk_buff *skb, const struct nf_conntrack_expect *exp) { -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] [XFRM]: Kill some bloat
From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <[EMAIL PROTECTED]> net/xfrm/xfrm_state.c: xfrm_audit_state_delete | -589 xfrm_replay_check| -542 xfrm_audit_state_icvfail | -520 xfrm_audit_state_add | -589 xfrm_audit_state_replay_overflow | -523 xfrm_audit_state_notfound_simple | -509 xfrm_audit_state_notfound| -521 7 functions changed, 3793 bytes removed, diff: -3793 net/xfrm/xfrm_state.c: xfrm_audit_helper_pktinfo | +522 xfrm_audit_helper_sainfo | +598 2 functions changed, 1120 bytes added, diff: +1120 net/xfrm/xfrm_state.o: 9 functions changed, 1120 bytes added, 3793 bytes removed, diff: -2673 Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/xfrm/xfrm_state.c |8 1 files changed, 4 insertions(+), 4 deletions(-) diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 4dec655..81b0fce 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -2008,8 +2008,8 @@ void __init xfrm_state_init(void) } #ifdef CONFIG_AUDITSYSCALL -static inline void xfrm_audit_helper_sainfo(struct xfrm_state *x, - struct audit_buffer *audit_buf) +static void xfrm_audit_helper_sainfo(struct xfrm_state *x, +struct audit_buffer *audit_buf) { struct xfrm_sec_ctx *ctx = x->security; u32 spi = ntohl(x->id.spi); @@ -2036,8 +2036,8 @@ static inline void xfrm_audit_helper_sainfo(struct xfrm_state *x, audit_log_format(audit_buf, " spi=%u(0x%x)", spi, spi); } -static inline void xfrm_audit_helper_pktinfo(struct sk_buff *skb, u16 family, -struct audit_buffer *audit_buf) +static void xfrm_audit_helper_pktinfo(struct sk_buff *skb, u16 family, + struct audit_buffer *audit_buf) { struct iphdr *iph4; struct ipv6hdr *iph6; -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] [CCID3]: Kill some bloat
From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <[EMAIL PROTECTED]> Without a number of CONFIG.*DEBUG: net/dccp/ccids/ccid3.c: ccid3_hc_tx_update_x | -170 ccid3_hc_tx_packet_sent | -175 ccid3_hc_tx_packet_recv | -169 ccid3_hc_tx_no_feedback_timer | -192 ccid3_hc_tx_send_packet | -144 5 functions changed, 850 bytes removed, diff: -850 net/dccp/ccids/ccid3.c: ccid3_update_send_interval | +191 1 function changed, 191 bytes added, diff: +191 net/dccp/ccids/ccid3.o: 6 functions changed, 191 bytes added, 850 bytes removed, diff: -659 Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/dccp/ccids/ccid3.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c index 00b5f11..8a817dd 100644 --- a/net/dccp/ccids/ccid3.c +++ b/net/dccp/ccids/ccid3.c @@ -97,7 +97,7 @@ static inline u64 rfc3390_initial_rate(struct sock *sk) /* * Recalculate t_ipi and delta (should be called whenever X changes) */ -static inline void ccid3_update_send_interval(struct ccid3_hc_tx_sock *hctx) +static void ccid3_update_send_interval(struct ccid3_hc_tx_sock *hctx) { /* Calculate new t_ipi = s / X_inst (X_inst is in 64 * bytes/second) */ hctx->ccid3hctx_t_ipi = scaled_div32(((u64)hctx->ccid3hctx_s) << 6, -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Sat, 5 Jan 2008, David Miller wrote: > From: Herbert Xu <[EMAIL PROTECTED]> > Date: Sun, 06 Jan 2008 11:29:35 +1100 > > > We should never use inline except when it's on the fast path and this > > is definitely not a fast path. If a function ends up being called > > just once the compiler will most likely inline it anyway, making the > > use of the keyword inline redundant. > > Similarly I question just about any inline usage at all in *.c files > these days. > > I even would discourage it's use for fast-path cases as well. There still seems to be good candidates for inline in *.c files, in worst case I had +172 due to inline removed and ~60 are on +10-+90 range with my gcc, later gccs might do better but I definately would just blindly remove them all. Here's the other end of the list: +35 static inline hci_encrypt_change_evtnet/bluetooth/hci_event.c +36 static __inline__ tcp_in_window net/ipv4/tcp_minisocks.c +38 static inline hci_conn_complete_evt net/bluetooth/hci_event.c +38 static inline hci_conn_request_evt net/bluetooth/hci_event.c +42 static inline gred_wred_modenet/sched/sch_gred.c +45 static inline secpath_has_nontransport net/xfrm/xfrm_policy.c +52 static inline bool port_match net/netfilter/xt_tcpudp.c +67 static inline dn_check_idf net/decnet/dn_nsp_in.c +90 static inline __ieee80211_queue_stopped net/mac80211/tx.c +172 static inline sctp_chunk_length_valid net/sctp/sm_statefuns.c -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Sun, 6 Jan 2008, Herbert Xu wrote: > is definitely not a fast path. If a function ends up being called > just once the compiler will most likely inline it anyway, making the > use of the keyword inline redundant. Unexpected enough, even this logic seems to fail in a way with my gcc, I'm yet to study it closer but it seems to me that e.g., uninlining only once called tcp_fastretrans_alert is worth of at least 100 bytes (note that it's not inlined by us, gcc did it all by itself)! Otherwise I'd fail to understand why I got -270 bytes from uninlining tcp_moderate_cwnd which is only 57 bytes as unlined with three call sites. net/ipv4/tcp_input.c: tcp_undo_cwr | -48 tcp_try_undo_recovery | -55 tcp_ack | -2941 3 functions changed, 3044 bytes removed, diff: -3044 net/ipv4/tcp_input.c: tcp_moderate_cwnd | +57 tcp_fastretrans_alert | +2717 2 functions changed, 2774 bytes added, diff: +2774 net/ipv4/tcp_input.o: 5 functions changed, 2774 bytes added, 3044 bytes removed, diff: -270 I'll probably force uninlining of it without tcp_moderate_cwnd noise and try a number of gcc versions. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Mon, 7 Jan 2008, David Miller wrote: > From: Andi Kleen <[EMAIL PROTECTED]> > Date: Tue, 8 Jan 2008 06:00:07 +0100 > > > On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote: > > > The vast majority of them are one, two, and three liners. > > > > % awk ' { line++ } ; /^{/ { total++; start = line } ; /^}/ { > > len=line-start-3; if (len > 4) l++; if (len >= 10) k++; } ; END { print > > total, l, l/total, k, k/total }' < include/net/tcp.h > > 68 28 0.411765 20 0.294118 > > > > 41% are over 4 lines, 29% are >= 10 lines. > > Take out the comments and whitespace lines, your script is > too simplistic. He should also remove these, which involve just syntactic casting to please the compiler: struct tcp_sock *tp = tcp_sk(sk); struct inet_connection_sock *icsk = inet_csk(sk); ...and maybe some other similar ones. I'm sure that's going to make his statistics even more questionable as is because we need tp all around. But about his concerns, I spend couple of minutes in looking to it by hand. These are likely to be a win: tcp_set_state tcp_prequeue tcp_prequeue_init tcp_openreq_init ...about rest I'm very unsure but there are some that probably are a minor win as well. tcp_dec_quickack_mode has only a single caller anyway so I'd doubt that it's going to be significant, I thought moving it to .c earlier but haven't just done that yet :-). -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Tue, 8 Jan 2008, Andi Kleen wrote: > David Miller <[EMAIL PROTECTED]> writes: > > > Similarly I question just about any inline usage at all in *.c files > > Don't forget the .h files. Especially a lot of stuff in tcp.h should > be probably in some .c file and not be inline. I'm not forgetting them... :-) My intention is to do the same for all headers as well but I'd really like to get some actual number of bytes by measuring rather than taking them of the hat. ...It just requires some work still to get the uninlining actually do the right thing and actually testing it involves orders of magnitude more exhaustive recompilation than .o only checking required so after finishing the script I either has to setup n distcc master with a number of slaves each or wait for some time for the results :-). However, I suspect that anything in tcp.h won't match to something like ~5-10k like I've seen in three .c files already + one with debugging that has even more than 10k (even if all in tcp.h summed up :-)). -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SACK scoreboard
On Mon, 7 Jan 2008, David Miller wrote: > Did you happen to read a recent blog posting of mine? > > http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead > > I've been thinking more and more and I think we might be able > to get away with enforcing that SACKs are always increasing in > coverage. > > I doubt there are any real systems out there that drop out of order > packets that are properly formed and are in window, even though the > SACK specification (foolishly, in my opinion) allows this. Luckily we can see that already from MIBs, so quering people who have large servers, which are continously "testing" the internet :-), under their supervision or can access, and asking if they see any might help. I checked my dept's interactive servers and all had zero renegings, but I don't think I have access to www server which would have much wider exposure. > If we could free packets as SACK blocks cover them, all the problems > go away. I thought it a bit yesterday after reading your blog and came to conclusion that they won't, we can still get those nasty ACKs regardless of received SACK info (in here, missing). Even in some valid cases which include ACK losses besides actual data loss, not that this is the most common case but just wanted to point out that cleanup work is at least partially independent of SACK problem. So not "all" problems would go away really. > For one thing, this will allow the retransmit queue liberation during > loss recovery to be spread out over the event, instead of batched up > like crazy to the point where the cumulative ACK finally moves and > releases an entire window's worth of data. Two key cases for real pattern are: 1. Losses once per n, where n is something small, like 2-20 or so, usually happens at slow start overshoot or when compething traffic slow starts. Cumulative ACKs will cover only small part of the window once rexmits make through, thus this is not a problem. 2. Single loss (or few at the beginning of the window), rest SACKed. Cumulative ACK will cover original window when the last necessary rexmit gets through. Case 1 becomes nasty ACKy only if rexmit is lost as well, but in that case the arriving SACK blocks make the rest of the window equal to 2 :-). So I'm now trying to solve just case 2. What if we could somehow "combine" adjacent skbs (or whatever they're called in that model) if SACK covers them both so that we still hold them but can drop them in a very efficient way. That would make the combining effort split per ACK. And if reneging would occur, we can think a way to put the necessary fuzz into a form which cannot hurt the rest of the system (relatively easy & fast if we add CA_Reneging and allow retransmitting a portion of an skb similar to what you suggested earlier). And it might even be possible then to offer admin a control so that the admin can choose between recover/plain reset if admin thinks that it's always an indication of an attack. This is somewhat similar case to what UTO (under IETF evaluation) does, as purpose of both is in violation of RFC TCP to avoid malicious traps but the control about it is left to the user. > Next, it would simplify all of this scanning code trying to figure out > which holes to fill during recovery. > > And for SACK scoreboard marking, the RB trie would become very nearly > unecessary as far as I can tell. I've been contacted by a person who was interested in reaching 500k windows, so your 4000 sounded like a joke :-/. Having, let say, every 20th dropped means 25k skbs remaining, can we scan though it in any sensible time without RBs and friends :-)? However, allowing queue walk to begin from either direction would solve most of the common cases well enough for it to be nearly manageable. > I would not even entertain this kind of crazy idea unless I thought > the fundamental complexity simplification payback was enormous. And > in this case I think it is. > > What we could do is put some experimental hack in there for developers > to start playing with, which would enforce that SACKs always increase > in coverage. If violated the connection reset and a verbose log > message is logged so we can analyze any cases that occur. We have an initial number already, in MIBs. > Sounds crazy, but maybe has potential. What do you think? If I'd hint my boss that I'm involved in something like this I'd bet that he also would get quite crazy... ;-) I'm partially paid for making TCP more RFCish :-), or at least that the places where thing diverge are known and controllable for research purposes. -- i. ps. If other Cced would like to get dropped if there are some followups, just let me know :-). Else, no need to do anything. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-2.6.25 0/4] [NET]: Bloat, bloat and more bloat
On Sat, 5 Jan 2008, Arnaldo Carvalho de Melo wrote: > Em Sat, Jan 05, 2008 at 03:39:04PM +0200, Ilpo Järvinen escreveu: > > Hi Dave, > > > > After Arnaldo got codiff's inline instrumentation bugs fixed > > (thanks! :-)), I got my .c-inline-bloat-o-meter to power up > > reliably after some tweaking and bug fixing on my behalf... > > It shows some very high readings every now and then in the > > code under net/. > > Thank you for the reports and for showing how these tools can be put to > good use! > > If you have any further suggestions on how to make codiff and the > dwarves to be of more help or find any other bug, please let me know. It could use a bit less memory because my header inline checking attempt on a machine with 2G+2G ends up like this: $ codiff vmlinux.o.base vmlinux.o libclasses: out of memory(inline_expansion__new) $ ls -al vmlinux.o{,.base} -rw-r- 1 ijjarvin tkol 633132586 Jan 9 13:11 vmlinux.o -rw-r- 1 ijjarvin tkol 633132572 Jan 9 00:58 vmlinux.o.base $ Considering that's only 0.6G+0.6G I've a problem in understanding why codiff's number crunching eats up so much memory. I just hope there isn't any O(n^2) or worse algos in it either once the memory consumption gets resolved. :-) -- i.
Re: SACK scoreboard
On Tue, 8 Jan 2008, John Heffner wrote: > Andi Kleen wrote: > > David Miller <[EMAIL PROTECTED]> writes: > > > The big problem is that recovery from even a single packet loss in a > > > window makes us run kfree_skb() for a all the packets in a full > > > window's worth of data when recovery completes. > > > > Why exactly is it a problem to free them all at once? Are you worried > > about kernel preemption latencies? > > I also wonder how much of a problem this is (for now, with window sizes of > order 1 packets. My understanding is that the biggest problems arise from > O(N^2) time for recovery because every ack was expensive. Have current > tests shown the final ack to be a major source of problems? This thread got started because I tried to solve the other latencies but realized that it helps very little because this latency spike would have remained unsolved and it happens in one of the most common case. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] [XFRM]: Kill some bloat
On Tue, 8 Jan 2008, Ilpo Järvinen wrote: > On Mon, 7 Jan 2008, David Miller wrote: > > > From: Andi Kleen <[EMAIL PROTECTED]> > > Date: Tue, 8 Jan 2008 06:00:07 +0100 > > > > > On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote: > > > > The vast majority of them are one, two, and three liners. > > > > > > % awk ' { line++ } ; /^{/ { total++; start = line } ; /^}/ { > > > len=line-start-3; if (len > 4) l++; if (len >= 10) k++; } ; END { print > > > total, l, l/total, k, k/total }' < include/net/tcp.h > > > 68 28 0.411765 20 0.294118 > > > > > > 41% are over 4 lines, 29% are >= 10 lines. > > > > Take out the comments and whitespace lines, your script is > > too simplistic. In addition it triggered spuriously per struct/enum end brace :-) and was using the last known function starting brace in there so no wonder the numbers were that high... Counting with the corrected lines (len=line-start-1) & spurious matches removed: 74 19 0.256757 7 0.0945946 Here are (finally) the measured bytes (couple of the functions are missing because I had couple of bugs in the regexps and the #if trickery at the inline resulted failed compiles): 12 funcs, 242+, 1697-, diff: -1455 tcp_set_state 13 funcs, 92+, 632-, diff: -540 tcp_is_cwnd_limited 12 funcs, 2836+, 3225-, diff: -389 tcp_current_ssthresh 5 funcs, 261+, 556-, diff: -295 tcp_prequeue 7 funcs, 2777+, 3049-, diff: -272 tcp_clear_retrans_hints_partial 11 funcs, 64+, 275-, diff: -211 tcp_win_from_space 6 funcs, 128+, 320-, diff: -192 tcp_prequeue_init 12 funcs, 45+, 209-, diff: -164 tcp_set_ca_state 7 funcs, 106+, 237-, diff: -131 tcp_fast_path_check 5 funcs, 167+, 291-, diff: -124 tcp_write_queue_purge 6 funcs, 43+, 160-, diff: -117 tcp_push_pending_frames 9 funcs, 55+, 159-, diff: -104 tcp_v4_check 6 funcs, 4+, 97-, diff: -93 tcp_packets_in_flight 7 funcs, 58+, 150-, diff: -92 tcp_fast_path_on 4 funcs, 4+, 91-, diff: -87 tcp_clear_options 6 funcs, 141+, 217-, diff: -76 tcp_openreq_init 8 funcs, 38+, 111-, diff: -73 tcp_unlink_write_queue 7 funcs, 32+, 103-, diff: -71 tcp_checksum_complete 7 funcs, 35+, 101-, diff: -66 __tcp_fast_path_on 5 funcs, 4+, 66-, diff: -62 tcp_receive_window 6 funcs, 67+, 128-, diff: -61 tcp_add_write_queue_tail 7 funcs, 30+, 86-, diff: -56tcp_ca_event 6 funcs, 73+, 106-, diff: -33 tcp_paws_check 4 funcs, 4+, 36-, diff: -32 tcp_highest_sack_seq 6 funcs, 46+, 78-, diff: -32tcp_fin_time 3 funcs, 4+, 35-, diff: -31 tcp_clear_all_retrans_hints 7 funcs, 30+, 51-, diff: -21__tcp_add_write_queue_tail 3 funcs, 4+, 14-, diff: -10 tcp_enable_fack 4 funcs, 4+, 14-, diff: -10 keepalive_time_when 8 funcs, 66+, 73-, diff: -7 tcp_full_space 3 funcs, 4+, 5-, diff: -1 tcp_wnd_end 4 funcs, 97+, 97-, diff: +0 tcp_mib_init 3 funcs, 4+, 3-, diff: +1 tcp_skb_is_last 2 funcs, 4+, 2-, diff: +2 keepalive_intvl_when 2 funcs, 4+, 2-, diff: +2 tcp_is_fack 2 funcs, 4+, 2-, diff: +2 tcp_skb_mss 2 funcs, 4+, 2-, diff: +2 tcp_write_queue_empty 2 funcs, 4+, 2-, diff: +2 tcp_advance_highest_sack 2 funcs, 4+, 2-, diff: +2 tcp_advance_send_head 2 funcs, 4+, 2-, diff: +2 tcp_check_send_head 2 funcs, 4+, 2-, diff: +2 tcp_highest_sack_reset 2 funcs, 4+, 2-, diff: +2 tcp_init_send_head 2 funcs, 4+, 2-, diff: +2 tcp_sack_reset 6 funcs, 47+, 44-, diff: +3 tcp_space 5 funcs, 55+, 50-, diff: +5 tcp_too_many_orphans 3 funcs, 8+, 2-, diff: +6 tcp_minshall_update 3 funcs, 8+, 2-, diff: +6 tcp_update_wl 8 funcs, 25+, 14-, diff: +11between 3 funcs, 14+, 2-, diff: +12 tcp_put_md5sig_pool 3 funcs, 14+, 2-, diff: +12 tcp_clear_xmit_timers 5 funcs, 30+, 17-, diff: +13tcp_dec_pcount_approx_int 6 funcs, 33+, 20-, diff: +13tcp_insert_write_queue_after 3 funcs, 17+, 2-, diff: +15 __tcp_checksum_complete 5 funcs, 17+, 2-, diff: +15 tcp_init_wl 4 funcs, 57+, 41-, diff: +16tcp_dec_quickack_mode 4 funcs, 40+, 22-, diff: +18__tcp_add_write_queue_head 5 funcs, 36+, 16-, diff: +20tcp_highest_sack_combine 4 funcs, 40+, 18-, diff: +22tcp_dec_pcount_approx 6 funcs, 29+, 5-, diff: +24 tcp_is_sack 4 funcs, 28+, 2-, diff: +26 tcp_is_reno 5 funcs, 50+, 24-, diff: +26tcp_insert_write_queue_before 4 funcs, 83+, 56-, diff: +27tcp_check_probe_timer 8 funcs, 69+, 14-, diff: +55tcp_left_out 11 funcs, 2995+, 2893-, diff: +102 tcp_skb_pcount 30 funcs, 930+, 2-, diff: +928 before -- i.
TCP & hints
Hi Stephen, Do you still remember what this is for (got added along with other TCP hint stuff)? What kind of problem you saw back then (or who saw problems)? @@ -1605,6 +1711,10 @@ static void tcp_undo_cwr(struct sock *sk, const int undo) } tcp_moderate_cwnd(tp); tp->snd_cwnd_stamp = tcp_time_stamp; + + /* There is something screwy going on with the retrans hints after + an undo */ + clear_all_retrans_hints(tp); } static inline int tcp_may_undo(struct tcp_sock *tp) -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.25 0/8] [NET]: More uninlining & related
Hi Dave, First of all, I changed output encoding of git to utf-8, so I guess the encoding should not cause the same trouble for you. Here are couple of more to uninline things. Pretty straightforward except the EXPORT_SYMBOLs, I've no idea which is the right variant (feel free to fix them while applying :-)). Also pktgen uninlining is somewhat questionable because it's just a testing tool so feel free to drop it if it feels unnecessary (I could have asked first but it's just as easy to do it this way if not easier)... There were more dead static inlines found after inlines removed (gcc didn't report them otherwise) than in pktgen now included, but I'm not sure if I should send "by default" patches removing or #if 0'ing them? -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-2.6.25 0/8] [NET]: More uninlining & related
Hi Dave, First of all, I changed output encoding to utf-8, so I guess the encoding should not cause trouble for you. Here are couple of more to uninline things. Pretty straightforward except the EXPORT_SYMBOLs, I've no idea which is the right variant (feel free to fix them while applying :-)). Also pktgen uninlining is somewhat questionable because it's just a testing tool so feel free to drop it if it feels unnecessary (I could have asked first but it's just as easy to do it this way if not easier)... There were more dead static inlines found after inlines removed (gcc didn't report them otherwise) than in pktgen now included, but I'm not sure if I should send "by default" patches removing or #if 0'ing them? -- i. ps. I apologize that I must resend to get them to netdev as well because git-send-email of this system (not sure if later could) still seems to be lacking proper encoding of my name when it decides to add it to Cc list all by itself and those 8-bit chars in address got rejected. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] [TCP]: Uninline tcp_is_cwnd_limited
net/ipv4/tcp_cong.c: tcp_reno_cong_avoid | -65 1 function changed, 65 bytes removed, diff: -65 net/ipv4/arp.c: arp_ignore | -5 1 function changed, 5 bytes removed, diff: -5 net/ipv4/tcp_bic.c: bictcp_cong_avoid | -57 1 function changed, 57 bytes removed, diff: -57 net/ipv4/tcp_cubic.c: bictcp_cong_avoid | -61 1 function changed, 61 bytes removed, diff: -61 net/ipv4/tcp_highspeed.c: hstcp_cong_avoid | -63 1 function changed, 63 bytes removed, diff: -63 net/ipv4/tcp_hybla.c: hybla_cong_avoid | -85 1 function changed, 85 bytes removed, diff: -85 net/ipv4/tcp_htcp.c: htcp_cong_avoid | -57 1 function changed, 57 bytes removed, diff: -57 net/ipv4/tcp_veno.c: tcp_veno_cong_avoid | -52 1 function changed, 52 bytes removed, diff: -52 net/ipv4/tcp_scalable.c: tcp_scalable_cong_avoid | -61 1 function changed, 61 bytes removed, diff: -61 net/ipv4/tcp_yeah.c: tcp_yeah_cong_avoid | -75 1 function changed, 75 bytes removed, diff: -75 net/ipv4/tcp_illinois.c: tcp_illinois_cong_avoid | -54 1 function changed, 54 bytes removed, diff: -54 net/dccp/ccids/ccid3.c: ccid3_update_send_interval | -7 ccid3_hc_tx_packet_recv| +7 2 functions changed, 7 bytes added, 7 bytes removed, diff: +0 net/ipv4/tcp_cong.c: tcp_is_cwnd_limited | +88 1 function changed, 88 bytes added, diff: +88 built-in.o: 14 functions changed, 95 bytes added, 642 bytes removed, diff: -547 ...Again some gcc artifacts visible as well. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- include/net/tcp.h | 22 +- net/ipv4/tcp_cong.c | 21 + 2 files changed, 22 insertions(+), 21 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 306580c..7de4ea3 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -786,27 +786,7 @@ static inline u32 tcp_wnd_end(const struct tcp_sock *tp) { return tp->snd_una + tp->snd_wnd; } - -/* RFC2861 Check whether we are limited by application or congestion window - * This is the inverse of cwnd check in tcp_tso_should_defer - */ -static inline int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) -{ - const struct tcp_sock *tp = tcp_sk(sk); - u32 left; - - if (in_flight >= tp->snd_cwnd) - return 1; - - if (!sk_can_gso(sk)) - return 0; - - left = tp->snd_cwnd - in_flight; - if (sysctl_tcp_tso_win_divisor) - return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd; - else - return left <= tcp_max_burst(tp); -} +extern int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight); static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss, const struct sk_buff *skb) diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index 4451750..3a6be23 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -274,6 +274,27 @@ int tcp_set_congestion_control(struct sock *sk, const char *name) return err; } +/* RFC2861 Check whether we are limited by application or congestion window + * This is the inverse of cwnd check in tcp_tso_should_defer + */ +int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight) +{ + const struct tcp_sock *tp = tcp_sk(sk); + u32 left; + + if (in_flight >= tp->snd_cwnd) + return 1; + + if (!sk_can_gso(sk)) + return 0; + + left = tp->snd_cwnd - in_flight; + if (sysctl_tcp_tso_win_divisor) + return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd; + else + return left <= tcp_max_burst(tp); +} +EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited); /* * Slow start is used when congestion window is less than slow start -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 8/8] [PKTGEN]: uninline getCurUs
net/core/pktgen.c: pktgen_stop_device | -50 pktgen_run | -105 pktgen_if_show | -37 pktgen_thread_worker | -702 4 functions changed, 894 bytes removed, diff: -894 net/core/pktgen.c: getCurUs | +36 1 function changed, 36 bytes added, diff: +36 net/core/pktgen.o: 5 functions changed, 36 bytes added, 894 bytes removed, diff: -858 Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/core/pktgen.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index ebfb126..d18fdb1 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -405,7 +405,7 @@ static inline __u64 tv_to_us(const struct timeval *tv) return us; } -static inline __u64 getCurUs(void) +static __u64 getCurUs(void) { struct timeval tv; do_gettimeofday(&tv); -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/8] [TCP]: Uninline tcp_set_state
net/ipv4/tcp.c: tcp_close_state | -226 tcp_done| -145 tcp_close | -564 tcp_disconnect | -141 4 functions changed, 1076 bytes removed, diff: -1076 net/ipv4/tcp_input.c: tcp_fin | -86 tcp_rcv_state_process | -164 2 functions changed, 250 bytes removed, diff: -250 net/ipv4/tcp_ipv4.c: tcp_v4_connect | -209 1 function changed, 209 bytes removed, diff: -209 net/ipv4/arp.c: arp_ignore | +5 1 function changed, 5 bytes added, diff: +5 net/ipv6/tcp_ipv6.c: tcp_v6_connect | -158 1 function changed, 158 bytes removed, diff: -158 net/sunrpc/xprtsock.c: xs_sendpages | -2 1 function changed, 2 bytes removed, diff: -2 net/dccp/ccids/ccid3.c: ccid3_update_send_interval | +7 1 function changed, 7 bytes added, diff: +7 net/ipv4/tcp.c: tcp_set_state | +238 1 function changed, 238 bytes added, diff: +238 built-in.o: 12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445 I've no explanation why some unrelated changes seem to occur consistently as well (arp_ignore, ccid3_update_send_interval; I checked the arp_ignore asm and it seems to be due to some reordered of operation order causing some extra opcodes to be generated). Still, the benefits are pretty obvious from the codiff's results. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> Cc: Andi Kleen <[EMAIL PROTECTED]> --- include/net/tcp.h | 35 +-- net/ipv4/tcp.c| 35 +++ 2 files changed, 36 insertions(+), 34 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 48081ad..306580c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -926,40 +926,7 @@ static const char *statename[]={ "Close Wait","Last ACK","Listen","Closing" }; #endif - -static inline void tcp_set_state(struct sock *sk, int state) -{ - int oldstate = sk->sk_state; - - switch (state) { - case TCP_ESTABLISHED: - if (oldstate != TCP_ESTABLISHED) - TCP_INC_STATS(TCP_MIB_CURRESTAB); - break; - - case TCP_CLOSE: - if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED) - TCP_INC_STATS(TCP_MIB_ESTABRESETS); - - sk->sk_prot->unhash(sk); - if (inet_csk(sk)->icsk_bind_hash && - !(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) - inet_put_port(&tcp_hashinfo, sk); - /* fall through */ - default: - if (oldstate==TCP_ESTABLISHED) - TCP_DEC_STATS(TCP_MIB_CURRESTAB); - } - - /* Change state AFTER socket is unhashed to avoid closed -* socket sitting in hash tables. -*/ - sk->sk_state = state; - -#ifdef STATE_TRACE - SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n",sk, statename[oldstate],statename[state]); -#endif -} +extern void tcp_set_state(struct sock *sk, int state); extern void tcp_done(struct sock *sk); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 34085e3..7d7b958 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1652,6 +1652,41 @@ recv_urg: goto out; } +void tcp_set_state(struct sock *sk, int state) +{ + int oldstate = sk->sk_state; + + switch (state) { + case TCP_ESTABLISHED: + if (oldstate != TCP_ESTABLISHED) + TCP_INC_STATS(TCP_MIB_CURRESTAB); + break; + + case TCP_CLOSE: + if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED) + TCP_INC_STATS(TCP_MIB_ESTABRESETS); + + sk->sk_prot->unhash(sk); + if (inet_csk(sk)->icsk_bind_hash && + !(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) + inet_put_port(&tcp_hashinfo, sk); + /* fall through */ + default: + if (oldstate==TCP_ESTABLISHED) + TCP_DEC_STATS(TCP_MIB_CURRESTAB); + } + + /* Change state AFTER socket is unhashed to avoid closed +* socket sitting in hash tables. +*/ + sk->sk_state = state; + +#ifdef STATE_TRACE + SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n",sk, statename[oldstate],statename[state]); +#endif +} +EXPORT_SYMBOL_GPL(tcp_set_state); + /* * State processing on a close. This implements the state shift for * sending our FIN frame. Note that we only send a FIN for some -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] [IPV6] route: kill some bloat
net/ipv6/route.c: ip6_pkt_prohibit_out | -130 ip6_pkt_discard | -261 ip6_pkt_discard_out | -130 ip6_pkt_prohibit | -261 4 functions changed, 782 bytes removed, diff: -782 net/ipv6/route.c: ip6_pkt_drop | +300 1 function changed, 300 bytes added, diff: +300 net/ipv6/route.o: 5 functions changed, 300 bytes added, 782 bytes removed, diff: -482 Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv6/route.c |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 02354a7..d3b5811 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -1762,8 +1762,7 @@ int ipv6_route_ioctl(unsigned int cmd, void __user *arg) * Drop the packet on the floor */ -static inline int ip6_pkt_drop(struct sk_buff *skb, int code, - int ipstats_mib_noroutes) +static int ip6_pkt_drop(struct sk_buff *skb, int code, int ipstats_mib_noroutes) { int type; switch (ipstats_mib_noroutes) { -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/8] [NETLINK] af_netlink: kill some bloat
net/netlink/af_netlink.c: netlink_realloc_groups| -46 netlink_insert| -49 netlink_autobind | -94 netlink_clear_multicast_users | -48 netlink_bind | -55 netlink_setsockopt| -54 netlink_release | -86 netlink_kernel_create | -47 netlink_change_ngroups| -56 9 functions changed, 535 bytes removed, diff: -535 net/netlink/af_netlink.c: netlink_table_ungrab | +53 1 function changed, 53 bytes added, diff: +53 net/netlink/af_netlink.o: 10 functions changed, 53 bytes added, 535 bytes removed, diff: -482 Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/netlink/af_netlink.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index be07f1b..21f9e30 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -193,7 +193,7 @@ static void netlink_table_grab(void) } } -static inline void netlink_table_ungrab(void) +static void netlink_table_ungrab(void) __releases(nl_table_lock) { write_unlock_irq(&nl_table_lock); -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] [PKTGEN]: Kill dead static inlines
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/core/pktgen.c | 94 - 1 files changed, 0 insertions(+), 94 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index ede1fea..ebfb126 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -397,62 +397,6 @@ struct pktgen_thread { #define REMOVE 1 #define FIND 0 -/* This code works around the fact that do_div cannot handle two 64-bit -numbers, and regular 64-bit division doesn't work on x86 kernels. ---Ben -*/ - -#define PG_DIV 0 - -/* This was emailed to LMKL by: Chris Caputo <[EMAIL PROTECTED]> - * Function copied/adapted/optimized from: - * - * nemesis.sourceforge.net/browse/lib/static/intmath/ix86/intmath.c.html - * - * Copyright 1994, University of Cambridge Computer Laboratory - * All Rights Reserved. - * - */ -static inline s64 divremdi3(s64 x, s64 y, int type) -{ - u64 a = (x < 0) ? -x : x; - u64 b = (y < 0) ? -y : y; - u64 res = 0, d = 1; - - if (b > 0) { - while (b < a) { - b <<= 1; - d <<= 1; - } - } - - do { - if (a >= b) { - a -= b; - res += d; - } - b >>= 1; - d >>= 1; - } - while (d); - - if (PG_DIV == type) { - return (((x ^ y) & (1ll << 63)) == 0) ? res : -(s64) res; - } else { - return ((x & (1ll << 63)) == 0) ? a : -(s64) a; - } -} - -/* End of hacks to deal with 64-bit math on x86 */ - -/** Convert to milliseconds */ -static inline __u64 tv_to_ms(const struct timeval *tv) -{ - __u64 ms = tv->tv_usec / 1000; - ms += (__u64) tv->tv_sec * (__u64) 1000; - return ms; -} - /** Convert to micro-seconds */ static inline __u64 tv_to_us(const struct timeval *tv) { @@ -461,39 +405,6 @@ static inline __u64 tv_to_us(const struct timeval *tv) return us; } -static inline __u64 pg_div(__u64 n, __u32 base) -{ - __u64 tmp = n; - do_div(tmp, base); - /* printk("pktgen: pg_div, n: %llu base: %d rv: %llu\n", - n, base, tmp); */ - return tmp; -} - -static inline __u64 pg_div64(__u64 n, __u64 base) -{ - __u64 tmp = n; -/* - * How do we know if the architecture we are running on - * supports division with 64 bit base? - * - */ -#if defined(__sparc_v9__) || defined(__powerpc64__) || defined(__alpha__) || defined(__x86_64__) || defined(__ia64__) - - do_div(tmp, base); -#else - tmp = divremdi3(n, base, PG_DIV); -#endif - return tmp; -} - -static inline __u64 getCurMs(void) -{ - struct timeval tv; - do_gettimeofday(&tv); - return tv_to_ms(&tv); -} - static inline __u64 getCurUs(void) { struct timeval tv; @@ -501,11 +412,6 @@ static inline __u64 getCurUs(void) return tv_to_us(&tv); } -static inline __u64 tv_diff(const struct timeval *a, const struct timeval *b) -{ - return tv_to_us(a) - tv_to_us(b); -} - /* old include end */ static char version[] __initdata = VERSION; -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] [NETFILTER] xt_policy.c: kill some bloat
net/netfilter/xt_policy.c: policy_mt | -906 1 function changed, 906 bytes removed, diff: -906 net/netfilter/xt_policy.c: match_xfrm_state | +427 1 function changed, 427 bytes added, diff: +427 net/netfilter/xt_policy.o: 2 functions changed, 427 bytes added, 906 bytes removed, diff: -479 Alternatively, this could be done by combining identical parts of the match_policy_in/out() Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/netfilter/xt_policy.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/netfilter/xt_policy.c b/net/netfilter/xt_policy.c index 46ee7e8..45731ca 100644 --- a/net/netfilter/xt_policy.c +++ b/net/netfilter/xt_policy.c @@ -33,7 +33,7 @@ xt_addr_cmp(const union xt_policy_addr *a1, const union xt_policy_addr *m, return false; } -static inline bool +static bool match_xfrm_state(const struct xfrm_state *x, const struct xt_policy_elem *e, unsigned short family) { -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/8] [XFRM] xfrm_policy: kill some bloat
net/xfrm/xfrm_policy.c: xfrm_audit_policy_delete | -692 xfrm_audit_policy_add| -692 2 functions changed, 1384 bytes removed, diff: -1384 net/xfrm/xfrm_policy.c: xfrm_audit_common_policyinfo | +704 1 function changed, 704 bytes added, diff: +704 net/xfrm/xfrm_policy.o: 3 functions changed, 704 bytes added, 1384 bytes removed, diff: -680 Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/xfrm/xfrm_policy.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index 534b29e..47219f9 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2367,8 +2367,8 @@ void __init xfrm_init(void) } #ifdef CONFIG_AUDITSYSCALL -static inline void xfrm_audit_common_policyinfo(struct xfrm_policy *xp, - struct audit_buffer *audit_buf) +static void xfrm_audit_common_policyinfo(struct xfrm_policy *xp, +struct audit_buffer *audit_buf) { struct xfrm_sec_ctx *ctx = xp->security; struct xfrm_selector *sel = &xp->selector; -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] [HTB]: htb_classid is dead static inline
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/sched/sch_htb.c |4 1 files changed, 0 insertions(+), 4 deletions(-) diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c index 72beb66..6a2352c 100644 --- a/net/sched/sch_htb.c +++ b/net/sched/sch_htb.c @@ -214,10 +214,6 @@ static inline struct htb_class *htb_find(u32 handle, struct Qdisc *sch) * then finish and return direct queue. */ #define HTB_DIRECT (struct htb_class*)-1 -static inline u32 htb_classid(struct htb_class *cl) -{ - return (cl && cl != HTB_DIRECT) ? cl->classid : TC_H_UNSPEC; -} static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc *sch, int *qerr) -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] [NET] core/utils.c: digit2bin is dead static inline
Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/core/utils.c | 11 --- 1 files changed, 0 insertions(+), 11 deletions(-) diff --git a/net/core/utils.c b/net/core/utils.c index 34459c4..8031eb5 100644 --- a/net/core/utils.c +++ b/net/core/utils.c @@ -91,17 +91,6 @@ EXPORT_SYMBOL(in_aton); #define IN6PTON_NULL 0x2000 /* first/tail */ #define IN6PTON_UNKNOWN0x4000 -static inline int digit2bin(char c, int delim) -{ - if (c == delim || c == '\0') - return IN6PTON_DELIM; - if (c == '.') - return IN6PTON_DOT; - if (c >= '0' && c <= '9') - return (IN6PTON_DIGIT | (c - '0')); - return IN6PTON_UNKNOWN; -} - static inline int xdigit2bin(char c, int delim) { if (c == delim || c == '\0') -- 1.5.0.6 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs
On Sat, 12 Jan 2008, Herbert Xu wrote: > On Sat, Jan 12, 2008 at 09:40:17AM +0000, Ilpo Järvinen wrote: > > Your emails are now using UTF-8 encoding but it's still declaring > ISO-8859-1 as the charset. Thanks for trying to help but my situation is such that I think it got also you confused (this kind of mixed encoding is beoynd my skills really)... :-) Besides, I wouldn't mind of having incorrect characters in my name, I'm just used to that but somebody else wasn't that happy about it. Here's one example... From: "=?utf-8?q?Ilpo_J=C3=A4rvinen?=" <[EMAIL PROTECTED]> ... Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Something still needed besides these to declare email utf-8? > So you probably want to fix that up or your name may show up as Jävinen > on the reader's screen. Did you actually see this? I'd expect to see that as well but no, From is correctly decoded by my ISO-8859-1'ish MUA, aha, seems that I still have something to do to deal with the Signed-off line. ...Maybe I just fall-back to changing my last name, it's the only full-proof solution... ;-) -- i.
Re: [PATCH 1/8] [TCP]: Uninline tcp_set_state
On Sat, 12 Jan 2008, Stephen Hemminger wrote: > On Sat, 12 Jan 2008 11:40:10 +0200 > "Ilpo Järvinen" <[EMAIL PROTECTED]> wrote: ...snip... > > built-in.o: > > 12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445 ...snip... > > include/net/tcp.h | 35 +-- > > net/ipv4/tcp.c| 35 +++ > > 2 files changed, 36 insertions(+), 34 deletions(-) > > > > diff --git a/include/net/tcp.h b/include/net/tcp.h > > index 48081ad..306580c 100644 > > --- a/include/net/tcp.h > > +++ b/include/net/tcp.h > > @@ -926,40 +926,7 @@ static const char *statename[]={ > > "Close Wait","Last ACK","Listen","Closing" > > }; > > #endif > > - > > -static inline void tcp_set_state(struct sock *sk, int state) > > -{ > > - int oldstate = sk->sk_state; > > - > > - switch (state) { > > - case TCP_ESTABLISHED: > > - if (oldstate != TCP_ESTABLISHED) > > - TCP_INC_STATS(TCP_MIB_CURRESTAB); > > - break; > > - > > - case TCP_CLOSE: > > - if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED) > > - TCP_INC_STATS(TCP_MIB_ESTABRESETS); > > - > > - sk->sk_prot->unhash(sk); > > - if (inet_csk(sk)->icsk_bind_hash && > > - !(sk->sk_userlocks & SOCK_BINDPORT_LOCK)) > > - inet_put_port(&tcp_hashinfo, sk); > > - /* fall through */ > > - default: > > - if (oldstate==TCP_ESTABLISHED) > > - TCP_DEC_STATS(TCP_MIB_CURRESTAB); > > - } > > - > > - /* Change state AFTER socket is unhashed to avoid closed > > -* socket sitting in hash tables. > > -*/ > > - sk->sk_state = state; > > - > > -#ifdef STATE_TRACE > > - SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n",sk, > > statename[oldstate],statename[state]); > > -#endif > > -} > > > > > Since the function is called with a constant state, I guess the assumption > was that gcc would be smart enough to only include the code needed, it looks > like > either code was bigger or the compiler was dumber than expected I'd guess that compiler was just dumber... :-) It might be an interesting experiment to convert it to if's and see if it would make a difference, maybe it just gets confused by the switch or something. Besides, it not always that obvious that gcc is able to determine "the constant state", considering e.g., the complexity in the cases with tcp_rcv_synsent_state_process, tcp_close_state, tcp_done. In such cases uninlining should be done and gcc is probably not able to mix both cases nicely for a single function? However, after looking a bit, I'm partially leaning towards the other option too: > > tcp_done| -145 > > tcp_disconnect | -141 ...These called for tcp_set_state just _once_, while this calls for it twice: > > tcp_fin | -86 ...Obviously the compiler was able to perform some reductions but 43 bytes per inlining seems still a bit high number. -- i.
Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs
On Sat, 12 Jan 2008, David Miller wrote: > From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> > Date: Sat, 12 Jan 2008 14:59:50 +0200 (EET) > > > ...Maybe I just fall-back to changing my last name, it's the only > > full-proof solution... ;-) > > Don't do this! Otherwise I won't have a frequent test case to make > sure my patch applying scripts are working properly. :- So which test case you prefer? :-) Is iso-8859-1 from+content ok? Or should I keep trying to live with mixed utf-8 which I didn't got even fully working last time because git-send-email is probably either too dumb or too intelligent (I'm not even sure which), but you were able correct it by your tools so the flawed signed-off never entered to the git logs as incorrectly formatted :-). I'd prefer sending them as iso-8859-1 compliant (and I guess you are able to test your fix-to-utf-8 machinery with it as well :-)), as it would also make my mails compatible with other people's git apply tools you're not using (otherwise I'd probably forget to change it occassionally when interacting with others than you). -- i.
Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs
On Sun, 13 Jan 2008, David Miller wrote: > From: "Ilpo_Järvinen" <[EMAIL PROTECTED]> > Date: Mon, 14 Jan 2008 09:43:08 +0200 (EET) > > > I'd prefer sending them as iso-8859-1 compliant (and I guess you are able > > to test your fix-to-utf-8 machinery with it as well :-)), as it would also > > make my mails compatible with other people's git apply tools you're not > > using (otherwise I'd probably forget to change it occassionally when > > interacting with others than you). > > For now either way is fine with me. If the situation changes I'll > let you know. Ok, I'll remain in iso-8859-1, it's something that is known to work from my end. Thanks anyway for fixing it, wasn't any big deal for me at any point of time but people started asking me privately to correct it which I of course couldn't... :-) > I'm surprised git-send-email can't get it purely utf8 correctly. The problem is that my system is ISO natively, so git-send-email might encode a ISO native .patch file's content while sending, which in this case was intentionally already utf-8. It might surprise you but it wasn't a long time ago when git-send-email wouldn't care less e.g. about header encoding and I got rejects from netdev due to my name which wasn't encoded properly, I've 1.5.0.6 currently and it seemed still fail to encode Cc addresses it adds from signed-offs unless I explicitly ask for it to not do that (I explicitly ask for especific, encoded, from header anyway because it was broken at some point of time and my sending template is copy-paste originating from that time). There was some recent fixes in the git's logs regarding that encoding, so I intend to check if a later g-s-e is more able and if it isn't I'll report it to git folks. > I wonder if there is some issue with how it gets your name > string for the commit author etc. I've had it working well since the encoding header got relatively recently added (wasn't available at early dawn of git era), before that it was just a mess locally. Funny enough, you were able to magically mangle my emails to utf-8'ed commits nicely back then so I got a "fixed" commit back after an RTT :-). > I wonder if getting it into your global GIT config file in proper > UTF8 encoding would fix things. > > Put something like this into ~/.gitconfig > > > [user] > name = Ilpo Järvinen > email = [EMAIL PROTECTED] > I have this. In addition I have this (required to make my local system consistent): [i18n] commitencoding = ISO-8859-1 The problem was just that the API (or better, ABI) between us wasn't properly working :-)). While Herbert was working as the replacement-Dave in November, correct commit entries were created, so git has been working fine (I guess he used git applying tools instead of handmade scripts and they handle email correclt based on it's encoding). I tried logOutputEncoding = utf-8 in the last patch sets I sent (now could again remove it) but git-send-email problem appeared with it because the system is ISO natively. > The GIT maintainer is Finnish which makes this situation even > more perplexing to me, you might want to discuss it with him :-) Junio? Never heard that a Finnish name... ;-) Perhaps git-send-email wasn't written by that Finnish guy... :-) ...Besides, that Finnish git aintainer doesn't have any funny characters in his name... ;-) Thanks anyway for the tips & all, I think we have it now working and I can return to inlines and rexmit_skb_hint things & other TCP stuff rather than this hinderance. I've some interesting results from net header inlines checks I ran overnight :-). -- i.
Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22
On Fri, 11 Jan 2008, Zhang, Yanmin wrote: > On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: > > The regression is: > > 1)stoakley with 2 qual-core processors: 11%; > > 2)Tulsa with 4 dual-core(+hyperThread) processors:13%; > I have new update on this issue and also cc to netdev maillist. > Thank David Miller for pointing me the netdev maillist. > > > > > The test command is: > > #sudo taskset -c 7 ./netserver > > #sudo taskset -c 0 ./netperf -t TCP_RR -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 > > -- -r 1,1 > > > > As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's > > regression is between 16%~11%. > > > > I tried to use bisect to locate the bad patch between 2.6.22 and 2.6.23-rc1, > > but the bisected kernel wasn't stable and went crazy. TCP work between that is very much non-existing. Using git-reset's to select a nearby merge point instead of default commit where bisection lands might be help in case the bisected kernel breaks. Also, limiting bisection under a subsystem might reduce probability of brokeness (might at least be able to narrow it down quite a lot), e.g. git bisect start net/ -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22
On Mon, 14 Jan 2008, Ilpo Järvinen wrote: > On Fri, 11 Jan 2008, Zhang, Yanmin wrote: > > > On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: > > > > > > As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's > > > regression is between 16%~11%. > > > > > > I tried to use bisect to locate the bad patch between 2.6.22 and > > > 2.6.23-rc1, > > > but the bisected kernel wasn't stable and went crazy. > > TCP work between that is very much non-existing. I _really_ meant 2.6.22 - 2.6.23-rc1, not 2.6.24-rc1 in case you had a typo there which is not that uncommon while typing kernel versions... :-) -- i.
Re: [PATCH 2/4] dsmark: get rid of trivial function
On Mon, 21 Jan 2008, Patrick McHardy wrote: > Stephen Hemminger wrote: > > Replace loop in dsmark_valid_indices with equivalent bit math. > > > > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]> > > > > --- a/net/sched/sch_dsmark.c2008-01-20 13:07:58.0 -0800 > > +++ b/net/sched/sch_dsmark.c2008-01-20 13:22:54.0 -0800 > > @@ -45,13 +45,8 @@ struct dsmark_qdisc_data { > > > > static inline int dsmark_valid_indices(u16 indices) > > { > > - while (indices != 1) { > > - if (indices & 1) > > - return 0; > > - indices >>= 1; > > - } > > - > > - return 1; > > + /* Must have only one bit set */ > > + return (indices & (indices - 1)) == 0; Isn't there some magic under include/linux to do that btw, I suppose that if the caller side zero check is pushed down there too, the is_power_of_2() is 100% match? :-) > hweight seems easier to understand, it took me a bit > to realize that the comment matches the code :) In addition, the original seems infinite loop with zero indices given but luckily that was checked at the caller site already... -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Mon, 21 Jan 2008, Dave Young wrote: > Please see the kernel messages following,(trigged while using some qemu > session) > BTW, seems there's some e100 error message as well. > > PCI: Setting latency timer of device :00:1b.0 to 64 > e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI > e100: Copyright(c) 1999-2006 Intel Corporation > ACPI: PCI Interrupt :03:08.0[A] -> GSI 20 (level, low) -> IRQ 20 > modprobe:2331 conflicting cache attribute efaff000-efb0 uncached<->default > e100: :03:08.0: e100_probe: Cannot map device registers, aborting. > ACPI: PCI interrupt for device :03:08.0 disabled > e100: probe of :03:08.0 failed with error -12 > eth0: setting full-duplex. > [ cut here ] > WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150() > Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq > snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse snd_hda_intel > snd_pcm snd_timer btusb rtc_cmos thermal bluetooth rtc_core serio_raw > intel_agp button processor sg snd rtc_lib i2c_i801 evdev agpgart soundcore > dcdbas 3c59x pcspkr snd_page_alloc > Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4 > [] ? printk+0x0/0x20 > [] warn_on_slowpath+0x54/0x80 > [] ? ip_finish_output+0x128/0x2e0 > [] ? ip_output+0xe7/0x100 > [] ? ip_local_out+0x18/0x20 > [] ? ip_queue_xmit+0x3dc/0x470 > [] ? _spin_unlock_irqrestore+0x5e/0x70 > [] ? check_pad_bytes+0x61/0x80 > [] tcp_mark_head_lost+0x121/0x150 > [] tcp_update_scoreboard+0x4c/0x170 > [] tcp_fastretrans_alert+0x48a/0x6b0 > [] tcp_ack+0x1b3/0x3a0 > [] tcp_rcv_established+0x3eb/0x710 > [] tcp_v4_do_rcv+0xe5/0x100 > [] tcp_v4_rcv+0x5db/0x660 Doh, once more these S+L things..., the rest are symptom of the first problem. What is strange is that it doesn't show up until now, the last TCP changes that could have some significance are from early Dec/Nov. Is there some reason why you haven't seen this before this (e.g., not tested with similar cfg or so)? I'm a bit worried about its reproducability if it takes this far to see it... -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Tue, 22 Jan 2008, Dave Young wrote: > On Jan 22, 2008 12:37 PM, Dave Young <[EMAIL PROTECTED]> wrote: > > > > On Jan 22, 2008 5:14 AM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > > > > > On Mon, 21 Jan 2008, Dave Young wrote: > > > > > > > Please see the kernel messages following,(trigged while using some qemu > > > > session) > > > > BTW, seems there's some e100 error message as well. > > > > > > > > PCI: Setting latency timer of device :00:1b.0 to 64 > > > > e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI > > > > e100: Copyright(c) 1999-2006 Intel Corporation > > > > ACPI: PCI Interrupt :03:08.0[A] -> GSI 20 (level, low) -> IRQ 20 > > > > modprobe:2331 conflicting cache attribute efaff000-efb0 > > > > uncached<->default > > > > e100: :03:08.0: e100_probe: Cannot map device registers, aborting. > > > > ACPI: PCI interrupt for device :03:08.0 disabled > > > > e100: probe of :03:08.0 failed with error -12 > > > > eth0: setting full-duplex. > > > > [ cut here ] > > > > WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150() > > > > Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq > > > > snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse > > > > snd_hda_intel snd_pcm snd_timer btusb rtc_cmos thermal bluetooth > > > > rtc_core serio_raw intel_agp button processor sg snd rtc_lib i2c_i801 > > > > evdev agpgart soundcore dcdbas 3c59x pcspkr snd_page_alloc > > > > Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4 > > > > [] ? printk+0x0/0x20 > > > > [] warn_on_slowpath+0x54/0x80 > > > > [] ? ip_finish_output+0x128/0x2e0 > > > > [] ? ip_output+0xe7/0x100 > > > > [] ? ip_local_out+0x18/0x20 > > > > [] ? ip_queue_xmit+0x3dc/0x470 > > > > [] ? _spin_unlock_irqrestore+0x5e/0x70 > > > > [] ? check_pad_bytes+0x61/0x80 > > > > [] tcp_mark_head_lost+0x121/0x150 > > > > [] tcp_update_scoreboard+0x4c/0x170 > > > > [] tcp_fastretrans_alert+0x48a/0x6b0 > > > > [] tcp_ack+0x1b3/0x3a0 > > > > [] tcp_rcv_established+0x3eb/0x710 > > > > [] tcp_v4_do_rcv+0xe5/0x100 > > > > [] tcp_v4_rcv+0x5db/0x660 > > > > > > Doh, once more these S+L things..., the rest are symptom of the first > > > problem. > > > > What is the S+L thing? Could you explain a bit? It means that one of the skbs is both SACKed and marked as LOST at the same time in the counters (might be due to miscount of lost/sacked_out too, not necessarilily in the ->sacked bits). Such state is logically invalid because it would mean that the sender thinks that the same packet both reached the receiver and is lost in the network. Traditionally TCP has just silently "corrected" over-estimates (sacked_out+lost_out > packets_out). I changed this couple of releases ago because those over-estimates often are due to bugs that should be fixed (there have been couple of them but it has been very quite on this front long time, months or even half year already; but I might have broken something with the early Dec changes). These problem may originate from a bug that occurred a number of ACKs earlier the WARN_ON triggered, therefore they are a bit tricky to track, those WARN_ON serve just for alerting purposes and usually do not point out where the bug actually occurred. I usually just asked people to include exhaustive verifier which compares ->sacked bitmaps with sacked/lost_out counters and report immediately when the problem shows up, rather than waiting for the cheaper S+L check we do in the WARN_ON to trigger. I tried to collect tracking patch from the previous efforts (hopefully got it right after modifications). > > I'm a bit worried about its > > > reproducability if it takes this far to see it... > > > > > It's trigged again in my pc, just while using firefox. ...Good, then there's some chance to catch it. -- i. [PATCH] [TCP]: debug S+L --- include/net/tcp.h |8 +++- net/ipv4/tcp_input.c |6 +++ net/ipv4/tcp_ipv4.c | 101 + net/ipv4/tcp_output.c | 21 +++--- 4 files changed, 129 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 7de4ea3..0685035 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val
Re: WARNING, tcp_fastretrans_alert, rc6-git11
On Tue, 22 Jan 2008, Denys Fedoryshchenko wrote: > Just got on one of proxies, under high load. > It is a bit old rc, so probably my report not interesting, but since it is > production machines, i cannot change too often. > Kernel is 2.6.24-rc6-git11 It's not at all useless, there hasn't been any TCP changes lately so this is still very much valid. > Some sysctl adjustments done. Please tell me if need more information. TCP related sysctl setting would be nice to know (I don't care much about mem sizes but others tweaks affecting TCP features are nice to know). ...snip... > [9561199.893090] WARNING: at net/ipv4/tcp_input.c:2391 tcp_fastretrans_alert() This check includes fixing if there's miscount (couple of releases ago we didn't even care to print that but had inaccurate fackets_out as a feature, there still may be some left-overs though I've tried to track them down). To find out cause more accurately (this WARN_ON is just for alerting), I'd need to add rather exhaustive searching per ACK which unlikely is acceptable for you. Luckily, Dave Young reported one problem lately (though he was running brand new mm) which could be caused by the same problem as this, so it's not that bad even though you couldn't run skb state verifying. Thanks anyway for the report. -- i. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Tue, 22 Jan 2008, David Miller wrote: > From: "Dave Young" <[EMAIL PROTECTED]> > Date: Wed, 23 Jan 2008 09:44:30 +0800 > > > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > > [PATCH] [TCP]: debug S+L > > > > Thanks, If there's new findings I will let you know. > > Thanks for helping with this bug Dave. I noticed btw that there thing might (is likely to) spuriously trigger at WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK is not enabled. If that does happen too often, I send a fixed patch for it, yet, the fact that I print print tp->rx_opt.sack_ok allows identification of those cases already as it's zero when SACK is not enabled. Just ask if you need the updated debug patch. -- i.
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Wed, 23 Jan 2008, Dave Young wrote: > On Jan 23, 2008 3:41 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > > > On Tue, 22 Jan 2008, David Miller wrote: > > > > > From: "Dave Young" <[EMAIL PROTECTED]> > > > Date: Wed, 23 Jan 2008 09:44:30 +0800 > > > > > > > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > > > > [PATCH] [TCP]: debug S+L > > > > > > > > Thanks, If there's new findings I will let you know. > > > > > > Thanks for helping with this bug Dave. > > > > I noticed btw that there thing might (is likely to) spuriously trigger at > > WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK > > is not enabled. If that does happen too often, I send a fixed patch for > > it, yet, the fact that I print print tp->rx_opt.sack_ok allows > > identification of those cases already as it's zero when SACK is not > > enabled. > > > > Just ask if you need the updated debug patch. > > Thanks, please send, I would like to get it. There you go. I fixed non-SACK case by adding tcp_is_sack checks there and also added two verifys to tcp_ack to see if there's corruption outside of TCP. -- i. [PATCH] [TCP]: debug S+L --- include/net/tcp.h |8 +++- net/ipv4/tcp_input.c | 10 + net/ipv4/tcp_ipv4.c | 101 + net/ipv4/tcp_output.c | 21 +++--- 4 files changed, 133 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 7de4ea3..0685035 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) +extern voidtcp_verify_wq(struct sock *sk); + extern voidtcp_v4_err(struct sk_buff *skb, u32); extern voidtcp_shutdown (struct sock *sk, int how); @@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock *sk) } /* Use define here intentionally to get WARN_ON location shown at the caller */ -#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) > tp->packets_out) +#define tcp_verify_left_out(tp)\ + do { \ + WARN_ON(tcp_left_out(tp) > tp->packets_out); \ + tcp_verify_wq((struct sock *)tp); \ + } while(0) extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index fa2c85c..cdacf70 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2645,6 +2645,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk))) tcp_update_scoreboard(sk, fast_rexmit); tcp_cwnd_down(sk, flag); + + WARN_ON(tcp_write_queue_head(sk) == NULL); + WARN_ON(!tp->packets_out); + tcp_xmit_retransmit_queue(sk); } @@ -2848,6 +2852,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets) tcp_clear_all_retrans_hints(tp); } + tcp_verify_left_out(tp); + if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) flag |= FLAG_SACK_RENEGING; @@ -3175,6 +3181,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) prior_fackets = tp->fackets_out; prior_in_flight = tcp_packets_in_flight(tp); + tcp_verify_left_out(tp); + if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) { /* Window is constant, pure forward advance. * No more checks are required. @@ -3237,6 +3245,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) dst_confirm(sk->sk_dst_cache); + tcp_verify_left_out(tp); + return 1; no_queue: diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9aea88b..7e8ab40 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -108,6 +108,107 @@ struct inet_hashinfo __cacheline_aligned tcp_hashinfo = { .lhash_wait = __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhash_wait), }; +void tcp_print_queue(struct sock *sk) +{ + struct tcp_sock *tp = tcp_sk(sk); + struct sk_buff *skb; + char s[50+1]; + char h[50+1]; + int idx = 0; + int i; + + tcp_for_write_queue(skb, sk) { + if (skb == tcp_send_he
Re: Assertions in latest kernels
On Wed, 23 Jan 2008, Krishna Kumar2 wrote: > David Miller <[EMAIL PROTECTED]> wrote on 01/23/2008 01:27:23 PM: > > > > iperf with multiple threads almost always gets these 4, *especially* > when I > > > do some batching :). > > > > > > static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int > flag) > > > { > > > ... > > > if (WARN_ON(!tp->sacked_out && tp->fackets_out)) > > > tp->fackets_out = 0; > > > ... > > > } > > > > Does this assertion show up first or do you get the other TCP > > ones first? It might be important, in that if you get the > > others ones first that corrupted state might be what leads to > > this one. > > Hi Dave, > > I looked at my *old* messages file and found this assert (2506) was first > to hit (atleast in > two messages file). It hit 5 times, then I got a different one that I had > not reported earlier: > > "KERNEL: assertion (packets <= tp->packets_out) failed at > net/ipv4/tcp_input.c (2139)" > > (though this was hidden in my report under the panic for tcp_input.c:2528. > > Then another two thousand times of the 2506 asserts. > > Today I installed the latest untouched kernel, rebooted system and got the > following errors > in sequence, but no 2506 errors (which I have always got when running > batching in the last > 2-3 weeks): > > Jan 22 02:07:55 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:07:56 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:07:56 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:07:57 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:07:58 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:07:59 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:07:59 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:00 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:01 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:01 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:02 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:03 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:03 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:04 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:05 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:06 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:06 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:07 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767 > Jan 22 02:08:07 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:08 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:09 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:10 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:10 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:11 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:12 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:12 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:13 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:14 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:15 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:15 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767 > Jan 22 02:08:16 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767 > Jan 22 02:08:16 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767 > Jan 22 02:08:17 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:18 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:18 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > Jan 22 02:08:19 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528 > Jan 22 02:08:19 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169 > > and so on for another 700 counts. > > The unique asserts are: > 1767: tcp_verify_left_out (from tcp_entry_frto) > 2169: tcp_verify_left_out (from tcp_mark_head_lost) > 2528: tcp_verify_left_out (from tcp_fastretrans_alert) > 3063: tcp_verify_left_out (from tcp_process_frto) > (where 2169 seems to preceed any other asserts) Once you get one of these, you'll get a large number of them, maybe I should just change it to WARN_ON_ONCE to stop confusing people with the rest. > The other two asserts that I got only with batching are: > 2139: BUG_TRAP(packets <= tp->packets_out); (in tcp_mark_head_lost) > 2506: WARN_ON(!tp->sacked_out && tp->fackets_out) (in > tcp_fastretrans_alert) > (where 2506 always seems to preceed any other asserts). It's almost impossible to know which of these is the main cause and the first occuring due to reasons I'll not copy here. What a strange thing that it has been super quiet on this front until now everybody is seeing it, could there be something unr
Re: Assertions in latest kernels
On Wed, 23 Jan 2008, Krishna Kumar2 wrote: > Hi Ilpo, > > > It's almost impossible to know which of these is the main cause and the > > first occuring due to reasons I'll not copy here. What a strange thing > > that it has been super quiet on this front until now everybody is seeing > > it, could there be something unrelated to TCP which has broken it all > > recently? > > I have been getting this for atleast 3 weeks but I was quiet since those > were kernels that I had modified. Since you can easily reproduce it, lets just figure out what's causing it hard way, rather than digging the endless git-logs... :-) > > Good thing is that you seem to be able to reproduce it were nicely. > > Please try with this beauty below... Hopefully got it correctly matching > > to mainline, in contrast to version I sent Dave Y., there's some added > > candy which catches highest_sack corruption as well, at least it compiles > > > already :-). > > There were couple of patch apply failures in .c files which I fixed by > hand. > But when compiling, I got these errors (I am using DM's 2.6.24-rc7 kernel, > net-2.6.25.git): Well, that's annoying, you didn't mention net-2.6.25 back then, it sure is incompatible with it like already the patch title said... :-) > net/ipv4/tcp_output.c: In function 'tcp_push_one': > net/ipv4/tcp_output.c:1573: error: 'tp' undeclared (first use in this > function) > net/ipv4/tcp_output.c:1573: error: (Each undeclared identifier is reported > only once > net/ipv4/tcp_output.c:1573: error: for each function it appears in.) ...This is either due to mismerge or due your modifications. Lets re-iterate, compiled ok for me tcp_{ipv4,input,output}.o... -- i. [PATCH] [TCP]: debug S+L (for net-2.5.26 / mm, incompatible with mainline) --- include/net/tcp.h |8 +++- net/ipv4/tcp_input.c | 10 + net/ipv4/tcp_ipv4.c | 107 + net/ipv4/tcp_output.c | 21 +++--- 4 files changed, 139 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 7de4ea3..0685035 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) +extern voidtcp_verify_wq(struct sock *sk); + extern voidtcp_v4_err(struct sk_buff *skb, u32); extern voidtcp_shutdown (struct sock *sk, int how); @@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock *sk) } /* Use define here intentionally to get WARN_ON location shown at the caller */ -#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) > tp->packets_out) +#define tcp_verify_left_out(tp)\ + do { \ + WARN_ON(tcp_left_out(tp) > tp->packets_out); \ + tcp_verify_wq((struct sock *)tp); \ + } while(0) extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index fa2c85c..cdacf70 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2645,6 +2645,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk))) tcp_update_scoreboard(sk, fast_rexmit); tcp_cwnd_down(sk, flag); + + WARN_ON(tcp_write_queue_head(sk) == NULL); + WARN_ON(!tp->packets_out); + tcp_xmit_retransmit_queue(sk); } @@ -2848,6 +2852,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets) tcp_clear_all_retrans_hints(tp); } + tcp_verify_left_out(tp); + if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) flag |= FLAG_SACK_RENEGING; @@ -3175,6 +3181,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) prior_fackets = tp->fackets_out; prior_in_flight = tcp_packets_in_flight(tp); + tcp_verify_left_out(tp); + if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) { /* Window is constant, pure forward advance. * No more checks are required. @@ -3237,6 +3245,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP)) dst_confirm(sk->sk_dst_cache); + tcp_verify_left_out(tp); + return 1; no_queue: diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9aea88b..c95682e 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -108,6 +108,113 @@ struct inet_hashinfo __cacheline_aligned tcp_hashinfo = { .lhash_wait = __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhas
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Wed, 23 Jan 2008, Ilpo Järvinen wrote: > On Wed, 23 Jan 2008, Dave Young wrote: > > > On Jan 23, 2008 3:41 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > > > > > On Tue, 22 Jan 2008, David Miller wrote: > > > > > > > From: "Dave Young" <[EMAIL PROTECTED]> > > > > Date: Wed, 23 Jan 2008 09:44:30 +0800 > > > > > > > > > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote: > > > > > > [PATCH] [TCP]: debug S+L > > > > > > > > > > Thanks, If there's new findings I will let you know. > > > > > > > > Thanks for helping with this bug Dave. > > > > > > I noticed btw that there thing might (is likely to) spuriously trigger at > > > WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK > > > is not enabled. If that does happen too often, I send a fixed patch for > > > it, yet, the fact that I print print tp->rx_opt.sack_ok allows > > > identification of those cases already as it's zero when SACK is not > > > enabled. > > > > > > Just ask if you need the updated debug patch. > > > > Thanks, please send, I would like to get it. > > There you go. I fixed non-SACK case by adding tcp_is_sack checks there and > also added two verifys to tcp_ack to see if there's corruption outside of > TCP. There's some discussion about a problem that is very likely the same as in here (sorry for not remembering to cc you in there due to rapid progress): http://marc.info/?t=12010717423&r=1&w=2 -- i.
Re: Assertions in latest kernels
On Wed, 23 Jan 2008, Ilpo Järvinen wrote: > On Wed, 23 Jan 2008, Krishna Kumar2 wrote: > > > Hi Ilpo, > > > > > It's almost impossible to know which of these is the main cause and the > > > first occuring due to reasons I'll not copy here. What a strange thing > > > that it has been super quiet on this front until now everybody is seeing > > > it, could there be something unrelated to TCP which has broken it all > > > recently? > > > > I have been getting this for atleast 3 weeks but I was quiet since those > > were kernels that I had modified. > > Since you can easily reproduce it, lets just figure out what's causing it > hard way, rather than digging the endless git-logs... :-) Hmm, perhaps it could be something related to this (and some untested path somewhere which is now exposed): commit 4a55b553f691abadaa63570dfc714e20913561c1 Author: Ilpo Järvinen <[EMAIL PROTECTED]> Date: Thu Dec 20 20:36:03 2007 -0800 [TCP]: Fix TSO deferring Dave, what do you think? Wouldn't explain the one -rc only report though from Denys. Another one I'm a bit unsure of is this: commit 757c32944b80fd95542bd66f06032ab773034d53 Author: Ilpo Järvinen <[EMAIL PROTECTED]> Date: Thu Jan 3 20:39:01 2008 -0800 [TCP]: Perform setting of common control fields in one place ->sacked field is cleared in tcp_retransmit_skb due to a subtle change, which might be buggy However, I find it rather unlikely that this would explain Kumar's case. Anyway, here's the one that reverts the problematic part of it. ...and this is net-2.6.25 as well so it won't solve Denys' case either. -- i. -- [PATCH] [TCP]: Revert part of "[TCP]: Perform setting of common control..." This commit reverts part of 757c32944b80fd95542bd66f06032ab773034d53 ([TCP]: Perform setting of common control fields in one place) because it's not valid to clear ->sacked field like that without additional precautions with counters, mostly lost_out, sacked_out should not be set due to reneging which kicks in much earlier than this. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_output.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 648340f..f6cbc1f 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1887,10 +1887,12 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb) (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) && tp->snd_una == (TCP_SKB_CB(skb)->end_seq - 1)) { if (!pskb_trim(skb, 0)) { - /* Reuse, even though it does some unnecessary work */ - tcp_init_nondata_skb(skb, TCP_SKB_CB(skb)->end_seq - 1, -TCP_SKB_CB(skb)->flags); + TCP_SKB_CB(skb)->seq = TCP_SKB_CB(skb)->end_seq - 1; + skb_shinfo(skb)->gso_segs = 1; + skb_shinfo(skb)->gso_size = 0; + skb_shinfo(skb)->gso_type = 0; skb->ip_summed = CHECKSUM_NONE; + skb->csum = 0; } } -- 1.5.2.2
Re: Assertions in latest kernels
On Wed, 23 Jan 2008, Krishna Kumar2 wrote: > While running with this patch, I got these errors (pasted at the end > of this mail). I don't have a clue why it didn't go to the checking func (or it didn't print anything) but just had those WARN_ONs... Hopefully this is giving somewhat better input (applies on top of the other debug patch). -- i. [PATCH] [TCP]: more debug --- include/net/tcp.h|3 ++- net/ipv4/tcp_input.c |9 - net/ipv4/tcp_ipv4.c | 19 ++- 3 files changed, 24 insertions(+), 7 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 0685035..129c3b1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,7 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) +extern void tcp_print_queue(struct sock *sk); extern voidtcp_verify_wq(struct sock *sk); extern voidtcp_v4_err(struct sk_buff *skb, u32); @@ -772,8 +773,8 @@ static inline __u32 tcp_current_ssthresh(const struct sock *sk) /* Use define here intentionally to get WARN_ON location shown at the caller */ #define tcp_verify_left_out(tp)\ do { \ - WARN_ON(tcp_left_out(tp) > tp->packets_out); \ tcp_verify_wq((struct sock *)tp); \ + WARN_ON(tcp_left_out(tp) > tp->packets_out); \ } while(0) extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index cdacf70..295490e 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2133,12 +2133,15 @@ static void tcp_verify_retransmit_hint(struct tcp_sock *tp, struct sk_buff *skb) static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit) { struct tcp_sock *tp = tcp_sk(sk); - struct sk_buff *skb; + struct sk_buff *skb, *prev = NULL; int cnt; + tcp_verify_left_out(tp); + BUG_TRAP(packets <= tp->packets_out); if (tp->lost_skb_hint) { skb = tp->lost_skb_hint; + prev = skb; cnt = tp->lost_cnt_hint; } else { skb = tcp_write_queue_head(sk); @@ -2166,6 +2169,10 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit) tcp_verify_retransmit_hint(tp, skb); } } + if (tcp_left_out(tp) > tp->packets_out) { + printk(KERN_ERR "Prev hint: %p, exit %p\n", prev, skb); + tcp_print_queue(sk); + } tcp_verify_left_out(tp); } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index c95682e..c2a88c5 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -117,6 +117,15 @@ void tcp_print_queue(struct sock *sk) int idx = 0; int i; + i = 0; + tcp_for_write_queue(skb, sk) { + if (skb == tcp_send_head(sk)) + printk(KERN_ERR "head %u %p\n", i, skb); + else + printk(KERN_ERR "skb %u %p\n", i, skb); + i++; + } + tcp_for_write_queue(skb, sk) { if (skb == tcp_send_head(sk)) break; @@ -195,11 +204,6 @@ void tcp_verify_wq(struct sock *sk) packets += tcp_skb_pcount(skb); } - WARN_ON(lost != tp->lost_out); - WARN_ON(tcp_is_sack(tp) && (sacked != tp->sacked_out)); - WARN_ON(packets != tp->packets_out); - WARN_ON(fackets != tp->fackets_out); - if ((lost != tp->lost_out) || (tcp_is_sack(tp) && (sacked != tp->sacked_out)) || (packets != tp->packets_out) || @@ -213,6 +217,11 @@ void tcp_verify_wq(struct sock *sk) tp->rx_opt.sack_ok); tcp_print_queue(sk); } + + WARN_ON(lost != tp->lost_out); + WARN_ON(tcp_is_sack(tp) && (sacked != tp->sacked_out)); + WARN_ON(packets != tp->packets_out); + WARN_ON(fackets != tp->fackets_out); } static int tcp_v4_get_port(struct sock *sk, unsigned short snum) -- 1.5.2.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
10 > [] rest_init+0x5d/0x60 > [] start_kernel+0x1fa/0x260 > [] ? unknown_bootoption+0x0/0x130 > === > ---[ end trace 14b601818e6903ac ]--- ...But this no longer is, and even more, L: 5 is not valid state at this point all (should only happen if we went to RTO but it would reset S to zero with newreno): > P: 5 L: 5 vs 5 S: 0 vs 3 w: 2044790889-2044796616 (0) > TCP wq(s) l< > TCP wq(h) +++h+< > l5 s3 f0 p5 seq: su2044790889 hs2044795029 sn2044796616 Surprisingly, it was the first time the WARN_ON for left_out returned correct location. This also explains why the patch I sent to Krishna didn't print anything (it didn't end up into printing because I forgot to add L+S>P check into to the state checking if). ...so please, could you (others than Denys) try this patch, it should solve the issue. And Denys, could you confirm (and if necessary double check) that the kernel you saw this similar problem with is the pure Linus' mainline, i.e., without any net-2.6.25 or mm bits please, if so, that problem persists. And anyway, there were some fackets_out related problems reported as well and this doesn't help for that but I think I've lost track of who was seeing it due to large number of reports :-), could somebody refresh my memory because I currently don't have time to dig it up from archives (at least on this week). -- i. -- [PATCH] [TCP]: NewReno must count every skb while marking losses NewReno should add cnt per skb (as with FACK) instead of depending on SACKED_ACKED bits which won't be set with it at all. Effectively, NewReno should always exists after the first iteration anyway (or immediately if there's already head in lost_out. This was fixed earlier in net-2.6.25 but got reverted among other stuff and I didn't notice that this is still necessary (actually wasn't even considering this case while trying to figure out the reports because I lived with different kind of code than it in reality was). This should solve the WARN_ONs in TCP code that as a result of this triggered multiple times in every place we check for this invariant. Special thanks to Dave Young <[EMAIL PROTECTED]> and Krishna Kumar2 <[EMAIL PROTECTED]> for trying with my debug patches. Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]> --- net/ipv4/tcp_input.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 295490e..aa409a5 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2156,7 +2156,7 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit) tp->lost_skb_hint = skb; tp->lost_cnt_hint = cnt; - if (tcp_is_fack(tp) || + if (tcp_is_fack(tp) || tcp_is_reno(tp) || (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) cnt += tcp_skb_pcount(skb); -- 1.5.2.2
Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings
On Thu, 24 Jan 2008, Ilpo Järvinen wrote: > And anyway, there were some fackets_out related > problems reported as well and this doesn't help for that but I think I've > lost track of who was seeing it due to large number of reports :-), could > somebody refresh my memory because I currently don't have time to dig it > up from archives (at least on this week). Here's the updated debug patch for net-2.6.25/mm for tracking fackets_out inconsistencies (it won't work for trees which don't include net-2.6.25, mm does of course :-)). I hope I got it into good shape this time to avoid spurious stacktraces but still maintaining 100% accuracy, but it's not a simple oneliner so I might have missed something... :-) I'd suggest that people trying with this first apply the newreno fix of the previous mail to avoid already-fixed case from triggering. -- i. -- [PATCH] [TCP]: debug S+L (for net-2.5.26 / mm, incompatible with mainline) --- include/net/tcp.h |5 ++- net/ipv4/tcp_input.c | 18 +++- net/ipv4/tcp_ipv4.c | 127 + net/ipv4/tcp_output.c | 23 +++-- 4 files changed, 165 insertions(+), 8 deletions(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 7de4ea3..552aa71 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -272,6 +272,9 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics); #define TCP_ADD_STATS_BH(field, val) SNMP_ADD_STATS_BH(tcp_statistics, field, val) #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, field, val) +extern void tcp_print_queue(struct sock *sk); +extern voidtcp_verify_wq(struct sock *sk); + extern voidtcp_v4_err(struct sk_buff *skb, u32); extern voidtcp_shutdown (struct sock *sk, int how); @@ -768,7 +771,7 @@ static inline __u32 tcp_current_ssthresh(const struct sock *sk) } /* Use define here intentionally to get WARN_ON location shown at the caller */ -#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) > tp->packets_out) +#define tcp_verify_left_out(tp)tcp_verify_wq((struct sock *)tp) extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh); extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 19c449f..c897c93 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1426,8 +1426,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, int first_sack_index; if (!tp->sacked_out) { - if (WARN_ON(tp->fackets_out)) + if (WARN_ON(tp->fackets_out)) { + tcp_verify_left_out(tp); tp->fackets_out = 0; + } tcp_highest_sack_reset(sk); } @@ -2136,6 +2138,8 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit) struct sk_buff *skb; int cnt; + tcp_verify_left_out(tp); + BUG_TRAP(packets <= tp->packets_out); if (tp->lost_skb_hint) { skb = tp->lost_skb_hint; @@ -2501,6 +2505,8 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) (tcp_fackets_out(tp) > tp->reordering)); int fast_rexmit = 0; + tcp_verify_left_out(tp); + if (WARN_ON(!tp->packets_out && tp->sacked_out)) tp->sacked_out = 0; if (WARN_ON(!tp->sacked_out && tp->fackets_out)) @@ -2645,6 +2651,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int flag) if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk))) tcp_update_scoreboard(sk, fast_rexmit); tcp_cwnd_down(sk, flag); + + WARN_ON(tcp_write_queue_head(sk) == NULL); + WARN_ON(!tp->packets_out); + tcp_xmit_retransmit_queue(sk); } @@ -2848,6 +2858,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets) tcp_clear_all_retrans_hints(tp); } + tcp_verify_left_out(tp); + if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) flag |= FLAG_SACK_RENEGING; @@ -3175,6 +3187,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) prior_fackets = tp->fackets_out; prior_in_flight = tcp_packets_in_flight(tp); + tcp_verify_left_out(tp); + if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) { /* Window is constant, pure forward advance. * No more checks are required. @@ -3237,6 +3251,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, int flag) if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))