from:"Ilpo Järvinen"

Re: [PATCHSET 0/18] FRTO: fixes and small changes + SACK enhanced version

2007-02-22 Thread Ilpo Järvinen

On Thu, 22 Feb 2007, David Miller wrote:

> If you have any further bug fixes or enhancements to FRTO we
> can put it there too.

I tried the rate-halving approach in spurious RTO response and it looks 
really nice indeed. I'll try to also run couple of tests with the 
undo_cwr(,1) response too in near future to ensure that it doesn't 
cause immediate breakage with something.

Before preparing the submittable spurious RTO responses patch, I would 
like to know what is the preferred style for the response selector code 
(I'm controlling the response through frto_response sysctl)? switch? array 
of function pointers? If array, where to place it (considering the need 
for forward declarations and mixing of code and data in some approaches)?

-- 
 i.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix limited slow start bug

2007-02-22 Thread Ilpo Järvinen

On Thu, 22 Feb 2007, John Heffner wrote:

> diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
> index 7fd2910..a0c894f 100644
> --- a/net/ipv4/tcp_cong.c
> +++ b/net/ipv4/tcp_cong.c
> @@ -303,9 +303,9 @@ void tcp_slow_start(struct tcp_sock *tp)
>   
>   tp->snd_cwnd_cnt += cnt;
>   while (tp->snd_cwnd_cnt >= tp->snd_cwnd) {
> + tp->snd_cwnd_cnt -= tp->snd_cwnd;
>   if (tp->snd_cwnd < tp->snd_cwnd_clamp)
>   tp->snd_cwnd++;
> - tp->snd_cwnd_cnt -= tp->snd_cwnd;
>   }
> }
> EXPORT_SYMBOL_GPL(tcp_slow_start);

ACK. I was going to track this down tomorrow as I noticed strange
behavior during slow start while testing tcp-2.6 today but I don't have to 
anymore :-). The problem I saw is now obvious, whole congestion control 
was practically disabled due to underflow of the unsigned type that occurs 
without this patch, causing TCP to be limited only by the receiver 
advertized window during slow-start.

BTW, while looking this patch, I noticed that snd_cwnd_clamp is only u16 
while snd_cwnd is u32, which seems rather strange since snd_cwnd is being 
limited by the clamp value here and there?!?! And tcp_highspeed.c is 
clearly assuming even more than this (but the problem is hidden as 
snd_cwnd_clamp is feed back to the min_t and the used 32-bit constant 
could be safely cut to 16-bits anyway):

  tp->snd_cwnd_clamp = min_t(u32, tp->snd_cwnd_clamp, 0x/128);

Has the type being changed somewhere in the past or why is this so?

-- 
 i.
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] [TCP]: Correct reordering detection change (no FRTO case)

2007-02-23 Thread Ilpo Järvinen

The reordering detection must work also when FRTO has not been
used at all which was the original intention of mine, just the
expression of the idea was flawed.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
Applies on the top of tcp-2.6 branch as you would probably have
guessed.

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bb3f234..f6ba07f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1240,7 +1240,7 @@ tcp_sacktag_write_queue(struct sock *sk,
tp->left_out = tp->sacked_out + tp->lost_out;
 
if ((reord < tp->fackets_out) && icsk->icsk_ca_state != TCP_CA_Loss &&
-   (tp->frto_highmark && after(tp->snd_una, tp->frto_highmark)))
+   (!tp->frto_highmark || after(tp->snd_una, tp->frto_highmark)))
tcp_update_reordering(sk, ((tp->fackets_out + 1) - reord), 0);
 
 #if FASTRETRANS_DEBUG > 0
-- 
1.4.2

[PATCH 1/2] [TCP]: Add two new spurious RTO responses to FRTO

2007-02-27 Thread Ilpo Järvinen

New sysctl tcp_frto_response is added to select amongst these
responses:
- Rate halving based; reuses CA_CWR state (default)
- Very conservative; used to be the only one available (=1)
- Undo cwr; undoes ssthresh and cwnd reductions (=2)

The response with rate halving requires a new parameter to
tcp_enter_cwr because FRTO has already reduced ssthresh and
doing a second reduction there has to be prevented. In addition,
to keep things nice on 80 cols screen, a local variable was
added.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/linux/sysctl.h |1 +
 include/net/tcp.h  |3 ++-
 net/ipv4/sysctl_net_ipv4.c |8 
 net/ipv4/tcp_input.c   |   30 ++
 net/ipv4/tcp_output.c  |2 +-
 5 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index a2dce72..80f11e0 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -439,6 +439,7 @@ enum
NET_TCP_AVAIL_CONG_CONTROL=122,
NET_TCP_ALLOWED_CONG_CONTROL=123,
NET_TCP_MAX_SSTHRESH=124,
+   NET_TCP_FRTO_RESPONSE=125,
 };
 
 enum {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6d09f50..f0c9e34 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -220,6 +220,7 @@ extern int sysctl_tcp_app_win;
 extern int sysctl_tcp_adv_win_scale;
 extern int sysctl_tcp_tw_reuse;
 extern int sysctl_tcp_frto;
+extern int sysctl_tcp_frto_response;
 extern int sysctl_tcp_low_latency;
 extern int sysctl_tcp_dma_copybreak;
 extern int sysctl_tcp_nometrics_save;
@@ -738,7 +739,7 @@ static inline void tcp_sync_left_out(str
tp->left_out = tp->sacked_out + tp->lost_out;
 }
 
-extern void tcp_enter_cwr(struct sock *sk);
+extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
 extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst);
 
 /* Slow start with delack produces 3 packets of burst, so that
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index d68effe..6817d64 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -647,6 +647,14 @@ #endif
.proc_handler   = &proc_dointvec
},
{
+   .ctl_name   = NET_TCP_FRTO_RESPONSE,
+   .procname   = "tcp_frto_response",
+   .data   = &sysctl_tcp_frto_response,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = &proc_dointvec
+   },
+   {
.ctl_name   = NET_TCP_LOW_LATENCY,
.procname   = "tcp_low_latency",
.data   = &sysctl_tcp_low_latency,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f6ba07f..d6e1776 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -86,6 +86,7 @@ int sysctl_tcp_stdurg __read_mostly;
 int sysctl_tcp_rfc1337 __read_mostly;
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 int sysctl_tcp_frto __read_mostly;
+int sysctl_tcp_frto_response __read_mostly;
 int sysctl_tcp_nometrics_save __read_mostly;
 
 int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
@@ -762,15 +763,17 @@ __u32 tcp_init_cwnd(struct tcp_sock *tp,
 }
 
 /* Set slow start threshold and cwnd not falling to slow start */
-void tcp_enter_cwr(struct sock *sk)
+void tcp_enter_cwr(struct sock *sk, const int set_ssthresh)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   const struct inet_connection_sock *icsk = inet_csk(sk);
 
tp->prior_ssthresh = 0;
tp->bytes_acked = 0;
if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
tp->undo_marker = 0;
-   tp->snd_ssthresh = inet_csk(sk)->icsk_ca_ops->ssthresh(sk);
+   if (set_ssthresh)
+   tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
tp->snd_cwnd = min(tp->snd_cwnd,
   tcp_packets_in_flight(tp) + 1U);
tp->snd_cwnd_cnt = 0;
@@ -2003,7 +2006,7 @@ static void tcp_try_to_open(struct sock 
tp->retrans_stamp = 0;
 
if (flag&FLAG_ECE)
-   tcp_enter_cwr(sk);
+   tcp_enter_cwr(sk, 1);
 
if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) {
int state = TCP_CA_Open;
@@ -2579,6 +2582,21 @@ static void tcp_conservative_spur_to_res
tcp_moderate_cwnd(tp);
 }
 
+/* A conservative spurious RTO response algorithm: reduce cwnd using
+ * rate halving and continue in congestion avoidance.
+ */
+static void tcp_ratehalving_spur_to_response(struct sock *sk)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   tcp_enter_cwr(sk, 0);
+   tp->high_seq = tp->frto_highmark;   /* Smoother w/o this? - ij */
+}
+
+static void tcp_undo_spur_to_response(struct sock *sk)
+{
+   tcp_undo_cw

[PATCH 2/2] [TCP] Sysctl documentation: tcp_frto_response

2007-02-27 Thread Ilpo Järvinen

In addition, fixed minor things in tcp_frto sysctl.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 Documentation/networking/ip-sysctl.txt |   21 +++--
 1 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index d66777b..95e2a9f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -174,15 +174,32 @@ tcp_fin_timeout - INTEGER
because they eat maximum 1.5K of memory, but they tend
to live longer. Cf. tcp_max_orphans.
 
-tcp_frto - BOOLEAN
+tcp_frto - INTEGER
Enables F-RTO, an enhanced recovery algorithm for TCP retransmission
timeouts.  It is particularly beneficial in wireless environments
where packet loss is typically due to random radio interference
rather than intermediate router congestion. If set to 1, basic
-   version is enabled. 2 enables SACK enhanced FRTO, which is
+   version is enabled. 2 enables SACK enhanced F-RTO, which is
EXPERIMENTAL. The basic version can be used also when SACK is
enabled for a flow through tcp_sack sysctl.
 
+tcp_frto_response - INTEGER
+   When F-RTO has detected that a TCP retransmission timeout was
+   spurious (i.e, the timeout would have been avoided had TCP set a
+   longer retransmission timeout), TCP has several options what to do
+   next. Possible values are:
+   0 Rate halving based; a smooth and conservative response,
+ results in halved cwnd and ssthresh after one RTT
+   1 Very conservative response; not recommended because even
+ though being valid, it interacts poorly with the rest of
+ Linux TCP, halves cwnd and ssthresh immediately
+   2 Aggressive response; undoes congestion control measures
+ that are now known to be unnecessary (ignoring the
+ possibility of a lost retransmission that would require
+ TCP to be more cautious), cwnd and ssthresh are restored
+ to the values prior timeout
+   Default: 0 (rate halving based)
+
 tcp_keepalive_time - INTEGER
How often TCP sends out keepalive messages when keepalive is enabled.
Default: 2hours.
-- 
1.4.2

Re: [PATCH 1/2] [TCP]: Add two new spurious RTO responses to FRTO

2007-03-01 Thread Ilpo Järvinen

On Wed, 28 Feb 2007, Jarek Poplawski wrote:

> On 27-02-2007 16:50, Ilpo Järvinen wrote:
> > New sysctl tcp_frto_response is added to select amongst these
> ...
> > Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
> > @@ -762,15 +763,17 @@ __u32 tcp_init_cwnd(struct tcp_sock *tp,
> >  }
> >  
> >  /* Set slow start threshold and cwnd not falling to slow start */
> > -void tcp_enter_cwr(struct sock *sk)
> > +void tcp_enter_cwr(struct sock *sk, const int set_ssthresh)
> >  {
> > struct tcp_sock *tp = tcp_sk(sk);
> > +   const struct inet_connection_sock *icsk = inet_csk(sk);
> >  
> > tp->prior_ssthresh = 0;
> > tp->bytes_acked = 0;
> > if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
> 
> - if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
> + if (icsk->icsk_ca_state < TCP_CA_CWR) {
> 
> Probably something for the next "BTW".

These are going to 2.6.22, not to 2.6.21, see:
  http://marc.theaimsgroup.com/?l=linux-netdev&m=117213215924406&w=2
...or do you mean something else?

Since DaveM has already applied this, here is a patch with your correction 
alone on the top of tcp-2.6.

--
[PATCH] [TCP]: Complete icsk-to-local-variable change (in tcp_enter_cwr)

A local variable for icsk was created but this change was
missing. Spotted by Jarek Poplawski.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d6e1776..dc221a3 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -770,7 +770,7 @@ void tcp_enter_cwr(struct sock *sk, cons
 
tp->prior_ssthresh = 0;
tp->bytes_acked = 0;
-   if (inet_csk(sk)->icsk_ca_state < TCP_CA_CWR) {
+   if (icsk->icsk_ca_state < TCP_CA_CWR) {
tp->undo_marker = 0;
if (set_ssthresh)
tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
-- 
1.4.2

[RFC PATCH] [TCP]: Move clearing of the prior_ssthresh due to ECE earlier

2007-03-01 Thread Ilpo Järvinen

Previously the prior_ssthresh was only used in the undo functions
that were called by fastretrans_alert. FRTO processing, however,
happens before that and one of the responses is doing undo too.

I think that after this patch, FRTO should be ECN-safe; only the
undo_cwr response is a suspect anyway. The possible cases:

1)  Pre-RTO ECE: FRTO is not setting prior_ssthresh in
CA_CWR at all, therefore it correctly remains cleared
and ssthresh is the already reduced one

2a) Post-RTO ECE, before deciding ACK: prior_ssthresh is
cleared, so no undoing for ssthresh will occur

2b) Post-RTO ECE, deciding ACK: this patch changes this case
equal to 2a

3)  Between two RTOs: prior_ssthresh is cleared like in 2a
or the reduced value is used like in 1

Deciding ACK is the one after which FRTO declared RTO as
spurious.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---

This problem could be solved locally in the undo_cwr response of
FRTO by falling back to ratehalving response when FLAG_ECE is
enabled. If it's more preferrable way, just ping me to get a
patch. I'm not absolutely sure that this does not break anything
even though I couldn't find any using prior_ssthresh between
the new and the old place (seen more than once already within
years that things are not always as simple as they seem).


 net/ipv4/tcp_input.c |   19 +--
 1 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dc221a3..b0c9dfb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2080,15 +2080,11 @@ tcp_fastretrans_alert(struct sock *sk, u
tp->fackets_out = 0;
 
/* Now state machine starts.
-* A. ECE, hence prohibit cwnd undoing, the reduction is required. */
-   if (flag&FLAG_ECE)
-   tp->prior_ssthresh = 0;
-
-   /* B. In all the states check for reneging SACKs. */
+* A. In all the states check for reneging SACKs. */
if (tp->sacked_out && tcp_check_sack_reneging(sk))
return;
 
-   /* C. Process data loss notification, provided it is valid. */
+   /* B. Process data loss notification, provided it is valid. */
if ((flag&FLAG_DATA_LOST) &&
before(tp->snd_una, tp->high_seq) &&
icsk->icsk_ca_state != TCP_CA_Open &&
@@ -2097,10 +2093,10 @@ tcp_fastretrans_alert(struct sock *sk, u
NET_INC_STATS_BH(LINUX_MIB_TCPLOSS);
}
 
-   /* D. Synchronize left_out to current state. */
+   /* C. Synchronize left_out to current state. */
tcp_sync_left_out(tp);
 
-   /* E. Check state exit conditions. State can be terminated
+   /* D. Check state exit conditions. State can be terminated
 *when high_seq is ACKed. */
if (icsk->icsk_ca_state == TCP_CA_Open) {
BUG_TRAP(tp->retrans_out == 0);
@@ -2143,7 +2139,7 @@ tcp_fastretrans_alert(struct sock *sk, u
}
}
 
-   /* F. Process state. */
+   /* E. Process state. */
switch (icsk->icsk_ca_state) {
case TCP_CA_Recovery:
if (prior_snd_una == tp->snd_una) {
@@ -2742,8 +2738,11 @@ static int tcp_ack(struct sock *sk, stru
if (TCP_SKB_CB(skb)->sacked)
flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una);
 
-   if (TCP_ECN_rcv_ecn_echo(tp, skb->h.th))
+   if (TCP_ECN_rcv_ecn_echo(tp, skb->h.th)) {
flag |= FLAG_ECE;
+   /* ECE, prohibit cwnd undoing, reduction is required */
+   tp->prior_ssthresh = 0;
+   }
 
tcp_ca_event(sk, CA_EVENT_SLOW_ACK);
}
-- 
1.4.2

Re: [RFC PATCH] [TCP]: Move clearing of the prior_ssthresh due to ECE earlier

2007-03-01 Thread Ilpo Järvinen

On Thu, 1 Mar 2007, Ilpo Järvinen wrote:

> Previously the prior_ssthresh was only used in the undo functions
> that were called by fastretrans_alert. FRTO processing, however,
> happens before that and one of the responses is doing undo too.
> 
> I think that after this patch, FRTO should be ECN-safe; only the
> undo_cwr response is a suspect anyway. The possible cases:
> 
[...snip...]
> 
> This problem could be solved locally in the undo_cwr response of
> FRTO by falling back to ratehalving response when FLAG_ECE is
> enabled.

I think that doing it in the response is better that this approach,
since it knows that the ssthresh has been halved already within that
round-trip, so there is no need to do that again... I'll submit the
patch tomorrow... With this prior_ssthresh clearing move alone, the 
ssthresh ends up being halved twice if I tought it right (first in 
tcp_enter_frto and then again in tcp_enter_cwr that is called from 
fastretrans_alert)... So please, drop this patch.

-- 
 i.

[PATCH] [TCP]: FRTO undo response falls back to ratehalving one if ECEd

2007-03-02 Thread Ilpo Järvinen

Undoing ssthresh is disabled in fastretrans_alert whenever
FLAG_ECE is set by clearing prior_ssthresh. This clearing does
not protect FRTO because FRTO operates before fastretrans_alert.
Moving the clearing of prior_ssthresh earlier seems to be a
suboptimal solution to the FRTO case because then FLAG_ECE will
cause a second ssthresh reduction in try_to_open (the first
occurred when FRTO was entered). So instead, FRTO falls back
immediately to the rate halving response, which switches TCP to
CA_CWR state preventing the latter reduction of ssthresh.

If the first ECE arrived before the ACK after which FRTO is able
to decide RTO as spurious, prior_ssthresh is already cleared.
Thus no undoing for ssthresh occurs. Besides, FLAG_ECE should be
set also in the following ACKs resulting in rate halving response
that sees TCP already in CA_CWR, which again prevents an extra
ssthresh reduction on that round-trip.

If the first ECE arrived before RTO, ssthresh has already been
adapted and prior_ssthresh remains cleared on entry because TCP
is in CA_CWR (the same applies also to a case where FRTO is
entered more than once and ECE comes in the middle).

I believe that after this patch, FRTO should be ECN-safe and
even able to take advantage of synergy benefits.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |9 ++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dc221a3..bdd6172 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2592,9 +2592,12 @@ static void tcp_ratehalving_spur_to_resp
tp->high_seq = tp->frto_highmark;   /* Smoother w/o this? - ij */
 }
 
-static void tcp_undo_spur_to_response(struct sock *sk)
+static void tcp_undo_spur_to_response(struct sock *sk, int flag)
 {
-   tcp_undo_cwr(sk, 1);
+   if (flag&FLAG_ECE)
+   tcp_ratehalving_spur_to_response(sk);
+   else
+   tcp_undo_cwr(sk, 1);
 }
 
 /* F-RTO spurious RTO detection algorithm (RFC4138)
@@ -2680,7 +2683,7 @@ static int tcp_process_frto(struct sock 
return 1;
} else /* frto_counter == 2 */ {
switch (sysctl_tcp_frto_response) {
-   case 2: tcp_undo_spur_to_response(sk); break;
+   case 2: tcp_undo_spur_to_response(sk, flag); break;
case 1: tcp_conservative_spur_to_response(tp); break;
default: tcp_ratehalving_spur_to_response(sk); break;
}
-- 
1.4.2

[PATCH v2] [TCP]: FRTO undo response falls back to ratehalving one if ECEd

2007-03-02 Thread Ilpo Järvinen

Undoing ssthresh is disabled in fastretrans_alert whenever
FLAG_ECE is set by clearing prior_ssthresh. The clearing does
not protect FRTO because FRTO operates before fastretrans_alert.
Moving the clearing of prior_ssthresh earlier seems to be a
suboptimal solution to the FRTO case because then FLAG_ECE will
cause a second ssthresh reduction in try_to_open (the first
occurred when FRTO was entered). So instead, FRTO falls back
immediately to the rate halving response, which switches TCP to
CA_CWR state preventing the latter reduction of ssthresh.

If the first ECE arrived before the ACK after which FRTO is able
to decide RTO as spurious, prior_ssthresh is already cleared.
Thus no undoing for ssthresh occurs. Besides, FLAG_ECE should be
set also in the following ACKs resulting in rate halving response
that sees TCP is already in CA_CWR, which again prevents an extra
ssthresh reduction on that round-trip.

If the first ECE arrived before RTO, ssthresh has already been
adapted and prior_ssthresh remains cleared on entry because TCP
is in CA_CWR (the same applies also to a case where FRTO is
entered more than once and ECE comes in the middle).

High_seq must not be touched after tcp_enter_cwr because CWR
round-trip calculation depends on it.

I believe that after this patch, FRTO should be ECN-safe and
even able to take advantage of synergy benefits.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---

Of course I forgot to fix also the high_seq thing I had in mind last 
evening, so here is this again now with it too.


diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index dc221a3..6b268dc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2587,14 +2587,15 @@ static void tcp_conservative_spur_to_res
  */
 static void tcp_ratehalving_spur_to_response(struct sock *sk)
 {
-   struct tcp_sock *tp = tcp_sk(sk);
tcp_enter_cwr(sk, 0);
-   tp->high_seq = tp->frto_highmark;   /* Smoother w/o this? - ij */
 }
 
-static void tcp_undo_spur_to_response(struct sock *sk)
+static void tcp_undo_spur_to_response(struct sock *sk, int flag)
 {
-   tcp_undo_cwr(sk, 1);
+   if (flag&FLAG_ECE)
+   tcp_ratehalving_spur_to_response(sk);
+   else
+   tcp_undo_cwr(sk, 1);
 }
 
 /* F-RTO spurious RTO detection algorithm (RFC4138)
@@ -2680,7 +2681,7 @@ static int tcp_process_frto(struct sock 
return 1;
} else /* frto_counter == 2 */ {
switch (sysctl_tcp_frto_response) {
-   case 2: tcp_undo_spur_to_response(sk); break;
+   case 2: tcp_undo_spur_to_response(sk, flag); break;
case 1: tcp_conservative_spur_to_response(tp); break;
default: tcp_ratehalving_spur_to_response(sk); break;
}
-- 
1.4.2

Re: [PATCH 4/4]: Kill fastpath_{skb,cnt}_hint.

2007-03-03 Thread Ilpo Järvinen

On Fri, 2 Mar 2007, David Miller wrote:

> From: Baruch Even <[EMAIL PROTECTED]>
> Date: Thu, 1 Mar 2007 20:13:40 +0200

> > One drawback for this approach is that you now walk the entire sack
> > block when you advance one packet. If you consider a 10,000 packet queue
> > which had several losses at the beginning and a large sack block that
> > advances from the middle to the end you'll walk a lot of packets for
> > that one last stretch of a sack block.

Maybe this information in the last ACK could be even used as advantage to 
speed up the startpoint lookup, since the last ACK very likely contains 
the highest SACK block (could not hold under attack though), no matter how 
big it is, and it's not necessary to process it ever if we know for sure 
that it was the global highest rather than a local one. Using that, it's 
trivial to find the starting point below the SACK block and that could be 
passed to the head_lost marking on-the-fly using stack rather than in the
structs. Sort of fastpath too :-)... Maybe even some auto-tuning thingie 
which enables conditionally skipping of large SACK-blocks to reuse all of 
this last ACK information in mark_head but that would get rather 
complicated I think.

> > One way to handle that is to use the still existing sack fast path to
> > detect this case and calculate what is the sequence number to search
> > for. Since you know what was the end_seq that was handled last, you can
> > search for it as the start_seq and go on from there. Does it make sense?
> 
> BTW, I think I figured out a way to get rid of
> lost_{skb,cnt}_hint.  The fact of the matter in this case is that
> the setting of the tag bits always propagates from front of the queue
> onward.  We don't get holes mid-way.
> 
> So what we can do is search the RB-tree for high_seq and walk
> backwards.  Once we hit something with TCPCB_TAGBITS set, we
> stop processing as there are no earlier SKBs which we'd need
> to do anything with.

If TCP knows the highest SACK block skb (or its seq) and high_seq, finding 
the starting point is really trivial as you can always take min of them 
without any walking. Then walk backwards skipping first reordering, until 
the first LOST one is encountered. ...Only overhead then comes from the 
skipped which depends on the current reordering (of course that was not 
overhead if they were timedout). This would not even require knowledge 
about per skb fack_count because the skipping servers for its purpose and 
it can be done on the fly.

> scoreboard_skb_hint is a little bit trickier, but it is a similar
> case to the tcp_lost_skb_hint case.  Except here the termination
> condition is a relative timeout instead of a sequence number and
> packet count test.
> 
> Perhaps for that we can remember some state from the
> tcp_mark_head_lost() we do first.  In fact, we can start
> the queue walk from the latest packet which tcp_mark_head_lost()
> marked with a tag bit.

Yes, the problem with this case compared to head_lost seems to be that
we don't know whether the first skb (in backwards walk) must be marked 
until we have walked past it (actually to the point where the first 
SACKED_RETRANS is encountered, timestamps should be in order except for 
the discontinuity that occurs at SACKED_RETRANS edge). So it seems to me 
that any backwards walk could be a bit problematic in this case exactly 
because of this discontinuity? Armed with this knowledge, however, 
backwards walk could start checking for timedout marking right at the 
first SACKED_RETRANS skb. And continue later from that point forward if 
the skb at the edge was also marked due to timeout. ...Actually, I think 
that other discontinuity can exists at the EVER_RETRANS edge but that 
suffers from the same problem as non-RETRANS skbs before SACKED_RETRANS 
edge when first encountered and therefore is probably pretty useless.

> Basically these two algorithms are saying:
> 
> 1) Mark up to smallest of 'lost' or tp->high_seq.
> 2) Mark packets after those processed in #1 which have
>timed out.
> 
> Right?

So I would take another angle to this problem (basically combine the 
lost calculation from tcp_update_scoreboard and mark_head lost stuff and 
ignore this lost altogether). Basically what I propose is something like 
this (hopefully I don't stomp your feet by coding it this much as
pseudish approach seemed to get almost as complex :-)):

void tcp_update_scoreboard_fack(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
int reord_count = 0;
int had_retrans = 0;
struct sk_buff *timedout_reentry_skb = NULL;

/* How to get this highest_sack? */
skb = get_min_skb_wrapsafe(tp->high_seq, highest_sack);

walk_backwards_from(sk, skb) {
if (TCP_CB_SKB(skb)->sacked & TCPCB_LOST)
break;

if (TCP_CB_SKB(skb)->sacked & TCPCB_SACKED_RETRANS) {
had_retrans = 1;

Re: [PATCH 4/4]: Kill fastpath_{skb,cnt}_hint.

2007-03-03 Thread Ilpo Järvinen

On Sat, 3 Mar 2007, Ilpo Järvinen wrote:
> On Fri, 2 Mar 2007, David Miller wrote:
> > From: Baruch Even <[EMAIL PROTECTED]>
> > Date: Thu, 1 Mar 2007 20:13:40 +0200
> 
> > > One drawback for this approach is that you now walk the entire sack
> > > block when you advance one packet. If you consider a 10,000 packet queue
> > > which had several losses at the beginning and a large sack block that
> > > advances from the middle to the end you'll walk a lot of packets for
> > > that one last stretch of a sack block.
> 
> Maybe this information in the last ACK could be even used as advantage to 
> speed up the startpoint lookup, since the last ACK very likely contains 
> the highest SACK block (could not hold under attack though), no matter how 
> big it is, and it's not necessary to process it ever if we know for sure 
> that it was the global highest rather than a local one. Using that, it's 
> trivial to find the starting point below the SACK block and that could be 
> passed to the head_lost marking on-the-fly using stack rather than in the
> structs. Sort of fastpath too :-)... Maybe even some auto-tuning thingie 
> which enables conditionally skipping of large SACK-blocks to reuse all of 
> this last ACK information in mark_head but that would get rather 
> complicated I think.

Even better, adding to more the globally highest SACKed block range that 
is already larger than tp->reordering does nothing in mark_head_lost 
unless ca_state or high_seq are also changed (just a quick thoughts, 
it might be possible to exclude also them with careful analysis of state), 
isn't that so? But taking advantage of this might require inter-ACK state 
and be then less useful optimization. So only thing that would remain is 
the check for timedout above the highest SACKed block...

-- 
 i.

SACK scoreboard (Was: Re: [RFC PATCH net-2.6.25 uncompilable] [TCP]: Avoid breaking GSOed skbs when SACKed one-by-one)

2007-12-15 Thread Ilpo Järvinen

On Tue, 11 Dec 2007, David Miller wrote:

> Interesting approach, but I think there is limited value to this
> (arguably) complex form.
>
> The core issue is that the data and the SACK state are maintained in
> the same datastructure.  The complexity in all the state management
> and fixups in your patch is purely because of this.

Yeah, had just too much time while waiting for person who never
arrived... :-) It would have covered the typical case quite well
tough, for sure it was very intrusive.

> If we maintain SACK scoreboard information seperately, outside of
> the SKB, then there are only two changes to make:
> 
> 1) Every access to TCP_SKB_CB() SACK scoreboard is adjusted to
>new data structure.

Not sure if it is going to all that straightforward because you seem to 
ignore in this simple description all details of performing the linking 
between the structures, which becomes tricky when we add those
retransmission adjustments.

> 2) Retransmit is adjusted so that it can retransmit an SKB
>constructed as a portion of an existing SKB.  Since TSO
>implies SG, this can be handled with simple offset and
>length arguments and suitable creation of a clone referencing
>the pages in the SG vector that contain the desired data.

...I.e., how to count retrans_out origins in such case because the 
presented sack_ref knows nothing about them nor do we have anything but 
one bit in ->sacked per skb. We need the origins when retransmission get 
sacked/acked to reduce retrans_out correctly. To solve this, I think the 
sack_ref must have ->sacked as well and the structure should store the 
R-bits there as well, and may L-bits too though their defination is 
more obvious because it's mostly a complement of sacked (except for small 
part near fackets_out and timedout marking that probably makes bookkeeping 
of L still necessary).

The pcount boundaries are no longer that well defined after we break skb 
boundaries during retransmitting, so doing this makes fackets_out 
calculation even more trickier to track, unless whole pcount is replaced 
by byte based fackets_out :-/, which is very trivial to determine from 
seqnos only (TCP_NODELAYers might get unhappy though if we do that in a 
straightforward way...). 

This would clearly be a good step from one perspective. Nowadays GSO'ed 
skbs may get fragmented when mss_now changes, for two or more consecutive 
non-SACKed skbs that means one or more extra (small) packets when 
retransmitting.

> I would envision this SACK state thing to reference into the
> retransmit queue SKB's somehow.  Each SACK block could perhaps
> look something like:
> 
>   struct sack_ref {
>   struct sk_buff *start_skb;
>   struct sk_buff *end_skb;
>   unsigned int start_off;
>   unsigned int len;
>   };

...I assume that these are fast searchable? And skbs as well (because the 
linking of start/end_skb at minimum is a costly operation otherwise)?

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: net-2.6.25 rebased...

2007-12-17 Thread Ilpo Järvinen

On Sun, 16 Dec 2007, David Miller wrote:

> I needed to rebase for two reasons:
> 
> 1) Ilpo asked me to revert a lot of TCP stuff and the
>easiest way to do that was during a rebase.
> 
> 2) Patrick McHardy needs some of the pending net-2.6
>bug fixes in there in order to send me patches
>for some netfilter compat stuff.
> 
> It's all there in the usual spot:
> 
>   kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.25.git
> 
> I'm still doing build tests with various configurations on my
> Niagara-2 box.
> 
> Let me know if I screwed up anything, thanks!

Thanks, at least the end result for TCP was exactly what I 
expected to find there...


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] [TCP]: Include __tcp_reset_fack_counts to non-__ version

2007-12-17 Thread Ilpo Järvinen

On Tue, 11 Dec 2007, Christoph Hellwig wrote:

> On Tue, Dec 11, 2007 at 01:50:39PM +0200, Ilpo J?rvinen wrote:
> > +   BUG_ON((prev != NULL) && !tcp_skb_adjacent(sk, prev, 
> > skb[queue]));
> > +
> > +   tcp_for_write_queue_from(skb[queue], sk, queue) {
> > +   if ((prev != NULL) && !tcp_skb_adjacent(sk, prev, 
> > skb[queue]))
> > +   break;
> > +
> > +   if (!before(TCP_SKB_CB(skb[queue])->seq, 
> > tcp_sk(sk)->snd_nxt) ||
> > +   TCP_SKB_CB(skb[queue])->fack_count == fc)
> > +   return;
> 
> There's quite a few overflows of the normal 80 char limit here.  Because
> you're current style is a little on the verbose side that's trivially
> fixable, though:

This part got removed when part of TCP code got removed during net-2.6.25 
rebase...

Thanks anyway for the reminder, I'll try to be more careful during code 
moves in future but I'll probably continue to allow expections in cases 
where the offenders only consist of closing parenthesis, block opening 
brace and termination semicolon (like it was in one of these lines as 
well).

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT

2007-12-19 Thread Ilpo Järvinen

On Tue, 18 Dec 2007, Gavin McCullagh wrote:

> The last attempt didn't take account of the situation where a timestamp
> wasn't available and tcp_clean_rtx_queue() has to feed both the RTO and the
> congestion avoidance.  This updated patch stores both RTTs, making the
> delayed one available for the RTO and the other (ca_seq_rtt) available for
> congestion control.
> 
> 
> Signed-off-by: Gavin McCullagh <[EMAIL PROTECTED]> 
> 
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 889c893..6fb7989 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2686,13 +2687,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 
> *seq_rtt_p,
>   if (sacked & TCPCB_SACKED_RETRANS)
>   tp->retrans_out -= packets_acked;
>   flag |= FLAG_RETRANS_DATA_ACKED;
> + ca_seq_rtt = -1;
>   seq_rtt = -1;
>   if ((flag & FLAG_DATA_ACKED) ||
>   (packets_acked > 1))
>   flag |= FLAG_NONHEAD_RETRANS_ACKED;
>   } else {
> + ca_seq_rtt = now - scb->when;

Isn't it also much better this way in a case where ACK losses happened,
taking the longest RTT in that case is clearly questionable as it
may over-estimate considerably.

However, another thing to consider is the possibility of this value being 
used in "timeout-like" fashion in ca modules (I haven't read enough ca 
modules code to know if any of them does that), on contrary to 
determinating just rtt or packet's delay in which case this change seems 
appropriate (most modules do the latter). Therefore, if timeout-like 
module exists one should also add TCP_CONG_RTT_STAMP_LONGEST for that 
particular module and keep using seq_rtt for it like previously and use 
ca_seq_rtt only for others.

>   if (seq_rtt < 0) {
> - seq_rtt = now - scb->when;
> + seq_rtt = ca_seq_rtt;
>   if (fully_acked)
>   last_ackt = skb->tstamp;
>   }
> @@ -2709,8 +2712,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 
> *seq_rtt_p,
>   !before(end_seq, tp->snd_up))
>   tp->urg_mode = 0;
>   } else {
> + ca_seq_rtt = now - scb->when;
>   if (seq_rtt < 0) {
> - seq_rtt = now - scb->when;
> + seq_rtt = ca_seq_rtt;
>   if (fully_acked)
>   last_ackt = skb->tstamp;
>   }

This part doesn't exists anymore in development tree. Please base this 
patch (and anything in future) you intend to get included to mainline
onto net-2.6.25 unless there's a very good reason to not do so or 
whatever 2.6.xx is the correct net development tree at that time (if
one exists). Thanks.

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-19 Thread Ilpo Järvinen

On Sun, 16 Dec 2007, James Nichols wrote:

> I have a Java application that makes a large number of outbound
> webservice calls over HTTP/TCP.  The hosts contacted are a fixed set
> of about 2000 hosts and a web service call is made to each of them
> approximately every 5 mintues by a pool of 200 Java threads.  Over
> time, on average a percentage of these hosts are unreachable for one
> reason or another, usually because they are on wireless cell phone
> NICs, so there is a persistent count of sockets in the SYN_SENT state
> in the range of about 60-80.  This is fine, as these failed connection
> attempts eventually time out.
> 
> However, after approximately 38 hours of operation, all outbound
> connection attempts get stuck in the SYN_SENT state.  It happens
> instantaneously, where I go from the baseline of about 60-80 sockets
> in SYN_SENT to a count of 200 (corresponding to the # of java threads
> that make these calls).
> 
> When I stop and start the Java application, all the new outbound
> connections still get stuck in SYN_SENT state.

Is it so that they don't timeout at all? You can collect some of their 
state from /proc/net/tcp (shows at least timers and attempt counters)

> During this time, I am
> still able to SSH to the box and run wget to Google, cnn, etc, so the
> problem appears to be specific to the hosts that I'm accessing via the
> webservices.

Are you sure that you just don't get unlucky at some point of time and 
all 200 available threads are just temporarily stuck and your application 
is just very slowly progressing then?

> For a long time, the only thing that would resolve this was rebooting
> the entire machine.  Once I did this, the outbound connections could
> be made succesfully.

To the very same hosts? Or to another set of hosts?

> However, very recently when I had once of these incidents I disabled 
> tcp_sack via:
> 
> echo "0" > /proc/sys/net/ipv4/tcp_sack
> 
> And the problem almost instanteaously resolved itself and outbound
> connection attempts were succesful.

New or the pending ones?

> I hadn't attempted this before because I assumed that if any of my 
> network
> equipment or remote hosts had a problem with SACK, that it would never
> work.

Many bugs just are not like that at all... Usually people who coded things 
had at least some clue :-), so things work "almost correctly"...

>  In my case, it worked fine for about 38 hours before hitting a
> wall where no outbound connections could be made.

How accurate number? Is the lockup somehow related to daytime cycle?

> Is there a kernel buffer or some data structure that tcp_sack uses
> that gets filled up after an extended period of operation?

SACK has pretty little meaning in context of SYNs, there's only the 
sackperm(itted) TCP option which is sent along with the SYN/SYN-ACK.

The SACK scoreboard is currently included to the skbs (has been like 
this for very long time), so no additional data structures should be
there because of SACK...

> How can I debug this problem in the kernel to find out what the root cause is?
> 
> I emailed linux-kernel and they asked for output of netstat -s, I can
> get this the next
> time it occurs- any other usefull data to collect?

/proc/net/tcp couple of times in a row, try something something like
this:

for i in (seq 1 40); do cat /proc/net/tcp; echo "-"; sleep 10; done


> I'm running kernel 2.6.18 on RedHat, but have had this problem occur
> on earlier kernel versions (all 2.4 and 2.6).

I've done some fixes to SACK processing since 2.6.18 (not sure if RedHat 
has backported them). Though they're not that critical nor anything in 
them should affect in SYN_SENT state.


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT

2007-12-19 Thread Ilpo Järvinen

On Wed, 19 Dec 2007, Gavin McCullagh wrote:

> On Wed, 19 Dec 2007, Ilpo Järvinen wrote:
>
> > Isn't it also much better this way in a case where ACK losses happened,
> > taking the longest RTT in that case is clearly questionable as it
> > may over-estimate considerably.
> 
> Quite so.
> 
> > However, another thing to consider is the possibility of this value being 
> > used in "timeout-like" fashion in ca modules (I haven't read enough ca 
> > modules code to know if any of them does that), on contrary to 
> > determinating just rtt or packet's delay in which case this change seems 
> > appropriate (most modules do the latter). 
> 
> I'm not aware of any, but I haven't read them all either.  I would have
> thought tp->srtt was the value to use in this instance,

Very likely so...

> > Therefore, if timeout-like module exists one should also add
> > TCP_CONG_RTT_STAMP_LONGEST for that particular module and keep using
> > seq_rtt for it like previously and use ca_seq_rtt only for others.
> 
> Seems reasonable.  I'll add this.

...therefore I said "if". I'm not sure what they all do, haven't read them 
all that carefully... :-) Please check first if ..._LONGEST is necessary 
at all by quickly going through how the ca modules use it, I guess most of 
them won't be that complicated, one can probably figure out the intented 
usage by couple of minutes review. If there aren't any modules who need 
delayed ACK & other path caused delays included, ..._LONGEST would just 
end up being unnecessary cruft :-).

> > This part doesn't exists anymore in development tree. Please base this 
> > patch (and anything in future) you intend to get included to mainline
> > onto net-2.6.25 unless there's a very good reason to not do so or 
> > whatever 2.6.xx is the correct net development tree at that time (if
> > one exists). Thanks.
> 
> Will do.   I gather I should use the latest net- tree in future when
> submitting patches.

Doh, I owe you apology as I was probably too hasty to point you towards 
net-2.6.25. I suppose this could by considered as fix as well and 
therefore could probably be accepted to net-2.6 as well, which is for 
bugfixes only after merge window is closed. But it's Dave how will make 
such decisions, not me :-), and it's he who gets to deal with all 
the resulting conflicts ;-) (I added Cc to him).

-- 
 i.

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-19 Thread Ilpo Järvinen

On Wed, 19 Dec 2007, James Nichols wrote:

> > > And the problem almost instanteaously resolved itself and outbound
> > > connection attempts were succesful.
> >
> > New or the pending ones?
> 
> I'm fairly sure that sockets that were already open in SYN_SENT state
> when I turned tcp_sack off started to work as the count of sockets in
> SYN_SENT state drops very rapidly.

Heh, "very rapidly" :-/, considering that you have 200 x SYN_SENT flows 
and if tcp_syn_retries is set to default, you will see something happening 
quite quickly for sure and the whole situation is over in ~3 mins if I 
counted correctly.

> > > Is there a kernel buffer or some data structure that tcp_sack uses
> > > that gets filled up after an extended period of operation?
> >
> > SACK has pretty little meaning in context of SYNs, there's only the
> > sackperm(itted) TCP option which is sent along with the SYN/SYN-ACK.
> >
> > The SACK scoreboard is currently included to the skbs (has been like
> > this for very long time), so no additional data structures should be
> > there because of SACK...
> 
> I've been seeing this problem for about 4 years, so could it be
> related to the scoreboard implementation somehow?

Scoreboard has no meaning in this context, it is used while _input_ 
packets are processed. If you don't get SYN-ACKs at all, it doesn't
have any meaning.

> > /proc/net/tcp couple of times in a row, try something something like
> > this:
> >
> > for i in (seq 1 40); do cat /proc/net/tcp; echo "-"; sleep 10; done
> 
> I can set up to do this the next time the problem occurs.

for i in $(seq 1 40); ... is the right way to do the loop. :-)

> > > I'm running kernel 2.6.18 on RedHat, but have had this problem occur
> > > on earlier kernel versions (all 2.4 and 2.6).
> >
> > I've done some fixes to SACK processing since 2.6.18 (not sure if RedHat
> > has backported them). Though they're not that critical nor anything in
> > them should affect in SYN_SENT state.
> 
> Ok, unless there is direct evidence that there is a fix to this
> problem in a later kernel I won't be able to upgrade.  If there is a
> redhat provided patch I can probably do that.

...They won't affect in SYN_SENT.

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-19 Thread Ilpo Järvinen

On Wed, 19 Dec 2007, Eric Dumazet wrote:

> James Nichols a écrit :
> > On 12/19/07, Eric Dumazet <[EMAIL PROTECTED]> wrote:
> > > James Nichols a écrit :
> > > > > So you see outgoing SYN packets, but no SYN replies coming from the
> > > > > remote
> > > > > peer ?  (you mention ACKS, but the first packet received from the
> > > > > remote
> > > > >  peer should be a SYN+ACK),
> > > > Right, I meant to say SYN+ACK.  I don't see them coming back.
> > > So... Really unlikely a linux problem, but ...
> > >
> > 
> > 
> > I don't know how you can be so sure.  Turning tcp_sack off instantly
> > resovles the problem and all connections are succesful.  I can't
> > imagine even the most far-fetched scenario where a router or every
> > single remote endpoints would suddenly stop causing the problem just
> > by removing a single TCP option.

You could also check if you can pull same trick off by touching 
tcp_timestamps. It affects the TCP header as well.



-- 
 i.

TSO trimming question

2007-12-19 Thread Ilpo Järvinen

Hi all,

I'm not fully sure what's purpose of this code in tcp_write_xmit:

   if (skb->len < limit) {
   unsigned int trim = skb->len % mss_now;

   if (trim)
   limit = skb->len - trim;
}

Is it used to make sure we send only multiples of mss_now here and leave 
the left-over into another skb? Or does it try to make sure that
tso_fragment result honors multiple of mss_now boundaries when snd_wnd
is the limitting factor? For latter IMHO this would be necessary:

if (skb->len > limit)
limit -= limit % mss_now;


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-19 Thread Ilpo Järvinen

On Wed, 19 Dec 2007, Ilpo Järvinen wrote:

> I'm not fully sure what's purpose of this code in tcp_write_xmit:
> 
>if (skb->len < limit) {
>unsigned int trim = skb->len % mss_now;
> 
>if (trim)
>limit = skb->len - trim;
>   }
> 
> Is it used to make sure we send only multiples of mss_now here and leave 
> the left-over into another skb? Or does it try to make sure that
> tso_fragment result honors multiple of mss_now boundaries when snd_wnd
> is the limitting factor? For latter IMHO this would be necessary:
> 
>   if (skb->len > limit)
>   limit -= limit % mss_now;

...Anyway, here's an untested patch to just do it :-):

-- 
 i.


[PATCH] [TCP]: Force tso skb split to mss_now boundary (if snd_wnd limits)

If snd_wnd was limitting factor, the tso_fragment might not
create full-sized skbs but would include the window left-overs
as well.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |4 
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1c7ef17..8dafda9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1453,6 +1453,8 @@ static int tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle)
 
if (trim)
limit = skb->len - trim;
+   } else if (skb->len > limit) {
+   limit -= limit % mss_now;
}
}
 
@@ -1525,6 +1527,8 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
 
if (trim)
limit = skb->len - trim;
+   } else if (skb->len > limit) {
+   limit -= limit % mss_now;
}
}
 
-- 
1.5.0.6

Re: A short question about net git tree and patches

2007-12-20 Thread Ilpo Järvinen

On Thu, 20 Dec 2007, Sam Ravnborg wrote:

> On Thu, Dec 20, 2007 at 11:20:26AM +0200, David Shwatrz wrote:
> > Hello,
> > I have a short question regarding the net git tree and patches:
> > I want to write and send patches against the most recent and
> > bleeding edge kernel networking code.
> > I see in:
> > http://kernel.org/pub/scm/linux/kernel/git/davem/?C=M;O=A
> > that there are 3 git trees which can be candidates for git-clone and
> > making patches against;
> > these are:
> > netdev-2.6.git, net-2.6.25.git and net-2.6.git.
> > 
> > It seems to me that net-2.6.git is the most suitable one to work against;
> > am I right ?
> > what is the difference, in short, between the three repositories?
> 
> IIRC the usage is:
> netdev-2.6.git   <= old stuff, 4 weeks since last update. Not in use
> net-2.6.25.git   <= patches for current kernel release (only fixes)

Nope, we don't even have 2.6.24 yet. :-)

> net-2.6.git  <= patches for next kernel relase and planned to be
> applied in next merge window
>
> So net-2.6.git is the correct choice for bleeding edge.

net-2.6 is for fixes only, net-2.6.25 will become net-2.6 once 2.6.24
gets released, eventually net-2.6.26 gets opened (not necessarily at the 
same time as the merge window is closed but a bit later) and the cycle 
repeats with similar transitions when 2.6.25 gets released.

The netdev trees are for network drivers and are usually managed by
Jeff Garzik but there were recently some arrangement between Dave and 
Jeff due to vacations so that also netdev was managed by Dave.

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-20 Thread Ilpo Järvinen

On Wed, 19 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
> Date: Wed, 19 Dec 2007 23:46:33 +0200 (EET)
> 
> > I'm not fully sure what's purpose of this code in tcp_write_xmit:
> > 
> >if (skb->len < limit) {
> >unsigned int trim = skb->len % mss_now;
> > 
> >if (trim)
> >limit = skb->len - trim;
> > }
> > 
> > Is it used to make sure we send only multiples of mss_now here and leave 
> > the left-over into another skb?

Yeah, I now understand that this part is correct. I somehow got such 
impression while trying to figure this out that it ends up being dead code 
but that wasn't correct thought from my side. However, it caught my 
attention and after some thinking I'd say there's more to handle here 
(covered by the second question).

Also note that patch I sent earlier is not right either but needs some 
refining to do the right thing.

> > Or does it try to make sure that
> > tso_fragment result honors multiple of mss_now boundaries when snd_wnd
> > is the limitting factor? For latter IMHO this would be necessary:
> > 
> > if (skb->len > limit)
> > limit -= limit % mss_now;
> 
> The purpose of the test is to make sure we process tail sub-mss chunks
> correctly wrt. Nagle, which most closely matches the first purpose
> you've listed.
> 
> So I think the calculation really does belong where it is.
>
> Because of the way that the sendmsg() super-skb formation logic
> works, we always will tack on more data and grow the tail
> SKB before creating a new one.  So any sub-mss chunk at the
> end of a TSO frame really is at the end of the write queue
> and really should get nagle processing.

Yes, I now agree this is fully correct for this task.

> Actually, there is an exception, which is when we run out of
> skb_frag_list slots.  In that case we'll potentially have breaks at
> odd boundaries in the middle of the queue.  But this can only happen
> in exceptional cases (user does tons of 1-byte sendfile()'s over
> random non-consequetive locations of a file) or outright bugs
> (MAX_SKB_FRAGS is defined incorrectly, for example) and thus this
> situation is not worth coding for.

That's not the only case, IMHO if there's odd boundary due to 
snd_una+snd_wnd - skb->seq limit (done in tcp_window_allows()), we don't 
consider it as odd but break the skb at arbitary point resulting
two small segments to the network, and what's worse, when the later skb 
resulting from the first split is matching skb->len < limit check as well 
causing an unnecessary small skb to be created for nagle purpose too, 
solving it fully requires some thought in case the mss_now != mss_cache 
even if non-odd boundaries are honored in the middle of skb.

Though whether we get there is depending on what tcp_tso_should_defer() 
decided. Hmm, there seems to be an unrelated bug in it as well :-/. A 
patch below. Please consider the fact that enabling TSO deferring may 
have some unpleasant effect to TCP dynamics, considering that I don't find 
stable mandatory for this to avoid breaking, besides things have been 
working quite well without it too... Only compile tested.

-- 
 i.

--
[PATCH] [TCP]: Fix TSO deferring

I'd say that most of what tcp_tso_should_defer had in between
there was dead code because of this.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8dafda9..693b9f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1217,7 +1217,8 @@ static int tcp_tso_should_defer(struct sock *sk, struct 
sk_buff *skb)
goto send_now;

/* Defer for less than two clock ticks. */
-   if (!tp->tso_deferred && ((jiffies<<1)>>1) - (tp->tso_deferred>>1) > 1)
+   if (tp->tso_deferred &&
+   ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1)
goto send_now;

in_flight = tcp_packets_in_flight(tp);
-- 
1.5.0.6

Re: TSO trimming question

2007-12-20 Thread Ilpo Järvinen

On Thu, 20 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
> Date: Thu, 20 Dec 2007 13:40:51 +0200 (EET)
> 
> > That's not the only case, IMHO if there's odd boundary due to 
> > snd_una+snd_wnd - skb->seq limit (done in tcp_window_allows()), we don't 
> > consider it as odd but break the skb at arbitary point resulting
> > two small segments to the network, and what's worse, when the later skb 
> > resulting from the first split is matching skb->len < limit check as well 
> > causing an unnecessary small skb to be created for nagle purpose too, 
> > solving it fully requires some thought in case the mss_now != mss_cache 
> > even if non-odd boundaries are honored in the middle of skb.
> 
> In the most ideal sense, tcp_window_allows() should probably
> be changed to only return MSS multiples.

That's what Herbert suggested already, I'll send a patch later
on... :-)

> Unfortunately this would add an expensive modulo operation,
> however I think it would elimiate this problem case.

Yes. Should we still call tcp_minshall_update() if split in the middle of 
wq results in smaller than MSS tail (occurs only if mss_now != mss_cache)?

-- 
 i.

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Ilpo Järvinen

On Thu, 20 Dec 2007, James Nichols wrote:

> > I still dont understand.
> >
> > "tcpdump -p -n -s 1600 -c 1" doesnt reveal User data at all.
> >
> > Without any exact data from you, I am afraid nobody can help.
> 
> Oh, I didn't see that you specified specific options.  I'll still have
> to anonymize 2000+ IP addresses, but I think there is an open source
> tool that will do this for you.

Even a simple for loop in shell can do that. It's not that hard and 
there's very little need for manual work! Ingrediments: for, cut, grep
and sed.


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: After many hours all outbound connections get stuck in SYN_SENT

2007-12-20 Thread Ilpo Järvinen

On Thu, 20 Dec 2007, James Nichols wrote:

> > You'd probably should also investigate the Linux kernel,
> > especially the size and locks of the components of the Sack data
> > structures and what happens to those data structures after Sack is
> > disabled (presumably the Sack data structure is in some unhappy
> > circumstance, and disabling Sack allows the data to be discarded,
> > magically unclaging the box).

...Not sure if you want now to invent such structure. Yes, we have per skb 
->sacked but again in SYN_SENT there are very few things who touch it at 
all, and they just set it to zero (though it would not even be mandatory 
for tcp_transmit_skb, IIRC, checked that just couple of days ago due to 
other things).

Another thing is the rx_opt.sack_ok which is just couple flag bits that 
tell the TCP variant in use (and it's mostly used only after SYN handshake 
completes). The rest (the actual SACK blocks) is in the ack_skb but again 
it has very little meaning in SYN_SENT state unless somebody is crazy 
enough to add SACK blocks to SYN-ACKs :-).

> > In the absence of the reporter wanting to dump the kernel's
> > core, how about a patch to print the Sack datastructure when
> > the command to disable Sack is received by the kernel?
> > Maybe just print the last 16b of the IP address?
> 
> Given the fact that I've had this problem for so long, over a variety
> of networking hardware vendors and colo-facilities, this really sounds
> good to me.  It will be challenging for me to justify a kernel core
> dump, but a simple patch to dump the Sack data would be do-able.

If your symptoms really are: SYNs leaving (if they show up in tcpdump, for 
sure they've left TCP code already) and SYN-ACK not showing up even in 
something as early as in tcpdump (for sure TCP side code didn't execute at 
that point yet), there's very little change that Linux' TCP code has some 
bug in it, only things that do something in such scenario are the SYN 
generation and retransmitting SYNs (and those are trivially verifiable 
from tcpdump).

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [TCP] IPV6 : Change a divide into a right shift in tcp_v6_send_ack()

2007-12-20 Thread Ilpo Järvinen


On Fri, 21 Dec 2007, Eric Dumazet wrote:


Because tot_len is signed in tcp_v6_send_ack(), tot_len/4 forces compiler
to emit an integer divide, while we can help it to use a right shift,
less expensive.


Can't you just change tot_len to unsigned here? It's just sizeof and some 
positive constants added...


--
i.diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0268e11..92f0fda 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1124,7 +1124,7 @@ static void tcp_v6_send_ack(struct tcp_timewait_sock *tw,
memset(t1, 0, sizeof(*t1));
t1->dest = th->source;
t1->source = th->dest;
-   t1->doff = tot_len/4;
+   t1->doff = tot_len >> 2;
t1->seq = htonl(seq);
t1->ack_seq = htonl(ack);
t1->ack = 1;

Re: TSO trimming question

2007-12-21 Thread Ilpo Järvinen

On Fri, 21 Dec 2007, Bill Fink wrote:

> I meant to ask about this a while back but then got distracted by
> other things.  But now since the subject has come up, I had a couple
> of more questions about this code.
> 
> What's with all the shifting back and forth?  Here with:
> 
>   ((jiffies<<1)>>1) - (tp->tso_deferred>>1)
> 
> and later with:
> 
>   /* Ok, it looks like it is advisable to defer.  */
>   tp->tso_deferred = 1 | (jiffies<<1);
> 
> Is this just done to try and avoid the special case of jiffies==0 
> when the jiffies wrap? 

I thought that it must be the reason. I couldn't figure out any other 
explination while thinking the same thing but since I saw no problem in 
that rather weird approach, I didn't want to touch that in a patch which 
had net-2.6 (or stable) potential.

> If so it seems like a lot of unnecessary
> work just to avoid a 1 in 4 billion event, since it's my understanding
> that the whole tcp_tso_should_defer function is just an optimization
> and not a criticality to the proper functioning of TCP, especially
> considering it hasn't even been executing at all up to now.

It would still be good to not return 1 in that case we didn't flag the 
deferral, how about adding one additional tick (+comment) in the zero 
case making tso_deferred == 0 again unambiguously defined, e.g.:

tp->tso_deferred = min_t(u32, jiffies, 1);

...I'm relatively sure that nobody would ever notice that tick :-) and we 
kept return value consistent with tso_deferred state invariant.

> My second question is more basic and if I'm not mistaken actually
> relates to a remaining bug in the (corrected) test:
> 
>   /* Defer for less than two clock ticks. */
>   if (tp->tso_deferred &&
>   ((jiffies << 1) >> 1) - (tp->tso_deferred >> 1) > 1)
> 
> Since jiffies is an unsigned long, which is 64-bits on a 64-bit system,
> whereas tp->tso_deferred is a u32, once jiffies exceeds 31-bits, which
> will happen in about 25 days if HZ=1000, won't the second part of the
> test always be true after that?  Or am I missing something obvious?

I didn't notice that earlier but I think you're right though my knowledge 
about jiffies and such is quite limited.

...Feel free to submit a patch to correct these.


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TSO trimming question

2007-12-21 Thread Ilpo Järvinen

On Fri, 21 Dec 2007, Ilpo Järvinen wrote:

> On Fri, 21 Dec 2007, Bill Fink wrote:
> 
> > If so it seems like a lot of unnecessary
> > work just to avoid a 1 in 4 billion event, since it's my understanding
> > that the whole tcp_tso_should_defer function is just an optimization
> > and not a criticality to the proper functioning of TCP, especially
> > considering it hasn't even been executing at all up to now.
> 
> It would still be good to not return 1 in that case we didn't flag the 
> deferral, how about adding one additional tick (+comment) in the zero 
> case making tso_deferred == 0 again unambiguously defined, e.g.:
> 
>   tp->tso_deferred = min_t(u32, jiffies, 1);

Blah, max_t of course :-).


-- 
 i.

[PATCH] [TCP]: Remove seq_rtt ptr from clean_rtx_queue args

2007-12-21 Thread Ilpo Järvinen


While checking Gavin's patch I noticed that the returned seq_rtt
is not used by the caller.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
I think it shouldn't introduce new conflicts between net-2.6.25 and 
net-2.6 (tested with git-am -3 to both).

 net/ipv4/tcp_input.c |7 ++-
 1 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 0f0c1c9..a020e83 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2638,8 +2638,7 @@ static u32 tcp_tso_acked(struct sock *sk, struct sk_buff 
*skb)
  * is before the ack sequence we can discard it as it's confirmed to have
  * arrived at the other end.
  */
-static int tcp_clean_rtx_queue(struct sock *sk, s32 *seq_rtt_p,
-  int prior_fackets)
+static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets)
 {
struct tcp_sock *tp = tcp_sk(sk);
const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -2803,7 +2802,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 
*seq_rtt_p,
}
}
 #endif
-   *seq_rtt_p = seq_rtt;
return flag;
 }
 
@@ -3044,7 +3042,6 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
u32 ack = TCP_SKB_CB(skb)->ack_seq;
u32 prior_in_flight;
u32 prior_fackets;
-   s32 seq_rtt;
int prior_packets;
int frto_cwnd = 0;
 
@@ -3111,7 +3108,7 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
prior_in_flight = tcp_packets_in_flight(tp);
 
/* See if we can take anything off of the retransmit queue. */
-   flag |= tcp_clean_rtx_queue(sk, &seq_rtt, prior_fackets);
+   flag |= tcp_clean_rtx_queue(sk, prior_fackets);
 
if (tp->frto_counter)
frto_cwnd = tcp_process_frto(sk, flag);
-- 
1.5.0.6

Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT

2007-12-21 Thread Ilpo Järvinen

On Fri, 21 Dec 2007, Gavin McCullagh wrote:

> On Fri, 21 Dec 2007, David Miller wrote:
> 
> > When Gavin respins the patch I'll look at in the context of submitting
> > it as a bug fix.  So Gavin please generate the patch against Linus's
> > vanilla GIT tree or net-2.6, your choise.
> 
> The existing patch was against Linus' linux-2.6.git from a few days ago so
> I've updated my tree and regenerated the patch (below).  Is that the right
> one?
> 
> I'm just checking through the existing CA modules.  I don't see the rtt
> used for RTO anywhere.  This is what I gather they're each using rtt for.

I meant more timeout like fashion (e.g., to "timeout" some internal 
phase but I don't find that too likely)...

Thanks for checking them.

> So as far as I can tell, timeout stuff is not ever altered using
> pkts_acked() so I guess this fix only affects westwood, htcp and cubic just
> now.
> 
> I need to re-read properly, but I think the same problem affects the
> microsecond values where TCP_CONG_RTT_STAMP is set (used by vegas, veno,
> yeah, illinois).  I might follow up with another patch which changes the
> behaviour where TCP_CONG_RTT_STAMP when I'm more sure of that.

Please do, you might have to remove fully_acked checks to do that right 
though so it won't be as straight-forward change as this one and requires 
some amount of thinking to result in a right thing.

> Signed-off-by: Gavin McCullagh <[EMAIL PROTECTED]>
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 889c893..6fb7989 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2651,6 +2651,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 
> *seq_rtt_p,
>   u32 cnt = 0;
>   u32 reord = tp->packets_out;
>   s32 seq_rtt = -1;
> + s32 ca_seq_rtt = -1;
>   ktime_t last_ackt = net_invalid_timestamp();
>  
>   while ((skb = tcp_write_queue_head(sk)) && skb != tcp_send_head(sk)) {
> @@ -2686,13 +2687,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 
> *seq_rtt_p,
>   if (sacked & TCPCB_SACKED_RETRANS)
>   tp->retrans_out -= packets_acked;
>   flag |= FLAG_RETRANS_DATA_ACKED;
> + ca_seq_rtt = -1;
>   seq_rtt = -1;
>   if ((flag & FLAG_DATA_ACKED) ||
>   (packets_acked > 1))
>   flag |= FLAG_NONHEAD_RETRANS_ACKED;
>   } else {
> + ca_seq_rtt = now - scb->when;
>   if (seq_rtt < 0) {
> - seq_rtt = now - scb->when;
> + seq_rtt = ca_seq_rtt;
>   if (fully_acked)
>   last_ackt = skb->tstamp;
>   }
> @@ -2709,8 +2712,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 
> *seq_rtt_p,
>   !before(end_seq, tp->snd_up))
>   tp->urg_mode = 0;
>   } else {
> + ca_seq_rtt = now - scb->when;
>   if (seq_rtt < 0) {
> - seq_rtt = now - scb->when;
> + seq_rtt = ca_seq_rtt;
>   if (fully_acked)
>   last_ackt = skb->tstamp;
>   }
> @@ -2772,8 +2776,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, s32 
> *seq_rtt_p,
>net_invalid_timestamp()))
>   rtt_us = 
> ktime_us_delta(ktime_get_real(),
>   last_ackt);
> - else if (seq_rtt > 0)
> - rtt_us = jiffies_to_usecs(seq_rtt);
> + else if (ca_seq_rtt > 0)
> + rtt_us = jiffies_to_usecs(ca_seq_rtt);
>   }
>  
>   ca_ops->pkts_acked(sk, pkts_acked, rtt_us);
> 

Acked-by: Ilpo Järvinen <[EMAIL PROTECTED]>

/ Reviewed-by... whatever, I don't know if they really started to use it
or not... :-)

-- 
 i.

Re: [PATCH/RFC] [v2] TCP: use non-delayed ACK for congestion control RTT

2007-12-21 Thread Ilpo Järvinen

On Fri, 21 Dec 2007, Ilpo Järvinen wrote:

> On Fri, 21 Dec 2007, Gavin McCullagh wrote:
> 
> > I'm just checking through the existing CA modules.  I don't see the rtt
> > used for RTO anywhere.  This is what I gather they're each using rtt for.
> 
> I meant more timeout like fashion (e.g., to "timeout" some internal 
> phase but I don't find that too likely)...

I meant didn't, no need to recheck them due to that... :-)

-- 
 i.

[PATCH] [TCP]: Force TSO splits to MSS boundaries

2007-12-21 Thread Ilpo Järvinen

On Thu, 20 Dec 2007, David Miller wrote:

> From: Herbert Xu <[EMAIL PROTECTED]>
> Date: Thu, 20 Dec 2007 22:00:12 +0800
> 
> > On Thu, Dec 20, 2007 at 04:00:37AM -0800, David Miller wrote:
> > >
> > > In the most ideal sense, tcp_window_allows() should probably
> > > be changed to only return MSS multiples.
> > > 
> > > Unfortunately this would add an expensive modulo operation,
> > > however I think it would elimiate this problem case.
> > 
> > Well you only have to divide in the unlikely case of us being
> > limited by the receiver window.  In that case speed is probably
> > not of the essence anyway.
> 
> Agreed, to some extent.
> 
> I say "to some extent" because it might be realistic, with
> lots (millions) of sockets to hit this case a lot.
> 
> There are so many things that are a "don't care" performance
> wise until you have a lot of stinky connections over crappy
> links.

How about this, I had to use another approach due to reasons
outlined in the commit message:

--
[PATCH] [TCP]: Force TSO splits to MSS boundaries

If snd_wnd - snd_nxt wasn't multiple of MSS, skb was split on
odd boundary by the callers of tcp_window_allows.

We try really hard to avoid unnecessary modulos. Therefore the
old caller side check "if (skb->len < limit)" was too wide as
well because limit is not bound in any way to skb->len and can
cause spurious testing for trimming in the middle of the queue
while we only wanted that to happen at the tail of the queue.
A simple additional caller side check for tcp_write_queue_tail
would likely have resulted 2 x modulos because the limit would
have to be first calculated from window, however, doing that
unnecessary modulo is not mandatory. After a minor change to
the algorithm, simply determine first if the modulo is needed
at all and at that point immediately decide also from which
value it should be calculated from.

This approach also kills some duplicated code.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |   51 -
 1 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1852698..ea92a1b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1008,13 +1008,29 @@ static void tcp_cwnd_validate(struct sock *sk)
}
 }
 
-static unsigned int tcp_window_allows(struct tcp_sock *tp, struct sk_buff 
*skb, unsigned int mss_now, unsigned int cwnd)
+/* Returns the portion of skb which can be sent right away without
+ * introducing MSS oddities to segment boundaries. In rare cases where
+ * mss_now != mss_cache, we will request caller to create a small skb
+ * per input skb which could be mostly avoided here (if desired).
+ */
+static unsigned int tcp_mss_split_point(struct sock *sk, struct sk_buff *skb,
+   unsigned int mss_now,
+   unsigned int cwnd)
 {
-   u32 window, cwnd_len;
+   struct tcp_sock *tp = tcp_sk(sk);
+   u32 needed, window, cwnd_len;
 
window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq);
cwnd_len = mss_now * cwnd;
-   return min(window, cwnd_len);
+
+   if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk)))
+   return cwnd_len;
+
+   if (skb == tcp_write_queue_tail(sk) && cwnd_len <= skb->len)
+   return cwnd_len;
+
+   needed = min(skb->len, window);
+   return needed - needed % mss_now;
 }
 
 /* Can at least one segment of SKB be sent right now, according to the
@@ -1445,17 +1461,9 @@ static int tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle)
}
 
limit = mss_now;
-   if (tso_segs > 1) {
-   limit = tcp_window_allows(tp, skb,
- mss_now, cwnd_quota);
-
-   if (skb->len < limit) {
-   unsigned int trim = skb->len % mss_now;
-
-   if (trim)
-   limit = skb->len - trim;
-   }
-   }
+   if (tso_segs > 1)
+   limit = tcp_mss_split_point(sk, skb, mss_now,
+   cwnd_quota);
 
if (skb->len > limit &&
unlikely(tso_fragment(sk, skb, limit, mss_now)))
@@ -1502,7 +1510,6 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned 
int cur_mss,
  */
 void tcp_push_one(struct sock *sk, unsigned int mss_now)
 {
-   struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb = tcp_send_head(sk);
unsigned int tso_segs, cwnd_qu

Re: TSO trimming question

2007-12-21 Thread Ilpo Järvinen

On Fri, 21 Dec 2007, Bill Fink wrote:

> On Fri, 21 Dec 2007, Bill Fink wrote:
> 
> > Or perhaps even:
> > 
> > /* Ok, it looks like it is advisable to defer.  */
> > tp->tso_deferred = jiffies;
> > 
> > /* need to return a non-zero value to defer, which means won't
> >  * defer if jiffies == 0 but it's only a 1 in 4 billion event
> >  * (and avoids a compare/branch by not checking jiffies)
> >  /
> > return jiffies;
> 
> Ack.  I introduced my own 64-bit to 32-bit issue (too late at night).
> How about:
> 
>   /* Ok, it looks like it is advisable to defer.  */
>   tp->tso_deferred = jiffies;
> 
>   /* this won't defer if jiffies == 0 but it's only a 1 in
>* 4 billion event (and avoids a branch)
>*/
>   return (jiffies != 0);

I'm not sure how the jiffies work but is this racy as well?

Simple return tp->tso_deferred; should work, shouldn't it? :-)


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [TCP]: Force TSO splits to MSS boundaries

2007-12-21 Thread Ilpo Järvinen

On Fri, 21 Dec 2007, Ilpo Järvinen wrote:

> How about this, I had to use another approach due to reasons
> outlined in the commit message:
> 
> --
> [PATCH] [TCP]: Force TSO splits to MSS boundaries
> 
> If snd_wnd - snd_nxt wasn't multiple of MSS, skb was split on
> odd boundary by the callers of tcp_window_allows.
> 
> We try really hard to avoid unnecessary modulos. Therefore the
> old caller side check "if (skb->len < limit)" was too wide as
> well because limit is not bound in any way to skb->len and can
> cause spurious testing for trimming in the middle of the queue
> while we only wanted that to happen at the tail of the queue.
> A simple additional caller side check for tcp_write_queue_tail
> would likely have resulted 2 x modulos because the limit would
> have to be first calculated from window, however, doing that
> unnecessary modulo is not mandatory. After a minor change to
> the algorithm, simply determine first if the modulo is needed
> at all and at that point immediately decide also from which
> value it should be calculated from.
> 
> This approach also kills some duplicated code.
> 
> Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
> ---
>  net/ipv4/tcp_output.c |   51 
> -
>  1 files changed, 25 insertions(+), 26 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 1852698..ea92a1b 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1008,13 +1008,29 @@ static void tcp_cwnd_validate(struct sock *sk)
>   }
>  }
>  
> -static unsigned int tcp_window_allows(struct tcp_sock *tp, struct sk_buff 
> *skb, unsigned int mss_now, unsigned int cwnd)
> +/* Returns the portion of skb which can be sent right away without
> + * introducing MSS oddities to segment boundaries. In rare cases where
> + * mss_now != mss_cache, we will request caller to create a small skb
> + * per input skb which could be mostly avoided here (if desired).
> + */
> +static unsigned int tcp_mss_split_point(struct sock *sk, struct sk_buff *skb,
> + unsigned int mss_now,
> + unsigned int cwnd)
>  {
> - u32 window, cwnd_len;
> + struct tcp_sock *tp = tcp_sk(sk);
> + u32 needed, window, cwnd_len;
>  
>   window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq);
>   cwnd_len = mss_now * cwnd;
> - return min(window, cwnd_len);
> +
> + if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk)))
> + return cwnd_len;
> +
> + if (skb == tcp_write_queue_tail(sk) && cwnd_len <= skb->len)
> + return cwnd_len;

...if desired that this won't cause small skbs in the middle of the queue 
(except in case where just the sub-MSS portion is left out above window), 
something like this could be added here (again, avoiding more modulos):

if (skb != tcp_write_queue_tail(sk) && window > skb->len)
return skb->len;

> + needed = min(skb->len, window);
> + return needed - needed % mss_now;
>  }
>  
>  /* Can at least one segment of SKB be sent right now, according to the

-- 
 i.

Re: [PATCH/RFC] [v3] TCP: use non-delayed ACK for congestion control RTT

2007-12-30 Thread Ilpo Järvinen

On Sun, 30 Dec 2007, Gavin McCullagh wrote:

> Hi,
> 
> On Fri, 21 Dec 2007, Ilpo Järvinen wrote:
> 
> > > I need to re-read properly, but I think the same problem affects the
> > > microsecond values where TCP_CONG_RTT_STAMP is set (used by vegas, veno,
> > > yeah, illinois).  I might follow up with another patch which changes the
> > > behaviour where TCP_CONG_RTT_STAMP when I'm more sure of that.
> >
> > Please do, you might have to remove fully_acked checks to do that right 
> > though so it won't be as straight-forward change as this one and requires 
> > some amount of thinking to result in a right thing.
> 
> The TCP_CONG_RTT_STAMP code does need to be fixed similarly.  A combined
> patch will follow this mail.  As I understand it, the fully_acked checks
> kick in where the ACK is a portion of a TSO chunk and doesn't completely
> ACK that chunk.  Leaving the checks in place means you get one rtt for each
> TSO chunk, based on the ACK for the last byte of the chunk.  This rtt
> therefore is the maximum available and includes the time-lag between the
> first and last chunk being acked.  Removing the tests gives you an RTT
> value for each ACK in a tso chunk, including the minimum and maximum.  It
> seems the minimum rtt is the best indicator of congestion.  On the other
> hand having all available RTTs gives the congestion avoidance greater
> knowledge of how the RTT is evolving (albeit somewhat coloured by TSO
> delays which don't seem too severe in my tests).

I guess the non-minimum TSO delays are only significant in case there was 
something unexpectional happening and in such case we definately want to 
have some measurements taken.


-- 
 i.

[PATCH net-2.6.25 0/9]: TCP cleanups & minor changes.

2007-12-31 Thread Ilpo Järvinen

Hi Dave,

The first one is restored after getting removed in the
straight-forward revert we did.

Please check that the TCPCB_URG removal is indeed valid. I couldn't
find any use for it but there might be some non-obviously named
things I've missed.

I did a larger cleanup with indent for tcp_input and tcp_output
because I started to find so many unnecessarily line split &
missing spaces here and there. There's still a lot to do because
not every case has a trivial solution but at least something got
cleaner :-).

These should apply cleanly to the rebased net-2.6.25. I did some
trivial test with them before the rebase.

   
-- 
 i.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/9] [TCP]: Make invariant check complain about invalid sacked_out

2007-12-31 Thread Ilpo Järvinen

Earlier resolution for NewReno's sacked_out should now keep
it small enough for this to become invariant-like check.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |5 +
 1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 722c9cb..41f4b86 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2504,11 +2504,8 @@ tcp_fastretrans_alert(struct sock *sk, int pkts_acked, 
int flag)
(tcp_fackets_out(tp) > tp->reordering));
int fast_rexmit = 0;
 
-   /* Some technical things:
-* 1. Reno does not count dupacks (sacked_out) automatically. */
-   if (!tp->packets_out)
+   if (WARN_ON(!tp->packets_out && tp->sacked_out))
tp->sacked_out = 0;
-
if (WARN_ON(!tp->sacked_out && tp->fackets_out))
tp->fackets_out = 0;
 
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/9] [TCP]: Remove unnecessary local variables

2007-12-31 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |   11 +++
 1 files changed, 3 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1ca638b..025dddf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -995,9 +995,8 @@ unsigned int tcp_current_mss(struct sock *sk, int 
large_allowed)
 static void tcp_cwnd_validate(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   __u32 packets_out = tp->packets_out;
 
-   if (packets_out >= tp->snd_cwnd) {
+   if (tp->packets_out >= tp->snd_cwnd) {
/* Network is feed fully. */
tp->snd_cwnd_used = 0;
tp->snd_cwnd_stamp = tcp_time_stamp;
@@ -1042,17 +1041,13 @@ static unsigned int tcp_mss_split_point(struct sock 
*sk, struct sk_buff *skb,
  */
 static inline unsigned int tcp_cwnd_test(struct tcp_sock *tp, struct sk_buff 
*skb)
 {
-   u32 in_flight, cwnd;
-
/* Don't be strict about the congestion window for the final FIN.  */
if ((TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
tcp_skb_pcount(skb) == 1)
return 1;
 
-   in_flight = tcp_packets_in_flight(tp);
-   cwnd = tp->snd_cwnd;
-   if (in_flight < cwnd)
-   return (cwnd - in_flight);
+   if (tcp_packets_in_flight(tp) < tp->snd_cwnd)
+   return tp->snd_cwnd - tcp_packets_in_flight(tp);
 
return 0;
 }
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/9] [TCP]: Rename update_send_head & include related increment to it

2007-12-31 Thread Ilpo Järvinen

There's very little need to have the packets_out incrementing in
a separate function. Also name the combined function
appropriately.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |   32 
 1 files changed, 12 insertions(+), 20 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7a4834a..1ca638b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -61,29 +61,22 @@ int sysctl_tcp_base_mss __read_mostly = 512;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
-static inline void tcp_packets_out_inc(struct sock *sk,
-  const struct sk_buff *skb)
-{
-   struct tcp_sock *tp = tcp_sk(sk);
-   int orig = tp->packets_out;
-
-   tp->packets_out += tcp_skb_pcount(skb);
-   if (!orig)
-   inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
- inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
-}
-
-static void update_send_head(struct sock *sk, struct sk_buff *skb)
+static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
+   unsigned int prior_packets = tp->packets_out;
 
tcp_advance_send_head(sk, skb);
tp->snd_nxt = TCP_SKB_CB(skb)->end_seq;
-   tcp_packets_out_inc(sk, skb);
 
/* Don't override Nagle indefinately with F-RTO */
if (tp->frto_counter == 2)
tp->frto_counter = 3;
+
+   tp->packets_out += tcp_skb_pcount(skb);
+   if (!prior_packets)
+   inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+ inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
 }
 
 /* SND.NXT, if window was not shrunk.
@@ -1410,7 +1403,7 @@ static int tcp_mtu_probe(struct sock *sk)
/* Decrement cwnd here because we are sending
* effectively two packets. */
tp->snd_cwnd--;
-   update_send_head(sk, nskb);
+   tcp_event_new_data_sent(sk, nskb);
 
icsk->icsk_mtup.probe_size = tcp_mss_to_mtu(sk, nskb->len);
tp->mtu_probe.probe_seq_start = TCP_SKB_CB(nskb)->seq;
@@ -1494,7 +1487,7 @@ static int tcp_write_xmit(struct sock *sk, unsigned int 
mss_now, int nonagle)
/* Advance the send_head.  This one is sent out.
 * This call will increment packets_out.
 */
-   update_send_head(sk, skb);
+   tcp_event_new_data_sent(sk, skb);
 
tcp_minshall_update(tp, mss_now, skb);
sent_pkts++;
@@ -1553,7 +1546,7 @@ void tcp_push_one(struct sock *sk, unsigned int mss_now)
TCP_SKB_CB(skb)->when = tcp_time_stamp;
 
if (likely(!tcp_transmit_skb(sk, skb, 1, sk->sk_allocation))) {
-   update_send_head(sk, skb);
+   tcp_event_new_data_sent(sk, skb);
tcp_cwnd_validate(sk);
return;
}
@@ -2528,9 +2521,8 @@ int tcp_write_wakeup(struct sock *sk)
TCP_SKB_CB(skb)->flags |= TCPCB_FLAG_PSH;
TCP_SKB_CB(skb)->when = tcp_time_stamp;
err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
-   if (!err) {
-   update_send_head(sk, skb);
-   }
+   if (!err)
+   tcp_event_new_data_sent(sk, skb);
return err;
} else {
if (tp->urg_mode &&
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/9] [TCP]: Introduce tcp_wnd_end() to reduce line lengths

2007-12-31 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/net/tcp.h |6 ++
 net/ipv4/tcp_input.c  |3 +--
 net/ipv4/tcp_output.c |   20 ++--
 3 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 76286e8..6a732d4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -785,6 +785,12 @@ static __inline__ __u32 tcp_max_burst(const struct 
tcp_sock *tp)
return 3;
 }
 
+/* Returns end sequence number of the receiver's advertised window */
+static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
+{
+   return tp->snd_una + tp->snd_wnd;
+}
+
 /* RFC2861 Check whether we are limited by application or congestion window
  * This is the inverse of cwnd check in tcp_tso_should_defer
  */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 41f4b86..366f63a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2921,8 +2921,7 @@ static void tcp_ack_probe(struct sock *sk)
 
/* Was it a usable window open? */
 
-   if (!after(TCP_SKB_CB(tcp_send_head(sk))->end_seq,
-  tp->snd_una + tp->snd_wnd)) {
+   if (!after(TCP_SKB_CB(tcp_send_head(sk))->end_seq, tcp_wnd_end(tp))) {
icsk->icsk_backoff = 0;
inet_csk_clear_xmit_timer(sk, ICSK_TIME_PROBE0);
/* Socket must be waked up by subsequent tcp_data_snd_check().
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 025dddf..202d60c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -89,10 +89,10 @@ static inline __u32 tcp_acceptable_seq(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
-   if (!before(tp->snd_una+tp->snd_wnd, tp->snd_nxt))
+   if (!before(tcp_wnd_end(tp), tp->snd_nxt))
return tp->snd_nxt;
else
-   return tp->snd_una+tp->snd_wnd;
+   return tcp_wnd_end(tp);
 }
 
 /* Calculate mss to advertise in SYN segment.
@@ -1023,7 +1023,7 @@ static unsigned int tcp_mss_split_point(struct sock *sk, 
struct sk_buff *skb,
struct tcp_sock *tp = tcp_sk(sk);
u32 needed, window, cwnd_len;
 
-   window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq);
+   window = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
cwnd_len = mss_now * cwnd;
 
if (likely(cwnd_len <= window && skb != tcp_write_queue_tail(sk)))
@@ -1129,7 +1129,7 @@ static inline int tcp_snd_wnd_test(struct tcp_sock *tp, 
struct sk_buff *skb, uns
if (skb->len > cur_mss)
end_seq = TCP_SKB_CB(skb)->seq + cur_mss;
 
-   return !after(end_seq, tp->snd_una + tp->snd_wnd);
+   return !after(end_seq, tcp_wnd_end(tp));
 }
 
 /* This checks if the data bearing packet SKB (usually tcp_send_head(sk))
@@ -1246,7 +1246,7 @@ static int tcp_tso_should_defer(struct sock *sk, struct 
sk_buff *skb)
BUG_ON(tcp_skb_pcount(skb) <= 1 ||
   (tp->snd_cwnd <= in_flight));
 
-   send_win = (tp->snd_una + tp->snd_wnd) - TCP_SKB_CB(skb)->seq;
+   send_win = tcp_wnd_end(tp) - TCP_SKB_CB(skb)->seq;
 
/* From in_flight test above, we know that cwnd > in_flight.  */
cong_win = (tp->snd_cwnd - in_flight) * tp->mss_cache;
@@ -1327,7 +1327,7 @@ static int tcp_mtu_probe(struct sock *sk)
 
if (tp->snd_wnd < size_needed)
return -1;
-   if (after(tp->snd_nxt + size_needed, tp->snd_una + tp->snd_wnd))
+   if (after(tp->snd_nxt + size_needed, tcp_wnd_end(tp)))
return 0;
 
/* Do we need to wait to drain cwnd? With none in flight, don't stall */
@@ -1682,7 +1682,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, 
struct sk_buff *skb, int m
return;
 
/* Next skb is out of window. */
-   if (after(TCP_SKB_CB(next_skb)->end_seq, 
tp->snd_una+tp->snd_wnd))
+   if (after(TCP_SKB_CB(next_skb)->end_seq, tcp_wnd_end(tp)))
return;
 
/* Punt if not enough space exists in the first SKB for
@@ -1826,7 +1826,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb)
 * case, when window is shrunk to zero. In this case
 * our retransmit serves as a zero window probe.
 */
-   if (!before(TCP_SKB_CB(skb)->seq, tp->snd_una+tp->snd_wnd)
+   if (!before(TCP_SKB_CB(skb)->seq, tcp_wnd_end(tp))
&& TCP_SKB_CB(skb)->seq != tp->snd_una)
return -EAGAIN;
 
@@ -2492,10 +2492,10 @@ int tcp_write_wakeup(struct sock *sk)
struct sk_buff *skb;
 
if ((skb = tcp_send_head(sk)) != NULL &&
-   before(TCP_SKB_CB(skb)->seq, tp->snd_una+tp->snd_wnd)) {
+   before(TCP_SKB_CB(skb)->seq, tcp_wnd_end

[PATCH 5/9] [TCP]: Dropped unnecessary skb/sacked accessing in reneging

2007-12-31 Thread Ilpo Järvinen

SACK reneging can be precalculated to a FLAG in clean_rtx_queue
which has the right skb looked up. This will help a bit in
future because skb->sacked access will be changed eventually,
changing it already won't hurt any.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |   25 +
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 366f63a..7bac1fa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -105,6 +105,7 @@ int sysctl_tcp_abc __read_mostly;
 #define FLAG_SND_UNA_ADVANCED  0x400 /* Snd_una was changed (!= 
FLAG_DATA_ACKED) */
 #define FLAG_DSACKING_ACK  0x800 /* SACK blocks contained D-SACK info */
 #define FLAG_NONHEAD_RETRANS_ACKED 0x1000 /* Non-head rexmitted data was 
ACKed */
+#define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */
 
 #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP   (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -1918,18 +1919,15 @@ void tcp_enter_loss(struct sock *sk, int how)
tp->frto_counter = 0;
 }
 
-static int tcp_check_sack_reneging(struct sock *sk)
+/* If ACK arrived pointing to a remembered SACK, it means that our
+ * remembered SACKs do not reflect real state of receiver i.e.
+ * receiver _host_ is heavily congested (or buggy).
+ *
+ * Do processing similar to RTO timeout.
+ */
+static int tcp_check_sack_reneging(struct sock *sk, int flag)
 {
-   struct sk_buff *skb;
-
-   /* If ACK arrived pointing to a remembered SACK,
-* it means that our remembered SACKs do not reflect
-* real state of receiver i.e.
-* receiver _host_ is heavily congested (or buggy).
-* Do processing similar to RTO timeout.
-*/
-   if ((skb = tcp_write_queue_head(sk)) != NULL &&
-   (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)) {
+   if (flag & FLAG_SACK_RENEGING) {
struct inet_connection_sock *icsk = inet_csk(sk);
NET_INC_STATS_BH(LINUX_MIB_TCPSACKRENEGING);
 
@@ -2515,7 +2513,7 @@ tcp_fastretrans_alert(struct sock *sk, int pkts_acked, 
int flag)
tp->prior_ssthresh = 0;
 
/* B. In all the states check for reneging SACKs. */
-   if (tp->sacked_out && tcp_check_sack_reneging(sk))
+   if (tcp_check_sack_reneging(sk, flag))
return;
 
/* C. Process data loss notification, provided it is valid. */
@@ -2852,6 +2850,9 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets)
tcp_clear_all_retrans_hints(tp);
}
 
+   if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
+   flag |= FLAG_SACK_RENEGING;
+
if (flag & FLAG_ACKED) {
const struct tcp_congestion_ops *ca_ops
= inet_csk(sk)->icsk_ca_ops;
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/9] [TCP]: reduce tcp_output's indentation levels a bit

2007-12-31 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |  239 +
 1 files changed, 121 insertions(+), 118 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 182ae21..199253c 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1668,75 +1668,77 @@ static void tcp_retrans_try_collapse(struct sock *sk, 
struct sk_buff *skb, int m
 {
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
+   int skb_size, next_skb_size;
+   u16 flags;
 
/* The first test we must make is that neither of these two
 * SKB's are still referenced by someone else.
 */
-   if (!skb_cloned(skb) && !skb_cloned(next_skb)) {
-   int skb_size = skb->len, next_skb_size = next_skb->len;
-   u16 flags = TCP_SKB_CB(skb)->flags;
+   if (skb_cloned(skb) || skb_cloned(next_skb))
+   return;
 
-   /* Also punt if next skb has been SACK'd. */
-   if (TCP_SKB_CB(next_skb)->sacked & TCPCB_SACKED_ACKED)
-   return;
+   skb_size = skb->len;
+   next_skb_size = next_skb->len;
+   flags = TCP_SKB_CB(skb)->flags;
 
-   /* Next skb is out of window. */
-   if (after(TCP_SKB_CB(next_skb)->end_seq, tcp_wnd_end(tp)))
-   return;
+   /* Also punt if next skb has been SACK'd. */
+   if (TCP_SKB_CB(next_skb)->sacked & TCPCB_SACKED_ACKED)
+   return;
 
-   /* Punt if not enough space exists in the first SKB for
-* the data in the second, or the total combined payload
-* would exceed the MSS.
-*/
-   if ((next_skb_size > skb_tailroom(skb)) ||
-   ((skb_size + next_skb_size) > mss_now))
-   return;
+   /* Next skb is out of window. */
+   if (after(TCP_SKB_CB(next_skb)->end_seq, tcp_wnd_end(tp)))
+   return;
 
-   BUG_ON(tcp_skb_pcount(skb) != 1 ||
-  tcp_skb_pcount(next_skb) != 1);
+   /* Punt if not enough space exists in the first SKB for
+* the data in the second, or the total combined payload
+* would exceed the MSS.
+*/
+   if ((next_skb_size > skb_tailroom(skb)) ||
+   ((skb_size + next_skb_size) > mss_now))
+   return;
 
-   tcp_highest_sack_combine(sk, next_skb, skb);
+   BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
 
-   /* Ok.  We will be able to collapse the packet. */
-   tcp_unlink_write_queue(next_skb, sk);
+   tcp_highest_sack_combine(sk, next_skb, skb);
 
-   skb_copy_from_linear_data(next_skb,
- skb_put(skb, next_skb_size),
- next_skb_size);
+   /* Ok.  We will be able to collapse the packet. */
+   tcp_unlink_write_queue(next_skb, sk);
 
-   if (next_skb->ip_summed == CHECKSUM_PARTIAL)
-   skb->ip_summed = CHECKSUM_PARTIAL;
+   skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
+ next_skb_size);
 
-   if (skb->ip_summed != CHECKSUM_PARTIAL)
-   skb->csum = csum_block_add(skb->csum, next_skb->csum, 
skb_size);
+   if (next_skb->ip_summed == CHECKSUM_PARTIAL)
+   skb->ip_summed = CHECKSUM_PARTIAL;
 
-   /* Update sequence range on original skb. */
-   TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
+   if (skb->ip_summed != CHECKSUM_PARTIAL)
+   skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
 
-   /* Merge over control information. */
-   flags |= TCP_SKB_CB(next_skb)->flags; /* This moves PSH/FIN 
etc. over */
-   TCP_SKB_CB(skb)->flags = flags;
+   /* Update sequence range on original skb. */
+   TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
 
-   /* All done, get rid of second SKB and account for it so
-* packet counting does not break.
-*/
-   TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & 
TCPCB_EVER_RETRANS;
-   if (TCP_SKB_CB(next_skb)->sacked&TCPCB_SACKED_RETRANS)
-   tp->retrans_out -= tcp_skb_pcount(next_skb);
-   if (TCP_SKB_CB(next_skb)->sacked&TCPCB_LOST)
-   tp->lost_out -= tcp_skb_pcount(next_skb);
-   /* Reno case is special. Sigh... */
-   if (tcp_is_reno(tp) && tp->sacked_out)
-   tcp_dec_pcount_a

[PATCH 6/9] [TCP]: Remove TCPCB_URG & TCPCB_AT_TAIL as unnecessary

2007-12-31 Thread Ilpo Järvinen

The snd_up check should be enough. I suspect this has been
there to provide a minor optimization in clean_rtx_queue which
used to have a small if (!->sacked) block which could skip
snd_up check among the other work.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/net/tcp.h |4 
 net/ipv4/tcp.c|1 -
 net/ipv4/tcp_input.c  |3 +--
 net/ipv4/tcp_output.c |7 +++
 4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6a732d4..48081ad 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -578,10 +578,6 @@ struct tcp_skb_cb {
 #define TCPCB_EVER_RETRANS 0x80/* Ever retransmitted frame */
 #define TCPCB_RETRANS  (TCPCB_SACKED_RETRANS|TCPCB_EVER_RETRANS)
 
-#define TCPCB_URG  0x20/* Urgent pointer advanced here */
-
-#define TCPCB_AT_TAIL  (TCPCB_URG)
-
__u16   urg_ptr;/* Valid w/URG flags is set.*/
__u32   ack_seq;/* Sequence number ACK'd*/
 };
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2cbfa6d..34085e3 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -497,7 +497,6 @@ static inline void tcp_mark_urg(struct tcp_sock *tp, int 
flags,
if (flags & MSG_OOB) {
tp->urg_mode = 1;
tp->snd_up = tp->write_seq;
-   TCP_SKB_CB(skb)->sacked |= TCPCB_URG;
}
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 7bac1fa..1e7fd81 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2821,8 +2821,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets)
if (sacked & TCPCB_LOST)
tp->lost_out -= acked_pcount;
 
-   if (unlikely((sacked & TCPCB_URG) && tp->urg_mode &&
-!before(end_seq, tp->snd_up)))
+   if (unlikely(tp->urg_mode && !before(end_seq, tp->snd_up)))
tp->urg_mode = 0;
 
tp->packets_out -= acked_pcount;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 202d60c..182ae21 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -711,7 +711,6 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 
len, unsigned int mss
TCP_SKB_CB(skb)->flags = flags & ~(TCPCB_FLAG_FIN|TCPCB_FLAG_PSH);
TCP_SKB_CB(buff)->flags = flags;
TCP_SKB_CB(buff)->sacked = TCP_SKB_CB(skb)->sacked;
-   TCP_SKB_CB(skb)->sacked &= ~TCPCB_AT_TAIL;
 
if (!skb_shinfo(skb)->nr_frags && skb->ip_summed != CHECKSUM_PARTIAL) {
/* Copy and checksum data tail into the new buffer. */
@@ -1721,7 +1720,7 @@ static void tcp_retrans_try_collapse(struct sock *sk, 
struct sk_buff *skb, int m
/* All done, get rid of second SKB and account for it so
 * packet counting does not break.
 */
-   TCP_SKB_CB(skb)->sacked |= 
TCP_SKB_CB(next_skb)->sacked&(TCPCB_EVER_RETRANS|TCPCB_AT_TAIL);
+   TCP_SKB_CB(skb)->sacked |= TCP_SKB_CB(next_skb)->sacked & 
TCPCB_EVER_RETRANS;
if (TCP_SKB_CB(next_skb)->sacked&TCPCB_SACKED_RETRANS)
tp->retrans_out -= tcp_skb_pcount(next_skb);
if (TCP_SKB_CB(next_skb)->sacked&TCPCB_LOST)
@@ -2470,7 +2469,7 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
skb_reserve(skb, MAX_TCP_HEADER);
skb->csum = 0;
TCP_SKB_CB(skb)->flags = TCPCB_FLAG_ACK;
-   TCP_SKB_CB(skb)->sacked = urgent;
+   TCP_SKB_CB(skb)->sacked = 0;
skb_shinfo(skb)->gso_segs = 1;
skb_shinfo(skb)->gso_size = 0;
skb_shinfo(skb)->gso_type = 0;
@@ -2522,7 +2521,7 @@ int tcp_write_wakeup(struct sock *sk)
} else {
if (tp->urg_mode &&
between(tp->snd_up, tp->snd_una+1, 
tp->snd_una+0x))
-   tcp_xmit_probe_skb(sk, TCPCB_URG);
+   tcp_xmit_probe_skb(sk, 1);
return tcp_xmit_probe_skb(sk, 0);
}
}
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 9/9] [TCP]: Code duplication removal, added tcp_bound_to_half_wnd()

2007-12-31 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |   19 +++
 1 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index eb8c61a..c70f1d0 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -888,6 +888,15 @@ void tcp_mtup_init(struct sock *sk)
icsk->icsk_mtup.probe_size = 0;
 }
 
+/* Bound MSS / TSO packet size with the half of the window */
+static int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
+{
+   if (tp->max_window && pktsize > (tp->max_window >> 1))
+   return max(tp->max_window >> 1, 68U - tp->tcp_header_len);
+   else
+   return pktsize;
+}
+
 /* This function synchronize snd mss to current pmtu/exthdr set.
 
tp->rx_opt.user_mss is mss set by user by TCP_MAXSEG. It does NOT counts
@@ -920,10 +929,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
icsk->icsk_mtup.search_high = pmtu;
 
mss_now = tcp_mtu_to_mss(sk, pmtu);
-
-   /* Bound mss with half of window */
-   if (tp->max_window && mss_now > (tp->max_window >> 1))
-   mss_now = max((tp->max_window >> 1), 68U - tp->tcp_header_len);
+   mss_now = tcp_bound_to_half_wnd(tp, mss_now);
 
/* And store cached results */
icsk->icsk_pmtu_cookie = pmtu;
@@ -977,10 +983,7 @@ unsigned int tcp_current_mss(struct sock *sk, int 
large_allowed)
  inet_csk(sk)->icsk_ext_hdr_len -
  tp->tcp_header_len);
 
-   if (tp->max_window && (xmit_size_goal > (tp->max_window >> 1)))
-   xmit_size_goal = max((tp->max_window >> 1),
-68U - tp->tcp_header_len);
-
+   xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
xmit_size_goal -= (xmit_size_goal % mss_now);
}
tp->xmit_size_goal = xmit_size_goal;
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel 2.6.23.8: KERNEL: assertion in net/ipv4/tcp_input.c

2007-12-31 Thread Ilpo Järvinen

On Fri, 14 Dec 2007, Ilpo Järvinen wrote:

> On Thu, 13 Dec 2007, Wolfgang Walter wrote:
> 
> > it happened again with your patch applied:
> > 
> > WARNING: at net/ipv4/tcp_input.c:1018 tcp_sacktag_write_queue()
> > 
> > Call Trace:
> >   [] tcp_sacktag_write_queue+0x7d0/0xa60
> > [] add_partial+0x19/0x60
> > [] tcp_ack+0x5a4/0x1d70
> > [] tcp_rcv_established+0x485/0x7b0
> > [] tcp_v4_do_rcv+0xed/0x3e0
> > [] tcp_v4_rcv+0x947/0x970
> > [] ip_local_deliver+0xac/0x290
> > [] ip_rcv+0x362/0x6c0
> > [] netif_receive_skb+0x323/0x420
> > [] tg3_poll+0x630/0xa50
> > [] net_rx_action+0x8a/0x140
> > [] __do_softirq+0x69/0xe0
> > [] call_softirq+0x1c/0x30
> > [] do_softirq+0x35/0x90
> > [] irq_exit+0x55/0x60
> > [] do_IRQ+0x80/0x100
> > [] ret_from_intr+0x0/0xa
> > 
> 
> ...Yeah, as I suspected, left_out != 0 when sacked_out and lost_out are 
> zero. I'll try to read the code again to see how that could happen (in 
> any case this is just annoying at the best, no other harm but the 
> message is being done). ...If nothing comes up I might ask you to run
> with another test patch but it might take week or so until I've enough
> time to dig into this fully because I must also come familiar with 
> something as pre-historic as the 2.6.23 (there are already large number
> of related changes since then, both in upcoming 2.6.24 and some in 
> net-2.6.25)... :-)

...I checked and found this one place from where left_out inconsistency 
could develop from in stable-2.6.23 or earlier. No idea though how we end 
up there and do not sync afterwards. Adding WARN_ON to that branch might 
help but it will give false positives too from sacktag_write_queue
due to DSACKs. 

...If either of you wants to, you could add WARN_ON(1) to that modified 
block too and just ignore those now and then stack_traces having
tcp_sacktag_write_queue calling tcp_fragment (they are not suspect because 
they occur due to DSACKs and sync before the assertion).

-- 
 i.

--
[PATCH] [TCP]: Try to fix left_out inconsistency

This spot performs (ie., doesn't perform it) obviously incorrect
counting for left_out though I know very few ways to call
tcp_fragment() for SACKED_ACKED skbs (DSACK seems to be the only
case, and perhaps some split-after-first ACK SACK blocks could
cause this as well and both fill sync after the loop). Those
usually shouldn't lead to any adjustments of pcount though (and 
tcp_sync_left_out()s are all over the code so it's very narrow
maze to not "hit" them in between which could remove all temporal 
inconsistencies). Therefore this might not be the actually cause
of the detected inconsistencies but due to complexity of
corner-case  scenarios it seems still be a worth of try (I might
have missed some path :-)).

This is only valid for stable-2.6.23 or earlier because
anything before that drops left_out and calculates it on
the fly from lost_out and sacked_out and is therefore always
consistently in sync.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 666d8a5..9fc08cb 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -677,8 +677,10 @@ int tcp_fragment(struct sock *sk, struct sk_buff *skb, u32 
len, unsigned int mss

tp->packets_out -= diff;

-   if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
+   if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) {
tp->sacked_out -= diff;
+   tp->left_out -= diff;
+   }
if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)
tp->retrans_out -= diff;

-- 
1.5.0.6

Re: [PATCH 3/9] [TCP]: Remove unnecessary local variables

2007-12-31 Thread Ilpo Järvinen

On Mon, 31 Dec 2007, David Miller wrote:

> From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
> Date: Mon, 31 Dec 2007 12:47:51 +0200
> 
> > Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
>  ...
> >  
> > -   in_flight = tcp_packets_in_flight(tp);
> > -   cwnd = tp->snd_cwnd;
> > -   if (in_flight < cwnd)
> > -   return (cwnd - in_flight);
> > +   if (tcp_packets_in_flight(tp) < tp->snd_cwnd)
> > +   return tp->snd_cwnd - tcp_packets_in_flight(tp);
> >  
> 
> I don't know about this one.
> 
> Although tcp_packets_in_flight() is inline and the compiler
> should CSE the first call into a local register and not
> redo the calculation:
> 
> 1) That isn't something to rely upon.  The compiler might look
>at a function or set of functions in this file and decide
>to not inline tcp_packets_in_flight() or not see the CSE
>opportunity and that the second call is redundant.

How about doing something really straight-forward then:

  return max_t(s32, tp->snd_cwnd - tcp_packets_in_flight(tp), 0);

...I was a bit unsure to add such type trickery and thus didn't go
for that right away.

> So best to keep the local vars here.
> 
> I also think the code is clearer that way too.

Fair enough.


-- 
 i.

[PATCH 3/3] [TCP]: Remove unnecessary local variable

2007-12-31 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b3110fc..f6d279a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -995,9 +995,8 @@ unsigned int tcp_current_mss(struct sock *sk, int 
large_allowed)
 static void tcp_cwnd_validate(struct sock *sk)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   __u32 packets_out = tp->packets_out;
 
-   if (packets_out >= tp->snd_cwnd) {
+   if (tp->packets_out >= tp->snd_cwnd) {
/* Network is feed fully. */
tp->snd_cwnd_used = 0;
tp->snd_cwnd_stamp = tcp_time_stamp;
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] [TCP]: Code duplication removal, added tcp_bound_to_half_wnd()

2007-12-31 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |   19 +++
 1 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bb7e80a..b3110fc 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -888,6 +888,15 @@ void tcp_mtup_init(struct sock *sk)
icsk->icsk_mtup.probe_size = 0;
 }
 
+/* Bound MSS / TSO packet size with the half of the window */
+static int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
+{
+   if (tp->max_window && pktsize > (tp->max_window >> 1))
+   return max(tp->max_window >> 1, 68U - tp->tcp_header_len);
+   else
+   return pktsize;
+}
+
 /* This function synchronize snd mss to current pmtu/exthdr set.
 
tp->rx_opt.user_mss is mss set by user by TCP_MAXSEG. It does NOT counts
@@ -920,10 +929,7 @@ unsigned int tcp_sync_mss(struct sock *sk, u32 pmtu)
icsk->icsk_mtup.search_high = pmtu;
 
mss_now = tcp_mtu_to_mss(sk, pmtu);
-
-   /* Bound mss with half of window */
-   if (tp->max_window && mss_now > (tp->max_window >> 1))
-   mss_now = max((tp->max_window >> 1), 68U - tp->tcp_header_len);
+   mss_now = tcp_bound_to_half_wnd(tp, mss_now);
 
/* And store cached results */
icsk->icsk_pmtu_cookie = pmtu;
@@ -977,10 +983,7 @@ unsigned int tcp_current_mss(struct sock *sk, int 
large_allowed)
  inet_csk(sk)->icsk_ext_hdr_len -
  tp->tcp_header_len);
 
-   if (tp->max_window && (xmit_size_goal > (tp->max_window >> 1)))
-   xmit_size_goal = max((tp->max_window >> 1),
-68U - tp->tcp_header_len);
-
+   xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
xmit_size_goal -= (xmit_size_goal % mss_now);
}
tp->xmit_size_goal = xmit_size_goal;
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: TCP event tracking via netlink...

2008-01-02 Thread Ilpo Järvinen

On Wed, 2 Jan 2008, David Miller wrote:

> From: Stephen Hemminger <[EMAIL PROTECTED]>
> Date: Thu, 6 Dec 2007 09:23:12 -0800
> 
> > Tools and scripts for testing that generate graphs are at:
> > git://git.kernel.org/pub/scm/tcptest/tcptest
> 
> Did you move it somewhere else?
> 
> [EMAIL PROTECTED]:~/src/GIT$ git clone 
> git://git.kernel.org/pub/scm/tcptest/tcptest
> Initialized empty Git repository in /home/davem/src/GIT/tcptest/.git/
> fatal: The remote end hung up unexpectedly
> fetch-pack from 'git://git.kernel.org/pub/scm/tcptest/tcptest' failed.

.../network/ was missing from the path :-).

$ git-remote show origin
* remote origin
  URL: git://git.kernel.org/pub/scm/network/tcptest/tcptest.git
  Remote branch(es) merged with 'git pull' while on branch master
master
  Tracked remote branches
master


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-2.6.25 2/3] [TCP]: Urgent parameter effect can be simplified.

2008-01-03 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f6d279a..6c7cd0a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2488,7 +2488,7 @@ static int tcp_xmit_probe_skb(struct sock *sk, int urgent)
 * end to send an ack.  Don't queue or clone SKB, just
 * send it.
 */
-   TCP_SKB_CB(skb)->seq = urgent ? tp->snd_una : tp->snd_una - 1;
+   TCP_SKB_CB(skb)->seq = tp->snd_una - !urgent;
TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq;
TCP_SKB_CB(skb)->when = tcp_time_stamp;
return tcp_transmit_skb(sk, skb, 0, GFP_ATOMIC);
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-2.6.25 3/3] [TCP]: Perform setting of common control fields in one place

2008-01-03 Thread Ilpo Järvinen

In case of segments which are purely for control without any
data (SYN/ACK/FIN/RST), many fields are set to common values
in multiple places.

i386 results:

$ gcc --version
gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13)

$ codiff tcp_output.o.old tcp_output.o.new
net/ipv4/tcp_output.c:
  tcp_xmit_probe_skb|  -48
  tcp_send_ack  |  -56
  tcp_retransmit_skb|  -79
  tcp_connect   |  -43
  tcp_send_active_reset |  -35
  tcp_make_synack   |  -42
  tcp_send_fin  |  -48
 7 functions changed, 351 bytes removed

net/ipv4/tcp_output.c:
  tcp_init_nondata_skb |  +90
 1 function changed, 90 bytes added

tcp_output.o.mid:
 8 functions changed, 90 bytes added, 351 bytes removed, diff: -261

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |   91 +++-
 1 files changed, 36 insertions(+), 55 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 6c7cd0a..89f0188 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -327,6 +327,26 @@ static inline void TCP_ECN_send(struct sock *sk, struct 
sk_buff *skb,
}
 }
 
+/* Constructs common control bits of non-data skb. If SYN/FIN is present,
+ * auto increment end seqno.
+ */
+static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
+{
+   skb->csum = 0;
+
+   TCP_SKB_CB(skb)->flags = flags;
+   TCP_SKB_CB(skb)->sacked = 0;
+
+   skb_shinfo(skb)->gso_segs = 1;
+   skb_shinfo(skb)->gso_size = 0;
+   skb_shinfo(skb)->gso_type = 0;
+
+   TCP_SKB_CB(skb)->seq = seq;
+   if (flags & (TCPCB_FLAG_SYN | TCPCB_FLAG_FIN))
+   seq++;
+   TCP_SKB_CB(skb)->end_seq = seq;
+}
+
 static void tcp_build_and_update_options(__be32 *ptr, struct tcp_sock *tp,
 __u32 tstamp, __u8 **md5_hash)
 {
@@ -1864,12 +1884,10 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb)
(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
tp->snd_una == (TCP_SKB_CB(skb)->end_seq - 1)) {
if (!pskb_trim(skb, 0)) {
-   TCP_SKB_CB(skb)->seq = TCP_SKB_CB(skb)->end_seq - 1;
-   skb_shinfo(skb)->gso_segs = 1;
-   skb_shinfo(skb)->gso_size = 0;
-   skb_shinfo(skb)->gso_type = 0;
+   /* Reuse, even though it does some unnecessary work */
+   tcp_init_nondata_skb(skb, TCP_SKB_CB(skb)->end_seq - 1,
+TCP_SKB_CB(skb)->flags);
skb->ip_summed = CHECKSUM_NONE;
-   skb->csum = 0;
}
}
 
@@ -2068,16 +2086,9 @@ void tcp_send_fin(struct sock *sk)
 
/* Reserve space for headers and prepare control bits. */
skb_reserve(skb, MAX_TCP_HEADER);
-   skb->csum = 0;
-   TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
-   TCP_SKB_CB(skb)->sacked = 0;
-   skb_shinfo(skb)->gso_segs = 1;
-   skb_shinfo(skb)->gso_size = 0;
-   skb_shinfo(skb)->gso_type = 0;
-
/* FIN eats a sequence byte, write_seq advanced by 
tcp_queue_skb(). */
-   TCP_SKB_CB(skb)->seq = tp->write_seq;
-   TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + 1;
+   tcp_init_nondata_skb(skb, tp->write_seq,
+TCPCB_FLAG_ACK | TCPCB_FLAG_FIN);
tcp_queue_skb(sk, skb);
}
__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_OFF);
@@ -2101,16 +2112,9 @@ void tcp_send_active_reset(struct sock *sk, gfp_t 
priority)
 
/* Reserve space for headers and prepare control bits. */
skb_reserve(skb, MAX_TCP_HEADER);
-   skb->csum = 0;
-   TCP_SKB_CB(skb)->flags = (TCPCB_FLAG_ACK | TCPCB_FLAG_RST);
-   TCP_SKB_CB(skb)->sacked = 0;
-   skb_shinfo(skb)->gso_segs = 1;
-   skb_shinfo(skb)->gso_size = 0;
-   skb_shinfo(skb)->gso_type = 0;
-
+   tcp_init_nondata_skb(skb, tcp_acceptable_seq(sk),
+TCPCB_FLAG_ACK | TCPCB_FLAG_RST);
/* Send it off. */
-   TCP_SKB_CB(skb)->seq = tcp_acceptable_seq(sk);
-   TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq;
TCP_SKB_CB(skb)->when = tcp_time_stamp;
if (tcp_transmit_skb(sk, skb, 0, priority))
NET_INC_STATS(LINUX_MIB_TCPABORTFAILED);
@@ -2198,12 +2202,11 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct 
dst_entry *dst,
TCP_ECN_make_synack(req, th);
th->source = inet_sk(sk)->sport;
th->dest = ireq->rmt_port;
-   TCP_SKB_CB(skb)->seq = tcp_rsk(req)->snt_isn;
-   TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(s

[PATCH net-2.6.25 1/3] [TCP]: cleanup tcp_parse_options deep indented switch

2008-01-03 Thread Ilpo Järvinen

Removed case indentation level & combined some nested ifs, mostly
within 80 lines now. This is a leftover from indent patch, it
just had to be done manually to avoid messing it up completely.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |  135 +-
 1 files changed, 67 insertions(+), 68 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 18e099c..d5b6adf 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3278,81 +3278,80 @@ void tcp_parse_options(struct sk_buff *skb, struct 
tcp_options_received *opt_rx,
int opsize;
 
switch (opcode) {
-   case TCPOPT_EOL:
+   case TCPOPT_EOL:
+   return;
+   case TCPOPT_NOP:/* Ref: RFC 793 section 3.1 */
+   length--;
+   continue;
+   default:
+   opsize = *ptr++;
+   if (opsize < 2) /* "silly options" */
return;
-   case TCPOPT_NOP:/* Ref: RFC 793 section 3.1 */
-   length--;
-   continue;
-   default:
-   opsize=*ptr++;
-   if (opsize < 2) /* "silly options" */
-   return;
-   if (opsize > length)
-   return; /* don't parse partial options 
*/
-   switch (opcode) {
-   case TCPOPT_MSS:
-   if (opsize==TCPOLEN_MSS && th->syn && 
!estab) {
-   u16 in_mss = 
ntohs(get_unaligned((__be16 *)ptr));
-   if (in_mss) {
-   if (opt_rx->user_mss && 
opt_rx->user_mss < in_mss)
-   in_mss = 
opt_rx->user_mss;
-   opt_rx->mss_clamp = 
in_mss;
-   }
-   }
-   break;
-   case TCPOPT_WINDOW:
-   if (opsize==TCPOLEN_WINDOW && th->syn 
&& !estab)
-   if (sysctl_tcp_window_scaling) {
-   __u8 snd_wscale = 
*(__u8 *) ptr;
-   opt_rx->wscale_ok = 1;
-   if (snd_wscale > 14) {
-   if 
(net_ratelimit())
-   
printk(KERN_INFO "tcp_parse_options: Illegal window "
-  
"scaling value %d >14 received.\n",
-  
snd_wscale);
-   snd_wscale = 14;
-   }
-   opt_rx->snd_wscale = 
snd_wscale;
-   }
-   break;
-   case TCPOPT_TIMESTAMP:
-   if (opsize==TCPOLEN_TIMESTAMP) {
-   if ((estab && 
opt_rx->tstamp_ok) ||
-   (!estab && 
sysctl_tcp_timestamps)) {
-   opt_rx->saw_tstamp = 1;
-   opt_rx->rcv_tsval = 
ntohl(get_unaligned((__be32 *)ptr));
-   opt_rx->rcv_tsecr = 
ntohl(get_unaligned((__be32 *)(ptr+4)));
-   }
+   if (opsize > length)
+   return; /* don't parse partial options */
+   switch (opcode) {
+   case TCPOPT_MSS:
+   if (opsize == TCPOLEN_MSS && th->syn && !estab) 
{
+   u16 in_mss = 
ntohs(get_unaligned((__be16 *)ptr));
+   if (in_mss) {
+   if (opt_rx->user_mss &&
+

[PATCH net-2.6.25] [NET]: Remove obsolete comment

2008-01-05 Thread Ilpo Järvinen

[PATCH] [NET]: Remove obsolete comment

It seems that ip_build_xmit is no longer used in here and
ip_append_data is used.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---

Hi Dave,

While reading some nearby code, I noticed this. I'm not 100%
sure if the removal is valid or not nor have interest to dig
it up from history but I suppose you know it better without
having to look it up from history :-).

 net/ipv4/ip_output.c |2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3dc0c12..e57de0f 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1339,8 +1339,6 @@ static int ip_reply_glue_bits(void *dptr, char *to, int 
offset,
  *
  * Should run single threaded per socket because it uses the sock
  * structure to pass arguments.
- *
- * LATER: switch from ip_build_xmit to ip_append_*
  */
 void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg 
*arg,
   unsigned int len)
-- 
1.5.0.6

[PATCH net-2.6.25 0/4] [NET]: Bloat, bloat and more bloat

2008-01-05 Thread Ilpo Järvinen

Hi Dave,

After Arnaldo got codiff's inline instrumentation bugs fixed
(thanks! :-)), I got my .c-inline-bloat-o-meter to power up
reliably after some tweaking and bug fixing on my behalf...
It shows some very high readings every now and then in the
code under net/.

...Aand... we've a sovereign winner, though it was only fifth
on a kernel wide list (arch/ excluded due to number of
reasons) :-/.

Hall of (unquestionable) fame (measured per inline, top 10 under
net/):
  -4496 ctnetlink_parse_tuplenetfilter/nf_conntrack_netlink.c
  -2165 ctnetlink_dump_tuplesnetfilter/nf_conntrack_netlink.c
  -2115 __ip_vs_get_out_rt   ipv4/ipvs/ip_vs_xmit.c
  -1924 xfrm_audit_helper_pktinfoxfrm/xfrm_state.c
  -1799 ctnetlink_parse_tuple_proto  netfilter/nf_conntrack_netlink.c
  -1268 ctnetlink_parse_tuple_ip netfilter/nf_conntrack_netlink.c
  -1093 ctnetlink_exp_dump_expectnetfilter/nf_conntrack_netlink.c
  -1060 ccid3_update_send_interval   dccp/ccids/ccid3.c
  -983  ctnetlink_dump_tuples_proto  netfilter/nf_conntrack_netlink.c
  -827  ctnetlink_exp_dump_tuple netfilter/nf_conntrack_netlink.c

Removing inlines is done iteratively because e.g., uninlining
ctnetlink_parse_tuple affected ..._{proto,ip} prices as well.

i386/gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13)/allyesconfig
except CONFIG_FORCED_INLINING. Tried without some CONFIG.*DEBUG
as well and got slightly better numbers for some functions, yet
the number don't differ enough to be that meaningful, ie., if  
there's bloat somewhere, removing DEBUGs won't make it go away.

...After this first aid we're at least below 1k :-).

--
 i.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/4] [IPVS]: Kill some bloat

2008-01-05 Thread Ilpo Järvinen

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <[EMAIL PROTECTED]>

net/ipv4/ipvs/ip_vs_xmit.c:
  ip_vs_icmp_xmit   | -638
  ip_vs_tunnel_xmit | -674
  ip_vs_nat_xmit| -716
  ip_vs_dr_xmit | -682
 4 functions changed, 2710 bytes removed, diff: -2710

net/ipv4/ipvs/ip_vs_xmit.c:
  __ip_vs_get_out_rt | +595
 1 function changed, 595 bytes added, diff: +595

net/ipv4/ipvs/ip_vs_xmit.o:
 5 functions changed, 595 bytes added, 2710 bytes removed, diff: -2115

Without some CONFIG.*DEBUGs:

net/ipv4/ipvs/ip_vs_xmit.o:
 5 functions changed, 383 bytes added, 1513 bytes removed, diff: -1130

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/ipvs/ip_vs_xmit.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/ipvs/ip_vs_xmit.c b/net/ipv4/ipvs/ip_vs_xmit.c
index 1e96bf8..8436bf8 100644
--- a/net/ipv4/ipvs/ip_vs_xmit.c
+++ b/net/ipv4/ipvs/ip_vs_xmit.c
@@ -59,7 +59,7 @@ __ip_vs_dst_check(struct ip_vs_dest *dest, u32 rtos, u32 
cookie)
return dst;
 }
 
-static inline struct rtable *
+static struct rtable *
 __ip_vs_get_out_rt(struct ip_vs_conn *cp, u32 rtos)
 {
struct rtable *rt;  /* Route to the other host */
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/4] [NETFILTER]: Kill some supper dupper bloatry

2008-01-05 Thread Ilpo Järvinen

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <[EMAIL PROTECTED]>

/me awards the bloatiest-of-all-net/-.c-code award to
nf_conntrack_netlink.c, congratulations to all the authors :-/!

Hall of (unquestionable) fame (measured per inline, top 10 under
net/):
  -4496 ctnetlink_parse_tuplenetfilter/nf_conntrack_netlink.c
  -2165 ctnetlink_dump_tuplesnetfilter/nf_conntrack_netlink.c
  -2115 __ip_vs_get_out_rt   ipv4/ipvs/ip_vs_xmit.c
  -1924 xfrm_audit_helper_pktinfoxfrm/xfrm_state.c
  -1799 ctnetlink_parse_tuple_proto  netfilter/nf_conntrack_netlink.c
  -1268 ctnetlink_parse_tuple_ip netfilter/nf_conntrack_netlink.c
  -1093 ctnetlink_exp_dump_expectnetfilter/nf_conntrack_netlink.c
  -1060 void ccid3_update_send_interval  dccp/ccids/ccid3.c
  -983  ctnetlink_dump_tuples_proto  netfilter/nf_conntrack_netlink.c
  -827  ctnetlink_exp_dump_tuple netfilter/nf_conntrack_netlink.c

  (i386 / gcc (GCC) 4.1.2 20070626 (Red Hat 4.1.2-13) /
   allyesconfig except CONFIG_FORCED_INLINING)

...and I left < 200 byte gains as future work item.

After iterative inline removal, I finally have this:

net/netfilter/nf_conntrack_netlink.c:
  ctnetlink_exp_fill_info   | -1104
  ctnetlink_new_expect  | -1572
  ctnetlink_fill_info   | -1303
  ctnetlink_new_conntrack   | -2230
  ctnetlink_get_expect  | -341
  ctnetlink_del_expect  | -352
  ctnetlink_expect_event| -1110
  ctnetlink_conntrack_event | -1548
  ctnetlink_del_conntrack   | -729
  ctnetlink_get_conntrack   | -728
 10 functions changed, 11017 bytes removed, diff: -11017

net/netfilter/nf_conntrack_netlink.c:
  ctnetlink_parse_tuple | +419
  dump_nat_seq_adj  | +183
  ctnetlink_dump_counters   | +166
  ctnetlink_dump_tuples | +261
  ctnetlink_exp_dump_expect | +633
  ctnetlink_change_status   | +460
 6 functions changed, 2122 bytes added, diff: +2122

net/netfilter/nf_conntrack_netlink.o:
 16 functions changed, 2122 bytes added, 11017 bytes removed, diff: -8895

Without a number of CONFIG.*DEBUGs, I got this:
net/netfilter/nf_conntrack_netlink.o:
 16 functions changed, 2122 bytes added, 11029 bytes removed, diff: -8907

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/netfilter/nf_conntrack_netlink.c |   12 ++--
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c 
b/net/netfilter/nf_conntrack_netlink.c
index d93d58d..38141f1 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -95,7 +95,7 @@ nla_put_failure:
return -1;
 }
 
-static inline int
+static int
 ctnetlink_dump_tuples(struct sk_buff *skb,
  const struct nf_conntrack_tuple *tuple)
 {
@@ -205,7 +205,7 @@ nla_put_failure:
 }
 
 #ifdef CONFIG_NF_CT_ACCT
-static inline int
+static int
 ctnetlink_dump_counters(struct sk_buff *skb, const struct nf_conn *ct,
enum ip_conntrack_dir dir)
 {
@@ -284,7 +284,7 @@ nla_put_failure:
 }
 
 #ifdef CONFIG_NF_NAT_NEEDED
-static inline int
+static int
 dump_nat_seq_adj(struct sk_buff *skb, const struct nf_nat_seq *natseq, int 
type)
 {
struct nlattr *nest_parms;
@@ -648,7 +648,7 @@ ctnetlink_parse_tuple_proto(struct nlattr *attr,
return ret;
 }
 
-static inline int
+static int
 ctnetlink_parse_tuple(struct nlattr *cda[], struct nf_conntrack_tuple *tuple,
  enum ctattr_tuple type, u_int8_t l3num)
 {
@@ -888,7 +888,7 @@ out:
return err;
 }
 
-static inline int
+static int
 ctnetlink_change_status(struct nf_conn *ct, struct nlattr *cda[])
 {
unsigned long d;
@@ -1349,7 +1349,7 @@ nla_put_failure:
return -1;
 }
 
-static inline int
+static int
 ctnetlink_exp_dump_expect(struct sk_buff *skb,
  const struct nf_conntrack_expect *exp)
 {
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/4] [XFRM]: Kill some bloat

2008-01-05 Thread Ilpo Järvinen

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <[EMAIL PROTECTED]>

net/xfrm/xfrm_state.c:
  xfrm_audit_state_delete  | -589
  xfrm_replay_check| -542
  xfrm_audit_state_icvfail | -520
  xfrm_audit_state_add | -589
  xfrm_audit_state_replay_overflow | -523
  xfrm_audit_state_notfound_simple | -509
  xfrm_audit_state_notfound| -521
 7 functions changed, 3793 bytes removed, diff: -3793

net/xfrm/xfrm_state.c:
  xfrm_audit_helper_pktinfo | +522
  xfrm_audit_helper_sainfo  | +598
 2 functions changed, 1120 bytes added, diff: +1120

net/xfrm/xfrm_state.o:
 9 functions changed, 1120 bytes added, 3793 bytes removed, diff: -2673

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/xfrm/xfrm_state.c |8 
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 4dec655..81b0fce 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -2008,8 +2008,8 @@ void __init xfrm_state_init(void)
 }
 
 #ifdef CONFIG_AUDITSYSCALL
-static inline void xfrm_audit_helper_sainfo(struct xfrm_state *x,
-   struct audit_buffer *audit_buf)
+static void xfrm_audit_helper_sainfo(struct xfrm_state *x,
+struct audit_buffer *audit_buf)
 {
struct xfrm_sec_ctx *ctx = x->security;
u32 spi = ntohl(x->id.spi);
@@ -2036,8 +2036,8 @@ static inline void xfrm_audit_helper_sainfo(struct 
xfrm_state *x,
audit_log_format(audit_buf, " spi=%u(0x%x)", spi, spi);
 }
 
-static inline void xfrm_audit_helper_pktinfo(struct sk_buff *skb, u16 family,
-struct audit_buffer *audit_buf)
+static void xfrm_audit_helper_pktinfo(struct sk_buff *skb, u16 family,
+ struct audit_buffer *audit_buf)
 {
struct iphdr *iph4;
struct ipv6hdr *iph6;
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/4] [CCID3]: Kill some bloat

2008-01-05 Thread Ilpo Järvinen

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <[EMAIL PROTECTED]>

Without a number of CONFIG.*DEBUG:

net/dccp/ccids/ccid3.c:
  ccid3_hc_tx_update_x  | -170
  ccid3_hc_tx_packet_sent   | -175
  ccid3_hc_tx_packet_recv   | -169
  ccid3_hc_tx_no_feedback_timer | -192
  ccid3_hc_tx_send_packet   | -144
 5 functions changed, 850 bytes removed, diff: -850

net/dccp/ccids/ccid3.c:
  ccid3_update_send_interval | +191
 1 function changed, 191 bytes added, diff: +191

net/dccp/ccids/ccid3.o:
 6 functions changed, 191 bytes added, 850 bytes removed, diff: -659

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/dccp/ccids/ccid3.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/dccp/ccids/ccid3.c b/net/dccp/ccids/ccid3.c
index 00b5f11..8a817dd 100644
--- a/net/dccp/ccids/ccid3.c
+++ b/net/dccp/ccids/ccid3.c
@@ -97,7 +97,7 @@ static inline u64 rfc3390_initial_rate(struct sock *sk)
 /*
  * Recalculate t_ipi and delta (should be called whenever X changes)
  */
-static inline void ccid3_update_send_interval(struct ccid3_hc_tx_sock *hctx)
+static void ccid3_update_send_interval(struct ccid3_hc_tx_sock *hctx)
 {
/* Calculate new t_ipi = s / X_inst (X_inst is in 64 * bytes/second) */
hctx->ccid3hctx_t_ipi = scaled_div32(((u64)hctx->ccid3hctx_s) << 6,
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-06 Thread Ilpo Järvinen

On Sat, 5 Jan 2008, David Miller wrote:

> From: Herbert Xu <[EMAIL PROTECTED]>
> Date: Sun, 06 Jan 2008 11:29:35 +1100
> 
> > We should never use inline except when it's on the fast path and this
> > is definitely not a fast path.  If a function ends up being called
> > just once the compiler will most likely inline it anyway, making the
> > use of the keyword inline redundant.
> 
> Similarly I question just about any inline usage at all in *.c files
> these days.
> 
> I even would discourage it's use for fast-path cases as well.

There still seems to be good candidates for inline in *.c files, in worst 
case I had +172 due to inline removed and ~60 are on +10-+90 range with 
my gcc, later gccs might do better but I definately would just blindly 
remove them all. Here's the other end of the list:

+35 static inline hci_encrypt_change_evtnet/bluetooth/hci_event.c
+36 static __inline__ tcp_in_window net/ipv4/tcp_minisocks.c
+38 static inline hci_conn_complete_evt net/bluetooth/hci_event.c
+38 static inline hci_conn_request_evt  net/bluetooth/hci_event.c
+42 static inline gred_wred_modenet/sched/sch_gred.c
+45 static inline secpath_has_nontransport  net/xfrm/xfrm_policy.c
+52 static inline bool port_match   net/netfilter/xt_tcpudp.c
+67 static inline dn_check_idf  net/decnet/dn_nsp_in.c
+90 static inline __ieee80211_queue_stopped net/mac80211/tx.c
+172 static inline sctp_chunk_length_valid  net/sctp/sm_statefuns.c


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-07 Thread Ilpo Järvinen

On Sun, 6 Jan 2008, Herbert Xu wrote:

> is definitely not a fast path.  If a function ends up being called
> just once the compiler will most likely inline it anyway, making the
> use of the keyword inline redundant.

Unexpected enough, even this logic seems to fail in a way with my gcc, I'm 
yet to study it closer but it seems to me that e.g., uninlining only once 
called tcp_fastretrans_alert is worth of at least 100 bytes (note that 
it's not inlined by us, gcc did it all by itself)! Otherwise I'd fail to 
understand why I got -270 bytes from uninlining tcp_moderate_cwnd which is 
only 57 bytes as unlined with three call sites.

net/ipv4/tcp_input.c:
  tcp_undo_cwr  |  -48
  tcp_try_undo_recovery |  -55
  tcp_ack   | -2941
 3 functions changed, 3044 bytes removed, diff: -3044

net/ipv4/tcp_input.c:
  tcp_moderate_cwnd |  +57
  tcp_fastretrans_alert | +2717
 2 functions changed, 2774 bytes added, diff: +2774

net/ipv4/tcp_input.o:
 5 functions changed, 2774 bytes added, 3044 bytes removed, diff: -270

I'll probably force uninlining of it without tcp_moderate_cwnd noise and 
try a number of gcc versions.

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-08 Thread Ilpo Järvinen

On Mon, 7 Jan 2008, David Miller wrote:

> From: Andi Kleen <[EMAIL PROTECTED]>
> Date: Tue, 8 Jan 2008 06:00:07 +0100
> 
> > On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote:
> > > The vast majority of them are one, two, and three liners.
> > 
> > % awk '  { line++ } ; /^{/ { total++; start = line } ; /^}/ { 
> > len=line-start-3; if (len > 4) l++; if (len >= 10) k++; } ; END { print 
> > total, l, l/total, k, k/total }' < include/net/tcp.h
> > 68 28 0.411765 20 0.294118
> > 
> > 41% are over 4 lines, 29% are >= 10 lines.
> 
> Take out the comments and whitespace lines, your script is
> too simplistic.

He should also remove these, which involve just syntactic casting to 
please the compiler:

struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);

...and maybe some other similar ones. I'm sure that's going to make his 
statistics even more questionable as is because we need tp all around.

But about his concerns, I spend couple of minutes in looking to it by 
hand. These are likely to be a win:
tcp_set_state
tcp_prequeue 
tcp_prequeue_init
tcp_openreq_init

...about rest I'm very unsure but there are some that probably are a 
minor win as well.

tcp_dec_quickack_mode has only a single caller anyway so I'd doubt that 
it's going to be significant, I thought moving it to .c earlier but 
haven't just done that yet :-).

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-08 Thread Ilpo Järvinen

On Tue, 8 Jan 2008, Andi Kleen wrote:

> David Miller <[EMAIL PROTECTED]> writes:
> 
> > Similarly I question just about any inline usage at all in *.c files
> 
> Don't forget the .h files. Especially a lot of stuff in tcp.h should
> be probably in some .c file and not be inline.

I'm not forgetting them... :-)

My intention is to do the same for all headers as well but I'd really like 
to get some actual number of bytes by measuring rather than taking them of 
the hat.

...It just requires some work still to get the uninlining actually do the 
right thing and actually testing it involves orders of magnitude more 
exhaustive recompilation than .o only checking required so after finishing 
the script I either has to setup n distcc master with a number of slaves 
each or wait for some time for the results :-).

However, I suspect that anything in tcp.h won't match to something like 
~5-10k like I've seen in three .c files already + one with debugging that 
has even more than 10k (even if all in tcp.h summed up :-)).

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SACK scoreboard

2008-01-08 Thread Ilpo Järvinen

On Mon, 7 Jan 2008, David Miller wrote:

> Did you happen to read a recent blog posting of mine?
> 
>   http://vger.kernel.org/~davem/cgi-bin/blog.cgi/2007/12/31#tcp_overhead
> 
> I've been thinking more and more and I think we might be able
> to get away with enforcing that SACKs are always increasing in
> coverage.
> 
> I doubt there are any real systems out there that drop out of order
> packets that are properly formed and are in window, even though the
> SACK specification (foolishly, in my opinion) allows this.

Luckily we can see that already from MIBs, so quering people who have 
large servers, which are continously "testing" the internet :-), under 
their supervision or can access, and asking if they see any might help.
I checked my dept's interactive servers and all had zero renegings, but
I don't think I have access to www server which would have much wider 
exposure.

> If we could free packets as SACK blocks cover them, all the problems
> go away.

I thought it a bit yesterday after reading your blog and came to 
conclusion that they won't, we can still get those nasty ACKs regardless 
of received SACK info (in here, missing). Even in some valid cases which 
include ACK losses besides actual data loss, not that this is the most 
common case but just wanted to point out that cleanup work is at least 
partially independent of SACK problem. So not "all" problems would go
away really.

> For one thing, this will allow the retransmit queue liberation during
> loss recovery to be spread out over the event, instead of batched up
> like crazy to the point where the cumulative ACK finally moves and
> releases an entire window's worth of data.

Two key cases for real pattern are:

1. Losses once per n, where n is something small, like 2-20 or so, usually
   happens at slow start overshoot or when compething traffic slow starts. 
   Cumulative ACKs will cover only small part of the window once rexmits 
   make through, thus this is not a problem.
2. Single loss (or few at the beginning of the window), rest SACKed. 
   Cumulative ACK will cover original window when the last necessary 
   rexmit gets through.

Case 1 becomes nasty ACKy only if rexmit is lost as well, but in that case 
the arriving SACK blocks make the rest of the window equal to 2 :-).

So I'm now trying to solve just case 2. What if we could somehow "combine" 
adjacent skbs (or whatever they're called in that model) if SACK covers 
them both so that we still hold them but can drop them in a very 
efficient way. That would make the combining effort split per ACK. 
And if reneging would occur, we can think a way to put the necessary fuzz 
into a form which cannot hurt the rest of the system (relatively easy & 
fast if we add CA_Reneging and allow retransmitting a portion of an skb 
similar to what you suggested earlier).

And it might even be possible then to offer admin a control so that the 
admin can choose between recover/plain reset if admin thinks that it's 
always an indication of an attack. This is somewhat similar case to what 
UTO (under IETF evaluation) does, as purpose of both is in violation of 
RFC TCP to avoid malicious traps but the control about it is left to the 
user.

> Next, it would simplify all of this scanning code trying to figure out
> which holes to fill during recovery.
> 
> And for SACK scoreboard marking, the RB trie would become very nearly
> unecessary as far as I can tell.

I've been contacted by a person who was interested in reaching 500k 
windows, so your 4000 sounded like a joke :-/. Having, let say, every
20th dropped means 25k skbs remaining, can we scan though it in any
sensible time without RBs and friends :-)? However, allowing queue walk
to begin from either direction would solve most of the common cases well 
enough for it to be nearly manageable.

> I would not even entertain this kind of crazy idea unless I thought
> the fundamental complexity simplification payback was enormous.  And
> in this case I think it is.
> 
> What we could do is put some experimental hack in there for developers
> to start playing with, which would enforce that SACKs always increase
> in coverage.  If violated the connection reset and a verbose log
> message is logged so we can analyze any cases that occur.

We have an initial number already, in MIBs.

> Sounds crazy, but maybe has potential.  What do you think?

If I'd hint my boss that I'm involved in something like this I'd
bet that he also would get quite crazy... ;-) I'm partially paid
for making TCP more RFCish :-), or at least that the places where
thing diverge are known and controllable for research purposes.

-- 
 i.

ps. If other Cced would like to get dropped if there are some followups, 
just let me know :-). Else, no need to do anything.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-2.6.25 0/4] [NET]: Bloat, bloat and more bloat

2008-01-09 Thread Ilpo Järvinen

On Sat, 5 Jan 2008, Arnaldo Carvalho de Melo wrote:

> Em Sat, Jan 05, 2008 at 03:39:04PM +0200, Ilpo Järvinen escreveu:
> > Hi Dave,
> > 
> > After Arnaldo got codiff's inline instrumentation bugs fixed
> > (thanks! :-)), I got my .c-inline-bloat-o-meter to power up
> > reliably after some tweaking and bug fixing on my behalf...
> > It shows some very high readings every now and then in the
> > code under net/.
> 
> Thank you for the reports and for showing how these tools can be put to
> good use! 
> 
> If you have any further suggestions on how to make codiff and the
> dwarves to be of more help or find any other bug, please let me know.

It could use a bit less memory because my header inline checking attempt 
on a machine with 2G+2G ends up like this:

$ codiff vmlinux.o.base vmlinux.o
libclasses: out of memory(inline_expansion__new)
$ ls -al vmlinux.o{,.base}
-rw-r- 1 ijjarvin tkol 633132586 Jan  9 13:11 vmlinux.o
-rw-r- 1 ijjarvin tkol 633132572 Jan  9 00:58 vmlinux.o.base
$

Considering that's only 0.6G+0.6G I've a problem in understanding why 
codiff's number crunching eats up so much memory.

I just hope there isn't any O(n^2) or worse algos in it either once
the memory consumption gets resolved. :-)

-- 
 i.

Re: SACK scoreboard

2008-01-09 Thread Ilpo Järvinen

On Tue, 8 Jan 2008, John Heffner wrote:

> Andi Kleen wrote:
> > David Miller <[EMAIL PROTECTED]> writes:
> > > The big problem is that recovery from even a single packet loss in a
> > > window makes us run kfree_skb() for a all the packets in a full
> > > window's worth of data when recovery completes.
> > 
> > Why exactly is it a problem to free them all at once? Are you worried
> > about kernel preemption latencies?
> 
> I also wonder how much of a problem this is (for now, with window sizes of
> order 1 packets.  My understanding is that the biggest problems arise from
> O(N^2) time for recovery because every ack was expensive. Have current 
> tests shown the final ack to be a major source of problems?

This thread got started because I tried to solve the other latencies but 
realized that it helps very little because this latency spike would 
have remained unsolved and it happens in one of the most common case.

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/4] [XFRM]: Kill some bloat

2008-01-10 Thread Ilpo Järvinen

On Tue, 8 Jan 2008, Ilpo Järvinen wrote:

> On Mon, 7 Jan 2008, David Miller wrote:
> 
> > From: Andi Kleen <[EMAIL PROTECTED]>
> > Date: Tue, 8 Jan 2008 06:00:07 +0100
> > 
> > > On Mon, Jan 07, 2008 at 07:37:00PM -0800, David Miller wrote:
> > > > The vast majority of them are one, two, and three liners.
> > > 
> > > % awk '  { line++ } ; /^{/ { total++; start = line } ; /^}/ { 
> > > len=line-start-3; if (len > 4) l++; if (len >= 10) k++; } ; END { print 
> > > total, l, l/total, k, k/total }' < include/net/tcp.h
> > > 68 28 0.411765 20 0.294118
> > > 
> > > 41% are over 4 lines, 29% are >= 10 lines.
> > 
> > Take out the comments and whitespace lines, your script is
> > too simplistic.

In addition it triggered spuriously per struct/enum end brace :-) and
was using the last known function starting brace in there so no wonder
the numbers were that high... Counting with the corrected lines 
(len=line-start-1) & spurious matches removed:

74 19 0.256757 7 0.0945946


Here are (finally) the measured bytes (couple of the functions are 
missing because I had couple of bugs in the regexps and the #if trickery 
at the inline resulted failed compiles):

 12 funcs, 242+, 1697-, diff: -1455  tcp_set_state 
 13 funcs, 92+, 632-, diff: -540 tcp_is_cwnd_limited 
 12 funcs, 2836+, 3225-, diff: -389  tcp_current_ssthresh 
 5 funcs, 261+, 556-, diff: -295 tcp_prequeue 
 7 funcs, 2777+, 3049-, diff: -272   tcp_clear_retrans_hints_partial 
 11 funcs, 64+, 275-, diff: -211 tcp_win_from_space 
 6 funcs, 128+, 320-, diff: -192 tcp_prequeue_init 
 12 funcs, 45+, 209-, diff: -164 tcp_set_ca_state 
 7 funcs, 106+, 237-, diff: -131 tcp_fast_path_check 
 5 funcs, 167+, 291-, diff: -124 tcp_write_queue_purge 
 6 funcs, 43+, 160-, diff: -117  tcp_push_pending_frames 
 9 funcs, 55+, 159-, diff: -104  tcp_v4_check 
 6 funcs, 4+, 97-, diff: -93 tcp_packets_in_flight 
 7 funcs, 58+, 150-, diff: -92   tcp_fast_path_on 
 4 funcs, 4+, 91-, diff: -87 tcp_clear_options 
 6 funcs, 141+, 217-, diff: -76  tcp_openreq_init 
 8 funcs, 38+, 111-, diff: -73   tcp_unlink_write_queue 
 7 funcs, 32+, 103-, diff: -71   tcp_checksum_complete 
 7 funcs, 35+, 101-, diff: -66   __tcp_fast_path_on 
 5 funcs, 4+, 66-, diff: -62 tcp_receive_window 
 6 funcs, 67+, 128-, diff: -61   tcp_add_write_queue_tail 
 7 funcs, 30+, 86-, diff: -56tcp_ca_event 
 6 funcs, 73+, 106-, diff: -33   tcp_paws_check 
 4 funcs, 4+, 36-, diff: -32 tcp_highest_sack_seq 
 6 funcs, 46+, 78-, diff: -32tcp_fin_time 
 3 funcs, 4+, 35-, diff: -31 tcp_clear_all_retrans_hints 
 7 funcs, 30+, 51-, diff: -21__tcp_add_write_queue_tail 
 3 funcs, 4+, 14-, diff: -10 tcp_enable_fack 
 4 funcs, 4+, 14-, diff: -10 keepalive_time_when 
 8 funcs, 66+, 73-, diff: -7 tcp_full_space 
 3 funcs, 4+, 5-, diff: -1   tcp_wnd_end 
 4 funcs, 97+, 97-, diff: +0 tcp_mib_init 
 3 funcs, 4+, 3-, diff: +1   tcp_skb_is_last 
 2 funcs, 4+, 2-, diff: +2   keepalive_intvl_when 
 2 funcs, 4+, 2-, diff: +2   tcp_is_fack 
 2 funcs, 4+, 2-, diff: +2   tcp_skb_mss 
 2 funcs, 4+, 2-, diff: +2   tcp_write_queue_empty 
 2 funcs, 4+, 2-, diff: +2   tcp_advance_highest_sack 
 2 funcs, 4+, 2-, diff: +2   tcp_advance_send_head 
 2 funcs, 4+, 2-, diff: +2   tcp_check_send_head 
 2 funcs, 4+, 2-, diff: +2   tcp_highest_sack_reset 
 2 funcs, 4+, 2-, diff: +2   tcp_init_send_head 
 2 funcs, 4+, 2-, diff: +2   tcp_sack_reset 
 6 funcs, 47+, 44-, diff: +3 tcp_space 
 5 funcs, 55+, 50-, diff: +5 tcp_too_many_orphans 
 3 funcs, 8+, 2-, diff: +6   tcp_minshall_update 
 3 funcs, 8+, 2-, diff: +6   tcp_update_wl 
 8 funcs, 25+, 14-, diff: +11between 
 3 funcs, 14+, 2-, diff: +12 tcp_put_md5sig_pool 
 3 funcs, 14+, 2-, diff: +12 tcp_clear_xmit_timers 
 5 funcs, 30+, 17-, diff: +13tcp_dec_pcount_approx_int 
 6 funcs, 33+, 20-, diff: +13tcp_insert_write_queue_after 
 3 funcs, 17+, 2-, diff: +15 __tcp_checksum_complete 
 5 funcs, 17+, 2-, diff: +15 tcp_init_wl 
 4 funcs, 57+, 41-, diff: +16tcp_dec_quickack_mode 
 4 funcs, 40+, 22-, diff: +18__tcp_add_write_queue_head 
 5 funcs, 36+, 16-, diff: +20tcp_highest_sack_combine 
 4 funcs, 40+, 18-, diff: +22tcp_dec_pcount_approx 
 6 funcs, 29+, 5-, diff: +24 tcp_is_sack 
 4 funcs, 28+, 2-, diff: +26 tcp_is_reno 
 5 funcs, 50+, 24-, diff: +26tcp_insert_write_queue_before 
 4 funcs, 83+, 56-, diff: +27tcp_check_probe_timer 
 8 funcs, 69+, 14-, diff: +55tcp_left_out 
 11 funcs, 2995+, 2893-, diff: +102  tcp_skb_pcount 
 30 funcs, 930+, 2-, diff: +928  before 

-- 
 i.

TCP & hints

2008-01-11 Thread Ilpo Järvinen

Hi Stephen,

Do you still remember what this is for (got added along with other TCP 
hint stuff)? What kind of problem you saw back then (or who saw
problems)?

@@ -1605,6 +1711,10 @@ static void tcp_undo_cwr(struct sock *sk, const int 
undo)
}
tcp_moderate_cwnd(tp);
tp->snd_cwnd_stamp = tcp_time_stamp;
+
+   /* There is something screwy going on with the retrans hints after
+  an undo */
+   clear_all_retrans_hints(tp);
 }
 
 static inline int tcp_may_undo(struct tcp_sock *tp)


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-2.6.25 0/8] [NET]: More uninlining & related

2008-01-12 Thread Ilpo Järvinen

Hi Dave,

First of all, I changed output encoding of git to utf-8, so I
guess the encoding should not cause the same trouble for you.

Here are couple of more to uninline things. Pretty
straightforward except the EXPORT_SYMBOLs, I've no idea which
is the right variant (feel free to fix them while applying :-)).
Also pktgen uninlining is somewhat questionable because it's
just a testing tool so feel free to drop it if it feels
unnecessary (I could have asked first but it's just as easy to
do it this way if not easier)...

There were more dead static inlines found after inlines removed
(gcc didn't report them otherwise) than in pktgen now included,
but I'm not sure if I should send "by default" patches removing
or #if 0'ing them?


--
 i.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net-2.6.25 0/8] [NET]: More uninlining & related

2008-01-12 Thread Ilpo Järvinen

Hi Dave,

First of all, I changed output encoding to utf-8, so I guess
the encoding should not cause trouble for you.

Here are couple of more to uninline things. Pretty
straightforward except the EXPORT_SYMBOLs, I've no idea which
is the right variant (feel free to fix them while applying :-)).
Also pktgen uninlining is somewhat questionable because it's
just a testing tool so feel free to drop it if it feels
unnecessary (I could have asked first but it's just as easy to
do it this way if not easier)...

There were more dead static inlines found after inlines removed
(gcc didn't report them otherwise) than in pktgen now included,
but I'm not sure if I should send "by default" patches removing
or #if 0'ing them?


--
 i.

ps. I apologize that I must resend to get them to netdev as well
because git-send-email of this system (not sure if later could)
still seems to be lacking proper encoding of my name when it
decides to add it to Cc list all by itself and those 8-bit chars
in address got rejected.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/8] [TCP]: Uninline tcp_is_cwnd_limited

2008-01-12 Thread Ilpo Järvinen

net/ipv4/tcp_cong.c:
  tcp_reno_cong_avoid |  -65
 1 function changed, 65 bytes removed, diff: -65

net/ipv4/arp.c:
  arp_ignore |   -5
 1 function changed, 5 bytes removed, diff: -5

net/ipv4/tcp_bic.c:
  bictcp_cong_avoid |  -57
 1 function changed, 57 bytes removed, diff: -57

net/ipv4/tcp_cubic.c:
  bictcp_cong_avoid |  -61
 1 function changed, 61 bytes removed, diff: -61

net/ipv4/tcp_highspeed.c:
  hstcp_cong_avoid |  -63
 1 function changed, 63 bytes removed, diff: -63

net/ipv4/tcp_hybla.c:
  hybla_cong_avoid |  -85
 1 function changed, 85 bytes removed, diff: -85

net/ipv4/tcp_htcp.c:
  htcp_cong_avoid |  -57
 1 function changed, 57 bytes removed, diff: -57

net/ipv4/tcp_veno.c:
  tcp_veno_cong_avoid |  -52
 1 function changed, 52 bytes removed, diff: -52

net/ipv4/tcp_scalable.c:
  tcp_scalable_cong_avoid |  -61
 1 function changed, 61 bytes removed, diff: -61

net/ipv4/tcp_yeah.c:
  tcp_yeah_cong_avoid |  -75
 1 function changed, 75 bytes removed, diff: -75

net/ipv4/tcp_illinois.c:
  tcp_illinois_cong_avoid |  -54
 1 function changed, 54 bytes removed, diff: -54

net/dccp/ccids/ccid3.c:
  ccid3_update_send_interval |   -7
  ccid3_hc_tx_packet_recv|   +7
 2 functions changed, 7 bytes added, 7 bytes removed, diff: +0

net/ipv4/tcp_cong.c:
  tcp_is_cwnd_limited |  +88
 1 function changed, 88 bytes added, diff: +88

built-in.o:
 14 functions changed, 95 bytes added, 642 bytes removed, diff: -547

...Again some gcc artifacts visible as well.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 include/net/tcp.h   |   22 +-
 net/ipv4/tcp_cong.c |   21 +
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 306580c..7de4ea3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -786,27 +786,7 @@ static inline u32 tcp_wnd_end(const struct tcp_sock *tp)
 {
return tp->snd_una + tp->snd_wnd;
 }
-
-/* RFC2861 Check whether we are limited by application or congestion window
- * This is the inverse of cwnd check in tcp_tso_should_defer
- */
-static inline int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
-{
-   const struct tcp_sock *tp = tcp_sk(sk);
-   u32 left;
-
-   if (in_flight >= tp->snd_cwnd)
-   return 1;
-
-   if (!sk_can_gso(sk))
-   return 0;
-
-   left = tp->snd_cwnd - in_flight;
-   if (sysctl_tcp_tso_win_divisor)
-   return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd;
-   else
-   return left <= tcp_max_burst(tp);
-}
+extern int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight);
 
 static inline void tcp_minshall_update(struct tcp_sock *tp, unsigned int mss,
   const struct sk_buff *skb)
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 4451750..3a6be23 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -274,6 +274,27 @@ int tcp_set_congestion_control(struct sock *sk, const char 
*name)
return err;
 }
 
+/* RFC2861 Check whether we are limited by application or congestion window
+ * This is the inverse of cwnd check in tcp_tso_should_defer
+ */
+int tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
+{
+   const struct tcp_sock *tp = tcp_sk(sk);
+   u32 left;
+
+   if (in_flight >= tp->snd_cwnd)
+   return 1;
+
+   if (!sk_can_gso(sk))
+   return 0;
+
+   left = tp->snd_cwnd - in_flight;
+   if (sysctl_tcp_tso_win_divisor)
+   return left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd;
+   else
+   return left <= tcp_max_burst(tp);
+}
+EXPORT_SYMBOL_GPL(tcp_is_cwnd_limited);
 
 /*
  * Slow start is used when congestion window is less than slow start
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 8/8] [PKTGEN]: uninline getCurUs

2008-01-12 Thread Ilpo Järvinen

net/core/pktgen.c:
  pktgen_stop_device   |  -50
  pktgen_run   | -105
  pktgen_if_show   |  -37
  pktgen_thread_worker | -702
 4 functions changed, 894 bytes removed, diff: -894

net/core/pktgen.c:
  getCurUs |  +36
 1 function changed, 36 bytes added, diff: +36

net/core/pktgen.o:
 5 functions changed, 36 bytes added, 894 bytes removed, diff: -858

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/core/pktgen.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index ebfb126..d18fdb1 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -405,7 +405,7 @@ static inline __u64 tv_to_us(const struct timeval *tv)
return us;
 }
 
-static inline __u64 getCurUs(void)
+static __u64 getCurUs(void)
 {
struct timeval tv;
do_gettimeofday(&tv);
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/8] [TCP]: Uninline tcp_set_state

2008-01-12 Thread Ilpo Järvinen

net/ipv4/tcp.c:
  tcp_close_state | -226
  tcp_done| -145
  tcp_close   | -564
  tcp_disconnect  | -141
 4 functions changed, 1076 bytes removed, diff: -1076

net/ipv4/tcp_input.c:
  tcp_fin   |  -86
  tcp_rcv_state_process | -164
 2 functions changed, 250 bytes removed, diff: -250

net/ipv4/tcp_ipv4.c:
  tcp_v4_connect | -209
 1 function changed, 209 bytes removed, diff: -209

net/ipv4/arp.c:
  arp_ignore |   +5
 1 function changed, 5 bytes added, diff: +5

net/ipv6/tcp_ipv6.c:
  tcp_v6_connect | -158
 1 function changed, 158 bytes removed, diff: -158

net/sunrpc/xprtsock.c:
  xs_sendpages |   -2
 1 function changed, 2 bytes removed, diff: -2

net/dccp/ccids/ccid3.c:
  ccid3_update_send_interval |   +7
 1 function changed, 7 bytes added, diff: +7

net/ipv4/tcp.c:
  tcp_set_state | +238
 1 function changed, 238 bytes added, diff: +238

built-in.o:
 12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445

I've no explanation why some unrelated changes seem to occur
consistently as well (arp_ignore, ccid3_update_send_interval;
I checked the arp_ignore asm and it seems to be due to some
reordered of operation order causing some extra opcodes to be
generated). Still, the benefits are pretty obvious from the
codiff's results.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
---
 include/net/tcp.h |   35 +--
 net/ipv4/tcp.c|   35 +++
 2 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 48081ad..306580c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -926,40 +926,7 @@ static const char *statename[]={
"Close Wait","Last ACK","Listen","Closing"
 };
 #endif
-
-static inline void tcp_set_state(struct sock *sk, int state)
-{
-   int oldstate = sk->sk_state;
-
-   switch (state) {
-   case TCP_ESTABLISHED:
-   if (oldstate != TCP_ESTABLISHED)
-   TCP_INC_STATS(TCP_MIB_CURRESTAB);
-   break;
-
-   case TCP_CLOSE:
-   if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED)
-   TCP_INC_STATS(TCP_MIB_ESTABRESETS);
-
-   sk->sk_prot->unhash(sk);
-   if (inet_csk(sk)->icsk_bind_hash &&
-   !(sk->sk_userlocks & SOCK_BINDPORT_LOCK))
-   inet_put_port(&tcp_hashinfo, sk);
-   /* fall through */
-   default:
-   if (oldstate==TCP_ESTABLISHED)
-   TCP_DEC_STATS(TCP_MIB_CURRESTAB);
-   }
-
-   /* Change state AFTER socket is unhashed to avoid closed
-* socket sitting in hash tables.
-*/
-   sk->sk_state = state;
-
-#ifdef STATE_TRACE
-   SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n",sk, 
statename[oldstate],statename[state]);
-#endif 
-}
+extern void tcp_set_state(struct sock *sk, int state);
 
 extern void tcp_done(struct sock *sk);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 34085e3..7d7b958 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1652,6 +1652,41 @@ recv_urg:
goto out;
 }
 
+void tcp_set_state(struct sock *sk, int state)
+{
+   int oldstate = sk->sk_state;
+
+   switch (state) {
+   case TCP_ESTABLISHED:
+   if (oldstate != TCP_ESTABLISHED)
+   TCP_INC_STATS(TCP_MIB_CURRESTAB);
+   break;
+
+   case TCP_CLOSE:
+   if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED)
+   TCP_INC_STATS(TCP_MIB_ESTABRESETS);
+
+   sk->sk_prot->unhash(sk);
+   if (inet_csk(sk)->icsk_bind_hash &&
+   !(sk->sk_userlocks & SOCK_BINDPORT_LOCK))
+   inet_put_port(&tcp_hashinfo, sk);
+   /* fall through */
+   default:
+   if (oldstate==TCP_ESTABLISHED)
+   TCP_DEC_STATS(TCP_MIB_CURRESTAB);
+   }
+
+   /* Change state AFTER socket is unhashed to avoid closed
+* socket sitting in hash tables.
+*/
+   sk->sk_state = state;
+
+#ifdef STATE_TRACE
+   SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n",sk, 
statename[oldstate],statename[state]);
+#endif 
+}
+EXPORT_SYMBOL_GPL(tcp_set_state);
+
 /*
  * State processing on a close. This implements the state shift for
  * sending our FIN frame. Note that we only send a FIN for some
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/8] [IPV6] route: kill some bloat

2008-01-12 Thread Ilpo Järvinen

net/ipv6/route.c:
  ip6_pkt_prohibit_out | -130
  ip6_pkt_discard  | -261
  ip6_pkt_discard_out  | -130
  ip6_pkt_prohibit | -261
 4 functions changed, 782 bytes removed, diff: -782

net/ipv6/route.c:
  ip6_pkt_drop | +300
 1 function changed, 300 bytes added, diff: +300

net/ipv6/route.o:
 5 functions changed, 300 bytes added, 782 bytes removed, diff: -482

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv6/route.c |3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 02354a7..d3b5811 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1762,8 +1762,7 @@ int ipv6_route_ioctl(unsigned int cmd, void __user *arg)
  * Drop the packet on the floor
  */
 
-static inline int ip6_pkt_drop(struct sk_buff *skb, int code,
-  int ipstats_mib_noroutes)
+static int ip6_pkt_drop(struct sk_buff *skb, int code, int 
ipstats_mib_noroutes)
 {
int type;
switch (ipstats_mib_noroutes) {
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 5/8] [NETLINK] af_netlink: kill some bloat

2008-01-12 Thread Ilpo Järvinen

net/netlink/af_netlink.c:
  netlink_realloc_groups|  -46
  netlink_insert|  -49
  netlink_autobind  |  -94
  netlink_clear_multicast_users |  -48
  netlink_bind  |  -55
  netlink_setsockopt|  -54
  netlink_release   |  -86
  netlink_kernel_create |  -47
  netlink_change_ngroups|  -56
 9 functions changed, 535 bytes removed, diff: -535

net/netlink/af_netlink.c:
  netlink_table_ungrab |  +53
 1 function changed, 53 bytes added, diff: +53

net/netlink/af_netlink.o:
 10 functions changed, 53 bytes added, 535 bytes removed, diff: -482

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/netlink/af_netlink.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index be07f1b..21f9e30 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -193,7 +193,7 @@ static void netlink_table_grab(void)
}
 }
 
-static inline void netlink_table_ungrab(void)
+static void netlink_table_ungrab(void)
__releases(nl_table_lock)
 {
write_unlock_irq(&nl_table_lock);
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 7/8] [PKTGEN]: Kill dead static inlines

2008-01-12 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/core/pktgen.c |   94 -
 1 files changed, 0 insertions(+), 94 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index ede1fea..ebfb126 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -397,62 +397,6 @@ struct pktgen_thread {
 #define REMOVE 1
 #define FIND   0
 
-/*  This code works around the fact that do_div cannot handle two 64-bit
-numbers, and regular 64-bit division doesn't work on x86 kernels.
---Ben
-*/
-
-#define PG_DIV 0
-
-/* This was emailed to LMKL by: Chris Caputo <[EMAIL PROTECTED]>
- * Function copied/adapted/optimized from:
- *
- *  nemesis.sourceforge.net/browse/lib/static/intmath/ix86/intmath.c.html
- *
- * Copyright 1994, University of Cambridge Computer Laboratory
- * All Rights Reserved.
- *
- */
-static inline s64 divremdi3(s64 x, s64 y, int type)
-{
-   u64 a = (x < 0) ? -x : x;
-   u64 b = (y < 0) ? -y : y;
-   u64 res = 0, d = 1;
-
-   if (b > 0) {
-   while (b < a) {
-   b <<= 1;
-   d <<= 1;
-   }
-   }
-
-   do {
-   if (a >= b) {
-   a -= b;
-   res += d;
-   }
-   b >>= 1;
-   d >>= 1;
-   }
-   while (d);
-
-   if (PG_DIV == type) {
-   return (((x ^ y) & (1ll << 63)) == 0) ? res : -(s64) res;
-   } else {
-   return ((x & (1ll << 63)) == 0) ? a : -(s64) a;
-   }
-}
-
-/* End of hacks to deal with 64-bit math on x86 */
-
-/** Convert to milliseconds */
-static inline __u64 tv_to_ms(const struct timeval *tv)
-{
-   __u64 ms = tv->tv_usec / 1000;
-   ms += (__u64) tv->tv_sec * (__u64) 1000;
-   return ms;
-}
-
 /** Convert to micro-seconds */
 static inline __u64 tv_to_us(const struct timeval *tv)
 {
@@ -461,39 +405,6 @@ static inline __u64 tv_to_us(const struct timeval *tv)
return us;
 }
 
-static inline __u64 pg_div(__u64 n, __u32 base)
-{
-   __u64 tmp = n;
-   do_div(tmp, base);
-   /* printk("pktgen: pg_div, n: %llu  base: %d  rv: %llu\n",
-  n, base, tmp); */
-   return tmp;
-}
-
-static inline __u64 pg_div64(__u64 n, __u64 base)
-{
-   __u64 tmp = n;
-/*
- * How do we know if the architecture we are running on
- * supports division with 64 bit base?
- *
- */
-#if defined(__sparc_v9__) || defined(__powerpc64__) || defined(__alpha__) || 
defined(__x86_64__) || defined(__ia64__)
-
-   do_div(tmp, base);
-#else
-   tmp = divremdi3(n, base, PG_DIV);
-#endif
-   return tmp;
-}
-
-static inline __u64 getCurMs(void)
-{
-   struct timeval tv;
-   do_gettimeofday(&tv);
-   return tv_to_ms(&tv);
-}
-
 static inline __u64 getCurUs(void)
 {
struct timeval tv;
@@ -501,11 +412,6 @@ static inline __u64 getCurUs(void)
return tv_to_us(&tv);
 }
 
-static inline __u64 tv_diff(const struct timeval *a, const struct timeval *b)
-{
-   return tv_to_us(a) - tv_to_us(b);
-}
-
 /* old include end */
 
 static char version[] __initdata = VERSION;
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/8] [NETFILTER] xt_policy.c: kill some bloat

2008-01-12 Thread Ilpo Järvinen

net/netfilter/xt_policy.c:
  policy_mt | -906
 1 function changed, 906 bytes removed, diff: -906

net/netfilter/xt_policy.c:
  match_xfrm_state | +427
 1 function changed, 427 bytes added, diff: +427

net/netfilter/xt_policy.o:
 2 functions changed, 427 bytes added, 906 bytes removed, diff: -479

Alternatively, this could be done by combining identical
parts of the match_policy_in/out()

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/netfilter/xt_policy.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/netfilter/xt_policy.c b/net/netfilter/xt_policy.c
index 46ee7e8..45731ca 100644
--- a/net/netfilter/xt_policy.c
+++ b/net/netfilter/xt_policy.c
@@ -33,7 +33,7 @@ xt_addr_cmp(const union xt_policy_addr *a1, const union 
xt_policy_addr *m,
return false;
 }
 
-static inline bool
+static bool
 match_xfrm_state(const struct xfrm_state *x, const struct xt_policy_elem *e,
 unsigned short family)
 {
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/8] [XFRM] xfrm_policy: kill some bloat

2008-01-12 Thread Ilpo Järvinen

net/xfrm/xfrm_policy.c:
  xfrm_audit_policy_delete | -692
  xfrm_audit_policy_add| -692
 2 functions changed, 1384 bytes removed, diff: -1384

net/xfrm/xfrm_policy.c:
  xfrm_audit_common_policyinfo | +704
 1 function changed, 704 bytes added, diff: +704

net/xfrm/xfrm_policy.o:
 3 functions changed, 704 bytes added, 1384 bytes removed, diff: -680

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/xfrm/xfrm_policy.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 534b29e..47219f9 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2367,8 +2367,8 @@ void __init xfrm_init(void)
 }
 
 #ifdef CONFIG_AUDITSYSCALL
-static inline void xfrm_audit_common_policyinfo(struct xfrm_policy *xp,
-   struct audit_buffer *audit_buf)
+static void xfrm_audit_common_policyinfo(struct xfrm_policy *xp,
+struct audit_buffer *audit_buf)
 {
struct xfrm_sec_ctx *ctx = xp->security;
struct xfrm_selector *sel = &xp->selector;
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] [HTB]: htb_classid is dead static inline

2008-01-12 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/sched/sch_htb.c |4 
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 72beb66..6a2352c 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -214,10 +214,6 @@ static inline struct htb_class *htb_find(u32 handle, 
struct Qdisc *sch)
  * then finish and return direct queue.
  */
 #define HTB_DIRECT (struct htb_class*)-1
-static inline u32 htb_classid(struct htb_class *cl)
-{
-   return (cl && cl != HTB_DIRECT) ? cl->classid : TC_H_UNSPEC;
-}
 
 static struct htb_class *htb_classify(struct sk_buff *skb, struct Qdisc *sch,
  int *qerr)
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] [NET] core/utils.c: digit2bin is dead static inline

2008-01-12 Thread Ilpo Järvinen

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/core/utils.c |   11 ---
 1 files changed, 0 insertions(+), 11 deletions(-)

diff --git a/net/core/utils.c b/net/core/utils.c
index 34459c4..8031eb5 100644
--- a/net/core/utils.c
+++ b/net/core/utils.c
@@ -91,17 +91,6 @@ EXPORT_SYMBOL(in_aton);
 #define IN6PTON_NULL   0x2000  /* first/tail */
 #define IN6PTON_UNKNOWN0x4000
 
-static inline int digit2bin(char c, int delim)
-{
-   if (c == delim || c == '\0')
-   return IN6PTON_DELIM;
-   if (c == '.')
-   return IN6PTON_DOT;
-   if (c >= '0' && c <= '9')
-   return (IN6PTON_DIGIT | (c - '0'));
-   return IN6PTON_UNKNOWN;
-}
-
 static inline int xdigit2bin(char c, int delim)
 {
if (c == delim || c == '\0')
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs

2008-01-12 Thread Ilpo Järvinen

On Sat, 12 Jan 2008, Herbert Xu wrote:

> On Sat, Jan 12, 2008 at 09:40:17AM +0000, Ilpo Järvinen wrote:
> 
> Your emails are now using UTF-8 encoding but it's still declaring
> ISO-8859-1 as the charset. 

Thanks for trying to help but my situation is such that I think it got 
also you confused (this kind of mixed encoding is beoynd my skills 
really)... :-) Besides, I wouldn't mind of having incorrect characters
in my name, I'm just used to that but somebody else wasn't that happy 
about it.

Here's one example...

From: "=?utf-8?q?Ilpo_J=C3=A4rvinen?=" <[EMAIL PROTECTED]>
...
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

Something still needed besides these to declare email utf-8?

> So you probably want to fix that up or your name may show up as JÃ¤vinen 
> on the reader's screen.

Did you actually see this? I'd expect to see that as well but no, From is 
correctly decoded by my ISO-8859-1'ish MUA, aha, seems that I still have 
something to do to deal with the Signed-off line.

...Maybe I just fall-back to changing my last name, it's the only 
full-proof solution... ;-)

-- 
 i.

Re: [PATCH 1/8] [TCP]: Uninline tcp_set_state

2008-01-13 Thread Ilpo Järvinen

On Sat, 12 Jan 2008, Stephen Hemminger wrote:

> On Sat, 12 Jan 2008 11:40:10 +0200
> "Ilpo Järvinen" <[EMAIL PROTECTED]> wrote:

...snip...

> > built-in.o:
> >  12 functions changed, 250 bytes added, 1695 bytes removed, diff: -1445

...snip...

> >  include/net/tcp.h |   35 +--
> >  net/ipv4/tcp.c|   35 +++
> >  2 files changed, 36 insertions(+), 34 deletions(-)
> > 
> > diff --git a/include/net/tcp.h b/include/net/tcp.h
> > index 48081ad..306580c 100644
> > --- a/include/net/tcp.h
> > +++ b/include/net/tcp.h
> > @@ -926,40 +926,7 @@ static const char *statename[]={
> > "Close Wait","Last ACK","Listen","Closing"
> >  };
> >  #endif
> > -
> > -static inline void tcp_set_state(struct sock *sk, int state)
> > -{
> > -   int oldstate = sk->sk_state;
> > -
> > -   switch (state) {
> > -   case TCP_ESTABLISHED:
> > -   if (oldstate != TCP_ESTABLISHED)
> > -   TCP_INC_STATS(TCP_MIB_CURRESTAB);
> > -   break;
> > -
> > -   case TCP_CLOSE:
> > -   if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED)
> > -   TCP_INC_STATS(TCP_MIB_ESTABRESETS);
> > -
> > -   sk->sk_prot->unhash(sk);
> > -   if (inet_csk(sk)->icsk_bind_hash &&
> > -   !(sk->sk_userlocks & SOCK_BINDPORT_LOCK))
> > -   inet_put_port(&tcp_hashinfo, sk);
> > -   /* fall through */
> > -   default:
> > -   if (oldstate==TCP_ESTABLISHED)
> > -   TCP_DEC_STATS(TCP_MIB_CURRESTAB);
> > -   }
> > -
> > -   /* Change state AFTER socket is unhashed to avoid closed
> > -* socket sitting in hash tables.
> > -*/
> > -   sk->sk_state = state;
> > -
> > -#ifdef STATE_TRACE
> > -   SOCK_DEBUG(sk, "TCP sk=%p, State %s -> %s\n",sk, 
> > statename[oldstate],statename[state]);
> > -#endif 
> > -}
> >
> 
> 
> Since the function is called with a constant state, I guess the assumption
> was that gcc would be smart enough to only include the code needed, it looks 
> like
> either code was bigger or the compiler was dumber than expected

I'd guess that compiler was just dumber... :-) It might be an interesting 
experiment to convert it to if's and see if it would make a difference, 
maybe it just gets confused by the switch or something.

Besides, it not always that obvious that gcc is able to determine "the
constant state", considering e.g., the complexity in the cases with
tcp_rcv_synsent_state_process, tcp_close_state, tcp_done. In such cases
uninlining should be done and gcc is probably not able to mix both cases
nicely for a single function?

However, after looking a bit, I'm partially leaning towards the other 
option too:

> >   tcp_done| -145
> >   tcp_disconnect  | -141

...These called for tcp_set_state just _once_, while this calls for it 
twice:

> >   tcp_fin   |  -86

...Obviously the compiler was able to perform some reductions but 43 bytes 
per inlining seems still a bit high number.


-- 
 i.

Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs

2008-01-13 Thread Ilpo Järvinen

On Sat, 12 Jan 2008, David Miller wrote:

> From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
> Date: Sat, 12 Jan 2008 14:59:50 +0200 (EET)
> 
> > ...Maybe I just fall-back to changing my last name, it's the only 
> > full-proof solution... ;-)
> 
> Don't do this!  Otherwise I won't have a frequent test case to make
> sure my patch applying scripts are working properly. :-

So which test case you prefer? :-) Is iso-8859-1 from+content ok? Or 
should I keep trying to live with mixed utf-8 which I didn't got even 
fully working last time because git-send-email is probably either too dumb 
or too intelligent (I'm not even sure which), but you were able correct it 
by your tools so the flawed signed-off never entered to the git logs as 
incorrectly formatted :-).

I'd prefer sending them as iso-8859-1 compliant (and I guess you are able 
to test your fix-to-utf-8 machinery with it as well :-)), as it would also 
make my mails compatible with other people's git apply tools you're not 
using (otherwise I'd probably forget to change it occassionally when 
interacting with others than you).

-- 
 i.

Re: [RFC PATCH 8/8] [PKTGEN]: uninline getCurUs

2008-01-14 Thread Ilpo Järvinen

On Sun, 13 Jan 2008, David Miller wrote:

> From: "Ilpo_Järvinen" <[EMAIL PROTECTED]>
> Date: Mon, 14 Jan 2008 09:43:08 +0200 (EET)
> 
> > I'd prefer sending them as iso-8859-1 compliant (and I guess you are able 
> > to test your fix-to-utf-8 machinery with it as well :-)), as it would also 
> > make my mails compatible with other people's git apply tools you're not 
> > using (otherwise I'd probably forget to change it occassionally when 
> > interacting with others than you).
> 
> For now either way is fine with me.  If the situation changes I'll
> let you know.

Ok, I'll remain in iso-8859-1, it's something that is known to work from 
my end. Thanks anyway for fixing it, wasn't any big deal for me at any 
point of time but people started asking me privately to correct it which 
I of course couldn't... :-)

> I'm surprised git-send-email can't get it purely utf8 correctly.

The problem is that my system is ISO natively, so git-send-email might 
encode a ISO native .patch file's content while sending, which in this 
case was intentionally already utf-8.

It might surprise you but it wasn't a long time ago when git-send-email 
wouldn't care less e.g. about header encoding and I got rejects from 
netdev due to my name which wasn't encoded properly, I've 1.5.0.6 
currently and it seemed still fail to encode Cc addresses it adds from 
signed-offs unless I explicitly ask for it to not do that (I explicitly 
ask for especific, encoded, from header anyway because it was broken at 
some point of time and my sending template is copy-paste originating
from that time). There was some recent fixes in the git's logs regarding 
that encoding, so I intend to check if a later g-s-e is more able and if 
it isn't I'll report it to git folks.

> I wonder if there is some issue with how it gets your name
> string for the commit author etc.

I've had it working well since the encoding header got relatively recently 
added (wasn't available at early dawn of git era), before that it was just 
a mess locally. Funny enough, you were able to magically mangle my emails 
to utf-8'ed commits nicely back then so I got a "fixed" commit back after 
an RTT :-).

> I wonder if getting it into your global GIT config file in proper
> UTF8 encoding would fix things.
>
> Put something like this into ~/.gitconfig
> 
> 
> [user]
>   name = Ilpo Järvinen
>   email = [EMAIL PROTECTED]
> 

I have this. In addition I have this (required to make my local system 
consistent):

[i18n]
commitencoding = ISO-8859-1

The problem was just that the API (or better, ABI) between us 
wasn't properly working :-)). While Herbert was working as the
replacement-Dave in November, correct commit entries were created,
so git has been working fine (I guess he used git applying tools
instead of handmade scripts and they handle email correclt based
on it's encoding).

I tried logOutputEncoding = utf-8 in the last patch sets I sent (now 
could again remove it) but git-send-email problem appeared with it
because the system is ISO natively.

> The GIT maintainer is Finnish which makes this situation even
> more perplexing to me, you might want to discuss it with him :-)

Junio? Never heard that a Finnish name... ;-) Perhaps
git-send-email wasn't written by that Finnish guy... :-) 
...Besides, that Finnish git aintainer doesn't have any funny
characters in his name... ;-)

Thanks anyway for the tips & all, I think we have it now working
and I can return to inlines and rexmit_skb_hint things & other TCP
stuff rather than this hinderance. I've some interesting results
from net header inlines checks I ran overnight :-).

-- 
 i.

Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-14 Thread Ilpo Järvinen

On Fri, 11 Jan 2008, Zhang, Yanmin wrote:

> On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: 
> > The regression is:
> > 1)stoakley with 2 qual-core processors: 11%;
> > 2)Tulsa with 4 dual-core(+hyperThread) processors:13%;
> I have new update on this issue and also cc to netdev maillist.
> Thank David Miller for pointing me the netdev maillist.
> 
> > 
> > The test command is:
> > #sudo taskset -c 7 ./netserver
> > #sudo taskset -c 0 ./netperf -t TCP_RR -l 60 -H 127.0.0.1 -i 50,3 -I 99,5 
> > -- -r 1,1
> > 
> > As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's
> > regression is between 16%~11%.
> > 
> > I tried to use bisect to locate the bad patch between 2.6.22 and 2.6.23-rc1,
> > but the bisected kernel wasn't stable and went crazy.

TCP work between that is very much non-existing.

Using git-reset's to select a nearby merge point instead of default 
commit where bisection lands might be help in case the bisected kernel 
breaks.

Also, limiting bisection under a subsystem might reduce probability of 
brokeness (might at least be able to narrow it down quite a lot), e.g.

git bisect start net/


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Netperf TCP_RR(loopback) 10% regression in 2.6.24-rc6, comparing with 2.6.22

2008-01-14 Thread Ilpo Järvinen

On Mon, 14 Jan 2008, Ilpo Järvinen wrote:

> On Fri, 11 Jan 2008, Zhang, Yanmin wrote:
> 
> > On Wed, 2008-01-09 at 17:35 +0800, Zhang, Yanmin wrote: 
> > > 
> > > As a matter of fact, 2.6.23 has about 6% regression and 2.6.24-rc's
> > > regression is between 16%~11%.
> > > 
> > > I tried to use bisect to locate the bad patch between 2.6.22 and 
> > > 2.6.23-rc1,
> > > but the bisected kernel wasn't stable and went crazy.
> 
> TCP work between that is very much non-existing.

I _really_ meant 2.6.22 - 2.6.23-rc1, not 2.6.24-rc1 in case you had a 
typo there which is not that uncommon while typing kernel versions... :-)

-- 
 i.

Re: [PATCH 2/4] dsmark: get rid of trivial function

2008-01-21 Thread Ilpo Järvinen

On Mon, 21 Jan 2008, Patrick McHardy wrote:

> Stephen Hemminger wrote:
> > Replace loop in dsmark_valid_indices with equivalent bit math.
> > 
> > Signed-off-by: Stephen Hemminger <[EMAIL PROTECTED]>
> > 
> > --- a/net/sched/sch_dsmark.c2008-01-20 13:07:58.0 -0800
> > +++ b/net/sched/sch_dsmark.c2008-01-20 13:22:54.0 -0800
> > @@ -45,13 +45,8 @@ struct dsmark_qdisc_data {
> >  
> >  static inline int dsmark_valid_indices(u16 indices)
> >  {
> > -   while (indices != 1) {
> > -   if (indices & 1)
> > -   return 0;
> > -   indices >>= 1;
> > -   }
> > -
> > -   return 1;
> > +   /* Must have only one bit set */
> > +   return (indices & (indices - 1)) == 0;

Isn't there some magic under include/linux to do that btw, I suppose 
that if the caller side zero check is pushed down there too, the
is_power_of_2() is 100% match? :-) 

> hweight seems easier to understand, it took me a bit
> to realize that the comment matches the code :)

In addition, the original seems infinite loop with zero indices given
but luckily that was checked at the caller site already...


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-21 Thread Ilpo Järvinen

On Mon, 21 Jan 2008, Dave Young wrote:

> Please see the kernel messages following,(trigged while using some qemu 
> session)
> BTW, seems there's some e100 error message as well.
> 
> PCI: Setting latency timer of device :00:1b.0 to 64
> e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI
> e100: Copyright(c) 1999-2006 Intel Corporation
> ACPI: PCI Interrupt :03:08.0[A] -> GSI 20 (level, low) -> IRQ 20
> modprobe:2331 conflicting cache attribute efaff000-efb0 uncached<->default
> e100: :03:08.0: e100_probe: Cannot map device registers, aborting.
> ACPI: PCI interrupt for device :03:08.0 disabled
> e100: probe of :03:08.0 failed with error -12
> eth0:  setting full-duplex.
> [ cut here ]
> WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150()
> Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq 
> snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse snd_hda_intel 
> snd_pcm snd_timer btusb rtc_cmos thermal bluetooth rtc_core serio_raw 
> intel_agp button processor sg snd rtc_lib i2c_i801 evdev agpgart soundcore 
> dcdbas 3c59x pcspkr snd_page_alloc
> Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4
>  [] ? printk+0x0/0x20
>  [] warn_on_slowpath+0x54/0x80
>  [] ? ip_finish_output+0x128/0x2e0
>  [] ? ip_output+0xe7/0x100
>  [] ? ip_local_out+0x18/0x20
>  [] ? ip_queue_xmit+0x3dc/0x470
>  [] ? _spin_unlock_irqrestore+0x5e/0x70
>  [] ? check_pad_bytes+0x61/0x80
>  [] tcp_mark_head_lost+0x121/0x150
>  [] tcp_update_scoreboard+0x4c/0x170
>  [] tcp_fastretrans_alert+0x48a/0x6b0
>  [] tcp_ack+0x1b3/0x3a0
>  [] tcp_rcv_established+0x3eb/0x710
>  [] tcp_v4_do_rcv+0xe5/0x100
>  [] tcp_v4_rcv+0x5db/0x660

Doh, once more these S+L things..., the rest are symptom of the first 
problem.

What is strange is that it doesn't show up until now, the last TCP
changes that could have some significance are from early Dec/Nov. Is
there some reason why you haven't seen this before this (e.g., not
tested with similar cfg or so)? I'm a bit worried about its
reproducability if it takes this far to see it...


-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-22 Thread Ilpo Järvinen

On Tue, 22 Jan 2008, Dave Young wrote:

> On Jan 22, 2008 12:37 PM, Dave Young <[EMAIL PROTECTED]> wrote:
> >
> > On Jan 22, 2008 5:14 AM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> > >
> > > On Mon, 21 Jan 2008, Dave Young wrote:
> > >
> > > > Please see the kernel messages following,(trigged while using some qemu 
> > > > session)
> > > > BTW, seems there's some e100 error message as well.
> > > >
> > > > PCI: Setting latency timer of device :00:1b.0 to 64
> > > > e100: Intel(R) PRO/100 Network Driver, 3.5.23-k4-NAPI
> > > > e100: Copyright(c) 1999-2006 Intel Corporation
> > > > ACPI: PCI Interrupt :03:08.0[A] -> GSI 20 (level, low) -> IRQ 20
> > > > modprobe:2331 conflicting cache attribute efaff000-efb0 
> > > > uncached<->default
> > > > e100: :03:08.0: e100_probe: Cannot map device registers, aborting.
> > > > ACPI: PCI interrupt for device :03:08.0 disabled
> > > > e100: probe of :03:08.0 failed with error -12
> > > > eth0:  setting full-duplex.
> > > > [ cut here ]
> > > > WARNING: at net/ipv4/tcp_input.c:2169 tcp_mark_head_lost+0x121/0x150()
> > > > Modules linked in: snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq 
> > > > snd_seq_device snd_pcm_oss snd_mixer_oss eeprom e100 psmouse 
> > > > snd_hda_intel snd_pcm snd_timer btusb rtc_cmos thermal bluetooth 
> > > > rtc_core serio_raw intel_agp button processor sg snd rtc_lib i2c_i801 
> > > > evdev agpgart soundcore dcdbas 3c59x pcspkr snd_page_alloc
> > > > Pid: 0, comm: swapper Not tainted 2.6.24-rc8-mm1 #4
> > > >  [] ? printk+0x0/0x20
> > > >  [] warn_on_slowpath+0x54/0x80
> > > >  [] ? ip_finish_output+0x128/0x2e0
> > > >  [] ? ip_output+0xe7/0x100
> > > >  [] ? ip_local_out+0x18/0x20
> > > >  [] ? ip_queue_xmit+0x3dc/0x470
> > > >  [] ? _spin_unlock_irqrestore+0x5e/0x70
> > > >  [] ? check_pad_bytes+0x61/0x80
> > > >  [] tcp_mark_head_lost+0x121/0x150
> > > >  [] tcp_update_scoreboard+0x4c/0x170
> > > >  [] tcp_fastretrans_alert+0x48a/0x6b0
> > > >  [] tcp_ack+0x1b3/0x3a0
> > > >  [] tcp_rcv_established+0x3eb/0x710
> > > >  [] tcp_v4_do_rcv+0xe5/0x100
> > > >  [] tcp_v4_rcv+0x5db/0x660
> > >
> > > Doh, once more these S+L things..., the rest are symptom of the first
> > > problem.
> >
> > What is the S+L thing? Could you explain a bit?

It means that one of the skbs is both SACKed and marked as LOST at the
same time in the counters (might be due to miscount of lost/sacked_out
too, not necessarilily in the ->sacked bits). Such state is logically
invalid because it would mean that the sender thinks that the same packet 
both reached the receiver and is lost in the network.

Traditionally TCP has just silently "corrected" over-estimates
(sacked_out+lost_out > packets_out). I changed this couple of releases ago
because those over-estimates often are due to bugs that should be fixed  
(there have been couple of them but it has been very quite on this front  
long time, months or even half year already; but I might have broken
something with the early Dec changes).

These problem may originate from a bug that occurred a number of ACKs
earlier the WARN_ON triggered, therefore they are a bit tricky to track,
those WARN_ON serve just for alerting purposes and usually do not point
out where the bug actually occurred.

I usually just asked people to include exhaustive verifier which compares
->sacked bitmaps with sacked/lost_out counters and report immediately when
the problem shows up, rather than waiting for the cheaper S+L check we do
in the WARN_ON to trigger. I tried to collect tracking patch from the
previous efforts (hopefully got it right after modifications).

> > I'm a bit worried about its
> > > reproducability if it takes this far to see it...
> > >
> 
> It's trigged again in my pc, just while using firefox.

...Good, then there's some chance to catch it.

-- 
 i.

[PATCH] [TCP]: debug S+L

---
 include/net/tcp.h |8 +++-
 net/ipv4/tcp_input.c  |6 +++
 net/ipv4/tcp_ipv4.c   |  101 +
 net/ipv4/tcp_output.c |   21 +++---
 4 files changed, 129 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de4ea3..0685035 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val

Re: WARNING, tcp_fastretrans_alert, rc6-git11

2008-01-22 Thread Ilpo Järvinen

On Tue, 22 Jan 2008, Denys Fedoryshchenko wrote:

> Just got on one of proxies, under high load.
> It is a bit old rc, so probably my report not interesting, but since it is 
> production machines, i cannot change too often.
> Kernel is 2.6.24-rc6-git11

It's not at all useless, there hasn't been any TCP changes lately so this 
is still very much valid.

> Some sysctl adjustments done. Please tell me if need more information.

TCP related sysctl setting would be nice to know (I don't care much about 
mem sizes but others tweaks affecting TCP features are nice to know).

...snip...
> [9561199.893090] WARNING: at net/ipv4/tcp_input.c:2391 tcp_fastretrans_alert()

This check includes fixing if there's miscount (couple of releases ago 
we didn't even care to print that but had inaccurate fackets_out as a 
feature, there still may be some left-overs though I've tried to track 
them down).

To find out cause more accurately (this WARN_ON is just for alerting), I'd 
need to add rather exhaustive searching per ACK which unlikely is 
acceptable for you. Luckily, Dave Young reported one problem lately 
(though he was running brand new mm) which could be caused by the same 
problem as this, so it's not that bad even though you couldn't run skb 
state verifying.

Thanks anyway for the report.

-- 
 i.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-22 Thread Ilpo Järvinen

On Tue, 22 Jan 2008, David Miller wrote:

> From: "Dave Young" <[EMAIL PROTECTED]>
> Date: Wed, 23 Jan 2008 09:44:30 +0800
> 
> > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> > > [PATCH] [TCP]: debug S+L
> > 
> > Thanks, If there's new findings I will let you know.
> 
> Thanks for helping with this bug Dave.

I noticed btw that there thing might (is likely to) spuriously trigger at 
WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK 
is not enabled. If that does happen too often, I send a fixed patch for 
it, yet, the fact that I print print tp->rx_opt.sack_ok allows
identification of those cases already as it's zero when SACK is not 
enabled.

Just ask if you need the updated debug patch.

-- 
 i.

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Dave Young wrote:

> On Jan 23, 2008 3:41 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> >
> > On Tue, 22 Jan 2008, David Miller wrote:
> >
> > > From: "Dave Young" <[EMAIL PROTECTED]>
> > > Date: Wed, 23 Jan 2008 09:44:30 +0800
> > >
> > > > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> > > > > [PATCH] [TCP]: debug S+L
> > > >
> > > > Thanks, If there's new findings I will let you know.
> > >
> > > Thanks for helping with this bug Dave.
> >
> > I noticed btw that there thing might (is likely to) spuriously trigger at
> > WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK
> > is not enabled. If that does happen too often, I send a fixed patch for
> > it, yet, the fact that I print print tp->rx_opt.sack_ok allows
> > identification of those cases already as it's zero when SACK is not
> > enabled.
> >
> > Just ask if you need the updated debug patch.
> 
> Thanks,  please send, I would like to get it.

There you go. I fixed non-SACK case by adding tcp_is_sack checks there and 
also added two verifys to tcp_ack to see if there's corruption outside of 
TCP.

-- 
 i.

[PATCH] [TCP]: debug S+L

---
 include/net/tcp.h |8 +++-
 net/ipv4/tcp_input.c  |   10 +
 net/ipv4/tcp_ipv4.c   |  101 +
 net/ipv4/tcp_output.c |   21 +++---
 4 files changed, 133 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de4ea3..0685035 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
field, val)
 #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
field, val)
 
+extern voidtcp_verify_wq(struct sock *sk);
+
 extern voidtcp_v4_err(struct sk_buff *skb, u32);
 
 extern voidtcp_shutdown (struct sock *sk, int how);
@@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock 
*sk)
 }
 
 /* Use define here intentionally to get WARN_ON location shown at the caller */
-#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) > 
tp->packets_out)
+#define tcp_verify_left_out(tp)\
+   do { \
+   WARN_ON(tcp_left_out(tp) > tp->packets_out); \
+   tcp_verify_wq((struct sock *)tp); \
+   } while(0)
 
 extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
 extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fa2c85c..cdacf70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2645,6 +2645,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int 
pkts_acked, int flag)
if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
tcp_update_scoreboard(sk, fast_rexmit);
tcp_cwnd_down(sk, flag);
+
+   WARN_ON(tcp_write_queue_head(sk) == NULL);
+   WARN_ON(!tp->packets_out);
+
tcp_xmit_retransmit_queue(sk);
 }
 
@@ -2848,6 +2852,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets)
tcp_clear_all_retrans_hints(tp);
}
 
+   tcp_verify_left_out(tp);
+
if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
flag |= FLAG_SACK_RENEGING;
 
@@ -3175,6 +3181,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
prior_fackets = tp->fackets_out;
prior_in_flight = tcp_packets_in_flight(tp);
 
+   tcp_verify_left_out(tp);
+
if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
/* Window is constant, pure forward advance.
 * No more checks are required.
@@ -3237,6 +3245,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
dst_confirm(sk->sk_dst_cache);
 
+   tcp_verify_left_out(tp);
+
return 1;
 
 no_queue:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9aea88b..7e8ab40 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -108,6 +108,107 @@ struct inet_hashinfo __cacheline_aligned tcp_hashinfo = {
.lhash_wait  = __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhash_wait),
 };
 
+void tcp_print_queue(struct sock *sk)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+   struct sk_buff *skb;
+   char s[50+1];
+   char h[50+1];
+   int idx = 0;
+   int i;
+
+   tcp_for_write_queue(skb, sk) {
+   if (skb == tcp_send_he

Re: Assertions in latest kernels

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Krishna Kumar2 wrote:

> David Miller <[EMAIL PROTECTED]> wrote on 01/23/2008 01:27:23 PM:
> 
> > > iperf with multiple threads almost always gets these 4, *especially*
> when I
> > > do some batching :).
> > >
> > > static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, int
> flag)
> > > {
> > >   ...
> > >   if (WARN_ON(!tp->sacked_out && tp->fackets_out))
> > > tp->fackets_out = 0;
> > >   ...
> > > }
> >
> > Does this assertion show up first or do you get the other TCP
> > ones first?  It might be important, in that if you get the
> > others ones first that corrupted state might be what leads to
> > this one.
> 
> Hi Dave,
> 
> I looked at my *old* messages file and found this assert (2506) was first
> to hit (atleast in
> two messages file). It hit 5 times, then I got a different one that I had
> not reported earlier:
> 
> "KERNEL: assertion (packets <= tp->packets_out) failed at
> net/ipv4/tcp_input.c (2139)"
> 
> (though this was hidden in my report under the panic for tcp_input.c:2528.
> 
> Then another two thousand times of the 2506 asserts.
> 
> Today I installed the latest untouched kernel, rebooted system and got the
> following errors
> in sequence, but no 2506 errors (which I have always got when running
> batching in the last
> 2-3 weeks):
> 
> Jan 22 02:07:55 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:07:56 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:07:56 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:07:57 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:07:58 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:07:59 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:07:59 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:00 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:01 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:01 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:02 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:03 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:03 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:04 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:05 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:06 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:06 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:07 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767
> Jan 22 02:08:07 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:08 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:09 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:10 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:10 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:11 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:12 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:12 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:13 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:14 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:15 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:15 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767
> Jan 22 02:08:16 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767
> Jan 22 02:08:16 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:1767
> Jan 22 02:08:17 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:18 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:18 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> Jan 22 02:08:19 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2528
> Jan 22 02:08:19 elm3b39 kernel: Badness at net/ipv4/tcp_input.c:2169
> 
> and so on for another 700 counts.
>
> The unique asserts are:
>   1767: tcp_verify_left_out (from tcp_entry_frto)
>   2169: tcp_verify_left_out (from tcp_mark_head_lost)
>   2528: tcp_verify_left_out (from tcp_fastretrans_alert)
>   3063: tcp_verify_left_out (from tcp_process_frto)
> (where 2169 seems to preceed any other asserts)

Once you get one of these, you'll get a large number of them, maybe I 
should just change it to WARN_ON_ONCE to stop confusing people with the 
rest.

> The other two asserts that I got only with batching are:
>   2139: BUG_TRAP(packets <= tp->packets_out); (in tcp_mark_head_lost)
>   2506: WARN_ON(!tp->sacked_out && tp->fackets_out) (in
> tcp_fastretrans_alert)
> (where 2506 always seems to preceed any other asserts).

It's almost impossible to know which of these is the main cause and the 
first occuring due to reasons I'll not copy here. What a strange thing
that it has been super quiet on this front until now everybody is seeing 
it, could there be something unr

Re: Assertions in latest kernels

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Krishna Kumar2 wrote:

> Hi Ilpo,
> 
> > It's almost impossible to know which of these is the main cause and the
> > first occuring due to reasons I'll not copy here. What a strange thing
> > that it has been super quiet on this front until now everybody is seeing
> > it, could there be something unrelated to TCP which has broken it all
> > recently?
> 
> I have been getting this for atleast 3 weeks but I was quiet since those
> were kernels that I had modified.

Since you can easily reproduce it, lets just figure out what's causing it 
hard way, rather than digging the endless git-logs... :-)

> > Good thing is that you seem to be able to reproduce it were nicely.
> > Please try with this beauty below... Hopefully got it correctly matching
> > to mainline, in contrast to version I sent Dave Y., there's some added
> > candy which catches highest_sack corruption as well, at least it compiles
> 
> > already :-).
> 
> There were couple of patch apply failures in .c files which I fixed by
> hand.
> But when compiling, I got these errors (I am using DM's 2.6.24-rc7 kernel,
> net-2.6.25.git):

Well, that's annoying, you didn't mention net-2.6.25 back then, it sure 
is incompatible with it like already the patch title said... :-)


> net/ipv4/tcp_output.c: In function 'tcp_push_one':
> net/ipv4/tcp_output.c:1573: error: 'tp' undeclared (first use in this
> function)
> net/ipv4/tcp_output.c:1573: error: (Each undeclared identifier is reported
> only once
> net/ipv4/tcp_output.c:1573: error: for each function it appears in.)

...This is either due to mismerge or due your modifications.

Lets re-iterate, compiled ok for me tcp_{ipv4,input,output}.o...


-- 
 i.

[PATCH] [TCP]: debug S+L (for net-2.5.26 / mm, incompatible with mainline)

---
 include/net/tcp.h |8 +++-
 net/ipv4/tcp_input.c  |   10 +
 net/ipv4/tcp_ipv4.c   |  107 +
 net/ipv4/tcp_output.c |   21 +++---
 4 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de4ea3..0685035 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,8 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
field, val)
 #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
field, val)
 
+extern voidtcp_verify_wq(struct sock *sk);
+
 extern voidtcp_v4_err(struct sk_buff *skb, u32);
 
 extern voidtcp_shutdown (struct sock *sk, int how);
@@ -768,7 +770,11 @@ static inline __u32 tcp_current_ssthresh(const struct sock 
*sk)
 }
 
 /* Use define here intentionally to get WARN_ON location shown at the caller */
-#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) > 
tp->packets_out)
+#define tcp_verify_left_out(tp)\
+   do { \
+   WARN_ON(tcp_left_out(tp) > tp->packets_out); \
+   tcp_verify_wq((struct sock *)tp); \
+   } while(0)
 
 extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
 extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fa2c85c..cdacf70 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2645,6 +2645,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int 
pkts_acked, int flag)
if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
tcp_update_scoreboard(sk, fast_rexmit);
tcp_cwnd_down(sk, flag);
+
+   WARN_ON(tcp_write_queue_head(sk) == NULL);
+   WARN_ON(!tp->packets_out);
+
tcp_xmit_retransmit_queue(sk);
 }
 
@@ -2848,6 +2852,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets)
tcp_clear_all_retrans_hints(tp);
}
 
+   tcp_verify_left_out(tp);
+
if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
flag |= FLAG_SACK_RENEGING;
 
@@ -3175,6 +3181,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
prior_fackets = tp->fackets_out;
prior_in_flight = tcp_packets_in_flight(tp);
 
+   tcp_verify_left_out(tp);
+
if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
/* Window is constant, pure forward advance.
 * No more checks are required.
@@ -3237,6 +3245,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))
dst_confirm(sk->sk_dst_cache);
 
+   tcp_verify_left_out(tp);
+
return 1;
 
 no_queue:
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 9aea88b..c95682e 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -108,6 +108,113 @@ struct inet_hashinfo __cacheline_aligned tcp_hashinfo = {
.lhash_wait  = __WAIT_QUEUE_HEAD_INITIALIZER(tcp_hashinfo.lhas

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Ilpo Järvinen wrote:

> On Wed, 23 Jan 2008, Dave Young wrote:
> 
> > On Jan 23, 2008 3:41 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> > >
> > > On Tue, 22 Jan 2008, David Miller wrote:
> > >
> > > > From: "Dave Young" <[EMAIL PROTECTED]>
> > > > Date: Wed, 23 Jan 2008 09:44:30 +0800
> > > >
> > > > > On Jan 22, 2008 6:47 PM, Ilpo Järvinen <[EMAIL PROTECTED]> wrote:
> > > > > > [PATCH] [TCP]: debug S+L
> > > > >
> > > > > Thanks, If there's new findings I will let you know.
> > > >
> > > > Thanks for helping with this bug Dave.
> > >
> > > I noticed btw that there thing might (is likely to) spuriously trigger at
> > > WARN_ON(sacked != tp->sacked_out); because those won't be equal when SACK
> > > is not enabled. If that does happen too often, I send a fixed patch for
> > > it, yet, the fact that I print print tp->rx_opt.sack_ok allows
> > > identification of those cases already as it's zero when SACK is not
> > > enabled.
> > >
> > > Just ask if you need the updated debug patch.
> > 
> > Thanks,  please send, I would like to get it.
> 
> There you go. I fixed non-SACK case by adding tcp_is_sack checks there and 
> also added two verifys to tcp_ack to see if there's corruption outside of 
> TCP.

There's some discussion about a problem that is very likely the same as in 
here (sorry for not remembering to cc you in there due to rapid progress):

   http://marc.info/?t=12010717423&r=1&w=2


-- 
 i.

Re: Assertions in latest kernels

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Ilpo Järvinen wrote:

> On Wed, 23 Jan 2008, Krishna Kumar2 wrote:
> 
> > Hi Ilpo,
> > 
> > > It's almost impossible to know which of these is the main cause and the
> > > first occuring due to reasons I'll not copy here. What a strange thing
> > > that it has been super quiet on this front until now everybody is seeing
> > > it, could there be something unrelated to TCP which has broken it all
> > > recently?
> > 
> > I have been getting this for atleast 3 weeks but I was quiet since those
> > were kernels that I had modified.
> 
> Since you can easily reproduce it, lets just figure out what's causing it 
> hard way, rather than digging the endless git-logs... :-)

Hmm, perhaps it could be something related to this (and some untested 
path somewhere which is now exposed):

commit 4a55b553f691abadaa63570dfc714e20913561c1
Author: Ilpo Järvinen <[EMAIL PROTECTED]>
Date:   Thu Dec 20 20:36:03 2007 -0800

[TCP]: Fix TSO deferring

Dave, what do you think? Wouldn't explain the one -rc only report though 
from Denys. Another one I'm a bit unsure of is this:

commit 757c32944b80fd95542bd66f06032ab773034d53
Author: Ilpo Järvinen <[EMAIL PROTECTED]>
Date:   Thu Jan 3 20:39:01 2008 -0800

[TCP]: Perform setting of common control fields in one place

->sacked field is cleared in tcp_retransmit_skb due to a subtle change, 
which might be buggy However, I find it rather unlikely that this 
would explain Kumar's case. Anyway, here's the one that reverts the 
problematic part of it. ...and this is net-2.6.25 as well so it won't 
solve Denys' case either.

-- 
 i.

--
[PATCH] [TCP]: Revert part of "[TCP]: Perform setting of common control..."

This commit reverts part of 757c32944b80fd95542bd66f06032ab773034d53
([TCP]: Perform setting of common control fields in one place)
because it's not valid to clear ->sacked field like that without
additional precautions with counters, mostly lost_out, sacked_out
should not be set due to reneging which kicks in much earlier than
this.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_output.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 648340f..f6cbc1f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1887,10 +1887,12 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb)
(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN) &&
tp->snd_una == (TCP_SKB_CB(skb)->end_seq - 1)) {
if (!pskb_trim(skb, 0)) {
-   /* Reuse, even though it does some unnecessary work */
-   tcp_init_nondata_skb(skb, TCP_SKB_CB(skb)->end_seq - 1,
-TCP_SKB_CB(skb)->flags);
+   TCP_SKB_CB(skb)->seq = TCP_SKB_CB(skb)->end_seq - 1;
+   skb_shinfo(skb)->gso_segs = 1;
+   skb_shinfo(skb)->gso_size = 0;
+   skb_shinfo(skb)->gso_type = 0;
skb->ip_summed = CHECKSUM_NONE;
+   skb->csum = 0;
}
}
 
-- 
1.5.2.2

Re: Assertions in latest kernels

2008-01-23 Thread Ilpo Järvinen

On Wed, 23 Jan 2008, Krishna Kumar2 wrote:

> While running with this patch, I got these errors (pasted at the end
> of this mail).

I don't have a clue why it didn't go to the checking func (or it didn't 
print anything) but just had those WARN_ONs... Hopefully this is giving 
somewhat better input (applies on top of the other debug patch).

--
 i.

[PATCH] [TCP]: more debug

---
 include/net/tcp.h|3 ++-
 net/ipv4/tcp_input.c |9 -
 net/ipv4/tcp_ipv4.c  |   19 ++-
 3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0685035..129c3b1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,7 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
field, val)
 #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
field, val)
 
+extern void tcp_print_queue(struct sock *sk);
 extern voidtcp_verify_wq(struct sock *sk);
 
 extern voidtcp_v4_err(struct sk_buff *skb, u32);
@@ -772,8 +773,8 @@ static inline __u32 tcp_current_ssthresh(const struct sock 
*sk)
 /* Use define here intentionally to get WARN_ON location shown at the caller */
 #define tcp_verify_left_out(tp)\
do { \
-   WARN_ON(tcp_left_out(tp) > tp->packets_out); \
tcp_verify_wq((struct sock *)tp); \
+   WARN_ON(tcp_left_out(tp) > tp->packets_out); \
} while(0)
 
 extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index cdacf70..295490e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2133,12 +2133,15 @@ static void tcp_verify_retransmit_hint(struct tcp_sock 
*tp, struct sk_buff *skb)
 static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit)
 {
struct tcp_sock *tp = tcp_sk(sk);
-   struct sk_buff *skb;
+   struct sk_buff *skb, *prev = NULL;
int cnt;
 
+   tcp_verify_left_out(tp);
+
BUG_TRAP(packets <= tp->packets_out);
if (tp->lost_skb_hint) {
skb = tp->lost_skb_hint;
+   prev = skb;
cnt = tp->lost_cnt_hint;
} else {
skb = tcp_write_queue_head(sk);
@@ -2166,6 +2169,10 @@ static void tcp_mark_head_lost(struct sock *sk, int 
packets, int fast_rexmit)
tcp_verify_retransmit_hint(tp, skb);
}
}
+   if (tcp_left_out(tp) > tp->packets_out) {
+   printk(KERN_ERR "Prev hint: %p, exit %p\n", prev, skb);
+   tcp_print_queue(sk);
+   }
tcp_verify_left_out(tp);
 }
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index c95682e..c2a88c5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -117,6 +117,15 @@ void tcp_print_queue(struct sock *sk)
int idx = 0;
int i;
 
+   i = 0;
+   tcp_for_write_queue(skb, sk) {
+   if (skb == tcp_send_head(sk))
+   printk(KERN_ERR "head %u %p\n", i, skb);
+   else
+   printk(KERN_ERR "skb %u %p\n", i, skb);
+   i++;
+   }
+
tcp_for_write_queue(skb, sk) {
if (skb == tcp_send_head(sk))
break;
@@ -195,11 +204,6 @@ void tcp_verify_wq(struct sock *sk)
packets += tcp_skb_pcount(skb);
}
 
-   WARN_ON(lost != tp->lost_out);
-   WARN_ON(tcp_is_sack(tp) && (sacked != tp->sacked_out));
-   WARN_ON(packets != tp->packets_out);
-   WARN_ON(fackets != tp->fackets_out);
-
if ((lost != tp->lost_out) ||
(tcp_is_sack(tp) && (sacked != tp->sacked_out)) ||
(packets != tp->packets_out) ||
@@ -213,6 +217,11 @@ void tcp_verify_wq(struct sock *sk)
tp->rx_opt.sack_ok);
tcp_print_queue(sk);
}
+
+   WARN_ON(lost != tp->lost_out);
+   WARN_ON(tcp_is_sack(tp) && (sacked != tp->sacked_out));
+   WARN_ON(packets != tp->packets_out);
+   WARN_ON(fackets != tp->fackets_out);
 }
 
 static int tcp_v4_get_port(struct sock *sk, unsigned short snum)
-- 
1.5.2.2

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-24 Thread Ilpo Järvinen

10
>  [] rest_init+0x5d/0x60
>  [] start_kernel+0x1fa/0x260
>  [] ? unknown_bootoption+0x0/0x130
>  ===
> ---[ end trace 14b601818e6903ac ]---

...But this no longer is, and even more, L: 5 is not valid state at this 
point all (should only happen if we went to RTO but it would reset S to 
zero with newreno):

> P: 5 L: 5 vs 5 S: 0 vs 3 w: 2044790889-2044796616 (0)
> TCP wq(s) l<
> TCP wq(h) +++h+<
> l5 s3 f0 p5 seq: su2044790889 hs2044795029 sn2044796616

Surprisingly, it was the first time the WARN_ON for left_out returned 
correct location. This also explains why the patch I sent to Krishna 
didn't print anything (it didn't end up into printing because I forgot 
to add L+S>P check into to the state checking if).

...so please, could you (others than Denys) try this patch, it should 
solve the issue. And Denys, could you confirm (and if necessary double 
check) that the kernel you saw this similar problem with is the pure 
Linus' mainline, i.e., without any net-2.6.25 or mm bits please, if so, 
that problem persists. And anyway, there were some fackets_out related 
problems reported as well and this doesn't help for that but I think I've 
lost track of who was seeing it due to large number of reports :-), could 
somebody refresh my memory because I currently don't have time to dig it 
up from archives (at least on this week).


-- 
 i.

--
[PATCH] [TCP]: NewReno must count every skb while marking losses

NewReno should add cnt per skb (as with FACK) instead of depending
on SACKED_ACKED bits which won't be set with it at all.
Effectively, NewReno should always exists after the first
iteration anyway (or immediately if there's already head in
lost_out.

This was fixed earlier in net-2.6.25 but got reverted among other
stuff and I didn't notice that this is still necessary (actually
wasn't even considering this case while trying to figure out the
reports because I lived with different kind of code than it in
reality was).

This should solve the WARN_ONs in TCP code that as a result of
this triggered multiple times in every place we check for this
invariant.

Special thanks to Dave Young <[EMAIL PROTECTED]> and
Krishna Kumar2 <[EMAIL PROTECTED]> for trying with my debug
patches.

Signed-off-by: Ilpo Järvinen <[EMAIL PROTECTED]>
---
 net/ipv4/tcp_input.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 295490e..aa409a5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2156,7 +2156,7 @@ static void tcp_mark_head_lost(struct sock *sk, int 
packets, int fast_rexmit)
tp->lost_skb_hint = skb;
tp->lost_cnt_hint = cnt;
 
-   if (tcp_is_fack(tp) ||
+   if (tcp_is_fack(tp) || tcp_is_reno(tp) ||
(TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
cnt += tcp_skb_pcount(skb);
 
-- 
1.5.2.2

Re: 2.6.24-rc8-mm1 : net tcp_input.c warnings

2008-01-24 Thread Ilpo Järvinen

On Thu, 24 Jan 2008, Ilpo Järvinen wrote:

> And anyway, there were some fackets_out related 
> problems reported as well and this doesn't help for that but I think I've 
> lost track of who was seeing it due to large number of reports :-), could 
> somebody refresh my memory because I currently don't have time to dig it 
> up from archives (at least on this week).

Here's the updated debug patch for net-2.6.25/mm for tracking 
fackets_out inconsistencies (it won't work for trees which don't
include net-2.6.25, mm does of course :-)).

I hope I got it into good shape this time to avoid spurious stacktraces 
but still maintaining 100% accuracy, but it's not a simple oneliner so I 
might have missed something... :-)

I'd suggest that people trying with this first apply the newreno fix of 
the previous mail to avoid already-fixed case from triggering.

-- 
 i.

--
[PATCH] [TCP]: debug S+L (for net-2.5.26 / mm, incompatible with mainline)

---
 include/net/tcp.h |5 ++-
 net/ipv4/tcp_input.c  |   18 +++-
 net/ipv4/tcp_ipv4.c   |  127 +
 net/ipv4/tcp_output.c |   23 +++--
 4 files changed, 165 insertions(+), 8 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7de4ea3..552aa71 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -272,6 +272,9 @@ DECLARE_SNMP_STAT(struct tcp_mib, tcp_statistics);
 #define TCP_ADD_STATS_BH(field, val)   SNMP_ADD_STATS_BH(tcp_statistics, 
field, val)
 #define TCP_ADD_STATS_USER(field, val) SNMP_ADD_STATS_USER(tcp_statistics, 
field, val)
 
+extern void tcp_print_queue(struct sock *sk);
+extern voidtcp_verify_wq(struct sock *sk);
+
 extern voidtcp_v4_err(struct sk_buff *skb, u32);
 
 extern voidtcp_shutdown (struct sock *sk, int how);
@@ -768,7 +771,7 @@ static inline __u32 tcp_current_ssthresh(const struct sock 
*sk)
 }
 
 /* Use define here intentionally to get WARN_ON location shown at the caller */
-#define tcp_verify_left_out(tp)WARN_ON(tcp_left_out(tp) > 
tp->packets_out)
+#define tcp_verify_left_out(tp)tcp_verify_wq((struct sock *)tp)
 
 extern void tcp_enter_cwr(struct sock *sk, const int set_ssthresh);
 extern __u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 19c449f..c897c93 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1426,8 +1426,10 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff 
*ack_skb,
int first_sack_index;
 
if (!tp->sacked_out) {
-   if (WARN_ON(tp->fackets_out))
+   if (WARN_ON(tp->fackets_out)) {
+   tcp_verify_left_out(tp);
tp->fackets_out = 0;
+   }
tcp_highest_sack_reset(sk);
}
 
@@ -2136,6 +2138,8 @@ static void tcp_mark_head_lost(struct sock *sk, int 
packets, int fast_rexmit)
struct sk_buff *skb;
int cnt;
 
+   tcp_verify_left_out(tp);
+
BUG_TRAP(packets <= tp->packets_out);
if (tp->lost_skb_hint) {
skb = tp->lost_skb_hint;
@@ -2501,6 +2505,8 @@ static void tcp_fastretrans_alert(struct sock *sk, int 
pkts_acked, int flag)
(tcp_fackets_out(tp) > tp->reordering));
int fast_rexmit = 0;
 
+   tcp_verify_left_out(tp);
+
if (WARN_ON(!tp->packets_out && tp->sacked_out))
tp->sacked_out = 0;
if (WARN_ON(!tp->sacked_out && tp->fackets_out))
@@ -2645,6 +2651,10 @@ static void tcp_fastretrans_alert(struct sock *sk, int 
pkts_acked, int flag)
if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk)))
tcp_update_scoreboard(sk, fast_rexmit);
tcp_cwnd_down(sk, flag);
+
+   WARN_ON(tcp_write_queue_head(sk) == NULL);
+   WARN_ON(!tp->packets_out);
+
tcp_xmit_retransmit_queue(sk);
 }
 
@@ -2848,6 +2858,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int 
prior_fackets)
tcp_clear_all_retrans_hints(tp);
}
 
+   tcp_verify_left_out(tp);
+
if (skb && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
flag |= FLAG_SACK_RENEGING;
 
@@ -3175,6 +3187,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
prior_fackets = tp->fackets_out;
prior_in_flight = tcp_packets_in_flight(tp);
 
+   tcp_verify_left_out(tp);
+
if (!(flag & FLAG_SLOWPATH) && after(ack, prior_snd_una)) {
/* Window is constant, pure forward advance.
 * No more checks are required.
@@ -3237,6 +3251,8 @@ static int tcp_ack(struct sock *sk, struct sk_buff *skb, 
int flag)
if ((flag & FLAG_FORWARD_PROGRESS) || !(flag & FLAG_NOT_DUP))

< 1 2 3 4 5 6 >

301 - 400 of 527 matches

Mail list logo