Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-20 Thread Andrea Arcangeli

On Wed, Sep 20, 2000 at 03:24:28PM +0200, Andi Kleen wrote:
> That would just break the whole idea behind softnet. When you're juggling

I agree ;(.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-20 Thread Andi Kleen

On Wed, Sep 20, 2000 at 03:29:39PM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 20, 2000 at 04:38:24AM +0200, Andi Kleen wrote:
> > We must be talking about different things. It of course detects it on
> > ACK input, but only for data it did send itself. Every TCP detects 
> > reordering automatically on the input with the sequence number check,
> > but all we still do is to send an ACK immediately and go into
> > quickack mode.
> 
> Agreed. I couldn't see how to "detect" the reordering in the receiver without
> additional informations from the dev.c layer (I don't think it's possible to
> detect it from the TCP layer without adding something like described by
> Andi in the below section that looks not very nice because it would
> hurt links with real packet loss).

Assuming you have outgoing traffic on the NIC as well it isn't that bad,
because there will be likely some queueing delay for the ACK in which
you could in theory still cancel it of you know better (it would hurt the
abstraction between device drivers and dev.c a lot though) 

> The solution consists in a sequence number protected by the spinlock in the NIC
> device driver (as DaveM suggested) and to increase this sequence number for
> each packet received and to set the sequence number value of the moment in each
> skb. Then we'd need a secondary (shared :( ) sequence numbers in the
> net_rx_action layer that will tell us the sequence number of the last packet
> that entered the higher level layer (IP in our case). With this per packet
> sequence number information we'll know if we're generating reordering by
> pushing the next packet in the softnet queue from net_rx_action to the IP
> layer.  In that case we should only set up a wake-up-me-later logic and giveup.

That would just break the whole idea behind softnet. When you're juggling
the cache line for the sequence counter you could as well juggle the 
cache line with a shared queue head, e.g. we would be at 2.2 level again.

> The other almost zero cost way to fix the reordering is of course to set the
> irq affinity of each nic only to one CPU :)) (this applies to the less smart
> protocol as well)

That's the real solution long term, right.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-20 Thread Andrea Arcangeli

On Wed, Sep 20, 2000 at 04:38:24AM +0200, Andi Kleen wrote:
> We must be talking about different things. It of course detects it on
> ACK input, but only for data it did send itself. Every TCP detects 
> reordering automatically on the input with the sequence number check,
> but all we still do is to send an ACK immediately and go into
> quickack mode.

Agreed. I couldn't see how to "detect" the reordering in the receiver without
additional informations from the dev.c layer (I don't think it's possible to
detect it from the TCP layer without adding something like described by
Andi in the below section that looks not very nice because it would
hurt links with real packet loss).

> Or did I miss something ? 
> 
> The only way to do true reordering handling on the receiver I can think of
> would be to use something like the soft-timers to do very very fast delayed 
> acks even for rcv_mss sized packets and hope that you collect packets from 
> all CPUs in the delay, but overall it could still cost you a lot by 
> disturbing the ACK clock 
> 
> [I talked a lot with Andrea about this during OLS, and we couldn't
> figure out a good way] 

I couldn't figure out a good way to fix that internally to TCP, but I have a
quite simple solution at the dev.c layer (that would address also the problem
with the not smart protocols that needs to retransmit everything once an hole
in the sequence space is found).

The solution consists in a sequence number protected by the spinlock in the NIC
device driver (as DaveM suggested) and to increase this sequence number for
each packet received and to set the sequence number value of the moment in each
skb. Then we'd need a secondary (shared :( ) sequence numbers in the
net_rx_action layer that will tell us the sequence number of the last packet
that entered the higher level layer (IP in our case). With this per packet
sequence number information we'll know if we're generating reordering by
pushing the next packet in the softnet queue from net_rx_action to the IP
layer.  In that case we should only set up a wake-up-me-later logic and giveup.
No idea how much this additional logic hurts. Certainly it could be enabled
per-device/per-protocol basis.

The other almost zero cost way to fix the reordering is of course to set the
irq affinity of each nic only to one CPU :)) (this applies to the less smart
protocol as well)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-20 Thread Andi Kleen

On Tue, Sep 19, 2000 at 08:56:57PM -0700, David S. Miller wrote:
>Just use __cacheline_aligned instead, like I did with the
>ip_local_data in the patch you just rejected. There is still the
>problem that generic SMP x86 kernels use a 32byte cacheline. Not a
>problem currently because the only x86 SMP is pII/pIII which has
>32byte, but with Williamette/Athlon SMP it'll be a problem because
>these have 64byte and 128byte cache lines. With __cacheline_aligned
>it'll be easy though to pick up such changes.
> 
> Ok, feel free to send me a patch which does only this.  I may doctor
> it up a bit, be warned :-)

Here is the complete patch I'm using, pick what you want from it.

This time I documented my intentions properly, see the source code
comments. I removed all power-of-two paddings because the shift/add
sequences seem to be always short enough.

I also documented a bigger potential win on machines with immediate 
memory access and sane interrupt model (and an even bigger -- 
Dimitris Michailidis's PDA patches) which both need more extensive 
modifications 

>On the other hand, the inetpeer code is only really exercised on
>machines that talk to lots and lots of destinations (=real
>servers), and 2.4 testing on such machines still has to begin.
> 
>Given that 2.4 testing has not really begun yet I would guess that
>it is safer to remove it now @)
> 
> True, but the following is what I'm really thinking about.  Suppose a
> few weeks from now Alexey greets both of our mailboxes with a fabulous
> solution to the tw recycle masqeurading problem, wouldn't it be quite
> a pain in the ass to put back in and retest all the inetpeer code?
> 
> Let's sit on this until Alexey gives a forecast about the
> possibilities of a solution in the near future ok?

Ok. I just don't want the code uselessly wasting cycles and cache lines in 
2.4.0 if possible.

-Andi

Index: include/net/snmp.h
===
RCS file: /cvs/linux/include/net/snmp.h,v
retrieving revision 1.16
diff -u -u -r1.16 snmp.h
--- include/net/snmp.h  2000/08/09 11:59:03 1.16
+++ include/net/snmp.h  2000/09/20 11:19:25
@@ -14,17 +14,34 @@
  * as published by the Free Software Foundation; either version
  * 2 of the License, or (at your option) any later version.
  *
+ * $Id$
+ *
  */
  
 #ifndef _SNMP_H
 #define _SNMP_H
+
+#include 
  
 /*
  * We use all unsigned longs. Linux will soon be so reliable that even these
  * will rapidly get too small 8-). Seriously consider the IpInReceives count
  * on the 20Gb/s + networks people expect in a few years time!
  */
-  
+
+/* 
+ * The rule for padding: 
+ * Best is power of two because then the right structure can be found by a simple
+ * shift. The structure should be always cache line aligned.
+ * gcc needs n=alignto(cachelinesize, popcnt(sizeof(bla_mib))) shift/add instructions
+ * to emulate multiply in case it is not power-of-two. Currently n is always <=3 for
+ * all sizes so simple cache line alignment is enough. 
+ * 
+ * The best solution would be a global CPU local area , especially on 64 and 128byte 
+ * cacheline machine it makes a *lot* of sense -AK
+ */ 
+
+ 
 struct ip_mib
 {
unsigned long   IpInReceives;
@@ -44,8 +61,8 @@
unsigned long   IpFragOKs;
unsigned long   IpFragFails;
unsigned long   IpFragCreates;
-   unsigned long   __pad[32-19];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
  
 struct ipv6_mib
 {
@@ -71,8 +88,8 @@
unsigned long   Ip6FragCreates;
unsigned long   Ip6InMcastPkts;
unsigned long   Ip6OutMcastPkts;
-   unsigned long   __pad[32-22];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
  
 struct icmp_mib
 {
@@ -102,8 +119,8 @@
unsigned long   IcmpOutTimestampReps;
unsigned long   IcmpOutAddrMasks;
unsigned long   IcmpOutAddrMaskReps;
-   unsigned long   __pad[32-26];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
 
 struct icmpv6_mib
 {
@@ -140,8 +157,8 @@
unsigned long   Icmp6OutRedirects;
unsigned long   Icmp6OutGroupMembResponses;
unsigned long   Icmp6OutGroupMembReductions;
-   unsigned long   __pad[32-28];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
  
 struct tcp_mib
 {
@@ -159,8 +176,8 @@
unsigned long   TcpRetransSegs;
unsigned long   TcpInErrs;
unsigned long   TcpOutRsts;
-   unsigned long   __pad[16-14];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
  
 struct udp_mib
 {
@@ -168,8 +185,8 @@
unsigned long   UdpNoPorts;
unsigned long   UdpInErrors;
unsigned long   UdpOutDatagrams;
-   unsigned long   __pad[0];
-};
+   unsigned long   __pad[0];
+} cacheline_aligned; 
 
 struct linux_mib 
 {
@@ -237,9 +254,15 @@
unsigned long   TCPAbortOnLinger;

Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-20 Thread Andi Kleen

On Tue, Sep 19, 2000 at 08:56:57PM -0700, David S. Miller wrote:
Just use __cacheline_aligned instead, like I did with the
ip_local_data in the patch you just rejected. There is still the
problem that generic SMP x86 kernels use a 32byte cacheline. Not a
problem currently because the only x86 SMP is pII/pIII which has
32byte, but with Williamette/Athlon SMP it'll be a problem because
these have 64byte and 128byte cache lines. With __cacheline_aligned
it'll be easy though to pick up such changes.
 
 Ok, feel free to send me a patch which does only this.  I may doctor
 it up a bit, be warned :-)

Here is the complete patch I'm using, pick what you want from it.

This time I documented my intentions properly, see the source code
comments. I removed all power-of-two paddings because the shift/add
sequences seem to be always short enough.

I also documented a bigger potential win on machines with immediate 
memory access and sane interrupt model (and an even bigger -- 
Dimitris Michailidis's PDA patches) which both need more extensive 
modifications 

On the other hand, the inetpeer code is only really exercised on
machines that talk to lots and lots of destinations (=real
servers), and 2.4 testing on such machines still has to begin.
 
Given that 2.4 testing has not really begun yet I would guess that
it is safer to remove it now @)
 
 True, but the following is what I'm really thinking about.  Suppose a
 few weeks from now Alexey greets both of our mailboxes with a fabulous
 solution to the tw recycle masqeurading problem, wouldn't it be quite
 a pain in the ass to put back in and retest all the inetpeer code?
 
 Let's sit on this until Alexey gives a forecast about the
 possibilities of a solution in the near future ok?

Ok. I just don't want the code uselessly wasting cycles and cache lines in 
2.4.0 if possible.

-Andi

Index: include/net/snmp.h
===
RCS file: /cvs/linux/include/net/snmp.h,v
retrieving revision 1.16
diff -u -u -r1.16 snmp.h
--- include/net/snmp.h  2000/08/09 11:59:03 1.16
+++ include/net/snmp.h  2000/09/20 11:19:25
@@ -14,17 +14,34 @@
  * as published by the Free Software Foundation; either version
  * 2 of the License, or (at your option) any later version.
  *
+ * $Id$
+ *
  */
  
 #ifndef _SNMP_H
 #define _SNMP_H
+
+#include linux/cache.h
  
 /*
  * We use all unsigned longs. Linux will soon be so reliable that even these
  * will rapidly get too small 8-). Seriously consider the IpInReceives count
  * on the 20Gb/s + networks people expect in a few years time!
  */
-  
+
+/* 
+ * The rule for padding: 
+ * Best is power of two because then the right structure can be found by a simple
+ * shift. The structure should be always cache line aligned.
+ * gcc needs n=alignto(cachelinesize, popcnt(sizeof(bla_mib))) shift/add instructions
+ * to emulate multiply in case it is not power-of-two. Currently n is always =3 for
+ * all sizes so simple cache line alignment is enough. 
+ * 
+ * The best solution would be a global CPU local area , especially on 64 and 128byte 
+ * cacheline machine it makes a *lot* of sense -AK
+ */ 
+
+ 
 struct ip_mib
 {
unsigned long   IpInReceives;
@@ -44,8 +61,8 @@
unsigned long   IpFragOKs;
unsigned long   IpFragFails;
unsigned long   IpFragCreates;
-   unsigned long   __pad[32-19];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
  
 struct ipv6_mib
 {
@@ -71,8 +88,8 @@
unsigned long   Ip6FragCreates;
unsigned long   Ip6InMcastPkts;
unsigned long   Ip6OutMcastPkts;
-   unsigned long   __pad[32-22];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
  
 struct icmp_mib
 {
@@ -102,8 +119,8 @@
unsigned long   IcmpOutTimestampReps;
unsigned long   IcmpOutAddrMasks;
unsigned long   IcmpOutAddrMaskReps;
-   unsigned long   __pad[32-26];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
 
 struct icmpv6_mib
 {
@@ -140,8 +157,8 @@
unsigned long   Icmp6OutRedirects;
unsigned long   Icmp6OutGroupMembResponses;
unsigned long   Icmp6OutGroupMembReductions;
-   unsigned long   __pad[32-28];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
  
 struct tcp_mib
 {
@@ -159,8 +176,8 @@
unsigned long   TcpRetransSegs;
unsigned long   TcpInErrs;
unsigned long   TcpOutRsts;
-   unsigned long   __pad[16-14];
-};
+   unsigned long   __pad[0]; 
+} cacheline_aligned;
  
 struct udp_mib
 {
@@ -168,8 +185,8 @@
unsigned long   UdpNoPorts;
unsigned long   UdpInErrors;
unsigned long   UdpOutDatagrams;
-   unsigned long   __pad[0];
-};
+   unsigned long   __pad[0];
+} cacheline_aligned; 
 
 struct linux_mib 
 {
@@ -237,9 +254,15 @@
unsigned long   TCPAbortOnLinger;

Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-20 Thread Andrea Arcangeli

On Wed, Sep 20, 2000 at 04:38:24AM +0200, Andi Kleen wrote:
 We must be talking about different things. It of course detects it on
 ACK input, but only for data it did send itself. Every TCP detects 
 reordering automatically on the input with the sequence number check,
 but all we still do is to send an ACK immediately and go into
 quickack mode.

Agreed. I couldn't see how to "detect" the reordering in the receiver without
additional informations from the dev.c layer (I don't think it's possible to
detect it from the TCP layer without adding something like described by
Andi in the below section that looks not very nice because it would
hurt links with real packet loss).

 Or did I miss something ? 
 
 The only way to do true reordering handling on the receiver I can think of
 would be to use something like the soft-timers to do very very fast delayed 
 acks even for rcv_mss sized packets and hope that you collect packets from 
 all CPUs in the delay, but overall it could still cost you a lot by 
 disturbing the ACK clock 
 
 [I talked a lot with Andrea about this during OLS, and we couldn't
 figure out a good way] 

I couldn't figure out a good way to fix that internally to TCP, but I have a
quite simple solution at the dev.c layer (that would address also the problem
with the not smart protocols that needs to retransmit everything once an hole
in the sequence space is found).

The solution consists in a sequence number protected by the spinlock in the NIC
device driver (as DaveM suggested) and to increase this sequence number for
each packet received and to set the sequence number value of the moment in each
skb. Then we'd need a secondary (shared :( ) sequence numbers in the
net_rx_action layer that will tell us the sequence number of the last packet
that entered the higher level layer (IP in our case). With this per packet
sequence number information we'll know if we're generating reordering by
pushing the next packet in the softnet queue from net_rx_action to the IP
layer.  In that case we should only set up a wake-up-me-later logic and giveup.
No idea how much this additional logic hurts. Certainly it could be enabled
per-device/per-protocol basis.

The other almost zero cost way to fix the reordering is of course to set the
irq affinity of each nic only to one CPU :)) (this applies to the less smart
protocol as well)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-20 Thread Andi Kleen

On Wed, Sep 20, 2000 at 03:29:39PM +0200, Andrea Arcangeli wrote:
 On Wed, Sep 20, 2000 at 04:38:24AM +0200, Andi Kleen wrote:
  We must be talking about different things. It of course detects it on
  ACK input, but only for data it did send itself. Every TCP detects 
  reordering automatically on the input with the sequence number check,
  but all we still do is to send an ACK immediately and go into
  quickack mode.
 
 Agreed. I couldn't see how to "detect" the reordering in the receiver without
 additional informations from the dev.c layer (I don't think it's possible to
 detect it from the TCP layer without adding something like described by
 Andi in the below section that looks not very nice because it would
 hurt links with real packet loss).

Assuming you have outgoing traffic on the NIC as well it isn't that bad,
because there will be likely some queueing delay for the ACK in which
you could in theory still cancel it of you know better (it would hurt the
abstraction between device drivers and dev.c a lot though) 

 The solution consists in a sequence number protected by the spinlock in the NIC
 device driver (as DaveM suggested) and to increase this sequence number for
 each packet received and to set the sequence number value of the moment in each
 skb. Then we'd need a secondary (shared :( ) sequence numbers in the
 net_rx_action layer that will tell us the sequence number of the last packet
 that entered the higher level layer (IP in our case). With this per packet
 sequence number information we'll know if we're generating reordering by
 pushing the next packet in the softnet queue from net_rx_action to the IP
 layer.  In that case we should only set up a wake-up-me-later logic and giveup.

That would just break the whole idea behind softnet. When you're juggling
the cache line for the sequence counter you could as well juggle the 
cache line with a shared queue head, e.g. we would be at 2.2 level again.

 The other almost zero cost way to fix the reordering is of course to set the
 irq affinity of each nic only to one CPU :)) (this applies to the less smart
 protocol as well)

That's the real solution long term, right.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-20 Thread Andrea Arcangeli

On Wed, Sep 20, 2000 at 03:24:28PM +0200, Andi Kleen wrote:
 That would just break the whole idea behind softnet. When you're juggling

I agree ;(.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-19 Thread David S. Miller

   Date:Wed, 20 Sep 2000 04:38:24 +0200
   From: "Andi Kleen" <[EMAIL PROTECTED]>

   We must be talking about different things. It of course detects it
   on ACK input, but only for data it did send itself. Every TCP detects 
   reordering automatically on the input with the sequence number check,
   but all we still do is to send an ACK immediately and go into
   quickack mode.

   Or did I miss something ? 

 ...

   [I talked a lot with Andrea about this during OLS, and we couldn't
   figure out a good way] 

Nevermind, I've apparently got my directions reversed. :-)
I'll think about this a bit.

   Just use __cacheline_aligned instead, like I did with the
   ip_local_data in the patch you just rejected. There is still the
   problem that generic SMP x86 kernels use a 32byte cacheline. Not a
   problem currently because the only x86 SMP is pII/pIII which has
   32byte, but with Williamette/Athlon SMP it'll be a problem because
   these have 64byte and 128byte cache lines. With __cacheline_aligned
   it'll be easy though to pick up such changes.

Ok, feel free to send me a patch which does only this.  I may doctor
it up a bit, be warned :-)

   On the other hand, the inetpeer code is only really exercised on
   machines that talk to lots and lots of destinations (=real
   servers), and 2.4 testing on such machines still has to begin.

   Given that 2.4 testing has not really begun yet I would guess that
   it is safer to remove it now @)

True, but the following is what I'm really thinking about.  Suppose a
few weeks from now Alexey greets both of our mailboxes with a fabulous
solution to the tw recycle masqeurading problem, wouldn't it be quite
a pain in the ass to put back in and retest all the inetpeer code?

Let's sit on this until Alexey gives a forecast about the
possibilities of a solution in the near future ok?

Later,
David S. Miller
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-19 Thread Andi Kleen

On Tue, Sep 19, 2000 at 06:54:30PM -0700, David S. Miller wrote:
>Date: Wed, 20 Sep 2000 03:51:37 +0200
>From: "Andi Kleen" <[EMAIL PROTECTED]>
> 
>>Receiver side SMP reordering is still there, but I'm not sure if it is
>>fixable (but it'll surely hit people that cannot use Linux senders, I 
>>just see the reports) 
>> 
>> Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input
>> rewrite was strictly about dealing with this.  No SMP reordering
>> should cause any bogus fast retransmits etc. for example.
> 
>You mean TCP output rewrite ? The fix was strictly sender side only 
>(I first thought one of the ack changes alexey did was an attempt at a 
>hackish receiver side solution, but he told me that was false) 
> 
>> What is the problem?  Are you referering to the LAPB/X25 stuff?
> 
>When you have a non linux 2.4 sender you lose.
> 
> Alexey's changes detect reordering on the input side, regardless of
> whether it is speaking to a Linux senders or not, to avoid false
> retransmits.

We must be talking about different things. It of course detects it on
ACK input, but only for data it did send itself. Every TCP detects 
reordering automatically on the input with the sequence number check,
but all we still do is to send an ACK immediately and go into
quickack mode.

Or did I miss something ? 

The only way to do true reordering handling on the receiver I can think of
would be to use something like the soft-timers to do very very fast delayed 
acks even for rcv_mss sized packets and hope that you collect packets from 
all CPUs in the delay, but overall it could still cost you a lot by 
disturbing the ACK clock 

[I talked a lot with Andrea about this during OLS, and we couldn't
figure out a good way] 


> Please show me (and even more importantly Alexey) an example of where
> receive reordering detection is dependant upon Linux TCP sender
> behavior, his code it is as generic as I can imagine it to be.  In
> fact, his code got lots of the testing on a web server serving almost
> exclusively clients running windows :-)

Web clients probably do not send enough data to make reordering a problem
because the request fits into 1-2 packets and the 3way handshake is not
reordering sensitive.
(I haven't looked at SpecWeb that closely, but has it really any client
sent data that is >packet size?) 

> Ok, linux_mib is obviously not exact but in that case I would argue
> that the extra size needed (to get to the next a power of 2) would
> outweight whatever instruction performance gain we'd get.
> 
> As for the udp etc. case, how do we pick a "number" to make these
> arrays as you say they should be?

Just use __cacheline_aligned instead, like I did with the ip_local_data
in the patch you just rejected. There is still the problem that
generic SMP x86 kernels use a 32byte cacheline. Not a problem
currently because the only x86 SMP is pII/pIII which has 32byte, but
with Williamette/Athlon SMP it'll be a problem because these have 64byte
and 128byte cache lines. With __cacheline_aligned it'll be easy though
to pick up such changes.

> I would only make these changes for the snmp mibs which are "smaller"
> than this "number" we pick, the larger ones won't see much false
> sharing at all.
> 
> I think this is really a small and trite issue actually.

I don't think so. There was so much pain to avoid cache line bouncing for 
fast paths, it would be a shame to add it again for silly statistics keeping.

> Removing a lot of code also means undoing all the testing done so far
> with that code present :-)))

True. 

On the other hand, the inetpeer code is only really exercised on machines
that talk to lots and lots of destinations (=real servers), and 2.4 testing
on such machines still has to begin.

Given that 2.4 testing has not really begun yet I would guess that it is 
safer to remove it now @)

-Andi (who is off to bed now) 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-19 Thread David S. Miller

   Date: Wed, 20 Sep 2000 03:51:37 +0200
   From: "Andi Kleen" <[EMAIL PROTECTED]>

   >Receiver side SMP reordering is still there, but I'm not sure if it is
   >fixable (but it'll surely hit people that cannot use Linux senders, I 
   >just see the reports) 
   > 
   > Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input
   > rewrite was strictly about dealing with this.  No SMP reordering
   > should cause any bogus fast retransmits etc. for example.

   You mean TCP output rewrite ? The fix was strictly sender side only 
   (I first thought one of the ack changes alexey did was an attempt at a 
   hackish receiver side solution, but he told me that was false) 

   > What is the problem?  Are you referering to the LAPB/X25 stuff?

   When you have a non linux 2.4 sender you lose.

Alexey's changes detect reordering on the input side, regardless of
whether it is speaking to a Linux senders or not, to avoid false
retransmits.

Please show me (and even more importantly Alexey) an example of where
receive reordering detection is dependant upon Linux TCP sender
behavior, his code it is as generic as I can imagine it to be.  In
fact, his code got lots of the testing on a web server serving almost
exclusively clients running windows :-)

   When I count correctly sizeof(udp_mib) is 16 bytes currently (lots of
   false sharing on 32/64byte cacheline sized architectures) and linux_mib
   is 288 bytes currently (does not make any sense at all)  

Ok, linux_mib is obviously not exact but in that case I would argue
that the extra size needed (to get to the next a power of 2) would
outweight whatever instruction performance gain we'd get.

As for the udp etc. case, how do we pick a "number" to make these
arrays as you say they should be?

I would only make these changes for the snmp mibs which are "smaller"
than this "number" we pick, the larger ones won't see much false
sharing at all.

I think this is really a small and trite issue actually.

   >UDP recvmsg error handling for csum errors is bogus (fix pending) 
   > 
   > Ok, send me a patch so I can see what the problem is.

   Appended.

Applied.

   The "unknown gain" is just removing a lot of complexity => speed
   and less bugs.

Removing a lot of code also means undoing all the testing done so far
with that code present :-)))

   The only advantage I know left is saving pmtus for a bit longer,
   but I doubt it is worth the complexity.

Neither of us are experts in this area, Alexey is.

   Do you have any specific plan for salvaging tw recycling ? Just
   keeping such an intrusive piece of code for benchmarks around looks
   wasteful to me.

Nope, but this also requires Alexey's input.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-19 Thread Andi Kleen

On Tue, Sep 19, 2000 at 06:13:38PM -0700, David S. Miller wrote:
>Date: Wed, 20 Sep 2000 03:14:10 +0200
>From: "Andi Kleen" <[EMAIL PROTECTED]>
> 
>The ipid handling is still fishy, it will break when you talk to
>more destinations than the inetpeer cache can take (I fixed it in
>my local tree with the appended patch) 
> 
> I don't like this change, please work with Alexey on this one.

What do you dislike ?  Stateless ipid looks much nicer to me than stateful
one (and the "security" lost by not refetching the secret is minimal, 
especially given the state of the random device on machines without diskio
and keyboard input) 

>Receiver side SMP reordering is still there, but I'm not sure if it is
>fixable (but it'll surely hit people that cannot use Linux senders, I 
>just see the reports) 
> 
> Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input
> rewrite was strictly about dealing with this.  No SMP reordering
> should cause any bogus fast retransmits etc. for example.

You mean TCP output rewrite ? The fix was strictly sender side only 
(I first thought one of the ack changes alexey did was an attempt at a 
hackish receiver side solution, but he told me that was false) 

> What is the problem?  Are you referering to the LAPB/X25 stuff?

When you have a non linux 2.4 sender you lose.

>The TCP connect running out of ports problem is still there (I
>fixed it locally, but the changes are probably far too extensive
>for 2.4.x now)
> 
> I think people can change their ip_local_port_range and I really
> consider this a non-issue, at least, it's not a 2.4.x show stopper.

I agree.

> 
>The TCP ISN computation is not SMP safe. 
> 
> "Big deal".  So how much effort will you go to to get how much more
> protection and how much of it do we really need?  How less secure are
> our TCP ISN's because of this?  Not a show-stopper.
> 
> I think you're reading too much tcp-impl :-)

No, it has nothing to do with tcp-impl ;), I discovered it while playing
with the ipids much earlier. With some bad luck could could get very similar
ISNs, which is probably bad (an easy fix is to do the secret hash with a 
stack copy instead of the static) 

(but it is not a show stopper I agree)


> 
>include/linux/snmp.h is probably still not properly aligned for all
>cache line sizes.
> 
> Read, and reread what Alexey tries to tell you over and over about
> this.  It is not meant to be cache line aligned, it is meant to be a
> power of two and nothing more.

It has to be at least cache line aligned, otherwise the arrays wouldn't
make any sense at all, the power of two would just be a (imho misguided but
anyways) optimization.

When I count correctly sizeof(udp_mib) is 16 bytes currently (lots of
false sharing on 32/64byte cacheline sized architectures) and linux_mib
is 288 bytes currently (does not make any sense at all)  


>UDP recvmsg error handling for csum errors is bogus (fix pending) 
> 
> Ok, send me a patch so I can see what the problem is.

Appended.

> 
>I also would like to remove the inetpeer cache code before 2.4.0, now
>that nobody managed to salvage tw recycling and ipid can do without it.
> 
> I would prefer not to, it seems to be too potentially destabilizing
> for questionable and unknown gain (ie. we don't know if tw recycling
> can be salvaged, so lets not take any changes).

The "unknown gain" is just removing a lot of complexity => speed and less
bugs.

The only advantage I know left is saving pmtus for a bit longer, but
I doubt it is worth the complexity.

Do you have any specific plan for salvaging tw recycling ? Just keeping
such an intrusive piece of code for benchmarks around looks wasteful
to me.

-Andi


Index: net/ipv4/udp.c
===
RCS file: /cvs/linux/net/ipv4/udp.c,v
retrieving revision 1.85
diff -u -u -d -r1.85 udp.c
--- net/ipv4/udp.c  2000/08/09 11:59:04 1.85
+++ net/ipv4/udp.c  2000/09/17 11:55:29
@@ -493,8 +493,6 @@
if (usin->sin_family != AF_INET) {
if (usin->sin_family != AF_UNSPEC)
return -EINVAL;
-   if (net_ratelimit())
-   printk("Remind Kuznetsov, he has to repair %s 
eventually\n", current->comm);
}
 
ufh.daddr = usin->sin_addr.s_addr;
@@ -678,6 +676,8 @@
if (flags & MSG_ERRQUEUE)
return ip_recv_error(sk, msg, len);
 
+
+ retry:
/*
 *  From here the generic datagram does a lot of the work. Come
 *  the finished NET3, it will do _ALL_ the work!
@@ -733,26 +733,21 @@
 csum_copy_err:
UDP_INC_STATS_BH(UdpInErrors);
 
-   /* Clear queue. */
-   if (flags_PEEK) {
-   int clear = 0;
+   if (flags&(MSG_PEEK|MSG_DONTWAIT)) {
+   struct sk_buff *skb2; 
+

Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-19 Thread David S. Miller

   Date: Wed, 20 Sep 2000 03:14:10 +0200
   From: "Andi Kleen" <[EMAIL PROTECTED]>

   The ipid handling is still fishy, it will break when you talk to
   more destinations than the inetpeer cache can take (I fixed it in
   my local tree with the appended patch) 

I don't like this change, please work with Alexey on this one.
Do not assume we are removing the inetpeer cache also, see below.

   Receiver side SMP reordering is still there, but I'm not sure if it is
   fixable (but it'll surely hit people that cannot use Linux senders, I 
   just see the reports) 

Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input
rewrite was strictly about dealing with this.  No SMP reordering
should cause any bogus fast retransmits etc. for example.

What is the problem?  Are you referering to the LAPB/X25 stuff?

   The TCP connect running out of ports problem is still there (I
   fixed it locally, but the changes are probably far too extensive
   for 2.4.x now)

I think people can change their ip_local_port_range and I really
consider this a non-issue, at least, it's not a 2.4.x show stopper.

   The TCP ISN computation is not SMP safe. 

"Big deal".  So how much effort will you go to to get how much more
protection and how much of it do we really need?  How less secure are
our TCP ISN's because of this?  Not a show-stopper.

I think you're reading too much tcp-impl :-)

   include/linux/snmp.h is probably still not properly aligned for all
   cache line sizes.

Read, and reread what Alexey tries to tell you over and over about
this.  It is not meant to be cache line aligned, it is meant to be a
power of two and nothing more.

Not a bug.

   UDP recvmsg error handling for csum errors is bogus (fix pending) 

Ok, send me a patch so I can see what the problem is.

   I also would like to remove the inetpeer cache code before 2.4.0, now
   that nobody managed to salvage tw recycling and ipid can do without it.

I would prefer not to, it seems to be too potentially destabilizing
for questionable and unknown gain (ie. we don't know if tw recycling
can be salvaged, so lets not take any changes).

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-19 Thread David S. Miller

   Date: Wed, 20 Sep 2000 03:51:37 +0200
   From: "Andi Kleen" [EMAIL PROTECTED]

   Receiver side SMP reordering is still there, but I'm not sure if it is
   fixable (but it'll surely hit people that cannot use Linux senders, I 
   just see the reports) 

Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input
rewrite was strictly about dealing with this.  No SMP reordering
should cause any bogus fast retransmits etc. for example.

   You mean TCP output rewrite ? The fix was strictly sender side only 
   (I first thought one of the ack changes alexey did was an attempt at a 
   hackish receiver side solution, but he told me that was false) 

What is the problem?  Are you referering to the LAPB/X25 stuff?

   When you have a non linux 2.4 sender you lose.

Alexey's changes detect reordering on the input side, regardless of
whether it is speaking to a Linux senders or not, to avoid false
retransmits.

Please show me (and even more importantly Alexey) an example of where
receive reordering detection is dependant upon Linux TCP sender
behavior, his code it is as generic as I can imagine it to be.  In
fact, his code got lots of the testing on a web server serving almost
exclusively clients running windows :-)

   When I count correctly sizeof(udp_mib) is 16 bytes currently (lots of
   false sharing on 32/64byte cacheline sized architectures) and linux_mib
   is 288 bytes currently (does not make any sense at all)  

Ok, linux_mib is obviously not exact but in that case I would argue
that the extra size needed (to get to the next a power of 2) would
outweight whatever instruction performance gain we'd get.

As for the udp etc. case, how do we pick a "number" to make these
arrays as you say they should be?

I would only make these changes for the snmp mibs which are "smaller"
than this "number" we pick, the larger ones won't see much false
sharing at all.

I think this is really a small and trite issue actually.

   UDP recvmsg error handling for csum errors is bogus (fix pending) 

Ok, send me a patch so I can see what the problem is.

   Appended.

Applied.

   The "unknown gain" is just removing a lot of complexity = speed
   and less bugs.

Removing a lot of code also means undoing all the testing done so far
with that code present :-)))

   The only advantage I know left is saving pmtus for a bit longer,
   but I doubt it is worth the complexity.

Neither of us are experts in this area, Alexey is.

   Do you have any specific plan for salvaging tw recycling ? Just
   keeping such an intrusive piece of code for benchmarks around looks
   wasteful to me.

Nope, but this also requires Alexey's input.

Later,
David S. Miller
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-19 Thread Andi Kleen

On Tue, Sep 19, 2000 at 06:13:38PM -0700, David S. Miller wrote:
Date: Wed, 20 Sep 2000 03:14:10 +0200
From: "Andi Kleen" [EMAIL PROTECTED]
 
The ipid handling is still fishy, it will break when you talk to
more destinations than the inetpeer cache can take (I fixed it in
my local tree with the appended patch) 
 
 I don't like this change, please work with Alexey on this one.

What do you dislike ?  Stateless ipid looks much nicer to me than stateful
one (and the "security" lost by not refetching the secret is minimal, 
especially given the state of the random device on machines without diskio
and keyboard input) 

Receiver side SMP reordering is still there, but I'm not sure if it is
fixable (but it'll surely hit people that cannot use Linux senders, I 
just see the reports) 
 
 Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input
 rewrite was strictly about dealing with this.  No SMP reordering
 should cause any bogus fast retransmits etc. for example.

You mean TCP output rewrite ? The fix was strictly sender side only 
(I first thought one of the ack changes alexey did was an attempt at a 
hackish receiver side solution, but he told me that was false) 

 What is the problem?  Are you referering to the LAPB/X25 stuff?

When you have a non linux 2.4 sender you lose.

The TCP connect running out of ports problem is still there (I
fixed it locally, but the changes are probably far too extensive
for 2.4.x now)
 
 I think people can change their ip_local_port_range and I really
 consider this a non-issue, at least, it's not a 2.4.x show stopper.

I agree.

 
The TCP ISN computation is not SMP safe. 
 
 "Big deal".  So how much effort will you go to to get how much more
 protection and how much of it do we really need?  How less secure are
 our TCP ISN's because of this?  Not a show-stopper.
 
 I think you're reading too much tcp-impl :-)

No, it has nothing to do with tcp-impl ;), I discovered it while playing
with the ipids much earlier. With some bad luck could could get very similar
ISNs, which is probably bad (an easy fix is to do the secret hash with a 
stack copy instead of the static) 

(but it is not a show stopper I agree)


 
include/linux/snmp.h is probably still not properly aligned for all
cache line sizes.
 
 Read, and reread what Alexey tries to tell you over and over about
 this.  It is not meant to be cache line aligned, it is meant to be a
 power of two and nothing more.

It has to be at least cache line aligned, otherwise the arrays wouldn't
make any sense at all, the power of two would just be a (imho misguided but
anyways) optimization.

When I count correctly sizeof(udp_mib) is 16 bytes currently (lots of
false sharing on 32/64byte cacheline sized architectures) and linux_mib
is 288 bytes currently (does not make any sense at all)  


UDP recvmsg error handling for csum errors is bogus (fix pending) 
 
 Ok, send me a patch so I can see what the problem is.

Appended.

 
I also would like to remove the inetpeer cache code before 2.4.0, now
that nobody managed to salvage tw recycling and ipid can do without it.
 
 I would prefer not to, it seems to be too potentially destabilizing
 for questionable and unknown gain (ie. we don't know if tw recycling
 can be salvaged, so lets not take any changes).

The "unknown gain" is just removing a lot of complexity = speed and less
bugs.

The only advantage I know left is saving pmtus for a bit longer, but
I doubt it is worth the complexity.

Do you have any specific plan for salvaging tw recycling ? Just keeping
such an intrusive piece of code for benchmarks around looks wasteful
to me.

-Andi


Index: net/ipv4/udp.c
===
RCS file: /cvs/linux/net/ipv4/udp.c,v
retrieving revision 1.85
diff -u -u -d -r1.85 udp.c
--- net/ipv4/udp.c  2000/08/09 11:59:04 1.85
+++ net/ipv4/udp.c  2000/09/17 11:55:29
@@ -493,8 +493,6 @@
if (usin-sin_family != AF_INET) {
if (usin-sin_family != AF_UNSPEC)
return -EINVAL;
-   if (net_ratelimit())
-   printk("Remind Kuznetsov, he has to repair %s 
eventually\n", current-comm);
}
 
ufh.daddr = usin-sin_addr.s_addr;
@@ -678,6 +676,8 @@
if (flags  MSG_ERRQUEUE)
return ip_recv_error(sk, msg, len);
 
+
+ retry:
/*
 *  From here the generic datagram does a lot of the work. Come
 *  the finished NET3, it will do _ALL_ the work!
@@ -733,26 +733,21 @@
 csum_copy_err:
UDP_INC_STATS_BH(UdpInErrors);
 
-   /* Clear queue. */
-   if (flagsMSG_PEEK) {
-   int clear = 0;
+   if (flags(MSG_PEEK|MSG_DONTWAIT)) {
+   struct sk_buff *skb2; 
+
spin_lock_irq(sk-receive_queue.lock);
-   if (skb == 

Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-19 Thread Andi Kleen

On Tue, Sep 19, 2000 at 06:54:30PM -0700, David S. Miller wrote:
Date: Wed, 20 Sep 2000 03:51:37 +0200
From: "Andi Kleen" [EMAIL PROTECTED]
 
Receiver side SMP reordering is still there, but I'm not sure if it is
fixable (but it'll surely hit people that cannot use Linux senders, I 
just see the reports) 
 
 Reordering is a non-issue for ipv4/ipv6, all of the massive TCP input
 rewrite was strictly about dealing with this.  No SMP reordering
 should cause any bogus fast retransmits etc. for example.
 
You mean TCP output rewrite ? The fix was strictly sender side only 
(I first thought one of the ack changes alexey did was an attempt at a 
hackish receiver side solution, but he told me that was false) 
 
 What is the problem?  Are you referering to the LAPB/X25 stuff?
 
When you have a non linux 2.4 sender you lose.
 
 Alexey's changes detect reordering on the input side, regardless of
 whether it is speaking to a Linux senders or not, to avoid false
 retransmits.

We must be talking about different things. It of course detects it on
ACK input, but only for data it did send itself. Every TCP detects 
reordering automatically on the input with the sequence number check,
but all we still do is to send an ACK immediately and go into
quickack mode.

Or did I miss something ? 

The only way to do true reordering handling on the receiver I can think of
would be to use something like the soft-timers to do very very fast delayed 
acks even for rcv_mss sized packets and hope that you collect packets from 
all CPUs in the delay, but overall it could still cost you a lot by 
disturbing the ACK clock 

[I talked a lot with Andrea about this during OLS, and we couldn't
figure out a good way] 


 Please show me (and even more importantly Alexey) an example of where
 receive reordering detection is dependant upon Linux TCP sender
 behavior, his code it is as generic as I can imagine it to be.  In
 fact, his code got lots of the testing on a web server serving almost
 exclusively clients running windows :-)

Web clients probably do not send enough data to make reordering a problem
because the request fits into 1-2 packets and the 3way handshake is not
reordering sensitive.
(I haven't looked at SpecWeb that closely, but has it really any client
sent data that is packet size?) 

 Ok, linux_mib is obviously not exact but in that case I would argue
 that the extra size needed (to get to the next a power of 2) would
 outweight whatever instruction performance gain we'd get.
 
 As for the udp etc. case, how do we pick a "number" to make these
 arrays as you say they should be?

Just use __cacheline_aligned instead, like I did with the ip_local_data
in the patch you just rejected. There is still the problem that
generic SMP x86 kernels use a 32byte cacheline. Not a problem
currently because the only x86 SMP is pII/pIII which has 32byte, but
with Williamette/Athlon SMP it'll be a problem because these have 64byte
and 128byte cache lines. With __cacheline_aligned it'll be easy though
to pick up such changes.

 I would only make these changes for the snmp mibs which are "smaller"
 than this "number" we pick, the larger ones won't see much false
 sharing at all.
 
 I think this is really a small and trite issue actually.

I don't think so. There was so much pain to avoid cache line bouncing for 
fast paths, it would be a shame to add it again for silly statistics keeping.

 Removing a lot of code also means undoing all the testing done so far
 with that code present :-)))

True. 

On the other hand, the inetpeer code is only really exercised on machines
that talk to lots and lots of destinations (=real servers), and 2.4 testing
on such machines still has to begin.

Given that 2.4 testing has not really begun yet I would guess that it is 
safer to remove it now @)

-Andi (who is off to bed now) 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: networking todo, was Re: Linux-2.4.0-test9-pre2

2000-09-19 Thread David S. Miller

   Date:Wed, 20 Sep 2000 04:38:24 +0200
   From: "Andi Kleen" [EMAIL PROTECTED]

   We must be talking about different things. It of course detects it
   on ACK input, but only for data it did send itself. Every TCP detects 
   reordering automatically on the input with the sequence number check,
   but all we still do is to send an ACK immediately and go into
   quickack mode.

   Or did I miss something ? 

 ...

   [I talked a lot with Andrea about this during OLS, and we couldn't
   figure out a good way] 

Nevermind, I've apparently got my directions reversed. :-)
I'll think about this a bit.

   Just use __cacheline_aligned instead, like I did with the
   ip_local_data in the patch you just rejected. There is still the
   problem that generic SMP x86 kernels use a 32byte cacheline. Not a
   problem currently because the only x86 SMP is pII/pIII which has
   32byte, but with Williamette/Athlon SMP it'll be a problem because
   these have 64byte and 128byte cache lines. With __cacheline_aligned
   it'll be easy though to pick up such changes.

Ok, feel free to send me a patch which does only this.  I may doctor
it up a bit, be warned :-)

   On the other hand, the inetpeer code is only really exercised on
   machines that talk to lots and lots of destinations (=real
   servers), and 2.4 testing on such machines still has to begin.

   Given that 2.4 testing has not really begun yet I would guess that
   it is safer to remove it now @)

True, but the following is what I'm really thinking about.  Suppose a
few weeks from now Alexey greets both of our mailboxes with a fabulous
solution to the tw recycle masqeurading problem, wouldn't it be quite
a pain in the ass to put back in and retest all the inetpeer code?

Let's sit on this until Alexey gives a forecast about the
possibilities of a solution in the near future ok?

Later,
David S. Miller
[EMAIL PROTECTED]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/