Re: packet reordering at exchange points

2002-04-13 Thread Peter Galbavy


> Gotta go with the old head scratch on that one...
>
> Imagine the packet zipping down a wire.  It hits a router. It slows
> down. Why? Because (until very recently) wire-speed != processing
> speed != backplane speed... This is called 'blocking'.  The packet
> has to wait somewhere while the prior packet is being processed
> and fed through the backplane.
>
> Very recently vendors started offloading tasks from a single processor
> to sub-processors and ASICs in a more distributed manner. This has led
> to routers that are faster than the wire (in contrast to the above
> stated routers that were slower than the wire).  In this situation,
> buffering allows a greater port density: that is to say, more
> channels at a slower speed (the speed of the wire) rather than
> less channels at a higher speed (faster than the wire).

Apart from the fact that I have been building these (PC/Sparc based) routers
since the early '90s is neither here nor there. My point in general was to
actualy try to make people explain why there is a 'marketing' relationship
between the size of buffers and a magical bandwidth x delay product of
(normally) line rate x one second.

I fully understand (in case it appeared otherwise) the issues of interface
contention when aggregating traffic and also the need for buffering in QoS
implementations, but the perceived wisdom of 'you must have at least one
seconds buffer on each *interface*' still seems very shoddy and marketing
driven.

Still haven't heard it (a good reason), but then again I haven't gone to
read Van Jacobson's paper that was referenced in an earlier reply. And that
paper probably makes too good a point and bears no relation to the marketing
drivel we have been getting for many years.

BTW The faster than line rate / slower than line rate forwarding engine
stuff is an ongoing and iterative thing that has been the simple result of
capacity growth and in turn a has spurred the development of each new
generation of router - just like any growth industry - and this will
continue until we run out of electrons and photons and god-knows-what-else
to move around.

Peter




Re: packet reordering at exchange points

2002-04-11 Thread Craig Partridge



In message <00ae01c1e125$ba6b5380$[EMAIL PROTECTED]>, "Jim Forster" write
s:

>Sure, see the original Van Jacobson-Mike Karels paper "Congestion Avoidance
>and Control", at http://www-nrg.ee.lbl.gov/papers/congavoid.pdf.  Briefly,
>TCP end systems start pumping packets into the path until they've gotten
>about RTT*BW worth of packets "in the pipe".  Ideally these packets are
>somewhat evenly spaced out, but in practice in various circumtances they can
>get clumped together at a bottleneck link.  If the bottleneck link router
>can't handle the burst then some get dumped.

Actually, it is even stronger than that -- in a perfect world (without
jitter, etc), the packets *will* get clumped together at the bottleneck
link.  The reason is that for every ack, TCP's pumping out two back to back
packets -- but the acks are coming back at approximately the spacing
at which full-sized data packets get the bottleneck link... So you're
sending two segments (or 1.5 if you ack every other segment) in the time
the bottleneck can only handle one.

[Side note, this works because during slow start, you're not sending during
the entire RTT -- you're sending bursts at the start of the RTT, and each
slow start you fill more of the RTT]

Craig



RE: packet reordering at exchange points

2002-04-10 Thread Jim Forster



> > To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
> > 25MB of data. According to pricewatch, I can pick up a high density
512MB
>
> Why ?
>
> I am still waiting (after many years) for anyone to explain to me the
issue
> of buffering. It appears to be completely unneccesary in a router.
>
> Everyone seems to answer me with 'bandwidth x delay product' and similar,
> but think about IP routeing. The intermediate points are not doing any
form
> of per-packet ack etc. and so do not need to have large windows of data
etc.
>
> I can understand the need in end-points and networks (like X.25) that do
> per-hop clever things...
>
> Will someone please point me to references that actually demonstrate why
an
> IP router needs big buffers (as opposed to lots of 'downstream' ports) ?

Sure, see the original Van Jacobson-Mike Karels paper "Congestion Avoidance
and Control", at http://www-nrg.ee.lbl.gov/papers/congavoid.pdf.  Briefly,
TCP end systems start pumping packets into the path until they've gotten
about RTT*BW worth of packets "in the pipe".  Ideally these packets are
somewhat evenly spaced out, but in practice in various circumtances they can
get clumped together at a bottleneck link.  If the bottleneck link router
can't handle the burst then some get dumped.

  -- Jim




Re: packet reordering at exchange points

2002-04-10 Thread Paul Vixie


> Routers are not non-blocking devices.  When an output port is blocked,
> packets going to that port must be either buffered or dropped.  While it's
> obviously possible to drop them, like ATM/FR carriers do, ISPs have found
> they have much happier customers when they do a reasonable amount of
> buffering.

the important thing to remember about bits per second (or packets per second
or anything else per second) is that it's not particularly granular compared
to modern link speeds.

if at the nanosecond when a packet has finished arriving (assuming that we're
not doing cut-through, which is a good assumption in the case of routers) the
selected output interface is still busy clocking out another packet, that by
itself does not indicate "congestion".  it might be that way way later on,
like at the end of the current second, that only 1% of the link output space
was filled.  packet arrival times aren't fractal, or anything like fractal.

so we buffer.  and because we measure "bits per second" we like to have enough
buffering to handle a output pipe worth of inconveniently-timed output events.
if the inconveniently-timed packet arrival times are such that more of them
arrive than could fit the pipeline between this router and the next one down
the line, then we need to (passively) signal the sender that they are trying
to put 11 gallons of gasoline into a 10 gallon hat.  so we drop in that case.

(smd, didn't you write this up for nanog several years ago?  encore?  encore?)



Re: packet reordering at exchange points

2002-04-10 Thread Stephen Sprunk


Thus spake "Peter Galbavy" <[EMAIL PROTECTED]>
> Why ?
>
> I am still waiting (after many years) for anyone to explain to me
> the issue of buffering. It appears to be completely unneccesary in
> a router.

Routers are not non-blocking devices.  When an output port is blocked,
packets going to that port must be either buffered or dropped.  While it's
obviously possible to drop them, like ATM/FR carriers do, ISPs have found
they have much happier customers when they do a reasonable amount of
buffering.

S




Re: packet reordering at exchange points

2002-04-10 Thread Stephen Sprunk


Thus spake "Mathew Lodge" <[EMAIL PROTECTED]>
> At 03:48 PM 4/10/2002 +0100, Peter Galbavy wrote:
> >Why ?
> >
> >I am still waiting (after many years) for anyone to explain to me
> >the issue of buffering. It appears to be completely unneccesary
> >in a router.
>
> Well, that's some challenge but I'll have a go :-/
>
> As far as I can tell, the use of buffering has to do with traffic
> shaping vs. rate limiting. If you have a buffer on the interface,
> you are doing traffic shaping -- whether or not your vendor calls
> it that. ... If you have no queue or a very small queue ... This is
> rate limiting.

Well, that's implicit shaping/policing if you wish to call it that.  It's
only common to use those terms with explicit shaping/policing, i.e. when you
need to shape/police at something other than line rate.

> except for the owner of the routers who wanted to know why
> they had to buy the more expensive ATM card  (i.e. why
> couldn't the ATM core people couldn't put more buffering on
> their ATM access ports).

The answer here lies in ATM switches being designed primarily for carriers
(and by people with a carrier mindset).  Carriers, by and large, do not want
to carry unfunded traffic across their networks and then be forced to buffer
it; it's much easier (and cheaper) to police at ingress and buffer nothing.

It would have been nice to see a parallel line of switches (or cards) with
more buffers.  However, anyone wise enough to buy those was wise enough to
ditch ATM altogether :)

S




Re: packet reordering at exchange points

2002-04-10 Thread Peter Galbavy


> Note that the previous example was about end to end systems achieving line
> rate across a continent, nothing about routers was mentioned.

Fair enough - for that I can see the point. Maybe I need to read more though
:)

Peter





Re: packet reordering at exchange points

2002-04-10 Thread Mathew Lodge


At 03:48 PM 4/10/2002 +0100, Peter Galbavy wrote:
>Why ?
>
>I am still waiting (after many years) for anyone to explain to me the issue
>of buffering. It appears to be completely unneccesary in a router.

Well, that's some challenge but I'll have a go :-/

As far as I can tell, the use of buffering has to do with traffic shaping 
vs. rate limiting. If you have a buffer on the interface, you are doing 
traffic shaping -- whether or not your vendor calls it that. That's because 
when the rate at which traffic arrives at the queue exceeds the rate that 
it leaves the queue, the packets get buffered for transmission some time 
later. In effect, the queue buffers traffic bursts and then spreads 
transmission of the buffered packets over time.

If you have no queue or a very small queue (relative to the Rate x Average 
packet size) and the arrival rate exceeds transmission rate, you can't 
buffer the packet to transmit later, and so simply drop it. This is rate 
limiting.

That's my theory, but what's the effect?

I have seen the difference in effect on a real network running IP over ATM. 
The ATM core at this large European service provider was running equipment 
from "Vendor N". N's ATM access switches have very small cell buffers -- 
practically none, in fact.

When we connected routers to this core from "vendor C" that didn't have 
much buffering on the ATM interfaces, users saw very poor e-mail and HTTP 
throughput. We discovered that this was happening because during bursts of 
traffic, there were long trains of sequential packet loss -- including many 
TCP ACKs. This caused the TCP senders to rapidly back off their transmit 
windows. That and the packet loss was the major cause of poor throughput. 
Although we didn't figure this out until much later, a side effect of the 
sequential packet loss (i.e. no drop policy) was to synchronize all of the 
TCP senders -- i.e. the "burstyness" of the traffic got worse because now 
all of the TCP senders were trying to increase their send windows at the same.

To fix the problem, we replaced the ATM interface cards on the routers -- 
it turns out Vendor C has an ATM interface with lots of buffering, 
configurable drop policy (we used WRED) and a cell-level traffic shaper, 
presumably to address this very issue. The users saw much improved e-mail 
and web performance and everyone was happy, except for the owner of the 
routers who wanted to know why they had to buy the more expensive ATM card 
(i.e. why couldn't the ATM core people couldn't put more buffering on their 
ATM access ports).

Hope this helps,

Mathew




>Everyone seems to answer me with 'bandwidth x delay product' and similar,
>but think about IP routeing. The intermediate points are not doing any form
>of per-packet ack etc. and so do not need to have large windows of data etc.
>
>I can understand the need in end-points and networks (like X.25) that do
>per-hop clever things...
>
>Will someone please point me to references that actually demonstrate why an
>IP router needs big buffers (as opposed to lots of 'downstream' ports) ?
>
>Peter

| Mathew Lodge | [EMAIL PROTECTED] |
| Director, Product Management | Ph: +1 408 789 4068   |
| CPLANE, Inc. | http://www.cplane.com | 




Re: packet reordering at exchange points

2002-04-10 Thread Richard A Steenbergen


On Wed, Apr 10, 2002 at 03:48:36PM +0100, Peter Galbavy wrote:
> > To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
> > 25MB of data. According to pricewatch, I can pick up a high density 512MB
> 
> Why ?
> 
> I am still waiting (after many years) for anyone to explain to me the
> issue of buffering. It appears to be completely unneccesary in a router.

Note that the previous example was about end to end systems achieving line 
rate across a continent, nothing about routers was mentioned.

-- 
Richard A Steenbergen <[EMAIL PROTECTED]>   http://www.e-gerbil.net/ras
PGP Key ID: 0x138EA177  (67 29 D7 BC E8 18 3E DA  B2 46 B3 D8 14 36 FE B6)



Re: packet reordering at exchange points

2002-04-10 Thread John Kristoff


On Wed, 10 Apr 2002 15:48:36 +0100
"Peter Galbavy" <[EMAIL PROTECTED]> wrote:

> I am still waiting (after many years) for anyone to explain to me the
> issue of buffering. It appears to be completely unneccesary in a
> router.

OK, what I am missing?  Unless I'm misunderstanding your question, this
seems relatively simplistic and the need for buffers on routes is
actually quite obvious.

Imagine a router with more than 2 interfaces, each interface being of
the same speed.  Packets arrive on 2 or more interfaces and each need to
be forwarded onto the same outbound interface.  Imagine packets arrive
at exactly or roughly the same time.  Since the bits are going out
serially, you're gonna need to buffer packets one behind the others on
the egress interface.

Similar scenarios occur when egress interface capacity is less than some
rate or aggregate rate wanting to exit via that interface.

John



Re: packet reordering at exchange points

2002-04-10 Thread Neil J. McRae


Peter,
For basic Internet style routeing you are probably
correct [and possibly even more true for MPLS style
switching/routeing], but these days customers 
demand different classes of service and managed data and bandwidth
services over IP. These requires lots of packet hacking and for
that you need bufferss. If you need any type of filtering done
thats remotely intelligent you need somewhere to process those
packets especially when you are running 10G interfaces.

Regards,
Neil.
--
Neil J. McRae - Alive and Kicking
[EMAIL PROTECTED]



Re: packet reordering at exchange points

2002-04-10 Thread Peter Galbavy


> To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
> 25MB of data. According to pricewatch, I can pick up a high density 512MB

Why ?

I am still waiting (after many years) for anyone to explain to me the issue
of buffering. It appears to be completely unneccesary in a router.

Everyone seems to answer me with 'bandwidth x delay product' and similar,
but think about IP routeing. The intermediate points are not doing any form
of per-packet ack etc. and so do not need to have large windows of data etc.

I can understand the need in end-points and networks (like X.25) that do
per-hop clever things...

Will someone please point me to references that actually demonstrate why an
IP router needs big buffers (as opposed to lots of 'downstream' ports) ?

Peter




Re: fixing TCP buffers (Re: packet reordering at exchange points)

2002-04-09 Thread E.B. Dreger


> Date: Tue, 9 Apr 2002 21:12:30 -0400
> From: Richard A Steenbergen <[EMAIL PROTECTED]>


> That doesn't prevent an intentional local DoS though.

And the current stacks do?  (Note that my 64 kB figure was an
example, for an example system that had 512 current connections.)

Okay, how about new sockets split "excess" buffer space, subject
to certain minimum size restrictions?  New sockets do not impact
establish streams, unless we have way too many sockets or too
little buffer space.

If way too many sockets, it's just like current stacks, although
hopefully ulimit would prevent this scenario.

If we're out of buffer space, then we're going to have even more
problems when the sockets are actually passing data.

Yes, I'm still thinking about carving up a 32 MB chunk of RAM,
shrinking window sizes when we need more buffers.

Of course, we probably should consider AIO, too... if we can
have buffers in userspace instead of copying from kernel to
user via read(), that makes memory issues a bit more pleasant.


--
Eddy

Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence

~
Date: Mon, 21 May 2001 11:23:58 + (GMT)
From: A Trap <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots.
Do NOT send mail to <[EMAIL PROTECTED]>, or you are likely to
be blocked.




Re: fixing TCP buffers (Re: packet reordering at exchange points)

2002-04-09 Thread E.B. Dreger


Rough attempt at processing rules:

 1. If "enough" buffer space, let each stream have its fill.
DONE.

 2. Not "enough" buffer space, so we must invoke limits.

 3. If new connection, impose low limit until socket proves its
intentions... much like not allocating an entire socket
struct until TCP handshake is complete, or TCP slow start.
DONE.

 4. It's an existing connection.

 5. Does it act like it could use a smaller window?  If so,
shrink the window.  DONE.

 6. Stream might be able to use a larger window.

 7. Is it "tuning time" for this stream according to round robin
or random robin?  If so, use BIG buffer for a few packets,
measuring the stream's desires.

 8. Does the stream want more buffer space?  If not, DONE.

 9. Is it fair to other streams to adjust window?  If not, DONE.

10. Adjust appropriately.

I guess this shoots my "split into friendly fractions" approach
out of the water... and we're back to "standard" autotuning (for
sending) once we enforce minimum buffer size.

Major differences:

+ We're saying to approach memory usage macroscopically instead
  of microscopically.  i.e., per system instead of per stream.

+ We're removing upper bounds when bandwidth is plentiful.

+ Receive like you suggested, save for the "low memory" start
  phase.


--
Eddy

Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence

~
Date: Mon, 21 May 2001 11:23:58 + (GMT)
From: A Trap <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots.
Do NOT send mail to <[EMAIL PROTECTED]>, or you are likely to
be blocked.




Re: fixing TCP buffers (Re: packet reordering at exchange points)

2002-04-09 Thread Richard A Steenbergen


On Wed, Apr 10, 2002 at 12:57:19AM +, E.B. Dreger wrote:
> 
> Unless, again, there's some sort of limit.  32 MB total, 512
> connections, each socket gets 64 kB until it proves its worth.
> Sockets don't get to play the RED-ish game until they _prove_
> that they're serious about sucking down data.
> 
> Once a socket proves its intentions (and periodically after
> that), it gets to use a BIG buffer, so we find out just how fast
> the connection can go.

That doesn't prevent an intentional local DoS though.

-- 
Richard A Steenbergen <[EMAIL PROTECTED]>   http://www.e-gerbil.net/ras
PGP Key ID: 0x138EA177  (67 29 D7 BC E8 18 3E DA  B2 46 B3 D8 14 36 FE B6)



Re: fixing TCP buffers (Re: packet reordering at exchange points)

2002-04-09 Thread E.B. Dreger


> Date: Tue, 9 Apr 2002 20:39:34 -0400
> From: Richard A Steenbergen <[EMAIL PROTECTED]>


> My suggestion was to cut out all that non-sense by simply removing the 
> received window limits all together. Actually you could accomplish this 
> goal by just advertising the maximum possible window size and rely on 
> packet drops to shrink the congestion window on the sending side as 
> necessary, but this would be slightly less efficient in the case of a 
> sender overrunning the receiver.
> 
> But alas we're both forgetting the sender side, which controls how quickly 
> data moves from userland into the kernel. This part must be set by looking 
> at the sending congestion window. And I thought of another problem as 

Actually, I was thinking more in terms of sending than receiving.
Yes, your approach sounds quite slick for the RECV side, and I
see your point.  But WND info will be negotiated for sending...
so why not base it on "splitting the total pie" instead of
"arbitrary maximum"?


> well. If you had a receiver which made a connection, requested as much 
> data as possible, and then never did a read() on the socket buffer, all 
> the data would pile up in the kernel and consume the total buffer space 
> for the entire system.

Unless, again, there's some sort of limit.  32 MB total, 512
connections, each socket gets 64 kB until it proves its worth.
Sockets don't get to play the RED-ish game until they _prove_
that they're serious about sucking down data.

Once a socket proves its intentions (and periodically after
that), it gets to use a BIG buffer, so we find out just how fast
the connection can go.


> You're missing the point, you don't allocate ANYTHING until you have a
> packet to fill that buffer, and then when you're done buffering it, it is
> free'd. The limits are just there to prevent you from running away with a 
> socket buffer.

No, I understand your point perfectly, and that's how it's
currently done.

But why even bother with constant malloc(9)/free(9) when the
overall buffer size remains reasonably constant?  i.e., kernel
allocation to IP stack changes slowly if at all.  IP stack alloc
to individual streams changes regularly.


--
Eddy

Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence

~
Date: Mon, 21 May 2001 11:23:58 + (GMT)
From: A Trap <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots.
Do NOT send mail to <[EMAIL PROTECTED]>, or you are likely to
be blocked.




Re: fixing TCP buffers (Re: packet reordering at exchange points)

2002-04-09 Thread Richard A Steenbergen


On Wed, Apr 10, 2002 at 12:22:57AM +, E.B. Dreger wrote:
> 
> My static buffer presumed that one would regularly see line rate;
> that's probably an invalid assumption.

Indeed. But thats why it's not an actual allocation.

> Why bother advertising space remaining?  Simply take the total
> space -- which is tuned to line rate -- and divide equitably.
> Equal division is the primitive way.  Monitoring actual buffer
> use, a la PSC window-tuning code, is more efficient.

Because then you havn't accomplished your goal. If you have 32MB of buffer
memory available, and you open 32 connections and share it equally for
1MB/ea, you could have 1 connection that is doing no bandwidth and one 
connection that wants to scale to more then 1MB of packets inflight. Then 
you have to start scanning all your connections on a periodic basis 
adjusting the socket buffers to reflect the actual congestion window, a 
la PSC.

My suggestion was to cut out all that non-sense by simply removing the 
received window limits all together. Actually you could accomplish this 
goal by just advertising the maximum possible window size and rely on 
packet drops to shrink the congestion window on the sending side as 
necessary, but this would be slightly less efficient in the case of a 
sender overrunning the receiver.

But alas we're both forgetting the sender side, which controls how quickly 
data moves from userland into the kernel. This part must be set by looking 
at the sending congestion window. And I thought of another problem as 
well. If you had a receiver which made a connection, requested as much 
data as possible, and then never did a read() on the socket buffer, all 
the data would pile up in the kernel and consume the total buffer space 
for the entire system.

> To respect memory, sure, you could impose a global limit and
> alloc as needed.  But on a "busy enough" server/client, how much
> would that save?  Perhaps one could allocate 8MB chunks at a
> time... but fragmentation could prevent the ability to have a
> contiguous 32MB in the future.  (Yes, I'm assuming high memory
> usage and simplistic paging.  But I think that's plausible.)

You're missing the point, you don't allocate ANYTHING until you have a
packet to fill that buffer, and then when you're done buffering it, it is
free'd. The limits are just there to prevent you from running away with a 
socket buffer.

-- 
Richard A Steenbergen <[EMAIL PROTECTED]>   http://www.e-gerbil.net/ras
PGP Key ID: 0x138EA177  (67 29 D7 BC E8 18 3E DA  B2 46 B3 D8 14 36 FE B6)



Re: fixing TCP buffers (Re: packet reordering at exchange points)

2002-04-09 Thread E.B. Dreger


> Date: Tue, 9 Apr 2002 19:17:44 -0400
> From: Richard A Steenbergen <[EMAIL PROTECTED]>

[ snip beginning ]


> Actually here's an even simpler one. Define a global limit for
> this, something like 32MB would be more then reasonable. Then
> instead of advertising the space "remaining" in individual


My static buffer presumed that one would regularly see line rate;
that's probably an invalid assumption.


> socket buffers, advertise the total space remaining in this


Why bother advertising space remaining?  Simply take the total
space -- which is tuned to line rate -- and divide equitably.
Equal division is the primitive way.  Monitoring actual buffer
use, a la PSC window-tuning code, is more efficient.

To respect memory, sure, you could impose a global limit and
alloc as needed.  But on a "busy enough" server/client, how much
would that save?  Perhaps one could allocate 8MB chunks at a
time... but fragmentation could prevent the ability to have a
contiguous 32MB in the future.  (Yes, I'm assuming high memory
usage and simplistic paging.  But I think that's plausible.)

Honestly... memory is so plentiful these days that I'd gladly
devote "line rate"-sized buffers to the cause on each and every
server that I run.


> virtual memory pool. If you overrun your buffer, you might have
> the other side send you a few unnecessary bytes that you just
> have to drop, but the situation should correct itself very


By allocating 32MB, one stream could achieve line rate with no
wasted space (assuming latency is exactly what we predict, which
we all know won't happen).  When another stream or two are
opened, we split the buffer into four.  Maybe we drop, like you
suggest, in a RED-like manner.  Maybe we flush the queue if it's
not "too full".

Now we have up to four streams, each with an 8MB queue.  Need
more streams?  Fine, split { one | some | all } of the 8MB
windows into 2MB segments.  Simple enough, until we hit the
variable bw*delay times... then we should use intelligence when
splitting, probably via mechanisms similar to the PSC stack.

Granularity of 4 is for example only.  I know that would be
non-ideal.  One could split 32 MB into 6.0 MB + 7.0 MB + 8.5 MB +
10.5 MB, which would then be halved as needed.  Long-running
sessions could be moved between buffer clumps as needed.  (i.e.,
if 1.5 MB is too small and 2.0 MB is too large, 1.75 MB fits
nicely into the 7.0 MB area.)


> quickly. I don't think this would be "unfair" to any particular
> flow, since you've eliminated the concept of one flow
> "hogging" the socket buffer and leave it to TCP to work out the
> sharing of the link. Second  opinions?


Smells to me like ALTQ's TBR (token buffer regulator).

Perhaps also have a dynamically-allocated "tuning" buffer:
Imagine 2000 dialups and 10 DSL connections transferring over a
DS3... use a single "big enough" buffer (few buffers?) to sniff
out each stream's capability, to determine which stream can use
how much more space.


--
Eddy

Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence

~
Date: Mon, 21 May 2001 11:23:58 + (GMT)
From: A Trap <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots.
Do NOT send mail to <[EMAIL PROTECTED]>, or you are likely to
be blocked.




Re: fixing TCP buffers (Re: packet reordering at exchange points)

2002-04-09 Thread Richard A Steenbergen


On Tue, Apr 09, 2002 at 10:51:27PM +, E.B. Dreger wrote:
> 
> But how many simultaneous connections?  Until TCP stacks start
> using window autotuning (of which I know you're well aware), we
> must either use suboptimal windows or chew up ridiculous amounts
> of memory.  Yes, bad software, but still a limit...

Thats precisely what I ment by bad software, as well as the server code 
that pushes the data out in the first place. And for that matter, the 
receiver side is just as important.

> It would be nice to allocate a 32MB chunk of RAM for buffers,
> then dynamically split it between streams.  Fragmentation makes
> that pretty much impossible.
> 
> OTOH... perhaps that's a reasonable start:
> 
> 1. Alloc buffer of size X
> 2. Let it be used for Y streams
> 3. When we have Y streams, split each stream "sub-buffer" into Y
>parts, giving capacity for Y^2, streams.

You don't actually allocate the buffers until you have something to put in
them, you're just fixing a limit on the maximum you're willing to 
allocate. The problem comes from the fact that you're fixing the limits on 
a "per-socket" basis, not on a "total system" basis.

> Aggregate transmission can't exceed line rate.  So instead of
> fixed-size buffers for each stream, perhaps our TOTAL buffer size
> should remain constant.
> 
> Use PSC-style autotuning to eek out more capacity/performance,
> instead of using fixed value of "Y" or splitting each and every
> last buffer.  (Actually, I need to reread/reexamine the PSC code
> in case it actually _does_ use a fixed total buffer size.)
> 
> This shouldn't be terribly hard to hack into an IP stack...

Actually here's an even simpler one. Define a global limit for this,
something like 32MB would be more then reasonable. Then instead of
advertising the space "remaining" in individual socket buffers, advertise
the total space remaining in this virtual memory pool. If you overrun your
buffer, you might have the other side send you a few unnecessary bytes
that you just have to drop, but the situation should correct itself very
quickly. I don't think this would be "unfair" to any particular flow, 
since you've eliminated the concept of one flow "hogging" the socket 
buffer and leave it to TCP to work out the sharing of the link. Second 
opinions?

-- 
Richard A Steenbergen <[EMAIL PROTECTED]>   http://www.e-gerbil.net/ras
PGP Key ID: 0x138EA177  (67 29 D7 BC E8 18 3E DA  B2 46 B3 D8 14 36 FE B6)



fixing TCP buffers (Re: packet reordering at exchange points)

2002-04-09 Thread E.B. Dreger


> Date: Tue, 9 Apr 2002 16:03:53 -0400
> From: Richard A Steenbergen <[EMAIL PROTECTED]>


> To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
> 25MB of data. According to pricewatch, I can pick up a high density 512MB

[ snip ]


> The problem isn't the lack of hardware, it's a lack of good software (both

[ snip ]

But how many simultaneous connections?  Until TCP stacks start
using window autotuning (of which I know you're well aware), we
must either use suboptimal windows or chew up ridiculous amounts
of memory.  Yes, bad software, but still a limit...

It would be nice to allocate a 32MB chunk of RAM for buffers,
then dynamically split it between streams.  Fragmentation makes
that pretty much impossible.

OTOH... perhaps that's a reasonable start:

1. Alloc buffer of size X
2. Let it be used for Y streams
3. When we have Y streams, split each stream "sub-buffer" into Y
   parts, giving capacity for Y^2, streams.

Aggregate transmission can't exceed line rate.  So instead of
fixed-size buffers for each stream, perhaps our TOTAL buffer size
should remain constant.

Use PSC-style autotuning to eek out more capacity/performance,
instead of using fixed value of "Y" or splitting each and every
last buffer.  (Actually, I need to reread/reexamine the PSC code
in case it actually _does_ use a fixed total buffer size.)

This shouldn't be terribly hard to hack into an IP stack...


--
Eddy

Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence

~
Date: Mon, 21 May 2001 11:23:58 + (GMT)
From: A Trap <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots.
Do NOT send mail to <[EMAIL PROTECTED]>, or you are likely to
be blocked.




Re: packet reordering at exchange points

2002-04-09 Thread Jim Hickstein


> beware: you're probably looking at benchmarks

(I can't resist passing this along for posterity.)

One of our guys described it thus:  Engineering wants to see how fast they 
can get the wheels to spin on a car.  Operations wants to know how fast the 
car will go.  These are different.



Re: packet reordering at exchange points

2002-04-09 Thread Jesper Skriver


On Tue, Apr 09, 2002 at 06:00:31PM +, E.B. Dreger wrote:
> > A large IX in Europe have this exact problem on their Foundry swiches,
> > which doesn't support round robin, and is currently forced to moving for
> 
> Can you state how many participants?

100+

> With N x GigE, what sort of [im]balance is there over the N lines?

a few links overloaded, which other practically doesn't carry traffic.

> Of course, I'd hope that individual heavy pairs would establish
> private interconnects instead of using public switch fabric, but I
> know that's not always { an option | done | ... }.

If A and B exchange say 200 Mbps of traffic, moving to a PNI is for sure
a option, but if both have GigE connections to the shared infrastruture
with spare capacity, both can expect the IX to handle that traffic.

/Jesper

-- 
Jesper Skriver, jesper(at)skriver(dot)dk  -  CCIE #5456
Work:Network manager   @ AS3292 (Tele Danmark DataNetworks)
Private: FreeBSD committer @ AS2109 (A much smaller network ;-)

One Unix to rule them all, One Resolver to find them,
One IP to bring them all and in the zone to bind them.



Re: packet reordering at exchange points

2002-04-09 Thread Joe St Sauver


>Date: Tue, 09 Apr 2002 16:03:53 -0400
>From: Richard A Steenbergen <[EMAIL PROTECTED]>
>Subject: Re: packet reordering at exchange points
>
>To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
>25MB of data. According to pricewatch, I can pick up a high density 512MB
>PC133 DIMM for $70, and use $3.50 of it to catch that TCP stream. Throw in
>$36 for a GigE NIC, and we're ready to go for under $40. Yeah I know thats
>cheapest garbage you can get, but this is just to prove a point. :) I 
>might only be able to get 800Mbit across a 32bit/33mhz PCI bus, but 
>whatever.

Of course, in reality, things like choice of NIC can matter tremendously
when it comes to going even moderately fast, which is why people continue
to pay a premium for high performance NICs such as those by Syskonnect. (When 
you see vendors touting near-gigabit throughput for inexpensive gig NICs, 
beware: you're probably looking at benchmarks consisting of multiple streams 
sent with jumbo frames between two machines connected virtually back to back 
rather than "real world" performance associated with a single wide area tcp 
flow across a 1500 byte MTU link).

>The problem isn't the lack of hardware, it's a lack of good software (both
>on the receiving side and probably more importantly the sending side), a
>lot of bad standards coming back to bite us (1500 byte packets is about as
>far from efficient as you can get), a lack of people with enough know-how
>to actually build a network that can transport it all (heck they can't
>even build decent networks to deliver 10Mbit/s, @Home was the closest),
>and just a general lack of things for end users to do with that much
>bandwidth even if they got it.

In the university community, it is routine for students in residence halls
to have access to switched 10 (or even switched 100 Mbps) ethernet; of 
course, at that point, the issue isn't a lack of things for end users to do 
with that much (potential) bandwidth, it is the *cost* of provisioning wide 
area commodity bandwidth to support the demand that that speedy local 
infrastructure can generate that becomes the binding constraint. 

And speaking of University users, if you look at Internet2's excellent weekly 
reports (see, for example: 
http://netflow.internet2.edu/weekly/20020401/#fastest ) you'll see that wide 
area TCP non-measurement single flows *are* occuring at and above the 100Mbps 
mark, at least across that (admittedly rather atypical) network infrastructure.
[Maybe not as commonly as we'd all like to hope, but they are happening.]

Regards,

Joe



Re: packet reordering at exchange points

2002-04-09 Thread Richard A Steenbergen


On Tue, Apr 09, 2002 at 07:18:35PM +, E.B. Dreger wrote:
> 
> > Date: Tue, 09 Apr 2002 11:16:24 -0700
> > From: Paul Vixie <[EMAIL PROTECTED]>
> 
> > my expectation is that when the last mile goes to 622Mb/s or 1000Mb/s,
> > exchange points will all be operating at 10Gb/s, and interswitch trunks
> > at exchange points will be multiples of 10Gb/s.
> 
> I guess Moore's Law comes into play again.  One will need some
> pretty hefty TCP buffers for a single stream to hit those rates,
> unless latency _really_ drops.  (Distributed CDNs, anyone?  Speed
> of light ain't getting faster any time soon...)

To transfer 1Gb/s across 100ms I need to be prepared to buffer at least
25MB of data. According to pricewatch, I can pick up a high density 512MB
PC133 DIMM for $70, and use $3.50 of it to catch that TCP stream. Throw in
$36 for a GigE NIC, and we're ready to go for under $40. Yeah I know thats
cheapest garbage you can get, but this is just to prove a point. :) I 
might only be able to get 800Mbit across a 32bit/33mhz PCI bus, but 
whatever.

The problem isn't the lack of hardware, it's a lack of good software (both
on the receiving side and probably more importantly the sending side), a
lot of bad standards coming back to bite us (1500 byte packets is about as
far from efficient as you can get), a lack of people with enough know-how
to actually build a network that can transport it all (heck they can't
even build decent networks to deliver 10Mbit/s, @Home was the closest),
and just a general lack of things for end users to do with that much
bandwidth even if they got it.

-- 
Richard A Steenbergen <[EMAIL PROTECTED]>   http://www.e-gerbil.net/ras
PGP Key ID: 0x138EA177  (67 29 D7 BC E8 18 3E DA  B2 46 B3 D8 14 36 FE B6)



Re: packet reordering at exchange points

2002-04-09 Thread E.B. Dreger


> Date: Tue, 09 Apr 2002 11:16:24 -0700
> From: Paul Vixie <[EMAIL PROTECTED]>


> my expectation is that when the last mile goes to 622Mb/s or 1000Mb/s,
> exchange points will all be operating at 10Gb/s, and interswitch trunks
> at exchange points will be multiples of 10Gb/s.

I guess Moore's Law comes into play again.  One will need some
pretty hefty TCP buffers for a single stream to hit those rates,
unless latency _really_ drops.  (Distributed CDNs, anyone?  Speed
of light ain't getting faster any time soon...)

Of course, IMHO I expect DCDNs to become increasingly common...
but that topic would warrant a thread fork.

Looks like RR ISLs are feasible between GigE+ core switches...


--
Eddy

Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence

~
Date: Mon, 21 May 2001 11:23:58 + (GMT)
From: A Trap <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots.
Do NOT send mail to <[EMAIL PROTECTED]>, or you are likely to
be blocked.




Re: packet reordering at exchange points

2002-04-09 Thread Paul Vixie


> H.  You're right.  I lost sight of the original thread...
> GigE inter-switch trunking at PAIX.  In that case, congestion
> _should_ be low, and there shouldn't be much queue depth.

indeed, this is the case.  we keep a lot of headroom on those trunks.

> But this _does_ bank on current "real world" behavior.  If
> endpoints ever approach GigE speeds (of course requiring "low
> enough" latency and "big enough" windows)...
> 
> Then again, last mile is so slow that we're probably a ways away
> from that happening.

my expectation is that when the last mile goes to 622Mb/s or 1000Mb/s,
exchange points will all be operating at 10Gb/s, and interswitch trunks
at exchange points will be multiples of 10Gb/s.

> Of course, I'd hope that individual heavy pairs would establish
> private interconnects instead of using public switch fabric, but
> I know that's not always { an option | done | ... }.

individual heavy pairs do this, but as a long term response to growth,
not as a short term response to congestion.  in the short term, the
exchange point switch can't present congestion.  it's just not on the
table at all.



Re: packet reordering at exchange points

2002-04-09 Thread E.B. Dreger


> Date: Tue, 9 Apr 2002 07:13:38 +0200
> From: Jesper Skriver <[EMAIL PROTECTED]>


> We're talking parallel GigE links between switches which are located
> close to each other.
> 
> And we're talking real life applications, which perhaps sends 100 pps
> in one stream, which means that you need to have ~ 10 ms different
> transmission delay on the individual links, before the risk of out of
> order packets for a given stream arise.

H.  You're right.  I lost sight of the original thread...
GigE inter-switch trunking at PAIX.  In that case, congestion
_should_ be low, and there shouldn't be much queue depth.

But this _does_ bank on current "real world" behavior.  If
endpoints ever approach GigE speeds (of course requiring "low
enough" latency and "big enough" windows)...

Then again, last mile is so slow that we're probably a ways away
from that happening.


> > IIRC, 802.3ad DOES NOT allow round robin distribution;
> 
> That is not what we're talking about, we're talking about the impact of
> doing it.

Yes, I was incomplete in that part.  Intended point was that IEEE
at least [seemingly] found round robin inappropriate for general
case.


> > it uses hashes.  Sure, hashed distribution isn't perfect.
> 
> It's broken in a IX environment where you have few src/dst pairs, and
> where a single src/dst pair can easily use several hundreds of Mbps,
> if you have a few of those going of the same link due to the hashing
> algorithm, you will have problems.

In the [extreme] degenerate case, yes, one goes from N links to 1
effective link.


> A large IX in Europe have this exact problem on their Foundry swiches,
> which doesn't support round robin, and is currently forced to moving for

Can you state how many participants?  With N x GigE, what sort of
[im]balance is there over the N lines?

Of course, I'd hope that individual heavy pairs would establish
private interconnects instead of using public switch fabric, but
I know that's not always { an option | done | ... }.


> 10 GigE due to this very fact.

I'm going to have to play with ISL RR...


> /Jesper


--
Eddy

Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence

~
Date: Mon, 21 May 2001 11:23:58 + (GMT)
From: A Trap <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots.
Do NOT send mail to <[EMAIL PROTECTED]>, or you are likely to
be blocked.




RE: packet reordering at exchange points

2002-04-09 Thread Kavi, Prabhu


An interesting historical observation,

Many years ago when I used to create discrete event simulation 
network models for a living, I had one project which was to model 
(what was then) a widely implemented PC TCP stack.  I remember that 
one wart of this implementation was that when packet reordering 
occurred it collapsed the window size to 1!  

Anyone know if strange warts like this still exist in desktop 
systems?

Prabhu
--
Prabhu Kavi Phone:  1-978-264-4900 x125 
Director, Adv. Prod. Planning   Fax:1-978-264-0671
Tenor Networks  Email:  [EMAIL PROTECTED]
100 Nagog Park  WWW:www.tenornetworks.com
Acton, MA 01720


> -Original Message-
> From: Iljitsch van Beijnum [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 09, 2002 12:36 PM
> To: Stephen Sprunk
> Cc: [EMAIL PROTECTED]
> Subject: Re: packet reordering at exchange points
> 
> 
> 
> On Mon, 8 Apr 2002, Stephen Sprunk wrote:
> 
> > Thus spake "Iljitsch van Beijnum" <[EMAIL PROTECTED]>
> > > But how is packet reordering on two parallell gigabit interfaces
> > > ever going to translate into reordered packets for individual
> > > streams?
> 
> > Think of a large FTP between two well-connected machines.  
> Such flows tend
> > to generate periodic clumps of packets; split one of these 
> clumps across two
> > pipes and the clump will arrive out of order at the other end.  The
> > resulting mess will create a clump of retransmissions, then 
> another bigger
> > clump of new data, ...
> 
> I don't think it will be this bad, even if hosts are 
> connected at GigE and
> the trunk is 2 x GigE. In this case, a (delayed) ACK will usually
> acknowledge 2 segments so it will trigger transmission of two new
> segments. These will arrive back to back at the router/switch 
> doing the
> load balancing. Since there is obviously need for more than 1 
> Gbit worth
> of bandwidth, it is likely the average queue size is at least 
> close to 1
> (= ~65% line use) or even higher. If this is the case, there 
> is a _chance_
> the second packet gains a full packet time over the first and arrives
> first at the destination.  However, this is NOT especially 
> likely if both
> packets are the same size:  the _average_ queue sizes will be 
> the same so
> in half the cases the first packet gains an even bigger 
> advance over the
> second, and only in a fraction of half the cases the second 
> packet gains
> enough over the first to pass it. And then, the destination host still
> only sees a single packet coming in out of order, which isn't 
> enough to
> trigger fast retransmit.
> 
> You need to load balance over more than two connections to trigger
> unnecessary fast retransmit (over two lines, packet #3 isn't 
> going to pass
> by packet #1), AND you need to send more than two packets 
> back to back.
> Also, you need to be at the same speed as the load balanced lines,
> otherwise your packet train gets split up by traffic from 
> other interfaces
> or idle time on the line.
> 
> And _then_, if all of this happens, all the retransmitted 
> data is left of
> window. I'm not even sure if those packets generate an ACK, 
> and if they
> do, if the sender takes any action on this ACK. If this 
> triggers another
> round of fast retransmit, the FR implementation should be considered
> broken, IMO.
> 
> > > Packets for streams that are subject to header compression or
> > > for voice over IP or even Mbone are nearly always transmitted
> > > at relatively large intervals, so they can't travel down parallell
> > > paths simultaneously.
> 
> > RTP reordering isn't a problem in my experience, probably 
> since RTP has an
> > inherent resequencing mechanism.
> 
> My point is real time protocols will not see reordering 
> unless they are
> using up nearly the full line speed or there is congestion, 
> because these
> protocols don't send out packets back to back like TCP 
> sometimes does. How
> big are VoIP packets? Even with an 80 byte payload you get 
> 100 packets per
> second = 10 ms between packets, which is more than 80 packet times for
> GigE = congestion. And if there is congestion, all 
> performance bets are
> off.
> 
> It seems to me spending (CPU) time and money to do more complex load
> balancing than per packet round robing in order to avoid 
> reordering only
> helps some people with GigE connected hosts some of the time. 
> Using this
> time or money to overcome congestion is probably a better investment.
> 
> PS. For everyone looking at their netstat -p tcp output: 
> packet loss also
> counts towards the out of order packets, it is hard to 
> get the real
> out of order figures.
> 
> PS2. Isn't it annoying to have to think about layer 4 to 
> build layer 2 stuff?
> 
> 



Re: packet reordering at exchange points

2002-04-09 Thread Iljitsch van Beijnum


On Mon, 8 Apr 2002, Stephen Sprunk wrote:

> Thus spake "Iljitsch van Beijnum" <[EMAIL PROTECTED]>
> > But how is packet reordering on two parallell gigabit interfaces
> > ever going to translate into reordered packets for individual
> > streams?

> Think of a large FTP between two well-connected machines.  Such flows tend
> to generate periodic clumps of packets; split one of these clumps across two
> pipes and the clump will arrive out of order at the other end.  The
> resulting mess will create a clump of retransmissions, then another bigger
> clump of new data, ...

I don't think it will be this bad, even if hosts are connected at GigE and
the trunk is 2 x GigE. In this case, a (delayed) ACK will usually
acknowledge 2 segments so it will trigger transmission of two new
segments. These will arrive back to back at the router/switch doing the
load balancing. Since there is obviously need for more than 1 Gbit worth
of bandwidth, it is likely the average queue size is at least close to 1
(= ~65% line use) or even higher. If this is the case, there is a _chance_
the second packet gains a full packet time over the first and arrives
first at the destination.  However, this is NOT especially likely if both
packets are the same size:  the _average_ queue sizes will be the same so
in half the cases the first packet gains an even bigger advance over the
second, and only in a fraction of half the cases the second packet gains
enough over the first to pass it. And then, the destination host still
only sees a single packet coming in out of order, which isn't enough to
trigger fast retransmit.

You need to load balance over more than two connections to trigger
unnecessary fast retransmit (over two lines, packet #3 isn't going to pass
by packet #1), AND you need to send more than two packets back to back.
Also, you need to be at the same speed as the load balanced lines,
otherwise your packet train gets split up by traffic from other interfaces
or idle time on the line.

And _then_, if all of this happens, all the retransmitted data is left of
window. I'm not even sure if those packets generate an ACK, and if they
do, if the sender takes any action on this ACK. If this triggers another
round of fast retransmit, the FR implementation should be considered
broken, IMO.

> > Packets for streams that are subject to header compression or
> > for voice over IP or even Mbone are nearly always transmitted
> > at relatively large intervals, so they can't travel down parallell
> > paths simultaneously.

> RTP reordering isn't a problem in my experience, probably since RTP has an
> inherent resequencing mechanism.

My point is real time protocols will not see reordering unless they are
using up nearly the full line speed or there is congestion, because these
protocols don't send out packets back to back like TCP sometimes does. How
big are VoIP packets? Even with an 80 byte payload you get 100 packets per
second = 10 ms between packets, which is more than 80 packet times for
GigE = congestion. And if there is congestion, all performance bets are
off.

It seems to me spending (CPU) time and money to do more complex load
balancing than per packet round robing in order to avoid reordering only
helps some people with GigE connected hosts some of the time. Using this
time or money to overcome congestion is probably a better investment.

PS. For everyone looking at their netstat -p tcp output: packet loss also
counts towards the out of order packets, it is hard to get the real
out of order figures.

PS2. Isn't it annoying to have to think about layer 4 to build layer 2 stuff?




Re: packet reordering at exchange points

2002-04-08 Thread Jesper Skriver


On Mon, Apr 08, 2002 at 11:19:56PM +, E.B. Dreger wrote:
> 
> > Date: Tue, 9 Apr 2002 00:32:50 +0200 (CEST)
> > From: Iljitsch van Beijnum <[EMAIL PROTECTED]>
> 
> > But how is packet reordering on two parallell gigabit interfaces
> > ever going to translate into reordered packets for individual
> > streams? Packets
>
> Queue depths.  Varying paths.

We're talking parallel GigE links between switches which are located
close to each other.

And we're talking real life applications, which perhaps sends 100 pps
in one stream, which means that you need to have ~ 10 ms different
transmission delay on the individual links, before the risk of out of
order packets for a given stream arise.

> IIRC, 802.3ad DOES NOT allow round robin distribution;

That is not what we're talking about, we're talking about the impact of
doing it.

> it uses hashes.  Sure, hashed distribution isn't perfect.

It's broken in a IX environment where you have few src/dst pairs, and
where a single src/dst pair can easily use several hundreds of Mbps,
if you have a few of those going of the same link due to the hashing
algorithm, you will have problems.

A large IX in Europe have this exact problem on their Foundry swiches,
which doesn't support round robin, and is currently forced to moving for
10 GigE due to this very fact.

/Jesper

-- 
Jesper Skriver, jesper(at)skriver(dot)dk  -  CCIE #5456
Work:Network manager   @ AS3292 (Tele Danmark DataNetworks)
Private: FreeBSD committer @ AS2109 (A much smaller network ;-)

One Unix to rule them all, One Resolver to find them,
One IP to bring them all and in the zone to bind them.



Re: packet reordering at exchange points

2002-04-08 Thread Jesper Skriver


On Mon, Apr 08, 2002 at 02:18:52PM -0700, Paul Vixie wrote:

> > packet reordering at MAE East was extremely common a few years
> > ago. Does anyone have information whether this is still happening?
>
> more to the point, does anybody still care about packet reordering at
> exchange points? we (paix) go through significant effort to prevent
> it, and interswitch trunking with round robin would be a lot easier.
> are we chasing an urban legend here, or would reordering still cause
> pain?

LINX uses Extreme swiches with round robin load-sharing among 4*GigE and
8*GigE trunks, no problems has been noted.

/Jesper

-- 
Jesper Skriver, jesper(at)skriver(dot)dk  -  CCIE #5456
Work:Network manager   @ AS3292 (Tele Danmark DataNetworks)
Private: FreeBSD committer @ AS2109 (A much smaller network ;-)

One Unix to rule them all, One Resolver to find them,
One IP to bring them all and in the zone to bind them.



Re: packet reordering at exchange points

2002-04-08 Thread Mark Allman



Paul-

> more to the point, does anybody still care about packet reordering
> at exchange points?  we (paix) go through significant effort to
> prevent it, and interswitch trunking with round robin would be a
> lot easier.  are we chasing an urban legend here, or would
> reordering still cause pain?

Yep.  Reordering causes pain to TCP performance.  The basic idea is
that if packets get jumbled up they trigger duplicate ACKs from the
receiver.  If things are badly enough reordered we end up with >= 3
duplicate ACKs arriving at the sender.  According to the fast
retransmit algorithm 3 duplicate ACKs are taken by the TCP as an
indication of packet loss -- and, hence, congestion.  So, we end up
needlessly cutting our congestion window in half.

Ethan Blanton and I just published a paper on what might be done
to make TCP more robust to paths that reorder segments (which would,
in turn, make such paths be less problematic).  The paper is:

Ethan Blanton, Mark Allman. On Making TCP More Robust to Packet
Reordering}.  ACM Computer Communication Review, 32(1), January
2002. 
http://roland.grc.nasa.gov/~mallman/papers/tcp-reorder-ccr.ps

You might not necessarily be interested in the entire paper.  But,
the first part shows the performance problems caused by packet
reordering. 

allman


--
Mark Allman -- NASA GRC/BBN -- http://roland.grc.nasa.gov/~mallman/



Re: packet reordering at exchange points

2002-04-08 Thread Valdis . Kletnieks

On Tue, 09 Apr 2002 00:32:50 +0200, Iljitsch van Beijnum said:
> Obviously some applications care. In addition to the examples mentioned
> earlier: out of order packets aren't really good for TCP header
> compression, so they will slow down data transfers over slow links.

On the other hand, wouldn't this sort of slow link tend to close down
the TCP window and thus tend to minimize the effect?  A quick back-of-envelope
calculation gives me a 56K modem line only opening the window up to 10K
or so - so there should only be 5-6 1500 byte packets in flight at a given
time, so the chances of *that flow* getting out-of-order at a core router
that's flipping 200K packets/sec are fairly low.

Not saying it doesn't happen, or that it isn't a problem when it does - but
I'm going to wait till somebody posts a 'netstat' output showing that
it is in fact an issue for some environments...

-- 
Valdis Kletnieks
Computer Systems Senior Engineer
Virginia Tech




msg00771/pgp0.pgp
Description: PGP signature


Re: packet reordering at exchange points

2002-04-08 Thread Stephen Sprunk


Thus spake "Iljitsch van Beijnum" <[EMAIL PROTECTED]>
> But how is packet reordering on two parallell gigabit interfaces
> ever going to translate into reordered packets for individual
> streams?

Think of a large FTP between two well-connected machines.  Such flows tend
to generate periodic clumps of packets; split one of these clumps across two
pipes and the clump will arrive out of order at the other end.  The
resulting mess will create a clump of retransmissions, then another bigger
clump of new data, ...

> Packets for streams that are subject to header compression or
> for voice over IP or even Mbone are nearly always transmitted
> at relatively large intervals, so they can't travel down parallell
> paths simultaneously.

RTP reordering isn't a problem in my experience, probably since RTP has an
inherent resequencing mechanism.  The problem with RTP is that if the
packets don't follow a deterministic path, the header compression scheme is
severely trashed.  Also, non-deterministic paths tend to increase jitter,
requiring more bufferring at endpoints.

S




Re: packet reordering at exchange points

2002-04-08 Thread E.B. Dreger


> Date: Mon, 8 Apr 2002 19:45:16 -0400
> From: Richard A Steenbergen <[EMAIL PROTECTED]>


> > Queue depths.  Varying paths.  IIRC, 802.3ad DOES NOT allow round
> > robin distribution; it uses hashes.  Sure, hashed distribution
> > isn't perfect.  But it's better than "perfect" distribution with
> > added latency and/or retransmits out the wazoo.
> 
> You don't even need varying paths to create a desynch, all you need is
> varying size packets.

Quite true.  My list wasn't meant to be all-inclusive... bad
wording on my part.


--
Eddy

Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence

~
Date: Mon, 21 May 2001 11:23:58 + (GMT)
From: A Trap <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots.
Do NOT send mail to <[EMAIL PROTECTED]>, or you are likely to
be blocked.




Re: packet reordering at exchange points

2002-04-08 Thread Richard A Steenbergen


On Mon, Apr 08, 2002 at 11:19:56PM +, E.B. Dreger wrote:
> 
> > But how is packet reordering on two parallell gigabit interfaces ever
> > going to translate into reordered packets for individual streams? Packets
> 
> Queue depths.  Varying paths.  IIRC, 802.3ad DOES NOT allow round
> robin distribution; it uses hashes.  Sure, hashed distribution
> isn't perfect.  But it's better than "perfect" distribution with
> added latency and/or retransmits out the wazoo.

You don't even need varying paths to create a desynch, all you need is
varying size packets.

-- 
Richard A Steenbergen <[EMAIL PROTECTED]>   http://www.e-gerbil.net/ras
PGP Key ID: 0x138EA177  (67 29 D7 BC E8 18 3E DA  B2 46 B3 D8 14 36 FE B6)



Re: packet reordering at exchange points

2002-04-08 Thread E.B. Dreger


> Date: Tue, 9 Apr 2002 00:32:50 +0200 (CEST)
> From: Iljitsch van Beijnum <[EMAIL PROTECTED]>


> Obviously some applications care. In addition to the examples mentioned
> earlier: out of order packets aren't really good for TCP header
> compression, so they will slow down data transfers over slow links.

How about ACK?  I think that's the point that Richard was
making... even with SACK, out-of-order packets can be an issue.


> But how is packet reordering on two parallell gigabit interfaces ever
> going to translate into reordered packets for individual streams? Packets

Queue depths.  Varying paths.  IIRC, 802.3ad DOES NOT allow round
robin distribution; it uses hashes.  Sure, hashed distribution
isn't perfect.  But it's better than "perfect" distribution with
added latency and/or retransmits out the wazoo.


> for streams that are subject to header compression or for voice over IP or
> even Mbone are nearly always transmitted at relatively large intervals, so
> they can't travel down parallell paths simultaneously.

What MTU?  Compare to jitter multiplied by line rate.


--
Eddy

Brotsman & Dreger, Inc. - EverQuick Internet Division
Phone: +1 (316) 794-8922 Wichita/(Inter)national
Phone: +1 (785) 865-5885 Lawrence

~
Date: Mon, 21 May 2001 11:23:58 + (GMT)
From: A Trap <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Please ignore this portion of my mail signature.

These last few lines are a trap for address-harvesting spambots.
Do NOT send mail to <[EMAIL PROTECTED]>, or you are likely to
be blocked.




Re: packet reordering at exchange points

2002-04-08 Thread Iljitsch van Beijnum


On Mon, 8 Apr 2002, Paul Vixie wrote:

> > packet reordering at MAE East was extremely common a few years ago. Does
> > anyone have information whether this is still happening?

> more to the point, does anybody still care about packet reordering at
> exchange points?  we (paix) go through significant effort to prevent it,
> and interswitch trunking with round robin would be a lot easier.  are
> we chasing an urban legend here, or would reordering still cause pain?

Obviously some applications care. In addition to the examples mentioned
earlier: out of order packets aren't really good for TCP header
compression, so they will slow down data transfers over slow links.

But how is packet reordering on two parallell gigabit interfaces ever
going to translate into reordered packets for individual streams? Packets
for streams that are subject to header compression or for voice over IP or
even Mbone are nearly always transmitted at relatively large intervals, so
they can't travel down parallell paths simultaneously.




Re: packet reordering at exchange points

2002-04-08 Thread Jake Khuon


### On Mon, 08 Apr 2002 14:18:52 -0700, Paul Vixie <[EMAIL PROTECTED]> casually
### decided to expound upon [EMAIL PROTECTED] the following thoughts about
### "packet reordering at exchange points":

PV> > packet reordering at MAE East was extremely common a few years ago. Does
PV> > anyone have information whether this is still happening?
PV> 
PV> more to the point, does anybody still care about packet reordering at
PV> exchange points?  we (paix) go through significant effort to prevent it,
PV> and interswitch trunking with round robin would be a lot easier.  are
PV> we chasing an urban legend here, or would reordering still cause pain?

I'd imagine that anyone passing realtime streams, Mbone or VOIP (anyone out
there routing their VOIP traffic across an IXP?) would start having issues
with the resulting jitter.


--
/*===[ Jake Khuon <[EMAIL PROTECTED]> ]==+
 | Packet Plumber, Network Engineers /| / [~ [~ |) | | --- |
 | for Effective Bandwidth Utilisation  / |/  [_ [_ |) |_| N E T W O R K S |
 +=*/



Re: packet reordering at exchange points

2002-04-08 Thread Richard A Steenbergen


On Mon, Apr 08, 2002 at 02:18:52PM -0700, Paul Vixie wrote:
> 
> > packet reordering at MAE East was extremely common a few years ago. Does
> > anyone have information whether this is still happening?
> 
> more to the point, does anybody still care about packet reordering at
> exchange points?  we (paix) go through significant effort to prevent it,
> and interswitch trunking with round robin would be a lot easier.  are
> we chasing an urban legend here, or would reordering still cause pain?

Setup a freebsd system with a dummynet pipe, do a probability match on 50% 
of the packets and send them through a pipe with a few more bytes of 
queueing and 1ms more delay than the rest. Then test the performance of 
TCP across that link.

There is a good paper on the subject that was published by ACM in
Janurary: http://citeseer.nj.nec.com/450712.html

So just how common is packet reordering today? Well I did a quick peak at
a few machines which I don't have any reason to believe are out of the
ordinary, and they all pretty much come out about the same:

32896155 packets received
9961197 acks (for 2309956346 bytes)
96322 duplicate acks
0 acks for unsent data
17328137 packets (2667939981 bytes) received in-sequence
10755 completely duplicate packets (1803069 bytes)
19 old duplicate packets
375 packets with some dup. data (38297 bytes duped)
53862 out-of-order packets (75435307 bytes)

0.3% of non-ACK packets by packet were received out of order, or 2.8% by 
bytes.

-- 
Richard A Steenbergen <[EMAIL PROTECTED]>   http://www.e-gerbil.net/ras
PGP Key ID: 0x138EA177  (67 29 D7 BC E8 18 3E DA  B2 46 B3 D8 14 36 FE B6)



Re: packet reordering at exchange points

2002-04-08 Thread Sean Donelan


On Mon, 8 Apr 2002, Paul Vixie wrote:
> > packet reordering at MAE East was extremely common a few years ago. Does
> > anyone have information whether this is still happening?
>
> more to the point, does anybody still care about packet reordering at
> exchange points?  we (paix) go through significant effort to prevent it,
> and interswitch trunking with round robin would be a lot easier.  are
> we chasing an urban legend here, or would reordering still cause pain?

Packet re-ordering would still cause pain if it started re-appearing
again at high levels.





packet reordering at exchange points

2002-04-08 Thread Paul Vixie


> packet reordering at MAE East was extremely common a few years ago. Does
> anyone have information whether this is still happening?

more to the point, does anybody still care about packet reordering at
exchange points?  we (paix) go through significant effort to prevent it,
and interswitch trunking with round robin would be a lot easier.  are
we chasing an urban legend here, or would reordering still cause pain?