Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread Fred Baker



On Aug 15, 2007, at 8:35 AM, Sean Donelan wrote:

Or should IP backbones have methods to predictably control which IP  
applications receive the remaining IP bandwidth?  Similar to the  
telephone network special information tone -- All Circuits are  
Busy.  Maybe we've found a new use for ICMP Source Quench.


Source Quench wouldn't be my favored solution here. What I might  
suggest is taking TCP SYN and SCTP INIT (or new sessions if they are  
encrypted or UDP) and put them into a lower priority/rate queue.  
Delaying the start of new work would have a pretty strong effect on  
the congestive collapse of the existing work, I should think.


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread Sean Donelan


On Wed, 15 Aug 2007, Fred Baker wrote:

On Aug 15, 2007, at 8:35 AM, Sean Donelan wrote:
Or should IP backbones have methods to predictably control which IP 
applications receive the remaining IP bandwidth?  Similar to the telephone 
network special information tone -- All Circuits are Busy.  Maybe we've 
found a new use for ICMP Source Quench.


Source Quench wouldn't be my favored solution here. What I might suggest is 
taking TCP SYN and SCTP INIT (or new sessions if they are encrypted or UDP) 
and put them into a lower priority/rate queue. Delaying the start of new work 
would have a pretty strong effect on the congestive collapse of the existing 
work, I should think.


I was joking about Source Quench (missing :-), its got a lot of problems.

But I think the fundamental issue is who is responsible for controlling 
the back-off process?  The edge or the middle?


Using different queues implies the middle (i.e. routers).  At best it 
might be the "near-edge," and creating some type of shared knowledge

between past, current and new sessions in the host stacks (and maybe
middle-boxes like NAT gateways).

How fast do you need to signal large-scale back-off over what time period?
Since major events in the real-world also result in a lot of "new" 
traffic, how do you signal new sessions before they reach the affected

region of the network?  Can you use BGP to signal the far-reaches of
the Internet that I'm having problems, and other ASNs should start slowing
things down before they reach my region (security can-o-worms being 
opened).




Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread Stephen Wilcox

Hey Sean,

On Wed, Aug 15, 2007 at 11:35:43AM -0400, Sean Donelan wrote:
> On Wed, 15 Aug 2007, Stephen Wilcox wrote:
> >(Check slide 4) - the simple fact was that with something like 7 of 9 
> >cables down the redundancy is useless .. even if operators maintained 
> >N+1 redundancy which is unlikely for many operators that would imply 
> >50% of capacity was actually used with 50% spare.. however we see 
> >around 78% of capacity is lost. There was simply to much traffic and 
> >not enough capacity.. IP backbones fail pretty badly when faced with 
> >extreme congestion.
> 
> Remember the end-to-end principle.  IP backbones don't fail with extreme 
> congestion, IP applications fail with extreme congestion.

Hmm I'm not sure about that... a 100% full link dropping packets causes many 
problems:
L7: Applications stop working, humans get angry
L4: TCP/UDP drops cause retransmits, connection drops, retries etc
L3: BGP sessions drop, OSPF hellos are lost.. routing fails
L2: STP packets dropped.. switching fails

I believe any or all of the above could occur on a backbone which has just 
failed massively and now has 20% capacity available such as occurred in SE Asia

> Should IP applications respond to extreme congestion conditions better?
alert('Connection dropped')
"Ping timed out"

kinda icky but its not the applications job to manage the network

> Or should IP backbones have methods to predictably control which IP 
> applications receive the remaining IP bandwidth?  Similar to the telephone
> network special information tone -- All Circuits are Busy.  Maybe we've
> found a new use for ICMP Source Quench.

yes and no.. for a private network perhaps, but for the Internet backbone where 
all traffic is important (right?), differentiation is difficult unless applied 
at the edge and you have major failure and congestion i dont see what you can 
do that will have any reasonable effect. perhaps you are a government 
contractor and you reserve some capacity for them and drop everything else but 
what is really out there as a solution?

FYI I have seen telephone networks fail badly under extreme congestion. CO's 
have small CPUs that dont do a whole lot - setup calls, send busy signals .. 
once a call is in place it doesnt occupy CPU time as the path is locked in 
place elsewhere. however, if something occurs to cause a serious amount of busy 
ccts then CPU usage goes thro the roof and you can cause cascade failures of 
whole COs

telcos look to solutions such as call gapping to intervene when they anticipate 
major congestion, and not rely on the network to handle it

> Even if the IP protocols recover "as designed," does human impatience mean 
> there is a maximum recovery timeout period before humans start making the 
> problem worse?

i'm not sure they were designed to do this.. the arpanet wasnt intended to be 
massively congested.. the redundant links were in place to cope with loss of a 
node and usage was manageable.

Steve


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread Fred Baker


let me answer at least twice.

As you say, remember the end-2-end principle. The end-2-end  
principle, in my precis, says "in deciding where functionality should  
be placed, do so in the simplest, cheapest, and most reliable manner  
when considered in the context of the entire network. That is usually  
close to the edge." Note the presence of advice and absence of mandate.


Parekh and Gallagher in their 1993 papers on the topic proved using  
control theory that if we can specify the amount of data that each  
session keeps in the network (for some definition of "session") and  
for each link the session crosses define exactly what the link will  
do with it, we can mathematically predict the delay the session will  
experience. TCP congestion control as presently defined tries to  
manage delay by adjusting the window; some algorithms literally  
measure delay, while most measure loss, which is the extreme case of  
delay. The math tells me that place to control the rate of a session  
is in the end system. Funny thing, that is found "close to the edge".


What ISPs routinely try to do is adjust routing in order to maximize  
their ability to carry customer sessions without increasing their  
outlay for bandwidth. It's called "load sharing", and we have a list  
of ways we do that, notably in recent years using BGP advertisements.  
Where Parekh and Gallagher calculated what the delay was, the ISP has  
the option of minimizing it through appropriate use of routing.


ie, edge and middle both have valid options, and the totality works  
best when they work together. That may be heresy, but it's true. When  
I hear my company's marketing line on intelligence in the network  
(which makes me cringe), I try to remind my marketing folks that the  
best use of intelligence in the network is to offer intelligent  
services to the intelligent edge that enable the intelligent edge to  
do something intelligent. But there is a place for intelligence in  
the network, and routing its its poster child.


In your summary of the problem, the assumption is that both of these  
are operative and have done what they can - several links are down,  
the remaining links (including any rerouting that may have occurred)  
are full to the gills, TCP is backing off as far as it can back off,  
and even so due to high loss little if anything productive is in fact  
happening. You're looking for a third "thing that can be done" to  
avoid congestive collapse, which is the case in which the network or  
some part of it is fully utilized and yet accomplishing no useful work.


So I would suggest that a third thing that can be done, after the  
other two avenues have been exhausted, is to decide to not start new  
sessions unless there is some reasonable chance that they will be  
able to accomplish their work. This is a burden I would not want to  
put on the host, because the probability is vanishingly small - any  
competent network operator is going to solve the problem with money  
if it is other than transient. But from where I sit, it looks like  
the "simplest, cheapest, and most reliable" place to detect  
overwhelming congestion is at the congested link, and given that  
sessions tend to be of finite duration and present semi-predictable  
loads, if you want to allow established sessions to complete, you  
want to run the established sessions in preference to new ones. The  
thing to do is delay the initiation of new sessions.


If I had an ICMP that went to the application, and if I trusted the  
application to obey me, I might very well say "dear browser or p2p  
application, I know you want to open 4-7 TCP sessions at a time, but  
for the coming 60 seconds could I convince you to open only one at a  
time?". I suspect that would go a long way. But there is a trust  
issue - would enterprise firewalls let it get to the host, would the  
host be able to get it to the application, would the application  
honor it, and would the ISP trust the enterprise/host/application to  
do so? is ddos possible? 


So plan B would be to in some way rate limit the passage of TCP SYN/ 
SYN-ACK and SCTP INIT in such a way that the hosed links remain fully  
utilized but sessions that have become established get acceptable  
service (maybe not great service, but they eventually complete  
without failing).


On Aug 15, 2007, at 8:59 AM, Sean Donelan wrote:


On Wed, 15 Aug 2007, Fred Baker wrote:

On Aug 15, 2007, at 8:35 AM, Sean Donelan wrote:
Or should IP backbones have methods to predictably control which  
IP applications receive the remaining IP bandwidth?  Similar to  
the telephone network special information tone -- All Circuits  
are Busy.  Maybe we've found a new use for ICMP Source Quench.


Source Quench wouldn't be my favored solution here. What I might  
suggest is taking TCP SYN and SCTP INIT (or new sessions if they  
are encrypted or UDP) and put them into a lower priority/rate  
queue. Delaying the start of new wo

Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread Valdis . Kletnieks
On Wed, 15 Aug 2007 11:59:54 EDT, Sean Donelan said:
> Since major events in the real-world also result in a lot of "new" 
> traffic, how do you signal new sessions before they reach the affected
> region of the network?  Can you use BGP to signal the far-reaches of
> the Internet that I'm having problems, and other ASNs should start slowing
> things down before they reach my region (security can-o-worms being 
> opened).

I'm more worried about state getting "stuck", kind of like the total inability
of the DHS worry-o-meter to move lower than yellow.



pgpMetir9zJgj.pgp
Description: PGP signature


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread ChiloƩ Temuco
Congestion and applications...

My opinion:

A tier 1 provider does not care what traffic it carries.  That is all a
function of the application not the network.

A tier 2 provider may do traffic shaping, etc.

A tier 3 provider may decide to block traffic paterns.

 --

More or less...  The network was intended to move data from one machine to
another...  The less manipulation in the middle the better...  No
manipulation of the payload is the name of the game.

That being said.  It's entirely a function of the application to timeout and
drop out of order packets, etc.

ONS is designed around this principle.

In streaming data... often it is better to get bad or missing data than to
try and put out of order or bad data in the buffer...

A good example is digital over-the-air tv...  If you didn't build in enough
error correction... then you'll have digital breakup, etc.   It is
impossible to recover any of that data.

If reliable transport of data is required... That is a function of the
application.

ONS is an Optical Networking Standard in the development stage.

-Chiloe Temuco
On 8/15/07, Stephen Wilcox <[EMAIL PROTECTED]> wrote:
>
>
> Hey Sean,
>
> On Wed, Aug 15, 2007 at 11:35:43AM -0400, Sean Donelan wrote:
> > On Wed, 15 Aug 2007, Stephen Wilcox wrote:
> > >(Check slide 4) - the simple fact was that with something like 7 of 9
> > >cables down the redundancy is useless .. even if operators maintained
> > >N+1 redundancy which is unlikely for many operators that would imply
> > >50% of capacity was actually used with 50% spare.. however we see
> > >around 78% of capacity is lost. There was simply to much traffic and
> > >not enough capacity.. IP backbones fail pretty badly when faced with
> > >extreme congestion.
> >
> > Remember the end-to-end principle.  IP backbones don't fail with extreme
> > congestion, IP applications fail with extreme congestion.
>
> Hmm I'm not sure about that... a 100% full link dropping packets causes
> many problems:
> L7: Applications stop working, humans get angry
> L4: TCP/UDP drops cause retransmits, connection drops, retries etc
> L3: BGP sessions drop, OSPF hellos are lost.. routing fails
> L2: STP packets dropped.. switching fails
>
> I believe any or all of the above could occur on a backbone which has just
> failed massively and now has 20% capacity available such as occurred in SE
> Asia
>
> > Should IP applications respond to extreme congestion conditions better?
> alert('Connection dropped')
> "Ping timed out"
>
> kinda icky but its not the applications job to manage the network
>
> > Or should IP backbones have methods to predictably control which IP
> > applications receive the remaining IP bandwidth?  Similar to the
> telephone
> > network special information tone -- All Circuits are Busy.  Maybe we've
> > found a new use for ICMP Source Quench.
>
> yes and no.. for a private network perhaps, but for the Internet backbone
> where all traffic is important (right?), differentiation is difficult unless
> applied at the edge and you have major failure and congestion i dont see
> what you can do that will have any reasonable effect. perhaps you are a
> government contractor and you reserve some capacity for them and drop
> everything else but what is really out there as a solution?
>
> FYI I have seen telephone networks fail badly under extreme congestion.
> CO's have small CPUs that dont do a whole lot - setup calls, send busy
> signals .. once a call is in place it doesnt occupy CPU time as the path is
> locked in place elsewhere. however, if something occurs to cause a serious
> amount of busy ccts then CPU usage goes thro the roof and you can cause
> cascade failures of whole COs
>
> telcos look to solutions such as call gapping to intervene when they
> anticipate major congestion, and not rely on the network to handle it
>
> > Even if the IP protocols recover "as designed," does human impatience
> mean
> > there is a maximum recovery timeout period before humans start making
> the
> > problem worse?
>
> i'm not sure they were designed to do this.. the arpanet wasnt intended to
> be massively congested.. the redundant links were in place to cope with loss
> of a node and usage was manageable.
>
> Steve
>


RE: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread Rod Beck
Is this a declaration of principles? There is no reason why 'Tier 1' means that 
the carrier will not have an incentive to shape or even block traffic. 
Particularly, if they have a lot of eyeballs. 

Roderick S. Beck
Director of EMEA Sales
Hibernia Atlantic
1, Passage du Chantier, 75012 Paris
http://www.hiberniaatlantic.com
Wireless: 1-212-444-8829. 
Landline: 33-1-4346-3209
AOL Messenger: GlobalBandwidth
[EMAIL PROTECTED]
[EMAIL PROTECTED]
``Unthinking respect for authority is the greatest enemy of truth.'' Albert 
Einstein. 



-Original Message-
From: [EMAIL PROTECTED] on behalf of ChiloƩ Temuco
Sent: Wed 8/15/2007 6:06 PM
To: nanog@merit.edu
Subject: Re: Extreme congestion (was Re: inter-domain link recovery)
 
Congestion and applications... 

My opinion:
 
A tier 1 provider does not care what traffic it carries.  That is all a 
function of the application not the network.
 
A tier 2 provider may do traffic shaping, etc.
 
A tier 3 provider may decide to block traffic paterns.
 


 
More or less...  The network was intended to move data from one machine to 
another...  The less manipulation in the middle the better...  No manipulation 
of the payload is the name of the game.
 
That being said.  It's entirely a function of the application to timeout and 
drop out of order packets, etc.
 
ONS is designed around this principle.
 
In streaming data... often it is better to get bad or missing data than to try 
and put out of order or bad data in the buffer... 
 
A good example is digital over-the-air tv...  If you didn't build in enough 
error correction... then you'll have digital breakup, etc.   It is impossible 
to recover any of that data.
 
If reliable transport of data is required... That is a function of the 
application.

ONS is an Optical Networking Standard in the development stage.

-Chiloe Temuco

On 8/15/07, Stephen Wilcox <[EMAIL PROTECTED]> wrote: 


Hey Sean,

On Wed, Aug 15, 2007 at 11:35:43AM -0400, Sean Donelan wrote:
> On Wed, 15 Aug 2007, Stephen Wilcox wrote: 
> >(Check slide 4) - the simple fact was that with something like 7 of 9
> >cables down the redundancy is useless .. even if operators maintained
> >N+1 redundancy which is unlikely for many operators that would imply 
> >50% of capacity was actually used with 50% spare.. however we see
> >around 78% of capacity is lost. There was simply to much traffic and
> >not enough capacity.. IP backbones fail pretty badly when faced with 
> >extreme congestion.
>
> Remember the end-to-end principle.  IP backbones don't fail with 
extreme
> congestion, IP applications fail with extreme congestion.

Hmm I'm not sure about that... a 100% full link dropping packets causes 
many problems: 
L7: Applications stop working, humans get angry
L4: TCP/UDP drops cause retransmits, connection drops, retries etc
L3: BGP sessions drop, OSPF hellos are lost.. routing fails
L2: STP packets dropped.. switching fails 

I believe any or all of the above could occur on a backbone which has 
just failed massively and now has 20% capacity available such as occurred in SE 
Asia

> Should IP applications respond to extreme congestion conditions 
better? 
alert('Connection dropped')
"Ping timed out"

kinda icky but its not the applications job to manage the network

> Or should IP backbones have methods to predictably control which IP 
> applications receive the remaining IP bandwidth?  Similar to the 
telephone
> network special information tone -- All Circuits are Busy.  Maybe 
we've
> found a new use for ICMP Source Quench.

yes and no.. for a private network perhaps, but for the Internet 
backbone where all traffic is important (right?), differentiation is difficult 
unless applied at the edge and you have major failure and congestion i dont see 
what you can do that will have any reasonable effect. perhaps you are a 
government contractor and you reserve some capacity for them and drop 
everything else but what is really out there as a solution? 

FYI I have seen telephone networks fail badly under extreme congestion. 
CO's have small CPUs that dont do a whole lot - setup calls, send busy signals 
.. once a call is in place it doesnt occupy CPU time as the path is locked in 
place elsewhere. however, if something occurs to cause a serious amount of busy 
ccts then CPU usage goes thro the roof and you can cause cascade failures of 
whole COs 

telcos look to solutions such as call gapping to intervene when they 
anticipate major congestion, and not rely on the netw

Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread Fred Baker



On Aug 15, 2007, at 8:39 PM, Sean Donelan wrote:

Or would it be better to let the datagram protocols fight it out  
with the session oriented protocols, just like normal Internet  
operations


  Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc)  
1% queue

  Everything else (UDP, ICMP, GRE, TCP ACK/FIN, etc) normal queue

And finally why only do this during extreme congestion?  Why not  
always

do it?


I think I would always do it, and expect it to take effect only under  
extreme congestion.



On Aug 15, 2007, at 8:39 PM, Sean Donelan wrote:

On Wed, 15 Aug 2007, Fred Baker wrote:
So I would suggest that a third thing that can be done, after the  
other two avenues have been exhausted, is to decide to not start  
new sessions unless there is some reasonable chance that they will  
be able to accomplish their work.


I view this as part of the flash crowd family of congestion  
problems, a combination of a rapid increase in demand and a rapid  
decrease in capacity.


In many cases, yes. I know of a certain network that ran with 30%  
loss for a matter of years because the option didn't exist to  
increase the bandwidth. When it became reality, guess what they did.


That's when I got to thinking about this.



Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread Sean Donelan



[...Lots of good stuff deleted to get to this point...]

On Wed, 15 Aug 2007, Fred Baker wrote:
So I would suggest that a third thing that can be done, after the other two 
avenues have been exhausted, is to decide to not start new sessions unless 
there is some reasonable chance that they will be able to accomplish their 
work. This is a burden I would not want to put on the host, because the 
probability is vanishingly small - any competent network operator is going to 
solve the problem with money if it is other than transient. But from where I 
sit, it looks like the "simplest, cheapest, and most reliable" place to 
detect overwhelming congestion is at the congested link, and given that 
sessions tend to be of finite duration and present semi-predictable loads, if 
you want to allow established sessions to complete, you want to run the 
established sessions in preference to new ones. The thing to do is delay the 
initiation of new sessions.


I view this as part of the flash crowd family of congestion problems, a 
combination of a rapid increase in demand and a rapid decrease in 
capacity.  But instead of targeting a single destination, the impact is

across multiple networks in the region.

In the flash crowd cases (including DDOS variations), the place to respond 
(Note: the word change from "detect" to "respond") to extreme congestion 
does not seem toe be at the congested link but several hops upstream of 
the congested link. Current "effective practice" seems to be 1-2 ASN's 
away from the congested/failure point, but that may just also be the 
distance to reach "effective" ISP backbone engineer response.




If I had an ICMP that went to the application, and if I trusted the 
application to obey me, I might very well say "dear browser or p2p 
application, I know you want to open 4-7 TCP sessions at a time, but for the 
coming 60 seconds could I convince you to open only one at a time?". I 
suspect that would go a long way. But there is a trust issue - would 
enterprise firewalls let it get to the host, would the host be able to get it 
to the application, would the application honor it, and would the ISP trust 
the enterprise/host/application to do so? is ddos possible? 


For the malicious DDOS, of course we don't expect the hosts to obey. 
However, in the more general flash crowd case, I think the expectation of 
hosts following the RFC is pretty strong, although it may take years for
new things to make it into the stacks.  It won't slow down all the 
elephants, but maybe can turn the stampede into just a rampage.  And
the advantage of doing it in the edge host is their scale grow with 
the Internet.


But even if the hosts don't respond to the back-off, it would give the 
edge more in-band trouble-shooting information. For example, ICMP 
"Destination Unreachable - Load shedding in effect. Retry after "N" 
seconds" (where N is stored like the Next-Hop MTU). Sending more packets 
to signal congestion, just makes congestion worse.  However, having an 
explicit Internet "busy signal" is mostly to help network operators 
because firewalls will probably drop those ICMP messages just like PMTU.



So plan B would be to in some way rate limit the passage of TCP SYN/SYN-ACK 
and SCTP INIT in such a way that the hosed links remain fully utilized but 
sessions that have become established get acceptable service (maybe not great 
service, but they eventually complete without failing).


This would be a useful plan B (or plan F - when things are really 
FUBARed), but I still think you need a way to signal it upstream 1 or 2 
ASNs from the Extreme Congestion to be effective. For example, BGP says 
for all packets for network w.x.y.z with community a, implement back-off 
queue plan B.  Probably not a queue per network in backbone routers, just 
one alternate queue plan B for all networks with that community.  Once

the origin ASN feels things are back to "normal," they can remove the
community from their BGP announcements.

But what should the alternate queue plan B be?

Probably not fixed capacity numbers, but a distributed percentage across
different upstreams.

  Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue
  Datagram protocol packets (UDP, ICMP, GRE, etc) 20% queue
  Session protocol established/finish packets (TCP ACK/FIN, etc) normal queue

That values session oriented protocols more than datagram oriented 
protocols during extreme congestion.


Or would it be better to let the datagram protocols fight it out with the 
session oriented protocols, just like normal Internet operations


  Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue
  Everything else (UDP, ICMP, GRE, TCP ACK/FIN, etc) normal queue

And finally why only do this during extreme congestion?  Why not always
do it?


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-15 Thread Adrian Chadd

On Wed, Aug 15, 2007, Fred Baker wrote:

> >And finally why only do this during extreme congestion?  Why not  
> >always
> >do it?
> 
> I think I would always do it, and expect it to take effect only under  
> extreme congestion.

Well, emprically (on multi-megabit customer-facing links) it takes
effect immediately and results in congestion being "avoided" (for
values of avoided.) You don't hit a "hm, this is fine" and "hm,
this is congested"; you actually notice a much smoother performance
degredation right up to 95% constant link use.

Another thing that I've done on DSL links (and this was spawned by
some of Tony Kapela's NANOG stuff) is to actually rate limit TCP SYN,
UDP DNS, ICMP, etc) but what I noticed was that during periods of
90+% load TCP connections could still be established and slowly
progress forward but what really busted up stuff was various P2P stuff.

By also rate-limiting per-user TCP connection establishment (doing per-IP
NAT maximum session counts, all in 12.4 on little Cisco 800's) the impact
on bandwidth-hoggy applications was immediate. People were also
very happy that their links was suddenly magically usable.

I know a lot of these tricks can't be played on fat trunks (fair queueing
on 10Gig?) as I just haven't touched the equipment, but my experience
in enterprise switching environments with the Cisco QoS koolaid
really does show congestion doesn't have to destroy performance.

(Hm, an Ixia or two and a 7600 would be useful right about now.)



Adrian



Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Fred Baker


On Aug 15, 2007, at 10:13 PM, Adrian Chadd wrote:
Well, emprically (on multi-megabit customer-facing links) it takes  
effect immediately and results in congestion being "avoided" (for  
values of avoided.) You don't hit a "hm, this is fine" and "hm,  
this is congested"; you actually notice a much smoother performance  
degredation right up to 95% constant link use.


yes, theory says the same thing. It's really convenient when theory  
and practice happen to agree :-)


There is also a pretty good paper by Sue Moon et al in INFOCOMM 2004  
that looks at the Sprint network (they had special access) and looks  
at variation in delay pop-2-pop at a microsecond granularity and  
finds some fairly interesting behavior long before that.


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Sean Donelan


On Wed, 15 Aug 2007, Fred Baker wrote:

On Aug 15, 2007, at 8:39 PM, Sean Donelan wrote:

On Wed, 15 Aug 2007, Fred Baker wrote:
So I would suggest that a third thing that can be done, after the other 
two avenues have been exhausted, is to decide to not start new sessions 
unless there is some reasonable chance that they will be able to 
accomplish their work.


I view this as part of the flash crowd family of congestion problems, a 
combination of a rapid increase in demand and a rapid decrease in capacity.


In many cases, yes. I know of a certain network that ran with 30% loss for a 
matter of years because the option didn't exist to increase the bandwidth. 
When it became reality, guess what they did.

That's when I got to thinking about this.


Yeah, necessity is always the mother of invention.  I first tried rate
limiting the TCP SYNs with the Starr/Clinton report.  It worked great 
for a while, but then the SYN-flood started backing up not only on the 
"congested" link, but also started congesting in other the peering 
networks (those were the days of OC3 backbones and head-of-line blocking

NAP switches).  And then the server choked

So that's why I keep returning to the need to pushback traffic a couple
of ASNs back.  If its going to get dropped anyway, drop it sooner.

Its also why I would really like to try to do something about the 
woodpecker hosts that think congestion means try more.  If the back

off slows down the host re-trying, its even further pushback.



Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Randy Bush

> So that's why I keep returning to the need to pushback traffic a couple
> of ASNs back.  If its going to get dropped anyway, drop it sooner.

ECN


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Alexander Harrowell
An "Internet variable speed limit" is a nice idea, but there are some
serious trust issues; applications have to trust the network implicitly not
to issue gratuitous slow down messages, and certainly not to use them for
evil purposes (not that I want to start a network neutrality flamewar...but
what with the AT&T/Pearl Jam row, it's not hard to see
rightsholders/telcos/government/alien space bats leaning on your upstream to
spoil your access to content X).

Further, you're going to need *very good* filtration; necessary to verify
the source of any such packets closely due to the major DOS potential.
Scenario: Bad Guy controls some hacked machines on AS666 DubiousNet, who
peer at AMS-IX. Bad Guy has his bots inject a mass of "slow down!" packets
with a faked source address taken from the IX's netblock...and everything
starts moving Very Slowly. Especially if the suggestion upthread that the
slowdown ought to be implemented 1-2 AS away from the problem is
implemented, which would require forwarding the slowdowns between networks.

It has some similarities with the Chinese firewall's use of quick TCP RSTs
to keep users from seeing Bad Things; in that you could tell your machine to
ignore'em. There's a sort of tragedy of the commons problem - if everyone
agrees to listen to the slowdown requests, it will work, but all you need is
a significant minority of the irresponsible, and there'll be no gain in
listening to them.


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Stephen Wilcox

On Wed, Aug 15, 2007 at 12:58:48PM -0700, Tony Li wrote:
> 
> On Aug 15, 2007, at 9:12 AM, Stephen Wilcox wrote:
> 
> >>Remember the end-to-end principle.  IP backbones don't fail with  
> >>extreme
> >>congestion, IP applications fail with extreme congestion.
> >
> >Hmm I'm not sure about that... a 100% full link dropping packets  
> >causes many problems:
> >[...]
> >L3: BGP sessions drop, OSPF hellos are lost.. routing fails
> >L2: STP packets dropped.. switching fails
> 
> 
> It should be noted that well designed equipment will prioritize  
> control data both on xmit and receive so that under extreme  
> congestion situations, these symptoms do not occur.

Hi Tony,
 s/will/should/

The various bits of kit I've played with have tended not to cope under a 
massively maxed out circuit (I dont mean just full, I mean trying to get double 
the capacity into a link). This includes Cisco and Foundry kit.. not sure with 
other vendors such as Extreme or Juniper.

Often the congestion/flaps causes high CPU, which also can cause failure of 
protocols on the control plane.

Also, if you have something like router-switch-router it may be that the 
intermediate device looks after its control plane (ie STP) but for example sees 
BGP as just another TCP stream which it cannot differentiate.

Whilst it may be that control plane priority, cpu protection are features now 
available.. they have not always been and I'm fairly sure are not available 
across all platforms and software now. And as we know, the majority of the 
Internet does not run the latest high end kit and software..

Steve


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Stephen Wilcox

On Thu, Aug 16, 2007 at 10:55:34AM +0100, Alexander Harrowell wrote:
>An "Internet variable speed limit" is a nice idea, but there are some
>serious trust issues; applications have to trust the network implicitly
>not to issue gratuitous slow down messages, and certainly not to use them
>for evil purposes (not that I want to start a network neutrality
>flamewar...but what with the AT&T/Pearl Jam row, it's not hard to see
>rightsholders/telcos/government/alien space bats leaning on your upstream
>to spoil your access to content X).
> 
>Further, you're going to need *very good* filtration; necessary to verify
>the source of any such packets closely due to the major DOS potential.
>Scenario: Bad Guy controls some hacked machines on AS666 DubiousNet, who
>peer at AMS-IX. Bad Guy has his bots inject a mass of "slow down!" packets
>with a faked source address taken from the IX's netblock...and everything
>starts moving Very Slowly. Especially if the suggestion upthread that the
>slowdown ought to be implemented 1-2 AS away from the problem is
>implemented, which would require forwarding the slowdowns between
>networks.
> 
>It has some similarities with the Chinese firewall's use of quick TCP RSTs
>to keep users from seeing Bad Things; in that you could tell your machine
>to ignore'em. There's a sort of tragedy of the commons problem - if
>everyone agrees to listen to the slowdown requests, it will work, but all
>you need is a significant minority of the irresponsible, and there'll be
>no gain in listening to them.

sounds a lot like MEDs - something you have to trust an unknown upstream to 
send you, of dubious origin, making unknown changes to performance on your 
network

and also like MEDs, whilst it may work for some it wont for others.. a DSL 
provider may try to control input but a CDN will want to ignore them to 
maximise throughput and revenue

Steve


RE: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread michael.dillon

> In many cases, yes. I know of a certain network that ran with 
> 30% loss for a matter of years because the option didn't 
> exist to increase the bandwidth. When it became reality, 
> guess what they did.

How many people have noticed that when you replace a circuit with a
higher capacity one, the traffic on the new circuit is suddenly greater
than 100% of the old one. Obviously this doesn't happen all the time,
such as when you have a 40% threshold for initiating a circuit upgrade,
but if you do your upgrades when they are 80% or 90% full, this does
happen.

--Michael Dillon


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Sean Donelan


On Thu, 16 Aug 2007, Alexander Harrowell wrote:

An "Internet variable speed limit" is a nice idea, but there are some
serious trust issues; applications have to trust the network implicitly not
to issue gratuitous slow down messages, and certainly not to use them for


Yeah, that's why I was limiting the need (requirement) to only 1-few ASN 
hops upstream.  I view this as similar to some backbones offering a 
special blackhole everything BGP community that usually is not transitive. 
This is the Oh Crap, Don't Blackhole Everything but Slow Stuff Down

BGP community.



Further, you're going to need *very good* filtration; necessary to verify
the source of any such packets closely due to the major DOS potential.
Scenario: Bad Guy controls some hacked machines on AS666 DubiousNet, who
peer at AMS-IX. Bad Guy has his bots inject a mass of "slow down!" packets
with a faked source address taken from the IX's netblock...and everything
starts moving Very Slowly. Especially if the suggestion upthread that the
slowdown ought to be implemented 1-2 AS away from the problem is
implemented, which would require forwarding the slowdowns between networks.


For the ICMP packet, man in the middle attacks are really no different 
than the validation required for any other protocol.  For most protocols, 
you "should" get at least 64 bytes back of the original packet in the 
ICMP error message. You "should" be validating everything against what

you sent.  Be conservative in what you send, be suspicious in what you
receive.


It has some similarities with the Chinese firewall's use of quick TCP RSTs
to keep users from seeing Bad Things; in that you could tell your machine to
ignore'em. There's a sort of tragedy of the commons problem - if everyone
agrees to listen to the slowdown requests, it will work, but all you need is
a significant minority of the irresponsible, and there'll be no gain in
listening to them.


Penalty box, penalty box.  Yeah, this is always the argument.  But as 
we've seen with TCP, most host stacks try (more or less) to follow the 
RFCs.  Why implement any TCP congestion management?


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Hex Star
How does akamai handle traffic congestion so seamlessly? Perhaps we should
look at existing setups implemented by companies such as akamai for
guidelines regarding how to resolve this kind of issue...


RE: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Mikael Abrahamsson


On Thu, 16 Aug 2007, [EMAIL PROTECTED] wrote:

How many people have noticed that when you replace a circuit with a 
higher capacity one, the traffic on the new circuit is suddenly greater 
than 100% of the old one. Obviously this doesn't happen all the time, 
such as when you have a 40% threshold for initiating a circuit upgrade, 
but if you do your upgrades when they are 80% or 90% full, this does 
happen.


I'd say this might happen on links connected to devices with small buffers 
such as with a 7600 with lan cards, foundry device or alike. If you look 
at the same behaviour of a deep packet buffer device such as juniper or 
cisco GSR/CRS-1 the behaviour you're describing doesn't exist (at least 
not that I have noticed).


--
Mikael Abrahamssonemail: [EMAIL PROTECTED]


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Sean Donelan


On Wed, 15 Aug 2007, Randy Bush wrote:

So that's why I keep returning to the need to pushback traffic a couple
of ASNs back.  If its going to get dropped anyway, drop it sooner.


ECN


Oh goody, the whole RED, BLUE, WRED, AQM, etc menagerie.

Connections already in progress (i.e. the ones with ECN) we want to keep 
working and finish.  We don't want those connections to abort in the 
middle, and then add to the congestion when they retry.


The phrase everyone is trying to avoid saying is "Admission Control."  The 
Internet doesn't do admission control well (or even badly).


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Fred Baker


yes.

On Aug 16, 2007, at 12:29 AM, Randy Bush wrote:



So that's why I keep returning to the need to pushback traffic a  
couple

of ASNs back.  If its going to get dropped anyway, drop it sooner.


ECN


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Randy Bush

>>> So that's why I keep returning to the need to pushback traffic a couple
>>> of ASNs back.  If its going to get dropped anyway, drop it sooner.
>> ECN
> Oh goody, the whole RED, BLUE, WRED, AQM, etc menagerie.

wow!  is that what ECN stands for?  somehow, in all this time, i missed
that.  live and learn.

> Connections already in progress (i.e. the ones with ECN) we want to keep
> working and finish.  We don't want those connections to abort in the
> middle, and then add to the congestion when they retry.

so the latest version of ECN aborts connections?  wow!  i am really
learning a lot, and it's only the first cup of coffee today.  thanks!

> The phrase everyone is trying to avoid saying is "Admission Control." 

you want "pushback traffic a couple of ASNs back," the actual question i
was answering, you are talking admission control.

randy


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Randy Bush

> Yeah, that's why I was limiting the need (requirement) to only 1-few
> ASN hops upstream.  I view this as similar to some backbones offering
> a special blackhole everything BGP community that usually is not 
> transitive. This is the Oh Crap, Don't Blackhole Everything but Slow 
> Stuff Down BGP community.

and the two hops upstream but not the source router spools the packets
to the hard drive?

randy


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Randy Bush

Alexander Harrowell wrote:
>> Yeah, that's why I was limiting the need (requirement) to only 1-few
>> ASN hops upstream.  I view this as similar to some backbones offering
>> a special blackhole everything BGP community that usually is not
>> transitive. This is the Oh Crap, Don't Blackhole Everything but Slow
>> Stuff Down BGP community.
> and the two hops upstream but not the source router spools the packets
> to the hard drive?
> Ideally you'd want to influence the endpoint protocol stack, right?

ECN

sally floyd ain't stoopid


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Alexander Harrowell
On 8/16/07, Randy Bush <[EMAIL PROTECTED]> wrote:
>
> > Yeah, that's why I was limiting the need (requirement) to only 1-few
> > ASN hops upstream.  I view this as similar to some backbones offering
> > a special blackhole everything BGP community that usually is not
> > transitive. This is the Oh Crap, Don't Blackhole Everything but Slow
> > Stuff Down BGP community.
>
> and the two hops upstream but not the source router spools the packets
> to the hard drive?


Ideally you'd want to influence the endpoint protocol stack, right? (Which
brings us to the user trust thing.)


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Fred Baker



On Aug 16, 2007, at 7:46 AM, <[EMAIL PROTECTED]> wrote:
In many cases, yes. I know of a certain network that ran with 30%  
loss for a matter of years because the option didn't exist to  
increase the bandwidth. When it became reality, guess what they did.


How many people have noticed that when you replace a circuit with a  
higher capacity one, the traffic on the new circuit is suddenly  
greater than 100% of the old one. Obviously this doesn't happen all  
the time, such as when you have a 40% threshold for initiating a  
circuit upgrade, but if you do your upgrades when they are 80% or  
90% full, this does happen.


well, so lets do a thought experiment.

First, that infocomm paper I mentioned says that they measured the  
variation in delay pop-2-pop at microsecond granularity with hyper- 
synchronized clocks, and found that with 90% confidence the variation  
in delay in their particular optical network was less than 1 ms. Also  
with 90% confidence, they noted "frequent" (frequency not specified,  
but apparently pretty frequent, enough that one of the authors later  
worried in my presence about offering VoIP services on it) variations  
on the order of 10 ms. For completeness, I'll note that they had six  
cases in a five hour sample where the delay changed by 100 ms and  
stayed there for a period of time, but we'll leave that observation  
for now.


Such spikes are not difficult to explain. If you think of TCP as an  
on-off function, a wave function with some similarities to a sin  
wave, you might ask yourself what the sum of a bunch of sin waves  
with slightly different periods is. It is also a wave function, and  
occasionally has a very tall peak. The study says that TCP  
synchronization happens in the backbone. Surprise.


Now, let's say you're running your favorite link at 90% and get such  
a spike. What happens? The tip of it gets clipped off - a few packets  
get dropped. Those TCPs slow down momentarily. The more that happens,  
the more frequently TCPs get clipped and back off.


Now you upgrade the circuit and the TCPs stop getting clipped. What  
happens?


The TCPs don't slow down. They use the bandwidth you have made  
available instead.


in your words, "the traffic on the new circuit is suddenly greater  
than 100% of the old one".


In 1995 at the NGN conference, I found myself on a stage with Phill  
Gross, then a VP at MCI. He was basically reporting on this  
phenomenon and apologizing to his audience. MCI had put in an OC-3  
network - gee-whiz stuff then - and had some of the links run too  
close to full before starting to upgrade. By the time they had two  
OC-3's in parallel on every path, there were some paths with a  
standing 20% loss rate. Phill figured that doubling the bandwidth  
again (622 everywhere) on every path throughout the network should  
solve the problem for that remaining 20% of load, and started with  
the hottest links. To his surprise, with the standing load > 95% and  
experiencing 20% loss at 311 MBPS, doubling the rate to 622 MBPS  
resulted in links with a standing load > 90% and 4% loss. He still  
needed more bandwidth. After we walked offstage, I explained TCP to  
him...


Yup. That's what happens.

Several folks have commented on p2p as a major issue here.  
Personally, I don't think of p2p as the problem in this context, but  
it is an application that exacerbates the problem. Bottom line, the  
common p2p applications like to keep lots of TCP sessions flowing,  
and have lots of data to move. Also (and to my small mind this is  
egregious), they make no use of locality - if the content they are  
looking for is both next door and half-way around the world, they're  
perfectly happen to move it around the world. Hence, moving a file  
into a campus doesn't mean that the campus has the file and will stop  
bothering you. I'm pushing an agenda in the open source world to add  
some concept of locality, with the purpose of moving traffic off ISP  
networks when I can. I think the user will be just as happy or  
happier, and folks pushing large optics will certainly be.


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Sean Donelan


On Thu, 16 Aug 2007, Randy Bush wrote:

Alexander Harrowell wrote:

Yeah, that's why I was limiting the need (requirement) to only 1-few
ASN hops upstream.  I view this as similar to some backbones offering
a special blackhole everything BGP community that usually is not
transitive. This is the Oh Crap, Don't Blackhole Everything but Slow
Stuff Down BGP community.

and the two hops upstream but not the source router spools the packets
to the hard drive?
Ideally you'd want to influence the endpoint protocol stack, right?


ECN

sally floyd ain't stoopid


ECN doesn't affect the initial SYN packets.

I agree, sally floyd ain't stoopid.


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Deepak Jain




Mikael Abrahamsson wrote:


On Thu, 16 Aug 2007, [EMAIL PROTECTED] wrote:

How many people have noticed that when you replace a circuit with a 
higher capacity one, the traffic on the new circuit is suddenly 
greater than 100% of the old one. Obviously this doesn't happen all 
the time, such as when you have a 40% threshold for initiating a 
circuit upgrade, but if you do your upgrades when they are 80% or 90% 
full, this does happen.


I'd say this might happen on links connected to devices with small 
buffers such as with a 7600 with lan cards, foundry device or alike. If 
you look at the same behaviour of a deep packet buffer device such as 
juniper or cisco GSR/CRS-1 the behaviour you're describing doesn't exist 
(at least not that I have noticed).


Depends on your traffic type and I think this really depends on the 
granularity of your study set (when you are calculating 80-90% usage). 
If you upgrade early, or your (shallow) packet buffers convince to 
upgrade late, the effects might be different.


If you do upgrades assuming the same amount of latency and packet loss 
on any circuit, you should see the same effect irrespective of buffer 
depth. (for any production equipment by a main vendor).


Deeper buffers allow you to run closer to 100% (longer) with fewer 
packet drops at the cost of higher latency. The assumption being that 
more congested devices with smaller buffers are dropping some packets 
here and there and causing those sessions to back off in a way the 
deeper buffer systems don't.


Its a business case whether its better to upgrade early or buy gear that 
lets you upgrade later.


DJ




Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Mikael Abrahamsson


On Thu, 16 Aug 2007, Deepak Jain wrote:

Depends on your traffic type and I think this really depends on the 
granularity of your study set (when you are calculating 80-90% usage). If you 
upgrade early, or your (shallow) packet buffers convince to upgrade late, the 
effects might be different.


My guess is that the value comes from mrtg or alike, 5 minute average 
utilization.


If you do upgrades assuming the same amount of latency and packet loss on any 
circuit, you should see the same effect irrespective of buffer depth. (for 
any production equipment by a main vendor).


I do not agree. A shallow buffer device will give you packet loss without 
any major latency increase, whereas a deep buffer device will give you 
latency without packet loss (as most users out there will not have 
sufficient tcp window size to utilize a 300+ ms latency due to buffering, 
they will throttle back their usage of the link, and it can stay at 100% 
utilization without packet loss for quite some time).


Yes, these two cases will both enable link utilization to get to 100% on 
average, and in most cases users will actually complain less as the packet 
loss will most likely be less noticable to them in traceroute than the 
latency increase due to buffering.


Anyhow, I still consider a congested backbone an operational failure as 
one is failing to provide adequate service to the customers. Congestion 
should happen on the access line to the customer, nowhere else.


Deeper buffers allow you to run closer to 100% (longer) with fewer packet 
drops at the cost of higher latency. The assumption being that more congested 
devices with smaller buffers are dropping some packets here and there and 
causing those sessions to back off in a way the deeper buffer systems don't.


Correct.

Its a business case whether its better to upgrade early or buy gear that lets 
you upgrade later.


It depends on your bw cost, if your link is very expensive then it might 
make sense to use manpower opex and equipment capex to prolong the usage 
of that link by trying to cram everything you can out of it. In the long 
run there is of course no way to avoid upgrade, as users will notice it 
anyhow.


--
Mikael Abrahamssonemail: [EMAIL PROTECTED]


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Mikael Abrahamsson


On Thu, 16 Aug 2007, Fred Baker wrote:

world, they're perfectly happen to move it around the world. Hence, moving a 
file into a campus doesn't mean that the campus has the file and will stop 
bothering you. I'm pushing an agenda in the open source world to add some 
concept of locality, with the purpose of moving traffic off ISP networks when 
I can. I think the user will be just as happy or happier, and folks pushing 
large optics will certainly be.


With the regular user small TCP window size, you still get a sense of 
locality as more data during the same time will flow from a source that is 
closer to you RTT-wise than from one that is far away.


We've been pitching the idea to bittorrent tracker authors to include a 
BGP feed and prioritize peers that are in the same ASN as the user 
himself, but they're having performance problems already so they're not so 
keen on adding complexity. If it could be solved better at the client 
level that might help, but the end user who pays flat rate has little 
incentive to help the ISP in this case.


--
Mikael Abrahamssonemail: [EMAIL PROTECTED]


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Ted Hardie

Fred Baker writes:

>Hence, moving a file into a campus doesn't mean that the campus has the file 
>and 
>will stop  bothering you. I'm pushing an agenda in the open source world to 
>add  
>some concept of locality, with the purpose of moving traffic off ISP  
>networks when I can. I think the user will be just as happy or  
>happier, and folks pushing large optics will certainly be.

As I mentioned to Fred in a bar once, there is at least one case where you have
to be a bit careful with how you push locality.  In the wired campus case, he's 
certainly
right:  if you have the file topologically close to other potentially 
interested users,
delivering it from that "nearer" source is a win for pretty much everyone.
This is partly the case because the local wired network is unlikely to be 
resource
constrained, especially in comparison to the upstream network links.

In some wireless cases, though, it can be a bad thing.  Imagine for a moment 
that
Fred and I are using a p2p protocol while stuck in an airport.  We're both 
looking
for the same file.  The p2p network pushes it first to Fred and then directs me 
to get
it from him.  If he and I are doing this while we're both connected to the same 
resource-constrained base station, we may actually be worse off, as the
same base station has to allocate data channels for two high data traffic
flows while it passes from him to me.  If I/the second user gets the file from 
outside the pool of devices connected to that base  station, in other words, 
the base station , I, and its other users may well be better off.  

To put this another way, the end user sees the campus case as a win primarily
because the campus is not resource constrained (at least as compared to its
upstream links).  You can only really expect their cooperation when this is
true.  In cases where their performance is degraded by this strategy, their
interests will run counter to the backbone's interest in removing congestive
flows.  Accounting for that is a good thing.

regards,
Ted


RE: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread michael.dillon

> The TCPs don't slow down. They use the bandwidth you have 
> made available instead.
> 
> in your words, "the traffic on the new circuit is suddenly 
> greater than 100% of the old one".

Exactly!

To be honest, I first encountered this when Avi Freedman upgraded one of
his upstream connections from T1 to DS3 and either Avi, or one of his
employees mentioned this on inet-access or nanog. So I did a bit of
digging and discovered that other people had noticed that TCP traffic
tends to be fractal (or multi-fractal) in nature. That means that the
peaks which cause this effect are hard to get rid of entirely.

> To his surprise, with the standing load > 95% and 
> experiencing 20% loss at 311 MBPS, doubling the rate to 622 
> MBPS resulted in links with a standing load > 90% and 4% 
> loss. He still needed more bandwidth. After we walked 
> offstage, I explained TCP to him...

That is something that an awful lot of operations and capacity planning
people do not understand. They still think in terms of pipes with TCP
flavoured water flowing in them. But this is exactly the behavior that
you would expect from fractal traffic. The doubled capacity gave enough
headroom for some of the peaks to get through, but not enough for all of
them. On Ebone in Europe we used to have 40% as our threshold for
upgrading core circuits. 

> I'm pushing an agenda in the open source world to add  
> some concept of locality, with the purpose of moving traffic off ISP  
> networks when I can. I think the user will be just as happy or  
> happier, and folks pushing large optics will certainly be.

When you hear stories like the Icelandic ISP who discovered that P2P was
80% of their submarine bandwidth and promptly implemented P2P
throttling, I think that the open source P2P will be driven to it by
their user demand. 

--Michael Dillon
 


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-16 Thread Adrian Chadd

On Thu, Aug 16, 2007, [EMAIL PROTECTED] wrote:

> > I'm pushing an agenda in the open source world to add  
> > some concept of locality, with the purpose of moving traffic off ISP  
> > networks when I can. I think the user will be just as happy or  
> > happier, and folks pushing large optics will certainly be.
> 
> When you hear stories like the Icelandic ISP who discovered that P2P was
> 80% of their submarine bandwidth and promptly implemented P2P
> throttling, I think that the open source P2P will be driven to it by
> their user demand. 

.. or we could start talking about how Australian ISPs are madly throttling
P2P traffic. Not just because of its impact on international trunks,
but their POP/wholesale DSL infrastructure method just makes P2P even
between clients on the same ISP mostly horrible.




Adrian




Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-17 Thread Alexander Harrowell
On 8/17/07, Adrian Chadd <[EMAIL PROTECTED]> wrote:
>
>
> On Thu, Aug 16, 2007, [EMAIL PROTECTED] wrote:
>
> > > I'm pushing an agenda in the open source world to add
> > > some concept of locality, with the purpose of moving traffic off ISP
> > > networks when I can. I think the user will be just as happy or
> > > happier, and folks pushing large optics will certainly be.


This is badly needed in my humble opinion;  regarding the wireless LAN case
described, it's true that this behaviour would be technically suboptimal,
but interestingly the real reason for implementing it would be maintained -
economics. After all, the network operator (the owner of the wireless LAN)
isn't consuming any more upstream as a result.

>
> > When you hear stories like the Icelandic ISP who discovered that P2P was
> > 80% of their submarine bandwidth and promptly implemented P2P
> > throttling, I think that the open source P2P will be driven to it by
> > their user demand.


Yes. An important factor in future design will be "network
friendliness/responsibility".

.. or we could start talking about how Australian ISPs are madly throttling
> P2P traffic. Not just because of its impact on international trunks,
> but their POP/wholesale DSL infrastructure method just makes P2P even
> between clients on the same ISP mostly horrible.


Similar to the pre-LLU, BT IPStream ops in the UK. Charging flat rates to
customers and paying per-bit to wholesalers is an obvious economic problem;
possibly even more expensive to localise the p2p traffic, if the price of
wholesale access bits is greater than peering/transit ones!


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-17 Thread Sam Stickland


Ted Hardie wrote:

Fred Baker writes:

  
Hence, moving a file into a campus doesn't mean that the campus has the file and 
will stop  bothering you. I'm pushing an agenda in the open source world to add  
some concept of locality, with the purpose of moving traffic off ISP  
networks when I can. I think the user will be just as happy or  
happier, and folks pushing large optics will certainly be.



As I mentioned to Fred in a bar once, there is at least one case where you have
to be a bit careful with how you push locality.  In the wired campus case, he's 
certainly
right:  if you have the file topologically close to other potentially 
interested users,
delivering it from that "nearer" source is a win for pretty much everyone.
This is partly the case because the local wired network is unlikely to be 
resource
constrained, especially in comparison to the upstream network links.

In some wireless cases, though, it can be a bad thing.  Imagine for a moment 
that
Fred and I are using a p2p protocol while stuck in an airport.  We're both 
looking
for the same file.  The p2p network pushes it first to Fred and then directs me 
to get
it from him.  If he and I are doing this while we're both connected to the same 
resource-constrained base station, we may actually be worse off, as the

same base station has to allocate data channels for two high data traffic
flows while it passes from him to me.  If I/the second user gets the file from 
outside the pool of devices connected to that base  station, in other words, 
the base station , I, and its other users may well be better off.  

  
A similar (and far more common) issue exists in the UK where ISPs are 
buying their DSL 'last mile' connectivity via a BT central pipe. 
Essentially in this setup BT owns all the exchange equipment and the 
connectivity back to a central hand-off location - implemented as a L2TP 
VPDN. When the DSL customers connects, their realm is used to route 
their connection over the VPDN to the ISP. The physical hand-off point 
between BT and the ISP is what BT term a BT Central Pipe, which is many 
orders of magnitude more expensive than IP transit.


In this scenario it's more expensive for the ISP to have a customer 
retrieve the file from another customer on their network, then it is to 
go off net for the file.


(LLU (where the ISP has installed their own equipment in the exchange) 
changes this dynamic obviously).


S


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-17 Thread Leigh Porter

Sam Stickland wrote:
>
> Ted Hardie wrote:
>> Fred Baker writes:
>>
>>  
>>> Hence, moving a file into a campus doesn't mean that the campus has
>>> the file and will stop  bothering you. I'm pushing an agenda in the
>>> open source world to add  some concept of locality, with the purpose
>>> of moving traffic off ISP  networks when I can. I think the user
>>> will be just as happy or  happier, and folks pushing large optics
>>> will certainly be.
>>> 
>>
>> As I mentioned to Fred in a bar once, there is at least one case
>> where you have
>> to be a bit careful with how you push locality.  In the wired campus
>> case, he's certainly
>> right:  if you have the file topologically close to other potentially
>> interested users,
>> delivering it from that "nearer" source is a win for pretty much
>> everyone.
>> This is partly the case because the local wired network is unlikely
>> to be resource
>> constrained, especially in comparison to the upstream network links.
>>
>> In some wireless cases, though, it can be a bad thing.  Imagine for a
>> moment that
>> Fred and I are using a p2p protocol while stuck in an airport.  We're
>> both looking
>> for the same file.  The p2p network pushes it first to Fred and then
>> directs me to get
>> it from him.  If he and I are doing this while we're both connected
>> to the same resource-constrained base station, we may actually be
>> worse off, as the
>> same base station has to allocate data channels for two high data
>> traffic
>> flows while it passes from him to me.  If I/the second user gets the
>> file from outside the pool of devices connected to that base 
>> station, in other words, the base station , I, and its other users
>> may well be better off. 
>>   
> A similar (and far more common) issue exists in the UK where ISPs are
> buying their DSL 'last mile' connectivity via a BT central pipe.
> Essentially in this setup BT owns all the exchange equipment and the
> connectivity back to a central hand-off location - implemented as a
> L2TP VPDN. When the DSL customers connects, their realm is used to
> route their connection over the VPDN to the ISP. The physical hand-off
> point between BT and the ISP is what BT term a BT Central Pipe, which
> is many orders of magnitude more expensive than IP transit.
>
> In this scenario it's more expensive for the ISP to have a customer
> retrieve the file from another customer on their network, then it is
> to go off net for the file.
>
> (LLU (where the ISP has installed their own equipment in the exchange)
> changes this dynamic obviously).
>
> S

Also bear in mind that many wireless systems have constrained uplink
capacity and anything P2P can quite happily kill a wireless network by
using up too much uplink resource.

--
Leigh Porter



Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-17 Thread Stephen Wilcox

On Fri, Aug 17, 2007 at 10:54:47AM +0100, Sam Stickland wrote:
> 
> Ted Hardie wrote:
> >Fred Baker writes:
> >
> >  
> >>Hence, moving a file into a campus doesn't mean that the campus has the 
> >>file and will stop  bothering you. I'm pushing an agenda in the open 
> >>source world to add  some concept of locality, with the purpose of moving 
> >>traffic off ISP  networks when I can. I think the user will be just as 
> >>happy or  happier, and folks pushing large optics will certainly be.
> >>
> >
> >As I mentioned to Fred in a bar once, there is at least one case where you 
> >have
> >to be a bit careful with how you push locality.  In the wired campus case, 
> >he's certainly
> >right:  if you have the file topologically close to other potentially 
> >interested users,
> >delivering it from that "nearer" source is a win for pretty much everyone.
> >This is partly the case because the local wired network is unlikely to be 
> >resource
> >constrained, especially in comparison to the upstream network links.
> >
> >In some wireless cases, though, it can be a bad thing.  Imagine for a 
> >moment that
> >Fred and I are using a p2p protocol while stuck in an airport.  We're both 
> >looking
> >for the same file.  The p2p network pushes it first to Fred and then 
> >directs me to get
> >it from him.  If he and I are doing this while we're both connected to the 
> >same resource-constrained base station, we may actually be worse off, as 
> >the
> >same base station has to allocate data channels for two high data traffic
> >flows while it passes from him to me.  If I/the second user gets the file 
> >from outside the pool of devices connected to that base  station, in other 
> >words, the base station , I, and its other users may well be better off.  
> >
> >  
> A similar (and far more common) issue exists in the UK where ISPs are 
> buying their DSL 'last mile' connectivity via a BT central pipe. 
> Essentially in this setup BT owns all the exchange equipment and the 
> connectivity back to a central hand-off location - implemented as a L2TP 
> VPDN. When the DSL customers connects, their realm is used to route 
> their connection over the VPDN to the ISP. The physical hand-off point 
> between BT and the ISP is what BT term a BT Central Pipe, which is many 
> orders of magnitude more expensive than IP transit.
> 
> In this scenario it's more expensive for the ISP to have a customer 
> retrieve the file from another customer on their network, then it is to 
> go off net for the file.

Hey Sam,
 thats an excellent point made..

Altho I dont think its unique to UK/BT .. since last mile is recognised as most 
places as the big cost (in the UK its around 100x the cost of the backbone 
roughly) .. here anything traversing the last mile is not desirable, especially 
if it does it twice.

Steve


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-17 Thread Stephen Wilcox

On Thu, Aug 16, 2007 at 09:07:31AM -0700, Hex Star wrote:
>How does akamai handle traffic congestion so seamlessly? Perhaps we should
>look at existing setups implemented by companies such as akamai for
>guidelines regarding how to resolve this kind of issue...

and if you are a Content Delivery Network wishing to use a cache deployment 
architecture you should do just that ... but for networks with big backbones as 
per this discussion we need to do something else

Steve


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-17 Thread Patrick W. Gilmore


On Aug 17, 2007, at 6:57 AM, Stephen Wilcox wrote:

On Thu, Aug 16, 2007 at 09:07:31AM -0700, Hex Star wrote:
   How does akamai handle traffic congestion so seamlessly?  
Perhaps we should
   look at existing setups implemented by companies such as akamai  
for

   guidelines regarding how to resolve this kind of issue...


and if you are a Content Delivery Network wishing to use a cache  
deployment architecture you should do just that ... but for  
networks with big backbones as per this discussion we need to do  
something else


Ignoring "Akamai" and looking at just content providers (CDN or  
otherwise) in general, there is a huge difference between telling a  
web server "do not serve more than 900 Mbps on your GigE port", and a  
router which simply gets bits from random sources to be forwarded to  
random destinations.


IOW: Steve is right, those are two different topics.

--
TTFN,
patrick



Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-18 Thread Perry Lorier





We've been pitching the idea to bittorrent tracker authors to include 
a BGP feed and prioritize peers that are in the same ASN as the user 
himself, but they're having performance problems already so they're 
not so keen on adding complexity. If it could be solved better at the 
client level that might help, but the end user who pays flat rate has 
little incentive to help the ISP in this case.




Many networking stacks have a "TCP_INFO" ioctl that can be used to query 
for more accurate statistics on how the TCP connection is fairing 
(number of retransmits, TCP's current estimate of the RTT (and jitter), 
etc).  I've always pondered if bittorrent clients made use of this to 
better choose which connections to prefer and which ones to avoid.  I'm 
unfortunately unsure if windows has anything similar.


One problem with having clients only getting told about clients that are 
near to them is that the network starts forming "cliques".  Each clique 
works as a separate network and you can end up with silly things like 
one clique being full of seeders, and another clique not even having any 
seeders at all.  Obviously this means that a tracker has to send a 
handful of addresses of clients outside the "clique" network that the 
current client belongs to.


You want to make hosts talk to people that are close to you, you want to 
make sure that hosts don't form cliques, and you want something that a 
tracker can very quickly figure out from information that is easily 
available to people who run trackers.  My thought here was to sort all 
the IP addresses, and send the next 'n' IP addresses after the client IP 
as well as some random ones.  If we assume that IP's are generally 
allocated in contiguous groups then this means that clients should be 
generally at least told about people nearby, and hopefully that these 
hosts aren't too far apart (at least likely to be within a LIR or RIR).  
This should be able to be done in O(log n) which should be fairly efficient.


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-19 Thread Joe Provo

On Thu, Aug 16, 2007 at 10:55:59PM +0200, Mikael Abrahamsson wrote:
[snip]
> We've been pitching the idea to bittorrent tracker authors to include a 
> BGP feed and prioritize peers that are in the same ASN as the user 
> himself, but they're having performance problems already so they're not so 
> keen on adding complexity. If it could be solved better at the client 
> level that might help, but the end user who pays flat rate has little 
> incentive to help the ISP in this case.
 
Some of those maligned middleboxes deployed in last-mile networks don't
just throttle, but also use available topology data to optimize locality
per the ISPs policies.  

-- 
 RSUC / GweepNet / Spunk / FnB / Usenix / SAGE


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-19 Thread Mikael Abrahamsson


On Sun, 19 Aug 2007, Perry Lorier wrote:

Many networking stacks have a "TCP_INFO" ioctl that can be used to query for 
more accurate statistics on how the TCP connection is fairing (number of 
retransmits, TCP's current estimate of the RTT (and jitter), etc).  I've 
always pondered if bittorrent clients made use of this to better choose which 
connections to prefer and which ones to avoid.  I'm unfortunately unsure if 
windows has anything similar.


Well, by design bittorrent will try to get everything as fast as possible 
from all peers, so any TCP session giving good performance (often low 
packet loss and low latency) will thus end up transmitting a lot of the 
data in the torrent, so by design bittorrent is kind of localised, at 
least in the sense that it will utilize fast peers more than slower ones 
and these are normally closer to you.


One problem with having clients only getting told about clients that are near 
to them is that the network starts forming "cliques".  Each clique works as a 
separate network and you can end up with silly things like one clique being 
full of seeders, and another clique not even having any seeders at all. 
Obviously this means that a tracker has to send a handful of addresses of 
clients outside the "clique" network that the current client belongs to.


The idea we pitched was that of the 50 addresses that the tracker returns 
to the client, 25 (if possible) should be from the same ASN as the client 
itself, or a nearby ASN (by some definition). If there are a lot of peers 
(more than 50) the tracker will return a random set of clients, we wanted 
this to be not random but 25 of them should be by network proximity (by 
some definition).


You want to make hosts talk to people that are close to you, you want to make 
sure that hosts don't form cliques, and you want something that a tracker can 
very quickly figure out from information that is easily available to people 
who run trackers.  My thought here was to sort all the IP addresses, and send 
the next 'n' IP addresses after the client IP as well as some random ones. 
If we assume that IP's are generally allocated in contiguous groups then this 
means that clients should be generally at least told about people nearby, and 
hopefully that these hosts aren't too far apart (at least likely to be within 
a LIR or RIR).  This should be able to be done in O(log n) which should be 
fairly efficient.


Yeah, we discussed that the list of IPs should be sorted (doing insertion 
sort) in the data structures in the tracker already, so what you're saying 
is one way of defining proximity that as you're saying, would probably be 
quite efficient.


--
Mikael Abrahamssonemail: [EMAIL PROTECTED]


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-21 Thread Mikael Abrahamsson


On Tue, 21 Aug 2007, Alexander Harrowell wrote:


This is what I eventually upshot..

http://www.telco2.net/blog/2007/08/variable_speed_limits_for_the.html


You wrote in your blog:

"The problem is that if there is a major problem, very large numbers of 
users applications will all try to resend; generating a packet storm and 
creating even more congestion."


Do you have any data/facts to back up this statement? I'd be very 
interested to hear them, as I have heard this statement a few times before 
but it's a contradiction to the way I understand things to work.


--
Mikael Abrahamssonemail: [EMAIL PROTECTED]


Re: Extreme congestion (was Re: inter-domain link recovery)

2007-08-21 Thread Alexander Harrowell
This is what I eventually upshot..

http://www.telco2.net/blog/2007/08/variable_speed_limits_for_the.html

On 8/19/07, Mikael Abrahamsson <[EMAIL PROTECTED]> wrote:
>
>
> On Sun, 19 Aug 2007, Perry Lorier wrote:
>
> > Many networking stacks have a "TCP_INFO" ioctl that can be used to query
> for
> > more accurate statistics on how the TCP connection is fairing (number of
> > retransmits, TCP's current estimate of the RTT (and jitter), etc).  I've
> > always pondered if bittorrent clients made use of this to better choose
> which
> > connections to prefer and which ones to avoid.  I'm unfortunately unsure
> if
> > windows has anything similar.
>
> Well, by design bittorrent will try to get everything as fast as possible
> from all peers, so any TCP session giving good performance (often low
> packet loss and low latency) will thus end up transmitting a lot of the
> data in the torrent, so by design bittorrent is kind of localised, at
> least in the sense that it will utilize fast peers more than slower ones
> and these are normally closer to you.
>
> > One problem with having clients only getting told about clients that are
> near
> > to them is that the network starts forming "cliques".  Each clique works
> as a
> > separate network and you can end up with silly things like one clique
> being
> > full of seeders, and another clique not even having any seeders at all.
> > Obviously this means that a tracker has to send a handful of addresses
> of
> > clients outside the "clique" network that the current client belongs to.
>
> The idea we pitched was that of the 50 addresses that the tracker returns
> to the client, 25 (if possible) should be from the same ASN as the client
> itself, or a nearby ASN (by some definition). If there are a lot of peers
> (more than 50) the tracker will return a random set of clients, we wanted
> this to be not random but 25 of them should be by network proximity (by
> some definition).
>
> > You want to make hosts talk to people that are close to you, you want to
> make
> > sure that hosts don't form cliques, and you want something that a
> tracker can
> > very quickly figure out from information that is easily available to
> people
> > who run trackers.  My thought here was to sort all the IP addresses, and
> send
> > the next 'n' IP addresses after the client IP as well as some random
> ones.
> > If we assume that IP's are generally allocated in contiguous groups then
> this
> > means that clients should be generally at least told about people
> nearby, and
> > hopefully that these hosts aren't too far apart (at least likely to be
> within
> > a LIR or RIR).  This should be able to be done in O(log n) which should
> be
> > fairly efficient.
>
> Yeah, we discussed that the list of IPs should be sorted (doing insertion
> sort) in the data structures in the tracker already, so what you're saying
> is one way of defining proximity that as you're saying, would probably be
> quite efficient.
>
> --
> Mikael Abrahamssonemail: [EMAIL PROTECTED]
>