[...Lots of good stuff deleted to get to this point...]
On Wed, 15 Aug 2007, Fred Baker wrote:
So I would suggest that a third thing that can be done, after the other two
avenues have been exhausted, is to decide to not start new sessions unless
there is some reasonable chance that they will be able to accomplish their
work. This is a burden I would not want to put on the host, because the
probability is vanishingly small - any competent network operator is going to
solve the problem with money if it is other than transient. But from where I
sit, it looks like the "simplest, cheapest, and most reliable" place to
detect overwhelming congestion is at the congested link, and given that
sessions tend to be of finite duration and present semi-predictable loads, if
you want to allow established sessions to complete, you want to run the
established sessions in preference to new ones. The thing to do is delay the
initiation of new sessions.
I view this as part of the flash crowd family of congestion problems, a
combination of a rapid increase in demand and a rapid decrease in
capacity. But instead of targeting a single destination, the impact is
across multiple networks in the region.
In the flash crowd cases (including DDOS variations), the place to respond
(Note: the word change from "detect" to "respond") to extreme congestion
does not seem toe be at the congested link but several hops upstream of
the congested link. Current "effective practice" seems to be 1-2 ASN's
away from the congested/failure point, but that may just also be the
distance to reach "effective" ISP backbone engineer response.
If I had an ICMP that went to the application, and if I trusted the
application to obey me, I might very well say "dear browser or p2p
application, I know you want to open 4-7 TCP sessions at a time, but for the
coming 60 seconds could I convince you to open only one at a time?". I
suspect that would go a long way. But there is a trust issue - would
enterprise firewalls let it get to the host, would the host be able to get it
to the application, would the application honor it, and would the ISP trust
the enterprise/host/application to do so? is ddos possible? <mumble>
For the malicious DDOS, of course we don't expect the hosts to obey.
However, in the more general flash crowd case, I think the expectation of
hosts following the RFC is pretty strong, although it may take years for
new things to make it into the stacks. It won't slow down all the
elephants, but maybe can turn the stampede into just a rampage. And
the advantage of doing it in the edge host is their scale grow with
the Internet.
But even if the hosts don't respond to the back-off, it would give the
edge more in-band trouble-shooting information. For example, ICMP
"Destination Unreachable - Load shedding in effect. Retry after "N"
seconds" (where N is stored like the Next-Hop MTU). Sending more packets
to signal congestion, just makes congestion worse. However, having an
explicit Internet "busy signal" is mostly to help network operators
because firewalls will probably drop those ICMP messages just like PMTU.
So plan B would be to in some way rate limit the passage of TCP SYN/SYN-ACK
and SCTP INIT in such a way that the hosed links remain fully utilized but
sessions that have become established get acceptable service (maybe not great
service, but they eventually complete without failing).
This would be a useful plan B (or plan F - when things are really
FUBARed), but I still think you need a way to signal it upstream 1 or 2
ASNs from the Extreme Congestion to be effective. For example, BGP says
for all packets for network w.x.y.z with community a, implement back-off
queue plan B. Probably not a queue per network in backbone routers, just
one alternate queue plan B for all networks with that community. Once
the origin ASN feels things are back to "normal," they can remove the
community from their BGP announcements.
But what should the alternate queue plan B be?
Probably not fixed capacity numbers, but a distributed percentage across
different upstreams.
Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue
Datagram protocol packets (UDP, ICMP, GRE, etc) 20% queue
Session protocol established/finish packets (TCP ACK/FIN, etc) normal queue
That values session oriented protocols more than datagram oriented
protocols during extreme congestion.
Or would it be better to let the datagram protocols fight it out with the
session oriented protocols, just like normal Internet operations
Session protocol start packets (TCP SYN/SYN-ACK, SCTP INIT, etc) 1% queue
Everything else (UDP, ICMP, GRE, TCP ACK/FIN, etc) normal queue
And finally why only do this during extreme congestion? Why not always
do it?