Re: jumbo frame of GbE and IPv6 -- A proposal

Perry Lorier Wed, 27 Jul 2005 03:59:42 -0700

[Bugger! Lost the reply I was writing for this! ]

>> Packet rate however starts becoming a problem at faster speeds, at  gige
>> it starts becoming a problem for hosts to deal with unless they are
>> careful.  And not all networks are fast, 3G networks are becoming more
>> prevalent.  We should not waste resources needlessly :)
> 
> 
> Well, the places where jumboframes are worth the trouble are also the 
> places where a handful of packets won't make a difference. I'm not  sure
> how fast 3G is, but I believe not more than a few Mbps, so  jumboframes
> really aren't very useful there because they occupy the  channel for too
> long. Doubly so on radio networks with their high bit  error rates.


Good point :)

>>>> What happens on l2's where not every node can see every other node?
>>> Neighbor discovery fails?
>> Host A can talk to Host B ok.
>> Host A can talk to Host C ok.
>> Host B can't talk to Host C.
> 
>> This happens in ad hoc wireless networks.  With your system I'm not
>> entirely sure how you deal with who's "turn" it is next if not all  nodes
>> can see all other nodes.  Host A should still be able to talk to  Host B.
> 
> Well, a simple way to decide could be a log of the difference in MAC 
> address. So after host 20 sends its packet, host 28 would wait for 3 
> seconds and host 36 for 4 seconds. But host 36 hears host 28 and  resets
> its timer to 3 seconds. If hosts 28 and 36 can't hear each  other, host
> 36 will send its packet 1 second after host 28 rather  than 3 seconds.
> No big deal.

Yeah, but if this is happening lots you will end up with lots of hosts
transmitting close together.  Maybe it's not a problem as you suggest,
but I can't help feel that this seems a bit too complicated.

>>>> If A and B are talking to each other and C and D are talking to each
>>>> other, why do (A and B) need to talk to C and D?
> 
>>> Ah, but how do you know that A doesn't talk to D, and is never  going
>>> to?
> 
> 
>> How do you know it will in the time before the topology of the network
>> changes?  Given that the topology of the network changes every time a
>> host comes and goes, the chance that you'll want to talk to most of  the
>> users during time is rather low.
> 
> Look at it this way: if two routers send out RAs every 10 seconds, 
> that's one packet every 5 seconds. If 60 hosts all send one packet 
> every five minutes, that's also one packet every 5 seconds.

If they send it in the same 5s tho you'll melt your queues :)

How long will this take to converge?  Remember that you have to start
again as soon as the topology changes (eg a new host is added/removed).

The advantage with this part of your scheme is that it does give you
MTU's that you can use for multicast.

>> I'd start at the minimum "MTU" size.
> 
> Yes, I thought about this and first trying a 1508 byte packet makes 
> sense: if jumboframes don't work, you've wasted as little time and 
> bandwidth as possible. If they do work, you've only wasted 1508 bytes.

I'd start with 1500; check your assumptions.  In fact, it might be worth
starting at 1280 and if that fails log a message that the L2 is broken.

There are enough networks out there that use various tunnels that the L2
MTU is slightly smaller than you would expect to make it worth while to
at least check.

>> A colleague of mine (Matthew Luckie) has done some research into path
>> MTU's.  He has a work in progress paper (
>> http://www.wand.net.nz/~mjl12/debugging-pmtud.pdf ) where he  enumerates
>> all the common MTU's he's seen on the Internet.
> 
> And reaches a very interesting conclusion! Exchanging per-neighbor  MTUs
> would really help here.

Exactly.  :)

>> I'd start with a similar table trying the lowest size, and sending  that,
>> if it's received try the next lowest size and so on until you don't  get
>> a reply.  When you don't get a reply try the previous-mtu-that- worked+1,
>> if that succeeds start a binary search between previous-mtu-that- worked
>> and the one that didn't.
> 
> I partially agree. If you're at a well known boundary and want to 
> search upward, it makes sense to try that well known boundary +  minimum
> increment (I say: 4) first. That way, if you can't go beyond  the
> current boundary, you know so immediately. Next is the highest  possible
> value. If you can use that one, you're done.
> 
> But if previous low + minimum works but maximim doesn't, a mostly 
> binary search still makes sense. However, it could be a "hinted"  binary
> search. For instance, if you're searching between 1508 and  9000 (with
> the target being 4464) a strict binary search would do:
> 
> 1  1508 yes
> 2  9000 no
> 3  5252 no
> 4  3380 yes
> 5  4316 yes
> 6  4784 no
> 7  4548 no
> 8  4432 yes
> 9  4488 no
> 10 4460 yes
> 11 4472 no
> 12 4464 yes
> 13 4468 no
> 
> A hinted binary search could be:
> 
> 1  1508 yes
> 2  9000 no
> 3  4470 no (closest value to binary 5252 target)
> 4  2048 yes (closest value to binary 2988 target)
> 5  2052 yes (see if 2048 was our limit)
> 6  4352 yes (closest value to binary 3260 target)
> 7  4356 yes (see if 4352 was our limit)
> 8  4464 yes (closest value to binary 4412 target)
> 9  4468 no (4464 was our limit)
> 
> Note that although the second variant is faster overal, the first one 
> finds a reasonable candidate (that can already be used at that point) 
> at try 5, and the second one at try 6.
> 
> In this case your serial bottom-to-top search would probably be a bit 
> faster, but it has two disadvantages: it takes a long time to find a 
> high MTU, and it's not good at finding non-standard MTUs.

Mine would do:

1  1514 yes
2  1536 yes
3  2002 yes
4  2048 yes
5  4352 yes
6  4464 yes
7  4470 no
8  4465 no

Now, if you assume a timeout is at least 2 RTT's (and I suspect it
should probably be more) mine takes 6RTT's to find the right MTU, and
takes 10RTT's to prove it.

Your first one takes 19 RTT's to find it and 20 RTT's to prove it.
Your second (hinted) one takes 10 RTT's to find it, and 12 RTT's to
prove it.

As the MTU's get larger yours gets faster (since it is less likely to
hit a "no") and mine gets slower.

Mine is slow if there is a non standard MTU (which I suspect would be
rare, but anyway).  But after it's found a non standard MTU it can be
added to the table and it'll be found quickly next time.

>> For a "common" MTU, you only have to endure two timeouts (the next
>> highest common MTU, and the +1 test). For an uncommon MTU you can
>> increase the MTU to maximum "common" MTU that's lower than your MTU
>> quickly, and can endure the timeouts from then on.
> 
> Note that with a 100 ms timeout (more than enough) you're done in  less
> than 2 seconds worst case.

Yup, but it makes more sense to talk about RTT's than seconds :)


>> If instead of using special "ICMP MTU Probes" we use "ICMP Echo  request"
>> /"ICMP Echo Reply" messages, there is no changes to any packet formats
>> needed, all it needs to be done is have implemented in a TCP/IP stack,
>> and the concept is even reusable for IPv4.  Other hosts don't even  have
>> to be upgraded to support this either.  magic!
> 
> You mean, rely just on ICMP and not announce a bigger MTU in RAs?

yeah pretty much.

> I guess you're right, but I wouldn't want to be a 10 Mbps host in an 
> otherwise 64k jumbo-enabled network, because all those probes would  eat
> up my bandwidth even though I can't successfully receive them.

You'd get one probe when you talked to a new host, which would
immediately fail.

> Also, I think we want to be nicer to on-link probers than off-link 
> ones, especially with these large packets.

Indeed.  The problem with probing for large MTU's is that the probes
themselves are large :)

>> Stacks would be free to do as you suggest (doing a binary search)  or as
>> I suggest (ramp up and do a binary search only as a last resort).
> 
> Yes, this can be left up to the implementers.
> 
>> So the general approach would be:
>> * If a packet arrives from a host that is larger than the cached  MTU for
>> that neighbour, increase it to the size of the packet arriving.
> 
> Not sure if we want to do this check for every packet. 

It's a simple test, and besides, you don't have to do it for every every
packet, just the packets that matter :)

> Also, an
> attacker could fake the packet in order to do an "MTU attack" on a 
> non-jumbo enabled host.

Yes, this is a problem, but then again, presumably they can forge the NS
reply anyway.  All that this means is that you need to sign your packets.

>> * When receiving a ND (but not a NS!), and you have no cached MTU for
>> that neighbour, you start the MTU discovery process (using any  mechanism
>> for selecting the packet sizes the implementation deems appropriate  (ie,
>> either yours, or mine, or if someone can come up with a method thats
>> even better than ours, they could use that!)
> 
> With an MTU option in it. 

yeah that's probably wise, although not necessary.  You could probe
always and pray.

> And why not NS?

Because when A talks to B, you want A to do the MTU discovery and for B
to "learn" the MTU too, but you don't want both sending MTU probes, only
one of them needs to do so.

> 
>>> No, the announcement "the switch can handle 4500 bytes" wouldn't have
>>> anything to do with "I can handle 1500".
> 
> 
>> Which switch?  I live in a flat with 3 other people, we have at  least 4
>> devices that act like switches on one segment.  (2 switches, a voip
>> phone (you can daisy chain a PC off it), and an AP).  I have no idea
>> what the maximum MTU of all those switches are
> 
> 
> If all of those switches announce their MTU, we're in business.

If we're making fanciful wishes, can I have a million dollars? and a pony?

While having switches announce MTU's is possible, I don't think you'll
get switch manufacturers to agree, and even if they do the old and/or
cheap switches won't announce it therefore leaving you to have to probe
anyway.  and even if all the switches announce their MTU, how do you
know which subset of switches your going through to get to the other end?

> On the other hand, if we do an MTU search we don't need this 
> information because we'll find out ourselves.

Exactly.

> If we don't do an MTU search and the switches don't announce their  MTU,
> you're probably not going to use jumboframes on such a network...

And you fall back to 1280 (or some other link defined minimum such as 1500).

>>> It would be even better if we could ask the switch what our port
>>> supports, but I'm not sure how to do this in such a way that a switch
>>> that doesn't support this protocol floods the request so the results
>>> are meaningless.
> 
>> Hrm, so Ethernet has capability negotiation (which is how speed,  duplex,
>> pause frame support etc is negotiated).  I have no idea if it says if
>> the switch supports jumbo gram, IEEE specs make my head hurt.
> 
> 
> Autonegotiation only does 16 bits or something like that, no room to 
> include the MTU there. 

Plenty of room for a "1500, small,medium or large jumbo frames" tho :)

> Gigabit does have some in-band stuff like flow
> control, maybe that can be reused. 

PAUSE frames yeah, not particularly useful but perhaps doable :)

> But you always run the risk that a
> dumb switch just forwards those packets and screws up the negotiation.

Yup.  You want to probe to find out.  I'm beginning to think you want to
probe Point to point links too to discover if they have any weird
limitations that they aren't announcing.


--------------------------------------------------------------------
IETF IPv6 working group mailing list
ipv6@ietf.org
Administrative Requests: https://www1.ietf.org/mailman/listinfo/ipv6
--------------------------------------------------------------------

Re: jumbo frame of GbE and IPv6 -- A proposal

Reply via email to