2009]

Erik Nordmark Wed, 03 Jun 2009 10:59:05 -0700

James Carlson wrote:

>> We are introducing a new RTF_INDIRECT in <net/route.h>. This flag is useful
>> for routing daemons that do BGP plus OSPF/IS-IS since it can make handling
>> routing changes a lot more efficient.
> 
> Great to hear, but can we have some detail on how to use this?  Is it
> the network routes from BGP that must use this flag (to make the
> specified next hop address be "indirect"), or is it a flag set by the
> BGP (or even the IGP) to specify that a given entry is a special "host
> route" intended as a target for BGP routes?


The use of RTF_INDIRECT is optional. No changes are required to routing 
deamons are required. However, routing daemons can be modified to take 
advantage of the indirect routes.

> What exactly does the flag do?  How is a route with this flag
> different?

Two things are different for externally visible behavior:
  - when the route is added the kernel does not require that the gateway 
is directly reachable. If RTF_INDIRECT is not set the gateway must be 
directly reachable as is the case in Solaris today.
  - when we lookup a route and first find an indirect route, we iterate 
to lookup the gateway in that route. (Today if we first find an 
offlink/offsubnet route i.e., one with RTF_GATEWAY set, we iterate by 
looking up its gateway. The indirect route in essence adds one more step 
in this iterative process.)

 From an implementation perspective there is logic to ensure that the 
caching we do for performance proves the above visible behavior as 
routes are added and deleted; just ensuring that the caching stays 
consistent with the collection of forwarding table entries.

Do you think we should document this somewhere? If so, where? route(1m)?

>> We are adding an informational RTF_KERNEL flag for routes, for instance
>> interface routes, that are added by the kernel as part of configuring an IP
>> interface. Such routes can not be accidentally deleted by applications.
> 
> RTF_KERNEL doesn't appear for ICMP redirect, even though that's
> processed by the kernel, right?

Correct. The "as part of configuring an IP interface" is key. I don't 
know if a different name for the flag would make this more clear.

> 
>> | EXPER_IP_DCE                  | Uncommitted           | <inet/mib2.h> |
> 
> There's probably a new MIB data structure that goes with this symbol.

Yes, that structure is named dest_cache_entry_t and is also uncommitted.

>> No IRE_CACHE entries (UHA) entries will appear in netstat -ra, since the
>> implementation no longer has IRE_CACHE entries.
>> This project adds a new IRE_IF_CLONE type of routes. Those routes appear in
>> netstat -ra (but not without the 'a' option) with the new 'C' flag.
> 
> Are these like the BSD clone entries (i.e., effectively ARP cache
> entries, with *only* on-link addresses represented), or are they
> something else?

I do not know the details of the BSD code. But the IRE_IF_CLONE entries 
do not contain any ARP information. They are merely a place to hang a 
reference to the ARP information.
For example, if we have an interface route for a /23
129.146.228.0        129.146.228.81       U         1          5 bge0
and we are sending packets to 129.146.228.1 the kernel would create an 
IRE_IF_CLONE for 129.146.228.1/32, and as the ARP entries gets created 
there would be a direct pointer from that entry to the arp information.

This is beneficial for performance since we avoid doing a lookup of the 
ARP information; we can always place a direct pointer to the ARP 
information in the IRE that matches.

>> The new implementation no longer has a ire_max_frag field, hence the output 
>> of
>> Maxfrg/PMTU in the netstat -rv output is no longer useful. We are removing 
>> that
>> output. (Note that the details of the netstat output is not a stable
>> interface.)
>>
>> Currently Solaris handles IP interface MTU in odd ways in that it can be
>> set differently for local IP address prefix; this leaves it quite undefined
>> in what MTU is applied to multicast packets.
>> This project fixes that by applying the IP interface MTU per interface. As a
>> result ifconfig bge0:N mtu 1400 will fail with EINVAL.
> 
> There's certainly been some customer confusion around routes and
> addresses with specified MTU.  It's sometimes the case that users must
> deal with remote networks that have restricted MTUs *but* that don't
> support PMTU properly.
> 
> What should those users be doing?  (Reporting bugs against our PMTU
> implementation?  Asking for an RFE?  Yelling at the admin for that
> remote network?)

I think it is due that that we implement RFC 4821 for TCP and SCTP which 
would make PMTU discovery more robust overall.

If there are remote networks with different PMTU then presumably the 
customers are either 1) clamping down the MTU for their interfaces i.e., 
reducing MTU for everybody, or 2) using route(1m) to set a specific mtu 
for a particular route.
Both of those are unaffected by this project.

>> ENXIO will be returned to the sendto() system call. And received packets will
>> be dropped since they can't possibly match the interface index specified in 
>> the
>> IP_BOUND_IF when the interface has been unplumbed. However, when the IP 
>> address 
>> (or interface index) which was use by the application reappears, then the 
>> application's setting will be fully functional again.
> 
> Interface indices can't normally disappear and then reappear, can
> they?  (If they can, then I'd call that a "bug" rather than a
> "feature."  SNMP requires that ifIndex is unique for a given engine
> invocation.)

This is outside of the scope of this case; this case merely makes sure 
the system is at least as robust as today should the existing 
SIOCSLIFINDEX be used to change the ifindex.

Until relatively recently I thought we could actually remove the ability 
to set the interface index (essentially removing SIOCSLIFINDEX). But 
I've seen a legitimate case where a customer was using it (can't recall 
the details).

>> The project extends the kernel's ability to handle multiple routes for the 
>> same
>> prefix; currently the kernel only does some form of round robin for default 
>> routes and the project extends that to all off-link routes (default, prefix, 
>> and
>> host routes). We are adding an undocumented knob should there be a reason to
>> switch back to the old behavior in the field.
> 
> Yay!
> 
> What happens on forwarding?  Round-robin is a bad answer for
> forwarding, as it ends up reordering packets.  ECMP is a better answer
> for that case.  (I agree that round-robin is best _if_ you can
> preserve state ... local connections can do that, but forwarding cases
> generally cannot.)

For forwarding things are very simplistic in the current project gate.
A given route will select a single outgoing nexthop/interface from the 
ECMP routes used.
Thus the only way you can get ECMP behavior in the forwarding path is 
for separate routes which use the same gateway, but there are ECMP 
routes found when looking up that gateway. A contrived example of this 
is with this set of routes:
        A/n             10.0.0.1        UG
        B/m             10.0.0.1        UG
        10.0.0.0/24                     U       bge0
        10.0.0.0/24                     U       bge1
In the example A could might pick the first U route, and B might pick 
the second. (Similar examples can be constructed with indirect routes.)
Thus we do not do anything special should there be multiple routes for A/n.
It wouldn't be hard to add on top of what we are building, e.g., using a 
hash of src+dest ip to select a route from the bucket in which we find 
the matching A/n.

>> The project removes the usage of multidata from TCP/IP, but the interfaces
>> specified in PSARC/2004/594 and PSARC/2002/276 remain in the system.
> 
> This part is confusing; could you please elaborate?
> 
> Does the stack still generate MDT messages?  If not, then how are
> those two previous projects not affected?  (Doesn't this project just
> obsolete those two ... ?)

The TCP/IP stack no longer sends down MDT messages.

But all the support in PSARC/2004/594 and PSARC/2002/276 remains (the 
definition of M_MULTIDATA, the support in copymsg, the mmd_* functions.

Thus device drivers that support M_MULTIDATA and link against those 
functions will continue to link/load into the kernel. Other code which 
sends M_MULTIDATA messages will continue to work.

    Erik

IP Datapath Refactoring [PSARC/2009/331 FastTrack timeout 06/09/2009]

Reply via email to