2009]

James Carlson Wed, 3 Jun 2009 14:28:50 -0400

Erik Nordmark writes:
> James Carlson wrote:
> 
> >> We are introducing a new RTF_INDIRECT in <net/route.h>. This flag is useful
> >> for routing daemons that do BGP plus OSPF/IS-IS since it can make handling
> >> routing changes a lot more efficient.
> > 
> > Great to hear, but can we have some detail on how to use this?  Is it
> > the network routes from BGP that must use this flag (to make the
> > specified next hop address be "indirect"), or is it a flag set by the
> > BGP (or even the IGP) to specify that a given entry is a special "host
> > route" intended as a target for BGP routes?
> 
> The use of RTF_INDIRECT is optional. No changes are required to routing 
> deamons are required. However, routing daemons can be modified to take 
> advantage of the indirect routes.


I realize that it's optional.  I'll looking for "how to" information
so that someone can take advantage of the option.  The materials
provided don't make the usage clear.

> > What exactly does the flag do?  How is a route with this flag
> > different?
> 
> Two things are different for externally visible behavior:
>   - when the route is added the kernel does not require that the gateway 
> is directly reachable. If RTF_INDIRECT is not set the gateway must be 
> directly reachable as is the case in Solaris today.
>   - when we lookup a route and first find an indirect route, we iterate 
> to lookup the gateway in that route. (Today if we first find an 
> offlink/offsubnet route i.e., one with RTF_GATEWAY set, we iterate by 
> looking up its gateway. The indirect route in essence adds one more step 
> in this iterative process.)

Got it.  So this is a flag that I-BGP should use when adding routes
that depend on an IGP.  Right?

>  From an implementation perspective there is logic to ensure that the 
> caching we do for performance proves the above visible behavior as 
> routes are added and deleted; just ensuring that the caching stays 
> consistent with the collection of forwarding table entries.
> 
> Do you think we should document this somewhere? If so, where? route(1m)?

By definition, to be a public interface, it must be documented.

I'd suggest both route(1M) and route(7P).  The latter is what routing
protocol authors are supposed to be reading.

> >> We are adding an informational RTF_KERNEL flag for routes, for instance
> >> interface routes, that are added by the kernel as part of configuring an IP
> >> interface. Such routes can not be accidentally deleted by applications.
> > 
> > RTF_KERNEL doesn't appear for ICMP redirect, even though that's
> > processed by the kernel, right?
> 
> Correct. The "as part of configuring an IP interface" is key. I don't 
> know if a different name for the flag would make this more clear.

No, that's ok.  I was just checking that there wasn't some other
meaning behind "kernel."

(At one point, Linux seems to have had RTF_INTERFACE, and it seems to
have had a similar meaning.  But it doesn't have that flag now, and I
don't know of an equivalent flag in use elsewhere on BSD or Linux.)

> >> No IRE_CACHE entries (UHA) entries will appear in netstat -ra, since the
> >> implementation no longer has IRE_CACHE entries.
> >> This project adds a new IRE_IF_CLONE type of routes. Those routes appear in
> >> netstat -ra (but not without the 'a' option) with the new 'C' flag.
> > 
> > Are these like the BSD clone entries (i.e., effectively ARP cache
> > entries, with *only* on-link addresses represented), or are they
> > something else?
> 
> I do not know the details of the BSD code. But the IRE_IF_CLONE entries 
> do not contain any ARP information. They are merely a place to hang a 
> reference to the ARP information.
> For example, if we have an interface route for a /23
> 129.146.228.0        129.146.228.81       U         1          5 bge0
> and we are sending packets to 129.146.228.1 the kernel would create an 
> IRE_IF_CLONE for 129.146.228.1/32, and as the ARP entries gets created 
> there would be a direct pointer from that entry to the arp information.

Yep; that's how BSD works.

It has RTF_CLONING for the former route, to indicate that when you
match it, you need to create a cloned route, and RTF_CLONED to mark
the entries that were created by the cloning process.

  http://www.daemon-systems.org/man/route.8.html

> This is beneficial for performance since we avoid doing a lookup of the 
> ARP information; we can always place a direct pointer to the ARP 
> information in the IRE that matches.

Yes; that's the same reason BSD does it.

> > What should those users be doing?  (Reporting bugs against our PMTU
> > implementation?  Asking for an RFE?  Yelling at the admin for that
> > remote network?)
> 
> I think it is due that that we implement RFC 4821 for TCP and SCTP which 
> would make PMTU discovery more robust overall.
> 
> If there are remote networks with different PMTU then presumably the 
> customers are either 1) clamping down the MTU for their interfaces i.e., 
> reducing MTU for everybody, or 2) using route(1m) to set a specific mtu 
> for a particular route.
> Both of those are unaffected by this project.

OK; it sounded to me like the latter one (specific MTU for a route)
would be removed.  If it's not, then no issue here.

> > Interface indices can't normally disappear and then reappear, can
> > they?  (If they can, then I'd call that a "bug" rather than a
> > "feature."  SNMP requires that ifIndex is unique for a given engine
> > invocation.)
> 
> This is outside of the scope of this case; this case merely makes sure 
> the system is at least as robust as today should the existing 
> SIOCSLIFINDEX be used to change the ifindex.
> 
> Until relatively recently I thought we could actually remove the ability 
> to set the interface index (essentially removing SIOCSLIFINDEX). But 
> I've seen a legitimate case where a customer was using it (can't recall 
> the details).

*sigh*

OK.

> In the example A could might pick the first U route, and B might pick 
> the second. (Similar examples can be constructed with indirect routes.)
> Thus we do not do anything special should there be multiple routes for A/n.

The multiple A/n case is the one I was interested in.  By "do not do
anything special" do you mean this?

      Forwarding always just picks one of these routes (in some
      unspecified way) and uses it; there is no load spreading.  The
      fact that you have multiple routes gives you no benefit -- other
      than, perhaps, some robustness if one were deleted.  Local
      connections, on the other hand, are able to use all of the A/n
      routes, and each connection will be assigned to one route in a
      round-robin when connecting (or otherwise when establishing or
      using the remote address).

> It wouldn't be hard to add on top of what we are building, e.g., using a 
> hash of src+dest ip to select a route from the bucket in which we find 
> the matching A/n.

OK.

> >> The project removes the usage of multidata from TCP/IP, but the interfaces
> >> specified in PSARC/2004/594 and PSARC/2002/276 remain in the system.
> > 
> > This part is confusing; could you please elaborate?
> > 
> > Does the stack still generate MDT messages?  If not, then how are
> > those two previous projects not affected?  (Doesn't this project just
> > obsolete those two ... ?)
> 
> The TCP/IP stack no longer sends down MDT messages.
> 
> But all the support in PSARC/2004/594 and PSARC/2002/276 remains (the 
> definition of M_MULTIDATA, the support in copymsg, the mmd_* functions.
> 
> Thus device drivers that support M_MULTIDATA and link against those 
> functions will continue to link/load into the kernel. Other code which 
> sends M_MULTIDATA messages will continue to work.

What other code (besides the TCP/IP stack) sends M_MULTIDATA?  It was
a private interface, and I don't know of any other users.

If we're disabling multidata, then why not start the process of
removing this old stuff?  We could at least start the notification
process on the MDT contracts.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

IP Datapath Refactoring [PSARC/2009/331 FastTrack timeout 06/09/2009]

Reply via email to