I'm sponsoring this fast-track for Erik Nordmark, the timer expires on
June 9th.

IP Datapath Refactoring
=======================

The project changes a lot of the IP datapath code paths, but has rather
minor changes to documented interfaces. Those changes are the subject of
this case.

The IP datapaths are extremely hard to follow both at the micro level 
(ip_output_options and ip_wput_ire, and ip_input) and at the macro level 
(an outbound packet needing IPsec and ARP resolution goes through a large
number of steps).

The nature of the changes in IP derive from the classical quote "ip_newroute
delenda est"; a key root cause of the complexity is from the introduction
of asynchronous behavior near the top of the IP output path. Moving the
asynchrony to the bottom of the IP output path, where other implementations
handle ARP resolution, makes things more sane.

Also, partly due to the ip_newroute asynchrony, IP has grown a large number
of internal mechanisms to remember state associated with a packet that needs
to be queued. We are replacing that with a single internal mechanism in
the form of ip_xmit_attr_t and ip_recv_attr_t data structures. All those
mechanisms and interfaces are project private.

Current prototyping indicates that about 30,000 lines of code can be removed
as a result of these changes (combined with the ARP/IP merge pieces).


Imported interfaces:
--------------------

No changes.


Exported interfaces:
--------------------

We are introducing a new RTF_INDIRECT in <net/route.h>. This flag is useful
for routing daemons that do BGP plus OSPF/IS-IS since it can make handling
routing changes a lot more efficient.

Route(1m) has a new -indirect flag to set RTF_INDIRECT.

The indirect routes are represented by a new 'I' flag in the netstat -r
output. Old cases state that the classification of the output of netstat as
Unstable. This case doesn't change that; it just adds the 'I' flag.

We are adding an informational RTF_KERNEL flag for routes, for instance
interface routes, that are added by the kernel as part of configuring an IP
interface. Such routes can not be accidentally deleted by applications.

The implementation introduces a new Destination Cache Entry in the kernel
(patterned after the description in RFC 4861). For debugging reasons it is
useful to be able to display the DCE in particularly the Path MTU which is
recorded in it. We are adding the -d option to netstat(1m) for this purpose.
(Note that netstat -d is currently an undocumented option for debugging netstat
itself. We rename that undocumented debug option to -x.)
The way netstat extracts the DCE table from the kernel is using the new
EXPER_IP_DCE in <inet/mib2.h>.

The implementation of multirt/CGTP changes, and in all but one detail the
interfaces remain unchanged. The change is that evolving and undocumented
tunable ip_multirt_resolution_interval has been removed.
Multirt/CGTP routes use the same timer as ARP/ND does for other routes.
Note that PSARC/2003/041 replaced PSARC/2000/539 and the never contract
didn't mention the ndd tunable; this might have been an omission.

_________________________________________________________________________
|                   Interfaces Added by This Case                       |
|_______________________________|_______________________|_______________|
|Interface                      | Classification        | Comments      |
|_______________________________|_______________________|_______________|
| RTF_INDIRECT                  | Committed             | <net/route.h> |
| route(1m) -indirect flag      | Committed             |               |
| netstat(1m) -r output         | Uncommitted (unchanged)|              |
| RTF_KERNEL                    | Committed             | <net/route.h> |
| netstat(1m) -d option         | Committed             |               |
| netstat(1m) -d output         | Uncommitted           |               |
| EXPER_IP_DCE                  | Uncommitted           | <inet/mib2.h> |
|_______________________________|_______________________|_______________|


_________________________________________________________________________
|                   Interfaces Removed by This Case                     |
|_______________________________|_______________________|_______________|
|Interface                      | Classification        | Comments      |
|_______________________________|_______________________|_______________|
| ip_multirt_resolution_interval| Evolving              | PSARC/2000/539|
|_______________________________|_______________________|_______________|



Implementation changes:
-----------------------

Due to the ARP/IP merge and uniform application of Neighbor Unreachability 
Detection (RFC 4862) the undocumented ndd tunables for /dev/arp are replaced by
undocumented ndd tunables for /dev/ip.

Instead of relying on timers (with undocumented but well-known tunables like
arp_cleanup_interval and ip_ire_arp_interval) this project makes ARP function
the same way as Neighbor Discovery in using the RFC 4862 NUD state machine.
Thus those known, but undocumented, tunables are removed.

The implementation changes to track IPv4 group membership per ill_t instead
of per ipif_t, but we keep the IP address around so we can preserve the output
of netstat -g (which reports the logical interface name e.g., bge0:1 when an
IPv4 group is joined using the IP address assigned to bge0:1)

netstat -ia continues to show input counters for each local address. However,
the output counters never made any sense on a per-local address (IP packets are
sent out of a IP interface and not out an IP address), and this project makes
them be reported as zero.

The Solaris 'Use' count in netstat -r has been this unpredictable/undocumented
number since Solaris 2.0 (the implementation counts the number of times
ip_newroute has used the route to try to create an IRE_CACHE entry). We restore
the use count to actually count the number of packets that are
send out using the route in question.

No IRE_CACHE entries (UHA) entries will appear in netstat -ra, since the
implementation no longer has IRE_CACHE entries.
This project adds a new IRE_IF_CLONE type of routes. Those routes appear in
netstat -ra (but not without the 'a' option) with the new 'C' flag.

While the kernel no longer uses any IRE_CACHE entries, we are keeping
the #define of IRE_CACHE in the header file so that applications which
use the common, but undocumented, mibget approach for retrieving the kernel
routing table will still compile.

The new implementation no longer has a ire_max_frag field, hence the output of
Maxfrg/PMTU in the netstat -rv output is no longer useful. We are removing that
output. (Note that the details of the netstat output is not a stable
interface.)

Currently Solaris handles IP interface MTU in odd ways in that it can be
set differently for local IP address prefix; this leaves it quite undefined
in what MTU is applied to multicast packets.
This project fixes that by applying the IP interface MTU per interface. As a
result ifconfig bge0:N mtu 1400 will fail with EINVAL.

No mapping entry will appear in arp -a/netstat -p output, since the
implementation of the multicast mapping has changed.
Individual multicast and broadcast addresses might appear in netstat -p/arp -a.

API calls which refer to interface indicies and interface addresses
(IP_MULTICAST_IF, IP_BOUND_IF, IP_ADD_MEMBERSHIP, etc) currently have 
odd behaviors when interfaces and/or IP addresses are unplumbed and removed.
To preserve kernel sanity (no stray ill and ipif pointers) the applications
setting is forgotten with telling the application. From the application the
behavior looks very odd. E.g., if an IP_ADD_MEMBERSHIP is followed by a
correct IP_DROP_MEMBERSHIP will see a EADDRNOTAVAIL error because the kernel
might have removed all memory of the IP_ADD_MEMBERSHIP when the IP address was 
removed.

This project will instead preserve what the application has set until the
application explicitly removes it. For instance, an IP_BOUND_IF will remain
in effect even if the IP interface is unplumbed. Packets will be dropped and
ENXIO will be returned to the sendto() system call. And received packets will
be dropped since they can't possibly match the interface index specified in the
IP_BOUND_IF when the interface has been unplumbed. However, when the IP address 
(or interface index) which was use by the application reappears, then the 
application's setting will be fully functional again.


The project extends the kernel's ability to handle multiple routes for the same
prefix; currently the kernel only does some form of round robin for default 
routes and the project extends that to all off-link routes (default, prefix, and
host routes). We are adding an undocumented knob should there be a reason to
switch back to the old behavior in the field.

The project removes the usage of multidata from TCP/IP, but the interfaces
specified in PSARC/2004/594 and PSARC/2002/276 remain in the system.


The project changes the *use* of DL_NOTE_REPLUMB (introduced by PSARC 2008/242 
as a private interface); it is only used on the IP stream(s) and not on the 
ARP stream.

Notes:
------

This project removes the AR_* message set used by IP and ARP.
Earlier there was a contract private interface with SunATM on those interfaces
(established in LSARC/1993/101/ and extended in PSARC/1999/446 and
PSARC/2001/023). That contract was cancelled by 
        PSARC/2006/272   EOL of ATM device driver

The integration of this project is likely to also deliver the changes
related to
        PSARC/2008/522 EOF of 2001/070 IPsec HW Acceleration support



Reply via email to