I'm sponsoring this fast-track for Erik Nordmark, the timer expires on June 9th.
IP Datapath Refactoring ======================= The project changes a lot of the IP datapath code paths, but has rather minor changes to documented interfaces. Those changes are the subject of this case. The IP datapaths are extremely hard to follow both at the micro level (ip_output_options and ip_wput_ire, and ip_input) and at the macro level (an outbound packet needing IPsec and ARP resolution goes through a large number of steps). The nature of the changes in IP derive from the classical quote "ip_newroute delenda est"; a key root cause of the complexity is from the introduction of asynchronous behavior near the top of the IP output path. Moving the asynchrony to the bottom of the IP output path, where other implementations handle ARP resolution, makes things more sane. Also, partly due to the ip_newroute asynchrony, IP has grown a large number of internal mechanisms to remember state associated with a packet that needs to be queued. We are replacing that with a single internal mechanism in the form of ip_xmit_attr_t and ip_recv_attr_t data structures. All those mechanisms and interfaces are project private. Current prototyping indicates that about 30,000 lines of code can be removed as a result of these changes (combined with the ARP/IP merge pieces). Imported interfaces: -------------------- No changes. Exported interfaces: -------------------- We are introducing a new RTF_INDIRECT in <net/route.h>. This flag is useful for routing daemons that do BGP plus OSPF/IS-IS since it can make handling routing changes a lot more efficient. Route(1m) has a new -indirect flag to set RTF_INDIRECT. The indirect routes are represented by a new 'I' flag in the netstat -r output. Old cases state that the classification of the output of netstat as Unstable. This case doesn't change that; it just adds the 'I' flag. We are adding an informational RTF_KERNEL flag for routes, for instance interface routes, that are added by the kernel as part of configuring an IP interface. Such routes can not be accidentally deleted by applications. The implementation introduces a new Destination Cache Entry in the kernel (patterned after the description in RFC 4861). For debugging reasons it is useful to be able to display the DCE in particularly the Path MTU which is recorded in it. We are adding the -d option to netstat(1m) for this purpose. (Note that netstat -d is currently an undocumented option for debugging netstat itself. We rename that undocumented debug option to -x.) The way netstat extracts the DCE table from the kernel is using the new EXPER_IP_DCE in <inet/mib2.h>. The implementation of multirt/CGTP changes, and in all but one detail the interfaces remain unchanged. The change is that evolving and undocumented tunable ip_multirt_resolution_interval has been removed. Multirt/CGTP routes use the same timer as ARP/ND does for other routes. Note that PSARC/2003/041 replaced PSARC/2000/539 and the never contract didn't mention the ndd tunable; this might have been an omission. _________________________________________________________________________ | Interfaces Added by This Case | |_______________________________|_______________________|_______________| |Interface | Classification | Comments | |_______________________________|_______________________|_______________| | RTF_INDIRECT | Committed | <net/route.h> | | route(1m) -indirect flag | Committed | | | netstat(1m) -r output | Uncommitted (unchanged)| | | RTF_KERNEL | Committed | <net/route.h> | | netstat(1m) -d option | Committed | | | netstat(1m) -d output | Uncommitted | | | EXPER_IP_DCE | Uncommitted | <inet/mib2.h> | |_______________________________|_______________________|_______________| _________________________________________________________________________ | Interfaces Removed by This Case | |_______________________________|_______________________|_______________| |Interface | Classification | Comments | |_______________________________|_______________________|_______________| | ip_multirt_resolution_interval| Evolving | PSARC/2000/539| |_______________________________|_______________________|_______________| Implementation changes: ----------------------- Due to the ARP/IP merge and uniform application of Neighbor Unreachability Detection (RFC 4862) the undocumented ndd tunables for /dev/arp are replaced by undocumented ndd tunables for /dev/ip. Instead of relying on timers (with undocumented but well-known tunables like arp_cleanup_interval and ip_ire_arp_interval) this project makes ARP function the same way as Neighbor Discovery in using the RFC 4862 NUD state machine. Thus those known, but undocumented, tunables are removed. The implementation changes to track IPv4 group membership per ill_t instead of per ipif_t, but we keep the IP address around so we can preserve the output of netstat -g (which reports the logical interface name e.g., bge0:1 when an IPv4 group is joined using the IP address assigned to bge0:1) netstat -ia continues to show input counters for each local address. However, the output counters never made any sense on a per-local address (IP packets are sent out of a IP interface and not out an IP address), and this project makes them be reported as zero. The Solaris 'Use' count in netstat -r has been this unpredictable/undocumented number since Solaris 2.0 (the implementation counts the number of times ip_newroute has used the route to try to create an IRE_CACHE entry). We restore the use count to actually count the number of packets that are send out using the route in question. No IRE_CACHE entries (UHA) entries will appear in netstat -ra, since the implementation no longer has IRE_CACHE entries. This project adds a new IRE_IF_CLONE type of routes. Those routes appear in netstat -ra (but not without the 'a' option) with the new 'C' flag. While the kernel no longer uses any IRE_CACHE entries, we are keeping the #define of IRE_CACHE in the header file so that applications which use the common, but undocumented, mibget approach for retrieving the kernel routing table will still compile. The new implementation no longer has a ire_max_frag field, hence the output of Maxfrg/PMTU in the netstat -rv output is no longer useful. We are removing that output. (Note that the details of the netstat output is not a stable interface.) Currently Solaris handles IP interface MTU in odd ways in that it can be set differently for local IP address prefix; this leaves it quite undefined in what MTU is applied to multicast packets. This project fixes that by applying the IP interface MTU per interface. As a result ifconfig bge0:N mtu 1400 will fail with EINVAL. No mapping entry will appear in arp -a/netstat -p output, since the implementation of the multicast mapping has changed. Individual multicast and broadcast addresses might appear in netstat -p/arp -a. API calls which refer to interface indicies and interface addresses (IP_MULTICAST_IF, IP_BOUND_IF, IP_ADD_MEMBERSHIP, etc) currently have odd behaviors when interfaces and/or IP addresses are unplumbed and removed. To preserve kernel sanity (no stray ill and ipif pointers) the applications setting is forgotten with telling the application. From the application the behavior looks very odd. E.g., if an IP_ADD_MEMBERSHIP is followed by a correct IP_DROP_MEMBERSHIP will see a EADDRNOTAVAIL error because the kernel might have removed all memory of the IP_ADD_MEMBERSHIP when the IP address was removed. This project will instead preserve what the application has set until the application explicitly removes it. For instance, an IP_BOUND_IF will remain in effect even if the IP interface is unplumbed. Packets will be dropped and ENXIO will be returned to the sendto() system call. And received packets will be dropped since they can't possibly match the interface index specified in the IP_BOUND_IF when the interface has been unplumbed. However, when the IP address (or interface index) which was use by the application reappears, then the application's setting will be fully functional again. The project extends the kernel's ability to handle multiple routes for the same prefix; currently the kernel only does some form of round robin for default routes and the project extends that to all off-link routes (default, prefix, and host routes). We are adding an undocumented knob should there be a reason to switch back to the old behavior in the field. The project removes the usage of multidata from TCP/IP, but the interfaces specified in PSARC/2004/594 and PSARC/2002/276 remain in the system. The project changes the *use* of DL_NOTE_REPLUMB (introduced by PSARC 2008/242 as a private interface); it is only used on the IP stream(s) and not on the ARP stream. Notes: ------ This project removes the AR_* message set used by IP and ARP. Earlier there was a contract private interface with SunATM on those interfaces (established in LSARC/1993/101/ and extended in PSARC/1999/446 and PSARC/2001/023). That contract was cancelled by PSARC/2006/272 EOL of ATM device driver The integration of this project is likely to also deliver the changes related to PSARC/2008/522 EOF of 2001/070 IPsec HW Acceleration support