On Thu, Jul 9, 2015 at 1:28 PM, Dennis Ferguson <dennis.c.fergu...@gmail.com> wrote: > > On 7 Jul, 2015, at 21:25 , Ryota Ozaki <ozak...@netbsd.org> wrote: > >> BTW how do you think of separating L2 tables (ARP/NDP) from the L3 >> routing tables? The separation gets rid of cloning/cloned route >> features and that makes it easy to introduce locks in route.c. >> (Currently rtrequest1 can be called recursively to remove cloned >> routes and that makes it hard to use locks.) I read your paper >> (BSDNetworking.pdf) and it seems to suggest to maintain L2 routes >> in the common routing table (I may misunderstand your opinion). > > I think it is worth stepping back and thinking about what the end > result of the most common type of access to the route table (a > forwarding operation, done by a reader who wants to know what to do > with a packet it has) is going to be, since this is the operation you > want to optimize. If the packet is to be sent out an interface then > the result of the work you are doing is that an L2 header will be > prepended to the packet and the packet will be queued to an interface > for transmission. > > To make this direct and fast what you want is for the result of the > route lookup to point directly at the thing that knows what L2 header > needs to be added and which interface the packet needs to be delivered > to. If you have that then all that remains to be done after the > route lookup is to make space at the front of the packet for the L2 > header, memcpy() it in and give the resulting frame to the interface. > So you want the route lookup organized to get you from the addresses > in the packet you are processing to the L2 header and interface you > need to use to send a packet like that as directly as possible. > > While we could talk about how the route lookup might be structured > to better get directly to the point (this involves splitting the > rtentry into a "route" part and a "nexthop" part, the latter being > the result of a lookup and having the data needed to deliver the > packet with minimal extra work), this probably isn't relevant to > your question. What I did want to point out, however, is that > knowledge of the next hop IP address is (generally) entirely > unnecessary to forward a packet. All forwarding operations want > to know is the L2 header to add to the packet. Of course ARP or > ND will have used the next hop IP address to determine the L2 header > to attach to the packet, but once this is known all packet forwarding > wants is the result, the L2 header, and doesn't care how that was > arrived at. What this means is that your proposed use of the next > hop IP address is a gratuitous indirection; you would be taking > something which would be best done as > > <route lookup> -> <L2 header> > > and instead turning this into > > <route lookup> -> <next hop IP address> -> <next hop address lookup> -> > <L2 header> > > This will likely always be significantly more expensive than the direct > alternative. The indirection is also easy to resolve up front, when a route > is added, so there's no need to do it over and over again for each forwarded > packet, and failing to do it when routes are installed moves yet another > data structure (per-interface) into the forwarding path that will need to > be dealt with if you eventually want to eliminate the locks. I think > you shouldn't do this, or anything else that requires if_output() to > look at the next hop IP address, since that indirection should go away. > > The neat thing about this is that the internal arrangement that makes > one think that the next hop IP address is an important result of a route > lookup (it is listed as one in the rtentry structure, and if_output() > takes it as an argument) is actually a historical artifact. I think > this code was written in about 1980. Then, as now, the point of the > route lookup was to determine the L2 header to prepend to the packet > and the interface to queue it to, but what was different was the networks > that existed then. Almost all of them did <IP address> -> <L2 header> > mapping by storing the variable bits of the L2 header directly in the > local bits of the IP address; see RFC796 and RFC895 for a whole bunch of > examples (the all-zeros-host-part directed broadcast address that 4.2BSD > used came from the mapping for experimental ethernet). This meant that > the next hop IP address wasn't an indirection at all, it was directly > the data you needed to construct the L2 header to add to the packet. > The original exception to this was DIX Ethernet, with its 48 bit MAC > addresses that were too big to store that way, so the idea of > implementing an ARP cache in the interface code and using the next hop > IP address as a less efficient indirection to the L2 header data for > that type of interface, was invented to make DIX Ethernet look like a > "normal" interface where the next hop IP address directly and efficiently > provided the L2 bits you needed to know to send the packet. > > The thing is that pretty much all the networks that were "normal" > in 1980 had disappeared by about 1990, leaving only networks that > worked like DIX ethernet. You would think the code would have been > restructured for the new "normal" since then, but I guess old code > dies hard.
Thank you for the explanation! I'm getting your point now, but I still have some concerns. One is already pointed out by joerg; I'm not sure if possible performance degradation would be so serious. Another concern is that having next hop caches in a common datastore and having dependencies between its data (rtentry) make the code complex and difficult to manage locks. I ever tried to introduce locks to the current implement, but failed to do so and felt I'm going to a wrong direction. Separating next hop caches apparently solves the problem (I've looked at FreeBSD's code). Do you have any ideas to tackle the problem keeping the current structure? Thanks, ozaki-r > > Dennis Ferguson