Re: Improving use of rt_refcnt

Ryota Ozaki Thu, 09 Jul 2015 02:44:20 -0700

On Thu, Jul 9, 2015 at 1:28 PM, Dennis Ferguson
<dennis.c.fergu...@gmail.com> wrote:
>
> On 7 Jul, 2015, at 21:25 , Ryota Ozaki <ozak...@netbsd.org> wrote:
>
>> BTW how do you think of separating L2 tables (ARP/NDP) from the L3
>> routing tables? The separation gets rid of cloning/cloned route
>> features and that makes it easy to introduce locks in route.c.
>> (Currently rtrequest1 can be called recursively to remove cloned
>> routes and that makes it hard to use locks.) I read your paper
>> (BSDNetworking.pdf) and it seems to suggest to maintain L2 routes
>> in the common routing table (I may misunderstand your opinion).
>
> I think it is worth stepping back and thinking about what the end
> result of the most common type of access to the route table (a
> forwarding operation, done by a reader who wants to know what to do
> with a packet it has) is going to be, since this is the operation you
> want to optimize.  If the packet is to be sent out an interface then
> the result of the work you are doing is that an L2 header will be
> prepended to the packet and the packet will be queued to an interface
> for transmission.
>
> To make this direct and fast what you want is for the result of the
> route lookup to point directly at the thing that knows what L2 header
> needs to be added and which interface the packet needs to be delivered
> to.  If you have that then all that remains to be done after the
> route lookup is to make space at the front of the packet for the L2
> header, memcpy() it in and give the resulting frame to the interface.
> So you want the route lookup organized to get you from the addresses
> in the packet you are processing to the L2 header and interface you
> need to use to send a packet like that as directly as possible.
>
> While we could talk about how the route lookup might be structured
> to better get directly to the point (this involves splitting the
> rtentry into a "route" part and a "nexthop" part, the latter being
> the result of a lookup and having the data needed to deliver the
> packet with minimal extra work), this probably isn't relevant to
> your question.  What I did want to point out, however, is that
> knowledge of the next hop IP address is (generally) entirely
> unnecessary to forward a packet.  All forwarding operations want
> to know is the L2 header to add to the packet.  Of course ARP or
> ND will have used the next hop IP address to determine the L2 header
> to attach to the packet, but once this is known all packet forwarding
> wants is the result, the L2 header, and doesn't care how that was
> arrived at.  What this means is that your proposed use of the next
> hop IP address is a gratuitous indirection; you would be taking
> something which would be best done as
>
>     <route lookup> -> <L2 header>
>
> and instead turning this into
>
>     <route lookup> -> <next hop IP address> -> <next hop address lookup> -> 
> <L2 header>
>
> This will likely always be significantly more expensive than the direct
> alternative.  The indirection is also easy to resolve up front, when a route
> is added, so there's no need to do it over and over again for each forwarded
> packet, and failing to do it when routes are installed moves yet another
> data structure (per-interface) into the forwarding path that will need to
> be dealt with if you eventually want to eliminate the locks.  I think
> you shouldn't do this, or anything else that requires if_output() to
> look at the next hop IP address, since that indirection should go away.
>
> The neat thing about this is that the internal arrangement that makes
> one think that the next hop IP address is an important result of a route
> lookup (it is listed as one in the rtentry structure, and if_output()
> takes it as an argument) is actually a historical artifact.  I think
> this code was written in about 1980.  Then, as now, the point of the
> route lookup was to determine the L2 header to prepend to the packet
> and the interface to queue it to, but what was different was the networks
> that existed then.  Almost all of them did <IP address> -> <L2 header>
> mapping by storing the variable bits of the L2 header directly in the
> local bits of the IP address; see RFC796 and RFC895 for a whole bunch of
> examples (the all-zeros-host-part directed broadcast address that 4.2BSD
> used came from the mapping for experimental ethernet).  This meant that
> the next hop IP address wasn't an indirection at all, it was directly
> the data you needed to construct the L2 header to add to the packet.
> The original exception to this was DIX Ethernet, with its 48 bit MAC
> addresses that were too big to store that way, so the idea of
> implementing an ARP cache in the interface code and using the next hop
> IP address as a less efficient indirection to the L2 header data for
> that type of interface, was invented to make DIX Ethernet look like a
> "normal" interface where the next hop IP address directly and efficiently
> provided the L2 bits you needed to know to send the packet.
>
> The thing is that pretty much all the networks that were "normal"
> in 1980 had disappeared by about 1990, leaving only networks that
> worked like DIX ethernet.  You would think the code would have been
> restructured for the new "normal" since then, but I guess old code
> dies hard.


Thank you for the explanation! I'm getting your point now,
but I still have some concerns. One is already pointed out by
joerg; I'm not sure if possible performance degradation would
be so serious.

Another concern is that having next hop caches in a common datastore
and having dependencies between its data (rtentry) make the code
complex and difficult to manage locks. I ever tried to introduce
locks to the current implement, but failed to do so and felt I'm going
to a wrong direction. Separating next hop caches apparently solves
the problem (I've looked at FreeBSD's code). Do you have any ideas
to tackle the problem keeping the current structure?

Thanks,
  ozaki-r

>
> Dennis Ferguson

Re: Improving use of rt_refcnt

Reply via email to