On 12/08/15(Wed) 17:03, Martin Pieuchot wrote:
> I'm currently working on the routing table interface to make is safe
> to use by multiple CPUs at the same time. The diff below is a big
> step in this direction and I'd really appreciate if people could test
> it with their usual network setup and report back.
Updated version to match recent changes. I'm still looking for test
reports and reviews.
> The goal of this diff is to "cache" the route corresponding to "your"
> next hop as early as possible. Let's assume you're using a common
> dhcp-based network:
>
> mpi@goiaba $ netstat -rnf inet|egrep "(default|Dest)"
> Destination Gateway Flags Refs Use Mtu Prio Iface
> default 192.168.0.1 UGS 5 508 - 8 em0
>
> Here my default route points to a gateway (G) whose address is
> 192.168.0.1. In such setup your computer generally sends most of the
> packets to the internet through this gateway. But to do that it needs
> more informations:
>
> mpi@goiaba $ netstat -rnf inet|egrep "(192.168.0.1.*L|Dest)"
> Destination Gateway Flags Refs Use Mtu Prio Iface
> 192.168.0.1 bc:05:43:bd:3e:29 UHLc 1 149 - 8 em0
>
> Yes this is another route. This one contains link-layer informations
> (L) and has been cloned (c). This route is what I described before as
> "your" next hop. In this case, "your" is a shortcut for "the next hop
> of your default route" but all of this is valid for any route pointing
> to a gateway (G).
>
> In order to send packets via my default route, the kernel needs to know
> the link-layer address corresponding to the IP address of the gateway.
> This is called "Address Resolution" in network jargon. In OpenBSD,
> resolved addresses appear in the routing table with a link-layer address
> in the "Gateway" field, as shown previously.
>
> This resolution is done in the kernel by calling rtalloc(9) with the
> RT_RESOLVE flag for the wanted destination, in my case 192.168.0.1.
> Once the resolution is complete, a corresponding entry appears in the
> routing table and there's no need to redo it for a certain period of
> time. That is what I meant with "cache".
>
> Currently this resolution is done "late" in the journey of a packet and
> that's fine since it is not done often. Late means that it is done when
> the packet reaches a L2 output function, nd6_output() or arpresolve().
>
> The problem is that having a proper reference count on route entries in
> these functions is really complicated because you can end up using 3
> different routes. So this diff starts the resolution early: as soon as
> a gateway route is returned by rtalloc(9).
>
> It also makes sense to do the resolution as soon as possible since we
> need the link-layer address to send the packet.
>
> One important point: gateway routes (rt_gwroute) are only returned to
> the stack in L2 functions and when that happens, their reference
> counter is not incremented. That's why the reference count for such
> routes is almost always 1. They are the simplest example of working
> route caching in our kernel*. That means that when you purge your
> cloned route, rt_gwroute will still be valid but marked as RTP_DOWN
> until a new resolution is started.
>
> This diff changes rt_checkgate() to only do sanity checks (finally!).
>
> Do not hesitate to ask questions if something is not clear, I believe
> it's important that more people understand this.
>
> Note that this diff includes other bits to be committed separately:
>
> - Deprecate the use of RTF_XRESOLVE in rtalloc(9)
> - Remove PF_KEY-specific code & comments now that SPD lookups no
> longer use rtalloc(9).
> - Make rtfree(9) accept NULL
>
>
> * That's why I'm slowly killing "struct route" & friends to use the
> simplest route caching mechanism everywhere.
Index: net/route.c
===================================================================
RCS file: /cvs/src/sys/net/route.c,v
retrieving revision 1.225
diff -u -p -r1.225 route.c
--- net/route.c 24 Aug 2015 22:11:33 -0000 1.225
+++ net/route.c 25 Aug 2015 10:23:02 -0000
@@ -153,6 +153,7 @@ int rtable_alloc(void ***, u_int);
int rtflushclone1(struct rtentry *, void *, u_int);
void rtflushclone(unsigned int, struct rtentry *);
int rt_if_remove_rtdelete(struct rtentry *, void *, u_int);
+struct rtentry *rt_match(struct sockaddr *, int, unsigned int);
struct ifaddr *ifa_ifwithroute(int, struct sockaddr *, struct sockaddr *,
u_int);
@@ -297,19 +298,32 @@ rtable_exists(u_int id) /* verify table
return (1);
}
+/*
+ * Do the actual lookup for rtalloc(9), do not use directly!
+ *
+ * Return the best matching entry for the destination ``dst''.
+ *
+ * "RT_RESOLVE" means that a corresponding L2 entry should
+ * be added to the routing table and resolved (via ARP or
+ * NDP), if it does not exist.
+ *
+ * "RT_REPORT" indicates that a message should be sent to
+ * userland if no matching route has been found or if an
+ * error occured while adding a L2 entry.
+ */
struct rtentry *
-rtalloc(struct sockaddr *dst, int flags, unsigned int tableid)
+rt_match(struct sockaddr *dst, int flags, unsigned int tableid)
{
struct rtentry *rt;
struct rtentry *newrt = NULL;
struct rt_addrinfo info;
- int s, error = 0, msgtype = RTM_MISS;
+ int s, error = 0;
- s = splsoftnet();
bzero(&info, sizeof(info));
info.rti_info[RTAX_DST] = dst;
+ s = splsoftnet();
rt = rtable_match(tableid, dst);
if (rt != NULL) {
newrt = rt;
@@ -322,10 +336,6 @@ rtalloc(struct sockaddr *dst, int flags,
goto miss;
}
rt = newrt;
- if (rt->rt_flags & RTF_XRESOLVE) {
- msgtype = RTM_RESOLVE;
- goto miss;
- }
/* Inform listeners of the new route */
rt_sendmsg(rt, RTM_ADD, tableid);
} else
@@ -333,11 +343,8 @@ rtalloc(struct sockaddr *dst, int flags,
} else {
rtstat.rts_unreach++;
miss:
- if (ISSET(flags, RT_REPORT)) {
- bzero((caddr_t)&info, sizeof(info));
- info.rti_info[RTAX_DST] = dst;
- rt_missmsg(msgtype, &info, 0, NULL, error, tableid);
- }
+ if (ISSET(flags, RT_REPORT))
+ rt_missmsg(RTM_MISS, &info, 0, NULL, error, tableid);
}
splx(s);
return (newrt);
@@ -371,6 +378,75 @@ rtalloc_mpath(struct sockaddr *dst, uint
}
#endif /* SMALL_KERNEL */
+/*
+ * Look in the routing table for the best matching entry for
+ * ``dst''.
+ *
+ * If a route with a gateway is found and its next hop is no
+ * longer valid, try to cache it.
+ */
+struct rtentry *
+rtalloc(struct sockaddr *dst, int flags, unsigned int rtableid)
+{
+ struct rtentry *rt, *nhrt;
+
+ rt = rt_match(dst, flags, rtableid);
+
+ /* No match or route to host? We're done. */
+ if (rt == NULL || (rt->rt_flags & RTF_GATEWAY) == 0)
+ return (rt);
+
+ nhrt = rt->rt_gwroute;
+
+ /* Nothing to do if the next hop is valid. */
+ if (nhrt != NULL && (nhrt->rt_flags & RTF_UP))
+ return (rt);
+
+ rtfree(rt->rt_gwroute);
+ rt->rt_gwroute = NULL;
+
+ /*
+ * If we cannot find a valid next hop, return the route
+ * with a gateway.
+ *
+ * Some dragons hiding in the tree certainly depends on
+ * this behavior.
+ */
+ nhrt = rt_match(rt->rt_gateway, flags | RT_RESOLVE, rtableid);
+ if (nhrt == NULL)
+ return (rt);
+
+ /*
+ * Next hop must be reachable, this also prevents rtentry
+ * loops for example when rt->rt_gwroute points to rt.
+ */
+ if ((nhrt->rt_flags & (RTF_UP|RTF_CLONING|RTF_GATEWAY)) != RTF_UP) {
+ rtfree(nhrt);
+ return (rt);
+ }
+
+ /*
+ * Next hop entry MUST be on the same interface.
+ *
+ * XXX We could use a KASSERT() here if routes with dangling
+ * ``ifa'' pointers were dropped.
+ */
+ if (nhrt->rt_ifp != rt->rt_ifp) {
+ rtfree(nhrt);
+ return (rt);
+ }
+
+ /*
+ * If the MTU of next hop is 0, this will reset the MTU of the
+ * route to run PMTUD again from scratch.
+ */
+ if (!ISSET(rt->rt_locks, RTV_MTU) && (rt->rt_mtu > nhrt->rt_mtu))
+ rt->rt_mtu = nhrt->rt_mtu;
+
+ rt->rt_gwroute = nhrt;
+ return (rt);
+}
+
void
rtfree(struct rtentry *rt)
{
@@ -524,7 +600,7 @@ create:
rt->rt_flags |= RTF_MODIFIED;
flags |= RTF_MODIFIED;
stat = &rtstat.rts_newgateway;
- rt_setgate(rt, gateway, rdomain);
+ rt_setgate(rt, gateway);
}
} else
error = EHOSTUNREACH;
@@ -979,8 +1055,7 @@ rtrequest1(int req, struct rt_addrinfo *
* the routing table because the radix MPATH code use
* it to (re)order routes.
*/
- if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY],
- tableid))) {
+ if ((error = rt_setgate(rt, info->rti_info[RTAX_GATEWAY]))) {
free(ndst, M_RTABLE, dlen);
pool_put(&rtentry_pool, rt);
return (error);
@@ -1031,7 +1106,7 @@ rtrequest1(int req, struct rt_addrinfo *
}
int
-rt_setgate(struct rtentry *rt, struct sockaddr *gate, unsigned int tableid)
+rt_setgate(struct rtentry *rt, struct sockaddr *gate)
{
int glen = ROUNDUP(gate->sa_len);
struct sockaddr *sa;
@@ -1049,22 +1124,7 @@ rt_setgate(struct rtentry *rt, struct so
rtfree(rt->rt_gwroute);
rt->rt_gwroute = NULL;
}
- if (rt->rt_flags & RTF_GATEWAY) {
- /* XXX is this actually valid to cross tables here? */
- rt->rt_gwroute = rtalloc(gate, RT_REPORT|RT_RESOLVE, tableid);
- /*
- * If we switched gateways, grab the MTU from the new
- * gateway route if the current MTU is 0 or greater
- * than the MTU of gateway.
- * Note that, if the MTU of gateway is 0, we will reset the
- * MTU of the route to run PMTUD again from scratch. XXX
- */
- if (rt->rt_gwroute && !(rt->rt_rmx.rmx_locks & RTV_MTU) &&
- rt->rt_rmx.rmx_mtu &&
- rt->rt_rmx.rmx_mtu > rt->rt_gwroute->rt_rmx.rmx_mtu) {
- rt->rt_rmx.rmx_mtu = rt->rt_gwroute->rt_rmx.rmx_mtu;
- }
- }
+
return (0);
}
@@ -1076,28 +1136,21 @@ rt_checkgate(struct ifnet *ifp, struct r
KASSERT(rt != NULL);
- if ((rt->rt_flags & RTF_UP) == 0) {
- rt = rtalloc(dst, RT_REPORT|RT_RESOLVE, rtableid);
- if (rt == NULL)
- return (EHOSTUNREACH);
- rt->rt_refcnt--;
- if (rt->rt_ifp != ifp)
- return (EHOSTUNREACH);
- }
+ if ((rt->rt_flags & RTF_UP) == 0)
+ return (EHOSTUNREACH);
rt0 = rt;
if (rt->rt_flags & RTF_GATEWAY) {
- if (rt->rt_gwroute && !(rt->rt_gwroute->rt_flags & RTF_UP)) {
+ if (rt->rt_gwroute == NULL)
+ return (EHOSTUNREACH);
+
+ if ((rt->rt_gwroute->rt_flags & RTF_UP) == 0) {
rtfree(rt->rt_gwroute);
rt->rt_gwroute = NULL;
+ return (EHOSTUNREACH);
}
- if (rt->rt_gwroute == NULL) {
- rt->rt_gwroute = rtalloc(rt->rt_gateway,
- RT_REPORT|RT_RESOLVE, rtableid);
- if (rt->rt_gwroute == NULL)
- return (EHOSTUNREACH);
- }
+
/*
* Next hop must be reachable, this also prevents rtentry
* loops, for example when rt->rt_gwroute points to rt.
Index: net/route.h
===================================================================
RCS file: /cvs/src/sys/net/route.h,v
retrieving revision 1.110
diff -u -p -r1.110 route.h
--- net/route.h 20 Aug 2015 12:39:43 -0000 1.110
+++ net/route.h 25 Aug 2015 10:23:02 -0000
@@ -118,6 +118,8 @@ struct rtentry {
};
#define rt_use rt_rmx.rmx_pksent
#define rt_expire rt_rmx.rmx_expire
+#define rt_locks rt_rmx.rmx_locks
+#define rt_mtu rt_rmx.rmx_mtu
#define RTF_UP 0x1 /* route usable */
#define RTF_GATEWAY 0x2 /* destination is a gateway */
@@ -361,7 +363,7 @@ void rt_sendmsg(struct rtentry *, int,
void rt_sendaddrmsg(struct rtentry *, int);
void rt_missmsg(int, struct rt_addrinfo *, int, struct ifnet *, int,
u_int);
-int rt_setgate(struct rtentry *, struct sockaddr *, unsigned int);
+int rt_setgate(struct rtentry *, struct sockaddr *);
int rt_checkgate(struct ifnet *, struct rtentry *, struct sockaddr *,
unsigned int, struct rtentry **);
void rt_setmetrics(u_long, struct rt_metrics *, struct rt_kmetrics *);
Index: net/rtsock.c
===================================================================
RCS file: /cvs/src/sys/net/rtsock.c,v
retrieving revision 1.169
diff -u -p -r1.169 rtsock.c
--- net/rtsock.c 24 Aug 2015 22:11:33 -0000 1.169
+++ net/rtsock.c 25 Aug 2015 10:23:02 -0000
@@ -744,9 +744,8 @@ report:
info.rti_info[RTAX_GATEWAY]->sa_len)) {
newgate = 1;
}
- if (info.rti_info[RTAX_GATEWAY] != NULL &&
- (error = rt_setgate(rt, info.rti_info[RTAX_GATEWAY],
- tableid)))
+ if (info.rti_info[RTAX_GATEWAY] != NULL && (error =
+ rt_setgate(rt, info.rti_info[RTAX_GATEWAY])))
goto flush;
/*
* new gateway could require new ifaddr, ifp;