[PATCH net v2] ipv4: Avoid caching l3mdev dst on mismatched local route
David reported that doing the following: ip li add red type vrf table 10 ip link set dev eth1 vrf red ip addr add 127.0.0.1/8 dev red ip link set dev eth1 up ip li set red up ping -c1 -w1 -I red 127.0.0.1 ip li del red when either policy routing IP rules are present or the local table lookup ip rule is before the l3mdev lookup results in a hang with these messages: unregister_netdevice: waiting for red to become free. Usage count = 1 The problem is caused by caching the dst used for sending the packet out of the specified interface on a local route with a different nexthop interface. Thus the dst could stay around until the route in the table the lookup was done is deleted which may be never. Address the problem by not forcing output device to be the l3mdev in the flow's output interface if the lookup didn't use the l3mdev. This then results in the dst using the right device according to the route. Changes in v2: - make the dev_out passed in by __ip_route_output_key_hash correct instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as suggested by David. Fixes: 5f02ce24c2696 ("net: l3mdev: Allow the l3mdev to be a loopback") Reported-by: David Ahern Suggested-by: David Ahern Signed-off-by: Robert Shearman --- net/ipv4/route.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index acd69cfe2951..d9724889ff09 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2359,7 +2359,8 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, } /* L3 master device is the loopback for that domain */ - dev_out = l3mdev_master_dev_rcu(dev_out) ? : net->loopback_dev; + dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? : + net->loopback_dev; fl4->flowi4_oif = dev_out->ifindex; flags |= RTCF_LOCAL; goto make_route; -- 2.1.4
Re: [PATCH net-next 0/3] l3mdev: Improve use with main table
On 20/04/17 23:36, David Ahern wrote: On 4/10/17 8:21 AM, Robert Shearman wrote: Attempting to create a TCP socket not bound to a VRF device when a TCP socket bound to a VRF device with the same port exists (and vice versa) fails with EADDRINUSE. This limits the ability to use programs in selected mixed VRF/non-VRF contexts. This patch series solves the problem by extending the l3mdev be aware of the special semantics of the main table and fixing issues arising from the split local/main tables. A VRF master device created linking to the main table and used for these programs in the same way as those created for VRF tables can. Robert Shearman (3): ipv6: Fix route handling when using l3mdev set to main table ipv4: Fix route handling when using l3mdev set to main table l3mdev: Fix lookup in local table when using main table net/ipv4/af_inet.c | 4 +++- net/ipv4/fib_frontend.c | 14 +- net/ipv4/raw.c | 5 - net/ipv6/addrconf.c | 12 +--- net/ipv6/route.c| 23 ++- net/l3mdev/l3mdev.c | 26 -- 6 files changed, 63 insertions(+), 21 deletions(-) With the change I mentioned earlier to fix the refcnt issue on top of this patch set I see a number of failures: - local IPv4 with 127.0.0.1 address - ping, tcp, udp tests fail - all of the IPv4 multicast tests fail - IPv6 linklocal and mcast addresses generally fail - IPv6 global address on vrf device Can you send me some more details of your testing? These work for me: $ ping -c1 -I vrf-default 127.0.0.1 PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 vrf-default: 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.141 ms --- 127.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.141/0.141/0.141/0.000 ms $ ping6 -c1 -I vrf-default 2001::1 PING 2001::1(2001::1) from 2001::1 vrf-default: 56 data bytes 64 bytes from 2001::1: icmp_seq=1 ttl=64 time=0.069 ms --- 2001::1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.069/0.069/0.069/0.000 ms $ sudo ip vrf exec vrf-default nc -l -p 4013 TEST $ sudo ip vrf exec vrf-default nc -l -u -p 4013 TEST ^C (with a neighbouring host using nc to send the TEST string for the udp and tcp cases) Thanks, Rob
Re: [PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check
On 20/04/17 23:18, David Ahern wrote: On 4/20/17 6:58 AM, Robert Shearman wrote: diff --git a/net/ipv4/route.c b/net/ipv4/route.c index acd69cfe2951..f667783ffd19 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2125,6 +2125,14 @@ static struct rtable *__mkroute_output(const struct fib_result *res, fi = NULL; } + /* If the flag to skip the nh oif check is set then the output +* device may not match the nh device, so cannot use or add to +* cache in that case. +*/ + if (unlikely(fl4->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF && +FIB_RES_NH(*res).nh_dev != dev_out)) + do_cache = false; + fnhe = NULL; do_cache &= fi != NULL; if (do_cache) { I believe this is a better fix: diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 5e1e60546fce..fb74a16958af 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2407,7 +2407,7 @@ struct rtable *__ip_route_output_key_hash(struct net *net, struct flowi4 *fl4, } /* L3 master device is the loopback for that domain */ - dev_out = l3mdev_master_dev_rcu(dev_out) ? : net->loopback_dev; + dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? : net->loopback_dev; fl4->flowi4_oif = dev_out->ifindex; flags |= RTCF_LOCAL; goto make_route; Fixes: 5f02ce24c2696 ("net: l3mdev: Allow the l3mdev to be a loopback") With your change above, references to vrf devices are still taken (dev_out is the vrf device based on the flow struct) even though the route's nexthop is in another domain. And the commit log should reference the use case which is policy routing overriding the VRF rule. That is indeed a nicer fix - it survives all of my local testing. Thanks for correcting the fixes tag too. I had included this text in the commit message to capture the condition of the rules ordering: "when the rule for the lookup in the local table is ordered before the rule for lookups using l3mdevs". However, I'll try to make it more prominent and expand it to note the policy routing use case too. Thanks, Rob
Re: [PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check
On 20/04/17 16:16, David Ahern wrote: On 4/20/17 9:05 AM, Robert Shearman wrote: The key thing I think is the ip rules: $ ip rule 0:from all lookup local 1000:from all lookup [l3mdev-table] 32766:from all lookup main 32767:from all lookup default Maybe you have the local rule moved down at startup? yes that would be a problem. With this test the 127.0.0.1 lookup is matching on the wrong table: perf probe fib_table_lookup%return ret=%ax perf record -e fib:*,probe:* -a -- ping -c1 -w1 -I red 127.0.0.1 ... perf script ... Table 255 is the wrong table. That is the ultimate problem here. The table used for the lookup could be retrieved from the oif, but the check I introduced would still be required to cope with the unusual, but not disallowed, case of two VRF master devices referring to the same table. Thanks, Rob
Re: [PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check
On 20/04/17 15:59, David Ahern wrote: On 4/20/17 8:39 AM, Robert Shearman wrote: On 20/04/17 15:21, David Ahern wrote: On 4/20/17 6:58 AM, Robert Shearman wrote: David reported that doing the following: ip li add red type vrf table 10 ip link set dev eth1 vrf red ip addr add 127.0.0.1/8 dev red ip link set dev eth1 up ip li set red up ping -c1 -w1 -I red 127.0.0.1 ip li del red results in a hang with this message: unregister_netdevice: waiting for red to become free. Usage count = 1 I think you misunderstood my comment. The above works fine today. There is no bug with refcnts. It breaks with your patches wanting to use a VRF device with the main table (254). That doesn't seem to match with my experience. I can reproduce this on the net tree with the listed commands and the behaviour matches what I see in the code. Our mileage varies: root@kenny-jessie3:~# ip li add red type vrf table 10 root@kenny-jessie3:~# ip link set dev eth1 vrf red root@kenny-jessie3:~# ip addr add 127.0.0.1/8 dev red root@kenny-jessie3:~# ip link set dev eth1 up root@kenny-jessie3:~# ip li set red up root@kenny-jessie3:~# ping -c1 -w1 -I red 127.0.0.1 PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data. 64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.059 ms --- 127.0.0.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.059/0.059/0.059/0.000 ms root@kenny-jessie3:~# ip li del red root@kenny-jessie3:~# uname -a Linux kenny-jessie3 4.11.0-rc6+ #8 SMP Wed Apr 19 11:53:48 PDT 2017 x86_64 GNU/Linux The above is one of many tests I run and never hit a problem deleting a VRF device. dst's attached to fnhe_rth_output and fnhe_rth_input appear to be properly flushed and device references released when the device is deleted (NETDEV_DOWN and then NETDEV_UNREGISTER). Can you send me your kernel config and "sysctl -a --pattern 'net.ipv4'"? Perhaps you have something enabled I don't. The key thing I think is the ip rules: $ ip rule 0: from all lookup local 1000: from all lookup [l3mdev-table] 32766: from all lookup main 32767: from all lookup default Maybe you have the local rule moved down at startup? I'll send you the requested information if not. Thanks, Rob
Re: [PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check
On 20/04/17 15:21, David Ahern wrote: On 4/20/17 6:58 AM, Robert Shearman wrote: David reported that doing the following: ip li add red type vrf table 10 ip link set dev eth1 vrf red ip addr add 127.0.0.1/8 dev red ip link set dev eth1 up ip li set red up ping -c1 -w1 -I red 127.0.0.1 ip li del red results in a hang with this message: unregister_netdevice: waiting for red to become free. Usage count = 1 I think you misunderstood my comment. The above works fine today. There is no bug with refcnts. It breaks with your patches wanting to use a VRF device with the main table (254). That doesn't seem to match with my experience. I can reproduce this on the net tree with the listed commands and the behaviour matches what I see in the code. Thanks, Rob
Re: [PATCH net-next 0/3] l3mdev: Improve use with main table
On 13/04/17 15:36, David Ahern wrote: On 4/13/17 6:48 AM, Robert Shearman wrote: the patches look ok to me, but in testing them I see I refcnt problem. simple reproducer: ip li add red type vrf table 254 ip link set dev eth1 vrf red ip addr add 127.0.0.1/8 dev red ip link set dev eth1 up ip li set red up ping -c1 -w1 -I red 127.0.0.1 ip li del red --> hangs with 'uregister_netdevice: waiting for red to become free.' Right, I've reproduced the same and it occurs even without using the main table and I believe it has been introduced within the last week. The cached dst on sockets is one known place that causes a hang on a delete. Basically the delete stalls until the sockets are closed. I have a patch for sk_rx_dst which is the one I chased down last week. sk_dst_cache is another possibility. Neither of those should be at play with the above example because the ping command runs and then exits. Thanks for the pointers. My earlier assessment that this was something recent turned out to be wrong. I've sent a patch against net that fixes the issue. Thanks, Rob
[PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check
David reported that doing the following: ip li add red type vrf table 10 ip link set dev eth1 vrf red ip addr add 127.0.0.1/8 dev red ip link set dev eth1 up ip li set red up ping -c1 -w1 -I red 127.0.0.1 ip li del red results in a hang with this message: unregister_netdevice: waiting for red to become free. Usage count = 1 The problem is caused by caching the dst used for sending the packet out of the specified interface on the route that the lookup returned from the local table when the rule for the lookup in the local table is ordered before the rule for lookups using l3mdevs. Thus the dst could stay around until the route in the local table is deleted which may be never. Address the problem by not allocating a cacheable output dst if FLOWI_FLAG_SKIP_NH_OIF is set and the nh device differs from the device used for the dst. Fixes: ebfc102c566d ("net: vrf: Flip IPv4 output path from FIB lookup hook to out hook") Reported-by: David Ahern Signed-off-by: Robert Shearman --- net/ipv4/route.c | 8 1 file changed, 8 insertions(+) diff --git a/net/ipv4/route.c b/net/ipv4/route.c index acd69cfe2951..f667783ffd19 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2125,6 +2125,14 @@ static struct rtable *__mkroute_output(const struct fib_result *res, fi = NULL; } + /* If the flag to skip the nh oif check is set then the output +* device may not match the nh device, so cannot use or add to +* cache in that case. +*/ + if (unlikely(fl4->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF && +FIB_RES_NH(*res).nh_dev != dev_out)) + do_cache = false; + fnhe = NULL; do_cache &= fi != NULL; if (do_cache) { -- 2.1.4
Re: [PATCH net-next 0/3] l3mdev: Improve use with main table
On 12/04/17 17:51, David Ahern wrote: On 4/10/17 8:21 AM, Robert Shearman wrote: Attempting to create a TCP socket not bound to a VRF device when a TCP socket bound to a VRF device with the same port exists (and vice versa) fails with EADDRINUSE. This limits the ability to use programs in selected mixed VRF/non-VRF contexts. This patch series solves the problem by extending the l3mdev be aware of the special semantics of the main table and fixing issues arising from the split local/main tables. A VRF master device created linking to the main table and used for these programs in the same way as those created for VRF tables can. Robert Shearman (3): ipv6: Fix route handling when using l3mdev set to main table ipv4: Fix route handling when using l3mdev set to main table l3mdev: Fix lookup in local table when using main table net/ipv4/af_inet.c | 4 +++- net/ipv4/fib_frontend.c | 14 +- net/ipv4/raw.c | 5 - net/ipv6/addrconf.c | 12 +--- net/ipv6/route.c| 23 ++- net/l3mdev/l3mdev.c | 26 -- 6 files changed, 63 insertions(+), 21 deletions(-) the patches look ok to me, but in testing them I see I refcnt problem. simple reproducer: ip li add red type vrf table 254 ip link set dev eth1 vrf red ip addr add 127.0.0.1/8 dev red ip link set dev eth1 up ip li set red up ping -c1 -w1 -I red 127.0.0.1 ip li del red --> hangs with 'uregister_netdevice: waiting for red to become free.' Right, I've reproduced the same and it occurs even without using the main table and I believe it has been introduced within the last week. I'll bisect to find out the cause. Thanks for reviewing and testing, Rob
[PATCH v2 iproute2 net-next 0/2] ip: Allow TTL propagation to/from IP packets to be configured
This patch series adds support for per-MPLS-lightweight-tunnel ttl values and per route ttl-propagation for the purposes of MPLS to be specified. Changes in v2: - use matches instead of strcmp for enabled/disabled keywords to avoid excessive typing Robert Shearman (2): iproute: Add support for ttl-propagation attribute iproute: Add support for MPLS LWT ttl attribute ip/iproute.c | 22 ++ ip/iproute_lwtunnel.c | 31 +-- man/man8/ip-route.8.in | 19 +-- 3 files changed, 68 insertions(+), 4 deletions(-) -- 2.1.4
[PATCH v2 iproute2 net-next 1/2] iproute: Add support for ttl-propagation attribute
Add support for setting and displaying the ttl-propagation attribute initially used by MPLS to control propagation of MPLS TTL to IPv4/IPv6 TTL/hop-limit on popping final label on a per-route basis. Signed-off-by: Robert Shearman --- ip/iproute.c | 22 ++ man/man8/ip-route.8.in | 10 +- 2 files changed, 31 insertions(+), 1 deletion(-) diff --git a/ip/iproute.c b/ip/iproute.c index 7cdf0726feb3..6f75319c944a 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -77,6 +77,7 @@ static void usage(void) fprintf(stderr, "NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]\n"); fprintf(stderr, " [ table TABLE_ID ] [ proto RTPROTO ]\n"); fprintf(stderr, " [ scope SCOPE ] [ metric METRIC ]\n"); + fprintf(stderr, " [ ttl-propagate { enabled | disabled } ]\n"); fprintf(stderr, "INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]...\n"); fprintf(stderr, "NH := [ encap ENCAPTYPE ENCAPHDR ] [ via [ FAMILY ] ADDRESS ]\n"); fprintf(stderr, " [ dev STRING ] [ weight NUMBER ] NHFLAGS\n"); @@ -714,6 +715,13 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) fprintf(fp, "%u", pref); } } + if (tb[RTA_TTL_PROPAGATE]) { + fprintf(fp, "ttl-propagate "); + if (rta_getattr_u8(tb[RTA_TTL_PROPAGATE])) + fprintf(fp, "enabled"); + else + fprintf(fp, "disabled"); + } fprintf(fp, "\n"); fflush(fp); return 0; @@ -1184,6 +1192,20 @@ static int iproute_modify(int cmd, unsigned int flags, int argc, char **argv) if (rta->rta_len > RTA_LENGTH(0)) addraw_l(&req.n, 1024, RTA_DATA(rta), RTA_PAYLOAD(rta)); + } else if (strcmp(*argv, "ttl-propagate") == 0) { + __u8 ttl_prop; + + NEXT_ARG(); + if (matches(*argv, "enabled") == 0) + ttl_prop = 1; + else if (matches(*argv, "disabled") == 0) + ttl_prop = 0; + else + invarg("\"ttl-propagate\" value is invalid\n", + *argv); + + addattr8(&req.n, sizeof(req), RTA_TTL_PROPAGATE, +ttl_prop); } else { int type; inet_prefix dst; diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in index d6e06649a464..fbe2711a4301 100644 --- a/man/man8/ip-route.8.in +++ b/man/man8/ip-route.8.in @@ -75,7 +75,9 @@ replace " } " .B scope .IR SCOPE " ] [ " .B metric -.IR METRIC " ]" +.IR METRIC " ] [ " +.B ttl-propagate +.RB "{ " enabled " | " disabled " } ]" .ti -8 .IR INFO_SPEC " := " "NH OPTIONS FLAGS" " [" @@ -710,6 +712,12 @@ is a set of encapsulation attributes specific to the the route will be deleted after the expires time. .B Only support IPv6 at present. + +.TP +.BR ttl-propagate " { " enabled " | " disabled " } " +Control whether TTL should be propagated from any encap into the +un-encapsulated packet, overriding any global configuration. Only +supported for MPLS at present. .RE .TP -- 2.1.4
[PATCH v2 iproute2 net-next 2/2] iproute: Add support for MPLS LWT ttl attribute
Add support for setting and displaying the ttl attribute for MPLS IP lighweight tunnels. Signed-off-by: Robert Shearman --- ip/iproute_lwtunnel.c | 31 +-- man/man8/ip-route.8.in | 9 - 2 files changed, 37 insertions(+), 3 deletions(-) diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c index 0fa1cab0a790..845a115e9e41 100644 --- a/ip/iproute_lwtunnel.c +++ b/ip/iproute_lwtunnel.c @@ -84,6 +84,9 @@ static void print_encap_mpls(FILE *fp, struct rtattr *encap) if (tb[MPLS_IPTUNNEL_DST]) fprintf(fp, " %s ", format_host_rta(AF_MPLS, tb[MPLS_IPTUNNEL_DST])); + if (tb[MPLS_IPTUNNEL_TTL]) + fprintf(fp, "ttl %u ", + rta_getattr_u8(tb[MPLS_IPTUNNEL_TTL])); } static void print_encap_ip(FILE *fp, struct rtattr *encap) @@ -247,6 +250,7 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len, inet_prefix addr; int argc = *argcp; char **argv = *argvp; + int ttl_ok = 0; if (get_addr(&addr, *argv, AF_MPLS)) { fprintf(stderr, @@ -258,8 +262,31 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len, rta_addattr_l(rta, len, MPLS_IPTUNNEL_DST, &addr.data, addr.bytelen); - *argcp = argc; - *argvp = argv; + argc--; + argv++; + + while (argc > 0) { + if (strcmp(*argv, "ttl") == 0) { + __u8 ttl; + + NEXT_ARG(); + if (ttl_ok++) + duparg2("ttl", *argv); + if (get_u8(&ttl, *argv, 0)) + invarg("\"ttl\" value is invalid\n", *argv); + rta_addattr8(rta, len, MPLS_IPTUNNEL_TTL, ttl); + } else { + break; + } + argc--; argv++; + } + + /* argv is currently the first unparsed argument, +* but the lwt_parse_encap() caller will move to the next, +* so step back +*/ + *argcp = argc + 1; + *argvp = argv - 1; return 0; } diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in index fbe2711a4301..d2a44acf2793 100644 --- a/man/man8/ip-route.8.in +++ b/man/man8/ip-route.8.in @@ -181,7 +181,9 @@ throw " | " unreachable " | " prohibit " | " blackhole " | " nat " ]" .ti -8 .IR ENCAP_MPLS " := " .BR mpls " [ " -.IR LABEL " ]" +.IR LABEL " ] [" +.B ttl +.IR TTL " ]" .ti -8 .IR ENCAP_IP " := " @@ -666,6 +668,11 @@ is a set of encapsulation attributes specific to the .I MPLSLABEL - mpls label stack with labels separated by .I "/" +.sp + +.B ttl +.I TTL +- TTL to use for MPLS header or 0 to inherit from IP header .in -2 .sp -- 2.1.4
Re: [PATCH iproute2 net-next 1/2] iproute: Add support for ttl-propagation attribute
On 11/04/17 04:43, David Ahern wrote: On 4/10/17 8:36 AM, Robert Shearman wrote: @@ -1184,6 +1192,20 @@ static int iproute_modify(int cmd, unsigned int flags, int argc, char **argv) if (rta->rta_len > RTA_LENGTH(0)) addraw_l(&req.n, 1024, RTA_DATA(rta), RTA_PAYLOAD(rta)); + } else if (strcmp(*argv, "ttl-propagate") == 0) { + __u8 ttl_prop; + + NEXT_ARG(); + if (strcmp(*argv, "enabled") == 0) + ttl_prop = 1; + else if (strcmp(*argv, "disabled") == 0) + ttl_prop = 0; + else + invarg("\"ttl-propagate\" value is invalid\n", + *argv); + matches() instead of strcmp() is more user friendly. 'enabled' is a lot to type. ;-) Ok, will change as suggested in v2. Thanks for reviewing, Rob
[PATCH iproute2 net-next 1/2] iproute: Add support for ttl-propagation attribute
Add support for setting and displaying the ttl-propagation attribute initially used by MPLS to control propagation of MPLS TTL to IPv4/IPv6 TTL/hop-limit on popping final label on a per-route basis. Signed-off-by: Robert Shearman --- ip/iproute.c | 22 ++ man/man8/ip-route.8.in | 10 +- 2 files changed, 31 insertions(+), 1 deletion(-) diff --git a/ip/iproute.c b/ip/iproute.c index 7cdf0726feb3..c2b681fc9730 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -77,6 +77,7 @@ static void usage(void) fprintf(stderr, "NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]\n"); fprintf(stderr, " [ table TABLE_ID ] [ proto RTPROTO ]\n"); fprintf(stderr, " [ scope SCOPE ] [ metric METRIC ]\n"); + fprintf(stderr, " [ ttl-propagate { enabled | disabled } ]\n"); fprintf(stderr, "INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]...\n"); fprintf(stderr, "NH := [ encap ENCAPTYPE ENCAPHDR ] [ via [ FAMILY ] ADDRESS ]\n"); fprintf(stderr, " [ dev STRING ] [ weight NUMBER ] NHFLAGS\n"); @@ -714,6 +715,13 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) fprintf(fp, "%u", pref); } } + if (tb[RTA_TTL_PROPAGATE]) { + fprintf(fp, "ttl-propagate "); + if (rta_getattr_u8(tb[RTA_TTL_PROPAGATE])) + fprintf(fp, "enabled"); + else + fprintf(fp, "disabled"); + } fprintf(fp, "\n"); fflush(fp); return 0; @@ -1184,6 +1192,20 @@ static int iproute_modify(int cmd, unsigned int flags, int argc, char **argv) if (rta->rta_len > RTA_LENGTH(0)) addraw_l(&req.n, 1024, RTA_DATA(rta), RTA_PAYLOAD(rta)); + } else if (strcmp(*argv, "ttl-propagate") == 0) { + __u8 ttl_prop; + + NEXT_ARG(); + if (strcmp(*argv, "enabled") == 0) + ttl_prop = 1; + else if (strcmp(*argv, "disabled") == 0) + ttl_prop = 0; + else + invarg("\"ttl-propagate\" value is invalid\n", + *argv); + + addattr8(&req.n, sizeof(req), RTA_TTL_PROPAGATE, +ttl_prop); } else { int type; inet_prefix dst; diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in index d6e06649a464..fbe2711a4301 100644 --- a/man/man8/ip-route.8.in +++ b/man/man8/ip-route.8.in @@ -75,7 +75,9 @@ replace " } " .B scope .IR SCOPE " ] [ " .B metric -.IR METRIC " ]" +.IR METRIC " ] [ " +.B ttl-propagate +.RB "{ " enabled " | " disabled " } ]" .ti -8 .IR INFO_SPEC " := " "NH OPTIONS FLAGS" " [" @@ -710,6 +712,12 @@ is a set of encapsulation attributes specific to the the route will be deleted after the expires time. .B Only support IPv6 at present. + +.TP +.BR ttl-propagate " { " enabled " | " disabled " } " +Control whether TTL should be propagated from any encap into the +un-encapsulated packet, overriding any global configuration. Only +supported for MPLS at present. .RE .TP -- 2.1.4
[PATCH iproute2 net-next 2/2] iproute: Add support for MPLS LWT ttl attribute
Add support for setting and displaying the ttl attribute for MPLS IP lighweight tunnels. Signed-off-by: Robert Shearman --- ip/iproute_lwtunnel.c | 31 +-- man/man8/ip-route.8.in | 9 - 2 files changed, 37 insertions(+), 3 deletions(-) diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c index 0fa1cab0a790..845a115e9e41 100644 --- a/ip/iproute_lwtunnel.c +++ b/ip/iproute_lwtunnel.c @@ -84,6 +84,9 @@ static void print_encap_mpls(FILE *fp, struct rtattr *encap) if (tb[MPLS_IPTUNNEL_DST]) fprintf(fp, " %s ", format_host_rta(AF_MPLS, tb[MPLS_IPTUNNEL_DST])); + if (tb[MPLS_IPTUNNEL_TTL]) + fprintf(fp, "ttl %u ", + rta_getattr_u8(tb[MPLS_IPTUNNEL_TTL])); } static void print_encap_ip(FILE *fp, struct rtattr *encap) @@ -247,6 +250,7 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len, inet_prefix addr; int argc = *argcp; char **argv = *argvp; + int ttl_ok = 0; if (get_addr(&addr, *argv, AF_MPLS)) { fprintf(stderr, @@ -258,8 +262,31 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len, rta_addattr_l(rta, len, MPLS_IPTUNNEL_DST, &addr.data, addr.bytelen); - *argcp = argc; - *argvp = argv; + argc--; + argv++; + + while (argc > 0) { + if (strcmp(*argv, "ttl") == 0) { + __u8 ttl; + + NEXT_ARG(); + if (ttl_ok++) + duparg2("ttl", *argv); + if (get_u8(&ttl, *argv, 0)) + invarg("\"ttl\" value is invalid\n", *argv); + rta_addattr8(rta, len, MPLS_IPTUNNEL_TTL, ttl); + } else { + break; + } + argc--; argv++; + } + + /* argv is currently the first unparsed argument, +* but the lwt_parse_encap() caller will move to the next, +* so step back +*/ + *argcp = argc + 1; + *argvp = argv - 1; return 0; } diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in index fbe2711a4301..d2a44acf2793 100644 --- a/man/man8/ip-route.8.in +++ b/man/man8/ip-route.8.in @@ -181,7 +181,9 @@ throw " | " unreachable " | " prohibit " | " blackhole " | " nat " ]" .ti -8 .IR ENCAP_MPLS " := " .BR mpls " [ " -.IR LABEL " ]" +.IR LABEL " ] [" +.B ttl +.IR TTL " ]" .ti -8 .IR ENCAP_IP " := " @@ -666,6 +668,11 @@ is a set of encapsulation attributes specific to the .I MPLSLABEL - mpls label stack with labels separated by .I "/" +.sp + +.B ttl +.I TTL +- TTL to use for MPLS header or 0 to inherit from IP header .in -2 .sp -- 2.1.4
[PATCH iproute2 net-next 0/2] ip: Allow TTL propagation to/from IP packets to be configured
This patch series adds support for per-MPLS-lightweight-tunnel ttl values and per route ttl-propagation for the purposes of MPLS to be specified. Robert Shearman (2): iproute: Add support for ttl-propagation attribute iproute: Add support for MPLS LWT ttl attribute ip/iproute.c | 22 ++ ip/iproute_lwtunnel.c | 31 +-- man/man8/ip-route.8.in | 19 +-- 3 files changed, 68 insertions(+), 4 deletions(-) -- 2.1.4
[PATCH net-next 3/3] l3mdev: Fix lookup in local table when using main table
If an l3mdev is set to use the main table then the l3mdev rules will return this. This means that no lookup will be done in the local table if split local/main table is in effect and the local table lookup rule has been reordered to after the l3mdev rule. Related to this is that the order of the rule for looking up in the main table isn't respected. Fix these two aspects by not returning a match if the l3mdev's table id is the main table. Processing will then proceed normally to the default rules and do the appropriate lookups. Signed-off-by: Robert Shearman --- net/l3mdev/l3mdev.c | 26 -- 1 file changed, 20 insertions(+), 6 deletions(-) diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c index 8da86ceca33d..5147959243c5 100644 --- a/net/l3mdev/l3mdev.c +++ b/net/l3mdev/l3mdev.c @@ -12,6 +12,7 @@ #include #include #include +#include /** * l3mdev_master_ifindex - get index of L3 master device @@ -142,23 +143,36 @@ int l3mdev_fib_rule_match(struct net *net, struct flowi *fl, { struct net_device *dev; int rc = 0; + u32 tb_id; rcu_read_lock(); dev = dev_get_by_index_rcu(net, fl->flowi_oif); if (dev && netif_is_l3_master(dev) && dev->l3mdev_ops->l3mdev_fib_table) { - arg->table = dev->l3mdev_ops->l3mdev_fib_table(dev); - rc = 1; - goto out; + tb_id = dev->l3mdev_ops->l3mdev_fib_table(dev); + /* This requires the main table id to be consistent +* between IPv4 and IPv6. +*/ + BUILD_BUG_ON(RT_TABLE_MAIN != RT6_TABLE_MAIN); + /* main table is handled by default rules */ + if (tb_id != RT_TABLE_MAIN) { + arg->table = tb_id; + rc = 1; + goto out; + } } dev = dev_get_by_index_rcu(net, fl->flowi_iif); if (dev && netif_is_l3_master(dev) && dev->l3mdev_ops->l3mdev_fib_table) { - arg->table = dev->l3mdev_ops->l3mdev_fib_table(dev); - rc = 1; - goto out; + tb_id = dev->l3mdev_ops->l3mdev_fib_table(dev); + /* main table is handled by default rules */ + if (tb_id != RT_TABLE_MAIN) { + arg->table = tb_id; + rc = 1; + goto out; + } } out: -- 2.1.4
[PATCH net-next 1/3] ipv6: Fix route handling when using l3mdev set to main table
If an l3mdev is set to use the main table then the use of the local table is overridden. This means that when split local/main table is in effect then local routes aren't added to the local table and so don't respect the order of ip rules. Fix this by assuming that no if no l3mdev is present then defaulting to RT6_TABLE_MAIN and then subsequently doing a translation from RT6_TABLE_MAIN to RT6_TABLE_LOCAL. Do the same translations for RT6_TABLE_INFO, RT6_TABLE_DFLT and RT6_TABLE_PREFIX even though they are just defined to RT6_TABLE_MAIN in case someone decides to change that in the future. Signed-off-by: Robert Shearman --- net/ipv6/addrconf.c | 12 +--- net/ipv6/route.c| 23 ++- 2 files changed, 27 insertions(+), 8 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index 67ec87ea5fb6..937c35581a28 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -2244,7 +2244,6 @@ addrconf_prefix_route(struct in6_addr *pfx, int plen, struct net_device *dev, unsigned long expires, u32 flags) { struct fib6_config cfg = { - .fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_PREFIX, .fc_metric = IP6_RT_PRIO_ADDRCONF, .fc_ifindex = dev->ifindex, .fc_expires = expires, @@ -2254,6 +2253,9 @@ addrconf_prefix_route(struct in6_addr *pfx, int plen, struct net_device *dev, .fc_protocol = RTPROT_KERNEL, }; + cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN; + if (cfg.fc_table == RT6_TABLE_MAIN) + cfg.fc_table = RT6_TABLE_PREFIX; cfg.fc_dst = *pfx; /* Prevent useless cloning on PtP SIT. @@ -2277,8 +2279,10 @@ static struct rt6_info *addrconf_get_prefix_route(const struct in6_addr *pfx, struct fib6_node *fn; struct rt6_info *rt = NULL; struct fib6_table *table; - u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_PREFIX; + u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN; + if (tb_id == RT6_TABLE_MAIN) + tb_id = RT6_TABLE_PREFIX; table = fib6_get_table(dev_net(dev), tb_id); if (!table) return NULL; @@ -2310,7 +2314,6 @@ static struct rt6_info *addrconf_get_prefix_route(const struct in6_addr *pfx, static void addrconf_add_mroute(struct net_device *dev) { struct fib6_config cfg = { - .fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_LOCAL, .fc_metric = IP6_RT_PRIO_ADDRCONF, .fc_ifindex = dev->ifindex, .fc_dst_len = 8, @@ -2318,6 +2321,9 @@ static void addrconf_add_mroute(struct net_device *dev) .fc_nlinfo.nl_net = dev_net(dev), }; + cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN; + if (cfg.fc_table == RT6_TABLE_MAIN) + cfg.fc_table = RT6_TABLE_LOCAL; ipv6_addr_set(&cfg.fc_dst, htonl(0xFF00), 0, 0, 0); ip6_route_add(&cfg); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 9db1418993f2..490c74ed6a78 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -2424,12 +2424,15 @@ static struct rt6_info *rt6_get_route_info(struct net *net, const struct in6_addr *gwaddr, struct net_device *dev) { - u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_INFO; + u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN; int ifindex = dev->ifindex; struct fib6_node *fn; struct rt6_info *rt = NULL; struct fib6_table *table; + if (tb_id == RT6_TABLE_MAIN) + tb_id = RT6_TABLE_INFO; + table = fib6_get_table(net, tb_id); if (!table) return NULL; @@ -2471,7 +2474,9 @@ static struct rt6_info *rt6_add_route_info(struct net *net, .fc_nlinfo.nl_net = net, }; - cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_INFO, + cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN; + if (cfg.fc_table == RT6_TABLE_MAIN) + cfg.fc_table = RT6_TABLE_INFO; cfg.fc_dst = *prefix; cfg.fc_gateway = *gwaddr; @@ -2487,10 +2492,13 @@ static struct rt6_info *rt6_add_route_info(struct net *net, struct rt6_info *rt6_get_dflt_router(const struct in6_addr *addr, struct net_device *dev) { - u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_DFLT; + u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN; struct rt6_info *rt; struct fib6_table *table; + if (tb_id == RT6_TABLE_MAIN) + tb_id = RT6_TABLE_DFLT; + table = fib6_get_table(dev_net(dev), tb_id); if (!table) return NULL; @@ -2513,7 +2521,6 @@ struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr, unsigned int pref) {
[PATCH net-next 2/3] ipv4: Fix route handling when using l3mdev set to main table
If an l3mdev is set to use the main table then the use of the local table is overridden. This means that when split local/main table is in effect then local routes aren't added to the local table and so don't respect the order of ip rules. Fix this by assuming that no if no l3mdev is present then defaulting to RT_TABLE_MAIN and then subsequently doing a translation from RT_TABLE_MAIN to RT_TABLE_LOCAL. Signed-off-by: Robert Shearman --- net/ipv4/af_inet.c | 4 +++- net/ipv4/fib_frontend.c | 14 +- net/ipv4/raw.c | 5 - 3 files changed, 16 insertions(+), 7 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index d1a11707a126..83d54fab03f0 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -436,7 +436,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) struct net *net = sock_net(sk); unsigned short snum; int chk_addr_ret; - u32 tb_id = RT_TABLE_LOCAL; + u32 tb_id = RT_TABLE_MAIN; int err; /* If the socket has its own bind function then use it. (RAW) */ @@ -459,6 +459,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, int addr_len) } tb_id = l3mdev_fib_table_by_index(net, sk->sk_bound_dev_if) ? : tb_id; + if (tb_id == RT_TABLE_MAIN) + tb_id = RT_TABLE_LOCAL; chk_addr_ret = inet_addr_type_table(net, addr->sin_addr.s_addr, tb_id); /* Not specified by any standard per-se, however it breaks too diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 8f2133ffc2ff..1782c35dbac0 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -248,8 +248,10 @@ EXPORT_SYMBOL(inet_addr_type); unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev, __be32 addr) { - u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_LOCAL; + u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_MAIN; + if (rt_table == RT_TABLE_MAIN) + rt_table = RT_TABLE_LOCAL; return __inet_dev_addr_type(net, dev, addr, rt_table); } EXPORT_SYMBOL(inet_dev_addr_type); @@ -261,8 +263,10 @@ unsigned int inet_addr_type_dev_table(struct net *net, const struct net_device *dev, __be32 addr) { - u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_LOCAL; + u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_MAIN; + if (rt_table == RT_TABLE_MAIN) + rt_table = RT_TABLE_LOCAL; return __inet_dev_addr_type(net, NULL, addr, rt_table); } EXPORT_SYMBOL(inet_addr_type_dev_table); @@ -805,7 +809,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifaddr *ifa) { struct net *net = dev_net(ifa->ifa_dev->dev); - u32 tb_id = l3mdev_fib_table(ifa->ifa_dev->dev); + u32 tb_id = l3mdev_fib_table(ifa->ifa_dev->dev) ? : RT_TABLE_MAIN; struct fib_table *tb; struct fib_config cfg = { .fc_protocol = RTPROT_KERNEL, @@ -820,8 +824,8 @@ static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct in_ifad }, }; - if (!tb_id) - tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL; + if (tb_id == RT_TABLE_MAIN && type != RTN_UNICAST) + tb_id = RT_TABLE_LOCAL; tb = fib_new_table(net, tb_id); if (!tb) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 8119e1f66e03..2dd7022681e6 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -699,7 +699,7 @@ static int raw_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len) { struct inet_sock *inet = inet_sk(sk); struct sockaddr_in *addr = (struct sockaddr_in *) uaddr; - u32 tb_id = RT_TABLE_LOCAL; + u32 tb_id = RT_TABLE_MAIN; int ret = -EINVAL; int chk_addr_ret; @@ -710,6 +710,9 @@ static int raw_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len) tb_id = l3mdev_fib_table_by_index(sock_net(sk), sk->sk_bound_dev_if) ? : tb_id; + if (tb_id == RT_TABLE_MAIN) + tb_id = RT_TABLE_LOCAL; + chk_addr_ret = inet_addr_type_table(sock_net(sk), addr->sin_addr.s_addr, tb_id); -- 2.1.4
[PATCH net-next 0/3] l3mdev: Improve use with main table
Attempting to create a TCP socket not bound to a VRF device when a TCP socket bound to a VRF device with the same port exists (and vice versa) fails with EADDRINUSE. This limits the ability to use programs in selected mixed VRF/non-VRF contexts. This patch series solves the problem by extending the l3mdev be aware of the special semantics of the main table and fixing issues arising from the split local/main tables. A VRF master device created linking to the main table and used for these programs in the same way as those created for VRF tables can. Robert Shearman (3): ipv6: Fix route handling when using l3mdev set to main table ipv4: Fix route handling when using l3mdev set to main table l3mdev: Fix lookup in local table when using main table net/ipv4/af_inet.c | 4 +++- net/ipv4/fib_frontend.c | 14 +- net/ipv4/raw.c | 5 - net/ipv6/addrconf.c | 12 +--- net/ipv6/route.c| 23 ++- net/l3mdev/l3mdev.c | 26 -- 6 files changed, 63 insertions(+), 21 deletions(-) -- 2.1.4
Re: [PATCH net-next v3 0/6] net: mpls: Allow users to configure more labels per route
On 31/03/17 15:13, David Ahern wrote: Increase the maximum number of new labels for MPLS routes from 2 to 30. To keep memory consumption in check, the labels array is moved to the end of mpls_nh and mpls_iptunnel_encap structs as a 0-sized array. Allocations use the maximum number of labels across all nexthops in a route for LSR and the number of labels configured for LWT. The mpls_route layout is changed to: ... Meaning the via follows its mpls_nh providing better locality as the number of labels increases. UDP_RR tests with namespaces shows no impact to a modest performance increase with this layout for 1 or 2 labels and 1 or 2 nexthops. mpls_route allocation size is limited to 4096 bytes allowing on the order of 30 nexthops with 30 labels (or more nexthops with fewer labels). LWT encap shares same maximum number of labels as mpls routing. v3 - initialize n_labels to 0 in case RTA_NEWDST is not defined; detected by the kbuild test robot v2 - updates per Eric's comments + added patch to ensure all reads of rt_nhn_alive and nh_flags in the packet path use READ_ONCE and all writes via event handlers use WRITE_ONCE + limit mpls_route size to 4096 (PAGE_SIZE for most arch) + mostly killed use of MAX_NEW_LABELS; it exists only for common limit between lwt and routing paths David Ahern (6): net: mpls: rt_nhn_alive and nh_flags should be accessed using READ_ONCE net: mpls: Convert number of nexthops to u8 net: mpls: change mpls_route layout net:mpls: Limit memory allocation for mpls_route net: mpls: bump maximum number of labels net: mpls: Increase max number of labels for lwt encap include/net/mpls_iptunnel.h | 5 +- net/mpls/af_mpls.c | 210 +--- net/mpls/internal.h | 61 + net/mpls/mpls_iptunnel.c| 13 ++- 4 files changed, 196 insertions(+), 93 deletions(-) Acked-by: Robert Shearman
Re: [PATCH net-next] net: mpls: Update lfib_nlmsg_size to skip deleted nexthops
On 28/03/17 23:19, David Ahern wrote: A recent commit skips nexthops in a route if the device has been deleted. Update lfib_nlmsg_size accordingly. Reported-by: Roopa Prabhu Signed-off-by: David Ahern Acked-by: Robert Shearman
Re: [PATCH net-next 0/4] net: mpls: Allow users to configure more labels per route
On 25/03/17 17:03, David Ahern wrote: Bump the maximum number of labels for MPLS routes from 2 to 12. To keep memory consumption in check the labels array is moved to the end of mpls_nh and mpls_iptunnel_encap structs as a 0-sized array. Allocations use the maximum number of labels across all nexthops in a route for LSR and the number of labels configured for LWT. The mpls_route layout is changed to: ... Meaning the via follows its mpls_nh providing better locality as the number of labels increases. UDP_RR tests with namespaces shows no impact to a modest performance increase with this layout for 1 or 2 labels and 1 or 2 nexthops. The new limit is set to 12 to cover all currently known segment routing use cases. David Ahern (4): mpls: Convert number of nexthops to u8 net: mpls: change mpls_route layout net: mpls: bump maximum number of labels net: mpls: Increase max number of labels for lwt encap include/net/mpls_iptunnel.h | 4 +- net/mpls/af_mpls.c | 108 ++-- net/mpls/internal.h | 52 ++--- net/mpls/mpls_iptunnel.c| 13 -- 4 files changed, 112 insertions(+), 65 deletions(-) Acked-by: Robert Shearman
Re: [PATCH net-next 0/4] net: mpls: Allow users to configure more labels per route
On 28/03/17 04:08, Eric W. Biederman wrote: David Ahern writes: On 3/27/17 4:39 AM, Robert Shearman wrote: On 25/03/17 19:15, Eric W. Biederman wrote: David Ahern writes: Bump the maximum number of labels for MPLS routes from 2 to 12. To keep memory consumption in check the labels array is moved to the end of mpls_nh and mpls_iptunnel_encap structs as a 0-sized array. Allocations use the maximum number of labels across all nexthops in a route for LSR and the number of labels configured for LWT. The mpls_route layout is changed to: +--+ | mpls_route | +--+ | mpls_nh 0| +--+ | alignment padding| 4 bytes for odd number of labels; 0 for even +--+ | via[rt_max_alen] 0 | +--+ | alignment padding| via's aligned on sizeof(unsigned long) +--+ | ... | Meaning the via follows its mpls_nh providing better locality as the number of labels increases. UDP_RR tests with namespaces shows no impact to a modest performance increase with this layout for 1 or 2 labels and 1 or 2 nexthops. The new limit is set to 12 to cover all currently known segment routing use cases. How does this compare with running the packet a couple of times through the mpls table to get all of the desired labels applied? At the moment (i.e setting output interface for a route to the loopback interface) the TTL would currently be calculated incorrectly since it'll be decremented each time the packet is run through the input processing. If that was avoided, then the only issue would be the lower performance. We have the infrastructure to add all the labels on one pass. It does not make sense to recirculate the packet to get the same effect. I was really asking what are the advantages and disadvantages of this change rather than suggesting it was a bad idea. The information about ttl is useful. Adding that this will route packets with more labels more quickly than the recirculation method is also useful to know. I should also add that not recirculating also avoids having to allocate extra local labels, which may be limited in supply in some deployments, and avoids the extra control plane/user complexity associated with managing the routes associated with the recirculation. Thanks, Rob
Re: [PATCH net-next 0/4] net: mpls: Allow users to configure more labels per route
On 25/03/17 19:15, Eric W. Biederman wrote: David Ahern writes: Bump the maximum number of labels for MPLS routes from 2 to 12. To keep memory consumption in check the labels array is moved to the end of mpls_nh and mpls_iptunnel_encap structs as a 0-sized array. Allocations use the maximum number of labels across all nexthops in a route for LSR and the number of labels configured for LWT. The mpls_route layout is changed to: +--+ | mpls_route | +--+ | mpls_nh 0| +--+ | alignment padding| 4 bytes for odd number of labels; 0 for even +--+ | via[rt_max_alen] 0 | +--+ | alignment padding| via's aligned on sizeof(unsigned long) +--+ | ... | Meaning the via follows its mpls_nh providing better locality as the number of labels increases. UDP_RR tests with namespaces shows no impact to a modest performance increase with this layout for 1 or 2 labels and 1 or 2 nexthops. The new limit is set to 12 to cover all currently known segment routing use cases. How does this compare with running the packet a couple of times through the mpls table to get all of the desired labels applied? At the moment (i.e setting output interface for a route to the loopback interface) the TTL would currently be calculated incorrectly since it'll be decremented each time the packet is run through the input processing. If that was avoided, then the only issue would be the lower performance. I can certainly see the case in an mpls tunnel ingress where this might could be desirable.Which is something you implement in your last patch. However is it at all common to push lots of labels at once during routing? I am probably a bit naive but it seems absurd to push more than a handful of labels onto a packet as you are routing it. From draft-ietf-spring-segment-routing-mpls-07: Note that the kind of deployment of Segment Routing may affect the depth of the MPLS label stack. As every segment in the list is represented by an additional MPLS label, the length of the segment list directly correlates to the depth of the label stack. Implementing a long path with many explicit hops as a segment list may thus yield a deep label stack that would need to be pushed at the head of the SR tunnel. However, many use cases would need very few segments in the list. This is especially true when taking good advantage of the ECMP aware routing within each segment. In fact most use cases need just one additional segment and thus lead to a similar label stack depth as e.g. RSVP-based routing. The summary is that when using short-path routing then the number of labels needed to be pushed on will be small (2 or 3). However, if using SR to implement traffic engineering through a list of explicit hops then the number of labels pushed could be much greater and up to the diameter of the IGP network. Traffic engineering like this is not unusual. Thanks, Rob
Re: [PATCH net-next 0/2] net: mpls: multipath route cleanups
On 24/03/17 22:21, David Ahern wrote: When a device associated with a nexthop is deleted, the nexthop in the route is effectively removed, so remove it from the route dump. Further, when all nexhops have been deleted the route is effectively done, so remove the route. David Ahern (2): mpls: Don't show nexthop if device has been deleted mpls: Delete route when all nexthops have been deleted net/mpls/af_mpls.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) Acked-by: Robert Shearman
Re: [PATCH net-next] net: mpls: Fix setting ttl_propagate for rt2
On 24/03/17 01:02, David Ahern wrote: Fix copy and paste error setting rt_ttl_propagate. Fixes: 5b441ac8784c1 ("mpls: allow TTL propagation to IP packets to be configured") Signed-off-by: David Ahern Good catch. Acked-by: Robert Shearman
Re: [PATCH] net: mpls: Fix nexthop alive tracking on down events
On 13/03/17 23:49, David Ahern wrote: Alive tracking of nexthops can account for a link twice if the carrier goes down followed by an admin down of the same link rendering multipath routes useless. This is similar to 79099aab38c8 for UNREGISTER events and DOWN events. Fix by tracking number of alive nexthops in mpls_ifdown similar to the logic in mpls_ifup. Checking the flags per nexthop once after all events have been processed is simpler than trying to maintian a running count through all event combinations. Also, WRITE_ONCE is used instead of ACCESS_ONCE to set rt_nhn_alive per a comment from checkpatch: WARNING: Prefer WRITE_ONCE(, ) over ACCESS_ONCE() = Fixes: c89359a42e2a4 ("mpls: support for dead routes") Signed-off-by: David Ahern Acked-by: Robert Shearman
Re: [PATCH] mpls: Do not decrement alive counter for unregister events
On 10/03/17 22:11, David Ahern wrote: Multipath routes can be rendered usesless when a device in one of the paths is deleted. For example: $ ip -f mpls ro ls 100 nexthop as to 200 via inet 172.16.2.2 dev virt12 nexthop as to 300 via inet 172.16.3.2 dev br0 101 nexthop as to 201 via inet6 2000:2::2 dev virt12 nexthop as to 301 via inet6 2000:3::2 dev br0 $ ip li del br0 When br0 is deleted the other hop is not considered in mpls_select_multipath because of the alive check -- rt_nhn_alive is 0. rt_nhn_alive is decremented once in mpls_ifdown when the device is taken down (NETDEV_DOWN) and again when it is deleted (NETDEV_UNREGISTER). For a 2 hop route, deleting one device drops the alive count to 0. Since devices are taken down before unregistering, the decrement on NETDEV_UNREGISTER is redundant. Fixes: c89359a42e2a4 ("mpls: support for dead routes") Signed-off-by: David Ahern --- net/mpls/af_mpls.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index ccdac9c44fdc..22a9971aa484 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -1288,7 +1288,8 @@ static void mpls_ifdown(struct net_device *dev, int event) /* fall through */ case NETDEV_CHANGE: nh->nh_flags |= RTNH_F_LINKDOWN; - ACCESS_ONCE(rt->rt_nhn_alive) = rt->rt_nhn_alive - 1; + if (event != NETDEV_UNREGISTER) + ACCESS_ONCE(rt->rt_nhn_alive) = rt->rt_nhn_alive - 1; break; } if (event == NETDEV_UNREGISTER) Doesn't this leave the problem that if the device's link goes down and then the device gets deleted the alive count will be decremented twice for the same path? Perhaps it would be better to change the condition for decrementing the alive count to be "!(nh->nh_flags & (RTNH_F_LINKDOWN | RTNH_F_DEAD))"? Thanks, Rob
[PATCH net-next v3 2/2] mpls: allow TTL propagation from IP packets to be configured
Allow TTL propagation from IP packets to MPLS packets to be configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which allows the TTL to be set in the resulting MPLS packet, with the value of 0 having the semantics of enabling propagation of the TTL from the IP header (i.e. non-zero values disable propagation). Also allow the configuration to be overridden globally by reusing the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the per-LWT attribute is set then it overrides the global configuration. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl, "net.mpls.default_ttl". This is kept separate from the configuration of whether IP TTL propagation is enabled as it can be used in the future when non-IP payloads are supported (i.e. where there is no payload TTL that can be propagated). Signed-off-by: Robert Shearman --- Documentation/networking/mpls-sysctl.txt | 8 include/net/mpls_iptunnel.h | 2 + include/net/netns/mpls.h | 1 + include/uapi/linux/mpls_iptunnel.h | 2 + net/mpls/af_mpls.c | 11 + net/mpls/mpls_iptunnel.c | 73 ++-- 6 files changed, 84 insertions(+), 13 deletions(-) diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 9badd1d6685f..2f24a1912a48 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -30,6 +30,14 @@ ip_ttl_propagate - BOOL 0 - disabled / RFC 3443 [Short] Pipe Model 1 - enabled / RFC 3443 Uniform Model (default) +default_ttl - BOOL + Default TTL value to use for MPLS packets where it cannot be + propagated from an IP header, either because one isn't present + or ip_ttl_propagate has been disabled. + + Possible values: 1 - 255 + Default: 255 + conf//input - BOOL Control whether packets can be input on this interface. diff --git a/include/net/mpls_iptunnel.h b/include/net/mpls_iptunnel.h index 179253f9dcfd..a18af6a16eb5 100644 --- a/include/net/mpls_iptunnel.h +++ b/include/net/mpls_iptunnel.h @@ -19,6 +19,8 @@ struct mpls_iptunnel_encap { u32 label[MAX_NEW_LABELS]; u8 labels; + u8 ttl_propagate; + u8 default_ttl; }; static inline struct mpls_iptunnel_encap *mpls_lwtunnel_encap(struct lwtunnel_state *lwtstate) diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h index 08652eedabb2..6608b3693385 100644 --- a/include/net/netns/mpls.h +++ b/include/net/netns/mpls.h @@ -10,6 +10,7 @@ struct ctl_table_header; struct netns_mpls { int ip_ttl_propagate; + int default_ttl; size_t platform_labels; struct mpls_route __rcu * __rcu *platform_label; diff --git a/include/uapi/linux/mpls_iptunnel.h b/include/uapi/linux/mpls_iptunnel.h index d80a0498f77e..f5e45095b0bb 100644 --- a/include/uapi/linux/mpls_iptunnel.h +++ b/include/uapi/linux/mpls_iptunnel.h @@ -16,11 +16,13 @@ /* MPLS tunnel attributes * [RTA_ENCAP] = { * [MPLS_IPTUNNEL_DST] + * [MPLS_IPTUNNEL_TTL] * } */ enum { MPLS_IPTUNNEL_UNSPEC, MPLS_IPTUNNEL_DST, + MPLS_IPTUNNEL_TTL, __MPLS_IPTUNNEL_MAX, }; #define MPLS_IPTUNNEL_MAX (__MPLS_IPTUNNEL_MAX - 1) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 0e1046f21af8..0c5d111abe36 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -34,6 +34,7 @@ static int zero = 0; static int one = 1; static int label_limit = (1 << 20) - 1; +static int ttl_max = 255; static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, struct nlmsghdr *nlh, struct net *net, u32 portid, @@ -2042,6 +2043,15 @@ static const struct ctl_table mpls_table[] = { .extra1 = &zero, .extra2 = &one, }, + { + .procname = "default_ttl", + .data = MPLS_NS_SYSCTL_OFFSET(mpls.default_ttl), + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &one, + .extra2 = &ttl_max, + }, { } }; @@ -2053,6 +2063,7 @@ static int mpls_net_init(struct net *net) net->mpls.platform_labels = 0; net->mpls.platform_label = NULL; net->mpls.ip_ttl_propagate = 1; + net->mpls.default_ttl = 255; table = kmemdup(mpls_table, sizeof(mpls_table), GFP_KERNEL); if (table == NULL) diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c index e4e4424f9eb1..22f71fce0bfb 100644 --- a/net/mpls/mpls_iptunnel.c +++ b/net/mpls/mpls_iptunnel.c @@ -29,6 +29,7 @@ static const struct nla_policy mpls_iptun
[PATCH net-next v3 1/2] mpls: allow TTL propagation to IP packets to be configured
Provide the ability to control on a per-route basis whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped as per the theoretical model in RFC 3443 through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to mean disable propagation and 1 to mean enable propagation. In order to provide the ability to change the behaviour for packets arriving with IPv4/IPv6 Explicit Null labels and to provide an easy way for a user to change the behaviour for all existing routes without having to reprogram them, a global knob is provided. This is done through the addition of a new per-namespace sysctl, "net.mpls.ip_ttl_propagate", which defaults to enabled. If the per-route attribute is set (either enabled or disabled) then it overrides the global configuration. Signed-off-by: Robert Shearman --- Documentation/networking/mpls-sysctl.txt | 11 include/net/netns/mpls.h | 2 + include/uapi/linux/rtnetlink.h | 1 + net/mpls/af_mpls.c | 87 +--- net/mpls/internal.h | 7 +++ 5 files changed, 100 insertions(+), 8 deletions(-) diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 15d8d16934fd..9badd1d6685f 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -19,6 +19,17 @@ platform_labels - INTEGER Possible values: 0 - 1048575 Default: 0 +ip_ttl_propagate - BOOL + Control whether TTL is propagated from the IPv4/IPv6 header to + the MPLS header on imposing labels and propagated from the + MPLS header to the IPv4/IPv6 header on popping the last label. + + If disabled, the MPLS transport network will appear as a + single hop to transit traffic. + + 0 - disabled / RFC 3443 [Short] Pipe Model + 1 - enabled / RFC 3443 Uniform Model (default) + conf//input - BOOL Control whether packets can be input on this interface. diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h index d29203651c01..08652eedabb2 100644 --- a/include/net/netns/mpls.h +++ b/include/net/netns/mpls.h @@ -9,8 +9,10 @@ struct mpls_route; struct ctl_table_header; struct netns_mpls { + int ip_ttl_propagate; size_t platform_labels; struct mpls_route __rcu * __rcu *platform_label; + struct ctl_table_header *ctl; }; diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 6546917d605a..30fb25e851db 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -319,6 +319,7 @@ enum rtattr_type_t { RTA_EXPIRES, RTA_PAD, RTA_UID, + RTA_TTL_PROPAGATE, __RTA_MAX }; diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 3818686182b2..0e1046f21af8 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -32,6 +32,7 @@ #define MPLS_NEIGH_TABLE_UNSPEC (NEIGH_LINK_TABLE + 1) static int zero = 0; +static int one = 1; static int label_limit = (1 << 20) - 1; static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, @@ -220,8 +221,8 @@ static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt, return &rt->rt_nh[nh_index]; } -static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, - struct mpls_entry_decoded dec) +static bool mpls_egress(struct net *net, struct mpls_route *rt, + struct sk_buff *skb, struct mpls_entry_decoded dec) { enum mpls_payload_type payload_type; bool success = false; @@ -246,22 +247,46 @@ static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, switch (payload_type) { case MPT_IPV4: { struct iphdr *hdr4 = ip_hdr(skb); + u8 new_ttl; skb->protocol = htons(ETH_P_IP); + + /* If propagating TTL, take the decremented TTL from +* the incoming MPLS header, otherwise decrement the +* TTL, but only if not 0 to avoid underflow. +*/ + if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED || + (rt->rt_ttl_propagate == MPLS_TTL_PROP_DEFAULT && +net->mpls.ip_ttl_propagate)) + new_ttl = dec.ttl; + else + new_ttl = hdr4->ttl ? hdr4->ttl - 1 : 0; + csum_replace2(&hdr4->check, htons(hdr4->ttl << 8), - htons(dec.ttl << 8)); - hdr4->ttl = dec.ttl; + htons(new_ttl << 8)); + hdr4->ttl = new_ttl; success = true; break; } case MPT_IPV6: { struct ipv6hdr *hdr6 = ipv6_hdr(skb);
[PATCH net-next v3 0/2] mpls: allow TTL propagation to/from IP packets to be configured
It is sometimes desirable to present an MPLS transport network as a single hop to traffic transiting it because it prevents confusion when diagnosing failures. An example of where confusion can be generated is when addresses used in the provider network overlap with addresses in the overlay network and the addresses get exposed through ICMP errors generated as packets transit the provider network. In addition, RFC 3443 defines two methods of deriving TTL for an outgoing packet: Uniform Model where the TTL is propagated to/from the MPLS header and both Pipe Models and Short Pipe Models (with and without PHP) where the TTL is not propagated to/from the MPLS header. Changes in v3: - decrement ttl on popping last label when not doing ttl propagation, as suggested by David Ahern. - add comment to describe what the somewhat complex conditionals are doing to work out what ttl to use in mpls_iptunnel.c. - rearrange fields fields in struct netns_mpls to keep the platform label fields together, as suggested by David Ahern. Changes in v2: - add references to RFC 3443 as suggested by David Ahern - fix setting of skb->protocol as noticed by David Ahern - implement per-route/per-LWT configurability as suggested by Eric Biederman - split into two patches for ease of review Robert Shearman (2): mpls: allow TTL propagation to IP packets to be configured mpls: allow TTL propagation from IP packets to be configured Documentation/networking/mpls-sysctl.txt | 19 +++ include/net/mpls_iptunnel.h | 2 + include/net/netns/mpls.h | 3 + include/uapi/linux/mpls_iptunnel.h | 2 + include/uapi/linux/rtnetlink.h | 1 + net/mpls/af_mpls.c | 98 +--- net/mpls/internal.h | 7 +++ net/mpls/mpls_iptunnel.c | 73 +++- 8 files changed, 184 insertions(+), 21 deletions(-) -- 2.1.4
Re: [PATCH net-next v2 2/2] mpls: allow TTL propagation from IP packets to be configured
On 10/03/17 02:54, David Ahern wrote: On 3/7/17 5:46 PM, Robert Shearman wrote: @@ -78,6 +70,29 @@ static int mpls_xmit(struct sk_buff *skb) tun_encap_info = mpls_lwtunnel_encap(dst->lwtstate); + /* Obtain the ttl */ + if (dst->ops->family == AF_INET) { + if (tun_encap_info->ttl_propagate == MPLS_TTL_PROP_DISABLED) + ttl = tun_encap_info->default_ttl; + else if (tun_encap_info->ttl_propagate == MPLS_TTL_PROP_DEFAULT && +!net->mpls.ip_ttl_propagate) + ttl = net->mpls.default_ttl; + else + ttl = ip_hdr(skb)->ttl; After staring at that for a while, an explanation above this if {} else {} section on the ttl selection will be very helpful. Ok, would a comment like the following improve things sufficiently? /* Obtain the ttl using the following set of rules. * * LWT ttl propagation setting: * - disabled => use default TTL value from LWT * - enabled => use TTL value from IPv4/IPv6 header * - default => * Global ttl propagation setting: *- disabled => use default TTL value from global setting *- enabled => use TTL value from IPv4/IPv6 header */ Thanks, Rob
Re: [PATCH net-next v2 1/2] mpls: allow TTL propagation to IP packets to be configured
On 10/03/17 02:00, David Ahern wrote: On 3/7/17 5:46 PM, Robert Shearman wrote: @@ -244,24 +245,33 @@ static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, payload_type = ip_hdr(skb)->version; switch (payload_type) { - case MPT_IPV4: { - struct iphdr *hdr4 = ip_hdr(skb); + case MPT_IPV4: + if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED || + (rt->rt_ttl_propagate == MPLS_TTL_PROP_DEFAULT && +net->mpls.ip_ttl_propagate)) { + struct iphdr *hdr4 = ip_hdr(skb); + + csum_replace2(&hdr4->check, + htons(hdr4->ttl << 8), + htons(dec.ttl << 8)); + hdr4->ttl = dec.ttl; + } skb->protocol = htons(ETH_P_IP); - csum_replace2(&hdr4->check, - htons(hdr4->ttl << 8), - htons(dec.ttl << 8)); - hdr4->ttl = dec.ttl; success = true; break; - } - case MPT_IPV6: { - struct ipv6hdr *hdr6 = ipv6_hdr(skb); + case MPT_IPV6: + if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED || + (rt->rt_ttl_propagate == MPLS_TTL_PROP_DEFAULT && +net->mpls.ip_ttl_propagate)) { + struct ipv6hdr *hdr6 = ipv6_hdr(skb); + + hdr6->hop_limit = dec.ttl; + } skb->protocol = htons(ETH_P_IPV6); - hdr6->hop_limit = dec.ttl; success = true; break; - } What decrements the TTL if it is not propagated from MPLS to IP? Good point. Will address in v3. Thanks, Rob
Re: [PATCH net-next v2 1/2] mpls: allow TTL propagation to IP packets to be configured
On 10/03/17 02:40, David Ahern wrote: On 3/7/17 5:46 PM, Robert Shearman wrote: diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h index d29203651c01..58e0e46c4a5c 100644 --- a/include/net/netns/mpls.h +++ b/include/net/netns/mpls.h @@ -10,7 +10,9 @@ struct ctl_table_header; struct netns_mpls { size_t platform_labels; + int ip_ttl_propagate; struct mpls_route __rcu * __rcu *platform_label; + struct ctl_table_header *ctl; }; I'd prefer the platform_labels stay with platform_label. ie., put the new ip_ttl_propagate above platform_labels. Ok, will do in v3. Thanks, Rob
[PATCH iproute2] man: Fix formatting of vrf parameter of ip-link show command
Add missing opening " [" for the vrf parameter. Signed-off-by: Robert Shearman --- man/man8/ip-link.8.in | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in index 83ffb6357f7d..3f5d57c28885 100644 --- a/man/man8/ip-link.8.in +++ b/man/man8/ip-link.8.in @@ -164,7 +164,7 @@ ip-link \- network device configuration .B master .IR DEVICE " ] [" .B type -.IR ETYPE " ]" +.IR ETYPE " ] [" .B vrf .IR NAME " ]" -- 2.1.4
[PATCH iproute2] iplink: add support for afstats subcommand
Add support for new afstats subcommand. This uses the new IFLA_STATS_AF_SPEC attribute of RTM_GETSTATS messages to show per-device, AF-specific stats. At the moment the kernel only supports MPLS AF stats, so that is all that's implemented here. The print_num function is exposed from ipaddress.c to be used for printing the new stats so that the human-readable option, if set, can be respected. Example of use: $ ./ip/ip -f mpls link afstats dev eth1 3: eth1 mpls: RX: bytes packets errors dropped noroute 9016 98 0 00 TX: bytes packets errors dropped 7232 113 0 0 Signed-off-by: Robert Shearman --- ip/ip_common.h| 2 + ip/ipaddress.c| 2 +- ip/iplink.c | 151 ++ man/man8/ip-link.8.in | 12 4 files changed, 166 insertions(+), 1 deletion(-) diff --git a/ip/ip_common.h b/ip/ip_common.h index abb2b8d5d537..5a39623aa21d 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -112,3 +112,5 @@ int name_is_vrf(const char *name); #ifndef LABEL_MAX_MASK #define LABEL_MAX_MASK 0xFU #endif + +void print_num(FILE *fp, unsigned int width, uint64_t count); diff --git a/ip/ipaddress.c b/ip/ipaddress.c index 242c6ea0b4b3..b8d9c7d917fe 100644 --- a/ip/ipaddress.c +++ b/ip/ipaddress.c @@ -418,7 +418,7 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo) print_vf_stats64(fp, vf[IFLA_VF_STATS]); } -static void print_num(FILE *fp, unsigned int width, uint64_t count) +void print_num(FILE *fp, unsigned int width, uint64_t count) { const char *prefix = "kMGTPE"; const unsigned int base = use_iec ? 1024 : 1000; diff --git a/ip/iplink.c b/ip/iplink.c index 00fed9006ea6..866ad723f4a2 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -28,6 +28,7 @@ #include #include #include +#include #include "rt_names.h" #include "utils.h" @@ -99,6 +100,7 @@ void iplink_usage(void) " ip link show [ DEVICE | group GROUP ] [up] [master DEV] [vrf NAME] [type TYPE]\n"); fprintf(stderr, "\n ip link xstats type TYPE [ ARGS ]\n"); + fprintf(stderr, "\n ip link afstats [ dev DEVICE ]\n"); if (iplink_have_newlink()) { fprintf(stderr, @@ -1364,6 +1366,150 @@ static int do_set(int argc, char **argv) } #endif /* IPLINK_IOCTL_COMPAT */ +static void print_mpls_stats(FILE *fp, struct rtattr *attr) +{ + struct rtattr *mrtb[MPLS_STATS_MAX+1]; + struct mpls_link_stats *stats; + + parse_rtattr(mrtb, MPLS_STATS_MAX, RTA_DATA(attr), +RTA_PAYLOAD(attr)); + if (!mrtb[MPLS_STATS_LINK]) + return; + + stats = RTA_DATA(mrtb[MPLS_STATS_LINK]); + + fprintf(fp, "mpls:\n"); + fprintf(fp, "RX: bytes packets errors dropped noroute\n"); + fprintf(fp, ""); + print_num(fp, 10, stats->rx_bytes); + print_num(fp, 8, stats->rx_packets); + print_num(fp, 7, stats->rx_errors); + print_num(fp, 8, stats->rx_dropped); + print_num(fp, 7, stats->rx_noroute); + fprintf(fp, "\n"); + fprintf(fp, "TX: bytes packets errors dropped\n"); + fprintf(fp, ""); + print_num(fp, 10, stats->tx_bytes); + print_num(fp, 8, stats->tx_packets); + print_num(fp, 7, stats->tx_errors); + print_num(fp, 7, stats->tx_dropped); + fprintf(fp, "\n"); +} + +static void print_af_stats_attr(FILE *fp, int ifindex, struct rtattr *attr) +{ + bool if_printed = false; + struct rtattr *i; + int rem; + + rem = RTA_PAYLOAD(attr); + for (i = RTA_DATA(attr); RTA_OK(i, rem); i = RTA_NEXT(i, rem)) { + if (preferred_family != AF_UNSPEC && + i->rta_type != preferred_family) + continue; + + if (!if_printed) { + fprintf(fp, "%u: %s\n", ifindex, + ll_index_to_name(ifindex)); + if_printed = true; + } + + switch (i->rta_type) { + case AF_MPLS: + print_mpls_stats(fp, i); + break; + default: + fprintf(fp, "unknown af(%d)\n", i->rta_type); + break; + } + } +} + +struct af_stats_ctx { + FILE *fp; + int ifindex; +}; + +static int print_af_stats(const struct sockaddr_nl *who, + struct nlmsghdr *n, + void *arg) +{ + struct if_stats_msg *ifsm = NLMSG_DATA(n); + struct rtattr *tb[IFLA_STATS_MAX+1]; + int l
[PATCH net-next v2 2/2] mpls: allow TTL propagation from IP packets to be configured
Allow TTL propagation from IP packets to MPLS packets to be configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which allows the TTL to be set in the resulting MPLS packet, with the value of 0 having the semantics of enabling propagation of the TTL from the IP header (i.e. non-zero values disable propagation). Also allow the configuration to be overridden globally by reusing the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the per-LWT attribute is set then it overrides the global configuration. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl, "net.mpls.default_ttl". This is kept separate from the configuration of whether IP TTL propagation is enabled as it can be used in the future when non-IP payloads are supported (i.e. where there is no payload TTL that can be propagated). Signed-off-by: Robert Shearman --- Documentation/networking/mpls-sysctl.txt | 8 include/net/mpls_iptunnel.h | 2 + include/net/netns/mpls.h | 1 + include/uapi/linux/mpls_iptunnel.h | 2 + net/mpls/af_mpls.c | 11 ++ net/mpls/mpls_iptunnel.c | 64 +--- 6 files changed, 75 insertions(+), 13 deletions(-) diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 9badd1d6685f..2f24a1912a48 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -30,6 +30,14 @@ ip_ttl_propagate - BOOL 0 - disabled / RFC 3443 [Short] Pipe Model 1 - enabled / RFC 3443 Uniform Model (default) +default_ttl - BOOL + Default TTL value to use for MPLS packets where it cannot be + propagated from an IP header, either because one isn't present + or ip_ttl_propagate has been disabled. + + Possible values: 1 - 255 + Default: 255 + conf//input - BOOL Control whether packets can be input on this interface. diff --git a/include/net/mpls_iptunnel.h b/include/net/mpls_iptunnel.h index 179253f9dcfd..a18af6a16eb5 100644 --- a/include/net/mpls_iptunnel.h +++ b/include/net/mpls_iptunnel.h @@ -19,6 +19,8 @@ struct mpls_iptunnel_encap { u32 label[MAX_NEW_LABELS]; u8 labels; + u8 ttl_propagate; + u8 default_ttl; }; static inline struct mpls_iptunnel_encap *mpls_lwtunnel_encap(struct lwtunnel_state *lwtstate) diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h index 58e0e46c4a5c..1b68aed6e1b9 100644 --- a/include/net/netns/mpls.h +++ b/include/net/netns/mpls.h @@ -11,6 +11,7 @@ struct ctl_table_header; struct netns_mpls { size_t platform_labels; int ip_ttl_propagate; + int default_ttl; struct mpls_route __rcu * __rcu *platform_label; struct ctl_table_header *ctl; diff --git a/include/uapi/linux/mpls_iptunnel.h b/include/uapi/linux/mpls_iptunnel.h index d80a0498f77e..f5e45095b0bb 100644 --- a/include/uapi/linux/mpls_iptunnel.h +++ b/include/uapi/linux/mpls_iptunnel.h @@ -16,11 +16,13 @@ /* MPLS tunnel attributes * [RTA_ENCAP] = { * [MPLS_IPTUNNEL_DST] + * [MPLS_IPTUNNEL_TTL] * } */ enum { MPLS_IPTUNNEL_UNSPEC, MPLS_IPTUNNEL_DST, + MPLS_IPTUNNEL_TTL, __MPLS_IPTUNNEL_MAX, }; #define MPLS_IPTUNNEL_MAX (__MPLS_IPTUNNEL_MAX - 1) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index d4a51da8a0ce..a8710d334a60 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -34,6 +34,7 @@ static int zero = 0; static int one = 1; static int label_limit = (1 << 20) - 1; +static int ttl_max = 255; static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, struct nlmsghdr *nlh, struct net *net, u32 portid, @@ -2027,6 +2028,15 @@ static const struct ctl_table mpls_table[] = { .extra1 = &zero, .extra2 = &one, }, + { + .procname = "default_ttl", + .data = MPLS_NS_SYSCTL_OFFSET(mpls.default_ttl), + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &one, + .extra2 = &ttl_max, + }, { } }; @@ -2038,6 +2048,7 @@ static int mpls_net_init(struct net *net) net->mpls.platform_labels = 0; net->mpls.platform_label = NULL; net->mpls.ip_ttl_propagate = 1; + net->mpls.default_ttl = 255; table = kmemdup(mpls_table, sizeof(mpls_table), GFP_KERNEL); if (table == NULL) diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c index e4e4424f9eb1..da2fb02e0f27 100644 --- a/net/mpls/mpls_iptunnel.c +++ b/net/mpls/mpls_iptunnel.c @@ -29,6 +29,7 @@ static co
[PATCH net-next v2 1/2] mpls: allow TTL propagation to IP packets to be configured
Provide the ability to control on a per-route basis whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped as per the theoretical model in RFC 3443 through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to mean disable propagation and 1 to mean enable propagation. In order to provide the ability to change the behaviour for packets arriving with IPv4/IPv6 Explicit Null labels and to provide an easy way for a user to change the behaviour for all existing routes without having to reprogram them, a global knob is provided. This is done through the addition of a new per-namespace sysctl, "net.mpls.ip_ttl_propagate", which defaults to enabled. If the per-route attribute is set (either enabled or disabled) then it overrides the global configuration. Signed-off-by: Robert Shearman --- Documentation/networking/mpls-sysctl.txt | 11 include/net/netns/mpls.h | 2 + include/uapi/linux/rtnetlink.h | 1 + net/mpls/af_mpls.c | 88 ++-- net/mpls/internal.h | 7 +++ 5 files changed, 93 insertions(+), 16 deletions(-) diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 15d8d16934fd..9badd1d6685f 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -19,6 +19,17 @@ platform_labels - INTEGER Possible values: 0 - 1048575 Default: 0 +ip_ttl_propagate - BOOL + Control whether TTL is propagated from the IPv4/IPv6 header to + the MPLS header on imposing labels and propagated from the + MPLS header to the IPv4/IPv6 header on popping the last label. + + If disabled, the MPLS transport network will appear as a + single hop to transit traffic. + + 0 - disabled / RFC 3443 [Short] Pipe Model + 1 - enabled / RFC 3443 Uniform Model (default) + conf//input - BOOL Control whether packets can be input on this interface. diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h index d29203651c01..58e0e46c4a5c 100644 --- a/include/net/netns/mpls.h +++ b/include/net/netns/mpls.h @@ -10,7 +10,9 @@ struct ctl_table_header; struct netns_mpls { size_t platform_labels; + int ip_ttl_propagate; struct mpls_route __rcu * __rcu *platform_label; + struct ctl_table_header *ctl; }; diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index 6546917d605a..30fb25e851db 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -319,6 +319,7 @@ enum rtattr_type_t { RTA_EXPIRES, RTA_PAD, RTA_UID, + RTA_TTL_PROPAGATE, __RTA_MAX }; diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 3818686182b2..d4a51da8a0ce 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -32,6 +32,7 @@ #define MPLS_NEIGH_TABLE_UNSPEC (NEIGH_LINK_TABLE + 1) static int zero = 0; +static int one = 1; static int label_limit = (1 << 20) - 1; static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, @@ -220,8 +221,8 @@ static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt, return &rt->rt_nh[nh_index]; } -static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, - struct mpls_entry_decoded dec) +static bool mpls_egress(struct net *net, struct mpls_route *rt, + struct sk_buff *skb, struct mpls_entry_decoded dec) { enum mpls_payload_type payload_type; bool success = false; @@ -244,24 +245,33 @@ static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, payload_type = ip_hdr(skb)->version; switch (payload_type) { - case MPT_IPV4: { - struct iphdr *hdr4 = ip_hdr(skb); + case MPT_IPV4: + if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED || + (rt->rt_ttl_propagate == MPLS_TTL_PROP_DEFAULT && +net->mpls.ip_ttl_propagate)) { + struct iphdr *hdr4 = ip_hdr(skb); + + csum_replace2(&hdr4->check, + htons(hdr4->ttl << 8), + htons(dec.ttl << 8)); + hdr4->ttl = dec.ttl; + } skb->protocol = htons(ETH_P_IP); - csum_replace2(&hdr4->check, - htons(hdr4->ttl << 8), - htons(dec.ttl << 8)); - hdr4->ttl = dec.ttl; success = true; break; - } - case MPT_IPV6: { - struct ipv6hdr *hdr6 = ipv6_hdr(skb); + case MPT_IPV6: + if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED || +
[PATCH net-next v2 0/2] mpls: allow TTL propagation to/from IP packets to be configured
It is sometimes desirable to present an MPLS transport network as a single hop to traffic transiting it because it prevents confusion when diagnosing failures. An example of where confusion can be generated is when addresses used in the provider network overlap with addresses in the overlay network and the addresses get exposed through ICMP errors generated as packets transit the provider network. In addition, RFC 3443 defines two methods of deriving TTL for an outgoing packet: Uniform Model where the TTL is propagated to/from the MPLS header and both Pipe Models and Short Pipe Models (with and without PHP) where the TTL is not propagated to/from the MPLS header. Changes in v2: - add references to RFC 3443 as suggested by David Ahern - fix setting of skb->protocol as noticed by David Ahern - implement per-route/per-LWT configurability as suggested by Eric Biederman - split into two patches for ease of review Robert Shearman (2): mpls: allow TTL propagation to IP packets to be configured mpls: allow TTL propagation from IP packets to be configured Documentation/networking/mpls-sysctl.txt | 19 ++ include/net/mpls_iptunnel.h | 2 + include/net/netns/mpls.h | 3 + include/uapi/linux/mpls_iptunnel.h | 2 + include/uapi/linux/rtnetlink.h | 1 + net/mpls/af_mpls.c | 99 ++-- net/mpls/internal.h | 7 +++ net/mpls/mpls_iptunnel.c | 64 - 8 files changed, 168 insertions(+), 29 deletions(-) -- 2.1.4
Re: [PATCH net-next v2] net: mpls: Add support for netconf
On 20/02/17 16:03, David Ahern wrote: Add netconf support to MPLS. Allows userpsace to learn and be notified of changes to 'input' enable setting per interface. Acked-by: Robert Shearman Acked-by: Nicolas Dichtel Signed-off-by: David Ahern
Re: [PATCH net-next] net: mpls: Add support for netconf
On 20/02/17 14:53, David Ahern wrote: On 2/20/17 3:01 AM, Robert Shearman wrote: On 18/02/17 01:36, David Ahern wrote: Add netconf support to MPLS. Allows userpsace to learn and be notified of changes to 'input' enable setting per interface. Signed-off-by: David Ahern --- include/uapi/linux/netconf.h | 1 + include/uapi/linux/rtnetlink.h | 2 + net/mpls/af_mpls.c | 212 - net/mpls/internal.h| 2 +- 4 files changed, 214 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h index 45dfad509c4d..159d91c9c2a3 100644 --- a/include/uapi/linux/netconf.h +++ b/include/uapi/linux/netconf.h @@ -16,6 +16,7 @@ enum { NETCONFA_MC_FORWARDING, NETCONFA_PROXY_NEIGH, NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN, +NETCONFA_MPLS_INPUT_ENABLED, How about removing "MPLS_" the name? That way it would be usable by other AFs too, e.g. for the disable_ipv6 sysctl. Perhaps "ENABLED" is redundant as well. I could drop the MPLS to NETCONFA_INPUT_ENABLED; just NETCONFA_INPUT is not very descriptive. disable_ipv6 has a completely different meaning, so re-use there does not seem appropriate. That's a fair point - disable_ipv6 does more than just disabling the input of IPv6 packets. However, the point still stands in the general sense - why tie the attribute name to an AF if it's not necessary? Thanks, Rob Thanks for the review.
Re: [PATCH net-next] net: mpls: Add support for netconf
On 18/02/17 01:36, David Ahern wrote: Add netconf support to MPLS. Allows userpsace to learn and be notified of changes to 'input' enable setting per interface. Signed-off-by: David Ahern --- include/uapi/linux/netconf.h | 1 + include/uapi/linux/rtnetlink.h | 2 + net/mpls/af_mpls.c | 212 - net/mpls/internal.h| 2 +- 4 files changed, 214 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h index 45dfad509c4d..159d91c9c2a3 100644 --- a/include/uapi/linux/netconf.h +++ b/include/uapi/linux/netconf.h @@ -16,6 +16,7 @@ enum { NETCONFA_MC_FORWARDING, NETCONFA_PROXY_NEIGH, NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN, + NETCONFA_MPLS_INPUT_ENABLED, How about removing "MPLS_" the name? That way it would be usable by other AFs too, e.g. for the disable_ipv6 sysctl. Perhaps "ENABLED" is redundant as well. Thanks, Rob
Re: [PATCH net-next] mpls: allow TTL propagation to/from IP packets to be configured
On 31/01/17 01:09, David Ahern wrote: On 1/30/17 1:36 PM, Robert Shearman wrote: @@ -243,24 +245,29 @@ static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, payload_type = ip_hdr(skb)->version; switch (payload_type) { - case MPT_IPV4: { - struct iphdr *hdr4 = ip_hdr(skb); - skb->protocol = htons(ETH_P_IP); - csum_replace2(&hdr4->check, - htons(hdr4->ttl << 8), - htons(dec.ttl << 8)); - hdr4->ttl = dec.ttl; + case MPT_IPV4: + if (net->mpls.ip_ttl_propagate) { + struct iphdr *hdr4 = ip_hdr(skb); + + skb->protocol = htons(ETH_P_IP); The protocol setting here and ... + csum_replace2(&hdr4->check, + htons(hdr4->ttl << 8), + htons(dec.ttl << 8)); + hdr4->ttl = dec.ttl; + } success = true; break; - } - case MPT_IPV6: { - struct ipv6hdr *hdr6 = ipv6_hdr(skb); - skb->protocol = htons(ETH_P_IPV6); - hdr6->hop_limit = dec.ttl; + case MPT_IPV6: + if (net->mpls.ip_ttl_propagate) { + struct ipv6hdr *hdr6 = ipv6_hdr(skb); + + skb->protocol = htons(ETH_P_IPV6); here need to be done outside of net->mpls.ip_ttl_propagate otherwise ... + hdr6->hop_limit = dec.ttl; + } success = true; break; - } case MPT_UNSPEC: + /* Should have decided which protocol it is by now */ break; } disabling ip_ttl_propagate causes a corrupted packet to show up at the end host (after the LSP): Oops, good catch. Will fix in v2. Thanks, Rob
Re: [PATCH net-next] mpls: allow TTL propagation to/from IP packets to be configured
On 31/01/17 00:41, David Ahern wrote: On 1/30/17 1:36 PM, Robert Shearman wrote: It is sometimes desirable to present an MPLS transport network as a single hop to traffic transiting it because it prevents confusion when diagnosing failures. An example of where confusion can be generated is when addresses used in the provider network overlap with addresses in the overlay network and the addresses get exposed through ICMP errors generated as packets transit the provider network. Therefore, provide the ability to control whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped through the addition of a new per-namespace sysctl: "net.mpls.ip_ttl_propagate" which defaults to enabled. Use the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl: "net.mpls.default_ttl". Signed-off-by: Robert Shearman --- Documentation/networking/mpls-sysctl.txt | 19 + include/net/netns/mpls.h | 3 ++ net/mpls/af_mpls.c | 70 net/mpls/mpls_iptunnel.c | 12 +- 4 files changed, 85 insertions(+), 19 deletions(-) diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 15d8d16934fd..b8f0725ff09e 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -19,6 +19,25 @@ platform_labels - INTEGER Possible values: 0 - 1048575 Default: 0 +ip_ttl_propagate - BOOL + Control whether TTL is propagated from the IPv4/IPv6 header to + the MPLS header on imposing labels and propagated from the + MPLS header to the IPv4/IPv6 header on popping the last label. + + If disabled, the MPLS transport network will appear as a + single hop to transit traffic. + + 0 - disabled + 1 - enabled (default) + It seems like you are going after RFC 3443 with this change. Can you add comment to that effect? i.e., ip_ttl_propagate enabled is the Uniform Model and ip_ttl_propagate disabled is the Short Pipe Model. Good idea, will add it in the appropriate place depending on the chosen API. Thanks, Rob
Re: [PATCH net-next] mpls: allow TTL propagation to/from IP packets to be configured
On 31/01/17 00:17, Eric W. Biederman wrote: Robert Shearman writes: It is sometimes desirable to present an MPLS transport network as a single hop to traffic transiting it because it prevents confusion when diagnosing failures. An example of where confusion can be generated is when addresses used in the provider network overlap with addresses in the overlay network and the addresses get exposed through ICMP errors generated as packets transit the provider network. Therefore, provide the ability to control whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped through the addition of a new per-namespace sysctl: "net.mpls.ip_ttl_propagate" which defaults to enabled. Use the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl: "net.mpls.default_ttl". Instead of having a global sysctl can we please have a different way to configure the ingress/egress? My general memory is that this makes sense for a slightly different tunnel type. Making it a per mpls tunnel property instead of global property feels like it should be much more maintainable. RFC 3443 that David Ahern referenced does indeed infer that this should be a per-LSP property. However, it says: We also note here that signaling the LSP type (Pipe, Short Pipe or Uniform Model) is out of the scope of this document, and that is also not addressed in the current versions of the label distribution protocols, e.g. LDP [MPLS-LDP] and RSVP-TE [MPLS-RSVP]. Currently, the LSP type is configured by the network operator manually by means of either a command line or network management interface. AIUI, the situation of label distribution protocols not signaling this property hasn't changed from when this RFC has written, which limits the usefulness of a per-LSP property, and perhaps also indicates a lack of desire from users of this. Do you still feel it's worth implementing on a per-LSP basis? If so, any opinion on how it should be done for the pop case? Either a new per-path RTA attribute can be added, e.g. RTA_TTL_PROPAGATE, or a new rtnh flag could be added, e.g. RTNH_F_TTL_PROPAGATE. Similarly with the related behavior of what to do if the mpls ttl is exhausted during the trip through the tunnel. Drop or dig through the packet and send an ICMP error message at the ip layer. That's an interesting suggestion, but I don't think it will be useful when carrying another LSP over the LSP in question, since the LSR will have no idea what the label is being used for (i.e. the payload). If there is only one label in the packet then the router should know what the payload is of the label and since this is implicitly IPv4 or IPv6 at the moment (since those are the only types of traffic for which the labels can be used) then surely the ICMP should always be generated in that case? Thanks, Rob
[PATCH net-next] mpls: allow TTL propagation to/from IP packets to be configured
It is sometimes desirable to present an MPLS transport network as a single hop to traffic transiting it because it prevents confusion when diagnosing failures. An example of where confusion can be generated is when addresses used in the provider network overlap with addresses in the overlay network and the addresses get exposed through ICMP errors generated as packets transit the provider network. Therefore, provide the ability to control whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped through the addition of a new per-namespace sysctl: "net.mpls.ip_ttl_propagate" which defaults to enabled. Use the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl: "net.mpls.default_ttl". Signed-off-by: Robert Shearman --- Documentation/networking/mpls-sysctl.txt | 19 + include/net/netns/mpls.h | 3 ++ net/mpls/af_mpls.c | 70 net/mpls/mpls_iptunnel.c | 12 +- 4 files changed, 85 insertions(+), 19 deletions(-) diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 15d8d16934fd..b8f0725ff09e 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -19,6 +19,25 @@ platform_labels - INTEGER Possible values: 0 - 1048575 Default: 0 +ip_ttl_propagate - BOOL + Control whether TTL is propagated from the IPv4/IPv6 header to + the MPLS header on imposing labels and propagated from the + MPLS header to the IPv4/IPv6 header on popping the last label. + + If disabled, the MPLS transport network will appear as a + single hop to transit traffic. + + 0 - disabled + 1 - enabled (default) + +default_ttl - BOOL + Default TTL value to use for MPLS packets where it cannot be + propagated from an IP header, either because one isn't present + or ip_ttl_propagate has been disabled. + + Possible values: 1 - 255 + Default: 255 + conf//input - BOOL Control whether packets can be input on this interface. diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h index d29203651c01..1b68aed6e1b9 100644 --- a/include/net/netns/mpls.h +++ b/include/net/netns/mpls.h @@ -10,7 +10,10 @@ struct ctl_table_header; struct netns_mpls { size_t platform_labels; + int ip_ttl_propagate; + int default_ttl; struct mpls_route __rcu * __rcu *platform_label; + struct ctl_table_header *ctl; }; diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 64d3bf269a26..bf5f0792e8a2 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -31,7 +31,9 @@ #define MPLS_NEIGH_TABLE_UNSPEC (NEIGH_LINK_TABLE + 1) static int zero = 0; +static int one = 1; static int label_limit = (1 << 20) - 1; +static int ttl_max = 255; static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, struct nlmsghdr *nlh, struct net *net, u32 portid, @@ -219,8 +221,8 @@ static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt, return &rt->rt_nh[nh_index]; } -static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, - struct mpls_entry_decoded dec) +static bool mpls_egress(struct net *net, struct mpls_route *rt, + struct sk_buff *skb, struct mpls_entry_decoded dec) { enum mpls_payload_type payload_type; bool success = false; @@ -243,24 +245,29 @@ static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, payload_type = ip_hdr(skb)->version; switch (payload_type) { - case MPT_IPV4: { - struct iphdr *hdr4 = ip_hdr(skb); - skb->protocol = htons(ETH_P_IP); - csum_replace2(&hdr4->check, - htons(hdr4->ttl << 8), - htons(dec.ttl << 8)); - hdr4->ttl = dec.ttl; + case MPT_IPV4: + if (net->mpls.ip_ttl_propagate) { + struct iphdr *hdr4 = ip_hdr(skb); + + skb->protocol = htons(ETH_P_IP); + csum_replace2(&hdr4->check, + htons(hdr4->ttl << 8), + htons(dec.ttl << 8)); + hdr4->ttl = dec.ttl; + } success = true; break; - } - case MPT_IPV6: { - struct ipv6hdr *hdr6 = ipv6_hdr(skb); - skb->protocol = htons(ETH_P_IPV6); - hdr6->hop_limit = dec.ttl; + case MPT_IPV6: + if (net->mpls.ip_ttl_
Re: [PATCH net] net: Avoid receiving packets with an l3mdev on unbound UDP sockets
On 30/01/17 20:01, David Miller wrote: From: David Ahern Date: Mon, 30 Jan 2017 11:52:01 -0700 On 1/26/17 11:02 AM, Robert Shearman wrote: Packets arriving in a VRF currently are delivered to UDP sockets that aren't bound to any interface. TCP defaults to not delivering packets arriving in a VRF to unbound sockets. IP route lookup and socket transmit both assume that unbound means using the default table and UDP applications that haven't been changed to be aware of VRFs may not function correctly in this case since they may not be able to handle overlapping IP address ranges, or be able to send packets back to the original sender if required. So add a sysctl, udp_l3mdev_accept, to control this behaviour with it being analgous to the existing tcp_l3mdev_accept, namely to allow a process to have a VRF-global listen socket. Have this default to off as this is the behaviour that users will expect, given that there is no explicit mechanism to set unmodified VRF-unaware application into a default VRF. Signed-off-by: Robert Shearman --- I've targetted this for the net tree because I believe the expected behaviour is different enough from the current behaviour to be considered a bug. However, this should also apply to the net-next tree as-is if this not deemed a bug. Does not apply to net-next; collision in sysctl_net_ipv4.c As for the code change, I have updated my unit tests and they all pass with this patch. Not sure why I marked my version as not working last November, but it is all good now. Acked-by: David Ahern Tested-by: David Ahern The conflict was easy enough to fix up, so I did it myself. Applied to net-next, thanks. Great, thanks.
[PATCH net] net: Avoid receiving packets with an l3mdev on unbound UDP sockets
Packets arriving in a VRF currently are delivered to UDP sockets that aren't bound to any interface. TCP defaults to not delivering packets arriving in a VRF to unbound sockets. IP route lookup and socket transmit both assume that unbound means using the default table and UDP applications that haven't been changed to be aware of VRFs may not function correctly in this case since they may not be able to handle overlapping IP address ranges, or be able to send packets back to the original sender if required. So add a sysctl, udp_l3mdev_accept, to control this behaviour with it being analgous to the existing tcp_l3mdev_accept, namely to allow a process to have a VRF-global listen socket. Have this default to off as this is the behaviour that users will expect, given that there is no explicit mechanism to set unmodified VRF-unaware application into a default VRF. Signed-off-by: Robert Shearman --- I've targetted this for the net tree because I believe the expected behaviour is different enough from the current behaviour to be considered a bug. However, this should also apply to the net-next tree as-is if this not deemed a bug. Documentation/networking/ip-sysctl.txt | 7 +++ Documentation/networking/vrf.txt | 7 --- include/net/netns/ipv4.h | 4 net/ipv4/sysctl_net_ipv4.c | 11 +++ net/ipv4/udp.c | 27 --- net/ipv6/udp.c | 27 --- 6 files changed, 66 insertions(+), 17 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 7dd65c9cf707..fa1f14977a0c 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -742,6 +742,13 @@ tcp_challenge_ack_limit - INTEGER UDP variables: +udp_l3mdev_accept - BOOLEAN + Enabling this option allows a "global" bound socket to work + across L3 master domains (e.g., VRFs) with packets capable of + being received regardless of the L3 domain in which they + originated. Only valid when the kernel was compiled with + CONFIG_NET_L3_MASTER_DEV. + udp_mem - vector of 3 INTEGERs: min, pressure, max Number of pages allowed for queueing by all UDP sockets. diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt index 755dab856392..3918dae964d4 100644 --- a/Documentation/networking/vrf.txt +++ b/Documentation/networking/vrf.txt @@ -98,10 +98,11 @@ VRF device: or to specify the output device using cmsg and IP_PKTINFO. -TCP services running in the default VRF context (ie., not bound to any VRF -device) can work across all VRF domains by enabling the tcp_l3mdev_accept -sysctl option: +TCP & UDP services running in the default VRF context (ie., not bound +to any VRF device) can work across all VRF domains by enabling the +tcp_l3mdev_accept and udp_l3mdev_accept sysctl options: sysctl -w net.ipv4.tcp_l3mdev_accept=1 +sysctl -w net.ipv4.udp_l3mdev_accept=1 netfilter rules on the VRF device can be used to limit access to services running in the default VRF context as well. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 0378e88f6fd3..0822dced1b68 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -112,6 +112,10 @@ struct netns_ipv4 { unsigned int sysctl_tcp_notsent_lowat; int sysctl_tcp_tw_reuse; +#ifdef CONFIG_NET_L3_MASTER_DEV + int sysctl_udp_l3mdev_accept; +#endif + int sysctl_igmp_max_memberships; int sysctl_igmp_max_msf; int sysctl_igmp_llm_reports; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index b2fa498b15d1..a2ebbe6211ba 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -971,6 +971,17 @@ static struct ctl_table ipv4_net_table[] = { .extra2 = &one, }, #endif +#ifdef CONFIG_NET_L3_MASTER_DEV + { + .procname = "udp_l3mdev_accept", + .data = &init_net.ipv4.sysctl_udp_l3mdev_accept, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, +#endif { } }; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 1307a7c2e544..c7fcb7395ccf 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -134,6 +134,17 @@ EXPORT_SYMBOL(udp_memory_allocated); #define MAX_UDP_PORTS 65536 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN) +/* IPCB reference means this can not be used from early demux */ +static bool udp_lib_exact_dif_match(struct net *net, struct sk_buff *skb) +{ +#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV) + if (!net->ipv4.sysctl_udp_l3mdev_accept && + skb &&am
[PATCH net v3 1/2] net: Specify the owning module for lwtunnel ops
Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so specify the owning module for all lwtunnel ops. Signed-off-by: Robert Shearman --- include/net/lwtunnel.h| 2 ++ net/core/lwt_bpf.c| 1 + net/ipv4/ip_tunnel_core.c | 2 ++ net/ipv6/ila/ila_lwt.c| 1 + net/ipv6/seg6_iptunnel.c | 1 + net/mpls/mpls_iptunnel.c | 1 + 6 files changed, 8 insertions(+) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index 0b585f1fd340..73dd87647460 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -44,6 +44,8 @@ struct lwtunnel_encap_ops { int (*get_encap_size)(struct lwtunnel_state *lwtstate); int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b); int (*xmit)(struct sk_buff *skb); + + struct module *owner; }; #ifdef CONFIG_LWTUNNEL diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index 71bb3e2eca08..b3eef90b2df9 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -386,6 +386,7 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = { .fill_encap = bpf_fill_encap_info, .get_encap_size = bpf_encap_nlsize, .cmp_encap = bpf_encap_cmp, + .owner = THIS_MODULE, }; static int __init bpf_lwt_init(void) diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index fed3d29f9eb3..0fd1976ab63b 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -313,6 +313,7 @@ static const struct lwtunnel_encap_ops ip_tun_lwt_ops = { .fill_encap = ip_tun_fill_encap_info, .get_encap_size = ip_tun_encap_nlsize, .cmp_encap = ip_tun_cmp_encap, + .owner = THIS_MODULE, }; static const struct nla_policy ip6_tun_policy[LWTUNNEL_IP6_MAX + 1] = { @@ -403,6 +404,7 @@ static const struct lwtunnel_encap_ops ip6_tun_lwt_ops = { .fill_encap = ip6_tun_fill_encap_info, .get_encap_size = ip6_tun_encap_nlsize, .cmp_encap = ip_tun_cmp_encap, + .owner = THIS_MODULE, }; void __init ip_tunnel_core_init(void) diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c index a7bc54ab46e2..13b5e85fe0d5 100644 --- a/net/ipv6/ila/ila_lwt.c +++ b/net/ipv6/ila/ila_lwt.c @@ -238,6 +238,7 @@ static const struct lwtunnel_encap_ops ila_encap_ops = { .fill_encap = ila_fill_encap_info, .get_encap_size = ila_encap_nlsize, .cmp_encap = ila_encap_cmp, + .owner = THIS_MODULE, }; int ila_lwt_init(void) diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c index 1d60cb132835..c46f8cbf5ab5 100644 --- a/net/ipv6/seg6_iptunnel.c +++ b/net/ipv6/seg6_iptunnel.c @@ -422,6 +422,7 @@ static const struct lwtunnel_encap_ops seg6_iptun_ops = { .fill_encap = seg6_fill_encap_info, .get_encap_size = seg6_encap_nlsize, .cmp_encap = seg6_encap_cmp, + .owner = THIS_MODULE, }; int __init seg6_iptunnel_init(void) diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c index 2f7ccd934416..1d281c1ff7c1 100644 --- a/net/mpls/mpls_iptunnel.c +++ b/net/mpls/mpls_iptunnel.c @@ -215,6 +215,7 @@ static const struct lwtunnel_encap_ops mpls_iptun_ops = { .fill_encap = mpls_fill_encap_info, .get_encap_size = mpls_encap_nlsize, .cmp_encap = mpls_encap_cmp, + .owner = THIS_MODULE, }; static int __init mpls_iptunnel_init(void) -- 2.1.4
[PATCH net v3 0/2] net: Fix oops on state free after lwt module unload
An oops is seen in lwtstate_free after an lwt ops module has been unloaded. This patchset fixes this by preventing modules implementing lwtunnel ops from being unloaded whilst there's state alive using those ops. The first patch adds fills in a new owner field in all lwt ops and the second patch makes use of this to reference count the modules as state is built and destroyed using them. Changes in v3: - don't put module reference if try_module_get fails on building state Changes in v2: - specify module owner for all modules as suggested by DaveM - reference count all modules building lwt state, not just those ops implementing destroy_state, as also suggested by DaveM. - rebased on top of David Ahern's lwtunnel changes Robert Shearman (2): net: Specify the owning module for lwtunnel ops lwtunnel: Fix oops on state free after encap module unload include/net/lwtunnel.h| 2 ++ net/core/lwt_bpf.c| 1 + net/core/lwtunnel.c | 6 +- net/ipv4/ip_tunnel_core.c | 2 ++ net/ipv6/ila/ila_lwt.c| 1 + net/ipv6/seg6_iptunnel.c | 1 + net/mpls/mpls_iptunnel.c | 1 + 7 files changed, 13 insertions(+), 1 deletion(-) -- 2.1.4
[PATCH net v3 2/2] lwtunnel: Fix oops on state free after encap module unload
When attempting to free lwtunnel state after the module for the encap has been unloaded an oops occurs: BUG: unable to handle kernel NULL pointer dereference at 0008 IP: lwtstate_free+0x18/0x40 [..] task: 88003e372380 task.stack: c91fc000 RIP: 0010:lwtstate_free+0x18/0x40 RSP: 0018:88003fd83e88 EFLAGS: 00010246 RAX: RBX: 88002bbb3380 RCX: 88000c91a300 [..] Call Trace: free_fib_info_rcu+0x195/0x1a0 ? rt_fibinfo_free+0x50/0x50 rcu_process_callbacks+0x2d3/0x850 ? rcu_process_callbacks+0x296/0x850 __do_softirq+0xe4/0x4cb irq_exit+0xb0/0xc0 smp_apic_timer_interrupt+0x3d/0x50 apic_timer_interrupt+0x93/0xa0 [..] Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8 The problem is after the module for the encap can be unloaded the corresponding ops is removed and is thus NULL here. Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so grab the module reference for the ops on creating lwtunnel state and of course release the reference when freeing the state. Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation") Signed-off-by: Robert Shearman --- net/core/lwtunnel.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index 47b1dd65947b..c23465005f2f 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -115,8 +115,11 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, ret = -EOPNOTSUPP; rcu_read_lock(); ops = rcu_dereference(lwtun_encaps[encap_type]); - if (likely(ops && ops->build_state)) + if (likely(ops && ops->build_state && try_module_get(ops->owner))) { ret = ops->build_state(dev, encap, family, cfg, lws); + if (ret) + module_put(ops->owner); + } rcu_read_unlock(); return ret; @@ -194,6 +197,7 @@ void lwtstate_free(struct lwtunnel_state *lws) } else { kfree(lws); } + module_put(ops->owner); } EXPORT_SYMBOL(lwtstate_free); -- 2.1.4
Re: [PATCH net v2 2/2] lwtunnel: Fix oops on state free after encap module unload
On 23/01/17 20:42, David Miller wrote: From: Robert Shearman Date: Sat, 21 Jan 2017 00:21:26 + @@ -115,8 +115,12 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, ret = -EOPNOTSUPP; rcu_read_lock(); Here 'ret' equals -EOPNOTSUPP ops = rcu_dereference(lwtun_encaps[encap_type]); - if (likely(ops && ops->build_state)) - ret = ops->build_state(dev, encap, family, cfg, lws); + if (likely(ops)) { + if (likely(try_module_get(ops->owner) && ops->build_state)) + ret = ops->build_state(dev, encap, family, cfg, lws); + if (ret) + module_put(ops->owner); If try_module_get() fails, 'ret' will still be -EOPNOTSUPP and we will module_put() on a module we did not grab a reference to. I think you need to adjust the logic here. You only want to 'put' if try_module_get() succeeds and ->build_state() returns an error. Indeed, good point. Will address in a v3 shortly. Thanks, Rob
[PATCH net v2 1/2] net: Specify the owning module for lwtunnel ops
Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so specify the owning module for all lwtunnel ops. Signed-off-by: Robert Shearman --- include/net/lwtunnel.h| 2 ++ net/core/lwt_bpf.c| 1 + net/ipv4/ip_tunnel_core.c | 2 ++ net/ipv6/ila/ila_lwt.c| 1 + net/ipv6/seg6_iptunnel.c | 1 + net/mpls/mpls_iptunnel.c | 1 + 6 files changed, 8 insertions(+) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index 0b585f1fd340..73dd87647460 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -44,6 +44,8 @@ struct lwtunnel_encap_ops { int (*get_encap_size)(struct lwtunnel_state *lwtstate); int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b); int (*xmit)(struct sk_buff *skb); + + struct module *owner; }; #ifdef CONFIG_LWTUNNEL diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index 71bb3e2eca08..b3eef90b2df9 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -386,6 +386,7 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = { .fill_encap = bpf_fill_encap_info, .get_encap_size = bpf_encap_nlsize, .cmp_encap = bpf_encap_cmp, + .owner = THIS_MODULE, }; static int __init bpf_lwt_init(void) diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index fed3d29f9eb3..0fd1976ab63b 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -313,6 +313,7 @@ static const struct lwtunnel_encap_ops ip_tun_lwt_ops = { .fill_encap = ip_tun_fill_encap_info, .get_encap_size = ip_tun_encap_nlsize, .cmp_encap = ip_tun_cmp_encap, + .owner = THIS_MODULE, }; static const struct nla_policy ip6_tun_policy[LWTUNNEL_IP6_MAX + 1] = { @@ -403,6 +404,7 @@ static const struct lwtunnel_encap_ops ip6_tun_lwt_ops = { .fill_encap = ip6_tun_fill_encap_info, .get_encap_size = ip6_tun_encap_nlsize, .cmp_encap = ip_tun_cmp_encap, + .owner = THIS_MODULE, }; void __init ip_tunnel_core_init(void) diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c index a7bc54ab46e2..13b5e85fe0d5 100644 --- a/net/ipv6/ila/ila_lwt.c +++ b/net/ipv6/ila/ila_lwt.c @@ -238,6 +238,7 @@ static const struct lwtunnel_encap_ops ila_encap_ops = { .fill_encap = ila_fill_encap_info, .get_encap_size = ila_encap_nlsize, .cmp_encap = ila_encap_cmp, + .owner = THIS_MODULE, }; int ila_lwt_init(void) diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c index 1d60cb132835..c46f8cbf5ab5 100644 --- a/net/ipv6/seg6_iptunnel.c +++ b/net/ipv6/seg6_iptunnel.c @@ -422,6 +422,7 @@ static const struct lwtunnel_encap_ops seg6_iptun_ops = { .fill_encap = seg6_fill_encap_info, .get_encap_size = seg6_encap_nlsize, .cmp_encap = seg6_encap_cmp, + .owner = THIS_MODULE, }; int __init seg6_iptunnel_init(void) diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c index 2f7ccd934416..1d281c1ff7c1 100644 --- a/net/mpls/mpls_iptunnel.c +++ b/net/mpls/mpls_iptunnel.c @@ -215,6 +215,7 @@ static const struct lwtunnel_encap_ops mpls_iptun_ops = { .fill_encap = mpls_fill_encap_info, .get_encap_size = mpls_encap_nlsize, .cmp_encap = mpls_encap_cmp, + .owner = THIS_MODULE, }; static int __init mpls_iptunnel_init(void) -- 2.1.4
[PATCH net v2 0/2] net: Fix oops on state free after lwt module unload
An oops is seen in lwtstate_free after an lwt ops module has been unloaded. This patchset fixes this by preventing modules implementing lwtunnel ops from being unloaded whilst there's state alive using those ops. The first patch adds fills in a new owner field in all lwt ops and the second patch makes use of this to reference count the modules as state is built and destroyed using them. Changes in v2: - specify module owner for all modules as suggested by DaveM - reference count all modules building lwt state, not just those ops implementing destroy_state, as also suggested by DaveM. - rebased on top of David Ahern's lwtunnel changes Robert Shearman (2): net: Specify the owning module for lwtunnel ops lwtunnel: Fix oops on state free after encap module unload include/net/lwtunnel.h| 2 ++ net/core/lwt_bpf.c| 1 + net/core/lwtunnel.c | 9 +++-- net/ipv4/ip_tunnel_core.c | 2 ++ net/ipv6/ila/ila_lwt.c| 1 + net/ipv6/seg6_iptunnel.c | 1 + net/mpls/mpls_iptunnel.c | 1 + 7 files changed, 15 insertions(+), 2 deletions(-) -- 2.1.4
[PATCH net v2 2/2] lwtunnel: Fix oops on state free after encap module unload
When attempting to free lwtunnel state after the module for the encap has been unloaded an oops occurs: BUG: unable to handle kernel NULL pointer dereference at 0008 IP: lwtstate_free+0x18/0x40 [..] task: 88003e372380 task.stack: c91fc000 RIP: 0010:lwtstate_free+0x18/0x40 RSP: 0018:88003fd83e88 EFLAGS: 00010246 RAX: RBX: 88002bbb3380 RCX: 88000c91a300 [..] Call Trace: free_fib_info_rcu+0x195/0x1a0 ? rt_fibinfo_free+0x50/0x50 rcu_process_callbacks+0x2d3/0x850 ? rcu_process_callbacks+0x296/0x850 __do_softirq+0xe4/0x4cb irq_exit+0xb0/0xc0 smp_apic_timer_interrupt+0x3d/0x50 apic_timer_interrupt+0x93/0xa0 [..] Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8 The problem is after the module for the encap is unloaded the corresponding ops is removed and thus is NULL here. Modules implementing lwtunnel ops should not be allowed to unload while there is state alive using those ops, so grab the module reference for the ops on creating lwtunnel state and of course release the reference when freeing the state. Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation") Signed-off-by: Robert Shearman --- net/core/lwtunnel.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index 47b1dd65947b..ebfaffaa777b 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -115,8 +115,12 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, ret = -EOPNOTSUPP; rcu_read_lock(); ops = rcu_dereference(lwtun_encaps[encap_type]); - if (likely(ops && ops->build_state)) - ret = ops->build_state(dev, encap, family, cfg, lws); + if (likely(ops)) { + if (likely(try_module_get(ops->owner) && ops->build_state)) + ret = ops->build_state(dev, encap, family, cfg, lws); + if (ret) + module_put(ops->owner); + } rcu_read_unlock(); return ret; @@ -194,6 +198,7 @@ void lwtstate_free(struct lwtunnel_state *lws) } else { kfree(lws); } + module_put(ops->owner); } EXPORT_SYMBOL(lwtstate_free); -- 2.1.4
Re: [PATCH net 0/2] net: Fix oops on state free after lwt module unload
On 20/01/17 17:03, David Miller wrote: From: Robert Shearman Date: Wed, 18 Jan 2017 15:32:01 + This patchset fixes an oops in lwtstate_free and a memory leak that would otherwise be exposed by ensuring that references are taken on modules that need to stay around to clean up lwt state. To faciliate this all ops that implement destroy_state and that can be configured to build as a module are changed specify the owner module in the ops. The intersection of those two sets is just ila at the moment. Two things: 1) Under no circumstances should we allow a lwtunnel ops implementing module to unload while there is a rule using those ops which is alive. Therefore, we should not special case the destroy op. We should unconditionally grab the module reference. 2) Please add the new 'owner' field and add an appropriate assignment for ops->owner to _every_ lwtunnel implementation, and do so in your first patch. Please do not only do this for ILA. Thanks. Very clear, makes sense, will do. Thanks, Rob
Re: [PATCH net] net: mpls: Fix multipath selection for LSR use case
On 20/01/17 00:51, David Ahern wrote: MPLS multipath for LSR is broken -- always selecting the first nexthop in the one label case. For example: $ ip netns exec ns1 ip -f mpls ro ls 100 nexthop as to 200 via inet 172.16.2.2 dev virt12 nexthop as to 300 via inet 172.16.3.2 dev virt13 101 nexthop as to 201 via inet6 2000:2::2 dev virt12 nexthop as to 301 via inet6 2000:3::2 dev virt13 In this example incoming packets have a single MPLS labels which means BOS bit is set. The BOS bit is passed from mpls_forward down to mpls_multipath_hash which never processes the hash loop because BOS is 1. Removing the bos arg from mpls_multipath_hash uncovers a number of other problems with the hash loop that processes the MPLS label stack -- from incorrect assumptions on the skb (skb has already pulled the first mpls label in mpls_forward yet loop assumes it is there) This was intentional because it doesn't really add anything to include the top-most label in the entropy since all traffic for the mpls_route will have the same top-most label, until support for sharing of mpls_routes is added. Having said that, it costs very little to do this, makes the code simpler and avoids the need to remember to change this if sharing is added, so it's fine with me. to incorrect pskb_may_pull checks (label_index starts at 0 and pskb_may_pull checks all use sizeof() * label_index). This patch addresses all problems by moving the skb_pull in mpls_forward after mpls_select_multipath. This allows mpls_multipath_hash to see the skb with the entire label stack as it arrived. From there mpls_multipath_hash is modified to additively compute the total mpls header length on each pass (on pass N mpls_hdr_len is N * sizeof(mpls_shim_hdr)). When the label is found with the BOS set it verifies the skb has sufficient header for ipv4 or ipv6, and find the IPv4 and IPv6 header by using the last mpls_hdr pointer and adding 1 to advance past it. With these changes I have verified the code correctly sees the label, BOS, IPv4 and IPv6 addresses in the network header and icmp/tcp/udp traffic for ipv4 and ipv6 are distributed across the nexthops. Fixes: 1c78efa8319ca ("mpls: flow-based multipath selection") Signed-off-by: David Ahern Acked-by: Robert Shearman Good catch, thanks for fixing.
[PATCH net 0/2] net: Fix oops on state free after lwt module unload
This patchset fixes an oops in lwtstate_free and a memory leak that would otherwise be exposed by ensuring that references are taken on modules that need to stay around to clean up lwt state. To faciliate this all ops that implement destroy_state and that can be configured to build as a module are changed specify the owner module in the ops. The intersection of those two sets is just ila at the moment. Robert Shearman (2): lwtunnel: Fix oops on state free after encap module unload ila: Fix memory leak of lwt dst cache on module unload include/net/lwtunnel.h | 2 ++ net/core/lwtunnel.c| 11 +-- net/ipv6/ila/ila_lwt.c | 1 + 3 files changed, 12 insertions(+), 2 deletions(-) -- 2.1.4
[PATCH net 1/2] lwtunnel: Fix oops on state free after encap module unload
When attempting to free lwtunnel state after the module for the encap has been unloaded an oops occurs: BUG: unable to handle kernel NULL pointer dereference at 0008 IP: lwtstate_free+0x18/0x40 [..] task: 88003e372380 task.stack: c91fc000 RIP: 0010:lwtstate_free+0x18/0x40 RSP: 0018:88003fd83e88 EFLAGS: 00010246 RAX: RBX: 88002bbb3380 RCX: 88000c91a300 [..] Call Trace: free_fib_info_rcu+0x195/0x1a0 ? rt_fibinfo_free+0x50/0x50 rcu_process_callbacks+0x2d3/0x850 ? rcu_process_callbacks+0x296/0x850 __do_softirq+0xe4/0x4cb irq_exit+0xb0/0xc0 smp_apic_timer_interrupt+0x3d/0x50 apic_timer_interrupt+0x93/0xa0 [..] Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8 The problem is that we don't check for NULL ops in lwtstate_free. Adding the check fixes the immediate problem but will then won't properly clean up for ops that implement the ->destroy_state function if the implementing module has been unloaded, resulting in memory leaks or other problems. So in addition, refcount the module when the ops implements ->destroy_state so it can't be unloaded while there is still state around. Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation") Signed-off-by: Robert Shearman --- include/net/lwtunnel.h | 2 ++ net/core/lwtunnel.c| 11 +-- 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index d4c1c75b8862..2b9993f33198 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -44,6 +44,8 @@ struct lwtunnel_encap_ops { int (*get_encap_size)(struct lwtunnel_state *lwtstate); int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b); int (*xmit)(struct sk_buff *skb); + + struct module *owner; }; #ifdef CONFIG_LWTUNNEL diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index a5d4e866ce88..3dc3cc3b38ec 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -126,8 +126,14 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, } } #endif - if (likely(ops && ops->build_state)) + /* take module reference if destroy_state is in use */ + if (unlikely(ops && ops->destroy_state && !try_module_get(ops->owner))) + ops = NULL; + if (likely(ops && ops->build_state)) { ret = ops->build_state(dev, encap, family, cfg, lws); + if (ret && ops->destroy_state) + module_put(ops->owner); + } rcu_read_unlock(); return ret; @@ -138,9 +144,10 @@ void lwtstate_free(struct lwtunnel_state *lws) { const struct lwtunnel_encap_ops *ops = lwtun_encaps[lws->type]; - if (ops->destroy_state) { + if (ops && ops->destroy_state) { ops->destroy_state(lws); kfree_rcu(lws, rcu); + module_put(ops->owner); } else { kfree(lws); } -- 2.1.4
[PATCH net 2/2] ila: Fix memory leak of lwt dst cache on module unload
If routes with lwt state are present when the ila module is unloaded and then subsequently deleted, the dst cache entry in the state will be leaked. Fix this by specifying the owning module in the lwt ops to allow lwt to take a reference for each route and to keep the module around until the last ila route is deleted. Fixes: 79ff2fc31e0f ("ila: Cache a route to translated address") Signed-off-by: Robert Shearman --- net/ipv6/ila/ila_lwt.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c index a7bc54ab46e2..13b5e85fe0d5 100644 --- a/net/ipv6/ila/ila_lwt.c +++ b/net/ipv6/ila/ila_lwt.c @@ -238,6 +238,7 @@ static const struct lwtunnel_encap_ops ila_encap_ops = { .fill_encap = ila_fill_encap_info, .get_encap_size = ila_encap_nlsize, .cmp_encap = ila_encap_cmp, + .owner = THIS_MODULE, }; int ila_lwt_init(void) -- 2.1.4
Re: [PATCH net] lwtunnel: fix autoload of lwt modules
On 17/01/17 17:07, David Ahern wrote: On 1/17/17 3:04 AM, Robert Shearman wrote: Is it safe to release the rtnl lock here? E.g. neither fib_get_nhs nor fib_create_info take a reference on the device and further up inet_rtm_newroute has a pointer to the struct fib_table without taking a reference. fib tables can not be deleted so that reference is safe. Looks like we can free a table on unmerging. Admittedly this is the local table so is unlikely to be one that lwtunnels are used on, but does it not need to be safe against races anyway? You are right about the dev reference in the IPv4 path but the thing is build_state does not need the device reference. Roopa: do you recall why dev is passed to build_state? Nothing about the encap should require a device reference and none of the current encaps use it. Unfortunately, I think more invasive changes are required to solve this. I'm happy to work on solving this. I have a patch that removes passing dev to build_state. Walking the remainder of the code I do not see any more references that would be affected by dropping rtnl here on the add path. Still looking at the delete path. Ok, I'll continue looking too and let you know if there's anything else that pops up. Having said that, even if we eliminate all the unreferenced objects in the current code, what are the chances that we'll be able to keep it this way going forward? Perhaps it would be safer to retry the insert operation from as close to the start as possible? Thanks, Rob
Re: [PATCH net] lwtunnel: fix autoload of lwt modules
On 17/01/17 05:33, David Ahern wrote: Trying to add an mpls encap route when the MPLS modules are not loaded hangs. For example: CONFIG_MPLS=y CONFIG_NET_MPLS_GSO=m CONFIG_MPLS_ROUTING=m CONFIG_MPLS_IPTUNNEL=m $ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2 The ip command hangs: root 880 826 0 21:25 pts/000:00:00 ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2 $ cat /proc/880/stack [] call_usermodehelper_exec+0xd6/0x134 [] __request_module+0x27b/0x30a [] lwtunnel_build_state+0xe4/0x178 [] fib_create_info+0x47f/0xdd4 [] fib_table_insert+0x90/0x41f [] inet_rtm_newroute+0x4b/0x52 ... modprobe is trying to load rtnl-lwt-MPLS: root 881 5 0 21:25 ?00:00:00 /sbin/modprobe -q -- rtnl-lwt-MPLS and it hangs after loading mpls_router: $ cat /proc/881/stack [] rtnl_lock+0x12/0x14 [] register_netdevice_notifier+0x16/0x179 [] mpls_init+0x25/0x1000 [mpls_router] [] do_one_initcall+0x8e/0x13f [] do_init_module+0x5a/0x1e5 [] load_module+0x13bd/0x17d6 ... The problem is that lwtunnel_build_state is called with rtnl lock held preventing mpls_init from registering. Fix by dropping the lock before invoking request_module. Fixes: 745041e2aaf1 ("lwtunnel: autoload of lwt modules") Signed-off-by: David Ahern --- This is a best guess at the Fixes. net/core/lwtunnel.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index a5d4e866ce88..c14ee4d62a8a 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -120,7 +120,9 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, if (encap_type_str) { rcu_read_unlock(); + __rtnl_unlock(); request_module("rtnl-lwt-%s", encap_type_str); + rtnl_lock(); rcu_read_lock(); ops = rcu_dereference(lwtun_encaps[encap_type]); } Is it safe to release the rtnl lock here? E.g. neither fib_get_nhs nor fib_create_info take a reference on the device and further up inet_rtm_newroute has a pointer to the struct fib_table without taking a reference. Unfortunately, I think more invasive changes are required to solve this. I'm happy to work on solving this. Thanksm Rob
[PATCH net-next v2 1/2] net: AF-specific RTM_GETSTATS attributes
Add the functionality for including address-family-specific per-link stats in RTM_GETSTATS messages. This is done through adding a new IFLA_STATS_AF_SPEC attribute under which address family attributes are nested and then the AF-specific attributes can be further nested. This follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the advantage of presenting an easily extended hierarchy. The rtnl_af_ops structure is extended to provide AFs with the opportunity to fill and provide the size of their stats attributes. One alternative would have been to provide AFs with the ability to add attributes directly into the RTM_GETSTATS message without a nested hierarchy. I discounted this approach as it increases the rate at which the 32 attribute number space is used up and it makes implementation a little more tricky for stats dump resuming (at the moment the order in which attributes are added to the message has to match the numeric order of the attributes). Another alternative would have been to register per-AF RTM_GETSTATS handlers. I discounted this approach as I perceived a common use-case to be getting all the stats for an interface and this approach would necessitate multiple requests/dumps to retrieve them all. Signed-off-by: Robert Shearman Acked-by: Roopa Prabhu --- include/net/rtnetlink.h | 4 include/uapi/linux/if_link.h | 1 + net/core/rtnetlink.c | 50 3 files changed, 55 insertions(+) diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h index 4113916cc1bb..106de5f7bf06 100644 --- a/include/net/rtnetlink.h +++ b/include/net/rtnetlink.h @@ -139,6 +139,10 @@ struct rtnl_af_ops { const struct nlattr *attr); int (*set_link_af)(struct net_device *dev, const struct nlattr *attr); + + int (*fill_stats_af)(struct sk_buff *skb, +const struct net_device *dev); + size_t (*get_stats_af_size)(const struct net_device *dev); }; void __rtnl_af_unregister(struct rtnl_af_ops *ops); diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 6b13e591abc9..184b16ed2b84 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -847,6 +847,7 @@ enum { IFLA_STATS_LINK_XSTATS, IFLA_STATS_LINK_XSTATS_SLAVE, IFLA_STATS_LINK_OFFLOAD_XSTATS, + IFLA_STATS_AF_SPEC, __IFLA_STATS_MAX, }; diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 18b5aae99bec..4edc1bd7a735 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -3829,6 +3829,39 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, *idxattr = 0; } + if (stats_attr_valid(filter_mask, IFLA_STATS_AF_SPEC, *idxattr)) { + struct rtnl_af_ops *af_ops; + + *idxattr = IFLA_STATS_AF_SPEC; + attr = nla_nest_start(skb, IFLA_STATS_AF_SPEC); + if (!attr) + goto nla_put_failure; + + list_for_each_entry(af_ops, &rtnl_af_ops, list) { + if (af_ops->fill_stats_af) { + struct nlattr *af; + int err; + + af = nla_nest_start(skb, af_ops->family); + if (!af) + goto nla_put_failure; + + err = af_ops->fill_stats_af(skb, dev); + + if (err == -ENODATA) + nla_nest_cancel(skb, af); + else if (err < 0) + goto nla_put_failure; + + nla_nest_end(skb, af); + } + } + + nla_nest_end(skb, attr); + + *idxattr = 0; + } + nlmsg_end(skb, nlh); return 0; @@ -3885,6 +3918,23 @@ static size_t if_nlmsg_stats_size(const struct net_device *dev, if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_OFFLOAD_XSTATS, 0)) size += rtnl_get_offload_stats_size(dev); + if (stats_attr_valid(filter_mask, IFLA_STATS_AF_SPEC, 0)) { + struct rtnl_af_ops *af_ops; + + /* for IFLA_STATS_AF_SPEC */ + size += nla_total_size(0); + + list_for_each_entry(af_ops, &rtnl_af_ops, list) { + if (af_ops->get_stats_af_size) { + size += nla_total_size( + af_ops->get_stats_af_size(dev)); + + /* for AF_* */ + size += nla_total_size(0); + } + }
[PATCH net-next v2 0/2] mpls: Packet stats
This patchset records per-interface packet stats in the MPLS forwarding path and exports them using a nest of attributes root at a new IFLA_STATS_AF_SPEC attribute as part of RTM_GETSTATS messages: [IFLA_STATS_AF_SPEC] -> [AF_MPLS] -> [MPLS_STATS_LINK] -> struct mpls_link_stats The first patch adds the rtnl infrastructure for this, including a new callbacks to per-AF ops of fill_stats_af and get_stats_af_size. The second patch records MPLS stats and makes use of the infrastructure to export them. The rtnl infrastructure could also be used to export IPv6 stats in the future. Changes in v2: - make incrementing IPv6 stats in mpls_stats_inc_outucastpkts conditional on CONFIG_IPV6 to fix build with CONFIG_IPV6=n Robert Shearman (2): net: AF-specific RTM_GETSTATS attributes mpls: Packet stats include/net/rtnetlink.h | 4 + include/uapi/linux/if_link.h | 1 + include/uapi/linux/mpls.h| 30 +++ net/core/rtnetlink.c | 50 net/mpls/af_mpls.c | 181 +-- net/mpls/internal.h | 58 +- net/mpls/mpls_iptunnel.c | 11 ++- 7 files changed, 307 insertions(+), 28 deletions(-) -- 2.1.4
[PATCH net-next v2 2/2] mpls: Packet stats
Having MPLS packet stats is useful for observing network operation and for diagnosing network problems. In the absence of anything better, RFC2863 and RFC3813 are used for guidance for which stats to expose and the semantics of them. In particular rx_noroutes maps to in unknown protos in RFC2863. The stats are exposed to userspace via AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of RTM_GETSTATS messages. All the introduced fields are 64-bit, even error ones, to ensure no overflow with long uptimes. Per-CPU counters are used to avoid cache-line contention on the commonly used fields. The other fields have also been made per-CPU for code to avoid performance problems in error conditions on the assumption that on some platforms the cost of atomic operations could be more expensive than sending the packet (which is what would be done in the success case). If that's not the case, we could instead not use per-CPU counters for these fields. Only unicast and non-fragment are exposed at the moment, but other counters can be exposed in the future either by adding to the end of struct mpls_link_stats or by additional netlink attributes in the AF_MPLS IFLA_STATS_AF_SPEC nested attribute. Signed-off-by: Robert Shearman --- include/uapi/linux/mpls.h | 30 net/mpls/af_mpls.c| 181 -- net/mpls/internal.h | 58 ++- net/mpls/mpls_iptunnel.c | 11 ++- 4 files changed, 252 insertions(+), 28 deletions(-) diff --git a/include/uapi/linux/mpls.h b/include/uapi/linux/mpls.h index 24a6cb1aec86..77a19dfe3990 100644 --- a/include/uapi/linux/mpls.h +++ b/include/uapi/linux/mpls.h @@ -43,4 +43,34 @@ struct mpls_label { #define MPLS_LABEL_FIRST_UNRESERVED16 /* RFC3032 */ +/* These are embedded into IFLA_STATS_AF_SPEC: + * [IFLA_STATS_AF_SPEC] + * -> [AF_MPLS] + *-> [MPLS_STATS_xxx] + * + * Attributes: + * [MPLS_STATS_LINK] = { + * struct mpls_link_stats + * } + */ +enum { + MPLS_STATS_UNSPEC, /* also used as 64bit pad attribute */ + MPLS_STATS_LINK, + __MPLS_STATS_MAX, +}; + +#define MPLS_STATS_MAX (__MPLS_STATS_MAX - 1) + +struct mpls_link_stats { + __u64 rx_packets; /* total packets received */ + __u64 tx_packets; /* total packets transmitted*/ + __u64 rx_bytes; /* total bytes received */ + __u64 tx_bytes; /* total bytes transmitted */ + __u64 rx_errors; /* bad packets received */ + __u64 tx_errors; /* packet transmit problems */ + __u64 rx_dropped; /* packet dropped on receive*/ + __u64 tx_dropped; /* packet dropped on transmit */ + __u64 rx_noroute; /* no route for packet dest */ +}; + #endif /* _UAPI_MPLS_H */ diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 15fe97644ffe..4dc81963af8f 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -17,8 +18,8 @@ #include #if IS_ENABLED(CONFIG_IPV6) #include -#include #endif +#include #include #include "internal.h" @@ -48,11 +49,6 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) return rt; } -static inline struct mpls_dev *mpls_dev_get(const struct net_device *dev) -{ - return rcu_dereference_rtnl(dev->mpls_ptr); -} - bool mpls_output_possible(const struct net_device *dev) { return dev && (dev->flags & IFF_UP) && netif_carrier_ok(dev); @@ -98,6 +94,31 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) } EXPORT_SYMBOL_GPL(mpls_pkt_too_big); +void mpls_stats_inc_outucastpkts(struct net_device *dev, +const struct sk_buff *skb) +{ + struct mpls_dev *mdev; + + if (skb->protocol == htons(ETH_P_MPLS_UC)) { + mdev = mpls_dev_get(dev); + if (mdev) + MPLS_INC_STATS_LEN(mdev, skb->len, + tx_packets, + tx_bytes); + } else if (skb->protocol == htons(ETH_P_IP)) { + IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len); +#if IS_ENABLED(CONFIG_IPV6) + } else if (skb->protocol == htons(ETH_P_IPV6)) { + struct inet6_dev *in6dev = __in6_dev_get(dev); + + if (in6dev) + IP6_UPD_PO_STATS(dev_net(dev), in6dev, +IPSTATS_MIB_OUT, skb->len); +#endif + } +} +EXPORT_SYMBOL_GPL(mpls_stats_inc_outucastpkts); + static u32 mpls_multipath_hash(struct mpls_route *rt, struct sk_buff *skb, bool bos) { @@ -253,6 +274,7 @@ static int mpls_forward(struct
Re: [PATCH net-next 2/2] mpls: Packet stats
On 14/01/17 06:41, kbuild test robot wrote: Hi Robert, [auto build test ERROR on net-next/master] url: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_0day-2Dci_linux_commits_Robert-2DShearman_net-2DAF-2Dspecific-2DRTM-5FGETSTATS-2Dattributes_20170114-2D095819&d=DwIBAg&c=IL_XqQWOjubgfqINi2jTzg&r=HbzsGgI7IidV3zRVGbZOYKAd0w1ch0IRBV2pqme4k9w&m=-FIZAjvyj_cLBHBTW14fNKp5IiJpkr1laQrxgxKnZ3g&s=_d4vequnI_QXJkUb0X3X8AUHdi6qD4R86g-mTBsEJ7g&e= config: x86_64-randconfig-u0-01141334 (attached as .config) compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901 reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All error/warnings (new ones prefixed by >>): In file included from include/net/netns/mib.h:4:0, from include/net/net_namespace.h:14, from include/linux/netdevice.h:43, from include/uapi/linux/if_arp.h:26, from include/linux/if_arp.h:27, from net/mpls/af_mpls.c:7: net/mpls/af_mpls.c: In function 'mpls_stats_inc_outucastpkts': include/net/ipv6.h:163:35: error: 'struct netns_mib' has no member named 'ipv6_statistics'; did you mean 'ip_statistics'? mod##SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));\ Good catch kbuild test robot. Expect to see a v2 shortly with this CONFIG_IPV6=n build failure fixed. Thanks, Rob
[PATCH net-next 1/2] net: AF-specific RTM_GETSTATS attributes
Add the functionality for including address-family-specific per-link stats in RTM_GETSTATS messages. This is done through adding a new IFLA_STATS_AF_SPEC attribute under which address family attributes are nested and then the AF-specific attributes can be further nested. This follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the advantage of presenting an easily extended hierarchy. The rtnl_af_ops structure is extended to provide AFs with the opportunity to fill and provide the size of their stats attributes. One alternative would have been to provide AFs with the ability to add attributes directly into the RTM_GETSTATS message without a nested hierarchy. I discounted this approach as it increases the rate at which the 32 attribute number space is used up and it makes implementation a little more tricky for stats dump resuming (at the moment the order in which attributes are added to the message has to match the numeric order of the attributes). Another alternative would have been to register per-AF RTM_GETSTATS handlers. I discounted this approach as I perceived a common use-case to be getting all the stats for an interface and this approach would necessitate multiple requests/dumps to retrieve them all. Signed-off-by: Robert Shearman --- include/net/rtnetlink.h | 4 include/uapi/linux/if_link.h | 1 + net/core/rtnetlink.c | 50 3 files changed, 55 insertions(+) diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h index 4113916cc1bb..106de5f7bf06 100644 --- a/include/net/rtnetlink.h +++ b/include/net/rtnetlink.h @@ -139,6 +139,10 @@ struct rtnl_af_ops { const struct nlattr *attr); int (*set_link_af)(struct net_device *dev, const struct nlattr *attr); + + int (*fill_stats_af)(struct sk_buff *skb, +const struct net_device *dev); + size_t (*get_stats_af_size)(const struct net_device *dev); }; void __rtnl_af_unregister(struct rtnl_af_ops *ops); diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 6b13e591abc9..184b16ed2b84 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -847,6 +847,7 @@ enum { IFLA_STATS_LINK_XSTATS, IFLA_STATS_LINK_XSTATS_SLAVE, IFLA_STATS_LINK_OFFLOAD_XSTATS, + IFLA_STATS_AF_SPEC, __IFLA_STATS_MAX, }; diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c index 18b5aae99bec..4edc1bd7a735 100644 --- a/net/core/rtnetlink.c +++ b/net/core/rtnetlink.c @@ -3829,6 +3829,39 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, struct net_device *dev, *idxattr = 0; } + if (stats_attr_valid(filter_mask, IFLA_STATS_AF_SPEC, *idxattr)) { + struct rtnl_af_ops *af_ops; + + *idxattr = IFLA_STATS_AF_SPEC; + attr = nla_nest_start(skb, IFLA_STATS_AF_SPEC); + if (!attr) + goto nla_put_failure; + + list_for_each_entry(af_ops, &rtnl_af_ops, list) { + if (af_ops->fill_stats_af) { + struct nlattr *af; + int err; + + af = nla_nest_start(skb, af_ops->family); + if (!af) + goto nla_put_failure; + + err = af_ops->fill_stats_af(skb, dev); + + if (err == -ENODATA) + nla_nest_cancel(skb, af); + else if (err < 0) + goto nla_put_failure; + + nla_nest_end(skb, af); + } + } + + nla_nest_end(skb, attr); + + *idxattr = 0; + } + nlmsg_end(skb, nlh); return 0; @@ -3885,6 +3918,23 @@ static size_t if_nlmsg_stats_size(const struct net_device *dev, if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_OFFLOAD_XSTATS, 0)) size += rtnl_get_offload_stats_size(dev); + if (stats_attr_valid(filter_mask, IFLA_STATS_AF_SPEC, 0)) { + struct rtnl_af_ops *af_ops; + + /* for IFLA_STATS_AF_SPEC */ + size += nla_total_size(0); + + list_for_each_entry(af_ops, &rtnl_af_ops, list) { + if (af_ops->get_stats_af_size) { + size += nla_total_size( + af_ops->get_stats_af_size(dev)); + + /* for AF_* */ + size += nla_total_size(0); + } + } + } + return size; } -- 2.1.4
[PATCH net-next 0/2] mpls: Packet stats
This patchset records per-interface packet stats in the MPLS forwarding path and exports them using a nest of attributes root at a new IFLA_STATS_AF_SPEC attribute as part of RTM_GETSTATS messages: [IFLA_STATS_AF_SPEC] -> [AF_MPLS] -> [MPLS_STATS_LINK] -> struct mpls_link_stats The first patch adds the rtnl infrastructure for this, including a new callbacks to per-AF ops of fill_stats_af and get_stats_af_size. The second patch records MPLS stats and makes use of the infrastructure to export them. The rtnl infrastructure could also be used to export IPv6 stats in the future. Robert Shearman (2): net: AF-specific RTM_GETSTATS attributes mpls: Packet stats include/net/rtnetlink.h | 4 + include/uapi/linux/if_link.h | 1 + include/uapi/linux/mpls.h| 30 net/core/rtnetlink.c | 50 net/mpls/af_mpls.c | 179 +-- net/mpls/internal.h | 58 +- net/mpls/mpls_iptunnel.c | 11 ++- 7 files changed, 305 insertions(+), 28 deletions(-) -- 2.1.4
[PATCH net-next 2/2] mpls: Packet stats
Having MPLS packet stats is useful for observing network operation and for diagnosing network problems. In the absence of anything better, RFC2863 and RFC3813 are used for guidance for which stats to expose and the semantics of them. In particular rx_noroutes maps to in unknown protos in RFC2863. The stats are exposed to userspace via AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of RTM_GETSTATS messages. All the introduced fields are 64-bit, even error ones, to ensure no overflow with long uptimes. Per-CPU counters are used to avoid cache-line contention on the commonly used fields. The other fields have also been made per-CPU for code to avoid performance problems in error conditions on the assumption that on some platforms the cost of atomic operations could be more expensive than sending the packet (which is what would be done in the success case). If that's not the case, we could instead not use per-CPU counters for these fields. Only unicast and non-fragment are exposed at the moment, but other counters can be exposed in the future either by adding to the end of struct mpls_link_stats or by additional netlink attributes in the AF_MPLS IFLA_STATS_AF_SPEC nested attribute. Signed-off-by: Robert Shearman --- include/uapi/linux/mpls.h | 30 net/mpls/af_mpls.c| 179 -- net/mpls/internal.h | 58 ++- net/mpls/mpls_iptunnel.c | 11 ++- 4 files changed, 250 insertions(+), 28 deletions(-) diff --git a/include/uapi/linux/mpls.h b/include/uapi/linux/mpls.h index 24a6cb1aec86..77a19dfe3990 100644 --- a/include/uapi/linux/mpls.h +++ b/include/uapi/linux/mpls.h @@ -43,4 +43,34 @@ struct mpls_label { #define MPLS_LABEL_FIRST_UNRESERVED16 /* RFC3032 */ +/* These are embedded into IFLA_STATS_AF_SPEC: + * [IFLA_STATS_AF_SPEC] + * -> [AF_MPLS] + *-> [MPLS_STATS_xxx] + * + * Attributes: + * [MPLS_STATS_LINK] = { + * struct mpls_link_stats + * } + */ +enum { + MPLS_STATS_UNSPEC, /* also used as 64bit pad attribute */ + MPLS_STATS_LINK, + __MPLS_STATS_MAX, +}; + +#define MPLS_STATS_MAX (__MPLS_STATS_MAX - 1) + +struct mpls_link_stats { + __u64 rx_packets; /* total packets received */ + __u64 tx_packets; /* total packets transmitted*/ + __u64 rx_bytes; /* total bytes received */ + __u64 tx_bytes; /* total bytes transmitted */ + __u64 rx_errors; /* bad packets received */ + __u64 tx_errors; /* packet transmit problems */ + __u64 rx_dropped; /* packet dropped on receive*/ + __u64 tx_dropped; /* packet dropped on transmit */ + __u64 rx_noroute; /* no route for packet dest */ +}; + #endif /* _UAPI_MPLS_H */ diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 15fe97644ffe..fb20941cdda2 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -17,8 +18,8 @@ #include #if IS_ENABLED(CONFIG_IPV6) #include -#include #endif +#include #include #include "internal.h" @@ -48,11 +49,6 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) return rt; } -static inline struct mpls_dev *mpls_dev_get(const struct net_device *dev) -{ - return rcu_dereference_rtnl(dev->mpls_ptr); -} - bool mpls_output_possible(const struct net_device *dev) { return dev && (dev->flags & IFF_UP) && netif_carrier_ok(dev); @@ -98,6 +94,29 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) } EXPORT_SYMBOL_GPL(mpls_pkt_too_big); +void mpls_stats_inc_outucastpkts(struct net_device *dev, +const struct sk_buff *skb) +{ + struct mpls_dev *mdev; + struct inet6_dev *in6dev; + + if (skb->protocol == htons(ETH_P_MPLS_UC)) { + mdev = mpls_dev_get(dev); + if (mdev) + MPLS_INC_STATS_LEN(mdev, skb->len, + tx_packets, + tx_bytes); + } else if (skb->protocol == htons(ETH_P_IP)) { + IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len); + } else if (skb->protocol == htons(ETH_P_IPV6)) { + in6dev = __in6_dev_get(dev); + if (in6dev) + IP6_UPD_PO_STATS(dev_net(dev), in6dev, +IPSTATS_MIB_OUT, skb->len); + } +} +EXPORT_SYMBOL_GPL(mpls_stats_inc_outucastpkts); + static u32 mpls_multipath_hash(struct mpls_route *rt, struct sk_buff *skb, bool bos) { @@ -253,6 +272,7 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *d
Re: [net PATCH 2/2] ipv4: Drop suffix update from resize code
On 05/12/16 17:28, David Miller wrote: From: Robert Shearman Date: Mon, 5 Dec 2016 15:05:18 + On 01/12/16 12:27, Alexander Duyck wrote: It has been reported that update_suffix can be expensive when it is called on a large node in which most of the suffix lengths are the same. The time required to add 200K entries had increased from around 3 seconds to almost 49 seconds. In order to address this we need to move the code for updating the suffix out of resize and instead just have it handled in the cases where we are pushing a node that increases the suffix length, or will decrease the suffix length. Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length") Reported-by: Robert Shearman Signed-off-by: Alexander Duyck $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists What are these errors all about? These are just routes that are already added by the system but are present in the dump: $ ip route showdump < ~/allroutes | grep -v 110.110.110.2 default via 192.168.100.1 dev eth0 proto static metric 1024 10.37.96.0/20 dev eth2 proto kernel scope link src 10.37.96.204 110.110.110.0/24 dev eth1 proto kernel scope link src 110.110.110.1 192.168.100.0/24 dev eth0 proto kernel scope link src 192.168.100.153 So the errors are expected and are seen both with and without these patches. Thanks, Rob
Re: [net PATCH 2/2] ipv4: Drop suffix update from resize code
On 01/12/16 12:27, Alexander Duyck wrote: It has been reported that update_suffix can be expensive when it is called on a large node in which most of the suffix lengths are the same. The time required to add 200K entries had increased from around 3 seconds to almost 49 seconds. In order to address this we need to move the code for updating the suffix out of resize and instead just have it handled in the cases where we are pushing a node that increases the suffix length, or will decrease the suffix length. Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length") Reported-by: Robert Shearman Signed-off-by: Alexander Duyck $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists real0m4.026s user0m0.156s sys 0m2.500s $ time sudo ip route flush via 110.110.110.2 real0m5.423s user0m0.064s sys 0m1.096s This reduces the time taken to both add and delete the 200k routes back down to levels comparable before commit 5405afd1a306. The changes look good to me too. Nicely done Alex! Reviewed-by: Robert Shearman Tested-by: Robert Shearman Thanks, Rob
Re: [net PATCH 1/2] ipv4: Drop leaf from suffix pull/push functions
On 01/12/16 12:27, Alexander Duyck wrote: It wasn't necessary to pass a leaf in when doing the suffix updates so just drop it. Instead just pass the suffix and work with that. Since we dropped the leaf there is no need to include that in the name so the names are updated to node_push_suffix and node_pull_suffix. Finally I noticed that the logic for pulling the suffix length back actually had some issues. Specifically it would stop prematurely if there was a longer suffix, but it was not as long as the original suffix. I updated the code to address that in node_pull_suffix. Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length") Suggested-by: Robert Shearman Signed-off-by: Alexander Duyck This addresses an issue in the current code whereby when a fib alias is removed from a leaf when there are other aliases remaining then the suffix length may not be updated in the conditions described above. This is because fib_remove_alias doesn't call trie_rebalance (or resize) in this case, as there's no need since no nodes have been added/removed. This can be reproduced with this structure of the fib trie and by adding and then deleting 10.37.96.0/19: +-- 10.37.0.0/16 4 1 8 suffix/19 |-- 10.37.0.0 /24 universe UNICAST |-- 10.37.32.0 /24 universe UNICAST |-- 10.37.64.0 /20 universe UNICAST |-- 10.37.95.0 /24 universe UNICAST +-- 10.37.96.0/20 2 1 2 suffix/21 +-- 10.37.96.0/22 2 1 2 suffix/24 +-- 10.37.96.0/24 2 1 2 suffix/24 |-- 10.37.96.0 /32 link BROADCAST /24 link UNICAST +-- 10.37.96.192/26 2 0 2 suffix/28 |-- 10.37.96.204 /32 host LOCAL |-- 10.37.96.255 /32 link BROADCAST |-- 10.37.98.0 /25 universe UNICAST |-- 10.37.104.0 /21 universe UNICAST +-- 10.37.112.0/23 2 0 2 suffix/24 |-- 10.37.112.0 /24 universe UNICAST |-- 10.37.113.0 /24 universe UNICAST |-- 10.37.223.0 /24 universe UNICAST |-- 10.37.224.0 /24 universe UNICAST I've verified that with the fix included here that the suffix length on the 10.37.0.0/16 node now gets updated correctly to be /20. Reviewed-by: Robert Shearman Tested-by: Robert Shearman Thanks, Rob
Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
On 01/12/16 18:29, Alexander Duyck wrote: On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman wrote: On 29/11/16 23:14, Alexander Duyck wrote: On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length") Signed-off-by: Robert Shearman So I am not entirely convinced this is a regression, I was wondering what your numbers looked like before the patch you are saying this fixes? I recall I fixed a number of issues leading up to this so I am just curious. At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove checks for index >= tnode_child_length from tnode_get_child") which is the parent of 5405afd1a306: $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists real0m3.451s user0m0.184s sys 0m2.004s At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add tracking value for suffix length"): $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists real0m48.624s user0m0.360s sys 0m46.960s So does that warrant a fixes tag for this patch? Yes, please include it in the patch description next time. I've been giving it thought and the fact is the patch series has merit. I just think it was being a bit too aggressive in terms of some of the changes. So placing any changes in put_child seem to be the one piece in this that make this a bit ugly. Does that imply the current updating of the parent's slen value should be removed from put_child then? I started out without doing the changes in put_child, but that meant the changes to push the slen up through the trie were spread all over the place. This seemed like the cleaner solution. + } +} + /* Add a child at position i overwriting the old value. * Update the value of full_children and empty_children. + * + * The suffix length of the parent node and the rest of the tree is + * updated (if required) when adding/replacing a node, but is caller's + * responsibility on removal. */ static void put_child(struct key_vector *tn, unsigned long i, struct key_vector *n) @@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned long i, else if (!wasfull && isfull) tn_info(tn)->full_children++; - if (n && (tn->slen < n->slen)) - tn->slen = n->slen; + if (n) + node_push_suffix(tn, n); This is just creating redundant work if we have to perform a resize. The only real redundant work is the assignment of slen in tn, since the propagation up the trie has to happen regardless and if a subsequent resize results in changes to the trie then the propagation will stop at tn's parent since the suffix length of the tn's parent will not be less than tn's suffix length. This isn't going to work. We want to avoid trying to push the suffix when we place a child. There are scenarios where we are placing children in nodes that don't have parents yet because we are doing rebalances and such. I suspect this could be pretty expensive on a resize call. It's not expensive because the new parents that are being built up are orphaned from the rest of the trie, so the push up won't proceed into the rest of the trie until the new node is inserted into the trie. We want to pull the suffix pushing out of the resize completely. The problem is this is going to cause us to start pushing suffixes for every node moved as a part of resize. This will mean that nodes will have to be visited multiple times, which could well be more expensive than doing it during the resize. rcu_assign_pointer(tn->tnode[i], n); } ... -static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l) +static void node_set_suffix(struct key_vector *tp, unsigned char slen) { - while ((tp->slen > tp->pos) && (tp->slen > l->slen)) { - if (update_suffix(tp) > l->slen) + if (slen > tp->slen) + tp->slen = slen; +} + +static void node_pull_suffix(struct key_vector *tn) +{ + struct key_vector *tp; + unsigned char slen; + + slen = update_suffix(tn); + tp = node_parent(tn); + while ((tp->slen > tp->pos) && (tp->slen > slen)) { + if (update_suffix(tp) > slen) break; tp = node_parent(tp); } } Actually I realized that there is a bug here. The check for update_suffix(tp) > slen can cause us to stop prematurely if I am not mistaken. What we should probably be doing is pulling out tp->slen and if the
Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required
On 29/11/16 23:14, Alexander Duyck wrote: On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman wrote: With certain distributions of routes it can take a long time to add and delete further routes at scale. For example, with a route for 10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck is update_suffix: 40.39% [kernel] [k] update_suffix 8.02% libnl-3.so.200.19.0 [.] nl_hash_table_lookup Well yeah, it will be expensive when you have over 512K entries in a single node. I'm assuming that is the case based on the fact that 200K routes will double up in the trie node due to broadcast and the route ending up in adjacent key vectors. The example scenario was in fact using a large scale of just routes rather than addresses so there are no broadcast leafs (nor local leafs): +-- 8.0.0.0/6 18 2 52436 suffix/20 |-- 8.0.0.0 /24 universe UNICAST |-- 8.0.1.0 /24 universe UNICAST |-- 8.0.2.0 /24 universe UNICAST (the "suffix/20" is for test purposes as per below). In this case the 8.0.0.0/6 node has a child array of size 2^18 = 262144. With these changes, the time is reduced to 4s for the same scale and distribution of routes. The issue is that update_suffix does an O(n) walk on the children of a node and the with a dense distribtion of routes the number of children in a node tends towards the number of nodes in the tree. Other than the performance I was just wondering if you did any other validation to confirm that nothing else differs. Basically it would just be a matter of verifying that /proc/net/fib_trie is the same before and after your patch. Yes, to verify these changes I applied some local changes to dump the slen field of trie nodes from /proc/net/fib_trie. I presumed that the format of /proc/net/fib_trie is a public API that we can't break so I intend to submit this as a patch. I verified by inspection that the suffix length listed in the nodes was as expected when exercising the insert and remove functions for routes with both larger and smaller suffixes than in the current nodes, and then for the inflate, halve and flush cases. I've now verified that a diff of /proc/net/fib_trie before and after my patch with all the routes added of my example scenario shows no changes. ... Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length") Signed-off-by: Robert Shearman So I am not entirely convinced this is a regression, I was wondering what your numbers looked like before the patch you are saying this fixes? I recall I fixed a number of issues leading up to this so I am just curious. At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove checks for index >= tnode_child_length from tnode_get_child") which is the parent of 5405afd1a306: $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists real0m3.451s user0m0.184s sys 0m2.004s At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add tracking value for suffix length"): $ time sudo ip route restore < ~/allroutes RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists real0m48.624s user0m0.360s sys 0m46.960s So does that warrant a fixes tag for this patch? My thought is this seems more like a performance optimization than a fix. If that is the case this probably belongs in net-next. It seems to me we might be able to simplify update_suffix rather than going through all this change. That might be something that is more acceptable for net. Have you looked at simply adding code so that there is a break inside update_suffix if (slen == tn->slen)? We don't have to call it for if a leaf is added so it might make sense to add that check. That doesn't really help since the search always starts from the smallest suffix length and thus could potentially require visiting a large number of children before finding the node that makes slen == tn->slen. In the example above, 10.37.96.0/20 would be child number 140640 in node 8.0.0.0/6 18. Since the loop starts out with stride = 2 this means that it requires 70320 iterations round to find 10.37.96.0/20 which contributes the largest suffix length to the node. Now it may be possible to improve the algorithm by starting the search from the largest suffix length towards the smallest suffix length instead of the current smallest to largest, but there would still be distributions of routes that would lead to needing to visit a large number of nodes only to find that nothing has changed. --- net/ipv4/fib_trie.c | 86
[PATCH net] fib_trie: Avoid expensive update of suffix length when not required
With certain distributions of routes it can take a long time to add and delete further routes at scale. For example, with a route for 10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck is update_suffix: 40.39% [kernel] [k] update_suffix 8.02% libnl-3.so.200.19.0 [.] nl_hash_table_lookup With these changes, the time is reduced to 4s for the same scale and distribution of routes. The issue is that update_suffix does an O(n) walk on the children of a node and the with a dense distribtion of routes the number of children in a node tends towards the number of nodes in the tree. In the add case it isn't necessary to walk all the other children to update the largest suffix length of the node (which is what update_suffix is doing) since we already know what the largest suffix length of all the other children is and we only need to update it if the new node/leaf has a larger suffix. However, it is currently called from resize which is called on any update to rebalance the trie. Therefore improve the scaling by moving the responsibility of recalculating the node's largest suffix length out of the resize function into its callers. For fib_insert_node this can be taken care of by extending put_child to not only update the largest suffix length in the parent node, but to propagate it all the way up the trie as required, using a new function, node_push_suffix, based on leaf_push_suffix, but renamed to reflect its purpose and made safe if the node has no parent. No changes are required to inflate/halve since the maximum suffix length of the sub-trie starting from the node's parent isn't changed, and the suffix lengths of the halved/inflated nodes are updated through the creation of the new nodes and put_child called to add the children to them. In both fib_table_flush and fib_table_flush_external, given that a walk of the entire trie is being done there's a choice of optimising either for the case of a small number of leafs being flushed by updating the suffix on a change and propagating it up, or optimising for large numbers of leafs being flushed by deferring the updating of the largest suffix length to the walk up to the parent node. I opted for the latter to keep the algorithm linear in complexity in the case of flushing all leafs and because it is close to the status quo. Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length") Signed-off-by: Robert Shearman --- net/ipv4/fib_trie.c | 86 ++--- 1 file changed, 62 insertions(+), 24 deletions(-) diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 026f309c51e9..701cae8af44a 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -421,8 +421,22 @@ static inline int tnode_full(struct key_vector *tn, struct key_vector *n) return n && ((n->pos + n->bits) == tn->pos) && IS_TNODE(n); } +static void node_push_suffix(struct key_vector *tn, struct key_vector *l) +{ + while (tn->slen < l->slen) { + tn->slen = l->slen; + tn = node_parent(tn); + if (!tn) + break; + } +} + /* Add a child at position i overwriting the old value. * Update the value of full_children and empty_children. + * + * The suffix length of the parent node and the rest of the tree is + * updated (if required) when adding/replacing a node, but is caller's + * responsibility on removal. */ static void put_child(struct key_vector *tn, unsigned long i, struct key_vector *n) @@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned long i, else if (!wasfull && isfull) tn_info(tn)->full_children++; - if (n && (tn->slen < n->slen)) - tn->slen = n->slen; + if (n) + node_push_suffix(tn, n); rcu_assign_pointer(tn->tnode[i], n); } @@ -919,34 +933,35 @@ static struct key_vector *resize(struct trie *t, struct key_vector *tn) if (max_work != MAX_WORK) return tp; - /* push the suffix length to the parent node */ - if (tn->slen > tn->pos) { - unsigned char slen = update_suffix(tn); - - if (slen > tp->slen) - tp->slen = slen; - } - return tp; } -static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l) +static void node_set_suffix(struct key_vector *tp, unsigned char slen) { - while ((tp->slen > tp->pos) && (tp->slen > l->slen)) { - if (update_suffix(tp) > l->slen) + if (slen > tp->slen) + tp->slen = slen; +} + +static void node_pull_suffix(struct key_vector *tn) +{ +
Re: [PATCH] mpls: Add missing RCU-bh read side critical section locking in output path
On 20/06/16 19:05, Lennert Buytenhek wrote: From: David Barroso When locally originated IP traffic hits a route that says to push MPLS labels, we'll get a call chain dst_output() -> lwtunnel_output() -> mpls_output() -> neigh_xmit() -> ___neigh_lookup_noref() where the last function in this chain accesses a RCU-bh protected struct neigh_table pointer without us ever having declared an RCU-bh read side critical section. As in case of locally originated IP traffic we'll be running in process context, with softirqs enabled, we can be preempted by a softirq at any time, and RCU-bh considers the completion of a softirq as signaling the end of any pending read-side critical sections, so if we do get a softirq here, we can end up with an unexpected RCU grace period and all the nastiness that that comes with. This patch makes neigh_xmit() take rcu_read_{,un}lock_bh() around the code that expects to be treated as an RCU-bh read side critical section. Signed-off-by: David Barroso Signed-off-by: Lennert Buytenhek LGTM too. Acked-by: Robert Shearman
Re: [PATCH net-next] mpls: allow routes on ipgre devices
On 16/06/16 09:09, Simon Horman wrote: This appears to be necessary and sufficient to provide MPLS in GRE (RFC4023) support. This can be used by establishing an ipgre tunnel device and then routing MPLS over it. The following example will forward MPLS frames received with an outermost MPLS label 100 over tun1, a GRE tunnel. The forwarded packet will have the outermost MPLS LSE removed and two new LSEs added with labels 200 (outermost) and 300 (next). ip link add name tun1 type gre remote 10.0.99.193 local 10.0.99.192 ttl 225 ip link set up dev tun1 ip addr add 10.0.98.192/24 dev tun1 ip route sh echo 1 > /proc/sys/net/mpls/conf/eth0/input echo 101 > /proc/sys/net/mpls/platform_labels ip -f mpls route add 100 as 200/300 via inet 10.0.98.193 ip -f mpls route sh Also remove unnecessary braces. Reviewed-by: Dinan Gunawardena Thanks for testing this and making the necessary change. Acked-by: Robert Shearman Signed-off-by: Simon Horman
Re: [PATCH net-next 1/2] mpls: packet stats
On 16/02/16 20:26, roopa wrote: On 2/16/16, 7:41 AM, David Miller wrote: Statistics not provided via netlink are useless in real installations. In fact I would say to forego the proc interface entirely, it's a second class citizen for statistics gathering and has a non-triviel per-device cost for instantiation. +1 I agree with the cost too. Robert, I was thinking of responding to your series to add netlink stats for AF_MPLS in rtnl_af_ops (similar to IFLA_INET6_STATS). But, soon realized that it is currently used by ipv6 alone and it also ended up with a skip filter (RTEXT_FILTER_SKIP_STATS). So, extending that interface for mpls is not the right thing to do either. ipv4 doesn't have per-interface stats, so the fact that it doesn't use fill_link_af to export stats doesn't really add much to the argument. The real issue with the IFLA_INET6_STATS mechanism is that they are included in netlink broadcasts, not just netlink unicasts and there is no way of filtering for broadcasts at the moment. There is work being done for a separate netlink infrastructure for stats. I would wait for that infrastructure to be ready to add mpls stats. It should be available soon. Great, any details on what it would look like? Can I assist in any way in the development? In the mean time, I'll rebase and resubmit the ip ttl propagation patch separately. Thanks, Rob
[PATCH net-next v2 1/3] lwtunnel: autoload of lwt modules
The lwt implementations using net devices can autoload using the existing mechanism using IFLA_INFO_KIND. However, there's no mechanism that lwt modules not using net devices can use. Therefore, add the ability to autoload modules registering lwt operations for lwt implementations not using a net device so that users don't have to manually load the modules. Only users with the CAP_NET_ADMIN capability can cause modules to be loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx messages for users without this capability, and by lwtunnel_build_state not being called in response to RTM_GETxxx messages. Signed-off-by: Robert Shearman --- include/net/lwtunnel.h | 4 +++- net/core/lwtunnel.c| 37 + 2 files changed, 40 insertions(+), 1 deletion(-) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index 66350ce3e955..e9f116e29c22 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -170,6 +170,8 @@ static inline int lwtunnel_input(struct sk_buff *skb) return -EOPNOTSUPP; } -#endif +#endif /* CONFIG_LWTUNNEL */ + +#define MODULE_ALIAS_RTNL_LWT(encap_type) MODULE_ALIAS("rtnl-lwt-" __stringify(encap_type)) #endif /* __NET_LWTUNNEL_H */ diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index 299cfc24d888..669ecc9f884e 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -27,6 +27,31 @@ #include #include +#ifdef CONFIG_MODULES + +static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type) +{ + /* Only lwt encaps implemented without using an interface for +* the encap need to return a string here. +*/ + switch (encap_type) { + case LWTUNNEL_ENCAP_MPLS: + return "MPLS"; + case LWTUNNEL_ENCAP_ILA: + return "ILA"; + case LWTUNNEL_ENCAP_IP6: + case LWTUNNEL_ENCAP_IP: + case LWTUNNEL_ENCAP_NONE: + case __LWTUNNEL_ENCAP_MAX: + /* should not have got here */ + WARN_ON(1); + break; + } + return NULL; +} + +#endif /* CONFIG_MODULES */ + struct lwtunnel_state *lwtunnel_state_alloc(int encap_len) { struct lwtunnel_state *lws; @@ -85,6 +110,18 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, ret = -EOPNOTSUPP; rcu_read_lock(); ops = rcu_dereference(lwtun_encaps[encap_type]); +#ifdef CONFIG_MODULES + if (!ops) { + const char *encap_type_str = lwtunnel_encap_str(encap_type); + + if (encap_type_str) { + rcu_read_unlock(); + request_module("rtnl-lwt-%s", encap_type_str); + rcu_read_lock(); + ops = rcu_dereference(lwtun_encaps[encap_type]); + } + } +#endif if (likely(ops && ops->build_state)) ret = ops->build_state(dev, encap, family, cfg, lws); rcu_read_unlock(); -- 2.1.4
[PATCH net-next v2 3/3] ila: autoload module
Avoid users having to manually load the module by adding a module alias allowing it to be autoloaded by the lwt infra. Signed-off-by: Robert Shearman --- net/ipv6/ila/ila_common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c index 32dc9aab7297..30613050e4ca 100644 --- a/net/ipv6/ila/ila_common.c +++ b/net/ipv6/ila/ila_common.c @@ -99,5 +99,6 @@ static void __exit ila_fini(void) module_init(ila_init); module_exit(ila_fini); +MODULE_ALIAS_RTNL_LWT(ILA); MODULE_AUTHOR("Tom Herbert "); MODULE_LICENSE("GPL"); -- 2.1.4
[PATCH net-next v2 2/3] mpls: autoload lwt module
Avoid users having to manually load the module by adding a module alias allowing it to be autoloaded by the lwt infra. Signed-off-by: Robert Shearman --- net/mpls/mpls_iptunnel.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c index fb31aa87de81..644a8da6d4bd 100644 --- a/net/mpls/mpls_iptunnel.c +++ b/net/mpls/mpls_iptunnel.c @@ -227,5 +227,6 @@ static void __exit mpls_iptunnel_exit(void) } module_exit(mpls_iptunnel_exit); +MODULE_ALIAS_RTNL_LWT(MPLS); MODULE_DESCRIPTION("MultiProtocol Label Switching IP Tunnels"); MODULE_LICENSE("GPL v2"); -- 2.1.4
[PATCH net-next v2 0/3] lwtunnel: autoload of lwt modules
Changes since v1: - remove "LWTUNNEL_ENCAP_" prefix for the string form of the encaps used when requesting the module to reduce duplication, and don't bother returning strings for lwt modules using netdevices, both suggested by Jiri. - update commit message of first patch to clarify security implications, in response to Eric's comments. The lwt implementations using net devices can autoload using the existing mechanism using IFLA_INFO_KIND. However, there's no mechanism that lwt modules not using net devices can use. Therefore, these patches add the ability to autoload modules registering lwt operations for lwt implementations not using a net device so that users don't have to manually load the modules. Robert Shearman (3): lwtunnel: autoload of lwt modules mpls: autoload lwt module ila: autoload module include/net/lwtunnel.h| 4 +++- net/core/lwtunnel.c | 37 + net/ipv6/ila/ila_common.c | 1 + net/mpls/mpls_iptunnel.c | 1 + 4 files changed, 42 insertions(+), 1 deletion(-) -- 2.1.4
Re: [PATCH net-next 1/3] lwtunnel: autoload of lwt modules
On 15/02/16 21:33, Eric W. Biederman wrote: Robert Shearman writes: @@ -85,6 +109,14 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, ret = -EOPNOTSUPP; rcu_read_lock(); ops = rcu_dereference(lwtun_encaps[encap_type]); +#ifdef CONFIG_MODULES + if (!ops) { + rcu_read_unlock(); + request_module("rtnl-lwt-%s", lwtunnel_encap_str(encap_type)); + rcu_read_lock(); + ops = rcu_dereference(lwtun_encaps[encap_type]); + } +#endif if (likely(ops && ops->build_state)) ret = ops->build_state(dev, encap, family, cfg, lws); rcu_read_unlock(); My memory is fuzzy on how this is done elsewhere but this looks like it needs a capability check to ensure that non-root user's can't trigger this. It tends to be problematic if a non-root user can trigger an autoload of a known-buggy module. With a combination of user namespaces and network namespaces unprivileged users can cause just about every corner of the network stack to be exercised. The same protections apply to this as to the IFLA_INFO_KIND module autoloading, namely by rtnetlink_rcv_msg ensuring that no messages other than gets can be done by an unprivileged user: type = nlh->nlmsg_type; ... type -= RTM_BASE; ... kind = type&3; if (kind != 2 && !netlink_net_capable(skb, CAP_NET_ADMIN)) return -EPERM; The lwtunnel_build_state function is only called by the processing of non-get message types. Is this sufficient or are you looking for something in addition? Thanks, Rob
Re: [PATCH net-next 1/3] lwtunnel: autoload of lwt modules
On 15/02/16 16:32, Jiri Benc wrote: On Mon, 15 Feb 2016 16:22:08 +, Robert Shearman wrote: Yeah, it's the C preprocessor. MODULE_ALIAS_RTNL_LWT includes the string for the encap type in the module alias, and since the LWT encap types are defined as enums this is symbolic name. I can't see any way of getting the preprocessor to convert MODULE_ALIAS_RTNL_LWT(LWTUNNEL_ENCAP_MPLS) into "rtnl-lwt-MPLS", but I'm open to suggestions. MODULE_ALIAS_RTNL_LWT(MPLS)? But whatever, as I said, no strong preference. I was so hung up on the making the string match the name of the enum that I'd discounted that, but you're right that doing that would reduce duplication in the alias string. True, but I figured that it was cleaner for the lwtunnel infra to not assume whether how those modules are implemented. If you disagree, then I can change to doing as you suggest. It's not completely transparent to the infrastructure anyway, the tunnel type needs to be added to lwtunnel_encap_str for new tunnels. The way I suggested, it's only added for those tunnels having the module alias set. Just trying to get rid of the unnecessary strings in lwtunnel_encap_str. There's no need to bloat kernel with them if they're never used. Ok, will resubmit without the unnecessary strings in that function as well as with your suggestion above. Thanks for the review, Rob
Re: [PATCH net-next 1/3] lwtunnel: autoload of lwt modules
On 15/02/16 16:02, Jiri Benc wrote: On Mon, 15 Feb 2016 15:42:01 +, Robert Shearman wrote: +static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type) +{ + switch (encap_type) { + case LWTUNNEL_ENCAP_MPLS: + return "LWTUNNEL_ENCAP_MPLS"; + case LWTUNNEL_ENCAP_IP: + return "LWTUNNEL_ENCAP_IP"; + case LWTUNNEL_ENCAP_ILA: + return "LWTUNNEL_ENCAP_ILA"; + case LWTUNNEL_ENCAP_IP6: + return "LWTUNNEL_ENCAP_IP6"; + case LWTUNNEL_ENCAP_NONE: + case __LWTUNNEL_ENCAP_MAX: + /* should not have got here */ + break; + } + WARN_ON(1); + return "LWTUNNEL_ENCAP_NONE"; +} + +#endif /* CONFIG_MODULES */ + struct lwtunnel_state *lwtunnel_state_alloc(int encap_len) { struct lwtunnel_state *lws; @@ -85,6 +109,14 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, ret = -EOPNOTSUPP; rcu_read_lock(); ops = rcu_dereference(lwtun_encaps[encap_type]); +#ifdef CONFIG_MODULES + if (!ops) { + rcu_read_unlock(); + request_module("rtnl-lwt-%s", lwtunnel_encap_str(encap_type)); Why the repeating of "lwt"/"LWTUNNEL" and the unnecessary "ENCAP"? Wouldn't be lwtunnel_encap_str returning just "MPLS" or "ILA" enough? I don't have any strong preference here, it just looks weird to me. Maybe there's a reason. Yeah, it's the C preprocessor. MODULE_ALIAS_RTNL_LWT includes the string for the encap type in the module alias, and since the LWT encap types are defined as enums this is symbolic name. I can't see any way of getting the preprocessor to convert MODULE_ALIAS_RTNL_LWT(LWTUNNEL_ENCAP_MPLS) into "rtnl-lwt-MPLS", but I'm open to suggestions. I could just drop the "lwt-" bit of the alias string given that it's included in the name of the enum values. Also, this doesn't affect IP lwtunnels, i.e. LWTUNNEL_ENCAP_IP and LWTUNNEL_ENCAP_IP6. Should we just return NULL from lwtunnel_encap_str in such cases (plus unknown encap_type) and WARN on the NULL here? True, but I figured that it was cleaner for the lwtunnel infra to not assume whether how those modules are implemented. If you disagree, then I can change to doing as you suggest. Thanks, Rob
[PATCH net-next 2/3] mpls: autoload lwt module
Avoid users having to manually load the module by adding a module alias allowing it to be autoloaded by the lwt infra. Signed-off-by: Robert Shearman --- net/mpls/mpls_iptunnel.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c index fb31aa87de81..1b4536960f79 100644 --- a/net/mpls/mpls_iptunnel.c +++ b/net/mpls/mpls_iptunnel.c @@ -227,5 +227,6 @@ static void __exit mpls_iptunnel_exit(void) } module_exit(mpls_iptunnel_exit); +MODULE_ALIAS_RTNL_LWT(LWTUNNEL_ENCAP_MPLS); MODULE_DESCRIPTION("MultiProtocol Label Switching IP Tunnels"); MODULE_LICENSE("GPL v2"); -- 2.1.4
[PATCH net-next 1/3] lwtunnel: autoload of lwt modules
The lwt implementations using net devices can autoload using the existing mechanism using IFLA_INFO_KIND. However, there's no mechanism that lwt modules not using net devices can use. Therefore, add the ability to autoload modules registering lwt operations for lwt implementations not using a net device so that users don't have to manually load the modules. Signed-off-by: Robert Shearman --- include/net/lwtunnel.h | 4 +++- net/core/lwtunnel.c| 32 2 files changed, 35 insertions(+), 1 deletion(-) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index 66350ce3e955..e9f116e29c22 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -170,6 +170,8 @@ static inline int lwtunnel_input(struct sk_buff *skb) return -EOPNOTSUPP; } -#endif +#endif /* CONFIG_LWTUNNEL */ + +#define MODULE_ALIAS_RTNL_LWT(encap_type) MODULE_ALIAS("rtnl-lwt-" __stringify(encap_type)) #endif /* __NET_LWTUNNEL_H */ diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index 299cfc24d888..8ef5e5cec03e 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -27,6 +27,30 @@ #include #include +#ifdef CONFIG_MODULES + +static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type) +{ + switch (encap_type) { + case LWTUNNEL_ENCAP_MPLS: + return "LWTUNNEL_ENCAP_MPLS"; + case LWTUNNEL_ENCAP_IP: + return "LWTUNNEL_ENCAP_IP"; + case LWTUNNEL_ENCAP_ILA: + return "LWTUNNEL_ENCAP_ILA"; + case LWTUNNEL_ENCAP_IP6: + return "LWTUNNEL_ENCAP_IP6"; + case LWTUNNEL_ENCAP_NONE: + case __LWTUNNEL_ENCAP_MAX: + /* should not have got here */ + break; + } + WARN_ON(1); + return "LWTUNNEL_ENCAP_NONE"; +} + +#endif /* CONFIG_MODULES */ + struct lwtunnel_state *lwtunnel_state_alloc(int encap_len) { struct lwtunnel_state *lws; @@ -85,6 +109,14 @@ int lwtunnel_build_state(struct net_device *dev, u16 encap_type, ret = -EOPNOTSUPP; rcu_read_lock(); ops = rcu_dereference(lwtun_encaps[encap_type]); +#ifdef CONFIG_MODULES + if (!ops) { + rcu_read_unlock(); + request_module("rtnl-lwt-%s", lwtunnel_encap_str(encap_type)); + rcu_read_lock(); + ops = rcu_dereference(lwtun_encaps[encap_type]); + } +#endif if (likely(ops && ops->build_state)) ret = ops->build_state(dev, encap, family, cfg, lws); rcu_read_unlock(); -- 2.1.4
[PATCH net-next 3/3] ila: autoload module
Avoid users having to manually load the module by adding a module alias allowing it to be autoloaded by the lwt infra. Signed-off-by: Robert Shearman --- net/ipv6/ila/ila_common.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c index 32dc9aab7297..8ecf2482560e 100644 --- a/net/ipv6/ila/ila_common.c +++ b/net/ipv6/ila/ila_common.c @@ -99,5 +99,6 @@ static void __exit ila_fini(void) module_init(ila_init); module_exit(ila_fini); +MODULE_ALIAS_RTNL_LWT(LWTUNNEL_ENCAP_ILA); MODULE_AUTHOR("Tom Herbert "); MODULE_LICENSE("GPL"); -- 2.1.4
[PATCH net-next 0/3] lwtunnel: autoload of lwt modules
The lwt implementations using net devices can autoload using the existing mechanism using IFLA_INFO_KIND. However, there's no mechanism that lwt modules not using net devices can use. Therefore, add the ability to autoload modules registering lwt operations for lwt implementations not using a net device so that users don't have to manually load the modules. Robert Shearman (3): lwtunnel: autoload of lwt modules mpls: autoload lwt module ila: autoload module include/net/lwtunnel.h| 4 +++- net/core/lwtunnel.c | 32 net/ipv6/ila/ila_common.c | 1 + net/mpls/mpls_iptunnel.c | 1 + 4 files changed, 37 insertions(+), 1 deletion(-) -- 2.1.4
Re: [PATCH net-next 1/2] nsh: encapsulation module
On 11/02/16 11:35, Brian Russell wrote: ... diff --git a/include/net/nsh.h b/include/net/nsh.h new file mode 100644 index 000..7a5fb95 --- /dev/null +++ b/include/net/nsh.h @@ -0,0 +1,158 @@ +/* + * Network Service Header (NSH) inserted onto encapsulated packets + * or frames to realize service function paths. + * NSH also provides a mechanism for metadata exchange along the + * instantiated service path. + * + * https://tools.ietf.org/html/draft-ietf-sfc-nsh-01 + * + * Copyright (c) 2015 by Brocade Communications Systems, Inc. + * All rights reserved. + */ All rights reserved isn't really compatible with the GPL :-) ... diff --git a/net/ipv4/nsh.c b/net/ipv4/nsh.c new file mode 100644 index 000..70e5ef0 --- /dev/null +++ b/net/ipv4/nsh.c @@ -0,0 +1,362 @@ +/* + * Network Service Header (NSH) inserted onto encapsulated packets + * or frames to realize service function paths. + * NSH also provides a mechanism for metadata exchange along the + * instantiated service path. + * + * https://tools.ietf.org/html/draft-ietf-sfc-nsh-01 + * + * Copyright (c) 2015 by Brocade Communications Systems, Inc. + * All rights reserved. + */ Ditto. Thanks, Rob
Re: [PATCH net-next 2/2] mpls: allow TTL propagation to/from IP packets to be configured
On 06/02/16 18:36, Eric W. Biederman wrote: Robert Shearman writes: It is sometimes desirable to present an MPLS transport network as a single hop to traffic transiting it because it prevents confusion when diagnosing failures. An example of where confusion can be generated is when addresses used in the provider network overlap with addresses in the overlay network and the addresses get exposed through ICMP errors generated as packets transit the provider network. The configuration you are talking about is a bug. ICMP errors can be handled without confusion simplify by forwarding the packets out to the end of the tunnel. Which is how the standards require that situation to be handled if an ICMP error is generated. You're absolutely right that the standards say how the ICMP errors should be handled in order for them to be forwarded correctly back to the sender, but I'm referring to what source addresses customers of service provider see in those ICMP errors generated when e.g. doing a traceroute. Furthermore, the mechanism that you mention adds for scope for mis-diagnosis since a traceroute won't show any information for hops PE1, P1 and P2 if PE2 is dropping the traffic for that LSP (because the mechanism you describe relies on PE2 or even a further CE to hairpin the ICMP error back to the originator of the error-causing traffic). If you need further evidence that this is something that network operators might want to do, then see RFC 3032, s2.4.3 where it states: It is recognized that there may be situations where a network administration prefers to decrement the IPv4 TTL by one as it traverses an MPLS domain, instead of decrementing the IPv4 TTL by the number of LSP hops within the domain. And one more reference is that this behaviour is codified in RFC 3443. For the purposes of clarity, Uniform Model in RFC 3443 corresponds to ip_ttl_propagate = 1 (default) and (Short) Pipe Model corresponds to ip_ttl_propagate = 0. Therefore, provide the ability to control whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped through the addition of a new per-namespace sysctl: "net.mpls.ip_ttl_propagate" which defaults to enabled. Use the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl: "net.mpls.default_ttl". Ugh. There is a case for this, but this feels much more like a per tunnel/label/route egress property not a per network interface property. I don't recall all of the gory details but some flavors of mpls labels always require ttl propogation (the ip over mpls default) and some flavors of mpls labels always require no propagation (pseudo wires). Clearly, if the label isn't used for the purposes of encapsulating L3 traffic, then you can't propagate the L3 TTL into it and you have to put some other value in there instead. I envisaged that the value of default_ttl would be used in these cases and this is why I worded the documentation for the default_ttl sysctl like so: Default TTL value to use for MPLS packets where it cannot be propagated from an IP header, either because one isn't present or ip_ttl_propagate has been disabled. Given that traffic arriving with a pseudo-wire label will have to be forwarded differently from traffic arriving for labels with L3 traffic, you will know that the label is associated with L2 traffic and that the TTL cannot be propagated. There may be something cute in between. For something that is a per tunnel property I don't feel comfortable with a sysctl. I cannot think of a use-case where it would make sense to have a mix of TTL being propagated and not being propagated on a per-LSP basis. I note that all of the most widely used proprietary MPLS implementations support global IP TTL propagation configuration and I'm not aware of any MPLS implementation that implements a per-LSP control for IP TTL propagation. Especially when it is something as potentially dangerous as enabling packets to loop in a network. As I recall most IP over IP tunnels also propogate the ttl between the inner and outer ip packets to prevent loops. There is no possibility of packets looping in a network as the TTL is always decremented when a label is pushed, whether the packet came in as IP or MPLS, and when swapping a label egress TTL must be one less than the ingress TTL, as defined by the MPLS RFC. When popping the last label we have to ensure that the MPLS TTL is not propagated to IP TTL so that there's no possibility of set the IP TTL beyond the value it entered the LSP (after the TTL decrement done as part of IP switching) with, but that is what this code does. Note that this is only the case if all routers are configured to
Re: [PATCH net-next 1/2] mpls: packet stats
On 06/02/16 10:58, Francois Romieu wrote: Robert Shearman : [...] diff --git a/net/mpls/Makefile b/net/mpls/Makefile index 9ca923625016..6fdd61b9eae3 100644 --- a/net/mpls/Makefile +++ b/net/mpls/Makefile [...] @@ -98,6 +94,29 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) } EXPORT_SYMBOL_GPL(mpls_pkt_too_big); +void mpls_stats_inc_outucastpkts(struct net_device *dev, +const struct sk_buff *skb) +{ + struct mpls_dev *mdev; + struct inet6_dev *in6dev; Nit: the scope can be reduced for both variables. I'm happy to change this if this is the recommended style, but David Laight's reply suggests some doubt. + + if (skb->protocol == htons(ETH_P_MPLS_UC)) { + mdev = mpls_dev_get(dev); + if (mdev) + MPLS_INC_STATS_LEN(mdev, skb->len, + MPLS_IFSTATS_MIB_OUTUCASTPKTS, + MPLS_IFSTATS_MIB_OUTOCTETS); + } else if (skb->protocol == htons(ETH_P_IP)) { + IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len); + } else if (skb->protocol == htons(ETH_P_IPV6)) { + in6dev = __in6_dev_get(dev); + if (in6dev) + IP6_UPD_PO_STATS(dev_net(dev), in6dev, +IPSTATS_MIB_OUT, skb->len); + } +} [...] diff --git a/net/mpls/internal.h b/net/mpls/internal.h index 732a5c17e986..b39770ff2307 100644 --- a/net/mpls/internal.h +++ b/net/mpls/internal.h [...] +#define MPLS_INC_STATS(mdev, field)\ + do {\ + __typeof__(*(mdev)->stats) *ptr =\ + raw_cpu_ptr((mdev)->stats); \ + local_bh_disable(); \ + u64_stats_update_begin(&ptr->syncp); \ + ptr->mib[field]++; \ + u64_stats_update_end(&ptr->syncp); \ + local_bh_enable(); \ + } while (0) I don't get the point of the local_bh_{disable / enable}. Which kind of locally enabled bh code section do you anticipate these helpers to run under ? When a user process sends an IPv4/IPv6 packet destined to a route with mpls lwt encap. Thanks, Rob
[PATCH net-next 0/2] mpls: packet stats and ttl propagation config
This patch series implements new bits of mpls functionality: keeping statistics on packets as they pass through the box and allowing ttl propagation to be configured. Robert Shearman (2): mpls: packet stats mpls: allow TTL propagation to/from IP packets to be configured Documentation/networking/mpls-sysctl.txt | 19 +++ include/net/netns/mpls.h | 4 + net/mpls/Makefile| 1 + net/mpls/af_mpls.c | 205 --- net/mpls/internal.h | 93 +- net/mpls/mpls_iptunnel.c | 21 +++- net/mpls/proc.c | 128 +++ 7 files changed, 417 insertions(+), 54 deletions(-) create mode 100644 net/mpls/proc.c -- 2.1.4
[PATCH net-next 1/2] mpls: packet stats
Having MPLS packet stats is useful for observing network operation and for diagnosing network problems. In the absence of anything better, use RFCs for MIBs defining MPLS stats for guidance on the semantics of the stats to expose. RFC3813 details two per-interface packet stats that should be provided (label lookup failures and fragmented packets) and also provides interpretation of RFC2863 for other per-interface stats (in/out ucast, mcast and bcast, in/out discards and errors and in unknown protos). Multicast, fragment and broadcast packet counters are printed, but not stored to allow for future implementation of current standards or future standards without user-space having to change. All the introduced fields are 64-bit, even error ones, to ensure no overflow with long uptimes. Per-CPU counters are used to avoid cache-line contention on the commonly used fields. The other fields have also been made per-CPU for code to avoid performance problems in error conditions on the assumption that on some platforms the cost of atomic operations could be more pexpensive than sending the packet (which is what would be done in the success case). If that's not the case, we could instead not use per-CPU counters for these fields. The IPv6 proc code was used as an inspiration for the proc code here, both in terms of the implementation as well as the location of the per-device stats proc files: /proc/net/dev_snmp_mpls/. Signed-off-by: Robert Shearman --- include/net/netns/mpls.h | 1 + net/mpls/Makefile| 1 + net/mpls/af_mpls.c | 135 --- net/mpls/internal.h | 93 ++-- net/mpls/mpls_iptunnel.c | 11 +++- net/mpls/proc.c | 128 6 files changed, 334 insertions(+), 35 deletions(-) create mode 100644 net/mpls/proc.c diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h index d29203651c01..3062b0aa3a08 100644 --- a/include/net/netns/mpls.h +++ b/include/net/netns/mpls.h @@ -12,6 +12,7 @@ struct netns_mpls { size_t platform_labels; struct mpls_route __rcu * __rcu *platform_label; struct ctl_table_header *ctl; + struct proc_dir_entry *proc_net_devsnmp; }; #endif /* __NETNS_MPLS_H__ */ diff --git a/net/mpls/Makefile b/net/mpls/Makefile index 9ca923625016..6fdd61b9eae3 100644 --- a/net/mpls/Makefile +++ b/net/mpls/Makefile @@ -6,3 +6,4 @@ obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o obj-$(CONFIG_MPLS_IPTUNNEL) += mpls_iptunnel.o mpls_router-y := af_mpls.o +mpls_router-$(CONFIG_PROC_FS) += proc.o diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index b18c5ed42d95..6b3c96e2b21f 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -17,8 +18,8 @@ #include #if IS_ENABLED(CONFIG_IPV6) #include -#include #endif +#include #include #include "internal.h" @@ -48,11 +49,6 @@ static struct mpls_route *mpls_route_input_rcu(struct net *net, unsigned index) return rt; } -static inline struct mpls_dev *mpls_dev_get(const struct net_device *dev) -{ - return rcu_dereference_rtnl(dev->mpls_ptr); -} - bool mpls_output_possible(const struct net_device *dev) { return dev && (dev->flags & IFF_UP) && netif_carrier_ok(dev); @@ -98,6 +94,29 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned int mtu) } EXPORT_SYMBOL_GPL(mpls_pkt_too_big); +void mpls_stats_inc_outucastpkts(struct net_device *dev, +const struct sk_buff *skb) +{ + struct mpls_dev *mdev; + struct inet6_dev *in6dev; + + if (skb->protocol == htons(ETH_P_MPLS_UC)) { + mdev = mpls_dev_get(dev); + if (mdev) + MPLS_INC_STATS_LEN(mdev, skb->len, + MPLS_IFSTATS_MIB_OUTUCASTPKTS, + MPLS_IFSTATS_MIB_OUTOCTETS); + } else if (skb->protocol == htons(ETH_P_IP)) { + IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len); + } else if (skb->protocol == htons(ETH_P_IPV6)) { + in6dev = __in6_dev_get(dev); + if (in6dev) + IP6_UPD_PO_STATS(dev_net(dev), in6dev, +IPSTATS_MIB_OUT, skb->len); + } +} +EXPORT_SYMBOL_GPL(mpls_stats_inc_outucastpkts); + static u32 mpls_multipath_hash(struct mpls_route *rt, struct sk_buff *skb, bool bos) { @@ -253,6 +272,7 @@ static int mpls_forward(struct sk_buff *skb, struct net_device *dev, struct mpls_nh *nh; struct mpls_entry_decoded dec; struct net_device *out_dev; + struct mpls_dev *out_mdev; struct mpls_dev *mdev; unsigned int hh_len; unsigned int new_header_size; @@ -26
[PATCH net-next 2/2] mpls: allow TTL propagation to/from IP packets to be configured
It is sometimes desirable to present an MPLS transport network as a single hop to traffic transiting it because it prevents confusion when diagnosing failures. An example of where confusion can be generated is when addresses used in the provider network overlap with addresses in the overlay network and the addresses get exposed through ICMP errors generated as packets transit the provider network. Therefore, provide the ability to control whether the TTL value from an MPLS packet is propagated to an IPv4/IPv6 packet when the last label is popped through the addition of a new per-namespace sysctl: "net.mpls.ip_ttl_propagate" which defaults to enabled. Use the same sysctl to control whether the TTL is propagated from IP packets into the MPLS header. If the TTL isn't propagated then a default TTL value is used which can be configured via a new sysctl: "net.mpls.default_ttl". Signed-off-by: Robert Shearman --- Documentation/networking/mpls-sysctl.txt | 19 + include/net/netns/mpls.h | 3 ++ net/mpls/af_mpls.c | 70 net/mpls/mpls_iptunnel.c | 10 - 4 files changed, 83 insertions(+), 19 deletions(-) diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 9ed15f86c17c..9e8cfa6d48d1 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -19,6 +19,25 @@ platform_labels - INTEGER Possible values: 0 - 1048575 Default: 0 +ip_ttl_propagate - BOOL + Control whether TTL is propagated from the IPv4/IPv6 header to + the MPLS header on imposing labels and propagated from the + MPLS header to the IPv4/IPv6 header on popping the last label. + + If disabled, the MPLS transport network will appear as a + single hop to transit traffic. + + 0 - disabled + 1 - enabled (default) + +default_ttl - BOOL + Default TTL value to use for MPLS packets where it cannot be + propagated from an IP header, either because one isn't present + or ip_ttl_propagate has been disabled. + + Possible values: 1 - 255 + Default: 255 + conf//input - BOOL Control whether packets can be input on this interface. diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h index 3062b0aa3a08..9bdc2bd8fcb8 100644 --- a/include/net/netns/mpls.h +++ b/include/net/netns/mpls.h @@ -10,7 +10,10 @@ struct ctl_table_header; struct netns_mpls { size_t platform_labels; + int ip_ttl_propagate; + int default_ttl; struct mpls_route __rcu * __rcu *platform_label; + struct ctl_table_header *ctl; struct proc_dir_entry *proc_net_devsnmp; }; diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 6b3c96e2b21f..a2a4f0a884a3 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -31,7 +31,9 @@ #define MPLS_NEIGH_TABLE_UNSPEC (NEIGH_LINK_TABLE + 1) static int zero = 0; +static int one = 1; static int label_limit = (1 << 20) - 1; +static int ttl_max = 255; static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt, struct nlmsghdr *nlh, struct net *net, u32 portid, @@ -215,8 +217,8 @@ out: return &rt->rt_nh[nh_index]; } -static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, - struct mpls_entry_decoded dec) +static bool mpls_egress(struct net *net, struct mpls_route *rt, + struct sk_buff *skb, struct mpls_entry_decoded dec) { enum mpls_payload_type payload_type; bool success = false; @@ -239,24 +241,29 @@ static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, payload_type = ip_hdr(skb)->version; switch (payload_type) { - case MPT_IPV4: { - struct iphdr *hdr4 = ip_hdr(skb); - skb->protocol = htons(ETH_P_IP); - csum_replace2(&hdr4->check, - htons(hdr4->ttl << 8), - htons(dec.ttl << 8)); - hdr4->ttl = dec.ttl; + case MPT_IPV4: + if (net->mpls.ip_ttl_propagate) { + struct iphdr *hdr4 = ip_hdr(skb); + + skb->protocol = htons(ETH_P_IP); + csum_replace2(&hdr4->check, + htons(hdr4->ttl << 8), + htons(dec.ttl << 8)); + hdr4->ttl = dec.ttl; + } success = true; break; - } - case MPT_IPV6: { - struct ipv6hdr *hdr6 = ipv6_hdr(skb); - skb->protocol = htons(ETH_P_IPV6); - hdr6->hop_limit = dec.ttl; + case MPT_IPV6: + if (net->mpls.ip_ttl_propagat
[PATCH net 2/4] mpls: don't dump RTA_VIA attribute if not specified
The problem seen is that when adding a route with a nexthop with no via address specified, iproute2 generates bogus output: # ip -f mpls route add 100 dev lo # ip -f mpls route list 100 via inet 0.0.8.0 dev lo The reason for this is that the kernel generates an RTA_VIA attribute with the family set to AF_INET, but the via address data having zero length. The cause of family being AF_INET is that on route insert cfg->rc_via_table is left set to 0, which just happens to be NEIGH_ARP_TABLE which is then translated into AF_INET. iproute2 doesn't validate the length prior to printing and so prints garbage. Although it could be fixed to do the validation, I would argue that AF_INET addresses should always be exactly 4 bytes so the kernel is really giving userspace bogus data. Therefore, avoid generating the RTA_VIA attribute when dumping the route if the via address wasn't specified on add/modify. This is indicated by NEIGH_ARP_TABLE and a zero via address length - if the user specified a via address the address length would have been validated such that it was 4 bytes. Although this is a change in behaviour that is visible to userspace, I believe that what was generated before was invalid and as such userspace wouldn't be expecting it. Signed-off-by: Robert Shearman --- net/mpls/af_mpls.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 3be29cb1f658..ac1c116abaac 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -1235,7 +1235,9 @@ static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event, nla_put_labels(skb, RTA_NEWDST, nh->nh_labels, nh->nh_label)) goto nla_put_failure; - if (nla_put_via(skb, nh->nh_via_table, mpls_nh_via(rt, nh), + if ((nh->nh_via_table != NEIGH_ARP_TABLE || +nh->nh_via_alen != 0) && + nla_put_via(skb, nh->nh_via_table, mpls_nh_via(rt, nh), nh->nh_via_alen)) goto nla_put_failure; dev = rtnl_dereference(nh->nh_dev); @@ -1323,7 +1325,9 @@ static inline size_t lfib_nlmsg_size(struct mpls_route *rt) if (nh->nh_dev) payload += nla_total_size(4); /* RTA_OIF */ - payload += nla_total_size(2 + nh->nh_via_alen); /* RTA_VIA */ + if (nh->nh_via_table != NEIGH_ARP_TABLE || + nh->nh_via_alen != 0) /* RTA_VIA */ + payload += nla_total_size(2 + nh->nh_via_alen); if (nh->nh_labels) /* RTA_NEWDST */ payload += nla_total_size(nh->nh_labels * 4); } else { -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 1/4] mpls: validate L2 via address length
If an L2 via address for an mpls nexthop is specified, the length of the L2 address must match that expected by the output device, otherwise it could access memory beyond the end of the via address buffer in the route. This check was present prior to commit f8efb73c97e2 ("mpls: multipath route support"), but got lost in the refactoring, so add it back, applying it to all nexthops in multipath routes. Fixes: f8efb73c97e2 ("mpls: multipath route support") Signed-off-by: Robert Shearman --- net/mpls/af_mpls.c | 4 1 file changed, 4 insertions(+) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index c70d750148b6..3be29cb1f658 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -534,6 +534,10 @@ static int mpls_nh_assign_dev(struct net *net, struct mpls_route *rt, if (!mpls_dev_get(dev)) goto errout; + if ((nh->nh_via_table == NEIGH_LINK_TABLE) && + (dev->addr_len != nh->nh_via_alen)) + goto errout; + RCU_INIT_POINTER(nh->nh_dev, dev); return 0; -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 0/4] mpls: fixes for nexthops without via addresses
These four fixes all apply to the case of having an mpls route with an output device, but without a nexthop. Patches 2 and 3 could really have been combined in one patch, but I wanted to separate the fix for some recent breakage from the fix for a day-1 issue. Robert Shearman (4): mpls: validate L2 via address length mpls: don't dump RTA_VIA attribute if not specified mpls: fix out-of-bounds access when via address not specified mpls: make via address optional for multipath routes net/mpls/af_mpls.c | 43 +++ 1 file changed, 31 insertions(+), 12 deletions(-) -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net 4/4] mpls: make via address optional for multipath routes
The via address is optional for a single path route, yet is mandatory when the multipath attribute is used: # ip -f mpls route add 100 dev lo # ip -f mpls route add 101 nexthop dev lo RTNETLINK answers: Invalid argument Make them consistent by making the via address optional when the RTA_MULTIPATH attribute is being parsed so that both forms of specifying the route work. Signed-off-by: Robert Shearman --- net/mpls/af_mpls.c | 22 +- 1 file changed, 13 insertions(+), 9 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 7bfc85f52ca8..c32fc411a911 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -604,10 +604,14 @@ static int mpls_nh_build(struct net *net, struct mpls_route *rt, goto errout; } - err = nla_get_via(via, &nh->nh_via_alen, &nh->nh_via_table, - __mpls_nh_via(rt, nh)); - if (err) - goto errout; + if (via) { + err = nla_get_via(via, &nh->nh_via_alen, &nh->nh_via_table, + __mpls_nh_via(rt, nh)); + if (err) + goto errout; + } else { + nh->nh_via_table = MPLS_NEIGH_TABLE_UNSPEC; + } err = mpls_nh_assign_dev(net, rt, nh, oif); if (err) @@ -689,9 +693,6 @@ static int mpls_nh_build_multi(struct mpls_route_config *cfg, nla_newdst = nla_find(attrs, attrlen, RTA_NEWDST); } - if (!nla_via) - goto errout; - err = mpls_nh_build(cfg->rc_nlinfo.nl_net, rt, nh, rtnh->rtnh_ifindex, nla_via, nla_newdst); @@ -1271,7 +1272,8 @@ static int mpls_dump_route(struct sk_buff *skb, u32 portid, u32 seq, int event, nh->nh_labels, nh->nh_label)) goto nla_put_failure; - if (nla_put_via(skb, nh->nh_via_table, + if (nh->nh_via_table != MPLS_NEIGH_TABLE_UNSPEC && + nla_put_via(skb, nh->nh_via_table, mpls_nh_via(rt, nh), nh->nh_via_alen)) goto nla_put_failure; @@ -1343,7 +1345,9 @@ static inline size_t lfib_nlmsg_size(struct mpls_route *rt) for_nexthops(rt) { nhsize += nla_total_size(sizeof(struct rtnexthop)); - nhsize += nla_total_size(2 + nh->nh_via_alen); + /* RTA_VIA */ + if (nh->nh_via_table != MPLS_NEIGH_TABLE_UNSPEC) + nhsize += nla_total_size(2 + nh->nh_via_alen); if (nh->nh_labels) nhsize += nla_total_size(nh->nh_labels * 4); } endfor_nexthops(rt); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html