[PATCH net v2] ipv4: Avoid caching l3mdev dst on mismatched local route

2017-04-21 Thread Robert Shearman
David reported that doing the following:

ip li add red type vrf table 10
ip link set dev eth1 vrf red
ip addr add 127.0.0.1/8 dev red
ip link set dev eth1 up
ip li set red up
ping -c1 -w1 -I red 127.0.0.1
ip li del red

when either policy routing IP rules are present or the local table
lookup ip rule is before the l3mdev lookup results in a hang with
these messages:

unregister_netdevice: waiting for red to become free. Usage count = 1

The problem is caused by caching the dst used for sending the packet
out of the specified interface on a local route with a different
nexthop interface. Thus the dst could stay around until the route in
the table the lookup was done is deleted which may be never.

Address the problem by not forcing output device to be the l3mdev in
the flow's output interface if the lookup didn't use the l3mdev. This
then results in the dst using the right device according to the route.

Changes in v2:
 - make the dev_out passed in by __ip_route_output_key_hash correct
   instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as
   suggested by David.

Fixes: 5f02ce24c2696 ("net: l3mdev: Allow the l3mdev to be a loopback")
Reported-by: David Ahern 
Suggested-by: David Ahern 
Signed-off-by: Robert Shearman 
---
 net/ipv4/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index acd69cfe2951..d9724889ff09 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2359,7 +2359,8 @@ struct rtable *__ip_route_output_key_hash(struct net 
*net, struct flowi4 *fl4,
}
 
/* L3 master device is the loopback for that domain */
-   dev_out = l3mdev_master_dev_rcu(dev_out) ? : net->loopback_dev;
+   dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? :
+   net->loopback_dev;
fl4->flowi4_oif = dev_out->ifindex;
flags |= RTCF_LOCAL;
goto make_route;
-- 
2.1.4



Re: [PATCH net-next 0/3] l3mdev: Improve use with main table

2017-04-21 Thread Robert Shearman

On 20/04/17 23:36, David Ahern wrote:

On 4/10/17 8:21 AM, Robert Shearman wrote:

Attempting to create a TCP socket not bound to a VRF device when a TCP
socket bound to a VRF device with the same port exists (and vice
versa) fails with EADDRINUSE. This limits the ability to use programs
in selected mixed VRF/non-VRF contexts.

This patch series solves the problem by extending the l3mdev be aware
of the special semantics of the main table and fixing issues arising
from the split local/main tables. A VRF master device created linking
to the main table and used for these programs in the same way as those
created for VRF tables can.

Robert Shearman (3):
  ipv6: Fix route handling when using l3mdev set to main table
  ipv4: Fix route handling when using l3mdev set to main table
  l3mdev: Fix lookup in local table when using main table

 net/ipv4/af_inet.c  |  4 +++-
 net/ipv4/fib_frontend.c | 14 +-
 net/ipv4/raw.c  |  5 -
 net/ipv6/addrconf.c | 12 +---
 net/ipv6/route.c| 23 ++-
 net/l3mdev/l3mdev.c | 26 --
 6 files changed, 63 insertions(+), 21 deletions(-)



With the change I mentioned earlier to fix the refcnt issue on top of
this patch set I see a number of failures:
- local IPv4 with 127.0.0.1 address - ping, tcp, udp tests fail
- all of the IPv4 multicast tests fail
- IPv6 linklocal and mcast addresses generally fail
- IPv6 global address on vrf device


Can you send me some more details of your testing?

These work for me:

$ ping -c1 -I vrf-default 127.0.0.1
PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 vrf-default: 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.141 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.141/0.141/0.141/0.000 ms
$ ping6 -c1 -I vrf-default 2001::1
PING 2001::1(2001::1) from 2001::1 vrf-default: 56 data bytes
64 bytes from 2001::1: icmp_seq=1 ttl=64 time=0.069 ms

--- 2001::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.069/0.069/0.069/0.000 ms
$ sudo ip vrf exec vrf-default nc -l -p 4013
TEST
$ sudo ip vrf exec vrf-default nc -l -u -p 4013
TEST
^C

(with a neighbouring host using nc to send the TEST string for the udp 
and tcp cases)


Thanks,
Rob


Re: [PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check

2017-04-21 Thread Robert Shearman

On 20/04/17 23:18, David Ahern wrote:

On 4/20/17 6:58 AM, Robert Shearman wrote:

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index acd69cfe2951..f667783ffd19 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2125,6 +2125,14 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
fi = NULL;
}

+   /* If the flag to skip the nh oif check is set then the output
+* device may not match the nh device, so cannot use or add to
+* cache in that case.
+*/
+   if (unlikely(fl4->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF &&
+FIB_RES_NH(*res).nh_dev != dev_out))
+   do_cache = false;
+
fnhe = NULL;
do_cache &= fi != NULL;
if (do_cache) {



I believe this is a better fix:

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 5e1e60546fce..fb74a16958af 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2407,7 +2407,7 @@ struct rtable *__ip_route_output_key_hash(struct
net *net, struct flowi4 *fl4,
}

/* L3 master device is the loopback for that domain */
-   dev_out = l3mdev_master_dev_rcu(dev_out) ? :
net->loopback_dev;
+   dev_out = l3mdev_master_dev_rcu(FIB_RES_DEV(res)) ? :
net->loopback_dev;
fl4->flowi4_oif = dev_out->ifindex;
flags |= RTCF_LOCAL;
goto make_route;

Fixes: 5f02ce24c2696 ("net: l3mdev: Allow the l3mdev to be a loopback")

With your change above, references to vrf devices are still taken
(dev_out is the vrf device based on the flow struct) even though the
route's nexthop is in another domain. And the commit log should
reference the use case which is policy routing overriding the VRF rule.


That is indeed a nicer fix - it survives all of my local testing. Thanks 
for correcting the fixes tag too.


I had included this text in the commit message to capture the condition 
of the rules ordering: "when the rule for the lookup in the local table
is ordered before the rule for lookups using l3mdevs". However, I'll try 
to make it more prominent and expand it to note the policy routing use 
case too.


Thanks,
Rob


Re: [PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check

2017-04-20 Thread Robert Shearman

On 20/04/17 16:16, David Ahern wrote:

On 4/20/17 9:05 AM, Robert Shearman wrote:

The key thing I think is the ip rules:

$ ip rule
0:from all lookup local
1000:from all lookup [l3mdev-table]
32766:from all lookup main
32767:from all lookup default

Maybe you have the local rule moved down at startup?


yes that would be a problem. With this test the 127.0.0.1 lookup is
matching on the wrong table:

perf probe fib_table_lookup%return ret=%ax
perf record -e fib:*,probe:* -a -- ping -c1 -w1 -I red 127.0.0.1
...
perf script


...

Table 255 is the wrong table. That is the ultimate problem here.



The table used for the lookup could be retrieved from the oif, but the 
check I introduced would still be required to cope with the unusual, but 
not disallowed, case of two VRF master devices referring to the same table.


Thanks,
Rob


Re: [PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check

2017-04-20 Thread Robert Shearman

On 20/04/17 15:59, David Ahern wrote:

On 4/20/17 8:39 AM, Robert Shearman wrote:

On 20/04/17 15:21, David Ahern wrote:

On 4/20/17 6:58 AM, Robert Shearman wrote:

David reported that doing the following:

ip li add red type vrf table 10
ip link set dev eth1 vrf red
ip addr add 127.0.0.1/8 dev red
ip link set dev eth1 up
ip li set red up
ping -c1 -w1 -I red 127.0.0.1
ip li del red

results in a hang with this message:

unregister_netdevice: waiting for red to become free. Usage count
= 1


I think you misunderstood my comment. The above works fine today. There
is no bug with refcnts.

It breaks with your patches wanting to use a VRF device with the main
table (254).


That doesn't seem to match with my experience. I can reproduce this on
the net tree with the listed commands and the behaviour matches what I
see in the code.


Our mileage varies:

root@kenny-jessie3:~# ip li add red type vrf table 10
root@kenny-jessie3:~# ip link set dev eth1 vrf red
root@kenny-jessie3:~# ip addr add 127.0.0.1/8 dev red
root@kenny-jessie3:~# ip link set dev eth1 up
root@kenny-jessie3:~# ip li set red up
root@kenny-jessie3:~# ping -c1 -w1 -I red 127.0.0.1
PING 127.0.0.1 (127.0.0.1) from 127.0.0.1 red: 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.059 ms

--- 127.0.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.059/0.059/0.059/0.000 ms
root@kenny-jessie3:~# ip li del red
root@kenny-jessie3:~# uname -a
Linux kenny-jessie3 4.11.0-rc6+ #8 SMP Wed Apr 19 11:53:48 PDT 2017
x86_64 GNU/Linux

The above is one of many tests I run and never hit a problem deleting a
VRF device. dst's attached to fnhe_rth_output and fnhe_rth_input appear
to be properly flushed and device references released when the device is
deleted (NETDEV_DOWN and then NETDEV_UNREGISTER).

Can you send me your kernel config and "sysctl -a --pattern 'net.ipv4'"?
Perhaps you have something enabled I don't.


The key thing I think is the ip rules:

$ ip rule
0:  from all lookup local
1000:   from all lookup [l3mdev-table]
32766:  from all lookup main
32767:  from all lookup default

Maybe you have the local rule moved down at startup?

I'll send you the requested information if not.

Thanks,
Rob


Re: [PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check

2017-04-20 Thread Robert Shearman

On 20/04/17 15:21, David Ahern wrote:

On 4/20/17 6:58 AM, Robert Shearman wrote:

David reported that doing the following:

ip li add red type vrf table 10
ip link set dev eth1 vrf red
ip addr add 127.0.0.1/8 dev red
ip link set dev eth1 up
ip li set red up
ping -c1 -w1 -I red 127.0.0.1
ip li del red

results in a hang with this message:

unregister_netdevice: waiting for red to become free. Usage count = 1


I think you misunderstood my comment. The above works fine today. There
is no bug with refcnts.

It breaks with your patches wanting to use a VRF device with the main
table (254).


That doesn't seem to match with my experience. I can reproduce this on 
the net tree with the listed commands and the behaviour matches what I 
see in the code.


Thanks,
Rob


Re: [PATCH net-next 0/3] l3mdev: Improve use with main table

2017-04-20 Thread Robert Shearman

On 13/04/17 15:36, David Ahern wrote:

On 4/13/17 6:48 AM, Robert Shearman wrote:

the patches look ok to me, but in testing them I see I refcnt problem.
simple reproducer:

ip li add red type vrf table 254
ip link set dev eth1 vrf red
ip addr add 127.0.0.1/8 dev red
ip link set dev eth1 up
ip li set red up
ping -c1 -w1 -I red 127.0.0.1
ip li del red
--> hangs with 'uregister_netdevice: waiting for red to become free.'


Right, I've reproduced the same and it occurs even without using the
main table and I believe it has been introduced within the last week.


The cached dst on sockets is one known place that causes a hang on a
delete. Basically the delete stalls until the sockets are closed. I have
a patch for sk_rx_dst which is the one I chased down last week.
sk_dst_cache is another possibility.

Neither of those should be at play with the above example because the
ping command runs and then exits.


Thanks for the pointers. My earlier assessment that this was something 
recent turned out to be wrong. I've sent a patch against net that fixes 
the issue.


Thanks,
Rob


[PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check

2017-04-20 Thread Robert Shearman
David reported that doing the following:

ip li add red type vrf table 10
ip link set dev eth1 vrf red
ip addr add 127.0.0.1/8 dev red
ip link set dev eth1 up
ip li set red up
ping -c1 -w1 -I red 127.0.0.1
ip li del red

results in a hang with this message:

unregister_netdevice: waiting for red to become free. Usage count = 1

The problem is caused by caching the dst used for sending the packet
out of the specified interface on the route that the lookup returned
from the local table when the rule for the lookup in the local table
is ordered before the rule for lookups using l3mdevs. Thus the dst
could stay around until the route in the local table is deleted which
may be never.

Address the problem by not allocating a cacheable output dst if
FLOWI_FLAG_SKIP_NH_OIF is set and the nh device differs from the
device used for the dst.

Fixes: ebfc102c566d ("net: vrf: Flip IPv4 output path from FIB lookup hook to 
out hook")
Reported-by: David Ahern 
Signed-off-by: Robert Shearman 
---
 net/ipv4/route.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index acd69cfe2951..f667783ffd19 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2125,6 +2125,14 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
fi = NULL;
}
 
+   /* If the flag to skip the nh oif check is set then the output
+* device may not match the nh device, so cannot use or add to
+* cache in that case.
+*/
+   if (unlikely(fl4->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF &&
+FIB_RES_NH(*res).nh_dev != dev_out))
+   do_cache = false;
+
fnhe = NULL;
do_cache &= fi != NULL;
if (do_cache) {
-- 
2.1.4



Re: [PATCH net-next 0/3] l3mdev: Improve use with main table

2017-04-13 Thread Robert Shearman

On 12/04/17 17:51, David Ahern wrote:

On 4/10/17 8:21 AM, Robert Shearman wrote:

Attempting to create a TCP socket not bound to a VRF device when a TCP
socket bound to a VRF device with the same port exists (and vice
versa) fails with EADDRINUSE. This limits the ability to use programs
in selected mixed VRF/non-VRF contexts.

This patch series solves the problem by extending the l3mdev be aware
of the special semantics of the main table and fixing issues arising
from the split local/main tables. A VRF master device created linking
to the main table and used for these programs in the same way as those
created for VRF tables can.

Robert Shearman (3):
  ipv6: Fix route handling when using l3mdev set to main table
  ipv4: Fix route handling when using l3mdev set to main table
  l3mdev: Fix lookup in local table when using main table

 net/ipv4/af_inet.c  |  4 +++-
 net/ipv4/fib_frontend.c | 14 +-
 net/ipv4/raw.c  |  5 -
 net/ipv6/addrconf.c | 12 +---
 net/ipv6/route.c| 23 ++-
 net/l3mdev/l3mdev.c | 26 --
 6 files changed, 63 insertions(+), 21 deletions(-)



the patches look ok to me, but in testing them I see I refcnt problem.
simple reproducer:

ip li add red type vrf table 254
ip link set dev eth1 vrf red
ip addr add 127.0.0.1/8 dev red
ip link set dev eth1 up
ip li set red up
ping -c1 -w1 -I red 127.0.0.1
ip li del red
--> hangs with 'uregister_netdevice: waiting for red to become free.'


Right, I've reproduced the same and it occurs even without using the 
main table and I believe it has been introduced within the last week.


I'll bisect to find out the cause.

Thanks for reviewing and testing,
Rob


[PATCH v2 iproute2 net-next 0/2] ip: Allow TTL propagation to/from IP packets to be configured

2017-04-11 Thread Robert Shearman
This patch series adds support for per-MPLS-lightweight-tunnel ttl
values and per route ttl-propagation for the purposes of MPLS to be
specified.

Changes in v2:
 - use matches instead of strcmp for enabled/disabled keywords to
   avoid excessive typing

Robert Shearman (2):
  iproute: Add support for ttl-propagation attribute
  iproute: Add support for MPLS LWT ttl attribute

 ip/iproute.c   | 22 ++
 ip/iproute_lwtunnel.c  | 31 +--
 man/man8/ip-route.8.in | 19 +--
 3 files changed, 68 insertions(+), 4 deletions(-)

-- 
2.1.4



[PATCH v2 iproute2 net-next 1/2] iproute: Add support for ttl-propagation attribute

2017-04-11 Thread Robert Shearman
Add support for setting and displaying the ttl-propagation attribute
initially used by MPLS to control propagation of MPLS TTL to IPv4/IPv6
TTL/hop-limit on popping final label on a per-route basis.

Signed-off-by: Robert Shearman 
---
 ip/iproute.c   | 22 ++
 man/man8/ip-route.8.in | 10 +-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 7cdf0726feb3..6f75319c944a 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -77,6 +77,7 @@ static void usage(void)
fprintf(stderr, "NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]\n");
fprintf(stderr, " [ table TABLE_ID ] [ proto RTPROTO ]\n");
fprintf(stderr, " [ scope SCOPE ] [ metric METRIC ]\n");
+   fprintf(stderr, " [ ttl-propagate { enabled | disabled } 
]\n");
fprintf(stderr, "INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]...\n");
fprintf(stderr, "NH := [ encap ENCAPTYPE ENCAPHDR ] [ via [ FAMILY ] 
ADDRESS ]\n");
fprintf(stderr, "   [ dev STRING ] [ weight NUMBER ] 
NHFLAGS\n");
@@ -714,6 +715,13 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
fprintf(fp, "%u", pref);
}
}
+   if (tb[RTA_TTL_PROPAGATE]) {
+   fprintf(fp, "ttl-propagate ");
+   if (rta_getattr_u8(tb[RTA_TTL_PROPAGATE]))
+   fprintf(fp, "enabled");
+   else
+   fprintf(fp, "disabled");
+   }
fprintf(fp, "\n");
fflush(fp);
return 0;
@@ -1184,6 +1192,20 @@ static int iproute_modify(int cmd, unsigned int flags, 
int argc, char **argv)
 
if (rta->rta_len > RTA_LENGTH(0))
addraw_l(&req.n, 1024, RTA_DATA(rta), 
RTA_PAYLOAD(rta));
+   } else if (strcmp(*argv, "ttl-propagate") == 0) {
+   __u8 ttl_prop;
+
+   NEXT_ARG();
+   if (matches(*argv, "enabled") == 0)
+   ttl_prop = 1;
+   else if (matches(*argv, "disabled") == 0)
+   ttl_prop = 0;
+   else
+   invarg("\"ttl-propagate\" value is invalid\n",
+  *argv);
+
+   addattr8(&req.n, sizeof(req), RTA_TTL_PROPAGATE,
+ttl_prop);
} else {
int type;
inet_prefix dst;
diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
index d6e06649a464..fbe2711a4301 100644
--- a/man/man8/ip-route.8.in
+++ b/man/man8/ip-route.8.in
@@ -75,7 +75,9 @@ replace " } "
 .B  scope
 .IR SCOPE " ] [ "
 .B  metric
-.IR METRIC " ]"
+.IR METRIC " ] [ "
+.B  ttl-propagate
+.RB "{ " enabled " | " disabled " } ]"
 
 .ti -8
 .IR INFO_SPEC " := " "NH OPTIONS FLAGS" " ["
@@ -710,6 +712,12 @@ is a set of encapsulation attributes specific to the
 the route will be deleted after the expires time.
 .B Only
 support IPv6 at present.
+
+.TP
+.BR ttl-propagate " { " enabled " | " disabled " } "
+Control whether TTL should be propagated from any encap into the
+un-encapsulated packet, overriding any global configuration. Only
+supported for MPLS at present.
 .RE
 
 .TP
-- 
2.1.4



[PATCH v2 iproute2 net-next 2/2] iproute: Add support for MPLS LWT ttl attribute

2017-04-11 Thread Robert Shearman
Add support for setting and displaying the ttl attribute
for MPLS IP lighweight tunnels.

Signed-off-by: Robert Shearman 
---
 ip/iproute_lwtunnel.c  | 31 +--
 man/man8/ip-route.8.in |  9 -
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
index 0fa1cab0a790..845a115e9e41 100644
--- a/ip/iproute_lwtunnel.c
+++ b/ip/iproute_lwtunnel.c
@@ -84,6 +84,9 @@ static void print_encap_mpls(FILE *fp, struct rtattr *encap)
if (tb[MPLS_IPTUNNEL_DST])
fprintf(fp, " %s ",
format_host_rta(AF_MPLS, tb[MPLS_IPTUNNEL_DST]));
+   if (tb[MPLS_IPTUNNEL_TTL])
+   fprintf(fp, "ttl %u ",
+   rta_getattr_u8(tb[MPLS_IPTUNNEL_TTL]));
 }
 
 static void print_encap_ip(FILE *fp, struct rtattr *encap)
@@ -247,6 +250,7 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len,
inet_prefix addr;
int argc = *argcp;
char **argv = *argvp;
+   int ttl_ok = 0;
 
if (get_addr(&addr, *argv, AF_MPLS)) {
fprintf(stderr,
@@ -258,8 +262,31 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len,
rta_addattr_l(rta, len, MPLS_IPTUNNEL_DST, &addr.data,
  addr.bytelen);
 
-   *argcp = argc;
-   *argvp = argv;
+   argc--;
+   argv++;
+
+   while (argc > 0) {
+   if (strcmp(*argv, "ttl") == 0) {
+   __u8 ttl;
+
+   NEXT_ARG();
+   if (ttl_ok++)
+   duparg2("ttl", *argv);
+   if (get_u8(&ttl, *argv, 0))
+   invarg("\"ttl\" value is invalid\n", *argv);
+   rta_addattr8(rta, len, MPLS_IPTUNNEL_TTL, ttl);
+   } else {
+   break;
+   }
+   argc--; argv++;
+   }
+
+   /* argv is currently the first unparsed argument,
+* but the lwt_parse_encap() caller will move to the next,
+* so step back
+*/
+   *argcp = argc + 1;
+   *argvp = argv - 1;
 
return 0;
 }
diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
index fbe2711a4301..d2a44acf2793 100644
--- a/man/man8/ip-route.8.in
+++ b/man/man8/ip-route.8.in
@@ -181,7 +181,9 @@ throw " | " unreachable " | " prohibit " | " blackhole " | 
" nat " ]"
 .ti -8
 .IR ENCAP_MPLS " := "
 .BR mpls " [ "
-.IR LABEL " ]"
+.IR LABEL " ] ["
+.B  ttl
+.IR TTL " ]"
 
 .ti -8
 .IR ENCAP_IP " := "
@@ -666,6 +668,11 @@ is a set of encapsulation attributes specific to the
 .I MPLSLABEL
 - mpls label stack with labels separated by
 .I "/"
+.sp
+
+.B ttl
+.I TTL
+- TTL to use for MPLS header or 0 to inherit from IP header
 .in -2
 .sp
 
-- 
2.1.4



Re: [PATCH iproute2 net-next 1/2] iproute: Add support for ttl-propagation attribute

2017-04-11 Thread Robert Shearman

On 11/04/17 04:43, David Ahern wrote:

On 4/10/17 8:36 AM, Robert Shearman wrote:

@@ -1184,6 +1192,20 @@ static int iproute_modify(int cmd, unsigned int flags, 
int argc, char **argv)

if (rta->rta_len > RTA_LENGTH(0))
addraw_l(&req.n, 1024, RTA_DATA(rta), 
RTA_PAYLOAD(rta));
+   } else if (strcmp(*argv, "ttl-propagate") == 0) {
+   __u8 ttl_prop;
+
+   NEXT_ARG();
+   if (strcmp(*argv, "enabled") == 0)
+   ttl_prop = 1;
+   else if (strcmp(*argv, "disabled") == 0)
+   ttl_prop = 0;
+   else
+   invarg("\"ttl-propagate\" value is invalid\n",
+  *argv);
+


matches() instead of strcmp() is more user friendly. 'enabled' is a lot
to type. ;-)



Ok, will change as suggested in v2.

Thanks for reviewing,
Rob


[PATCH iproute2 net-next 1/2] iproute: Add support for ttl-propagation attribute

2017-04-10 Thread Robert Shearman
Add support for setting and displaying the ttl-propagation attribute
initially used by MPLS to control propagation of MPLS TTL to IPv4/IPv6
TTL/hop-limit on popping final label on a per-route basis.

Signed-off-by: Robert Shearman 
---
 ip/iproute.c   | 22 ++
 man/man8/ip-route.8.in | 10 +-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 7cdf0726feb3..c2b681fc9730 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -77,6 +77,7 @@ static void usage(void)
fprintf(stderr, "NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]\n");
fprintf(stderr, " [ table TABLE_ID ] [ proto RTPROTO ]\n");
fprintf(stderr, " [ scope SCOPE ] [ metric METRIC ]\n");
+   fprintf(stderr, " [ ttl-propagate { enabled | disabled } 
]\n");
fprintf(stderr, "INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]...\n");
fprintf(stderr, "NH := [ encap ENCAPTYPE ENCAPHDR ] [ via [ FAMILY ] 
ADDRESS ]\n");
fprintf(stderr, "   [ dev STRING ] [ weight NUMBER ] 
NHFLAGS\n");
@@ -714,6 +715,13 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
fprintf(fp, "%u", pref);
}
}
+   if (tb[RTA_TTL_PROPAGATE]) {
+   fprintf(fp, "ttl-propagate ");
+   if (rta_getattr_u8(tb[RTA_TTL_PROPAGATE]))
+   fprintf(fp, "enabled");
+   else
+   fprintf(fp, "disabled");
+   }
fprintf(fp, "\n");
fflush(fp);
return 0;
@@ -1184,6 +1192,20 @@ static int iproute_modify(int cmd, unsigned int flags, 
int argc, char **argv)
 
if (rta->rta_len > RTA_LENGTH(0))
addraw_l(&req.n, 1024, RTA_DATA(rta), 
RTA_PAYLOAD(rta));
+   } else if (strcmp(*argv, "ttl-propagate") == 0) {
+   __u8 ttl_prop;
+
+   NEXT_ARG();
+   if (strcmp(*argv, "enabled") == 0)
+   ttl_prop = 1;
+   else if (strcmp(*argv, "disabled") == 0)
+   ttl_prop = 0;
+   else
+   invarg("\"ttl-propagate\" value is invalid\n",
+  *argv);
+
+   addattr8(&req.n, sizeof(req), RTA_TTL_PROPAGATE,
+ttl_prop);
} else {
int type;
inet_prefix dst;
diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
index d6e06649a464..fbe2711a4301 100644
--- a/man/man8/ip-route.8.in
+++ b/man/man8/ip-route.8.in
@@ -75,7 +75,9 @@ replace " } "
 .B  scope
 .IR SCOPE " ] [ "
 .B  metric
-.IR METRIC " ]"
+.IR METRIC " ] [ "
+.B  ttl-propagate
+.RB "{ " enabled " | " disabled " } ]"
 
 .ti -8
 .IR INFO_SPEC " := " "NH OPTIONS FLAGS" " ["
@@ -710,6 +712,12 @@ is a set of encapsulation attributes specific to the
 the route will be deleted after the expires time.
 .B Only
 support IPv6 at present.
+
+.TP
+.BR ttl-propagate " { " enabled " | " disabled " } "
+Control whether TTL should be propagated from any encap into the
+un-encapsulated packet, overriding any global configuration. Only
+supported for MPLS at present.
 .RE
 
 .TP
-- 
2.1.4



[PATCH iproute2 net-next 2/2] iproute: Add support for MPLS LWT ttl attribute

2017-04-10 Thread Robert Shearman
Add support for setting and displaying the ttl attribute
for MPLS IP lighweight tunnels.

Signed-off-by: Robert Shearman 
---
 ip/iproute_lwtunnel.c  | 31 +--
 man/man8/ip-route.8.in |  9 -
 2 files changed, 37 insertions(+), 3 deletions(-)

diff --git a/ip/iproute_lwtunnel.c b/ip/iproute_lwtunnel.c
index 0fa1cab0a790..845a115e9e41 100644
--- a/ip/iproute_lwtunnel.c
+++ b/ip/iproute_lwtunnel.c
@@ -84,6 +84,9 @@ static void print_encap_mpls(FILE *fp, struct rtattr *encap)
if (tb[MPLS_IPTUNNEL_DST])
fprintf(fp, " %s ",
format_host_rta(AF_MPLS, tb[MPLS_IPTUNNEL_DST]));
+   if (tb[MPLS_IPTUNNEL_TTL])
+   fprintf(fp, "ttl %u ",
+   rta_getattr_u8(tb[MPLS_IPTUNNEL_TTL]));
 }
 
 static void print_encap_ip(FILE *fp, struct rtattr *encap)
@@ -247,6 +250,7 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len,
inet_prefix addr;
int argc = *argcp;
char **argv = *argvp;
+   int ttl_ok = 0;
 
if (get_addr(&addr, *argv, AF_MPLS)) {
fprintf(stderr,
@@ -258,8 +262,31 @@ static int parse_encap_mpls(struct rtattr *rta, size_t len,
rta_addattr_l(rta, len, MPLS_IPTUNNEL_DST, &addr.data,
  addr.bytelen);
 
-   *argcp = argc;
-   *argvp = argv;
+   argc--;
+   argv++;
+
+   while (argc > 0) {
+   if (strcmp(*argv, "ttl") == 0) {
+   __u8 ttl;
+
+   NEXT_ARG();
+   if (ttl_ok++)
+   duparg2("ttl", *argv);
+   if (get_u8(&ttl, *argv, 0))
+   invarg("\"ttl\" value is invalid\n", *argv);
+   rta_addattr8(rta, len, MPLS_IPTUNNEL_TTL, ttl);
+   } else {
+   break;
+   }
+   argc--; argv++;
+   }
+
+   /* argv is currently the first unparsed argument,
+* but the lwt_parse_encap() caller will move to the next,
+* so step back
+*/
+   *argcp = argc + 1;
+   *argvp = argv - 1;
 
return 0;
 }
diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
index fbe2711a4301..d2a44acf2793 100644
--- a/man/man8/ip-route.8.in
+++ b/man/man8/ip-route.8.in
@@ -181,7 +181,9 @@ throw " | " unreachable " | " prohibit " | " blackhole " | 
" nat " ]"
 .ti -8
 .IR ENCAP_MPLS " := "
 .BR mpls " [ "
-.IR LABEL " ]"
+.IR LABEL " ] ["
+.B  ttl
+.IR TTL " ]"
 
 .ti -8
 .IR ENCAP_IP " := "
@@ -666,6 +668,11 @@ is a set of encapsulation attributes specific to the
 .I MPLSLABEL
 - mpls label stack with labels separated by
 .I "/"
+.sp
+
+.B ttl
+.I TTL
+- TTL to use for MPLS header or 0 to inherit from IP header
 .in -2
 .sp
 
-- 
2.1.4



[PATCH iproute2 net-next 0/2] ip: Allow TTL propagation to/from IP packets to be configured

2017-04-10 Thread Robert Shearman
This patch series adds support for per-MPLS-lightweight-tunnel ttl
values and per route ttl-propagation for the purposes of MPLS to be
specified.

Robert Shearman (2):
  iproute: Add support for ttl-propagation attribute
  iproute: Add support for MPLS LWT ttl attribute

 ip/iproute.c   | 22 ++
 ip/iproute_lwtunnel.c  | 31 +--
 man/man8/ip-route.8.in | 19 +--
 3 files changed, 68 insertions(+), 4 deletions(-)

-- 
2.1.4



[PATCH net-next 3/3] l3mdev: Fix lookup in local table when using main table

2017-04-10 Thread Robert Shearman
If an l3mdev is set to use the main table then the l3mdev rules will
return this. This means that no lookup will be done in the local table
if split local/main table is in effect and the local table lookup rule
has been reordered to after the l3mdev rule. Related to this is that
the order of the rule for looking up in the main table isn't
respected.

Fix these two aspects by not returning a match if the l3mdev's table
id is the main table. Processing will then proceed normally to the
default rules and do the appropriate lookups.

Signed-off-by: Robert Shearman 
---
 net/l3mdev/l3mdev.c | 26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/l3mdev/l3mdev.c b/net/l3mdev/l3mdev.c
index 8da86ceca33d..5147959243c5 100644
--- a/net/l3mdev/l3mdev.c
+++ b/net/l3mdev/l3mdev.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * l3mdev_master_ifindex - get index of L3 master device
@@ -142,23 +143,36 @@ int l3mdev_fib_rule_match(struct net *net, struct flowi 
*fl,
 {
struct net_device *dev;
int rc = 0;
+   u32 tb_id;
 
rcu_read_lock();
 
dev = dev_get_by_index_rcu(net, fl->flowi_oif);
if (dev && netif_is_l3_master(dev) &&
dev->l3mdev_ops->l3mdev_fib_table) {
-   arg->table = dev->l3mdev_ops->l3mdev_fib_table(dev);
-   rc = 1;
-   goto out;
+   tb_id = dev->l3mdev_ops->l3mdev_fib_table(dev);
+   /* This requires the main table id to be consistent
+* between IPv4 and IPv6.
+*/
+   BUILD_BUG_ON(RT_TABLE_MAIN != RT6_TABLE_MAIN);
+   /* main table is handled by default rules */
+   if (tb_id != RT_TABLE_MAIN) {
+   arg->table = tb_id;
+   rc = 1;
+   goto out;
+   }
}
 
dev = dev_get_by_index_rcu(net, fl->flowi_iif);
if (dev && netif_is_l3_master(dev) &&
dev->l3mdev_ops->l3mdev_fib_table) {
-   arg->table = dev->l3mdev_ops->l3mdev_fib_table(dev);
-   rc = 1;
-   goto out;
+   tb_id = dev->l3mdev_ops->l3mdev_fib_table(dev);
+   /* main table is handled by default rules */
+   if (tb_id != RT_TABLE_MAIN) {
+   arg->table = tb_id;
+   rc = 1;
+   goto out;
+   }
}
 
 out:
-- 
2.1.4



[PATCH net-next 1/3] ipv6: Fix route handling when using l3mdev set to main table

2017-04-10 Thread Robert Shearman
If an l3mdev is set to use the main table then the use of the local
table is overridden. This means that when split local/main table is in
effect then local routes aren't added to the local table and so don't
respect the order of ip rules.

Fix this by assuming that no if no l3mdev is present then defaulting
to RT6_TABLE_MAIN and then subsequently doing a translation from
RT6_TABLE_MAIN to RT6_TABLE_LOCAL.

Do the same translations for RT6_TABLE_INFO, RT6_TABLE_DFLT and
RT6_TABLE_PREFIX even though they are just defined to RT6_TABLE_MAIN
in case someone decides to change that in the future.

Signed-off-by: Robert Shearman 
---
 net/ipv6/addrconf.c | 12 +---
 net/ipv6/route.c| 23 ++-
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 67ec87ea5fb6..937c35581a28 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -2244,7 +2244,6 @@ addrconf_prefix_route(struct in6_addr *pfx, int plen, 
struct net_device *dev,
  unsigned long expires, u32 flags)
 {
struct fib6_config cfg = {
-   .fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_PREFIX,
.fc_metric = IP6_RT_PRIO_ADDRCONF,
.fc_ifindex = dev->ifindex,
.fc_expires = expires,
@@ -2254,6 +2253,9 @@ addrconf_prefix_route(struct in6_addr *pfx, int plen, 
struct net_device *dev,
.fc_protocol = RTPROT_KERNEL,
};
 
+   cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN;
+   if (cfg.fc_table == RT6_TABLE_MAIN)
+   cfg.fc_table = RT6_TABLE_PREFIX;
cfg.fc_dst = *pfx;
 
/* Prevent useless cloning on PtP SIT.
@@ -2277,8 +2279,10 @@ static struct rt6_info *addrconf_get_prefix_route(const 
struct in6_addr *pfx,
struct fib6_node *fn;
struct rt6_info *rt = NULL;
struct fib6_table *table;
-   u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_PREFIX;
+   u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN;
 
+   if (tb_id == RT6_TABLE_MAIN)
+   tb_id = RT6_TABLE_PREFIX;
table = fib6_get_table(dev_net(dev), tb_id);
if (!table)
return NULL;
@@ -2310,7 +2314,6 @@ static struct rt6_info *addrconf_get_prefix_route(const 
struct in6_addr *pfx,
 static void addrconf_add_mroute(struct net_device *dev)
 {
struct fib6_config cfg = {
-   .fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_LOCAL,
.fc_metric = IP6_RT_PRIO_ADDRCONF,
.fc_ifindex = dev->ifindex,
.fc_dst_len = 8,
@@ -2318,6 +2321,9 @@ static void addrconf_add_mroute(struct net_device *dev)
.fc_nlinfo.nl_net = dev_net(dev),
};
 
+   cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN;
+   if (cfg.fc_table == RT6_TABLE_MAIN)
+   cfg.fc_table = RT6_TABLE_LOCAL;
ipv6_addr_set(&cfg.fc_dst, htonl(0xFF00), 0, 0, 0);
 
ip6_route_add(&cfg);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 9db1418993f2..490c74ed6a78 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2424,12 +2424,15 @@ static struct rt6_info *rt6_get_route_info(struct net 
*net,
   const struct in6_addr *gwaddr,
   struct net_device *dev)
 {
-   u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_INFO;
+   u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN;
int ifindex = dev->ifindex;
struct fib6_node *fn;
struct rt6_info *rt = NULL;
struct fib6_table *table;
 
+   if (tb_id == RT6_TABLE_MAIN)
+   tb_id = RT6_TABLE_INFO;
+
table = fib6_get_table(net, tb_id);
if (!table)
return NULL;
@@ -2471,7 +2474,9 @@ static struct rt6_info *rt6_add_route_info(struct net 
*net,
.fc_nlinfo.nl_net = net,
};
 
-   cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_INFO,
+   cfg.fc_table = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN;
+   if (cfg.fc_table == RT6_TABLE_MAIN)
+   cfg.fc_table = RT6_TABLE_INFO;
cfg.fc_dst = *prefix;
cfg.fc_gateway = *gwaddr;
 
@@ -2487,10 +2492,13 @@ static struct rt6_info *rt6_add_route_info(struct net 
*net,
 
 struct rt6_info *rt6_get_dflt_router(const struct in6_addr *addr, struct 
net_device *dev)
 {
-   u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_DFLT;
+   u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_MAIN;
struct rt6_info *rt;
struct fib6_table *table;
 
+   if (tb_id == RT6_TABLE_MAIN)
+   tb_id = RT6_TABLE_DFLT;
+
table = fib6_get_table(dev_net(dev), tb_id);
if (!table)
return NULL;
@@ -2513,7 +2521,6 @@ struct rt6_info *rt6_add_dflt_router(const struct 
in6_addr *gwaddr,
 unsigned int pref)
 {
   

[PATCH net-next 2/3] ipv4: Fix route handling when using l3mdev set to main table

2017-04-10 Thread Robert Shearman
If an l3mdev is set to use the main table then the use of the local
table is overridden. This means that when split local/main table is in
effect then local routes aren't added to the local table and so don't
respect the order of ip rules.

Fix this by assuming that no if no l3mdev is present then defaulting
to RT_TABLE_MAIN and then subsequently doing a translation from
RT_TABLE_MAIN to RT_TABLE_LOCAL.

Signed-off-by: Robert Shearman 
---
 net/ipv4/af_inet.c  |  4 +++-
 net/ipv4/fib_frontend.c | 14 +-
 net/ipv4/raw.c  |  5 -
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index d1a11707a126..83d54fab03f0 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -436,7 +436,7 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
struct net *net = sock_net(sk);
unsigned short snum;
int chk_addr_ret;
-   u32 tb_id = RT_TABLE_LOCAL;
+   u32 tb_id = RT_TABLE_MAIN;
int err;
 
/* If the socket has its own bind function then use it. (RAW) */
@@ -459,6 +459,8 @@ int inet_bind(struct socket *sock, struct sockaddr *uaddr, 
int addr_len)
}
 
tb_id = l3mdev_fib_table_by_index(net, sk->sk_bound_dev_if) ? : tb_id;
+   if (tb_id == RT_TABLE_MAIN)
+   tb_id = RT_TABLE_LOCAL;
chk_addr_ret = inet_addr_type_table(net, addr->sin_addr.s_addr, tb_id);
 
/* Not specified by any standard per-se, however it breaks too
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 8f2133ffc2ff..1782c35dbac0 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -248,8 +248,10 @@ EXPORT_SYMBOL(inet_addr_type);
 unsigned int inet_dev_addr_type(struct net *net, const struct net_device *dev,
__be32 addr)
 {
-   u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_LOCAL;
+   u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_MAIN;
 
+   if (rt_table == RT_TABLE_MAIN)
+   rt_table = RT_TABLE_LOCAL;
return __inet_dev_addr_type(net, dev, addr, rt_table);
 }
 EXPORT_SYMBOL(inet_dev_addr_type);
@@ -261,8 +263,10 @@ unsigned int inet_addr_type_dev_table(struct net *net,
  const struct net_device *dev,
  __be32 addr)
 {
-   u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_LOCAL;
+   u32 rt_table = l3mdev_fib_table(dev) ? : RT_TABLE_MAIN;
 
+   if (rt_table == RT_TABLE_MAIN)
+   rt_table = RT_TABLE_LOCAL;
return __inet_dev_addr_type(net, NULL, addr, rt_table);
 }
 EXPORT_SYMBOL(inet_addr_type_dev_table);
@@ -805,7 +809,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 static void fib_magic(int cmd, int type, __be32 dst, int dst_len, struct 
in_ifaddr *ifa)
 {
struct net *net = dev_net(ifa->ifa_dev->dev);
-   u32 tb_id = l3mdev_fib_table(ifa->ifa_dev->dev);
+   u32 tb_id = l3mdev_fib_table(ifa->ifa_dev->dev) ? : RT_TABLE_MAIN;
struct fib_table *tb;
struct fib_config cfg = {
.fc_protocol = RTPROT_KERNEL,
@@ -820,8 +824,8 @@ static void fib_magic(int cmd, int type, __be32 dst, int 
dst_len, struct in_ifad
},
};
 
-   if (!tb_id)
-   tb_id = (type == RTN_UNICAST) ? RT_TABLE_MAIN : RT_TABLE_LOCAL;
+   if (tb_id == RT_TABLE_MAIN && type != RTN_UNICAST)
+   tb_id = RT_TABLE_LOCAL;
 
tb = fib_new_table(net, tb_id);
if (!tb)
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8119e1f66e03..2dd7022681e6 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -699,7 +699,7 @@ static int raw_bind(struct sock *sk, struct sockaddr 
*uaddr, int addr_len)
 {
struct inet_sock *inet = inet_sk(sk);
struct sockaddr_in *addr = (struct sockaddr_in *) uaddr;
-   u32 tb_id = RT_TABLE_LOCAL;
+   u32 tb_id = RT_TABLE_MAIN;
int ret = -EINVAL;
int chk_addr_ret;
 
@@ -710,6 +710,9 @@ static int raw_bind(struct sock *sk, struct sockaddr 
*uaddr, int addr_len)
tb_id = l3mdev_fib_table_by_index(sock_net(sk),
 sk->sk_bound_dev_if) ? : tb_id;
 
+   if (tb_id == RT_TABLE_MAIN)
+   tb_id = RT_TABLE_LOCAL;
+
chk_addr_ret = inet_addr_type_table(sock_net(sk), addr->sin_addr.s_addr,
tb_id);
 
-- 
2.1.4



[PATCH net-next 0/3] l3mdev: Improve use with main table

2017-04-10 Thread Robert Shearman
Attempting to create a TCP socket not bound to a VRF device when a TCP
socket bound to a VRF device with the same port exists (and vice
versa) fails with EADDRINUSE. This limits the ability to use programs
in selected mixed VRF/non-VRF contexts.

This patch series solves the problem by extending the l3mdev be aware
of the special semantics of the main table and fixing issues arising
from the split local/main tables. A VRF master device created linking
to the main table and used for these programs in the same way as those
created for VRF tables can.

Robert Shearman (3):
  ipv6: Fix route handling when using l3mdev set to main table
  ipv4: Fix route handling when using l3mdev set to main table
  l3mdev: Fix lookup in local table when using main table

 net/ipv4/af_inet.c  |  4 +++-
 net/ipv4/fib_frontend.c | 14 +-
 net/ipv4/raw.c  |  5 -
 net/ipv6/addrconf.c | 12 +---
 net/ipv6/route.c| 23 ++-
 net/l3mdev/l3mdev.c | 26 --
 6 files changed, 63 insertions(+), 21 deletions(-)

-- 
2.1.4



Re: [PATCH net-next v3 0/6] net: mpls: Allow users to configure more labels per route

2017-04-03 Thread Robert Shearman

On 31/03/17 15:13, David Ahern wrote:

Increase the maximum number of new labels for MPLS routes from 2 to 30.

To keep memory consumption in check, the labels array is moved to the end
of mpls_nh and mpls_iptunnel_encap structs as a 0-sized array. Allocations
use the maximum number of labels across all nexthops in a route for LSR
and the number of labels configured for LWT.

The mpls_route layout is changed to:

...


Meaning the via follows its mpls_nh providing better locality as the
number of labels increases. UDP_RR tests with namespaces shows no impact
to a modest performance increase with this layout for 1 or 2 labels and
1 or 2 nexthops.

mpls_route allocation size is limited to 4096 bytes allowing on the
order of 30 nexthops with 30 labels (or more nexthops with fewer
labels). LWT encap shares same maximum number of labels as mpls routing.

v3
- initialize n_labels to 0 in case RTA_NEWDST is not defined; detected
  by the kbuild test robot

v2
- updates per Eric's comments
  + added patch to ensure all reads of rt_nhn_alive and nh_flags in
the packet path use READ_ONCE and all writes via event handlers
use WRITE_ONCE

  + limit mpls_route size to 4096 (PAGE_SIZE for most arch)

  + mostly killed use of MAX_NEW_LABELS; it exists only for common
limit between lwt and routing paths

David Ahern (6):
  net: mpls: rt_nhn_alive and nh_flags should be accessed using
READ_ONCE
  net: mpls: Convert number of nexthops to u8
  net: mpls: change mpls_route layout
  net:mpls: Limit memory allocation for mpls_route
  net: mpls: bump maximum number of labels
  net: mpls: Increase max number of labels for lwt encap

 include/net/mpls_iptunnel.h |   5 +-
 net/mpls/af_mpls.c  | 210 +---
 net/mpls/internal.h |  61 +
 net/mpls/mpls_iptunnel.c|  13 ++-
 4 files changed, 196 insertions(+), 93 deletions(-)



Acked-by: Robert Shearman 


Re: [PATCH net-next] net: mpls: Update lfib_nlmsg_size to skip deleted nexthops

2017-03-29 Thread Robert Shearman

On 28/03/17 23:19, David Ahern wrote:

A recent commit skips nexthops in a route if the device has been
deleted. Update lfib_nlmsg_size accordingly.

Reported-by: Roopa Prabhu 
Signed-off-by: David Ahern 


Acked-by: Robert Shearman 


Re: [PATCH net-next 0/4] net: mpls: Allow users to configure more labels per route

2017-03-28 Thread Robert Shearman

On 25/03/17 17:03, David Ahern wrote:

Bump the maximum number of labels for MPLS routes from 2 to 12. To keep
memory consumption in check the labels array is moved to the end of mpls_nh
and mpls_iptunnel_encap structs as a 0-sized array. Allocations use the
maximum number of labels across all nexthops in a route for LSR and the
number of labels configured for LWT.

The mpls_route layout is changed to:

...

Meaning the via follows its mpls_nh providing better locality as the
number of labels increases. UDP_RR tests with namespaces shows no impact
to a modest performance increase with this layout for 1 or 2 labels and
1 or 2 nexthops.

The new limit is set to 12 to cover all currently known segment
routing use cases.

David Ahern (4):
  mpls: Convert number of nexthops to u8
  net: mpls: change mpls_route layout
  net: mpls: bump maximum number of labels
  net: mpls: Increase max number of labels for lwt encap

 include/net/mpls_iptunnel.h |   4 +-
 net/mpls/af_mpls.c  | 108 ++--
 net/mpls/internal.h |  52 ++---
 net/mpls/mpls_iptunnel.c|  13 --
 4 files changed, 112 insertions(+), 65 deletions(-)



Acked-by: Robert Shearman 


Re: [PATCH net-next 0/4] net: mpls: Allow users to configure more labels per route

2017-03-28 Thread Robert Shearman

On 28/03/17 04:08, Eric W. Biederman wrote:

David Ahern  writes:


On 3/27/17 4:39 AM, Robert Shearman wrote:

On 25/03/17 19:15, Eric W. Biederman wrote:

David Ahern  writes:


Bump the maximum number of labels for MPLS routes from 2 to 12. To keep
memory consumption in check the labels array is moved to the end of
mpls_nh
and mpls_iptunnel_encap structs as a 0-sized array. Allocations use the
maximum number of labels across all nexthops in a route for LSR and the
number of labels configured for LWT.

The mpls_route layout is changed to:

   +--+
   | mpls_route   |
   +--+
   | mpls_nh 0|
   +--+
   | alignment padding|   4 bytes for odd number of labels; 0 for
even
   +--+
   | via[rt_max_alen] 0   |
   +--+
   | alignment padding|   via's aligned on sizeof(unsigned long)
   +--+
   | ...  |

Meaning the via follows its mpls_nh providing better locality as the
number of labels increases. UDP_RR tests with namespaces shows no impact
to a modest performance increase with this layout for 1 or 2 labels and
1 or 2 nexthops.

The new limit is set to 12 to cover all currently known segment
routing use cases.


How does this compare with running the packet a couple of times through
the mpls table to get all of the desired labels applied?


At the moment (i.e setting output interface for a route to the loopback
interface) the TTL would currently be calculated incorrectly since it'll
be decremented each time the packet is run through the input processing.
If that was avoided, then the only issue would be the lower performance.


We have the infrastructure to add all the labels on one pass. It does
not make sense to recirculate the packet to get the same effect.


I was really asking what are the advantages and disadvantages of this
change rather than suggesting it was a bad idea.  The information about
ttl is useful.

Adding that this will route packets with more labels more quickly than
the recirculation method is also useful to know.


I should also add that not recirculating also avoids having to allocate 
extra local labels, which may be limited in supply in some deployments, 
and avoids the extra control plane/user complexity associated with 
managing the routes associated with the recirculation.


Thanks,
Rob


Re: [PATCH net-next 0/4] net: mpls: Allow users to configure more labels per route

2017-03-27 Thread Robert Shearman

On 25/03/17 19:15, Eric W. Biederman wrote:

David Ahern  writes:


Bump the maximum number of labels for MPLS routes from 2 to 12. To keep
memory consumption in check the labels array is moved to the end of mpls_nh
and mpls_iptunnel_encap structs as a 0-sized array. Allocations use the
maximum number of labels across all nexthops in a route for LSR and the
number of labels configured for LWT.

The mpls_route layout is changed to:

   +--+
   | mpls_route   |
   +--+
   | mpls_nh 0|
   +--+
   | alignment padding|   4 bytes for odd number of labels; 0 for even
   +--+
   | via[rt_max_alen] 0   |
   +--+
   | alignment padding|   via's aligned on sizeof(unsigned long)
   +--+
   | ...  |

Meaning the via follows its mpls_nh providing better locality as the
number of labels increases. UDP_RR tests with namespaces shows no impact
to a modest performance increase with this layout for 1 or 2 labels and
1 or 2 nexthops.

The new limit is set to 12 to cover all currently known segment
routing use cases.


How does this compare with running the packet a couple of times through
the mpls table to get all of the desired labels applied?


At the moment (i.e setting output interface for a route to the loopback 
interface) the TTL would currently be calculated incorrectly since it'll 
be decremented each time the packet is run through the input processing. 
If that was avoided, then the only issue would be the lower performance.




I can certainly see the case in an mpls tunnel ingress where this might
could be desirable.Which is something you implement in your last
patch.  However is it at all common to push lots of labels at once
during routing?

I am probably a bit naive but it seems absurd to push more
than a handful of labels onto a packet as you are routing it.


From draft-ietf-spring-segment-routing-mpls-07:

   Note that the kind of deployment of Segment Routing may affect the
   depth of the MPLS label stack.  As every segment in the list is
   represented by an additional MPLS label, the length of the segment
   list directly correlates to the depth of the label stack.
   Implementing a long path with many explicit hops as a segment list
   may thus yield a deep label stack that would need to be pushed at the
   head of the SR tunnel.

   However, many use cases would need very few segments in the list.
   This is especially true when taking good advantage of the ECMP aware
   routing within each segment.  In fact most use cases need just one
   additional segment and thus lead to a similar label stack depth as
   e.g.  RSVP-based routing.

The summary is that when using short-path routing then the number of 
labels needed to be pushed on will be small (2 or 3). However, if using 
SR to implement traffic engineering through a list of explicit hops then 
the number of labels pushed could be much greater and up to the diameter 
of the IGP network. Traffic engineering like this is not unusual.


Thanks,
Rob


Re: [PATCH net-next 0/2] net: mpls: multipath route cleanups

2017-03-27 Thread Robert Shearman

On 24/03/17 22:21, David Ahern wrote:

When a device associated with a nexthop is deleted, the nexthop in
the route is effectively removed, so remove it from the route dump.

Further, when all nexhops have been deleted the route is effectively
done, so remove the route.

David Ahern (2):
  mpls: Don't show nexthop if device has been deleted
  mpls: Delete route when all nexthops have been deleted

 net/mpls/af_mpls.c | 17 +
 1 file changed, 13 insertions(+), 4 deletions(-)



Acked-by: Robert Shearman 


Re: [PATCH net-next] net: mpls: Fix setting ttl_propagate for rt2

2017-03-24 Thread Robert Shearman

On 24/03/17 01:02, David Ahern wrote:

Fix copy and paste error setting rt_ttl_propagate.

Fixes: 5b441ac8784c1 ("mpls: allow TTL propagation to IP packets to be 
configured")
Signed-off-by: David Ahern 


Good catch.

Acked-by: Robert Shearman 


Re: [PATCH] net: mpls: Fix nexthop alive tracking on down events

2017-03-14 Thread Robert Shearman

On 13/03/17 23:49, David Ahern wrote:

Alive tracking of nexthops can account for a link twice if the carrier
goes down followed by an admin down of the same link rendering multipath
routes useless. This is similar to 79099aab38c8 for UNREGISTER events and
DOWN events.

Fix by tracking number of alive nexthops in mpls_ifdown similar to the
logic in mpls_ifup. Checking the flags per nexthop once after all events
have been processed is simpler than trying to maintian a running count
through all event combinations.

Also, WRITE_ONCE is used instead of ACCESS_ONCE to set rt_nhn_alive
per a comment from checkpatch:
WARNING: Prefer WRITE_ONCE(, ) over ACCESS_ONCE() = 

Fixes: c89359a42e2a4 ("mpls: support for dead routes")
Signed-off-by: David Ahern 


Acked-by: Robert Shearman 


Re: [PATCH] mpls: Do not decrement alive counter for unregister events

2017-03-13 Thread Robert Shearman

On 10/03/17 22:11, David Ahern wrote:

Multipath routes can be rendered usesless when a device in one of the
paths is deleted. For example:

$ ip -f mpls ro ls
100
nexthop as to 200 via inet 172.16.2.2  dev virt12
nexthop as to 300 via inet 172.16.3.2  dev br0
101
nexthop as to 201 via inet6 2000:2::2  dev virt12
nexthop as to 301 via inet6 2000:3::2  dev br0

$ ip li del br0

When br0 is deleted the other hop is not considered in
mpls_select_multipath because of the alive check -- rt_nhn_alive
is 0.

rt_nhn_alive is decremented once in mpls_ifdown when the device is taken
down (NETDEV_DOWN) and again when it is deleted (NETDEV_UNREGISTER). For
a 2 hop route, deleting one device drops the alive count to 0. Since
devices are taken down before unregistering, the decrement on
NETDEV_UNREGISTER is redundant.

Fixes: c89359a42e2a4 ("mpls: support for dead routes")
Signed-off-by: David Ahern 
---
 net/mpls/af_mpls.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index ccdac9c44fdc..22a9971aa484 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1288,7 +1288,8 @@ static void mpls_ifdown(struct net_device *dev, int event)
/* fall through */
case NETDEV_CHANGE:
nh->nh_flags |= RTNH_F_LINKDOWN;
-   ACCESS_ONCE(rt->rt_nhn_alive) = 
rt->rt_nhn_alive - 1;
+   if (event != NETDEV_UNREGISTER)
+   ACCESS_ONCE(rt->rt_nhn_alive) = 
rt->rt_nhn_alive - 1;
break;
}
if (event == NETDEV_UNREGISTER)



Doesn't this leave the problem that if the device's link goes down and 
then the device gets deleted the alive count will be decremented twice 
for the same path?


Perhaps it would be better to change the condition for decrementing the 
alive count to be "!(nh->nh_flags & (RTNH_F_LINKDOWN | RTNH_F_DEAD))"?


Thanks,
Rob


[PATCH net-next v3 2/2] mpls: allow TTL propagation from IP packets to be configured

2017-03-10 Thread Robert Shearman
Allow TTL propagation from IP packets to MPLS packets to be
configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which
allows the TTL to be set in the resulting MPLS packet, with the value
of 0 having the semantics of enabling propagation of the TTL from the
IP header (i.e. non-zero values disable propagation).

Also allow the configuration to be overridden globally by reusing the
same sysctl to control whether the TTL is propagated from IP packets
into the MPLS header. If the per-LWT attribute is set then it
overrides the global configuration. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl,
"net.mpls.default_ttl". This is kept separate from the configuration
of whether IP TTL propagation is enabled as it can be used in the
future when non-IP payloads are supported (i.e. where there is no
payload TTL that can be propagated).

Signed-off-by: Robert Shearman 
---
 Documentation/networking/mpls-sysctl.txt |  8 
 include/net/mpls_iptunnel.h  |  2 +
 include/net/netns/mpls.h |  1 +
 include/uapi/linux/mpls_iptunnel.h   |  2 +
 net/mpls/af_mpls.c   | 11 +
 net/mpls/mpls_iptunnel.c | 73 ++--
 6 files changed, 84 insertions(+), 13 deletions(-)

diff --git a/Documentation/networking/mpls-sysctl.txt 
b/Documentation/networking/mpls-sysctl.txt
index 9badd1d6685f..2f24a1912a48 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -30,6 +30,14 @@ ip_ttl_propagate - BOOL
0 - disabled / RFC 3443 [Short] Pipe Model
1 - enabled / RFC 3443 Uniform Model (default)
 
+default_ttl - BOOL
+   Default TTL value to use for MPLS packets where it cannot be
+   propagated from an IP header, either because one isn't present
+   or ip_ttl_propagate has been disabled.
+
+   Possible values: 1 - 255
+   Default: 255
+
 conf//input - BOOL
Control whether packets can be input on this interface.
 
diff --git a/include/net/mpls_iptunnel.h b/include/net/mpls_iptunnel.h
index 179253f9dcfd..a18af6a16eb5 100644
--- a/include/net/mpls_iptunnel.h
+++ b/include/net/mpls_iptunnel.h
@@ -19,6 +19,8 @@
 struct mpls_iptunnel_encap {
u32 label[MAX_NEW_LABELS];
u8  labels;
+   u8  ttl_propagate;
+   u8  default_ttl;
 };
 
 static inline struct mpls_iptunnel_encap *mpls_lwtunnel_encap(struct 
lwtunnel_state *lwtstate)
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index 08652eedabb2..6608b3693385 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -10,6 +10,7 @@ struct ctl_table_header;
 
 struct netns_mpls {
int ip_ttl_propagate;
+   int default_ttl;
size_t platform_labels;
struct mpls_route __rcu * __rcu *platform_label;
 
diff --git a/include/uapi/linux/mpls_iptunnel.h 
b/include/uapi/linux/mpls_iptunnel.h
index d80a0498f77e..f5e45095b0bb 100644
--- a/include/uapi/linux/mpls_iptunnel.h
+++ b/include/uapi/linux/mpls_iptunnel.h
@@ -16,11 +16,13 @@
 /* MPLS tunnel attributes
  * [RTA_ENCAP] = {
  * [MPLS_IPTUNNEL_DST]
+ * [MPLS_IPTUNNEL_TTL]
  * }
  */
 enum {
MPLS_IPTUNNEL_UNSPEC,
MPLS_IPTUNNEL_DST,
+   MPLS_IPTUNNEL_TTL,
__MPLS_IPTUNNEL_MAX,
 };
 #define MPLS_IPTUNNEL_MAX (__MPLS_IPTUNNEL_MAX - 1)
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 0e1046f21af8..0c5d111abe36 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -34,6 +34,7 @@
 static int zero = 0;
 static int one = 1;
 static int label_limit = (1 << 20) - 1;
+static int ttl_max = 255;
 
 static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
   struct nlmsghdr *nlh, struct net *net, u32 portid,
@@ -2042,6 +2043,15 @@ static const struct ctl_table mpls_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
+   {
+   .procname   = "default_ttl",
+   .data   = MPLS_NS_SYSCTL_OFFSET(mpls.default_ttl),
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &one,
+   .extra2 = &ttl_max,
+   },
{ }
 };
 
@@ -2053,6 +2063,7 @@ static int mpls_net_init(struct net *net)
net->mpls.platform_labels = 0;
net->mpls.platform_label = NULL;
net->mpls.ip_ttl_propagate = 1;
+   net->mpls.default_ttl = 255;
 
table = kmemdup(mpls_table, sizeof(mpls_table), GFP_KERNEL);
if (table == NULL)
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index e4e4424f9eb1..22f71fce0bfb 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -29,6 +29,7 @@
 
 static const struct nla_policy mpls_iptun

[PATCH net-next v3 1/2] mpls: allow TTL propagation to IP packets to be configured

2017-03-10 Thread Robert Shearman
Provide the ability to control on a per-route basis whether the TTL
value from an MPLS packet is propagated to an IPv4/IPv6 packet when
the last label is popped as per the theoretical model in RFC 3443
through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to
mean disable propagation and 1 to mean enable propagation.

In order to provide the ability to change the behaviour for packets
arriving with IPv4/IPv6 Explicit Null labels and to provide an easy
way for a user to change the behaviour for all existing routes without
having to reprogram them, a global knob is provided. This is done
through the addition of a new per-namespace sysctl,
"net.mpls.ip_ttl_propagate", which defaults to enabled. If the
per-route attribute is set (either enabled or disabled) then it
overrides the global configuration.

Signed-off-by: Robert Shearman 
---
 Documentation/networking/mpls-sysctl.txt | 11 
 include/net/netns/mpls.h |  2 +
 include/uapi/linux/rtnetlink.h   |  1 +
 net/mpls/af_mpls.c   | 87 +---
 net/mpls/internal.h  |  7 +++
 5 files changed, 100 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/mpls-sysctl.txt 
b/Documentation/networking/mpls-sysctl.txt
index 15d8d16934fd..9badd1d6685f 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -19,6 +19,17 @@ platform_labels - INTEGER
Possible values: 0 - 1048575
Default: 0
 
+ip_ttl_propagate - BOOL
+   Control whether TTL is propagated from the IPv4/IPv6 header to
+   the MPLS header on imposing labels and propagated from the
+   MPLS header to the IPv4/IPv6 header on popping the last label.
+
+   If disabled, the MPLS transport network will appear as a
+   single hop to transit traffic.
+
+   0 - disabled / RFC 3443 [Short] Pipe Model
+   1 - enabled / RFC 3443 Uniform Model (default)
+
 conf//input - BOOL
Control whether packets can be input on this interface.
 
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index d29203651c01..08652eedabb2 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -9,8 +9,10 @@ struct mpls_route;
 struct ctl_table_header;
 
 struct netns_mpls {
+   int ip_ttl_propagate;
size_t platform_labels;
struct mpls_route __rcu * __rcu *platform_label;
+
struct ctl_table_header *ctl;
 };
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 6546917d605a..30fb25e851db 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -319,6 +319,7 @@ enum rtattr_type_t {
RTA_EXPIRES,
RTA_PAD,
RTA_UID,
+   RTA_TTL_PROPAGATE,
__RTA_MAX
 };
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 3818686182b2..0e1046f21af8 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -32,6 +32,7 @@
 #define MPLS_NEIGH_TABLE_UNSPEC (NEIGH_LINK_TABLE + 1)
 
 static int zero = 0;
+static int one = 1;
 static int label_limit = (1 << 20) - 1;
 
 static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
@@ -220,8 +221,8 @@ static struct mpls_nh *mpls_select_multipath(struct 
mpls_route *rt,
return &rt->rt_nh[nh_index];
 }
 
-static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
-   struct mpls_entry_decoded dec)
+static bool mpls_egress(struct net *net, struct mpls_route *rt,
+   struct sk_buff *skb, struct mpls_entry_decoded dec)
 {
enum mpls_payload_type payload_type;
bool success = false;
@@ -246,22 +247,46 @@ static bool mpls_egress(struct mpls_route *rt, struct 
sk_buff *skb,
switch (payload_type) {
case MPT_IPV4: {
struct iphdr *hdr4 = ip_hdr(skb);
+   u8 new_ttl;
skb->protocol = htons(ETH_P_IP);
+
+   /* If propagating TTL, take the decremented TTL from
+* the incoming MPLS header, otherwise decrement the
+* TTL, but only if not 0 to avoid underflow.
+*/
+   if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED ||
+   (rt->rt_ttl_propagate == MPLS_TTL_PROP_DEFAULT &&
+net->mpls.ip_ttl_propagate))
+   new_ttl = dec.ttl;
+   else
+   new_ttl = hdr4->ttl ? hdr4->ttl - 1 : 0;
+
csum_replace2(&hdr4->check,
  htons(hdr4->ttl << 8),
- htons(dec.ttl << 8));
-   hdr4->ttl = dec.ttl;
+ htons(new_ttl << 8));
+   hdr4->ttl = new_ttl;
success = true;
break;
}
case MPT_IPV6: {
struct ipv6hdr *hdr6 = ipv6_hdr(skb);
 

[PATCH net-next v3 0/2] mpls: allow TTL propagation to/from IP packets to be configured

2017-03-10 Thread Robert Shearman
It is sometimes desirable to present an MPLS transport network as a
single hop to traffic transiting it because it prevents confusion when
diagnosing failures. An example of where confusion can be generated is
when addresses used in the provider network overlap with addresses in
the overlay network and the addresses get exposed through ICMP errors
generated as packets transit the provider network.

In addition, RFC 3443 defines two methods of deriving TTL for an
outgoing packet: Uniform Model where the TTL is propagated to/from the
MPLS header and both Pipe Models and Short Pipe Models (with and
without PHP) where the TTL is not propagated to/from the MPLS header.

Changes in v3:
 - decrement ttl on popping last label when not doing ttl propagation,
   as suggested by David Ahern.
 - add comment to describe what the somewhat complex conditionals are
   doing to work out what ttl to use in mpls_iptunnel.c.
 - rearrange fields fields in struct netns_mpls to keep the platform
   label fields together, as suggested by David Ahern.

Changes in v2:
 - add references to RFC 3443 as suggested by David Ahern
 - fix setting of skb->protocol as noticed by David Ahern
 - implement per-route/per-LWT configurability as suggested by Eric
   Biederman
 - split into two patches for ease of review

Robert Shearman (2):
  mpls: allow TTL propagation to IP packets to be configured
  mpls: allow TTL propagation from IP packets to be configured

 Documentation/networking/mpls-sysctl.txt | 19 +++
 include/net/mpls_iptunnel.h  |  2 +
 include/net/netns/mpls.h |  3 +
 include/uapi/linux/mpls_iptunnel.h   |  2 +
 include/uapi/linux/rtnetlink.h   |  1 +
 net/mpls/af_mpls.c   | 98 +---
 net/mpls/internal.h  |  7 +++
 net/mpls/mpls_iptunnel.c | 73 +++-
 8 files changed, 184 insertions(+), 21 deletions(-)

-- 
2.1.4



Re: [PATCH net-next v2 2/2] mpls: allow TTL propagation from IP packets to be configured

2017-03-10 Thread Robert Shearman

On 10/03/17 02:54, David Ahern wrote:

On 3/7/17 5:46 PM, Robert Shearman wrote:

@@ -78,6 +70,29 @@ static int mpls_xmit(struct sk_buff *skb)

tun_encap_info = mpls_lwtunnel_encap(dst->lwtstate);

+   /* Obtain the ttl */
+   if (dst->ops->family == AF_INET) {
+   if (tun_encap_info->ttl_propagate == MPLS_TTL_PROP_DISABLED)
+   ttl = tun_encap_info->default_ttl;
+   else if (tun_encap_info->ttl_propagate == MPLS_TTL_PROP_DEFAULT 
&&
+!net->mpls.ip_ttl_propagate)
+   ttl = net->mpls.default_ttl;
+   else
+   ttl = ip_hdr(skb)->ttl;


After staring at that for a while, an explanation above this if {} else
{} section on the ttl selection will be very helpful.



Ok, would a comment like the following improve things sufficiently?

/* Obtain the ttl using the following set of rules.
 *
 * LWT ttl propagation setting:
 *  - disabled => use default TTL value from LWT
 *  - enabled  => use TTL value from IPv4/IPv6 header
 *  - default  =>
 *   Global ttl propagation setting:
 *- disabled => use default TTL value from global setting
 *- enabled => use TTL value from IPv4/IPv6 header
 */

Thanks,
Rob


Re: [PATCH net-next v2 1/2] mpls: allow TTL propagation to IP packets to be configured

2017-03-10 Thread Robert Shearman

On 10/03/17 02:00, David Ahern wrote:

On 3/7/17 5:46 PM, Robert Shearman wrote:

@@ -244,24 +245,33 @@ static bool mpls_egress(struct mpls_route *rt, struct 
sk_buff *skb,
payload_type = ip_hdr(skb)->version;

switch (payload_type) {
-   case MPT_IPV4: {
-   struct iphdr *hdr4 = ip_hdr(skb);
+   case MPT_IPV4:
+   if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED ||
+   (rt->rt_ttl_propagate == MPLS_TTL_PROP_DEFAULT &&
+net->mpls.ip_ttl_propagate)) {
+   struct iphdr *hdr4 = ip_hdr(skb);
+
+   csum_replace2(&hdr4->check,
+ htons(hdr4->ttl << 8),
+ htons(dec.ttl << 8));
+   hdr4->ttl = dec.ttl;
+   }
skb->protocol = htons(ETH_P_IP);
-   csum_replace2(&hdr4->check,
- htons(hdr4->ttl << 8),
- htons(dec.ttl << 8));
-   hdr4->ttl = dec.ttl;
success = true;
break;
-   }
-   case MPT_IPV6: {
-   struct ipv6hdr *hdr6 = ipv6_hdr(skb);
+   case MPT_IPV6:
+   if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED ||
+   (rt->rt_ttl_propagate == MPLS_TTL_PROP_DEFAULT &&
+net->mpls.ip_ttl_propagate)) {
+   struct ipv6hdr *hdr6 = ipv6_hdr(skb);
+
+   hdr6->hop_limit = dec.ttl;
+   }
skb->protocol = htons(ETH_P_IPV6);
-   hdr6->hop_limit = dec.ttl;
success = true;
break;
-   }


What decrements the TTL if it is not propagated from MPLS to IP?



Good point. Will address in v3.

Thanks,
Rob


Re: [PATCH net-next v2 1/2] mpls: allow TTL propagation to IP packets to be configured

2017-03-10 Thread Robert Shearman

On 10/03/17 02:40, David Ahern wrote:

On 3/7/17 5:46 PM, Robert Shearman wrote:

diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index d29203651c01..58e0e46c4a5c 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -10,7 +10,9 @@ struct ctl_table_header;

 struct netns_mpls {
size_t platform_labels;
+   int ip_ttl_propagate;
struct mpls_route __rcu * __rcu *platform_label;
+
struct ctl_table_header *ctl;
 };



I'd prefer the platform_labels stay with platform_label. ie., put the
new ip_ttl_propagate above platform_labels.



Ok, will do in v3.

Thanks,
Rob


[PATCH iproute2] man: Fix formatting of vrf parameter of ip-link show command

2017-03-09 Thread Robert Shearman
Add missing opening " [" for the vrf parameter.

Signed-off-by: Robert Shearman 
---
 man/man8/ip-link.8.in | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/ip-link.8.in b/man/man8/ip-link.8.in
index 83ffb6357f7d..3f5d57c28885 100644
--- a/man/man8/ip-link.8.in
+++ b/man/man8/ip-link.8.in
@@ -164,7 +164,7 @@ ip-link \- network device configuration
 .B master
 .IR DEVICE " ] ["
 .B type
-.IR ETYPE " ]"
+.IR ETYPE " ] ["
 .B vrf
 .IR NAME " ]"
 
-- 
2.1.4



[PATCH iproute2] iplink: add support for afstats subcommand

2017-03-09 Thread Robert Shearman
Add support for new afstats subcommand. This uses the new
IFLA_STATS_AF_SPEC attribute of RTM_GETSTATS messages to show
per-device, AF-specific stats. At the moment the kernel only supports
MPLS AF stats, so that is all that's implemented here.

The print_num function is exposed from ipaddress.c to be used for
printing the new stats so that the human-readable option, if set, can
be respected.

Example of use:

$ ./ip/ip -f mpls link afstats dev eth1
3: eth1
mpls:
RX: bytes  packets  errors  dropped  noroute
9016   98   0   00
TX: bytes  packets  errors  dropped
7232   113  0   0

Signed-off-by: Robert Shearman 
---
 ip/ip_common.h|   2 +
 ip/ipaddress.c|   2 +-
 ip/iplink.c   | 151 ++
 man/man8/ip-link.8.in |  12 
 4 files changed, 166 insertions(+), 1 deletion(-)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index abb2b8d5d537..5a39623aa21d 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -112,3 +112,5 @@ int name_is_vrf(const char *name);
 #ifndef LABEL_MAX_MASK
 #define LABEL_MAX_MASK  0xFU
 #endif
+
+void print_num(FILE *fp, unsigned int width, uint64_t count);
diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index 242c6ea0b4b3..b8d9c7d917fe 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -418,7 +418,7 @@ static void print_vfinfo(FILE *fp, struct rtattr *vfinfo)
print_vf_stats64(fp, vf[IFLA_VF_STATS]);
 }
 
-static void print_num(FILE *fp, unsigned int width, uint64_t count)
+void print_num(FILE *fp, unsigned int width, uint64_t count)
 {
const char *prefix = "kMGTPE";
const unsigned int base = use_iec ? 1024 : 1000;
diff --git a/ip/iplink.c b/ip/iplink.c
index 00fed9006ea6..866ad723f4a2 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "rt_names.h"
 #include "utils.h"
@@ -99,6 +100,7 @@ void iplink_usage(void)
"   ip link show [ DEVICE | group GROUP ] [up] [master DEV] 
[vrf NAME] [type TYPE]\n");
 
fprintf(stderr, "\n   ip link xstats type TYPE [ ARGS ]\n");
+   fprintf(stderr, "\n   ip link afstats [ dev DEVICE ]\n");
 
if (iplink_have_newlink()) {
fprintf(stderr,
@@ -1364,6 +1366,150 @@ static int do_set(int argc, char **argv)
 }
 #endif /* IPLINK_IOCTL_COMPAT */
 
+static void print_mpls_stats(FILE *fp, struct rtattr *attr)
+{
+   struct rtattr *mrtb[MPLS_STATS_MAX+1];
+   struct mpls_link_stats *stats;
+
+   parse_rtattr(mrtb, MPLS_STATS_MAX, RTA_DATA(attr),
+RTA_PAYLOAD(attr));
+   if (!mrtb[MPLS_STATS_LINK])
+   return;
+
+   stats = RTA_DATA(mrtb[MPLS_STATS_LINK]);
+
+   fprintf(fp, "mpls:\n");
+   fprintf(fp, "RX: bytes  packets  errors  dropped  noroute\n");
+   fprintf(fp, "");
+   print_num(fp, 10, stats->rx_bytes);
+   print_num(fp, 8, stats->rx_packets);
+   print_num(fp, 7, stats->rx_errors);
+   print_num(fp, 8, stats->rx_dropped);
+   print_num(fp, 7, stats->rx_noroute);
+   fprintf(fp, "\n");
+   fprintf(fp, "TX: bytes  packets  errors  dropped\n");
+   fprintf(fp, "");
+   print_num(fp, 10, stats->tx_bytes);
+   print_num(fp, 8, stats->tx_packets);
+   print_num(fp, 7, stats->tx_errors);
+   print_num(fp, 7, stats->tx_dropped);
+   fprintf(fp, "\n");
+}
+
+static void print_af_stats_attr(FILE *fp, int ifindex, struct rtattr *attr)
+{
+   bool if_printed = false;
+   struct rtattr *i;
+   int rem;
+
+   rem = RTA_PAYLOAD(attr);
+   for (i = RTA_DATA(attr); RTA_OK(i, rem); i = RTA_NEXT(i, rem)) {
+   if (preferred_family != AF_UNSPEC &&
+   i->rta_type != preferred_family)
+   continue;
+
+   if (!if_printed) {
+   fprintf(fp, "%u: %s\n", ifindex,
+   ll_index_to_name(ifindex));
+   if_printed = true;
+   }
+
+   switch (i->rta_type) {
+   case AF_MPLS:
+   print_mpls_stats(fp, i);
+   break;
+   default:
+   fprintf(fp, "unknown af(%d)\n", i->rta_type);
+   break;
+   }
+   }
+}
+
+struct af_stats_ctx {
+   FILE *fp;
+   int ifindex;
+};
+
+static int print_af_stats(const struct sockaddr_nl *who,
+ struct nlmsghdr *n,
+ void *arg)
+{
+   struct if_stats_msg *ifsm = NLMSG_DATA(n);
+   struct rtattr *tb[IFLA_STATS_MAX+1];
+   int l

[PATCH net-next v2 2/2] mpls: allow TTL propagation from IP packets to be configured

2017-03-07 Thread Robert Shearman
Allow TTL propagation from IP packets to MPLS packets to be
configured. Add a new optional LWT attribute, MPLS_IPTUNNEL_TTL, which
allows the TTL to be set in the resulting MPLS packet, with the value
of 0 having the semantics of enabling propagation of the TTL from the
IP header (i.e. non-zero values disable propagation).

Also allow the configuration to be overridden globally by reusing the
same sysctl to control whether the TTL is propagated from IP packets
into the MPLS header. If the per-LWT attribute is set then it
overrides the global configuration. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl,
"net.mpls.default_ttl". This is kept separate from the configuration
of whether IP TTL propagation is enabled as it can be used in the
future when non-IP payloads are supported (i.e. where there is no
payload TTL that can be propagated).

Signed-off-by: Robert Shearman 
---
 Documentation/networking/mpls-sysctl.txt |  8 
 include/net/mpls_iptunnel.h  |  2 +
 include/net/netns/mpls.h |  1 +
 include/uapi/linux/mpls_iptunnel.h   |  2 +
 net/mpls/af_mpls.c   | 11 ++
 net/mpls/mpls_iptunnel.c | 64 +---
 6 files changed, 75 insertions(+), 13 deletions(-)

diff --git a/Documentation/networking/mpls-sysctl.txt 
b/Documentation/networking/mpls-sysctl.txt
index 9badd1d6685f..2f24a1912a48 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -30,6 +30,14 @@ ip_ttl_propagate - BOOL
0 - disabled / RFC 3443 [Short] Pipe Model
1 - enabled / RFC 3443 Uniform Model (default)
 
+default_ttl - BOOL
+   Default TTL value to use for MPLS packets where it cannot be
+   propagated from an IP header, either because one isn't present
+   or ip_ttl_propagate has been disabled.
+
+   Possible values: 1 - 255
+   Default: 255
+
 conf//input - BOOL
Control whether packets can be input on this interface.
 
diff --git a/include/net/mpls_iptunnel.h b/include/net/mpls_iptunnel.h
index 179253f9dcfd..a18af6a16eb5 100644
--- a/include/net/mpls_iptunnel.h
+++ b/include/net/mpls_iptunnel.h
@@ -19,6 +19,8 @@
 struct mpls_iptunnel_encap {
u32 label[MAX_NEW_LABELS];
u8  labels;
+   u8  ttl_propagate;
+   u8  default_ttl;
 };
 
 static inline struct mpls_iptunnel_encap *mpls_lwtunnel_encap(struct 
lwtunnel_state *lwtstate)
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index 58e0e46c4a5c..1b68aed6e1b9 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -11,6 +11,7 @@ struct ctl_table_header;
 struct netns_mpls {
size_t platform_labels;
int ip_ttl_propagate;
+   int default_ttl;
struct mpls_route __rcu * __rcu *platform_label;
 
struct ctl_table_header *ctl;
diff --git a/include/uapi/linux/mpls_iptunnel.h 
b/include/uapi/linux/mpls_iptunnel.h
index d80a0498f77e..f5e45095b0bb 100644
--- a/include/uapi/linux/mpls_iptunnel.h
+++ b/include/uapi/linux/mpls_iptunnel.h
@@ -16,11 +16,13 @@
 /* MPLS tunnel attributes
  * [RTA_ENCAP] = {
  * [MPLS_IPTUNNEL_DST]
+ * [MPLS_IPTUNNEL_TTL]
  * }
  */
 enum {
MPLS_IPTUNNEL_UNSPEC,
MPLS_IPTUNNEL_DST,
+   MPLS_IPTUNNEL_TTL,
__MPLS_IPTUNNEL_MAX,
 };
 #define MPLS_IPTUNNEL_MAX (__MPLS_IPTUNNEL_MAX - 1)
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index d4a51da8a0ce..a8710d334a60 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -34,6 +34,7 @@
 static int zero = 0;
 static int one = 1;
 static int label_limit = (1 << 20) - 1;
+static int ttl_max = 255;
 
 static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
   struct nlmsghdr *nlh, struct net *net, u32 portid,
@@ -2027,6 +2028,15 @@ static const struct ctl_table mpls_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
+   {
+   .procname   = "default_ttl",
+   .data   = MPLS_NS_SYSCTL_OFFSET(mpls.default_ttl),
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &one,
+   .extra2 = &ttl_max,
+   },
{ }
 };
 
@@ -2038,6 +2048,7 @@ static int mpls_net_init(struct net *net)
net->mpls.platform_labels = 0;
net->mpls.platform_label = NULL;
net->mpls.ip_ttl_propagate = 1;
+   net->mpls.default_ttl = 255;
 
table = kmemdup(mpls_table, sizeof(mpls_table), GFP_KERNEL);
if (table == NULL)
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index e4e4424f9eb1..da2fb02e0f27 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -29,6 +29,7 @@
 
 static co

[PATCH net-next v2 1/2] mpls: allow TTL propagation to IP packets to be configured

2017-03-07 Thread Robert Shearman
Provide the ability to control on a per-route basis whether the TTL
value from an MPLS packet is propagated to an IPv4/IPv6 packet when
the last label is popped as per the theoretical model in RFC 3443
through a new route attribute, RTA_TTL_PROPAGATE which can be 0 to
mean disable propagation and 1 to mean enable propagation.

In order to provide the ability to change the behaviour for packets
arriving with IPv4/IPv6 Explicit Null labels and to provide an easy
way for a user to change the behaviour for all existing routes without
having to reprogram them, a global knob is provided. This is done
through the addition of a new per-namespace sysctl,
"net.mpls.ip_ttl_propagate", which defaults to enabled. If the
per-route attribute is set (either enabled or disabled) then it
overrides the global configuration.

Signed-off-by: Robert Shearman 
---
 Documentation/networking/mpls-sysctl.txt | 11 
 include/net/netns/mpls.h |  2 +
 include/uapi/linux/rtnetlink.h   |  1 +
 net/mpls/af_mpls.c   | 88 ++--
 net/mpls/internal.h  |  7 +++
 5 files changed, 93 insertions(+), 16 deletions(-)

diff --git a/Documentation/networking/mpls-sysctl.txt 
b/Documentation/networking/mpls-sysctl.txt
index 15d8d16934fd..9badd1d6685f 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -19,6 +19,17 @@ platform_labels - INTEGER
Possible values: 0 - 1048575
Default: 0
 
+ip_ttl_propagate - BOOL
+   Control whether TTL is propagated from the IPv4/IPv6 header to
+   the MPLS header on imposing labels and propagated from the
+   MPLS header to the IPv4/IPv6 header on popping the last label.
+
+   If disabled, the MPLS transport network will appear as a
+   single hop to transit traffic.
+
+   0 - disabled / RFC 3443 [Short] Pipe Model
+   1 - enabled / RFC 3443 Uniform Model (default)
+
 conf//input - BOOL
Control whether packets can be input on this interface.
 
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index d29203651c01..58e0e46c4a5c 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -10,7 +10,9 @@ struct ctl_table_header;
 
 struct netns_mpls {
size_t platform_labels;
+   int ip_ttl_propagate;
struct mpls_route __rcu * __rcu *platform_label;
+
struct ctl_table_header *ctl;
 };
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 6546917d605a..30fb25e851db 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -319,6 +319,7 @@ enum rtattr_type_t {
RTA_EXPIRES,
RTA_PAD,
RTA_UID,
+   RTA_TTL_PROPAGATE,
__RTA_MAX
 };
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 3818686182b2..d4a51da8a0ce 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -32,6 +32,7 @@
 #define MPLS_NEIGH_TABLE_UNSPEC (NEIGH_LINK_TABLE + 1)
 
 static int zero = 0;
+static int one = 1;
 static int label_limit = (1 << 20) - 1;
 
 static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
@@ -220,8 +221,8 @@ static struct mpls_nh *mpls_select_multipath(struct 
mpls_route *rt,
return &rt->rt_nh[nh_index];
 }
 
-static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
-   struct mpls_entry_decoded dec)
+static bool mpls_egress(struct net *net, struct mpls_route *rt,
+   struct sk_buff *skb, struct mpls_entry_decoded dec)
 {
enum mpls_payload_type payload_type;
bool success = false;
@@ -244,24 +245,33 @@ static bool mpls_egress(struct mpls_route *rt, struct 
sk_buff *skb,
payload_type = ip_hdr(skb)->version;
 
switch (payload_type) {
-   case MPT_IPV4: {
-   struct iphdr *hdr4 = ip_hdr(skb);
+   case MPT_IPV4:
+   if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED ||
+   (rt->rt_ttl_propagate == MPLS_TTL_PROP_DEFAULT &&
+net->mpls.ip_ttl_propagate)) {
+   struct iphdr *hdr4 = ip_hdr(skb);
+
+   csum_replace2(&hdr4->check,
+ htons(hdr4->ttl << 8),
+ htons(dec.ttl << 8));
+   hdr4->ttl = dec.ttl;
+   }
skb->protocol = htons(ETH_P_IP);
-   csum_replace2(&hdr4->check,
- htons(hdr4->ttl << 8),
- htons(dec.ttl << 8));
-   hdr4->ttl = dec.ttl;
success = true;
break;
-   }
-   case MPT_IPV6: {
-   struct ipv6hdr *hdr6 = ipv6_hdr(skb);
+   case MPT_IPV6:
+   if (rt->rt_ttl_propagate == MPLS_TTL_PROP_ENABLED ||
+ 

[PATCH net-next v2 0/2] mpls: allow TTL propagation to/from IP packets to be configured

2017-03-07 Thread Robert Shearman
It is sometimes desirable to present an MPLS transport network as a
single hop to traffic transiting it because it prevents confusion when
diagnosing failures. An example of where confusion can be generated is
when addresses used in the provider network overlap with addresses in
the overlay network and the addresses get exposed through ICMP errors
generated as packets transit the provider network.

In addition, RFC 3443 defines two methods of deriving TTL for an
outgoing packet: Uniform Model where the TTL is propagated to/from the
MPLS header and both Pipe Models and Short Pipe Models (with and
without PHP) where the TTL is not propagated to/from the MPLS header.

Changes in v2:
 - add references to RFC 3443 as suggested by David Ahern
 - fix setting of skb->protocol as noticed by David Ahern
 - implement per-route/per-LWT configurability as suggested by Eric
   Biederman
 - split into two patches for ease of review

Robert Shearman (2):
  mpls: allow TTL propagation to IP packets to be configured
  mpls: allow TTL propagation from IP packets to be configured

 Documentation/networking/mpls-sysctl.txt | 19 ++
 include/net/mpls_iptunnel.h  |  2 +
 include/net/netns/mpls.h |  3 +
 include/uapi/linux/mpls_iptunnel.h   |  2 +
 include/uapi/linux/rtnetlink.h   |  1 +
 net/mpls/af_mpls.c   | 99 ++--
 net/mpls/internal.h  |  7 +++
 net/mpls/mpls_iptunnel.c | 64 -
 8 files changed, 168 insertions(+), 29 deletions(-)

-- 
2.1.4



Re: [PATCH net-next v2] net: mpls: Add support for netconf

2017-02-20 Thread Robert Shearman

On 20/02/17 16:03, David Ahern wrote:

Add netconf support to MPLS. Allows userpsace to learn and be notified
of changes to 'input' enable setting per interface.



Acked-by: Robert Shearman 


Acked-by: Nicolas Dichtel 
Signed-off-by: David Ahern 


Re: [PATCH net-next] net: mpls: Add support for netconf

2017-02-20 Thread Robert Shearman



On 20/02/17 14:53, David Ahern wrote:

On 2/20/17 3:01 AM, Robert Shearman wrote:

On 18/02/17 01:36, David Ahern wrote:

Add netconf support to MPLS. Allows userpsace to learn and be notified
of changes to 'input' enable setting per interface.

Signed-off-by: David Ahern 
---
 include/uapi/linux/netconf.h   |   1 +
 include/uapi/linux/rtnetlink.h |   2 +
 net/mpls/af_mpls.c | 212
-
 net/mpls/internal.h|   2 +-
 4 files changed, 214 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h
index 45dfad509c4d..159d91c9c2a3 100644
--- a/include/uapi/linux/netconf.h
+++ b/include/uapi/linux/netconf.h
@@ -16,6 +16,7 @@ enum {
 NETCONFA_MC_FORWARDING,
 NETCONFA_PROXY_NEIGH,
 NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
+NETCONFA_MPLS_INPUT_ENABLED,


How about removing "MPLS_" the name? That way it would be usable by
other AFs too, e.g. for the disable_ipv6 sysctl. Perhaps "ENABLED" is
redundant as well.



I could drop the MPLS to NETCONFA_INPUT_ENABLED; just NETCONFA_INPUT is
not very descriptive.

disable_ipv6 has a completely different meaning, so re-use there does
not seem appropriate.


That's a fair point - disable_ipv6 does more than just disabling the 
input of IPv6 packets. However, the point still stands in the general 
sense - why tie the attribute name to an AF if it's not necessary?


Thanks,
Rob


Thanks for the review.



Re: [PATCH net-next] net: mpls: Add support for netconf

2017-02-20 Thread Robert Shearman

On 18/02/17 01:36, David Ahern wrote:

Add netconf support to MPLS. Allows userpsace to learn and be notified
of changes to 'input' enable setting per interface.

Signed-off-by: David Ahern 
---
 include/uapi/linux/netconf.h   |   1 +
 include/uapi/linux/rtnetlink.h |   2 +
 net/mpls/af_mpls.c | 212 -
 net/mpls/internal.h|   2 +-
 4 files changed, 214 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/netconf.h b/include/uapi/linux/netconf.h
index 45dfad509c4d..159d91c9c2a3 100644
--- a/include/uapi/linux/netconf.h
+++ b/include/uapi/linux/netconf.h
@@ -16,6 +16,7 @@ enum {
NETCONFA_MC_FORWARDING,
NETCONFA_PROXY_NEIGH,
NETCONFA_IGNORE_ROUTES_WITH_LINKDOWN,
+   NETCONFA_MPLS_INPUT_ENABLED,


How about removing "MPLS_" the name? That way it would be usable by 
other AFs too, e.g. for the disable_ipv6 sysctl. Perhaps "ENABLED" is 
redundant as well.


Thanks,
Rob


Re: [PATCH net-next] mpls: allow TTL propagation to/from IP packets to be configured

2017-01-31 Thread Robert Shearman

On 31/01/17 01:09, David Ahern wrote:

On 1/30/17 1:36 PM, Robert Shearman wrote:

@@ -243,24 +245,29 @@ static bool mpls_egress(struct mpls_route *rt, struct 
sk_buff *skb,
payload_type = ip_hdr(skb)->version;

switch (payload_type) {
-   case MPT_IPV4: {
-   struct iphdr *hdr4 = ip_hdr(skb);
-   skb->protocol = htons(ETH_P_IP);
-   csum_replace2(&hdr4->check,
- htons(hdr4->ttl << 8),
- htons(dec.ttl << 8));
-   hdr4->ttl = dec.ttl;
+   case MPT_IPV4:
+   if (net->mpls.ip_ttl_propagate) {
+   struct iphdr *hdr4 = ip_hdr(skb);
+
+   skb->protocol = htons(ETH_P_IP);


The protocol setting here and ...


+   csum_replace2(&hdr4->check,
+ htons(hdr4->ttl << 8),
+ htons(dec.ttl << 8));
+   hdr4->ttl = dec.ttl;
+   }
success = true;
break;
-   }
-   case MPT_IPV6: {
-   struct ipv6hdr *hdr6 = ipv6_hdr(skb);
-   skb->protocol = htons(ETH_P_IPV6);
-   hdr6->hop_limit = dec.ttl;
+   case MPT_IPV6:
+   if (net->mpls.ip_ttl_propagate) {
+   struct ipv6hdr *hdr6 = ipv6_hdr(skb);
+
+   skb->protocol = htons(ETH_P_IPV6);


here need to be done outside of net->mpls.ip_ttl_propagate otherwise ...


+   hdr6->hop_limit = dec.ttl;
+   }
success = true;
break;
-   }
case MPT_UNSPEC:
+   /* Should have decided which protocol it is by now */
break;
}



disabling ip_ttl_propagate causes a corrupted packet to show up at the end host 
(after the LSP):


Oops, good catch. Will fix in v2.

Thanks,
Rob


Re: [PATCH net-next] mpls: allow TTL propagation to/from IP packets to be configured

2017-01-31 Thread Robert Shearman

On 31/01/17 00:41, David Ahern wrote:

On 1/30/17 1:36 PM, Robert Shearman wrote:

It is sometimes desirable to present an MPLS transport network as a
single hop to traffic transiting it because it prevents confusion when
diagnosing failures. An example of where confusion can be generated is
when addresses used in the provider network overlap with addresses in
the overlay network and the addresses get exposed through ICMP errors
generated as packets transit the provider network.

Therefore, provide the ability to control whether the TTL value from
an MPLS packet is propagated to an IPv4/IPv6 packet when the last
label is popped through the addition of a new per-namespace sysctl:
"net.mpls.ip_ttl_propagate" which defaults to enabled.

Use the same sysctl to control whether the TTL is propagated from IP
packets into the MPLS header. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl:
"net.mpls.default_ttl".

Signed-off-by: Robert Shearman 
---
 Documentation/networking/mpls-sysctl.txt | 19 +
 include/net/netns/mpls.h |  3 ++
 net/mpls/af_mpls.c   | 70 
 net/mpls/mpls_iptunnel.c | 12 +-
 4 files changed, 85 insertions(+), 19 deletions(-)

diff --git a/Documentation/networking/mpls-sysctl.txt 
b/Documentation/networking/mpls-sysctl.txt
index 15d8d16934fd..b8f0725ff09e 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -19,6 +19,25 @@ platform_labels - INTEGER
Possible values: 0 - 1048575
Default: 0

+ip_ttl_propagate - BOOL
+   Control whether TTL is propagated from the IPv4/IPv6 header to
+   the MPLS header on imposing labels and propagated from the
+   MPLS header to the IPv4/IPv6 header on popping the last label.
+
+   If disabled, the MPLS transport network will appear as a
+   single hop to transit traffic.
+
+   0 - disabled
+   1 - enabled (default)
+


It seems like you are going after RFC 3443 with this change. Can you add 
comment to that effect? i.e.,  ip_ttl_propagate enabled is the Uniform Model 
and ip_ttl_propagate disabled is the Short Pipe Model.



Good idea, will add it in the appropriate place depending on the chosen API.

Thanks,
Rob


Re: [PATCH net-next] mpls: allow TTL propagation to/from IP packets to be configured

2017-01-31 Thread Robert Shearman

On 31/01/17 00:17, Eric W. Biederman wrote:

Robert Shearman  writes:


It is sometimes desirable to present an MPLS transport network as a
single hop to traffic transiting it because it prevents confusion when
diagnosing failures. An example of where confusion can be generated is
when addresses used in the provider network overlap with addresses in
the overlay network and the addresses get exposed through ICMP errors
generated as packets transit the provider network.

Therefore, provide the ability to control whether the TTL value from
an MPLS packet is propagated to an IPv4/IPv6 packet when the last
label is popped through the addition of a new per-namespace sysctl:
"net.mpls.ip_ttl_propagate" which defaults to enabled.

Use the same sysctl to control whether the TTL is propagated from IP
packets into the MPLS header. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl:
"net.mpls.default_ttl".


Instead of having a global sysctl can we please have a different way
to configure the ingress/egress?

My general memory is that this makes sense for a slightly different
tunnel type.   Making it a per mpls tunnel property instead of global
property feels like it should be much more maintainable.


RFC 3443 that David Ahern referenced does indeed infer that this should 
be a per-LSP property. However, it says:



   We also note here that signaling the LSP type (Pipe, Short Pipe or
   Uniform Model) is out of the scope of this document, and that is also
   not addressed in the current versions of the label distribution
   protocols, e.g. LDP [MPLS-LDP] and RSVP-TE [MPLS-RSVP].  Currently,
   the LSP type is configured by the network operator manually by means
   of either a command line or network management interface.


AIUI, the situation of label distribution protocols not signaling this 
property hasn't changed from when this RFC has written, which limits the 
usefulness of a per-LSP property, and perhaps also indicates a lack of 
desire from users of this.


Do you still feel it's worth implementing on a per-LSP basis? If so, any 
opinion on how it should be done for the pop case? Either a new per-path 
RTA attribute can be added, e.g. RTA_TTL_PROPAGATE, or a new rtnh flag 
could be added, e.g. RTNH_F_TTL_PROPAGATE.



Similarly with the related behavior of what to do if the mpls ttl is
exhausted during the trip through the tunnel.  Drop or dig through the
packet and send an ICMP error message at the ip layer.


That's an interesting suggestion, but I don't think it will be useful 
when carrying another LSP over the LSP in question, since the LSR will 
have no idea what the label is being used for (i.e. the payload). If 
there is only one label in the packet then the router should know what 
the payload is of the label and since this is implicitly IPv4 or IPv6 at 
the moment (since those are the only types of traffic for which the 
labels can be used) then surely the ICMP should always be generated in 
that case?


Thanks,
Rob


[PATCH net-next] mpls: allow TTL propagation to/from IP packets to be configured

2017-01-30 Thread Robert Shearman
It is sometimes desirable to present an MPLS transport network as a
single hop to traffic transiting it because it prevents confusion when
diagnosing failures. An example of where confusion can be generated is
when addresses used in the provider network overlap with addresses in
the overlay network and the addresses get exposed through ICMP errors
generated as packets transit the provider network.

Therefore, provide the ability to control whether the TTL value from
an MPLS packet is propagated to an IPv4/IPv6 packet when the last
label is popped through the addition of a new per-namespace sysctl:
"net.mpls.ip_ttl_propagate" which defaults to enabled.

Use the same sysctl to control whether the TTL is propagated from IP
packets into the MPLS header. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl:
"net.mpls.default_ttl".

Signed-off-by: Robert Shearman 
---
 Documentation/networking/mpls-sysctl.txt | 19 +
 include/net/netns/mpls.h |  3 ++
 net/mpls/af_mpls.c   | 70 
 net/mpls/mpls_iptunnel.c | 12 +-
 4 files changed, 85 insertions(+), 19 deletions(-)

diff --git a/Documentation/networking/mpls-sysctl.txt 
b/Documentation/networking/mpls-sysctl.txt
index 15d8d16934fd..b8f0725ff09e 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -19,6 +19,25 @@ platform_labels - INTEGER
Possible values: 0 - 1048575
Default: 0
 
+ip_ttl_propagate - BOOL
+   Control whether TTL is propagated from the IPv4/IPv6 header to
+   the MPLS header on imposing labels and propagated from the
+   MPLS header to the IPv4/IPv6 header on popping the last label.
+
+   If disabled, the MPLS transport network will appear as a
+   single hop to transit traffic.
+
+   0 - disabled
+   1 - enabled (default)
+
+default_ttl - BOOL
+   Default TTL value to use for MPLS packets where it cannot be
+   propagated from an IP header, either because one isn't present
+   or ip_ttl_propagate has been disabled.
+
+   Possible values: 1 - 255
+   Default: 255
+
 conf//input - BOOL
Control whether packets can be input on this interface.
 
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index d29203651c01..1b68aed6e1b9 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -10,7 +10,10 @@ struct ctl_table_header;
 
 struct netns_mpls {
size_t platform_labels;
+   int ip_ttl_propagate;
+   int default_ttl;
struct mpls_route __rcu * __rcu *platform_label;
+
struct ctl_table_header *ctl;
 };
 
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 64d3bf269a26..bf5f0792e8a2 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -31,7 +31,9 @@
 #define MPLS_NEIGH_TABLE_UNSPEC (NEIGH_LINK_TABLE + 1)
 
 static int zero = 0;
+static int one = 1;
 static int label_limit = (1 << 20) - 1;
+static int ttl_max = 255;
 
 static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
   struct nlmsghdr *nlh, struct net *net, u32 portid,
@@ -219,8 +221,8 @@ static struct mpls_nh *mpls_select_multipath(struct 
mpls_route *rt,
return &rt->rt_nh[nh_index];
 }
 
-static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
-   struct mpls_entry_decoded dec)
+static bool mpls_egress(struct net *net, struct mpls_route *rt,
+   struct sk_buff *skb, struct mpls_entry_decoded dec)
 {
enum mpls_payload_type payload_type;
bool success = false;
@@ -243,24 +245,29 @@ static bool mpls_egress(struct mpls_route *rt, struct 
sk_buff *skb,
payload_type = ip_hdr(skb)->version;
 
switch (payload_type) {
-   case MPT_IPV4: {
-   struct iphdr *hdr4 = ip_hdr(skb);
-   skb->protocol = htons(ETH_P_IP);
-   csum_replace2(&hdr4->check,
- htons(hdr4->ttl << 8),
- htons(dec.ttl << 8));
-   hdr4->ttl = dec.ttl;
+   case MPT_IPV4:
+   if (net->mpls.ip_ttl_propagate) {
+   struct iphdr *hdr4 = ip_hdr(skb);
+
+   skb->protocol = htons(ETH_P_IP);
+   csum_replace2(&hdr4->check,
+ htons(hdr4->ttl << 8),
+ htons(dec.ttl << 8));
+   hdr4->ttl = dec.ttl;
+   }
success = true;
break;
-   }
-   case MPT_IPV6: {
-   struct ipv6hdr *hdr6 = ipv6_hdr(skb);
-   skb->protocol = htons(ETH_P_IPV6);
-   hdr6->hop_limit = dec.ttl;
+   case MPT_IPV6:
+   if (net->mpls.ip_ttl_

Re: [PATCH net] net: Avoid receiving packets with an l3mdev on unbound UDP sockets

2017-01-30 Thread Robert Shearman

On 30/01/17 20:01, David Miller wrote:

From: David Ahern 
Date: Mon, 30 Jan 2017 11:52:01 -0700


On 1/26/17 11:02 AM, Robert Shearman wrote:

Packets arriving in a VRF currently are delivered to UDP sockets that
aren't bound to any interface. TCP defaults to not delivering packets
arriving in a VRF to unbound sockets. IP route lookup and socket
transmit both assume that unbound means using the default table and
UDP applications that haven't been changed to be aware of VRFs may not
function correctly in this case since they may not be able to handle
overlapping IP address ranges, or be able to send packets back to the
original sender if required.

So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
being analgous to the existing tcp_l3mdev_accept, namely to allow a
process to have a VRF-global listen socket. Have this default to off
as this is the behaviour that users will expect, given that there is
no explicit mechanism to set unmodified VRF-unaware application into a
default VRF.

Signed-off-by: Robert Shearman 
---
I've targetted this for the net tree because I believe the expected
behaviour is different enough from the current behaviour to be
considered a bug. However, this should also apply to the net-next tree
as-is if this not deemed a bug.


Does not apply to net-next; collision in sysctl_net_ipv4.c

As for the code change, I have updated my unit tests and they all pass with 
this patch. Not sure why I marked my version as not working last November, but 
it is all good now.

Acked-by: David Ahern 
Tested-by: David Ahern 


The conflict was easy enough to fix up, so I did it myself.

Applied to net-next, thanks.


Great, thanks.


[PATCH net] net: Avoid receiving packets with an l3mdev on unbound UDP sockets

2017-01-26 Thread Robert Shearman
Packets arriving in a VRF currently are delivered to UDP sockets that
aren't bound to any interface. TCP defaults to not delivering packets
arriving in a VRF to unbound sockets. IP route lookup and socket
transmit both assume that unbound means using the default table and
UDP applications that haven't been changed to be aware of VRFs may not
function correctly in this case since they may not be able to handle
overlapping IP address ranges, or be able to send packets back to the
original sender if required.

So add a sysctl, udp_l3mdev_accept, to control this behaviour with it
being analgous to the existing tcp_l3mdev_accept, namely to allow a
process to have a VRF-global listen socket. Have this default to off
as this is the behaviour that users will expect, given that there is
no explicit mechanism to set unmodified VRF-unaware application into a
default VRF.

Signed-off-by: Robert Shearman 
---
I've targetted this for the net tree because I believe the expected
behaviour is different enough from the current behaviour to be
considered a bug. However, this should also apply to the net-next tree
as-is if this not deemed a bug.

 Documentation/networking/ip-sysctl.txt |  7 +++
 Documentation/networking/vrf.txt   |  7 ---
 include/net/netns/ipv4.h   |  4 
 net/ipv4/sysctl_net_ipv4.c | 11 +++
 net/ipv4/udp.c | 27 ---
 net/ipv6/udp.c | 27 ---
 6 files changed, 66 insertions(+), 17 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 7dd65c9cf707..fa1f14977a0c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -742,6 +742,13 @@ tcp_challenge_ack_limit - INTEGER
 
 UDP variables:
 
+udp_l3mdev_accept - BOOLEAN
+   Enabling this option allows a "global" bound socket to work
+   across L3 master domains (e.g., VRFs) with packets capable of
+   being received regardless of the L3 domain in which they
+   originated. Only valid when the kernel was compiled with
+   CONFIG_NET_L3_MASTER_DEV.
+
 udp_mem - vector of 3 INTEGERs: min, pressure, max
Number of pages allowed for queueing by all UDP sockets.
 
diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt
index 755dab856392..3918dae964d4 100644
--- a/Documentation/networking/vrf.txt
+++ b/Documentation/networking/vrf.txt
@@ -98,10 +98,11 @@ VRF device:
 
 or to specify the output device using cmsg and IP_PKTINFO.
 
-TCP services running in the default VRF context (ie., not bound to any VRF
-device) can work across all VRF domains by enabling the tcp_l3mdev_accept
-sysctl option:
+TCP & UDP services running in the default VRF context (ie., not bound
+to any VRF device) can work across all VRF domains by enabling the
+tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
 sysctl -w net.ipv4.tcp_l3mdev_accept=1
+sysctl -w net.ipv4.udp_l3mdev_accept=1
 
 netfilter rules on the VRF device can be used to limit access to services
 running in the default VRF context as well.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 0378e88f6fd3..0822dced1b68 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -112,6 +112,10 @@ struct netns_ipv4 {
unsigned int sysctl_tcp_notsent_lowat;
int sysctl_tcp_tw_reuse;
 
+#ifdef CONFIG_NET_L3_MASTER_DEV
+   int sysctl_udp_l3mdev_accept;
+#endif
+
int sysctl_igmp_max_memberships;
int sysctl_igmp_max_msf;
int sysctl_igmp_llm_reports;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b2fa498b15d1..a2ebbe6211ba 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -971,6 +971,17 @@ static struct ctl_table ipv4_net_table[] = {
.extra2 = &one,
},
 #endif
+#ifdef CONFIG_NET_L3_MASTER_DEV
+   {
+   .procname   = "udp_l3mdev_accept",
+   .data   = &init_net.ipv4.sysctl_udp_l3mdev_accept,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &zero,
+   .extra2 = &one,
+   },
+#endif
{ }
 };
 
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 1307a7c2e544..c7fcb7395ccf 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -134,6 +134,17 @@ EXPORT_SYMBOL(udp_memory_allocated);
 #define MAX_UDP_PORTS 65536
 #define PORTS_PER_CHAIN (MAX_UDP_PORTS / UDP_HTABLE_SIZE_MIN)
 
+/* IPCB reference means this can not be used from early demux */
+static bool udp_lib_exact_dif_match(struct net *net, struct sk_buff *skb)
+{
+#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
+   if (!net->ipv4.sysctl_udp_l3mdev_accept &&
+   skb &&am

[PATCH net v3 1/2] net: Specify the owning module for lwtunnel ops

2017-01-24 Thread Robert Shearman
Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.

Signed-off-by: Robert Shearman 
---
 include/net/lwtunnel.h| 2 ++
 net/core/lwt_bpf.c| 1 +
 net/ipv4/ip_tunnel_core.c | 2 ++
 net/ipv6/ila/ila_lwt.c| 1 +
 net/ipv6/seg6_iptunnel.c  | 1 +
 net/mpls/mpls_iptunnel.c  | 1 +
 6 files changed, 8 insertions(+)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 0b585f1fd340..73dd87647460 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -44,6 +44,8 @@ struct lwtunnel_encap_ops {
int (*get_encap_size)(struct lwtunnel_state *lwtstate);
int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b);
int (*xmit)(struct sk_buff *skb);
+
+   struct module *owner;
 };
 
 #ifdef CONFIG_LWTUNNEL
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 71bb3e2eca08..b3eef90b2df9 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -386,6 +386,7 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
.fill_encap = bpf_fill_encap_info,
.get_encap_size = bpf_encap_nlsize,
.cmp_encap  = bpf_encap_cmp,
+   .owner  = THIS_MODULE,
 };
 
 static int __init bpf_lwt_init(void)
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index fed3d29f9eb3..0fd1976ab63b 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -313,6 +313,7 @@ static const struct lwtunnel_encap_ops ip_tun_lwt_ops = {
.fill_encap = ip_tun_fill_encap_info,
.get_encap_size = ip_tun_encap_nlsize,
.cmp_encap = ip_tun_cmp_encap,
+   .owner = THIS_MODULE,
 };
 
 static const struct nla_policy ip6_tun_policy[LWTUNNEL_IP6_MAX + 1] = {
@@ -403,6 +404,7 @@ static const struct lwtunnel_encap_ops ip6_tun_lwt_ops = {
.fill_encap = ip6_tun_fill_encap_info,
.get_encap_size = ip6_tun_encap_nlsize,
.cmp_encap = ip_tun_cmp_encap,
+   .owner = THIS_MODULE,
 };
 
 void __init ip_tunnel_core_init(void)
diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c
index a7bc54ab46e2..13b5e85fe0d5 100644
--- a/net/ipv6/ila/ila_lwt.c
+++ b/net/ipv6/ila/ila_lwt.c
@@ -238,6 +238,7 @@ static const struct lwtunnel_encap_ops ila_encap_ops = {
.fill_encap = ila_fill_encap_info,
.get_encap_size = ila_encap_nlsize,
.cmp_encap = ila_encap_cmp,
+   .owner = THIS_MODULE,
 };
 
 int ila_lwt_init(void)
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 1d60cb132835..c46f8cbf5ab5 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -422,6 +422,7 @@ static const struct lwtunnel_encap_ops seg6_iptun_ops = {
.fill_encap = seg6_fill_encap_info,
.get_encap_size = seg6_encap_nlsize,
.cmp_encap = seg6_encap_cmp,
+   .owner = THIS_MODULE,
 };
 
 int __init seg6_iptunnel_init(void)
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index 2f7ccd934416..1d281c1ff7c1 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -215,6 +215,7 @@ static const struct lwtunnel_encap_ops mpls_iptun_ops = {
.fill_encap = mpls_fill_encap_info,
.get_encap_size = mpls_encap_nlsize,
.cmp_encap = mpls_encap_cmp,
+   .owner = THIS_MODULE,
 };
 
 static int __init mpls_iptunnel_init(void)
-- 
2.1.4



[PATCH net v3 0/2] net: Fix oops on state free after lwt module unload

2017-01-24 Thread Robert Shearman
An oops is seen in lwtstate_free after an lwt ops module has been
unloaded. This patchset fixes this by preventing modules implementing
lwtunnel ops from being unloaded whilst there's state alive using
those ops. The first patch adds fills in a new owner field in all lwt
ops and the second patch makes use of this to reference count the
modules as state is built and destroyed using them.

Changes in v3:
 - don't put module reference if try_module_get fails on building state

Changes in v2:
 - specify module owner for all modules as suggested by DaveM
 - reference count all modules building lwt state, not just those ops
   implementing destroy_state, as also suggested by DaveM.
 - rebased on top of David Ahern's lwtunnel changes

Robert Shearman (2):
  net: Specify the owning module for lwtunnel ops
  lwtunnel: Fix oops on state free after encap module unload

 include/net/lwtunnel.h| 2 ++
 net/core/lwt_bpf.c| 1 +
 net/core/lwtunnel.c   | 6 +-
 net/ipv4/ip_tunnel_core.c | 2 ++
 net/ipv6/ila/ila_lwt.c| 1 +
 net/ipv6/seg6_iptunnel.c  | 1 +
 net/mpls/mpls_iptunnel.c  | 1 +
 7 files changed, 13 insertions(+), 1 deletion(-)

-- 
2.1.4



[PATCH net v3 2/2] lwtunnel: Fix oops on state free after encap module unload

2017-01-24 Thread Robert Shearman
When attempting to free lwtunnel state after the module for the encap
has been unloaded an oops occurs:

BUG: unable to handle kernel NULL pointer dereference at 0008
IP: lwtstate_free+0x18/0x40
[..]
task: 88003e372380 task.stack: c91fc000
RIP: 0010:lwtstate_free+0x18/0x40
RSP: 0018:88003fd83e88 EFLAGS: 00010246
RAX:  RBX: 88002bbb3380 RCX: 88000c91a300
[..]
Call Trace:
 
 free_fib_info_rcu+0x195/0x1a0
 ? rt_fibinfo_free+0x50/0x50
 rcu_process_callbacks+0x2d3/0x850
 ? rcu_process_callbacks+0x296/0x850
 __do_softirq+0xe4/0x4cb
 irq_exit+0xb0/0xc0
 smp_apic_timer_interrupt+0x3d/0x50
 apic_timer_interrupt+0x93/0xa0
[..]
Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 
55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 
74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8

The problem is after the module for the encap can be unloaded the
corresponding ops is removed and is thus NULL here.

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so grab the module
reference for the ops on creating lwtunnel state and of course release
the reference when freeing the state.

Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation")
Signed-off-by: Robert Shearman 
---
 net/core/lwtunnel.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 47b1dd65947b..c23465005f2f 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -115,8 +115,11 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
ret = -EOPNOTSUPP;
rcu_read_lock();
ops = rcu_dereference(lwtun_encaps[encap_type]);
-   if (likely(ops && ops->build_state))
+   if (likely(ops && ops->build_state && try_module_get(ops->owner))) {
ret = ops->build_state(dev, encap, family, cfg, lws);
+   if (ret)
+   module_put(ops->owner);
+   }
rcu_read_unlock();
 
return ret;
@@ -194,6 +197,7 @@ void lwtstate_free(struct lwtunnel_state *lws)
} else {
kfree(lws);
}
+   module_put(ops->owner);
 }
 EXPORT_SYMBOL(lwtstate_free);
 
-- 
2.1.4



Re: [PATCH net v2 2/2] lwtunnel: Fix oops on state free after encap module unload

2017-01-24 Thread Robert Shearman

On 23/01/17 20:42, David Miller wrote:

From: Robert Shearman 
Date: Sat, 21 Jan 2017 00:21:26 +


@@ -115,8 +115,12 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
ret = -EOPNOTSUPP;
rcu_read_lock();


Here 'ret' equals -EOPNOTSUPP


ops = rcu_dereference(lwtun_encaps[encap_type]);
-   if (likely(ops && ops->build_state))
-   ret = ops->build_state(dev, encap, family, cfg, lws);
+   if (likely(ops)) {
+   if (likely(try_module_get(ops->owner) && ops->build_state))
+   ret = ops->build_state(dev, encap, family, cfg, lws);
+   if (ret)
+   module_put(ops->owner);


If try_module_get() fails, 'ret' will still be -EOPNOTSUPP and we will
module_put() on a module we did not grab a reference to.

I think you need to adjust the logic here.  You only want to 'put' if
try_module_get() succeeds and ->build_state() returns an error.


Indeed, good point. Will address in a v3 shortly.

Thanks,
Rob


[PATCH net v2 1/2] net: Specify the owning module for lwtunnel ops

2017-01-20 Thread Robert Shearman
Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so specify the owning
module for all lwtunnel ops.

Signed-off-by: Robert Shearman 
---
 include/net/lwtunnel.h| 2 ++
 net/core/lwt_bpf.c| 1 +
 net/ipv4/ip_tunnel_core.c | 2 ++
 net/ipv6/ila/ila_lwt.c| 1 +
 net/ipv6/seg6_iptunnel.c  | 1 +
 net/mpls/mpls_iptunnel.c  | 1 +
 6 files changed, 8 insertions(+)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 0b585f1fd340..73dd87647460 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -44,6 +44,8 @@ struct lwtunnel_encap_ops {
int (*get_encap_size)(struct lwtunnel_state *lwtstate);
int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b);
int (*xmit)(struct sk_buff *skb);
+
+   struct module *owner;
 };
 
 #ifdef CONFIG_LWTUNNEL
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index 71bb3e2eca08..b3eef90b2df9 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -386,6 +386,7 @@ static const struct lwtunnel_encap_ops bpf_encap_ops = {
.fill_encap = bpf_fill_encap_info,
.get_encap_size = bpf_encap_nlsize,
.cmp_encap  = bpf_encap_cmp,
+   .owner  = THIS_MODULE,
 };
 
 static int __init bpf_lwt_init(void)
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c
index fed3d29f9eb3..0fd1976ab63b 100644
--- a/net/ipv4/ip_tunnel_core.c
+++ b/net/ipv4/ip_tunnel_core.c
@@ -313,6 +313,7 @@ static const struct lwtunnel_encap_ops ip_tun_lwt_ops = {
.fill_encap = ip_tun_fill_encap_info,
.get_encap_size = ip_tun_encap_nlsize,
.cmp_encap = ip_tun_cmp_encap,
+   .owner = THIS_MODULE,
 };
 
 static const struct nla_policy ip6_tun_policy[LWTUNNEL_IP6_MAX + 1] = {
@@ -403,6 +404,7 @@ static const struct lwtunnel_encap_ops ip6_tun_lwt_ops = {
.fill_encap = ip6_tun_fill_encap_info,
.get_encap_size = ip6_tun_encap_nlsize,
.cmp_encap = ip_tun_cmp_encap,
+   .owner = THIS_MODULE,
 };
 
 void __init ip_tunnel_core_init(void)
diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c
index a7bc54ab46e2..13b5e85fe0d5 100644
--- a/net/ipv6/ila/ila_lwt.c
+++ b/net/ipv6/ila/ila_lwt.c
@@ -238,6 +238,7 @@ static const struct lwtunnel_encap_ops ila_encap_ops = {
.fill_encap = ila_fill_encap_info,
.get_encap_size = ila_encap_nlsize,
.cmp_encap = ila_encap_cmp,
+   .owner = THIS_MODULE,
 };
 
 int ila_lwt_init(void)
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 1d60cb132835..c46f8cbf5ab5 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -422,6 +422,7 @@ static const struct lwtunnel_encap_ops seg6_iptun_ops = {
.fill_encap = seg6_fill_encap_info,
.get_encap_size = seg6_encap_nlsize,
.cmp_encap = seg6_encap_cmp,
+   .owner = THIS_MODULE,
 };
 
 int __init seg6_iptunnel_init(void)
diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index 2f7ccd934416..1d281c1ff7c1 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -215,6 +215,7 @@ static const struct lwtunnel_encap_ops mpls_iptun_ops = {
.fill_encap = mpls_fill_encap_info,
.get_encap_size = mpls_encap_nlsize,
.cmp_encap = mpls_encap_cmp,
+   .owner = THIS_MODULE,
 };
 
 static int __init mpls_iptunnel_init(void)
-- 
2.1.4



[PATCH net v2 0/2] net: Fix oops on state free after lwt module unload

2017-01-20 Thread Robert Shearman
An oops is seen in lwtstate_free after an lwt ops module has been
unloaded. This patchset fixes this by preventing modules implementing
lwtunnel ops from being unloaded whilst there's state alive using
those ops. The first patch adds fills in a new owner field in all lwt
ops and the second patch makes use of this to reference count the
modules as state is built and destroyed using them.

Changes in v2:
 - specify module owner for all modules as suggested by DaveM
 - reference count all modules building lwt state, not just those ops
   implementing destroy_state, as also suggested by DaveM.
 - rebased on top of David Ahern's lwtunnel changes

Robert Shearman (2):
  net: Specify the owning module for lwtunnel ops
  lwtunnel: Fix oops on state free after encap module unload

 include/net/lwtunnel.h| 2 ++
 net/core/lwt_bpf.c| 1 +
 net/core/lwtunnel.c   | 9 +++--
 net/ipv4/ip_tunnel_core.c | 2 ++
 net/ipv6/ila/ila_lwt.c| 1 +
 net/ipv6/seg6_iptunnel.c  | 1 +
 net/mpls/mpls_iptunnel.c  | 1 +
 7 files changed, 15 insertions(+), 2 deletions(-)

-- 
2.1.4



[PATCH net v2 2/2] lwtunnel: Fix oops on state free after encap module unload

2017-01-20 Thread Robert Shearman
When attempting to free lwtunnel state after the module for the encap
has been unloaded an oops occurs:

BUG: unable to handle kernel NULL pointer dereference at 0008
IP: lwtstate_free+0x18/0x40
[..]
task: 88003e372380 task.stack: c91fc000
RIP: 0010:lwtstate_free+0x18/0x40
RSP: 0018:88003fd83e88 EFLAGS: 00010246
RAX:  RBX: 88002bbb3380 RCX: 88000c91a300
[..]
Call Trace:
 
 free_fib_info_rcu+0x195/0x1a0
 ? rt_fibinfo_free+0x50/0x50
 rcu_process_callbacks+0x2d3/0x850
 ? rcu_process_callbacks+0x296/0x850
 __do_softirq+0xe4/0x4cb
 irq_exit+0xb0/0xc0
 smp_apic_timer_interrupt+0x3d/0x50
 apic_timer_interrupt+0x93/0xa0
[..]
Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 
55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 
74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8

The problem is after the module for the encap is unloaded the
corresponding ops is removed and thus is NULL here.

Modules implementing lwtunnel ops should not be allowed to unload
while there is state alive using those ops, so grab the module
reference for the ops on creating lwtunnel state and of course release
the reference when freeing the state.

Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation")
Signed-off-by: Robert Shearman 
---
 net/core/lwtunnel.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 47b1dd65947b..ebfaffaa777b 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -115,8 +115,12 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
ret = -EOPNOTSUPP;
rcu_read_lock();
ops = rcu_dereference(lwtun_encaps[encap_type]);
-   if (likely(ops && ops->build_state))
-   ret = ops->build_state(dev, encap, family, cfg, lws);
+   if (likely(ops)) {
+   if (likely(try_module_get(ops->owner) && ops->build_state))
+   ret = ops->build_state(dev, encap, family, cfg, lws);
+   if (ret)
+   module_put(ops->owner);
+   }
rcu_read_unlock();
 
return ret;
@@ -194,6 +198,7 @@ void lwtstate_free(struct lwtunnel_state *lws)
} else {
kfree(lws);
}
+   module_put(ops->owner);
 }
 EXPORT_SYMBOL(lwtstate_free);
 
-- 
2.1.4



Re: [PATCH net 0/2] net: Fix oops on state free after lwt module unload

2017-01-20 Thread Robert Shearman

On 20/01/17 17:03, David Miller wrote:

From: Robert Shearman 
Date: Wed, 18 Jan 2017 15:32:01 +


This patchset fixes an oops in lwtstate_free and a memory leak that
would otherwise be exposed by ensuring that references are taken on
modules that need to stay around to clean up lwt state. To faciliate
this all ops that implement destroy_state and that can be configured
to build as a module are changed specify the owner module in the
ops. The intersection of those two sets is just ila at the moment.


Two things:

1) Under no circumstances should we allow a lwtunnel ops implementing
   module to unload while there is a rule using those ops which is
   alive.

   Therefore, we should not special case the destroy op.  We should
   unconditionally grab the module reference.

2) Please add the new 'owner' field and add an appropriate assignment
   for ops->owner to _every_ lwtunnel implementation, and do so in
   your first patch.  Please do not only do this for ILA.

Thanks.


Very clear, makes sense, will do.

Thanks,
Rob


Re: [PATCH net] net: mpls: Fix multipath selection for LSR use case

2017-01-20 Thread Robert Shearman

On 20/01/17 00:51, David Ahern wrote:

MPLS multipath for LSR is broken -- always selecting the first nexthop
in the one label case. For example:

$ ip netns exec ns1 ip -f mpls ro ls
100
nexthop as to 200 via inet 172.16.2.2  dev virt12
nexthop as to 300 via inet 172.16.3.2  dev virt13
101
nexthop as to 201 via inet6 2000:2::2  dev virt12
nexthop as to 301 via inet6 2000:3::2  dev virt13

In this example incoming packets have a single MPLS labels which means
BOS bit is set. The BOS bit is passed from mpls_forward down to
mpls_multipath_hash which never processes the hash loop because BOS is 1.

Removing the bos arg from mpls_multipath_hash uncovers a number of other
problems with the hash loop that processes the MPLS label stack -- from
incorrect assumptions on the skb (skb has already pulled the first mpls
label in mpls_forward yet loop assumes it is there)


This was intentional because it doesn't really add anything to include 
the top-most label in the entropy since all traffic for the mpls_route 
will have the same top-most label, until support for sharing of 
mpls_routes is added.


Having said that, it costs very little to do this, makes the code 
simpler and avoids the need to remember to change this if sharing is 
added, so it's fine with me.



to incorrect
pskb_may_pull checks (label_index starts at 0 and pskb_may_pull checks
all use sizeof() * label_index).

This patch addresses all problems by moving the skb_pull in mpls_forward
after mpls_select_multipath. This allows mpls_multipath_hash to see the
skb with the entire label stack as it arrived.

From there mpls_multipath_hash is modified to additively compute the
total mpls header length on each pass (on pass N mpls_hdr_len is
N * sizeof(mpls_shim_hdr)). When the label is found with the BOS set it
verifies the skb has sufficient header for ipv4 or ipv6, and find the
IPv4 and IPv6 header by using the last mpls_hdr pointer and adding 1 to
advance past it.

With these changes I have verified the code correctly sees the label,
BOS, IPv4 and IPv6 addresses in the network header and icmp/tcp/udp
traffic for ipv4 and ipv6 are distributed across the nexthops.

Fixes: 1c78efa8319ca ("mpls: flow-based multipath selection")
Signed-off-by: David Ahern 


Acked-by: Robert Shearman 

Good catch, thanks for fixing.


[PATCH net 0/2] net: Fix oops on state free after lwt module unload

2017-01-18 Thread Robert Shearman
This patchset fixes an oops in lwtstate_free and a memory leak that
would otherwise be exposed by ensuring that references are taken on
modules that need to stay around to clean up lwt state. To faciliate
this all ops that implement destroy_state and that can be configured
to build as a module are changed specify the owner module in the
ops. The intersection of those two sets is just ila at the moment.

Robert Shearman (2):
  lwtunnel: Fix oops on state free after encap module unload
  ila: Fix memory leak of lwt dst cache on module unload

 include/net/lwtunnel.h |  2 ++
 net/core/lwtunnel.c| 11 +--
 net/ipv6/ila/ila_lwt.c |  1 +
 3 files changed, 12 insertions(+), 2 deletions(-)

-- 
2.1.4



[PATCH net 1/2] lwtunnel: Fix oops on state free after encap module unload

2017-01-18 Thread Robert Shearman
When attempting to free lwtunnel state after the module for the encap
has been unloaded an oops occurs:

BUG: unable to handle kernel NULL pointer dereference at 0008
IP: lwtstate_free+0x18/0x40
[..]
task: 88003e372380 task.stack: c91fc000
RIP: 0010:lwtstate_free+0x18/0x40
RSP: 0018:88003fd83e88 EFLAGS: 00010246
RAX:  RBX: 88002bbb3380 RCX: 88000c91a300
[..]
Call Trace:
 
 free_fib_info_rcu+0x195/0x1a0
 ? rt_fibinfo_free+0x50/0x50
 rcu_process_callbacks+0x2d3/0x850
 ? rcu_process_callbacks+0x296/0x850
 __do_softirq+0xe4/0x4cb
 irq_exit+0xb0/0xc0
 smp_apic_timer_interrupt+0x3d/0x50
 apic_timer_interrupt+0x93/0xa0
[..]
Code: e8 6e c6 fc ff 89 d8 5b 5d c3 bb de ff ff ff eb f4 66 90 66 66 66 66 90 
55 48 89 e5 53 0f b7 07 48 89 fb 48 8b 04 c5 00 81 d5 81 <48> 8b 40 08 48 85 c0 
74 13 ff d0 48 8d 7b 20 be 20 00 00 00 e8

The problem is that we don't check for NULL ops in
lwtstate_free. Adding the check fixes the immediate problem but will
then won't properly clean up for ops that implement the
->destroy_state function if the implementing module has been unloaded,
resulting in memory leaks or other problems. So in addition, refcount
the module when the ops implements ->destroy_state so it can't be
unloaded while there is still state around.

Fixes: 1104d9ba443a ("lwtunnel: Add destroy state operation")
Signed-off-by: Robert Shearman 
---
 include/net/lwtunnel.h |  2 ++
 net/core/lwtunnel.c| 11 +--
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index d4c1c75b8862..2b9993f33198 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -44,6 +44,8 @@ struct lwtunnel_encap_ops {
int (*get_encap_size)(struct lwtunnel_state *lwtstate);
int (*cmp_encap)(struct lwtunnel_state *a, struct lwtunnel_state *b);
int (*xmit)(struct sk_buff *skb);
+
+   struct module *owner;
 };
 
 #ifdef CONFIG_LWTUNNEL
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index a5d4e866ce88..3dc3cc3b38ec 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -126,8 +126,14 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
}
}
 #endif
-   if (likely(ops && ops->build_state))
+   /* take module reference if destroy_state is in use */
+   if (unlikely(ops && ops->destroy_state && !try_module_get(ops->owner)))
+   ops = NULL;
+   if (likely(ops && ops->build_state)) {
ret = ops->build_state(dev, encap, family, cfg, lws);
+   if (ret && ops->destroy_state)
+   module_put(ops->owner);
+   }
rcu_read_unlock();
 
return ret;
@@ -138,9 +144,10 @@ void lwtstate_free(struct lwtunnel_state *lws)
 {
const struct lwtunnel_encap_ops *ops = lwtun_encaps[lws->type];
 
-   if (ops->destroy_state) {
+   if (ops && ops->destroy_state) {
ops->destroy_state(lws);
kfree_rcu(lws, rcu);
+   module_put(ops->owner);
} else {
kfree(lws);
}
-- 
2.1.4



[PATCH net 2/2] ila: Fix memory leak of lwt dst cache on module unload

2017-01-18 Thread Robert Shearman
If routes with lwt state are present when the ila module is unloaded
and then subsequently deleted, the dst cache entry in the state will
be leaked.

Fix this by specifying the owning module in the lwt ops to allow lwt
to take a reference for each route and to keep the module around until
the last ila route is deleted.

Fixes: 79ff2fc31e0f ("ila: Cache a route to translated address")
Signed-off-by: Robert Shearman 
---
 net/ipv6/ila/ila_lwt.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ila/ila_lwt.c b/net/ipv6/ila/ila_lwt.c
index a7bc54ab46e2..13b5e85fe0d5 100644
--- a/net/ipv6/ila/ila_lwt.c
+++ b/net/ipv6/ila/ila_lwt.c
@@ -238,6 +238,7 @@ static const struct lwtunnel_encap_ops ila_encap_ops = {
.fill_encap = ila_fill_encap_info,
.get_encap_size = ila_encap_nlsize,
.cmp_encap = ila_encap_cmp,
+   .owner = THIS_MODULE,
 };
 
 int ila_lwt_init(void)
-- 
2.1.4



Re: [PATCH net] lwtunnel: fix autoload of lwt modules

2017-01-17 Thread Robert Shearman

On 17/01/17 17:07, David Ahern wrote:

On 1/17/17 3:04 AM, Robert Shearman wrote:

Is it safe to release the rtnl lock here? E.g. neither fib_get_nhs nor 
fib_create_info take a reference on the device and further up inet_rtm_newroute 
has a pointer to the struct fib_table without taking a reference.


fib tables can not be deleted so that reference is safe.


Looks like we can free a table on unmerging. Admittedly this is the 
local table so is unlikely to be one that lwtunnels are used on, but 
does it not need to be safe against races anyway?



You are right about the dev reference in the IPv4 path but the thing is 
build_state does not need the device reference.

Roopa: do you recall why dev is passed to build_state? Nothing about the encap 
should require a device reference and none of the current encaps use it.



Unfortunately, I think more invasive changes are required to solve this. I'm 
happy to work on solving this.


I have a patch that removes passing dev to build_state. Walking the remainder 
of the code I do not see any more references that would be affected by dropping 
rtnl here on the add path. Still looking at the delete path.



Ok, I'll continue looking too and let you know if there's anything else 
that pops up.


Having said that, even if we eliminate all the unreferenced objects in 
the current code, what are the chances that we'll be able to keep it 
this way going forward? Perhaps it would be safer to retry the insert 
operation from as close to the start as possible?


Thanks,
Rob


Re: [PATCH net] lwtunnel: fix autoload of lwt modules

2017-01-17 Thread Robert Shearman

On 17/01/17 05:33, David Ahern wrote:

Trying to add an mpls encap route when the MPLS modules are not loaded
hangs. For example:

CONFIG_MPLS=y
CONFIG_NET_MPLS_GSO=m
CONFIG_MPLS_ROUTING=m
CONFIG_MPLS_IPTUNNEL=m

$ ip route add 10.10.10.10/32 encap mpls 100 via inet 10.100.1.2

The ip command hangs:
root   880   826  0 21:25 pts/000:00:00 ip route add 10.10.10.10/32 
encap mpls 100 via inet 10.100.1.2

$ cat /proc/880/stack
[] call_usermodehelper_exec+0xd6/0x134
[] __request_module+0x27b/0x30a
[] lwtunnel_build_state+0xe4/0x178
[] fib_create_info+0x47f/0xdd4
[] fib_table_insert+0x90/0x41f
[] inet_rtm_newroute+0x4b/0x52
...

modprobe is trying to load rtnl-lwt-MPLS:

root   881 5  0 21:25 ?00:00:00 /sbin/modprobe -q -- 
rtnl-lwt-MPLS

and it hangs after loading mpls_router:

$ cat /proc/881/stack
[] rtnl_lock+0x12/0x14
[] register_netdevice_notifier+0x16/0x179
[] mpls_init+0x25/0x1000 [mpls_router]
[] do_one_initcall+0x8e/0x13f
[] do_init_module+0x5a/0x1e5
[] load_module+0x13bd/0x17d6
...

The problem is that lwtunnel_build_state is called with rtnl lock
held preventing mpls_init from registering. Fix by dropping the
lock before invoking request_module.

Fixes: 745041e2aaf1 ("lwtunnel: autoload of lwt modules")
Signed-off-by: David Ahern 
---
This is a best guess at the Fixes.

 net/core/lwtunnel.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index a5d4e866ce88..c14ee4d62a8a 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -120,7 +120,9 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,

if (encap_type_str) {
rcu_read_unlock();
+   __rtnl_unlock();
request_module("rtnl-lwt-%s", encap_type_str);
+   rtnl_lock();
rcu_read_lock();
ops = rcu_dereference(lwtun_encaps[encap_type]);
}



Is it safe to release the rtnl lock here? E.g. neither fib_get_nhs nor 
fib_create_info take a reference on the device and further up 
inet_rtm_newroute has a pointer to the struct fib_table without taking a 
reference.


Unfortunately, I think more invasive changes are required to solve this. 
I'm happy to work on solving this.


Thanksm
Rob


[PATCH net-next v2 1/2] net: AF-specific RTM_GETSTATS attributes

2017-01-16 Thread Robert Shearman
Add the functionality for including address-family-specific per-link
stats in RTM_GETSTATS messages. This is done through adding a new
IFLA_STATS_AF_SPEC attribute under which address family attributes are
nested and then the AF-specific attributes can be further nested. This
follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
advantage of presenting an easily extended hierarchy. The rtnl_af_ops
structure is extended to provide AFs with the opportunity to fill and
provide the size of their stats attributes.

One alternative would have been to provide AFs with the ability to add
attributes directly into the RTM_GETSTATS message without a nested
hierarchy. I discounted this approach as it increases the rate at
which the 32 attribute number space is used up and it makes
implementation a little more tricky for stats dump resuming (at the
moment the order in which attributes are added to the message has to
match the numeric order of the attributes).

Another alternative would have been to register per-AF RTM_GETSTATS
handlers. I discounted this approach as I perceived a common use-case
to be getting all the stats for an interface and this approach would
necessitate multiple requests/dumps to retrieve them all.

Signed-off-by: Robert Shearman 
Acked-by: Roopa Prabhu 
---
 include/net/rtnetlink.h  |  4 
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c | 50 
 3 files changed, 55 insertions(+)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 4113916cc1bb..106de5f7bf06 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -139,6 +139,10 @@ struct rtnl_af_ops {
const struct nlattr *attr);
int (*set_link_af)(struct net_device *dev,
   const struct nlattr *attr);
+
+   int (*fill_stats_af)(struct sk_buff *skb,
+const struct net_device *dev);
+   size_t  (*get_stats_af_size)(const struct net_device 
*dev);
 };
 
 void __rtnl_af_unregister(struct rtnl_af_ops *ops);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 6b13e591abc9..184b16ed2b84 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -847,6 +847,7 @@ enum {
IFLA_STATS_LINK_XSTATS,
IFLA_STATS_LINK_XSTATS_SLAVE,
IFLA_STATS_LINK_OFFLOAD_XSTATS,
+   IFLA_STATS_AF_SPEC,
__IFLA_STATS_MAX,
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 18b5aae99bec..4edc1bd7a735 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3829,6 +3829,39 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
*idxattr = 0;
}
 
+   if (stats_attr_valid(filter_mask, IFLA_STATS_AF_SPEC, *idxattr)) {
+   struct rtnl_af_ops *af_ops;
+
+   *idxattr = IFLA_STATS_AF_SPEC;
+   attr = nla_nest_start(skb, IFLA_STATS_AF_SPEC);
+   if (!attr)
+   goto nla_put_failure;
+
+   list_for_each_entry(af_ops, &rtnl_af_ops, list) {
+   if (af_ops->fill_stats_af) {
+   struct nlattr *af;
+   int err;
+
+   af = nla_nest_start(skb, af_ops->family);
+   if (!af)
+   goto nla_put_failure;
+
+   err = af_ops->fill_stats_af(skb, dev);
+
+   if (err == -ENODATA)
+   nla_nest_cancel(skb, af);
+   else if (err < 0)
+   goto nla_put_failure;
+
+   nla_nest_end(skb, af);
+   }
+   }
+
+   nla_nest_end(skb, attr);
+
+   *idxattr = 0;
+   }
+
nlmsg_end(skb, nlh);
 
return 0;
@@ -3885,6 +3918,23 @@ static size_t if_nlmsg_stats_size(const struct 
net_device *dev,
if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_OFFLOAD_XSTATS, 0))
size += rtnl_get_offload_stats_size(dev);
 
+   if (stats_attr_valid(filter_mask, IFLA_STATS_AF_SPEC, 0)) {
+   struct rtnl_af_ops *af_ops;
+
+   /* for IFLA_STATS_AF_SPEC */
+   size += nla_total_size(0);
+
+   list_for_each_entry(af_ops, &rtnl_af_ops, list) {
+   if (af_ops->get_stats_af_size) {
+   size += nla_total_size(
+   af_ops->get_stats_af_size(dev));
+
+   /* for AF_* */
+   size += nla_total_size(0);
+   }
+   }

[PATCH net-next v2 0/2] mpls: Packet stats

2017-01-16 Thread Robert Shearman
This patchset records per-interface packet stats in the MPLS
forwarding path and exports them using a nest of attributes root at a
new IFLA_STATS_AF_SPEC attribute as part of RTM_GETSTATS messages:

[IFLA_STATS_AF_SPEC]
 -> [AF_MPLS]
  -> [MPLS_STATS_LINK]
   -> struct mpls_link_stats

The first patch adds the rtnl infrastructure for this, including a new
callbacks to per-AF ops of fill_stats_af and get_stats_af_size. The
second patch records MPLS stats and makes use of the infrastructure to
export them. The rtnl infrastructure could also be used to export IPv6
stats in the future.

Changes in v2:
 - make incrementing IPv6 stats in mpls_stats_inc_outucastpkts
   conditional on CONFIG_IPV6 to fix build with CONFIG_IPV6=n

Robert Shearman (2):
  net: AF-specific RTM_GETSTATS attributes
  mpls: Packet stats

 include/net/rtnetlink.h  |   4 +
 include/uapi/linux/if_link.h |   1 +
 include/uapi/linux/mpls.h|  30 +++
 net/core/rtnetlink.c |  50 
 net/mpls/af_mpls.c   | 181 +--
 net/mpls/internal.h  |  58 +-
 net/mpls/mpls_iptunnel.c |  11 ++-
 7 files changed, 307 insertions(+), 28 deletions(-)

-- 
2.1.4



[PATCH net-next v2 2/2] mpls: Packet stats

2017-01-16 Thread Robert Shearman
Having MPLS packet stats is useful for observing network operation and
for diagnosing network problems. In the absence of anything better,
RFC2863 and RFC3813 are used for guidance for which stats to expose
and the semantics of them. In particular rx_noroutes maps to in
unknown protos in RFC2863. The stats are exposed to userspace via
AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
RTM_GETSTATS messages.

All the introduced fields are 64-bit, even error ones, to ensure no
overflow with long uptimes. Per-CPU counters are used to avoid
cache-line contention on the commonly used fields. The other fields
have also been made per-CPU for code to avoid performance problems in
error conditions on the assumption that on some platforms the cost of
atomic operations could be more expensive than sending the packet
(which is what would be done in the success case). If that's not the
case, we could instead not use per-CPU counters for these fields.

Only unicast and non-fragment are exposed at the moment, but other
counters can be exposed in the future either by adding to the end of
struct mpls_link_stats or by additional netlink attributes in the
AF_MPLS IFLA_STATS_AF_SPEC nested attribute.

Signed-off-by: Robert Shearman 
---
 include/uapi/linux/mpls.h |  30 
 net/mpls/af_mpls.c| 181 --
 net/mpls/internal.h   |  58 ++-
 net/mpls/mpls_iptunnel.c  |  11 ++-
 4 files changed, 252 insertions(+), 28 deletions(-)

diff --git a/include/uapi/linux/mpls.h b/include/uapi/linux/mpls.h
index 24a6cb1aec86..77a19dfe3990 100644
--- a/include/uapi/linux/mpls.h
+++ b/include/uapi/linux/mpls.h
@@ -43,4 +43,34 @@ struct mpls_label {
 
 #define MPLS_LABEL_FIRST_UNRESERVED16 /* RFC3032 */
 
+/* These are embedded into IFLA_STATS_AF_SPEC:
+ * [IFLA_STATS_AF_SPEC]
+ * -> [AF_MPLS]
+ *-> [MPLS_STATS_xxx]
+ *
+ * Attributes:
+ * [MPLS_STATS_LINK] = {
+ * struct mpls_link_stats
+ * }
+ */
+enum {
+   MPLS_STATS_UNSPEC, /* also used as 64bit pad attribute */
+   MPLS_STATS_LINK,
+   __MPLS_STATS_MAX,
+};
+
+#define MPLS_STATS_MAX (__MPLS_STATS_MAX - 1)
+
+struct mpls_link_stats {
+   __u64   rx_packets; /* total packets received   */
+   __u64   tx_packets; /* total packets transmitted*/
+   __u64   rx_bytes;   /* total bytes received */
+   __u64   tx_bytes;   /* total bytes transmitted  */
+   __u64   rx_errors;  /* bad packets received */
+   __u64   tx_errors;  /* packet transmit problems */
+   __u64   rx_dropped; /* packet dropped on receive*/
+   __u64   tx_dropped; /* packet dropped on transmit   */
+   __u64   rx_noroute; /* no route for packet dest */
+};
+
 #endif /* _UAPI_MPLS_H */
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 15fe97644ffe..4dc81963af8f 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -17,8 +18,8 @@
 #include 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
-#include 
 #endif
+#include 
 #include 
 #include "internal.h"
 
@@ -48,11 +49,6 @@ static struct mpls_route *mpls_route_input_rcu(struct net 
*net, unsigned index)
return rt;
 }
 
-static inline struct mpls_dev *mpls_dev_get(const struct net_device *dev)
-{
-   return rcu_dereference_rtnl(dev->mpls_ptr);
-}
-
 bool mpls_output_possible(const struct net_device *dev)
 {
return dev && (dev->flags & IFF_UP) && netif_carrier_ok(dev);
@@ -98,6 +94,31 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned 
int mtu)
 }
 EXPORT_SYMBOL_GPL(mpls_pkt_too_big);
 
+void mpls_stats_inc_outucastpkts(struct net_device *dev,
+const struct sk_buff *skb)
+{
+   struct mpls_dev *mdev;
+
+   if (skb->protocol == htons(ETH_P_MPLS_UC)) {
+   mdev = mpls_dev_get(dev);
+   if (mdev)
+   MPLS_INC_STATS_LEN(mdev, skb->len,
+  tx_packets,
+  tx_bytes);
+   } else if (skb->protocol == htons(ETH_P_IP)) {
+   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (skb->protocol == htons(ETH_P_IPV6)) {
+   struct inet6_dev *in6dev = __in6_dev_get(dev);
+
+   if (in6dev)
+   IP6_UPD_PO_STATS(dev_net(dev), in6dev,
+IPSTATS_MIB_OUT, skb->len);
+#endif
+   }
+}
+EXPORT_SYMBOL_GPL(mpls_stats_inc_outucastpkts);
+
 static u32 mpls_multipath_hash(struct mpls_route *rt,
   struct sk_buff *skb, bool bos)
 {
@@ -253,6 +274,7 @@ static int mpls_forward(struct 

Re: [PATCH net-next 2/2] mpls: Packet stats

2017-01-16 Thread Robert Shearman

On 14/01/17 06:41, kbuild test robot wrote:

Hi Robert,

[auto build test ERROR on net-next/master]

url:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_0day-2Dci_linux_commits_Robert-2DShearman_net-2DAF-2Dspecific-2DRTM-5FGETSTATS-2Dattributes_20170114-2D095819&d=DwIBAg&c=IL_XqQWOjubgfqINi2jTzg&r=HbzsGgI7IidV3zRVGbZOYKAd0w1ch0IRBV2pqme4k9w&m=-FIZAjvyj_cLBHBTW14fNKp5IiJpkr1laQrxgxKnZ3g&s=_d4vequnI_QXJkUb0X3X8AUHdi6qD4R86g-mTBsEJ7g&e=
config: x86_64-randconfig-u0-01141334 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All error/warnings (new ones prefixed by >>):

   In file included from include/net/netns/mib.h:4:0,
from include/net/net_namespace.h:14,
from include/linux/netdevice.h:43,
from include/uapi/linux/if_arp.h:26,
from include/linux/if_arp.h:27,
from net/mpls/af_mpls.c:7:
   net/mpls/af_mpls.c: In function 'mpls_stats_inc_outucastpkts':

include/net/ipv6.h:163:35: error: 'struct netns_mib' has no member named 
'ipv6_statistics'; did you mean 'ip_statistics'?

 mod##SNMP_UPD_PO_STATS((net)->mib.statname##_statistics, field, (val));\


Good catch kbuild test robot.

Expect to see a v2 shortly with this CONFIG_IPV6=n build failure fixed.

Thanks,
Rob


[PATCH net-next 1/2] net: AF-specific RTM_GETSTATS attributes

2017-01-13 Thread Robert Shearman
Add the functionality for including address-family-specific per-link
stats in RTM_GETSTATS messages. This is done through adding a new
IFLA_STATS_AF_SPEC attribute under which address family attributes are
nested and then the AF-specific attributes can be further nested. This
follows the model of IFLA_AF_SPEC on RTM_*LINK messages and it has the
advantage of presenting an easily extended hierarchy. The rtnl_af_ops
structure is extended to provide AFs with the opportunity to fill and
provide the size of their stats attributes.

One alternative would have been to provide AFs with the ability to add
attributes directly into the RTM_GETSTATS message without a nested
hierarchy. I discounted this approach as it increases the rate at
which the 32 attribute number space is used up and it makes
implementation a little more tricky for stats dump resuming (at the
moment the order in which attributes are added to the message has to
match the numeric order of the attributes).

Another alternative would have been to register per-AF RTM_GETSTATS
handlers. I discounted this approach as I perceived a common use-case
to be getting all the stats for an interface and this approach would
necessitate multiple requests/dumps to retrieve them all.

Signed-off-by: Robert Shearman 
---
 include/net/rtnetlink.h  |  4 
 include/uapi/linux/if_link.h |  1 +
 net/core/rtnetlink.c | 50 
 3 files changed, 55 insertions(+)

diff --git a/include/net/rtnetlink.h b/include/net/rtnetlink.h
index 4113916cc1bb..106de5f7bf06 100644
--- a/include/net/rtnetlink.h
+++ b/include/net/rtnetlink.h
@@ -139,6 +139,10 @@ struct rtnl_af_ops {
const struct nlattr *attr);
int (*set_link_af)(struct net_device *dev,
   const struct nlattr *attr);
+
+   int (*fill_stats_af)(struct sk_buff *skb,
+const struct net_device *dev);
+   size_t  (*get_stats_af_size)(const struct net_device 
*dev);
 };
 
 void __rtnl_af_unregister(struct rtnl_af_ops *ops);
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 6b13e591abc9..184b16ed2b84 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -847,6 +847,7 @@ enum {
IFLA_STATS_LINK_XSTATS,
IFLA_STATS_LINK_XSTATS_SLAVE,
IFLA_STATS_LINK_OFFLOAD_XSTATS,
+   IFLA_STATS_AF_SPEC,
__IFLA_STATS_MAX,
 };
 
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 18b5aae99bec..4edc1bd7a735 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3829,6 +3829,39 @@ static int rtnl_fill_statsinfo(struct sk_buff *skb, 
struct net_device *dev,
*idxattr = 0;
}
 
+   if (stats_attr_valid(filter_mask, IFLA_STATS_AF_SPEC, *idxattr)) {
+   struct rtnl_af_ops *af_ops;
+
+   *idxattr = IFLA_STATS_AF_SPEC;
+   attr = nla_nest_start(skb, IFLA_STATS_AF_SPEC);
+   if (!attr)
+   goto nla_put_failure;
+
+   list_for_each_entry(af_ops, &rtnl_af_ops, list) {
+   if (af_ops->fill_stats_af) {
+   struct nlattr *af;
+   int err;
+
+   af = nla_nest_start(skb, af_ops->family);
+   if (!af)
+   goto nla_put_failure;
+
+   err = af_ops->fill_stats_af(skb, dev);
+
+   if (err == -ENODATA)
+   nla_nest_cancel(skb, af);
+   else if (err < 0)
+   goto nla_put_failure;
+
+   nla_nest_end(skb, af);
+   }
+   }
+
+   nla_nest_end(skb, attr);
+
+   *idxattr = 0;
+   }
+
nlmsg_end(skb, nlh);
 
return 0;
@@ -3885,6 +3918,23 @@ static size_t if_nlmsg_stats_size(const struct 
net_device *dev,
if (stats_attr_valid(filter_mask, IFLA_STATS_LINK_OFFLOAD_XSTATS, 0))
size += rtnl_get_offload_stats_size(dev);
 
+   if (stats_attr_valid(filter_mask, IFLA_STATS_AF_SPEC, 0)) {
+   struct rtnl_af_ops *af_ops;
+
+   /* for IFLA_STATS_AF_SPEC */
+   size += nla_total_size(0);
+
+   list_for_each_entry(af_ops, &rtnl_af_ops, list) {
+   if (af_ops->get_stats_af_size) {
+   size += nla_total_size(
+   af_ops->get_stats_af_size(dev));
+
+   /* for AF_* */
+   size += nla_total_size(0);
+   }
+   }
+   }
+
return size;
 }
 
-- 
2.1.4



[PATCH net-next 0/2] mpls: Packet stats

2017-01-13 Thread Robert Shearman
This patchset records per-interface packet stats in the MPLS
forwarding path and exports them using a nest of attributes root at a
new IFLA_STATS_AF_SPEC attribute as part of RTM_GETSTATS messages:

[IFLA_STATS_AF_SPEC]
 -> [AF_MPLS]
  -> [MPLS_STATS_LINK]
   -> struct mpls_link_stats

The first patch adds the rtnl infrastructure for this, including a new
callbacks to per-AF ops of fill_stats_af and get_stats_af_size. The
second patch records MPLS stats and makes use of the infrastructure to
export them. The rtnl infrastructure could also be used to export IPv6
stats in the future.

Robert Shearman (2):
  net: AF-specific RTM_GETSTATS attributes
  mpls: Packet stats

 include/net/rtnetlink.h  |   4 +
 include/uapi/linux/if_link.h |   1 +
 include/uapi/linux/mpls.h|  30 
 net/core/rtnetlink.c |  50 
 net/mpls/af_mpls.c   | 179 +--
 net/mpls/internal.h  |  58 +-
 net/mpls/mpls_iptunnel.c |  11 ++-
 7 files changed, 305 insertions(+), 28 deletions(-)

-- 
2.1.4



[PATCH net-next 2/2] mpls: Packet stats

2017-01-13 Thread Robert Shearman
Having MPLS packet stats is useful for observing network operation and
for diagnosing network problems. In the absence of anything better,
RFC2863 and RFC3813 are used for guidance for which stats to expose
and the semantics of them. In particular rx_noroutes maps to in
unknown protos in RFC2863. The stats are exposed to userspace via
AF_MPLS attributes embedded in the IFLA_STATS_AF_SPEC attribute of
RTM_GETSTATS messages.

All the introduced fields are 64-bit, even error ones, to ensure no
overflow with long uptimes. Per-CPU counters are used to avoid
cache-line contention on the commonly used fields. The other fields
have also been made per-CPU for code to avoid performance problems in
error conditions on the assumption that on some platforms the cost of
atomic operations could be more expensive than sending the packet
(which is what would be done in the success case). If that's not the
case, we could instead not use per-CPU counters for these fields.

Only unicast and non-fragment are exposed at the moment, but other
counters can be exposed in the future either by adding to the end of
struct mpls_link_stats or by additional netlink attributes in the
AF_MPLS IFLA_STATS_AF_SPEC nested attribute.

Signed-off-by: Robert Shearman 
---
 include/uapi/linux/mpls.h |  30 
 net/mpls/af_mpls.c| 179 --
 net/mpls/internal.h   |  58 ++-
 net/mpls/mpls_iptunnel.c  |  11 ++-
 4 files changed, 250 insertions(+), 28 deletions(-)

diff --git a/include/uapi/linux/mpls.h b/include/uapi/linux/mpls.h
index 24a6cb1aec86..77a19dfe3990 100644
--- a/include/uapi/linux/mpls.h
+++ b/include/uapi/linux/mpls.h
@@ -43,4 +43,34 @@ struct mpls_label {
 
 #define MPLS_LABEL_FIRST_UNRESERVED16 /* RFC3032 */
 
+/* These are embedded into IFLA_STATS_AF_SPEC:
+ * [IFLA_STATS_AF_SPEC]
+ * -> [AF_MPLS]
+ *-> [MPLS_STATS_xxx]
+ *
+ * Attributes:
+ * [MPLS_STATS_LINK] = {
+ * struct mpls_link_stats
+ * }
+ */
+enum {
+   MPLS_STATS_UNSPEC, /* also used as 64bit pad attribute */
+   MPLS_STATS_LINK,
+   __MPLS_STATS_MAX,
+};
+
+#define MPLS_STATS_MAX (__MPLS_STATS_MAX - 1)
+
+struct mpls_link_stats {
+   __u64   rx_packets; /* total packets received   */
+   __u64   tx_packets; /* total packets transmitted*/
+   __u64   rx_bytes;   /* total bytes received */
+   __u64   tx_bytes;   /* total bytes transmitted  */
+   __u64   rx_errors;  /* bad packets received */
+   __u64   tx_errors;  /* packet transmit problems */
+   __u64   rx_dropped; /* packet dropped on receive*/
+   __u64   tx_dropped; /* packet dropped on transmit   */
+   __u64   rx_noroute; /* no route for packet dest */
+};
+
 #endif /* _UAPI_MPLS_H */
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 15fe97644ffe..fb20941cdda2 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -17,8 +18,8 @@
 #include 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
-#include 
 #endif
+#include 
 #include 
 #include "internal.h"
 
@@ -48,11 +49,6 @@ static struct mpls_route *mpls_route_input_rcu(struct net 
*net, unsigned index)
return rt;
 }
 
-static inline struct mpls_dev *mpls_dev_get(const struct net_device *dev)
-{
-   return rcu_dereference_rtnl(dev->mpls_ptr);
-}
-
 bool mpls_output_possible(const struct net_device *dev)
 {
return dev && (dev->flags & IFF_UP) && netif_carrier_ok(dev);
@@ -98,6 +94,29 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned 
int mtu)
 }
 EXPORT_SYMBOL_GPL(mpls_pkt_too_big);
 
+void mpls_stats_inc_outucastpkts(struct net_device *dev,
+const struct sk_buff *skb)
+{
+   struct mpls_dev *mdev;
+   struct inet6_dev *in6dev;
+
+   if (skb->protocol == htons(ETH_P_MPLS_UC)) {
+   mdev = mpls_dev_get(dev);
+   if (mdev)
+   MPLS_INC_STATS_LEN(mdev, skb->len,
+  tx_packets,
+  tx_bytes);
+   } else if (skb->protocol == htons(ETH_P_IP)) {
+   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+   } else if (skb->protocol == htons(ETH_P_IPV6)) {
+   in6dev = __in6_dev_get(dev);
+   if (in6dev)
+   IP6_UPD_PO_STATS(dev_net(dev), in6dev,
+IPSTATS_MIB_OUT, skb->len);
+   }
+}
+EXPORT_SYMBOL_GPL(mpls_stats_inc_outucastpkts);
+
 static u32 mpls_multipath_hash(struct mpls_route *rt,
   struct sk_buff *skb, bool bos)
 {
@@ -253,6 +272,7 @@ static int mpls_forward(struct sk_buff *skb, struct 
net_device *d

Re: [net PATCH 2/2] ipv4: Drop suffix update from resize code

2016-12-05 Thread Robert Shearman

On 05/12/16 17:28, David Miller wrote:

From: Robert Shearman 
Date: Mon, 5 Dec 2016 15:05:18 +


On 01/12/16 12:27, Alexander Duyck wrote:

It has been reported that update_suffix can be expensive when it is
called
on a large node in which most of the suffix lengths are the same.  The
time
required to add 200K entries had increased from around 3 seconds to
almost
49 seconds.

In order to address this we need to move the code for updating the
suffix
out of resize and instead just have it handled in the cases where we
are
pushing a node that increases the suffix length, or will decrease the
suffix length.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Reported-by: Robert Shearman 
Signed-off-by: Alexander Duyck 


$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists


What are these errors all about?


These are just routes that are already added by the system but are 
present in the dump:


$ ip route showdump < ~/allroutes | grep -v 110.110.110.2
default via 192.168.100.1 dev eth0  proto static  metric 1024
10.37.96.0/20 dev eth2  proto kernel  scope link  src 10.37.96.204
110.110.110.0/24 dev eth1  proto kernel  scope link  src 110.110.110.1
192.168.100.0/24 dev eth0  proto kernel  scope link  src 192.168.100.153

So the errors are expected and are seen both with and without these patches.

Thanks,
Rob


Re: [net PATCH 2/2] ipv4: Drop suffix update from resize code

2016-12-05 Thread Robert Shearman

On 01/12/16 12:27, Alexander Duyck wrote:

It has been reported that update_suffix can be expensive when it is called
on a large node in which most of the suffix lengths are the same.  The time
required to add 200K entries had increased from around 3 seconds to almost
49 seconds.

In order to address this we need to move the code for updating the suffix
out of resize and instead just have it handled in the cases where we are
pushing a node that increases the suffix length, or will decrease the
suffix length.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Reported-by: Robert Shearman 
Signed-off-by: Alexander Duyck 


$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real0m4.026s
user0m0.156s
sys 0m2.500s
$ time sudo ip route flush via 110.110.110.2

real0m5.423s
user0m0.064s
sys 0m1.096s

This reduces the time taken to both add and delete the 200k routes back 
down to levels comparable before commit 5405afd1a306. The changes look 
good to me too. Nicely done Alex!


Reviewed-by: Robert Shearman 
Tested-by: Robert Shearman 

Thanks,
Rob


Re: [net PATCH 1/2] ipv4: Drop leaf from suffix pull/push functions

2016-12-05 Thread Robert Shearman

On 01/12/16 12:27, Alexander Duyck wrote:

It wasn't necessary to pass a leaf in when doing the suffix updates so just
drop it.  Instead just pass the suffix and work with that.

Since we dropped the leaf there is no need to include that in the name so
the names are updated to node_push_suffix and node_pull_suffix.

Finally I noticed that the logic for pulling the suffix length back
actually had some issues.  Specifically it would stop prematurely if there
was a longer suffix, but it was not as long as the original suffix.  I
updated the code to address that in node_pull_suffix.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Suggested-by: Robert Shearman 
Signed-off-by: Alexander Duyck 


This addresses an issue in the current code whereby when a fib alias is 
removed from a leaf when there are other aliases remaining then the 
suffix length may not be updated in the conditions described above. This 
is because fib_remove_alias doesn't call trie_rebalance (or resize) in 
this case, as there's no need since no nodes have been added/removed.


This can be reproduced with this structure of the fib trie and by adding 
and then deleting 10.37.96.0/19:


+-- 10.37.0.0/16 4 1 8 suffix/19
   |-- 10.37.0.0
  /24 universe UNICAST
   |-- 10.37.32.0
  /24 universe UNICAST
   |-- 10.37.64.0
  /20 universe UNICAST
   |-- 10.37.95.0
  /24 universe UNICAST
   +-- 10.37.96.0/20 2 1 2 suffix/21
  +-- 10.37.96.0/22 2 1 2 suffix/24
 +-- 10.37.96.0/24 2 1 2 suffix/24
|-- 10.37.96.0
   /32 link BROADCAST
   /24 link UNICAST
+-- 10.37.96.192/26 2 0 2 suffix/28
   |-- 10.37.96.204
  /32 host LOCAL
   |-- 10.37.96.255
  /32 link BROADCAST
 |-- 10.37.98.0
/25 universe UNICAST
  |-- 10.37.104.0
 /21 universe UNICAST
   +-- 10.37.112.0/23 2 0 2 suffix/24
  |-- 10.37.112.0
 /24 universe UNICAST
  |-- 10.37.113.0
 /24 universe UNICAST
   |-- 10.37.223.0
  /24 universe UNICAST
   |-- 10.37.224.0
  /24 universe UNICAST

I've verified that with the fix included here that the suffix length on 
the 10.37.0.0/16 node now gets updated correctly to be /20.


Reviewed-by: Robert Shearman 
Tested-by: Robert Shearman 

Thanks,
Rob


Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required

2016-12-02 Thread Robert Shearman

On 01/12/16 18:29, Alexander Duyck wrote:

On Wed, Nov 30, 2016 at 4:27 PM, Robert Shearman  wrote:

On 29/11/16 23:14, Alexander Duyck wrote:

On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman 


Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Signed-off-by: Robert Shearman 



So I am not entirely convinced this is a regression, I was wondering
what your numbers looked like before the patch you are saying this
fixes?  I recall I fixed a number of issues leading up to this so I am
just curious.



At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove checks
for index >= tnode_child_length from tnode_get_child") which is the parent
of 5405afd1a306:

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real0m3.451s
user0m0.184s
sys 0m2.004s


At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add tracking
value for suffix length"):

$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real0m48.624s
user0m0.360s
sys 0m46.960s

So does that warrant a fixes tag for this patch?


Yes, please include it in the patch description next time.

I've been giving it thought and the fact is the patch series has
merit.  I just think it was being a bit too aggressive in terms of
some of the changes.  So placing any changes in put_child seem to be
the one piece in this that make this a bit ugly.


Does that imply the current updating of the parent's slen value should 
be removed from put_child then?


I started out without doing the changes in put_child, but that meant the 
changes to push the slen up through the trie were spread all over the 
place. This seemed like the cleaner solution.



+   }
+}
+
 /* Add a child at position i overwriting the old value.
  * Update the value of full_children and empty_children.
+ *
+ * The suffix length of the parent node and the rest of the tree is
+ * updated (if required) when adding/replacing a node, but is caller's
+ * responsibility on removal.
  */
 static void put_child(struct key_vector *tn, unsigned long i,
  struct key_vector *n)
@@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned
long i,
else if (!wasfull && isfull)
tn_info(tn)->full_children++;

-   if (n && (tn->slen < n->slen))
-   tn->slen = n->slen;
+   if (n)
+   node_push_suffix(tn, n);



This is just creating redundant work if we have to perform a resize.



The only real redundant work is the assignment of slen in tn, since the
propagation up the trie has to happen regardless and if a subsequent resize
results in changes to the trie then the propagation will stop at tn's parent
since the suffix length of the tn's parent will not be less than tn's suffix
length.



This isn't going to work.  We want to avoid trying to push the suffix
when we place a child.  There are scenarios where we are placing
children in nodes that don't have parents yet because we are doing
rebalances and such.  I suspect this could be pretty expensive on a
resize call.


It's not expensive because the new parents that are being built up are 
orphaned from the rest of the trie, so the push up won't proceed into 
the rest of the trie until the new node is inserted into the trie.



We want to pull the suffix pushing out of the resize completely.  The
problem is this is going to cause us to start pushing suffixes for
every node moved as a part of resize.


This will mean that nodes will have to be visited multiple times, which 
could well be more expensive than doing it during the resize.







rcu_assign_pointer(tn->tnode[i], n);
 }


...


-static void leaf_pull_suffix(struct key_vector *tp, struct key_vector
*l)
+static void node_set_suffix(struct key_vector *tp, unsigned char slen)
 {
-   while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
-   if (update_suffix(tp) > l->slen)
+   if (slen > tp->slen)
+   tp->slen = slen;
+}
+
+static void node_pull_suffix(struct key_vector *tn)
+{
+   struct key_vector *tp;
+   unsigned char slen;
+
+   slen = update_suffix(tn);
+   tp = node_parent(tn);
+   while ((tp->slen > tp->pos) && (tp->slen > slen)) {
+   if (update_suffix(tp) > slen)
break;
tp = node_parent(tp);
}
 }





Actually I realized that there is a bug here.  The check for
update_suffix(tp) > slen can cause us to stop prematurely if I am not
mistaken.  What we should probably be doing is pulling out tp->slen
and if the

Re: [PATCH net] fib_trie: Avoid expensive update of suffix length when not required

2016-11-30 Thread Robert Shearman

On 29/11/16 23:14, Alexander Duyck wrote:

On Tue, Nov 29, 2016 at 9:46 AM, Robert Shearman  wrote:

With certain distributions of routes it can take a long time to add
and delete further routes at scale. For example, with a route for
10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes
from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck
is update_suffix:

  40.39%  [kernel]  [k] update_suffix
   8.02%  libnl-3.so.200.19.0   [.] nl_hash_table_lookup


Well yeah, it will be expensive when you have over 512K entries in a
single node.  I'm assuming that is the case based on the fact that
200K routes will double up in the trie node due to broadcast and the
route ending up in adjacent key vectors.


The example scenario was in fact using a large scale of just routes 
rather than addresses so there are no broadcast leafs (nor local leafs):


+-- 8.0.0.0/6 18 2 52436 suffix/20
   |-- 8.0.0.0
  /24 universe UNICAST
   |-- 8.0.1.0
  /24 universe UNICAST
   |-- 8.0.2.0
  /24 universe UNICAST

(the "suffix/20" is for test purposes as per below). In this case the 
8.0.0.0/6 node has a child array of size 2^18 = 262144.





With these changes, the time is reduced to 4s for the same scale and
distribution of routes.

The issue is that update_suffix does an O(n) walk on the children of a
node and the with a dense distribtion of routes the number of children
in a node tends towards the number of nodes in the tree.


Other than the performance I was just wondering if you did any other
validation to confirm that nothing else differs.  Basically it would
just be a matter of verifying that /proc/net/fib_trie is the same
before and after your patch.


Yes, to verify these changes I applied some local changes to dump the 
slen field of trie nodes from /proc/net/fib_trie. I presumed that the 
format of /proc/net/fib_trie is a public API that we can't break so I 
intend to submit this as a patch. I verified by inspection that the 
suffix length listed in the nodes was as expected when exercising the 
insert and remove functions for routes with both larger and smaller 
suffixes than in the current nodes, and then for the inflate, halve and 
flush cases.


I've now verified that a diff of /proc/net/fib_trie before and after my 
patch with all the routes added of my example scenario shows no changes.





...

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Signed-off-by: Robert Shearman 


So I am not entirely convinced this is a regression, I was wondering
what your numbers looked like before the patch you are saying this
fixes?  I recall I fixed a number of issues leading up to this so I am
just curious.


At commit 21d1f11db0e2f20a549ad8322879507850544670 ("fib_trie: Remove 
checks for index >= tnode_child_length from tnode_get_child") which is 
the parent of 5405afd1a306:


$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real0m3.451s
user0m0.184s
sys 0m2.004s


At commit 5405afd1a30620de8601436bae739c67c0bc7324 ("fib_trie: Add 
tracking value for suffix length"):


$ time sudo ip route restore < ~/allroutes
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists
RTNETLINK answers: File exists

real0m48.624s
user0m0.360s
sys 0m46.960s

So does that warrant a fixes tag for this patch?



My thought is this seems more like a performance optimization than a
fix.  If that is the case this probably belongs in net-next.

It seems to me we might be able to simplify update_suffix rather than
going through all this change.  That might be something that is more
acceptable for net.  Have you looked at simply adding code so that
there is a break inside update_suffix if (slen == tn->slen)?  We don't
have to call it for if a leaf is added so it might make sense to add
that check.


That doesn't really help since the search always starts from the 
smallest suffix length and thus could potentially require visiting a 
large number of children before finding the node that makes slen == 
tn->slen.


In the example above, 10.37.96.0/20 would be child number 140640 in node 
8.0.0.0/6 18. Since the loop starts out with stride = 2 this means that 
it requires 70320 iterations round to find 10.37.96.0/20 which 
contributes the largest suffix length to the node.


Now it may be possible to improve the algorithm by starting the search 
from the largest suffix length towards the smallest suffix length 
instead of the current smallest to largest, but there would still be 
distributions of routes that would lead to needing to visit a large 
number of nodes only to find that nothing has changed.





---
 net/ipv4/fib_trie.c | 86 

[PATCH net] fib_trie: Avoid expensive update of suffix length when not required

2016-11-29 Thread Robert Shearman
With certain distributions of routes it can take a long time to add
and delete further routes at scale. For example, with a route for
10.37.96.0/20 present it takes 47s to add ~200k contiguous /24 routes
from 8.0.0.0/24 through to 11.138.207.0/24. Perf shows the bottleneck
is update_suffix:

  40.39%  [kernel]  [k] update_suffix
   8.02%  libnl-3.so.200.19.0   [.] nl_hash_table_lookup

With these changes, the time is reduced to 4s for the same scale and
distribution of routes.

The issue is that update_suffix does an O(n) walk on the children of a
node and the with a dense distribtion of routes the number of children
in a node tends towards the number of nodes in the tree.

In the add case it isn't necessary to walk all the other children to
update the largest suffix length of the node (which is what
update_suffix is doing) since we already know what the largest suffix
length of all the other children is and we only need to update it if
the new node/leaf has a larger suffix. However, it is currently called
from resize which is called on any update to rebalance the trie.

Therefore improve the scaling by moving the responsibility of
recalculating the node's largest suffix length out of the resize
function into its callers. For fib_insert_node this can be taken care
of by extending put_child to not only update the largest suffix length
in the parent node, but to propagate it all the way up the trie as
required, using a new function, node_push_suffix, based on
leaf_push_suffix, but renamed to reflect its purpose and made safe if
the node has no parent.

No changes are required to inflate/halve since the maximum suffix
length of the sub-trie starting from the node's parent isn't changed,
and the suffix lengths of the halved/inflated nodes are updated
through the creation of the new nodes and put_child called to add the
children to them.

In both fib_table_flush and fib_table_flush_external, given that a
walk of the entire trie is being done there's a choice of optimising
either for the case of a small number of leafs being flushed by
updating the suffix on a change and propagating it up, or optimising
for large numbers of leafs being flushed by deferring the updating of
the largest suffix length to the walk up to the parent node. I opted
for the latter to keep the algorithm linear in complexity in the case
of flushing all leafs and because it is close to the status quo.

Fixes: 5405afd1a306 ("fib_trie: Add tracking value for suffix length")
Signed-off-by: Robert Shearman 
---
 net/ipv4/fib_trie.c | 86 ++---
 1 file changed, 62 insertions(+), 24 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 026f309c51e9..701cae8af44a 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -421,8 +421,22 @@ static inline int tnode_full(struct key_vector *tn, struct 
key_vector *n)
return n && ((n->pos + n->bits) == tn->pos) && IS_TNODE(n);
 }
 
+static void node_push_suffix(struct key_vector *tn, struct key_vector *l)
+{
+   while (tn->slen < l->slen) {
+   tn->slen = l->slen;
+   tn = node_parent(tn);
+   if (!tn)
+   break;
+   }
+}
+
 /* Add a child at position i overwriting the old value.
  * Update the value of full_children and empty_children.
+ *
+ * The suffix length of the parent node and the rest of the tree is
+ * updated (if required) when adding/replacing a node, but is caller's
+ * responsibility on removal.
  */
 static void put_child(struct key_vector *tn, unsigned long i,
  struct key_vector *n)
@@ -447,8 +461,8 @@ static void put_child(struct key_vector *tn, unsigned long 
i,
else if (!wasfull && isfull)
tn_info(tn)->full_children++;
 
-   if (n && (tn->slen < n->slen))
-   tn->slen = n->slen;
+   if (n)
+   node_push_suffix(tn, n);
 
rcu_assign_pointer(tn->tnode[i], n);
 }
@@ -919,34 +933,35 @@ static struct key_vector *resize(struct trie *t, struct 
key_vector *tn)
if (max_work != MAX_WORK)
return tp;
 
-   /* push the suffix length to the parent node */
-   if (tn->slen > tn->pos) {
-   unsigned char slen = update_suffix(tn);
-
-   if (slen > tp->slen)
-   tp->slen = slen;
-   }
-
return tp;
 }
 
-static void leaf_pull_suffix(struct key_vector *tp, struct key_vector *l)
+static void node_set_suffix(struct key_vector *tp, unsigned char slen)
 {
-   while ((tp->slen > tp->pos) && (tp->slen > l->slen)) {
-   if (update_suffix(tp) > l->slen)
+   if (slen > tp->slen)
+   tp->slen = slen;
+}
+
+static void node_pull_suffix(struct key_vector *tn)
+{
+   

Re: [PATCH] mpls: Add missing RCU-bh read side critical section locking in output path

2016-06-21 Thread Robert Shearman

On 20/06/16 19:05, Lennert Buytenhek wrote:

From: David Barroso 

When locally originated IP traffic hits a route that says to push
MPLS labels, we'll get a call chain dst_output() -> lwtunnel_output()
-> mpls_output() -> neigh_xmit() -> ___neigh_lookup_noref() where the
last function in this chain accesses a RCU-bh protected struct
neigh_table pointer without us ever having declared an RCU-bh read
side critical section.

As in case of locally originated IP traffic we'll be running in process
context, with softirqs enabled, we can be preempted by a softirq at any
time, and RCU-bh considers the completion of a softirq as signaling
the end of any pending read-side critical sections, so if we do get a
softirq here, we can end up with an unexpected RCU grace period and
all the nastiness that that comes with.

This patch makes neigh_xmit() take rcu_read_{,un}lock_bh() around the
code that expects to be treated as an RCU-bh read side critical section.

Signed-off-by: David Barroso 
Signed-off-by: Lennert Buytenhek 


LGTM too.

Acked-by: Robert Shearman 


Re: [PATCH net-next] mpls: allow routes on ipgre devices

2016-06-16 Thread Robert Shearman

On 16/06/16 09:09, Simon Horman wrote:

This appears to be necessary and sufficient to provide
MPLS in GRE (RFC4023) support.

This can be used by establishing an ipgre tunnel device
and then routing MPLS over it.

The following example will forward MPLS frames received with an outermost
MPLS label 100 over tun1, a GRE tunnel. The forwarded packet will have the
outermost MPLS LSE removed and two new LSEs added with labels 200
(outermost) and 300 (next).

ip link add name tun1 type gre remote 10.0.99.193 local 10.0.99.192 ttl 225
ip link set up dev tun1
ip addr add 10.0.98.192/24 dev tun1
ip route sh

echo 1 > /proc/sys/net/mpls/conf/eth0/input
echo 101 > /proc/sys/net/mpls/platform_labels
ip -f mpls route add 100 as 200/300 via inet 10.0.98.193
ip -f mpls route sh

Also remove unnecessary braces.

Reviewed-by: Dinan Gunawardena 


Thanks for testing this and making the necessary change.

Acked-by: Robert Shearman 


Signed-off-by: Simon Horman 


Re: [PATCH net-next 1/2] mpls: packet stats

2016-02-19 Thread Robert Shearman

On 16/02/16 20:26, roopa wrote:

On 2/16/16, 7:41 AM, David Miller wrote:

Statistics not provided via netlink are useless in real installations.

In fact I would say to forego the proc interface entirely, it's a second
class citizen for statistics gathering and has a non-triviel per-device
cost for instantiation.


+1

I agree with the cost too.

Robert, I was thinking of responding to your series to add netlink stats for 
AF_MPLS in
  rtnl_af_ops (similar to  IFLA_INET6_STATS). But, soon realized that it is 
currently used by ipv6 alone
and it also ended up with a skip filter (RTEXT_FILTER_SKIP_STATS). So, 
extending that interface for
mpls is not the right thing to do either.


ipv4 doesn't have per-interface stats, so the fact that it doesn't use 
fill_link_af to export stats doesn't really add much to the argument.


The real issue with the IFLA_INET6_STATS mechanism is that they are 
included in netlink broadcasts, not just netlink unicasts and there is 
no way of filtering for broadcasts at the moment.



There is work being done for a separate netlink infrastructure for stats.
I would wait for that infrastructure to be ready to add mpls stats. It should 
be available soon.


Great, any details on what it would look like? Can I assist in any way 
in the development?


In the mean time, I'll rebase and resubmit the ip ttl propagation patch 
separately.


Thanks,
Rob


[PATCH net-next v2 1/3] lwtunnel: autoload of lwt modules

2016-02-19 Thread Robert Shearman
The lwt implementations using net devices can autoload using the
existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
that lwt modules not using net devices can use.

Therefore, add the ability to autoload modules registering lwt
operations for lwt implementations not using a net device so that
users don't have to manually load the modules.

Only users with the CAP_NET_ADMIN capability can cause modules to be
loaded, which is ensured by rtnetlink_rcv_msg rejecting non-RTM_GETxxx
messages for users without this capability, and by
lwtunnel_build_state not being called in response to RTM_GETxxx
messages.

Signed-off-by: Robert Shearman 
---
 include/net/lwtunnel.h |  4 +++-
 net/core/lwtunnel.c| 37 +
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 66350ce3e955..e9f116e29c22 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -170,6 +170,8 @@ static inline int lwtunnel_input(struct sk_buff *skb)
return -EOPNOTSUPP;
 }
 
-#endif
+#endif /* CONFIG_LWTUNNEL */
+
+#define MODULE_ALIAS_RTNL_LWT(encap_type) MODULE_ALIAS("rtnl-lwt-" 
__stringify(encap_type))
 
 #endif /* __NET_LWTUNNEL_H */
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 299cfc24d888..669ecc9f884e 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -27,6 +27,31 @@
 #include 
 #include 
 
+#ifdef CONFIG_MODULES
+
+static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
+{
+   /* Only lwt encaps implemented without using an interface for
+* the encap need to return a string here.
+*/
+   switch (encap_type) {
+   case LWTUNNEL_ENCAP_MPLS:
+   return "MPLS";
+   case LWTUNNEL_ENCAP_ILA:
+   return "ILA";
+   case LWTUNNEL_ENCAP_IP6:
+   case LWTUNNEL_ENCAP_IP:
+   case LWTUNNEL_ENCAP_NONE:
+   case __LWTUNNEL_ENCAP_MAX:
+   /* should not have got here */
+   WARN_ON(1);
+   break;
+   }
+   return NULL;
+}
+
+#endif /* CONFIG_MODULES */
+
 struct lwtunnel_state *lwtunnel_state_alloc(int encap_len)
 {
struct lwtunnel_state *lws;
@@ -85,6 +110,18 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
ret = -EOPNOTSUPP;
rcu_read_lock();
ops = rcu_dereference(lwtun_encaps[encap_type]);
+#ifdef CONFIG_MODULES
+   if (!ops) {
+   const char *encap_type_str = lwtunnel_encap_str(encap_type);
+
+   if (encap_type_str) {
+   rcu_read_unlock();
+   request_module("rtnl-lwt-%s", encap_type_str);
+   rcu_read_lock();
+   ops = rcu_dereference(lwtun_encaps[encap_type]);
+   }
+   }
+#endif
if (likely(ops && ops->build_state))
ret = ops->build_state(dev, encap, family, cfg, lws);
rcu_read_unlock();
-- 
2.1.4



[PATCH net-next v2 3/3] ila: autoload module

2016-02-19 Thread Robert Shearman
Avoid users having to manually load the module by adding a module
alias allowing it to be autoloaded by the lwt infra.

Signed-off-by: Robert Shearman 
---
 net/ipv6/ila/ila_common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c
index 32dc9aab7297..30613050e4ca 100644
--- a/net/ipv6/ila/ila_common.c
+++ b/net/ipv6/ila/ila_common.c
@@ -99,5 +99,6 @@ static void __exit ila_fini(void)
 
 module_init(ila_init);
 module_exit(ila_fini);
+MODULE_ALIAS_RTNL_LWT(ILA);
 MODULE_AUTHOR("Tom Herbert ");
 MODULE_LICENSE("GPL");
-- 
2.1.4



[PATCH net-next v2 2/3] mpls: autoload lwt module

2016-02-19 Thread Robert Shearman
Avoid users having to manually load the module by adding a module
alias allowing it to be autoloaded by the lwt infra.

Signed-off-by: Robert Shearman 
---
 net/mpls/mpls_iptunnel.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index fb31aa87de81..644a8da6d4bd 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -227,5 +227,6 @@ static void __exit mpls_iptunnel_exit(void)
 }
 module_exit(mpls_iptunnel_exit);
 
+MODULE_ALIAS_RTNL_LWT(MPLS);
 MODULE_DESCRIPTION("MultiProtocol Label Switching IP Tunnels");
 MODULE_LICENSE("GPL v2");
-- 
2.1.4



[PATCH net-next v2 0/3] lwtunnel: autoload of lwt modules

2016-02-19 Thread Robert Shearman
Changes since v1:
 - remove "LWTUNNEL_ENCAP_" prefix for the string form of the encaps
   used when requesting the module to reduce duplication, and don't
   bother returning strings for lwt modules using netdevices, both
   suggested by Jiri.
 - update commit message of first patch to clarify security
   implications, in response to Eric's comments.

The lwt implementations using net devices can autoload using the
existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
that lwt modules not using net devices can use.

Therefore, these patches add the ability to autoload modules
registering lwt operations for lwt implementations not using a net
device so that users don't have to manually load the modules.

Robert Shearman (3):
  lwtunnel: autoload of lwt modules
  mpls: autoload lwt module
  ila: autoload module

 include/net/lwtunnel.h|  4 +++-
 net/core/lwtunnel.c   | 37 +
 net/ipv6/ila/ila_common.c |  1 +
 net/mpls/mpls_iptunnel.c  |  1 +
 4 files changed, 42 insertions(+), 1 deletion(-)

-- 
2.1.4



Re: [PATCH net-next 1/3] lwtunnel: autoload of lwt modules

2016-02-16 Thread Robert Shearman

On 15/02/16 21:33, Eric W. Biederman wrote:

Robert Shearman  writes:

@@ -85,6 +109,14 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
ret = -EOPNOTSUPP;
rcu_read_lock();
ops = rcu_dereference(lwtun_encaps[encap_type]);
+#ifdef CONFIG_MODULES
+   if (!ops) {
+   rcu_read_unlock();
+   request_module("rtnl-lwt-%s", lwtunnel_encap_str(encap_type));
+   rcu_read_lock();
+   ops = rcu_dereference(lwtun_encaps[encap_type]);
+   }
+#endif
if (likely(ops && ops->build_state))
ret = ops->build_state(dev, encap, family, cfg, lws);
rcu_read_unlock();


My memory is fuzzy on how this is done elsewhere but this looks like it
needs a capability check to ensure that non-root user's can't trigger
this.

It tends to be problematic if a non-root user can trigger an autoload of
a known-buggy module.  With a combination of user namespaces and network
namespaces unprivileged users can cause just about every corner of the
network stack to be exercised.


The same protections apply to this as to the IFLA_INFO_KIND module 
autoloading, namely by rtnetlink_rcv_msg ensuring that no messages other 
than gets can be done by an unprivileged user:


type = nlh->nlmsg_type;
...
type -= RTM_BASE;
...
kind = type&3;

if (kind != 2 && !netlink_net_capable(skb, CAP_NET_ADMIN))
return -EPERM;

The lwtunnel_build_state function is only called by the processing of 
non-get message types.


Is this sufficient or are you looking for something in addition?

Thanks,
Rob


Re: [PATCH net-next 1/3] lwtunnel: autoload of lwt modules

2016-02-15 Thread Robert Shearman

On 15/02/16 16:32, Jiri Benc wrote:

On Mon, 15 Feb 2016 16:22:08 +, Robert Shearman wrote:

Yeah, it's the C preprocessor. MODULE_ALIAS_RTNL_LWT includes the string
for the encap type in the module alias, and since the LWT encap types
are defined as enums this is symbolic name. I can't see any way of
getting the preprocessor to convert
MODULE_ALIAS_RTNL_LWT(LWTUNNEL_ENCAP_MPLS) into "rtnl-lwt-MPLS", but I'm
open to suggestions.


MODULE_ALIAS_RTNL_LWT(MPLS)?

But whatever, as I said, no strong preference.


I was so hung up on the making the string match the name of the enum 
that I'd discounted that, but you're right that doing that would reduce 
duplication in the alias string.



True, but I figured that it was cleaner for the lwtunnel infra to not
assume whether how those modules are implemented. If you disagree, then
I can change to doing as you suggest.


It's not completely transparent to the infrastructure anyway, the
tunnel type needs to be added to lwtunnel_encap_str for new tunnels.
The way I suggested, it's only added for those tunnels having the
module alias set.

Just trying to get rid of the unnecessary strings in
lwtunnel_encap_str. There's no need to bloat kernel with them if
they're never used.


Ok, will resubmit without the unnecessary strings in that function as 
well as with your suggestion above.


Thanks for the review,
Rob


Re: [PATCH net-next 1/3] lwtunnel: autoload of lwt modules

2016-02-15 Thread Robert Shearman

On 15/02/16 16:02, Jiri Benc wrote:

On Mon, 15 Feb 2016 15:42:01 +, Robert Shearman wrote:

+static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
+{
+   switch (encap_type) {
+   case LWTUNNEL_ENCAP_MPLS:
+   return "LWTUNNEL_ENCAP_MPLS";
+   case LWTUNNEL_ENCAP_IP:
+   return "LWTUNNEL_ENCAP_IP";
+   case LWTUNNEL_ENCAP_ILA:
+   return "LWTUNNEL_ENCAP_ILA";
+   case LWTUNNEL_ENCAP_IP6:
+   return "LWTUNNEL_ENCAP_IP6";
+   case LWTUNNEL_ENCAP_NONE:
+   case __LWTUNNEL_ENCAP_MAX:
+   /* should not have got here */
+   break;
+   }
+   WARN_ON(1);
+   return "LWTUNNEL_ENCAP_NONE";
+}
+
+#endif /* CONFIG_MODULES */
+
  struct lwtunnel_state *lwtunnel_state_alloc(int encap_len)
  {
struct lwtunnel_state *lws;
@@ -85,6 +109,14 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
ret = -EOPNOTSUPP;
rcu_read_lock();
ops = rcu_dereference(lwtun_encaps[encap_type]);
+#ifdef CONFIG_MODULES
+   if (!ops) {
+   rcu_read_unlock();
+   request_module("rtnl-lwt-%s", lwtunnel_encap_str(encap_type));


Why the repeating of "lwt"/"LWTUNNEL" and the unnecessary "ENCAP"?
Wouldn't be lwtunnel_encap_str returning just "MPLS" or "ILA" enough?
I don't have any strong preference here, it just looks weird to me.
Maybe there's a reason.


Yeah, it's the C preprocessor. MODULE_ALIAS_RTNL_LWT includes the string 
for the encap type in the module alias, and since the LWT encap types 
are defined as enums this is symbolic name. I can't see any way of 
getting the preprocessor to convert 
MODULE_ALIAS_RTNL_LWT(LWTUNNEL_ENCAP_MPLS) into "rtnl-lwt-MPLS", but I'm 
open to suggestions.


I could just drop the "lwt-" bit of the alias string given that it's 
included in the name of the enum values.



Also, this doesn't affect IP lwtunnels, i.e. LWTUNNEL_ENCAP_IP and
LWTUNNEL_ENCAP_IP6. Should we just return NULL from lwtunnel_encap_str
in such cases (plus unknown encap_type) and WARN on the NULL here?


True, but I figured that it was cleaner for the lwtunnel infra to not 
assume whether how those modules are implemented. If you disagree, then 
I can change to doing as you suggest.


Thanks,
Rob


[PATCH net-next 2/3] mpls: autoload lwt module

2016-02-15 Thread Robert Shearman
Avoid users having to manually load the module by adding a module
alias allowing it to be autoloaded by the lwt infra.

Signed-off-by: Robert Shearman 
---
 net/mpls/mpls_iptunnel.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/mpls/mpls_iptunnel.c b/net/mpls/mpls_iptunnel.c
index fb31aa87de81..1b4536960f79 100644
--- a/net/mpls/mpls_iptunnel.c
+++ b/net/mpls/mpls_iptunnel.c
@@ -227,5 +227,6 @@ static void __exit mpls_iptunnel_exit(void)
 }
 module_exit(mpls_iptunnel_exit);
 
+MODULE_ALIAS_RTNL_LWT(LWTUNNEL_ENCAP_MPLS);
 MODULE_DESCRIPTION("MultiProtocol Label Switching IP Tunnels");
 MODULE_LICENSE("GPL v2");
-- 
2.1.4



[PATCH net-next 1/3] lwtunnel: autoload of lwt modules

2016-02-15 Thread Robert Shearman
The lwt implementations using net devices can autoload using the
existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
that lwt modules not using net devices can use.

Therefore, add the ability to autoload modules registering lwt
operations for lwt implementations not using a net device so that
users don't have to manually load the modules.

Signed-off-by: Robert Shearman 
---
 include/net/lwtunnel.h |  4 +++-
 net/core/lwtunnel.c| 32 
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index 66350ce3e955..e9f116e29c22 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -170,6 +170,8 @@ static inline int lwtunnel_input(struct sk_buff *skb)
return -EOPNOTSUPP;
 }
 
-#endif
+#endif /* CONFIG_LWTUNNEL */
+
+#define MODULE_ALIAS_RTNL_LWT(encap_type) MODULE_ALIAS("rtnl-lwt-" 
__stringify(encap_type))
 
 #endif /* __NET_LWTUNNEL_H */
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 299cfc24d888..8ef5e5cec03e 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -27,6 +27,30 @@
 #include 
 #include 
 
+#ifdef CONFIG_MODULES
+
+static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type)
+{
+   switch (encap_type) {
+   case LWTUNNEL_ENCAP_MPLS:
+   return "LWTUNNEL_ENCAP_MPLS";
+   case LWTUNNEL_ENCAP_IP:
+   return "LWTUNNEL_ENCAP_IP";
+   case LWTUNNEL_ENCAP_ILA:
+   return "LWTUNNEL_ENCAP_ILA";
+   case LWTUNNEL_ENCAP_IP6:
+   return "LWTUNNEL_ENCAP_IP6";
+   case LWTUNNEL_ENCAP_NONE:
+   case __LWTUNNEL_ENCAP_MAX:
+   /* should not have got here */
+   break;
+   }
+   WARN_ON(1);
+   return "LWTUNNEL_ENCAP_NONE";
+}
+
+#endif /* CONFIG_MODULES */
+
 struct lwtunnel_state *lwtunnel_state_alloc(int encap_len)
 {
struct lwtunnel_state *lws;
@@ -85,6 +109,14 @@ int lwtunnel_build_state(struct net_device *dev, u16 
encap_type,
ret = -EOPNOTSUPP;
rcu_read_lock();
ops = rcu_dereference(lwtun_encaps[encap_type]);
+#ifdef CONFIG_MODULES
+   if (!ops) {
+   rcu_read_unlock();
+   request_module("rtnl-lwt-%s", lwtunnel_encap_str(encap_type));
+   rcu_read_lock();
+   ops = rcu_dereference(lwtun_encaps[encap_type]);
+   }
+#endif
if (likely(ops && ops->build_state))
ret = ops->build_state(dev, encap, family, cfg, lws);
rcu_read_unlock();
-- 
2.1.4



[PATCH net-next 3/3] ila: autoload module

2016-02-15 Thread Robert Shearman
Avoid users having to manually load the module by adding a module
alias allowing it to be autoloaded by the lwt infra.

Signed-off-by: Robert Shearman 
---
 net/ipv6/ila/ila_common.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c
index 32dc9aab7297..8ecf2482560e 100644
--- a/net/ipv6/ila/ila_common.c
+++ b/net/ipv6/ila/ila_common.c
@@ -99,5 +99,6 @@ static void __exit ila_fini(void)
 
 module_init(ila_init);
 module_exit(ila_fini);
+MODULE_ALIAS_RTNL_LWT(LWTUNNEL_ENCAP_ILA);
 MODULE_AUTHOR("Tom Herbert ");
 MODULE_LICENSE("GPL");
-- 
2.1.4



[PATCH net-next 0/3] lwtunnel: autoload of lwt modules

2016-02-15 Thread Robert Shearman
The lwt implementations using net devices can autoload using the
existing mechanism using IFLA_INFO_KIND. However, there's no mechanism
that lwt modules not using net devices can use.

Therefore, add the ability to autoload modules registering lwt
operations for lwt implementations not using a net device so that
users don't have to manually load the modules.

Robert Shearman (3):
  lwtunnel: autoload of lwt modules
  mpls: autoload lwt module
  ila: autoload module

 include/net/lwtunnel.h|  4 +++-
 net/core/lwtunnel.c   | 32 
 net/ipv6/ila/ila_common.c |  1 +
 net/mpls/mpls_iptunnel.c  |  1 +
 4 files changed, 37 insertions(+), 1 deletion(-)

-- 
2.1.4



Re: [PATCH net-next 1/2] nsh: encapsulation module

2016-02-11 Thread Robert Shearman

On 11/02/16 11:35, Brian Russell wrote:
...

diff --git a/include/net/nsh.h b/include/net/nsh.h
new file mode 100644
index 000..7a5fb95
--- /dev/null
+++ b/include/net/nsh.h
@@ -0,0 +1,158 @@
+/*
+ * Network Service Header (NSH) inserted onto encapsulated packets
+ * or frames to realize service function paths.
+ * NSH also provides a mechanism for metadata exchange along the
+ * instantiated service path.
+ *
+ * https://tools.ietf.org/html/draft-ietf-sfc-nsh-01
+ *
+ * Copyright (c) 2015 by Brocade Communications Systems, Inc.
+ * All rights reserved.
+ */


All rights reserved isn't really compatible with the GPL :-)

...

diff --git a/net/ipv4/nsh.c b/net/ipv4/nsh.c
new file mode 100644
index 000..70e5ef0
--- /dev/null
+++ b/net/ipv4/nsh.c
@@ -0,0 +1,362 @@
+/*
+ * Network Service Header (NSH) inserted onto encapsulated packets
+ * or frames to realize service function paths.
+ * NSH also provides a mechanism for metadata exchange along the
+ * instantiated service path.
+ *
+ * https://tools.ietf.org/html/draft-ietf-sfc-nsh-01
+ *
+ * Copyright (c) 2015 by Brocade Communications Systems, Inc.
+ * All rights reserved.
+ */


Ditto.

Thanks,
Rob


Re: [PATCH net-next 2/2] mpls: allow TTL propagation to/from IP packets to be configured

2016-02-09 Thread Robert Shearman

On 06/02/16 18:36, Eric W. Biederman wrote:

Robert Shearman  writes:


It is sometimes desirable to present an MPLS transport network as a
single hop to traffic transiting it because it prevents confusion when
diagnosing failures. An example of where confusion can be generated is
when addresses used in the provider network overlap with addresses in
the overlay network and the addresses get exposed through ICMP errors
generated as packets transit the provider network.


The configuration you are talking about is a bug.  ICMP errors can
be handled without confusion simplify by forwarding the packets out
to the end of the tunnel.  Which is how the standards require that
situation to be handled if an ICMP error is generated.


You're absolutely right that the standards say how the ICMP errors 
should be handled in order for them to be forwarded correctly back to 
the sender, but I'm referring to what source addresses customers of 
service provider see in those ICMP errors generated when e.g. doing a 
traceroute. Furthermore, the mechanism that you mention adds for scope 
for mis-diagnosis since a traceroute won't show any information for hops 
PE1, P1 and P2 if PE2 is dropping the traffic for that LSP (because the 
mechanism you describe relies on PE2 or even a further CE to hairpin the 
ICMP error back to the originator of the error-causing traffic).


If you need further evidence that this is something that network 
operators might want to do, then see RFC 3032, s2.4.3 where it states:


   It is recognized that there may be situations where a network
   administration prefers to decrement the IPv4 TTL by one as it
   traverses an MPLS domain, instead of decrementing the IPv4 TTL by the
   number of LSP hops within the domain.

And one more reference is that this behaviour is codified in RFC 3443. 
For the purposes of clarity, Uniform Model in RFC 3443 corresponds to 
ip_ttl_propagate = 1 (default) and (Short) Pipe Model corresponds to 
ip_ttl_propagate = 0.





Therefore, provide the ability to control whether the TTL value from
an MPLS packet is propagated to an IPv4/IPv6 packet when the last
label is popped through the addition of a new per-namespace sysctl:
"net.mpls.ip_ttl_propagate" which defaults to enabled.

Use the same sysctl to control whether the TTL is propagated from IP
packets into the MPLS header. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl:
"net.mpls.default_ttl".


Ugh.  There is a case for this, but this feels much more like a per
tunnel/label/route egress property not a per network interface property.

I don't recall all of the gory details but some flavors of mpls labels
always require ttl propogation (the ip over mpls default) and some
flavors of mpls labels always require no propagation (pseudo wires).


Clearly, if the label isn't used for the purposes of encapsulating L3 
traffic, then you can't propagate the L3 TTL into it and you have to put 
some other value in there instead. I envisaged that the value of 
default_ttl would be used in these cases and this is why I worded the 
documentation for the default_ttl sysctl like so:


Default TTL value to use for MPLS packets where it cannot be
propagated from an IP header, either because one isn't present
or ip_ttl_propagate has been disabled.

Given that traffic arriving with a pseudo-wire label will have to be 
forwarded differently from traffic arriving for labels with L3 traffic, 
you will know that the label is associated with L2 traffic and that the 
TTL cannot be propagated.



There may be something cute in between.  For something that is a per
tunnel property I don't feel comfortable with a sysctl.


I cannot think of a use-case where it would make sense to have a mix of 
TTL being propagated and not being propagated on a per-LSP basis. I note 
that all of the most widely used proprietary MPLS implementations 
support global IP TTL propagation configuration and I'm not aware of any 
MPLS implementation that implements a per-LSP control for IP TTL 
propagation.



Especially when it is something as potentially dangerous as enabling
packets to loop in a network.  As I recall most IP over IP tunnels
also propogate the ttl between the inner and outer ip packets to prevent
loops.


There is no possibility of packets looping in a network as the TTL is 
always decremented when a label is pushed, whether the packet came in as 
IP or MPLS, and when swapping a label egress TTL must be one less than 
the ingress TTL, as defined by the MPLS RFC. When popping the last label 
we have to ensure that the MPLS TTL is not propagated to IP TTL so that 
there's no possibility of set the IP TTL beyond the value it entered the 
LSP (after the TTL decrement done as part of IP switching) with, but 
that is what this code does. Note that this is only the case if all 
routers are configured to

Re: [PATCH net-next 1/2] mpls: packet stats

2016-02-08 Thread Robert Shearman

On 06/02/16 10:58, Francois Romieu wrote:

Robert Shearman  :
[...]

diff --git a/net/mpls/Makefile b/net/mpls/Makefile
index 9ca923625016..6fdd61b9eae3 100644
--- a/net/mpls/Makefile
+++ b/net/mpls/Makefile

[...]

@@ -98,6 +94,29 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned 
int mtu)
  }
  EXPORT_SYMBOL_GPL(mpls_pkt_too_big);

+void mpls_stats_inc_outucastpkts(struct net_device *dev,
+const struct sk_buff *skb)
+{
+   struct mpls_dev *mdev;
+   struct inet6_dev *in6dev;


Nit: the scope can be reduced for both variables.


I'm happy to change this if this is the recommended style, but David 
Laight's reply suggests some doubt.





+
+   if (skb->protocol == htons(ETH_P_MPLS_UC)) {
+   mdev = mpls_dev_get(dev);
+   if (mdev)
+   MPLS_INC_STATS_LEN(mdev, skb->len,
+  MPLS_IFSTATS_MIB_OUTUCASTPKTS,
+  MPLS_IFSTATS_MIB_OUTOCTETS);
+   } else if (skb->protocol == htons(ETH_P_IP)) {
+   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+   } else if (skb->protocol == htons(ETH_P_IPV6)) {
+   in6dev = __in6_dev_get(dev);
+   if (in6dev)
+   IP6_UPD_PO_STATS(dev_net(dev), in6dev,
+IPSTATS_MIB_OUT, skb->len);
+   }
+}

[...]

diff --git a/net/mpls/internal.h b/net/mpls/internal.h
index 732a5c17e986..b39770ff2307 100644
--- a/net/mpls/internal.h
+++ b/net/mpls/internal.h

[...]

+#define MPLS_INC_STATS(mdev, field)\
+   do {\
+   __typeof__(*(mdev)->stats) *ptr =\
+   raw_cpu_ptr((mdev)->stats);  \
+   local_bh_disable(); \
+   u64_stats_update_begin(&ptr->syncp); \
+   ptr->mib[field]++;   \
+   u64_stats_update_end(&ptr->syncp);   \
+   local_bh_enable();  \
+   } while (0)


I don't get the point of the local_bh_{disable / enable}.

Which kind of locally enabled bh code section do you anticipate these
helpers to run under ?


When a user process sends an IPv4/IPv6 packet destined to a route with 
mpls lwt encap.


Thanks,
Rob


[PATCH net-next 0/2] mpls: packet stats and ttl propagation config

2016-02-05 Thread Robert Shearman
This patch series implements new bits of mpls functionality: keeping
statistics on packets as they pass through the box and allowing ttl
propagation to be configured.

Robert Shearman (2):
  mpls: packet stats
  mpls: allow TTL propagation to/from IP packets to be configured

 Documentation/networking/mpls-sysctl.txt |  19 +++
 include/net/netns/mpls.h |   4 +
 net/mpls/Makefile|   1 +
 net/mpls/af_mpls.c   | 205 ---
 net/mpls/internal.h  |  93 +-
 net/mpls/mpls_iptunnel.c |  21 +++-
 net/mpls/proc.c  | 128 +++
 7 files changed, 417 insertions(+), 54 deletions(-)
 create mode 100644 net/mpls/proc.c

-- 
2.1.4



[PATCH net-next 1/2] mpls: packet stats

2016-02-05 Thread Robert Shearman
Having MPLS packet stats is useful for observing network operation and
for diagnosing network problems. In the absence of anything better,
use RFCs for MIBs defining MPLS stats for guidance on the semantics of
the stats to expose. RFC3813 details two per-interface packet stats
that should be provided (label lookup failures and fragmented packets)
and also provides interpretation of RFC2863 for other per-interface
stats (in/out ucast, mcast and bcast, in/out discards and errors and
in unknown protos).

Multicast, fragment and broadcast packet counters are printed, but not
stored to allow for future implementation of current standards or
future standards without user-space having to change.

All the introduced fields are 64-bit, even error ones, to ensure no
overflow with long uptimes. Per-CPU counters are used to avoid
cache-line contention on the commonly used fields. The other fields
have also been made per-CPU for code to avoid performance problems in
error conditions on the assumption that on some platforms the cost of
atomic operations could be more pexpensive than sending the packet
(which is what would be done in the success case). If that's not the
case, we could instead not use per-CPU counters for these fields.

The IPv6 proc code was used as an inspiration for the proc code
here, both in terms of the implementation as well as the location of
the per-device stats proc files: /proc/net/dev_snmp_mpls/.

Signed-off-by: Robert Shearman 
---
 include/net/netns/mpls.h |   1 +
 net/mpls/Makefile|   1 +
 net/mpls/af_mpls.c   | 135 ---
 net/mpls/internal.h  |  93 ++--
 net/mpls/mpls_iptunnel.c |  11 +++-
 net/mpls/proc.c  | 128 
 6 files changed, 334 insertions(+), 35 deletions(-)
 create mode 100644 net/mpls/proc.c

diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index d29203651c01..3062b0aa3a08 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -12,6 +12,7 @@ struct netns_mpls {
size_t platform_labels;
struct mpls_route __rcu * __rcu *platform_label;
struct ctl_table_header *ctl;
+   struct proc_dir_entry *proc_net_devsnmp;
 };
 
 #endif /* __NETNS_MPLS_H__ */
diff --git a/net/mpls/Makefile b/net/mpls/Makefile
index 9ca923625016..6fdd61b9eae3 100644
--- a/net/mpls/Makefile
+++ b/net/mpls/Makefile
@@ -6,3 +6,4 @@ obj-$(CONFIG_MPLS_ROUTING) += mpls_router.o
 obj-$(CONFIG_MPLS_IPTUNNEL) += mpls_iptunnel.o
 
 mpls_router-y := af_mpls.o
+mpls_router-$(CONFIG_PROC_FS) += proc.o
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index b18c5ed42d95..6b3c96e2b21f 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -17,8 +18,8 @@
 #include 
 #if IS_ENABLED(CONFIG_IPV6)
 #include 
-#include 
 #endif
+#include 
 #include 
 #include "internal.h"
 
@@ -48,11 +49,6 @@ static struct mpls_route *mpls_route_input_rcu(struct net 
*net, unsigned index)
return rt;
 }
 
-static inline struct mpls_dev *mpls_dev_get(const struct net_device *dev)
-{
-   return rcu_dereference_rtnl(dev->mpls_ptr);
-}
-
 bool mpls_output_possible(const struct net_device *dev)
 {
return dev && (dev->flags & IFF_UP) && netif_carrier_ok(dev);
@@ -98,6 +94,29 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, unsigned 
int mtu)
 }
 EXPORT_SYMBOL_GPL(mpls_pkt_too_big);
 
+void mpls_stats_inc_outucastpkts(struct net_device *dev,
+const struct sk_buff *skb)
+{
+   struct mpls_dev *mdev;
+   struct inet6_dev *in6dev;
+
+   if (skb->protocol == htons(ETH_P_MPLS_UC)) {
+   mdev = mpls_dev_get(dev);
+   if (mdev)
+   MPLS_INC_STATS_LEN(mdev, skb->len,
+  MPLS_IFSTATS_MIB_OUTUCASTPKTS,
+  MPLS_IFSTATS_MIB_OUTOCTETS);
+   } else if (skb->protocol == htons(ETH_P_IP)) {
+   IP_UPD_PO_STATS(dev_net(dev), IPSTATS_MIB_OUT, skb->len);
+   } else if (skb->protocol == htons(ETH_P_IPV6)) {
+   in6dev = __in6_dev_get(dev);
+   if (in6dev)
+   IP6_UPD_PO_STATS(dev_net(dev), in6dev,
+IPSTATS_MIB_OUT, skb->len);
+   }
+}
+EXPORT_SYMBOL_GPL(mpls_stats_inc_outucastpkts);
+
 static u32 mpls_multipath_hash(struct mpls_route *rt,
   struct sk_buff *skb, bool bos)
 {
@@ -253,6 +272,7 @@ static int mpls_forward(struct sk_buff *skb, struct 
net_device *dev,
struct mpls_nh *nh;
struct mpls_entry_decoded dec;
struct net_device *out_dev;
+   struct mpls_dev *out_mdev;
struct mpls_dev *mdev;
unsigned int hh_len;
unsigned int new_header_size;
@@ -26

[PATCH net-next 2/2] mpls: allow TTL propagation to/from IP packets to be configured

2016-02-05 Thread Robert Shearman
It is sometimes desirable to present an MPLS transport network as a
single hop to traffic transiting it because it prevents confusion when
diagnosing failures. An example of where confusion can be generated is
when addresses used in the provider network overlap with addresses in
the overlay network and the addresses get exposed through ICMP errors
generated as packets transit the provider network.

Therefore, provide the ability to control whether the TTL value from
an MPLS packet is propagated to an IPv4/IPv6 packet when the last
label is popped through the addition of a new per-namespace sysctl:
"net.mpls.ip_ttl_propagate" which defaults to enabled.

Use the same sysctl to control whether the TTL is propagated from IP
packets into the MPLS header. If the TTL isn't propagated then a
default TTL value is used which can be configured via a new sysctl:
"net.mpls.default_ttl".

Signed-off-by: Robert Shearman 
---
 Documentation/networking/mpls-sysctl.txt | 19 +
 include/net/netns/mpls.h |  3 ++
 net/mpls/af_mpls.c   | 70 
 net/mpls/mpls_iptunnel.c | 10 -
 4 files changed, 83 insertions(+), 19 deletions(-)

diff --git a/Documentation/networking/mpls-sysctl.txt 
b/Documentation/networking/mpls-sysctl.txt
index 9ed15f86c17c..9e8cfa6d48d1 100644
--- a/Documentation/networking/mpls-sysctl.txt
+++ b/Documentation/networking/mpls-sysctl.txt
@@ -19,6 +19,25 @@ platform_labels - INTEGER
Possible values: 0 - 1048575
Default: 0
 
+ip_ttl_propagate - BOOL
+   Control whether TTL is propagated from the IPv4/IPv6 header to
+   the MPLS header on imposing labels and propagated from the
+   MPLS header to the IPv4/IPv6 header on popping the last label.
+
+   If disabled, the MPLS transport network will appear as a
+   single hop to transit traffic.
+
+   0 - disabled
+   1 - enabled (default)
+
+default_ttl - BOOL
+   Default TTL value to use for MPLS packets where it cannot be
+   propagated from an IP header, either because one isn't present
+   or ip_ttl_propagate has been disabled.
+
+   Possible values: 1 - 255
+   Default: 255
+
 conf//input - BOOL
Control whether packets can be input on this interface.
 
diff --git a/include/net/netns/mpls.h b/include/net/netns/mpls.h
index 3062b0aa3a08..9bdc2bd8fcb8 100644
--- a/include/net/netns/mpls.h
+++ b/include/net/netns/mpls.h
@@ -10,7 +10,10 @@ struct ctl_table_header;
 
 struct netns_mpls {
size_t platform_labels;
+   int ip_ttl_propagate;
+   int default_ttl;
struct mpls_route __rcu * __rcu *platform_label;
+
struct ctl_table_header *ctl;
struct proc_dir_entry *proc_net_devsnmp;
 };
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 6b3c96e2b21f..a2a4f0a884a3 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -31,7 +31,9 @@
 #define MPLS_NEIGH_TABLE_UNSPEC (NEIGH_LINK_TABLE + 1)
 
 static int zero = 0;
+static int one = 1;
 static int label_limit = (1 << 20) - 1;
+static int ttl_max = 255;
 
 static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
   struct nlmsghdr *nlh, struct net *net, u32 portid,
@@ -215,8 +217,8 @@ out:
return &rt->rt_nh[nh_index];
 }
 
-static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb,
-   struct mpls_entry_decoded dec)
+static bool mpls_egress(struct net *net, struct mpls_route *rt,
+   struct sk_buff *skb, struct mpls_entry_decoded dec)
 {
enum mpls_payload_type payload_type;
bool success = false;
@@ -239,24 +241,29 @@ static bool mpls_egress(struct mpls_route *rt, struct 
sk_buff *skb,
payload_type = ip_hdr(skb)->version;
 
switch (payload_type) {
-   case MPT_IPV4: {
-   struct iphdr *hdr4 = ip_hdr(skb);
-   skb->protocol = htons(ETH_P_IP);
-   csum_replace2(&hdr4->check,
- htons(hdr4->ttl << 8),
- htons(dec.ttl << 8));
-   hdr4->ttl = dec.ttl;
+   case MPT_IPV4:
+   if (net->mpls.ip_ttl_propagate) {
+   struct iphdr *hdr4 = ip_hdr(skb);
+
+   skb->protocol = htons(ETH_P_IP);
+   csum_replace2(&hdr4->check,
+ htons(hdr4->ttl << 8),
+ htons(dec.ttl << 8));
+   hdr4->ttl = dec.ttl;
+   }
success = true;
break;
-   }
-   case MPT_IPV6: {
-   struct ipv6hdr *hdr6 = ipv6_hdr(skb);
-   skb->protocol = htons(ETH_P_IPV6);
-   hdr6->hop_limit = dec.ttl;
+   case MPT_IPV6:
+   if (net->mpls.ip_ttl_propagat

[PATCH net 2/4] mpls: don't dump RTA_VIA attribute if not specified

2015-12-10 Thread Robert Shearman
The problem seen is that when adding a route with a nexthop with no
via address specified, iproute2 generates bogus output:

  # ip -f mpls route add 100 dev lo
  # ip -f mpls route list
  100 via inet 0.0.8.0 dev lo

The reason for this is that the kernel generates an RTA_VIA attribute
with the family set to AF_INET, but the via address data having zero
length. The cause of family being AF_INET is that on route insert
cfg->rc_via_table is left set to 0, which just happens to be
NEIGH_ARP_TABLE which is then translated into AF_INET.

iproute2 doesn't validate the length prior to printing and so prints
garbage. Although it could be fixed to do the validation, I would
argue that AF_INET addresses should always be exactly 4 bytes so the
kernel is really giving userspace bogus data.

Therefore, avoid generating the RTA_VIA attribute when dumping the
route if the via address wasn't specified on add/modify. This is
indicated by NEIGH_ARP_TABLE and a zero via address length - if the
user specified a via address the address length would have been
validated such that it was 4 bytes. Although this is a change in
behaviour that is visible to userspace, I believe that what was
generated before was invalid and as such userspace wouldn't be
expecting it.

Signed-off-by: Robert Shearman 
---
 net/mpls/af_mpls.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 3be29cb1f658..ac1c116abaac 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -1235,7 +1235,9 @@ static int mpls_dump_route(struct sk_buff *skb, u32 
portid, u32 seq, int event,
nla_put_labels(skb, RTA_NEWDST, nh->nh_labels,
   nh->nh_label))
goto nla_put_failure;
-   if (nla_put_via(skb, nh->nh_via_table, mpls_nh_via(rt, nh),
+   if ((nh->nh_via_table != NEIGH_ARP_TABLE ||
+nh->nh_via_alen != 0) &&
+   nla_put_via(skb, nh->nh_via_table, mpls_nh_via(rt, nh),
nh->nh_via_alen))
goto nla_put_failure;
dev = rtnl_dereference(nh->nh_dev);
@@ -1323,7 +1325,9 @@ static inline size_t lfib_nlmsg_size(struct mpls_route 
*rt)
 
if (nh->nh_dev)
payload += nla_total_size(4); /* RTA_OIF */
-   payload += nla_total_size(2 + nh->nh_via_alen); /* RTA_VIA */
+   if (nh->nh_via_table != NEIGH_ARP_TABLE ||
+   nh->nh_via_alen != 0) /* RTA_VIA */
+   payload += nla_total_size(2 + nh->nh_via_alen);
if (nh->nh_labels) /* RTA_NEWDST */
payload += nla_total_size(nh->nh_labels * 4);
} else {
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 1/4] mpls: validate L2 via address length

2015-12-10 Thread Robert Shearman
If an L2 via address for an mpls nexthop is specified, the length of
the L2 address must match that expected by the output device,
otherwise it could access memory beyond the end of the via address
buffer in the route.

This check was present prior to commit f8efb73c97e2 ("mpls: multipath
route support"), but got lost in the refactoring, so add it back,
applying it to all nexthops in multipath routes.

Fixes: f8efb73c97e2 ("mpls: multipath route support")
Signed-off-by: Robert Shearman 
---
 net/mpls/af_mpls.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index c70d750148b6..3be29cb1f658 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -534,6 +534,10 @@ static int mpls_nh_assign_dev(struct net *net, struct 
mpls_route *rt,
if (!mpls_dev_get(dev))
goto errout;
 
+   if ((nh->nh_via_table == NEIGH_LINK_TABLE) &&
+   (dev->addr_len != nh->nh_via_alen))
+   goto errout;
+
RCU_INIT_POINTER(nh->nh_dev, dev);
 
return 0;
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 0/4] mpls: fixes for nexthops without via addresses

2015-12-10 Thread Robert Shearman
These four fixes all apply to the case of having an mpls route with an
output device, but without a nexthop.

Patches 2 and 3 could really have been combined in one patch, but I
wanted to separate the fix for some recent breakage from the fix for a
day-1 issue.

Robert Shearman (4):
  mpls: validate L2 via address length
  mpls: don't dump RTA_VIA attribute if not specified
  mpls: fix out-of-bounds access when via address not specified
  mpls: make via address optional for multipath routes

 net/mpls/af_mpls.c | 43 +++
 1 file changed, 31 insertions(+), 12 deletions(-)

-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 4/4] mpls: make via address optional for multipath routes

2015-12-10 Thread Robert Shearman
The via address is optional for a single path route, yet is mandatory
when the multipath attribute is used:

  # ip -f mpls route add 100 dev lo
  # ip -f mpls route add 101 nexthop dev lo
  RTNETLINK answers: Invalid argument

Make them consistent by making the via address optional when the
RTA_MULTIPATH attribute is being parsed so that both forms of
specifying the route work.

Signed-off-by: Robert Shearman 
---
 net/mpls/af_mpls.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 7bfc85f52ca8..c32fc411a911 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -604,10 +604,14 @@ static int mpls_nh_build(struct net *net, struct 
mpls_route *rt,
goto errout;
}
 
-   err = nla_get_via(via, &nh->nh_via_alen, &nh->nh_via_table,
- __mpls_nh_via(rt, nh));
-   if (err)
-   goto errout;
+   if (via) {
+   err = nla_get_via(via, &nh->nh_via_alen, &nh->nh_via_table,
+ __mpls_nh_via(rt, nh));
+   if (err)
+   goto errout;
+   } else {
+   nh->nh_via_table = MPLS_NEIGH_TABLE_UNSPEC;
+   }
 
err = mpls_nh_assign_dev(net, rt, nh, oif);
if (err)
@@ -689,9 +693,6 @@ static int mpls_nh_build_multi(struct mpls_route_config 
*cfg,
nla_newdst = nla_find(attrs, attrlen, RTA_NEWDST);
}
 
-   if (!nla_via)
-   goto errout;
-
err = mpls_nh_build(cfg->rc_nlinfo.nl_net, rt, nh,
rtnh->rtnh_ifindex, nla_via,
nla_newdst);
@@ -1271,7 +1272,8 @@ static int mpls_dump_route(struct sk_buff *skb, u32 
portid, u32 seq, int event,
nh->nh_labels,
nh->nh_label))
goto nla_put_failure;
-   if (nla_put_via(skb, nh->nh_via_table,
+   if (nh->nh_via_table != MPLS_NEIGH_TABLE_UNSPEC &&
+   nla_put_via(skb, nh->nh_via_table,
mpls_nh_via(rt, nh),
nh->nh_via_alen))
goto nla_put_failure;
@@ -1343,7 +1345,9 @@ static inline size_t lfib_nlmsg_size(struct mpls_route 
*rt)
 
for_nexthops(rt) {
nhsize += nla_total_size(sizeof(struct rtnexthop));
-   nhsize += nla_total_size(2 + nh->nh_via_alen);
+   /* RTA_VIA */
+   if (nh->nh_via_table != MPLS_NEIGH_TABLE_UNSPEC)
+   nhsize += nla_total_size(2 + nh->nh_via_alen);
if (nh->nh_labels)
nhsize += nla_total_size(nh->nh_labels * 4);
} endfor_nexthops(rt);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >