Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route

2017-01-16 Thread David Ahern
On 1/16/17 5:51 PM, David Miller wrote:
> From: David Ahern 
> Date: Sun, 15 Jan 2017 12:07:04 -0800
> 
>> @@ -2143,6 +2143,26 @@ int ip6_del_rt(struct rt6_info *rt)
>>  return __ip6_del_rt(rt, &info);
>>  }
>>  
>> +/* called with table lock held */
>  ...
>> @@ -2176,10 +2196,9 @@ static int ip6_route_del(struct fib6_config *cfg)
>>  continue;
>>  if (cfg->fc_protocol && cfg->fc_protocol != 
>> rt->rt6i_protocol)
>>  continue;
>> -dst_hold(&rt->dst);
>> -read_unlock_bh(&table->tb6_lock);
>>  
>> -return __ip6_del_rt(rt, &cfg->fc_nlinfo);
>> +err = __ip6_route_del(rt, cfg);
>> +break;
>>  }
> 
> fib6_del() (invoked by __ip6_route_del()) has to be invoked with the
> table lock held a sa writer, but here you are only holding it as a
> reader.

That table lock is still held. If you look up 2 lines I remove the line that 
releases the lock.

> 
> I also think some more thought has to be put into whether we can
> change behavior like this without using a flag, as suggested by Roopa.
> 

I'm all for changing the behavior by default, but AIUI that breaks the golden 
rule.

libnl as of libnl3_2_16-16-g29b71371e764  (a patch by Roopa) already handles 
both flavors -- if RTA_MULTIPATH is sent it handles it fine and if a series of 
routes are sent they are consolidated to 1 object with next hops.

iproute2 can handle it just fine as well. I made no changes to get the display 
format consistent. The problem here is scripts that rely on the old format.

quagga does *not* correctly handle the RTA_MULTIPATH attribute for IPv6 but 
then it does not properly handle IPv6 multipath routes it receives from the 
kernel at all. Same result with current kernel code or with this patch set - 
only the last nexthop is retained. (quagga routes pushed to kernel works, but 
kernel routes fed to quagga does not)

--

Notifications are still single route based. There is no user input to handle 
backwards compatibility in converting those to RTA_MULTIPATH. A global sysctl 
is the only real option I guess.


Anyways, I am open to either; just looking to further close the ipv4 / ipv6 gap.


Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route

2017-01-16 Thread David Ahern
On 1/16/17 6:37 PM, David Miller wrote:
> Is it clear now?

yes. time for a trip to the eye doctor


Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route

2017-01-16 Thread David Miller
From: David Ahern 
Date: Mon, 16 Jan 2017 18:27:36 -0700

> On 1/16/17 5:51 PM, David Miller wrote:
>> From: David Ahern 
>> Date: Sun, 15 Jan 2017 12:07:04 -0800
>> 
>>> @@ -2143,6 +2143,26 @@ int ip6_del_rt(struct rt6_info *rt)
>>> return __ip6_del_rt(rt, &info);
>>>  }
>>>  
>>> +/* called with table lock held */
>>  ...
>>> @@ -2176,10 +2196,9 @@ static int ip6_route_del(struct fib6_config *cfg)
>>> continue;
>>> if (cfg->fc_protocol && cfg->fc_protocol != 
>>> rt->rt6i_protocol)
>>> continue;
>>> -   dst_hold(&rt->dst);
>>> -   read_unlock_bh(&table->tb6_lock);
>>>  
>>> -   return __ip6_del_rt(rt, &cfg->fc_nlinfo);
>>> +   err = __ip6_route_del(rt, cfg);
>>> +   break;
>>> }
>> 
>> fib6_del() (invoked by __ip6_route_del()) has to be invoked with the
>> table lock held a sa writer, but here you are only holding it as a
>> reader.
> 
> That table lock is still held. If you look up 2 lines I remove the line that 
> releases the lock.

It's held in this function as a reader, it needs to be held as a writer.

That's why the lock is dropped in the current code and the existing
wrapper around fib6_del() takes it as a writer.

Is it clear now?

read_lock_bh(&table->lock);
fib6_del();

is invalid.

write_lock_bh(&table->lock);
fib6_del();

is required.


Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route

2017-01-16 Thread David Miller
From: David Ahern 
Date: Sun, 15 Jan 2017 12:07:04 -0800

> @@ -2143,6 +2143,26 @@ int ip6_del_rt(struct rt6_info *rt)
>   return __ip6_del_rt(rt, &info);
>  }
>  
> +/* called with table lock held */
 ...
> @@ -2176,10 +2196,9 @@ static int ip6_route_del(struct fib6_config *cfg)
>   continue;
>   if (cfg->fc_protocol && cfg->fc_protocol != 
> rt->rt6i_protocol)
>   continue;
> - dst_hold(&rt->dst);
> - read_unlock_bh(&table->tb6_lock);
>  
> - return __ip6_del_rt(rt, &cfg->fc_nlinfo);
> + err = __ip6_route_del(rt, cfg);
> + break;
>   }

fib6_del() (invoked by __ip6_route_del()) has to be invoked with the
table lock held a sa writer, but here you are only holding it as a
reader.

I also think some more thought has to be put into whether we can
change behavior like this without using a flag, as suggested by Roopa.


Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route

2017-01-16 Thread David Ahern
On 1/16/17 8:48 AM, Roopa Prabhu wrote:
> Do we really need the flag ?. It seems like delete with just prefix should 
> delete all the routes in a multipath
> route by default... (understand that you have it there to preserve existing 
> behavior...for people who maybe relying on it. But this seems more like a bug 
> fix. route replace went through a few such bug fixes "ipv6: fix ECMP route 
> replacement"). ok with either approach.

I'm fine dropping the flag for this patch and just deleting the entire route. 
The flag only exists because this patch changes behavior visible to the user.


Re: [PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route

2017-01-16 Thread Roopa Prabhu
On 1/15/17, 12:07 PM, David Ahern wrote:
> IPv4 allows multipath routes to be deleted using just the prefix and
> length. For example:
> $ ip ro ls vrf red
> unreachable default metric 8192
> 1.1.1.0/24
> nexthop via 10.100.1.254  dev eth1 weight 1
> nexthop via 10.11.200.2  dev eth11.200 weight 1
> 10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
> 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3
>
> $ ip ro del 1.1.1.0/24 vrf red
>
> $ ip ro ls vrf red
> unreachable default metric 8192
> 10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
> 10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3
>
> The same notation does not work with IPv6 because of how multipath routes
> are implemented for IPv6. For IPv6 only the first nexthop of a multipath
> route is deleted if the request contains only a prefix and length. This
> leads to unnecessary complexity in userspace dealing with IPv6 multipath
> routes.
>
> This patch allows all nexthops to be deleted without specifying each one
> in the delete request by passing a new flag, RTM_F_ALL_NEXTHOPS, in
> rtm_flags. Internally, this is done by walking the sibling list of the
> route matching the specifications given (prefix, length, metric, protocol,
> etc).
>
> With this patch (and an updated iproute2 command):
> $  ip -6 ro ls vrf red
> 2001:db8::/120 via 2001:db8:1::62 dev eth1 metric 256  pref medium
> 2001:db8::/120 via 2001:db8:1::61 dev eth1 metric 256  pref medium
> 2001:db8::/120 via 2001:db8:1::60 dev eth1 metric 256  pref medium
> 2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
> ...
>
> $ ip -6 ro del vrf red ::1/120
> $ ip -6 ro ls vrf red
> 2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
> ...
>
> The flag is added to fib6_config by converting fc_type to a u16 (as
> noted fc_type only uses 8 bits). The new u16 hole is a bitmap with
> fc_delete_all_nexthop as the first bit.
>
> Suggested-by: Dinesh Dutt 
> Signed-off-by: David Ahern 
> ---
> v2
> - switched example to rfc 3849 documentation address per request
> - changed delete loop to explicitly look at siblings list for
>   first route matching specs given (metric, protocol, etc)
>
>  include/net/ip6_fib.h  |  4 +++-
>  include/uapi/linux/rtnetlink.h |  1 +
>  net/ipv6/route.c   | 28 +---
>  3 files changed, 29 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index a74e2aa40ef4..11ab99e87c5f 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -37,7 +37,9 @@ struct fib6_config {
>   int fc_ifindex;
>   u32 fc_flags;
>   u32 fc_protocol;
> - u32 fc_type;/* only 8 bits are used */
> + u16 fc_type;/* only 8 bits are used */
> + u16 fc_delete_all_nexthop : 1,
> + __unused : 15;
>  
>   struct in6_addr fc_dst;
>   struct in6_addr fc_src;
> diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
> index 8c93ad1ef9ab..7fb206bc42f9 100644
> --- a/include/uapi/linux/rtnetlink.h
> +++ b/include/uapi/linux/rtnetlink.h
> @@ -276,6 +276,7 @@ enum rt_scope_t {
>  #define RTM_F_EQUALIZE   0x400   /* Multipath equalizer: NI  
> */
>  #define RTM_F_PREFIX 0x800   /* Prefix addresses */
>  #define RTM_F_LOOKUP_TABLE   0x1000  /* set rtm_table to FIB lookup result */
> +#define RTM_F_ALL_NEXTHOPS   0x2000  /* delete all nexthops (IPv6) */
>  
Do we really need the flag ?. It seems like delete with just prefix should 
delete all the routes in a multipath
route by default... (understand that you have it there to preserve existing 
behavior...for people who maybe relying on it. But this seems more like a bug 
fix. route replace went through a few such bug fixes "ipv6: fix ECMP route 
replacement"). ok with either approach.



[PATCH net-next v2 1/3] net: ipv6: Allow shorthand delete of all nexthops in multipath route

2017-01-15 Thread David Ahern
IPv4 allows multipath routes to be deleted using just the prefix and
length. For example:
$ ip ro ls vrf red
unreachable default metric 8192
1.1.1.0/24
nexthop via 10.100.1.254  dev eth1 weight 1
nexthop via 10.11.200.2  dev eth11.200 weight 1
10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3

$ ip ro del 1.1.1.0/24 vrf red

$ ip ro ls vrf red
unreachable default metric 8192
10.11.200.0/24 dev eth11.200 proto kernel scope link src 10.11.200.3
10.100.1.0/24 dev eth1 proto kernel scope link src 10.100.1.3

The same notation does not work with IPv6 because of how multipath routes
are implemented for IPv6. For IPv6 only the first nexthop of a multipath
route is deleted if the request contains only a prefix and length. This
leads to unnecessary complexity in userspace dealing with IPv6 multipath
routes.

This patch allows all nexthops to be deleted without specifying each one
in the delete request by passing a new flag, RTM_F_ALL_NEXTHOPS, in
rtm_flags. Internally, this is done by walking the sibling list of the
route matching the specifications given (prefix, length, metric, protocol,
etc).

With this patch (and an updated iproute2 command):
$  ip -6 ro ls vrf red
2001:db8::/120 via 2001:db8:1::62 dev eth1 metric 256  pref medium
2001:db8::/120 via 2001:db8:1::61 dev eth1 metric 256  pref medium
2001:db8::/120 via 2001:db8:1::60 dev eth1 metric 256  pref medium
2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
...

$ ip -6 ro del vrf red ::1/120
$ ip -6 ro ls vrf red
2001:db8:1::/120 dev eth1 proto kernel metric 256  pref medium
...

The flag is added to fib6_config by converting fc_type to a u16 (as
noted fc_type only uses 8 bits). The new u16 hole is a bitmap with
fc_delete_all_nexthop as the first bit.

Suggested-by: Dinesh Dutt 
Signed-off-by: David Ahern 
---
v2
- switched example to rfc 3849 documentation address per request
- changed delete loop to explicitly look at siblings list for
  first route matching specs given (metric, protocol, etc)

 include/net/ip6_fib.h  |  4 +++-
 include/uapi/linux/rtnetlink.h |  1 +
 net/ipv6/route.c   | 28 +---
 3 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index a74e2aa40ef4..11ab99e87c5f 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -37,7 +37,9 @@ struct fib6_config {
int fc_ifindex;
u32 fc_flags;
u32 fc_protocol;
-   u32 fc_type;/* only 8 bits are used */
+   u16 fc_type;/* only 8 bits are used */
+   u16 fc_delete_all_nexthop : 1,
+   __unused : 15;
 
struct in6_addr fc_dst;
struct in6_addr fc_src;
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 8c93ad1ef9ab..7fb206bc42f9 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -276,6 +276,7 @@ enum rt_scope_t {
 #define RTM_F_EQUALIZE 0x400   /* Multipath equalizer: NI  */
 #define RTM_F_PREFIX   0x800   /* Prefix addresses */
 #define RTM_F_LOOKUP_TABLE 0x1000  /* set rtm_table to FIB lookup result */
+#define RTM_F_ALL_NEXTHOPS 0x2000  /* delete all nexthops (IPv6) */
 
 /* Reserved table identifiers */
 
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ce5aaf448c54..c95e2f941468 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2143,6 +2143,26 @@ int ip6_del_rt(struct rt6_info *rt)
return __ip6_del_rt(rt, &info);
 }
 
+/* called with table lock held */
+static int __ip6_route_del(struct rt6_info *rt, struct fib6_config *cfg)
+{
+   int err;
+
+   if (rt->rt6i_nsiblings && cfg->fc_delete_all_nexthop) {
+   struct rt6_info *sibling, *next_sibling;
+
+   list_for_each_entry_safe(sibling, next_sibling,
+&rt->rt6i_siblings,
+rt6i_siblings) {
+   err = fib6_del(sibling, &cfg->fc_nlinfo);
+   if (err)
+   return err;
+   }
+   }
+
+   return fib6_del(rt, &cfg->fc_nlinfo);
+}
+
 static int ip6_route_del(struct fib6_config *cfg)
 {
struct fib6_table *table;
@@ -2176,10 +2196,9 @@ static int ip6_route_del(struct fib6_config *cfg)
continue;
if (cfg->fc_protocol && cfg->fc_protocol != 
rt->rt6i_protocol)
continue;
-   dst_hold(&rt->dst);
-   read_unlock_bh(&table->tb6_lock);
 
-   return __ip6_del_rt(rt, &cfg->fc_nlinfo);
+   err = __ip6_route_del(rt, cfg);
+