Re: [PATCH 2/2] neighbour: allow NUD_NOARP entries to be forced GCed

2021-04-19 Thread David Ahern
On 4/19/21 10:52 AM, Kasper Dupont wrote:
> On 19/04/21 10.10, David Ahern wrote:
>> On 4/19/21 9:44 AM, Kasper Dupont wrote:
>>>
>>> Is there any update regarding this change?
>>>
>>> I noticed this regression when it was used in a DoS attack on one of
>>> my servers which I had upgraded from Ubuntu 18.04 to 20.04.
>>>
>>> I have verified that Ubuntu 18.04 is not subject to this attack and
>>> Ubuntu 20.04 is vulnerable. I have also verified that the one-line
>>> change which Cascardo has provided fixes the vulnerability on Ubuntu
>>> 20.04.
>>>
>>
>> your testing included both patches or just this one?
> 
> I applied only this one line change on top of the kernel in Ubuntu
> 20.04. The behavior I observed was that without the patch the kernel
> was vulnerable and with that patch I was unable to reproduce the
> problem.

This patch should be re-submitted standalone for -net

> 
> The other longer patch is for a different issue which Cascardo
> discovered while working on the one I had reported. I don't have an
> environment set up where I can reproduce the issue addressed by that
> larger patch.
> 

The first patch is the one I have concerns about.


Re: [PATCH 2/2] neighbour: allow NUD_NOARP entries to be forced GCed

2021-04-19 Thread David Ahern
On 4/19/21 9:44 AM, Kasper Dupont wrote:
> On 17/03/21 15.53, Thadeu Lima de Souza Cascardo wrote:
>> IFF_POINTOPOINT interfaces use NUD_NOARP entries for IPv6. It's possible to
>> fill up the neighbour table with enough entries that it will overflow for
>> valid connections after that.
>>
>> This behaviour is more prevalent after commit 58956317c8de ("neighbor:
>> Improve garbage collection") is applied, as it prevents removal from
>> entries that are not NUD_FAILED, unless they are more than 5s old.
>>
>> Fixes: 58956317c8de (neighbor: Improve garbage collection)
>> Reported-by: Kasper Dupont 
>> Signed-off-by: Thadeu Lima de Souza Cascardo 
>> ---
>>  net/core/neighbour.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
>> index bbc89c7ffdfd..be5ca411b149 100644
>> --- a/net/core/neighbour.c
>> +++ b/net/core/neighbour.c
>> @@ -256,6 +256,7 @@ static int neigh_forced_gc(struct neigh_table *tbl)
>>  
>>  write_lock(&n->lock);
>>  if ((n->nud_state == NUD_FAILED) ||
>> +(n->nud_state == NUD_NOARP) ||
>>  (tbl->is_multicast &&
>>   tbl->is_multicast(n->primary_key)) ||
>>  time_after(tref, n->updated))
>> -- 
>> 2.27.0
>>
> 
> Is there any update regarding this change?
> 
> I noticed this regression when it was used in a DoS attack on one of
> my servers which I had upgraded from Ubuntu 18.04 to 20.04.
> 
> I have verified that Ubuntu 18.04 is not subject to this attack and
> Ubuntu 20.04 is vulnerable. I have also verified that the one-line
> change which Cascardo has provided fixes the vulnerability on Ubuntu
> 20.04.
> 

your testing included both patches or just this one?




Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom

2021-04-19 Thread David Ahern
On 4/16/21 2:16 AM, Xuan Zhuo wrote:
> In page_to_skb(), if we have enough tailroom to save skb_shared_info, we
> can use build_skb to create skb directly. No need to alloc for
> additional space. And it can save a 'frags slot', which is very friendly
> to GRO.
> 
> Here, if the payload of the received package is too small (less than
> GOOD_COPY_LEN), we still choose to copy it directly to the space got by
> napi_alloc_skb. So we can reuse these pages.
> 
> Testing Machine:
> The four queues of the network card are bound to the cpu1.
> 
> Test command:
> for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& 
> done
> 
> The size of the udp package is 1000, so in the case of this patch, there
> will always be enough tailroom to use build_skb. The sent udp packet
> will be discarded because there is no port to receive it. The irqsoftd
> of the machine is 100%, we observe the received quantity displayed by
> sar -n DEV 1:
> 
> no build_skb:  956864.00 rxpck/s
> build_skb:1158465.00 rxpck/s
> 

virtio_net is using napi_consume_skb, so napi_build_skb should show a
small increase from build_skb.



Re: Different behavior wrt VRF and no VRF - packet Tx

2021-04-18 Thread David Ahern
On 4/15/21 8:57 PM, Bala Sajja wrote:
>please find the ip link show output(for ifindex) and ping and
> its corresponding perf fib events output. OIF seems MGMT(ifindex 5)
> always, not enslaved  interfaces ?

that is the reason it does not work.


Re: [PATCH iproute2] ip: drop 2-char command assumption

2021-04-18 Thread David Ahern
On 4/17/21 8:49 PM, Tony Ambardar wrote:
> The 'ip' utility hardcodes the assumption of being a 2-char command, where
> any follow-on characters are passed as an argument:
> 
>   $ ./ip-full help
>   Object "-full" is unknown, try "ip help".
> 
> This confusing behaviour isn't seen with 'tc' for example, and was added in
> a 2005 commit without documentation. It was noticed during testing of 'ip'
> variants built/packaged with different feature sets (e.g. w/o BPF support).
> 
> Drop the related code.
> 
> Fixes: 351efcde4e62 ("Update header files to 2.6.14")
> Signed-off-by: Tony Ambardar 
> ---
>  ip/ip.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/ip/ip.c b/ip/ip.c
> index 4cf09fc3..631ce903 100644
> --- a/ip/ip.c
> +++ b/ip/ip.c
> @@ -313,9 +313,6 @@ int main(int argc, char **argv)
>  
>   rtnl_set_strict_dump(&rth);
>  
> - if (strlen(basename) > 2)
> - return do_cmd(basename+2, argc, argv);
> -
>   if (argc > 1)
>   return do_cmd(argv[1], argc-1, argv+1);
>  
> 

popular change request lately. As I responded to Matteo in January:

This has been around for too long to just remove it. How about adding an
option to do_cmd that this appears to be guess based on basename? If the
guess is wrong, fallback to the next do_cmd.



Re: [PATCH net-next 2/2] selftests: fib_nexthops: Test large scale nexthop flushing

2021-04-18 Thread David Ahern
On 4/16/21 8:55 AM, Ido Schimmel wrote:
> From: Ido Schimmel 
> 
> Test that all the nexthops are flushed when a multi-part nexthop dump is
> required for the flushing.
> 
> Without previous patch:
> 
>  # ./fib_nexthops.sh
>  TEST: Large scale nexthop flushing  [FAIL]
> 
> With previous patch:
> 
>  # ./fib_nexthops.sh
>  TEST: Large scale nexthop flushing  [ OK ]
> 
> Signed-off-by: Ido Schimmel 
> Reviewed-by: Petr Machata 
> ---
>  tools/testing/selftests/net/fib_nexthops.sh | 15 +++
>  1 file changed, 15 insertions(+)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 1/2] nexthop: Restart nexthop dump based on last dumped nexthop identifier

2021-04-18 Thread David Ahern
On 4/16/21 8:55 AM, Ido Schimmel wrote:
> From: Ido Schimmel 
> 
> Currently, a multi-part nexthop dump is restarted based on the number of
> nexthops that have been dumped so far. This can result in a lot of
> nexthops not being dumped when nexthops are simultaneously deleted:
> 
>  # ip nexthop | wc -l
>  65536
>  # ip nexthop flush
>  Dump was interrupted and may be inconsistent.
>  Flushed 36040 nexthops
>  # ip nexthop | wc -l
>  29496
> 
> Instead, restart the dump based on the nexthop identifier (fixed number)
> of the last successfully dumped nexthop:
> 
>  # ip nexthop | wc -l
>  65536
>  # ip nexthop flush
>  Dump was interrupted and may be inconsistent.
>  Flushed 65536 nexthops
>  # ip nexthop | wc -l
>  0
> 
> Reported-by: Maksym Yaremchuk 
> Tested-by: Maksym Yaremchuk 
> Signed-off-by: Ido Schimmel 
> Reviewed-by: Petr Machata 
> ---
>  net/ipv4/nexthop.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 

Reviewed-by: David Ahern 

Any reason not to put this in -net with a Fixes tag?




Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-16 Thread David Ahern
[ cc author of 648700f76b03b7e8149d13cc2bdb3355035258a9 ]

On 4/16/21 3:58 PM, Keyu Man wrote:
> Hi,
> 
>  
> 
>     My name is Keyu Man. We are a group of researchers from University
> of California, Riverside. Zhiyun Qian is my advisor. We found the code
> in processing IPv4/IPv6 fragments will potentially lead to DoS Attacks.
> Specifically, after the latest kernel receives an IPv4 fragment, it will
> try to fit it into a queue by calling function
> 
>  
> 
>     struct inet_frag_queue *inet_frag_find(struct fqdir *fqdir, void
> *key) in net/ipv4/inet_fragment.c.
> 
>  
> 
>     However, this function will first check if the existing fragment
> memory exceeds the fqdir->high_thresh. If it exceeds, then drop the
> fragment regardless whether it belongs to a new queue or an existing queue.
> 
> Chances are that an attacker can fill the cache with fragments that
> will never be assembled (i.e., only sends the first fragment with new
> IPIDs every time) to exceed the threshold so that all future incoming
> fragmented IPv4 traffic would be blocked and dropped. Since there is no
> GC mechanism, the victim host has to wait for 30s when the fragments are
> expired to continue receive incoming fragments normally.
> 
>     In practice, given the 4MB fragment cache, the attacker only needs
> to send 1766 fragments to exhaust the cache and DoS the victim for 30s,
> whose cost is pretty low. Besides, IPv6 would also be affected since the
> issue resides in inet part.
> 
> This issue is introduced in commit
> 648700f76b03b7e8149d13cc2bdb3355035258a9 (inet: frags: use rhashtables
> for reassembly units) which removes fqdir->low_thresh, and GC worker as
> well. We would gently request to bring GC worker back to the kernel to
> prevent the DoS attacks.
> 
> Looking forward to hear from you
> 
>  
> 
>     Thanks,
> 
> Keyu Man
> 



Re: Different behavior wrt VRF and no VRF - packet Tx

2021-04-15 Thread David Ahern
On 4/15/21 12:15 AM, Bala Sajja wrote:
> When interfaces are not part of VRF  and below ip address config is
> done on these interfaces, ping with -I (interface) option, we see
> packets transmitting out of the right interfaces.
> 
>  ip addr add 2.2.2.100 peer 1.1.1.100/32 dev enp0s3
>  ip addr add 2.2.2.100 peer 1.1.1.100/32  dev enp0s8
> 
>  ping 1.1.1.100-I  enp0s3 , packet always goes out of  enp0s3
>  ping 1.1.1.100-I   enp0s8, packet always goes out of  enp0s8
> 
> When interfaces are enslaved  to VRF  as below and ip are configured
> on these interfaces, packets go out of one  interface only.
> 
>  ip link add MGMT type vrf table 1
>  ip link set dev MGMT up
>  ip link set dev enp0s3 up
>  ip link set dev enp0s3 master MGMT
>  ip link set dev enp0s8 up
>  ip link set dev enp0s8 master MGMT
>  ip link set dev enp0s9 up
> 
>  ip addr add 2.2.2.100 peer 1.1.1.100/32 dev enp0s3
>  ip addr add 2.2.2.100 peer 1.1.1.100/32  dev enp0s8
> 
>  ping 1.1.1.100-I  enp0s3 , packet always goes out of  enp0s3
>  ping 1.1.1.100-I   enp0s8, packet always goes out of  enp0s3
> 
> 

I believe this use case will not work since the FIB lookup loses the
original device binding (the -I argument). take a look at the FIB lookup
argument and result:

perf record -e fib:*
perf script


Re: XFRM programming with VRF enslaved interfaces

2021-04-15 Thread David Ahern
[ cc Ben ]

On 4/15/21 9:51 AM, Rob Dover wrote:
> Hi there,
> 
> I'm working on an application that's programming IPSec connections via XFRM 
> on VRFs. I'm seeing some strange behaviour in cases where there is an 
> enslaved interface on the VRF - was wondering if anyone has seen something 
> like this before or perhaps knows how this is supposed to work? 

Ben was / is looking at ipsec and VRF. Maybe he has some thoughts.

> 
> In our setup, we have a VRF and an enslaved (sidebar: should I be using a 
> different term for this? I would prefer to use something with fewer negative 
> historic connotations if possible!) interface like so:

for the sidebar, you can just say that a netdev is a member of the L3
domain. iproute2 supports 'vrf' keyword for better user semantics than
'master'


Any chance you can create a shell script that creates your setup using
network namespaces?

tools/testing/selftests/net/fcnal-test.sh has some helpers --
create_vrf, create_ns and connect_ns -- which simplify the 'namespace as
a node' concept and configuring the interconnects.

A standalone script would allow runs across kernel versions and make it
easier for others (me) to debug.


> 
> ```
> # ip link show vrf-1
> 33: vrf-1:  mtu 65536 qdisc noqueue state UP mode 
> DEFAULT group default qlen 1000
> link/ether 2a:71:ba:bd:33:4d brd ff:ff:ff:ff:ff:ff 
> 
> # ip link show master vrf-1
> 32: serv1:  mtu 1500 qdisc fq_codel master vrf-1 state 
> UP mode DEFAULT group default qlen 1000
> link/ether 00:13:3e:00:16:68 brd ff:ff:ff:ff:ff:ff
> ```
> 
> The serv1 interface has some associated IPs but the vrf-1 interface does not:
> 
> ```
> # ip addr show dev vrf-1
> 33: vrf-1:  mtu 65536 qdisc noqueue state UP group 
> default qlen 1000
> link/ether 2a:71:ba:bd:33:4d brd ff:ff:ff:ff:ff:ff 
> 
> # ip addr show dev serv1
> 32: serv1:  mtu 1500 qdisc fq_codel master vrf-1 state 
> UP group default qlen 1000
> link/ether 00:13:3e:00:16:68 brd ff:ff:ff:ff:ff:ff
> inet 10.248.0.191/16 brd 10.248.255.255 scope global serv1
>valid_lft forever preferred_lft forever
> inet 10.248.0.250/16 brd 10.248.255.255 scope global secondary serv1
>valid_lft forever preferred_lft forever
> inet6 fd5f:5d21:845:1401:213:3eff:fe00:1668/64 scope global
>valid_lft forever preferred_lft forever
> inet6 fe80::213:3eff:fe00:1668/64 scope link
>valid_lft forever preferred_lft forever
> ```
> 
> We're trying to program XFRM using these addresses to send and receive IPSec 
> traffic in transport mode. The interesting question is which interface the 
> XFRM state should be programmed on. I started off by programming the 
> following policies and SAs on the VRF:
> 
> ```
> # ip xfrm policy show
> src 10.254.13.16/32 dst 10.248.0.191/32 sport 37409 dport 5080 dev vrf-1
> dir in priority 2147483648 ptype main
> tmpl src 0.0.0.0 dst 0.0.0.0
> proto esp reqid 0 mode transport 
> src 10.248.0.191/32 dst 10.254.13.16/32 sport 16381 dport 37409 dev vrf-1
> dir out priority 2147483648 ptype main
> tmpl src 0.0.0.0 dst 0.0.0.0
> proto esp reqid 0 mode transport 
> # ip xfrm state show 
> src 10.254.13.16 dst 10.248.0.191
> proto esp spi 0x03a0392c reqid 3892838400 mode transport
> replay-window 0
> auth-trunc hmac(md5) 0x00112233445566778899aabbccddeeff 96
> enc cbc(des3_ede) 0xfeefdccdbaab98897667544532231001feefdccdbaab9889
> anti-replay context: seq 0x0, oseq 0x0, bitmap 0x
> sel src 0.0.0.0/0 dst 0.0.0.0/0 sport 37409 dport 5080 dev vrf-1 
> src 10.248.0.191 dst 10.254.13.16
> proto esp spi 0x00124f80 reqid 0 mode transport
> replay-window 0
> auth-trunc hmac(md5) 0x00112233445566778899aabbccddeeff 96
> enc cbc(des3_ede) 0xfeefdccdbaab98897667544532231001feefdccdbaab9889
> anti-replay context: seq 0x0, oseq 0x0, bitmap 0x
> sel src 0.0.0.0/0 dst 0.0.0.0/0 sport 16381 dport 37409 dev vrf-1
> ```
> 
> Having programmed these, I can then send ESP packets from 10.254.13.16:37409 
> -> 10.248.0.191:5080 and they are successfully decoded and passed up to my 
> application. However, when I try to send UDP packets out again from 
> 10.248.0.191:16381 -> 10.254.13.16:37409, the packets are not encrypted but 
> sent out in the clear!
> 
> Now, I've done some experimentation and found that if I program the outbound 
> XFRM policy (eg. 10.248.0.191->10.254.13.16) to be on serv1 rather than 
> vrf-1, the packets are correctly encrypted. But if I program the inbound XFRM 
> policy (eg. 10.254.13.16->10.248.0.191) to be on serv1 rather than vrf-1, the 
> inbound packets are not passed up to my application! That leaves me in a 
> situation where I need to program the inbound and outbound XFRM policies 
> asymmetrically in order to get my traffic to be sent properly, like so:
> 
> ```
> # ip xfrm policy show
> src 10.254.13.16/32 dst 10.248.0.191/32

Re: [PATCH v2 bpf-next] cpumap: bulk skb using netif_receive_skb_list

2021-04-15 Thread David Ahern
On 4/15/21 9:03 AM, Lorenzo Bianconi wrote:
>> On 4/15/21 8:05 AM, Daniel Borkmann wrote:
> 
> [...]
 &stats);
>>>
>>> Given we stop counting drops with the netif_receive_skb_list(), we
>>> should then
>>> also remove drops from trace_xdp_cpumap_kthread(), imho, as otherwise it
>>> is rather
>>> misleading (as in: drops actually happening, but 0 are shown from the
>>> tracepoint).
>>> Given they are not considered stable API, I would just remove those to
>>> make it clear
>>> to users that they cannot rely on this counter anymore anyway.
>>>
>>
>> What's the visibility into drops then? Seems like it would be fairly
>> easy to have netif_receive_skb_list return number of drops.
>>
> 
> In order to return drops from netif_receive_skb_list() I guess we need to 
> introduce
> some extra checks in the hot path. Moreover packet drops are already accounted
> in the networking stack and this is currently the only consumer for this info.
> Does it worth to do so?

right - softnet_stat shows the drop. So the loss here is that the packet
is from a cpumap XDP redirect.

Better insights into drops is needed, but I guess in this case coming
from the cpumap does not really aid into why it is dropped - that is
more core to __netif_receive_skb_list_core. I guess this is ok to drop
the counter from the tracepoint.


Re: [PATCH v2 bpf-next] cpumap: bulk skb using netif_receive_skb_list

2021-04-15 Thread David Ahern
On 4/15/21 8:05 AM, Daniel Borkmann wrote:
> On 4/13/21 6:22 PM, Lorenzo Bianconi wrote:
>> Rely on netif_receive_skb_list routine to send skbs converted from
>> xdp_frames in cpu_map_kthread_run in order to improve i-cache usage.
>> The proposed patch has been tested running xdp_redirect_cpu bpf sample
>> available in the kernel tree that is used to redirect UDP frames from
>> ixgbe driver to a cpumap entry and then to the networking stack.
>> UDP frames are generated using pkt_gen.
>>
>> $xdp_redirect_cpu  --cpu  --progname xdp_cpu_map0 --dev 
>>
>> bpf-next: ~2.2Mpps
>> bpf-next + cpumap skb-list: ~3.15Mpps
>>
>> Signed-off-by: Lorenzo Bianconi 
>> ---
>> Changes since v1:
>> - fixed comment
>> - rebased on top of bpf-next tree
>> ---
>>   kernel/bpf/cpumap.c | 11 +--
>>   1 file changed, 5 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
>> index 0cf2791d5099..d89551a508b2 100644
>> --- a/kernel/bpf/cpumap.c
>> +++ b/kernel/bpf/cpumap.c
>> @@ -27,7 +27,7 @@
>>   #include 
>>   #include 
>>   -#include    /* netif_receive_skb_core */
>> +#include    /* netif_receive_skb_list */
>>   #include  /* eth_type_trans */
>>     /* General idea: XDP packets getting XDP redirected to another CPU,
>> @@ -257,6 +257,7 @@ static int cpu_map_kthread_run(void *data)
>>   void *frames[CPUMAP_BATCH];
>>   void *skbs[CPUMAP_BATCH];
>>   int i, n, m, nframes;
>> +    LIST_HEAD(list);
>>     /* Release CPU reschedule checks */
>>   if (__ptr_ring_empty(rcpu->queue)) {
>> @@ -305,7 +306,6 @@ static int cpu_map_kthread_run(void *data)
>>   for (i = 0; i < nframes; i++) {
>>   struct xdp_frame *xdpf = frames[i];
>>   struct sk_buff *skb = skbs[i];
>> -    int ret;
>>     skb = __xdp_build_skb_from_frame(xdpf, skb,
>>    xdpf->dev_rx);
>> @@ -314,11 +314,10 @@ static int cpu_map_kthread_run(void *data)
>>   continue;
>>   }
>>   -    /* Inject into network stack */
>> -    ret = netif_receive_skb_core(skb);
>> -    if (ret == NET_RX_DROP)
>> -    drops++;
>> +    list_add_tail(&skb->list, &list);
>>   }
>> +    netif_receive_skb_list(&list);
>> +
>>   /* Feedback loop via tracepoint */
>>   trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched,
>> &stats);
> 
> Given we stop counting drops with the netif_receive_skb_list(), we
> should then
> also remove drops from trace_xdp_cpumap_kthread(), imho, as otherwise it
> is rather
> misleading (as in: drops actually happening, but 0 are shown from the
> tracepoint).
> Given they are not considered stable API, I would just remove those to
> make it clear
> to users that they cannot rely on this counter anymore anyway.
> 

What's the visibility into drops then? Seems like it would be fairly
easy to have netif_receive_skb_list return number of drops.


Re: [PATCH v3 net-next] net: multipath routing: configurable seed

2021-04-14 Thread David Ahern
On 4/14/21 12:33 AM, Pavel Balaev wrote:
>>
>> This should work the same for IPv6.
> I wanted to add IPv6 support after IPv4 will be approved,
> anyway no problem, will add IPv6 in next version 
>> And please add test cases under tools/testing/selftests/net.
> This feature cannot be tested whithin one host instance, becasue the same seed
> will be used by default for all netns, so results will be the same
> anyway, should I use QEMU for this tests?
>  
> 

why not make the seed per namespace?


Re: [PATCH net] vrf: fix a comment about loopback device

2021-04-14 Thread David Ahern
On 4/14/21 3:03 AM, Nicolas Dichtel wrote:
> This is a leftover of the below commit.
> 
> Fixes: 4f04256c983a ("net: vrf: Drop local rtable and rt6_info")
> Signed-off-by: Nicolas Dichtel 
> ---
>  drivers/net/vrf.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 

Acked-by: David Ahern 

Thanks, Nicolas.



Re: [PATCH v3 net-next] net: multipath routing: configurable seed

2021-04-13 Thread David Ahern
On 4/13/21 4:55 AM, Balaev Pavel wrote:
> Ability for a user to assign seed value to multipath route hashes.
> Now kernel uses random seed value to prevent hash-flooding DoS attacks;
> however, it disables some use cases, f.e:
> 
> +---++--+++
> |   |-eth0---| FW0  |---eth0-||
> |   |+--+||
> |  GW0  |ECMPECMP|  GW1   |
> |   |+--+||
> |   |-eth1---| FW1  |---eth1-||
> +---++--+++
> 
> In this use case, two ECMP routers balance traffic between
> two firewalls. If some flow transmits a response over a different channel 
> than request,
> such flow will be dropped, because keep-state rules are created on
> the other firewall.
> 
> This patch adds sysctl variable: net.ipv4.fib_multipath_hash_seed.
> User can set the same seed value on GW0 and GW1 for traffic to be
> mirror-balanced. By default, random value is used.
> 
> Signed-off-by: Balaev Pavel 
> ---
>  Documentation/networking/ip-sysctl.rst |  14 
>  include/net/flow_dissector.h   |   4 +
>  include/net/netns/ipv4.h   |  20 +
>  net/core/flow_dissector.c  |   9 +++
>  net/ipv4/af_inet.c |   5 ++
>  net/ipv4/route.c   |  10 ++-
>  net/ipv4/sysctl_net_ipv4.c | 104 +
>  7 files changed, 165 insertions(+), 1 deletion(-)
> 

This should work the same for IPv6.

And please add test cases under tools/testing/selftests/net.




Re: [PATCH v3] ip-nexthop: support flush by id

2021-04-08 Thread David Ahern
On 4/5/21 7:33 PM, Chunmei Xu wrote:
> since id is unique for nexthop, it is heavy to dump all nexthops.
> use existing delete_nexthop to support flush by id
> 
> Signed-off-by: Chunmei Xu 
> ---
>  ip/ipnexthop.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 

applied to iproute2-next. Thanks,



Re: [PATCH v2] ipv6: report errors for iftoken via netlink extack

2021-04-08 Thread David Ahern
On 4/7/21 9:59 AM, Stephen Hemminger wrote:
> From: Stephen Hemminger 
> 
> Setting iftoken can fail for several different reasons but there
> and there was no report to user as to the cause. Add netlink
> extended errors to the processing of the request.
> 
> This requires adding additional argument through rtnl_af_ops
> set_link_af callback.
> 
> Reported-by: Hongren Zheng 
> Signed-off-by: Stephen Hemminger 
> ---
> v2 - fix typo that broke build
> 
>  include/net/rtnetlink.h |  4 ++--
>  net/core/rtnetlink.c|  2 +-
>  net/ipv4/devinet.c  |  3 ++-
>  net/ipv6/addrconf.c | 32 ++--
>  4 files changed, 31 insertions(+), 10 deletions(-)
> 

Reviewed-by: David Ahern 


Re: [RFC net-next 0/1] seg6: Counters for SRv6 Behaviors

2021-04-07 Thread David Ahern
Since this is a single patch set, just put this good cover letter
content as the message in the patch.


Re: [RFC net-next 1/1] seg6: add counters support for SRv6 Behaviors

2021-04-07 Thread David Ahern
On 4/7/21 12:03 PM, Andrea Mayer wrote:
> diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h
> index 3b39ef1dbb46..ae5e3fd12b73 100644
> --- a/include/uapi/linux/seg6_local.h
> +++ b/include/uapi/linux/seg6_local.h
> @@ -27,6 +27,7 @@ enum {
>   SEG6_LOCAL_OIF,
>   SEG6_LOCAL_BPF,
>   SEG6_LOCAL_VRFTABLE,
> + SEG6_LOCAL_COUNTERS,
>   __SEG6_LOCAL_MAX,
>  };
>  #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1)
> @@ -78,4 +79,11 @@ enum {
>  
>  #define SEG6_LOCAL_BPF_PROG_MAX (__SEG6_LOCAL_BPF_PROG_MAX - 1)
>  
> +/* SRv6 Behavior counters */
> +struct seg6_local_counters {
> + __u64 rx_packets;
> + __u64 rx_bytes;
> + __u64 rx_errors;
> +};
> +
>  #endif

It's highly likely that more stats would get added over time. It would
be good to document that here for interested parties and then make sure
iproute2 can handle different sized stats structs. e.g., commit support
to your repo, then add a new one (e.g, rx_drops) and verify the
combinations handle it. e.g., old kernel - new iproute2, new kernel -
old iproute, old - old and new-new.



Re: [PATCH v2] ip-nexthop: support flush by id

2021-04-02 Thread David Ahern
On 3/31/21 10:03 PM, Chunmei Xu wrote:
> since id is unique for nexthop, it is heavy to dump all nexthops.
> use existing delete_nexthop to support flush by id
> 
> Signed-off-by: Chunmei Xu 
> ---
>  ip/ipnexthop.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
> 

this version does not apply to iproute2-next.



Re: [iproute2-next] tipc: use the libmnl functions in lib/mnl_utils.c

2021-04-02 Thread David Ahern
On 3/31/21 8:34 PM, Hoang Le wrote:
> To avoid code duplication, tipc should be converted to use the helper
> functions for working with libmnl in lib/mnl_utils.c
> 
> Acked-by: Jon Maloy 
> Signed-off-by: Hoang Le 
> ---
>  tipc/bearer.c|  38 ++
>  tipc/cmdl.c  |   2 -
>  tipc/link.c  |  37 +
>  tipc/media.c |  15 +++---
>  tipc/msg.c   | 132 +++
>  tipc/msg.h   |   2 +-
>  tipc/nametable.c |   5 +-
>  tipc/node.c  |  33 +---
>  tipc/peer.c  |   8 ++-
>  tipc/socket.c|  10 ++--
>  tipc/tipc.c  |  21 +++-
>  11 files changed, 83 insertions(+), 220 deletions(-)
> 

applied to iproute2-next.



Re: [PATCH net] net: udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...);

2021-04-01 Thread David Ahern
On 4/1/21 12:59 AM, Norman Maurer wrote:
> From: Norman Maurer 
> 
> Support for UDP_GRO was added in the past but the implementation for
> getsockopt was missed which did lead to an error when we tried to
> retrieve the setting for UDP_GRO. This patch adds the missing switch
> case for UDP_GRO
> 
> Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")
> Signed-off-by: Norman Maurer 
> ---
>  net/ipv4/udp.c | 4 
>  1 file changed, 4 insertions(+)
> 

Reviewed-by: David Ahern 




Re: [RFC] add extack errors for iptoken

2021-04-01 Thread David Ahern
On 3/31/21 9:49 PM, Stephen Hemminger wrote:
> @@ -5681,14 +5682,29 @@ static int inet6_set_iftoken(struct inet6_dev *idev, 
> struct in6_addr *token)
>  
>   ASSERT_RTNL();
>  
> - if (!token)
> + if (!token) {

You forgot to add a message here.

>   return -EINVAL;
> - if (dev->flags & (IFF_LOOPBACK | IFF_NOARP))
> + }
> +
> + if (dev->flags & IFF_LOOPBACK) {
> + NL_SET_ERR_MSG_MOD(extack, "Device is loopback");
>   return -EINVAL;
> - if (!ipv6_accept_ra(idev))
> + }
> +
> + if (dev->flags & IFF_NOARP) {
> + NL_SET_ERR_MSG_MOD(extack, "Device does not do discovery");

'Device does not do neighbor discovery'

>   return -EINVAL;
> - if (idev->cnf.rtr_solicits == 0)
> + }
> +
> + if (!ipv6_accept_ra(idev)) {
> + NL_SET_ERR_MSG_MOD(extack, "Device does accept route adverts");

How about
'Router advertisements are disabled for this device'


> + return -EINVAL;
> + }
> +
> + if (idev->cnf.rtr_solicits == 0) {
> + NL_SET_ERR_MSG(extack, "Device has disabled router 
> solicitation");

How about

'Router solicitation is disabled on device'



Re: [PATCH] udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...)

2021-03-31 Thread David Ahern
On 3/31/21 7:10 AM, Norman Maurer wrote:
> Friendly ping… 
> 
> As this missing change was most likely an oversight in the original commit I 
> do think it should go into 5.12 and subsequently stable as well. That’s also 
> the reason why I didn’t send a v2 and changed the commit message / subject 
> for the patch. For me it clearly is a bug and not a new feature.
> 
> 

I agree that it should be added to net

If you do not see it here:
  https://patchwork.kernel.org/project/netdevbpf/list/

you need to re-send and clearly mark it as [PATCH net]. Make sure it has
a Fixes tag.



Re: [PATCH] ip-nexthop: support flush by id

2021-03-31 Thread David Ahern
On 3/31/21 5:53 AM, Ido Schimmel wrote:
>> @@ -124,6 +125,9 @@ static int flush_nexthop(struct nlmsghdr *nlh, void *arg)
>>  if (tb[NHA_ID])
>>  id = rta_getattr_u32(tb[NHA_ID]);
>>  
>> +if (filter.id && filter.id != id)
>> +return 0;
>> +
>>  if (id && !delete_nexthop(id))
>>  filter.flushed++;
>>  
>> @@ -491,7 +495,10 @@ static int ipnh_list_flush(int argc, char **argv, int 
>> action)
>>  NEXT_ARG();
>>  if (get_unsigned(&id, *argv, 0))
>>  invarg("invalid id value", *argv);
>> -return ipnh_get_id(id);
>> +if (action == IPNH_FLUSH)
>> +filter.id = id;
>> +else
>> +return ipnh_get_id(id);
> 
> I think it's quite weird to ask for a dump of all nexthops only to
> delete a specific one. How about this:

+1

> 
> ```
> diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c
> index 0263307c49df..09a3076231aa 100644
> --- a/ip/ipnexthop.c
> +++ b/ip/ipnexthop.c
> @@ -765,8 +765,16 @@ static int ipnh_list_flush(int argc, char **argv, int 
> action)
> if (!filter.master)
> invarg("VRF does not exist\n", *argv);
> } else if (!strcmp(*argv, "id")) {
> -   NEXT_ARG();
> -   return ipnh_get_id(ipnh_parse_id(*argv));
> +   /* When 'id' is specified with 'flush' / 'list' we do
> +* not need to perform a dump.
> +*/
> +   if (action == IPNH_LIST) {
> +   NEXT_ARG();
> +   return ipnh_get_id(ipnh_parse_id(*argv));
> +   } else {
> +   return ipnh_modify(RTM_DELNEXTHOP, 0, argc,
> +  argv);
> +   }

since delete just needs the id, you could refactor ipnh_modify and
create a ipnh_delete_id.


Re: ip-nexthop does not support flush by id

2021-03-30 Thread David Ahern
On 3/30/21 7:05 AM, Ido Schimmel wrote:
> Looks like a bug. 'flush' does not really make sense with 'id', but
> 'list id' works, so I think 'flush id' should also work.

right, neither were intended. If you know the id, you don't need the
generic list / flush option.

> 
> Can you send a patch?

Agreed. At this point consistency is best.


Re: [PATCH iproute2-next] police: add support for packet-per-second rate limiting

2021-03-29 Thread David Ahern
On 3/26/21 6:50 AM, Simon Horman wrote:
> From: Baowen Zheng 
> 
> Allow a policer action to enforce a rate-limit based on packets-per-second,
> configurable using a packet-per-second rate and burst parameters.
> 
> e.g.
>  # $TC actions add action police pkts_rate 1000 pkts_burst 200 index 1
>  # $TC actions ls action police
>  total acts 1
> 
>   action order 0:  police 0x1 rate 0bit burst 0b mtu 4096Mb pkts_rate 
> 1000 pkts_burst 200
>   ref 1 bind 0
> 
> Signed-off-by: Baowen Zheng 
> Signed-off-by: Simon Horman 
> Signed-off-by: Louis Peens 
> ---
>  man/man8/tc-police.8 | 35 ---
>  tc/m_police.c| 50 +---
>  2 files changed, 75 insertions(+), 10 deletions(-)
> 

applied to iproute2-next.



Re: [PATCH] Add Open/R to rt_protos

2021-03-29 Thread David Ahern
On 3/26/21 9:05 AM, Cooper Ry Lees wrote:
> From: Cooper Lees 
> 
> - Open Routing is using ID 99 for it's installed routes
> - https://github.com/facebook/openr
> - Kernel has accepted 99 in `rtnetlink.h`
> 
> Signed-of-by: Cooper Lees 
> ---
>  etc/iproute2/rt_protos | 1 +
>  1 file changed, 1 insertion(+)
> 

applied to iproute2-next.


Re: [PATCH 8/9] ip6_tunnel:: Correct function name parse_tvl_tnl_enc_lim() in the kerneldoc comments

2021-03-27 Thread David Ahern
On 3/27/21 2:15 AM, Xiongfeng Wang wrote:
> Fix the following W=1 kernel build warning(s):
> 
>  net/ipv6/ip6_tunnel.c:401: warning: expecting prototype for 
> parse_tvl_tnl_enc_lim(). Prototype was for ip6_tnl_parse_tlv_enc_lim() instead
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Xiongfeng Wang 
> ---
>  net/ipv6/ip6_tunnel.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 

Reviewed-by: David Ahern 




Re: [PATCH 1/9] l3mdev: Correct function names in the kerneldoc comments

2021-03-27 Thread David Ahern
On 3/27/21 2:15 AM, Xiongfeng Wang wrote:
> Fix the following W=1 kernel build warning(s):
> 
>  net/l3mdev/l3mdev.c:111: warning: expecting prototype for 
> l3mdev_master_ifindex(). Prototype was for l3mdev_master_ifindex_rcu() instead
>  net/l3mdev/l3mdev.c:145: warning: expecting prototype for 
> l3mdev_master_upper_ifindex_by_index(). Prototype was for 
> l3mdev_master_upper_ifindex_by_index_rcu() instead
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Xiongfeng Wang 
> ---
>  net/l3mdev/l3mdev.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Reviewed-by: David Ahern 



Re: [PATCH] sit: use min

2021-03-27 Thread David Ahern
On 3/27/21 3:29 AM, Julia Lawall wrote:
> From: kernel test robot 
> 
> Opportunity for min()
> 
> Generated by: scripts/coccinelle/misc/minmax.cocci
> 
> CC: Denis Efremov 
> Reported-by: kernel test robot 
> Signed-off-by: kernel test robot 
> ---
> 
>  sit.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: David Ahern 




Re: [PATCH net-next] nexthop: Rename artifacts related to legacy multipath nexthop groups

2021-03-26 Thread David Ahern
On 3/26/21 7:20 AM, Petr Machata wrote:
> After resilient next-hop groups have been added recently, there are two
> types of multipath next-hop groups: the legacy "mpath", and the new
> "resilient". Calling the legacy next-hop group type "mpath" is unfortunate,
> because that describes the fact that a packet could be forwarded in one of
> several paths, which is also true for the resilient next-hop groups.
> 
> Therefore, to make the naming clearer, rename various artifacts to reflect
> the assumptions made. Therefore as of this patch:
> 
> - The flag for multipath groups is nh_grp_entry::is_multipath. This
>   includes the legacy and resilient groups, as well as any future group
>   types that behave as multipath groups.
>   Functions that assume this have "mpath" in the name.
> 
> - The flag for legacy multipath groups is nh_grp_entry::hash_threshold.
>   Functions that assume this have "hthr" in the name.
> 
> - The flag for resilient groups is nh_grp_entry::resilient.
>   Functions that assume this have "res" in the name.
> 
> Besides the above, struct nh_grp_entry::mpath was renamed to ::hthr as
> well.
> 
> UAPI artifacts were obviously left intact.
> 
> Suggested-by: David Ahern 
> Signed-off-by: Petr Machata 
> ---
>  include/net/nexthop.h |  4 ++--
>  net/ipv4/nexthop.c| 56 +++++--
>  2 files changed, 30 insertions(+), 30 deletions(-)
> 

Thanks for the followup.

Reviewed-by: David Ahern 



Re: [PATCH 2/2] net: ipv4: route.c: Remove unnecessary if()

2021-03-25 Thread David Ahern
On 3/23/21 9:10 PM, Yejune Deng wrote:
> negative_advice handler is only called when dst is non-NULL hence the
> 'if (rt)' check can be removed. 'if' and 'else if' can be merged together.
> And use container_of() instead of (struct rtable *).
> 
> Signed-off-by: Yejune Deng 
> ---
>  net/ipv4/route.c | 16 ++--
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 


Reviewed-by: David Ahern 


Re: rfc5837 and rfc8335

2021-03-24 Thread David Ahern
On 3/23/21 10:39 AM, Ron Bonica wrote:
> Hi Folks,
> 
>  
> 
> The rationale for RFC 8335 can be found in Section 5.0 of that document.
> Currently, ICMP ECHO and ECHO RESPONSE messages can be used to determine
> the liveness of some interfaces. However, they cannot determine the
> liveness of:
> 
>  
> 
>   * An unnumbered IPv4 interface
>   * An IPv6 interface that has only a link-local address
> 
>  
> 
> A router can have hundreds, or even thousands of interfaces that fall
> into these categories.
> 
>  
> 
> The rational for RFC 5837 can be found in the Introduction to that
> document. When a node sends an ICMP TTL Expired message, the node
> reports that a packet has expired on it. However, the source address of
> the ICMP TTL Expired message doesn’t necessarily identify the interface
> upon which the packet arrived. So, TRACEROUTE can be relied upon to
> identify the nodes that a packet traverses along its delivery path. But
> it cannot be relied upon to identify the interfaces that a packet
> traversed along its deliver path.
> 
>  

It's not a question of the rationale; the question is why add this
support to Linux now? RFC 5837 is 11 years old. Why has no one cared to
add support before now? What tooling supports it? What other NOS'es
support it to really make the feature meaningful? e.g., Do you know what
Juniper products support RFC 5837 today?

More than likely Linux is the end node of the traceroute chain, not the
transit or path nodes. With Linux, the ingress interface can lost in the
layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly
support either you need to return information about the right one.
Unnumbered interfaces can make that more of a challenge.


Re: [iproute2-next] tipc: add support for the netlink extack

2021-03-24 Thread David Ahern
On 3/24/21 7:56 PM, Hoang Le wrote:
> Add support extack in tipc to dump the netlink extack error messages
> (i.e -EINVAL) sent from kernel.
> 
> Acked-by: Jon Maloy 
> Signed-off-by: Hoang Le 
> ---
>  tipc/msg.c | 29 ++---
>  1 file changed, 22 insertions(+), 7 deletions(-)
> 

tipc should be converted to use the library functions in lib/mnl_utils.c.



Re: [PATCH net-next 0/6] page_pool: recycle buffers

2021-03-23 Thread David Ahern
On 3/22/21 11:02 AM, Matteo Croce wrote:
> From: Matteo Croce 
> 
> This series enables recycling of the buffers allocated with the page_pool API.
> The first two patches are just prerequisite to save space in a struct and
> avoid recycling pages allocated with other API.
> Patch 2 was based on a previous idea from Jonathan Lemon.
> 
> The third one is the real recycling, 4 fixes the compilation of 
> __skb_frag_unref
> users, and 5,6 enable the recycling on two drivers.

patch 4 should be folded into 3; each patch should build without errors.

> 
> In the last two patches I reported the improvement I have with the series.
> 
> The recycling as is can't be used with drivers like mlx5 which do page split,
> but this is documented in a comment.
> In the future, a refcount can be used so to support mlx5 with no changes.

Is the end goal of the page_pool changes to remove driver private caches?




Re: [PATCH] net: ipv4: route.c: Remove unnecessary {else} if()

2021-03-23 Thread David Ahern
subject line should have net-next as the target branch


On 3/23/21 4:20 AM, Yejune Deng wrote:
> Put if and else if together, and remove unnecessary judgments, because
> it's caller can make sure it is true. And add likely() in
> ipv4_confirm_neigh().
> 
> Signed-off-by: Yejune Deng 
> ---
>  net/ipv4/route.c | 18 +++---
>  1 file changed, 7 insertions(+), 11 deletions(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index fa68c2612252..f4ba07c5c1b1 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -440,7 +440,7 @@ static void ipv4_confirm_neigh(const struct dst_entry 
> *dst, const void *daddr)
>   struct net_device *dev = dst->dev;
>   const __be32 *pkey = daddr;
>  
> - if (rt->rt_gw_family == AF_INET) {
> + if (likely(rt->rt_gw_family == AF_INET)) {
>   pkey = (const __be32 *)&rt->rt_gw4;
>   } else if (rt->rt_gw_family == AF_INET6) {
>   return __ipv6_confirm_neigh_stub(dev, &rt->rt_gw6);
> @@ -814,19 +814,15 @@ static void ip_do_redirect(struct dst_entry *dst, 
> struct sock *sk, struct sk_buf
>  
>  static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst)
>  {
> - struct rtable *rt = (struct rtable *)dst;
> + struct rtable *rt = container_of(dst, struct rtable, dst);
>   struct dst_entry *ret = dst;
>  
> - if (rt) {
> - if (dst->obsolete > 0) {
> - ip_rt_put(rt);
> - ret = NULL;
> - } else if ((rt->rt_flags & RTCF_REDIRECTED) ||
> -rt->dst.expires) {
> - ip_rt_put(rt);
> - ret = NULL;
> - }
> + if (dst->obsolete > 0 || rt->dst.expires ||
> + (rt->rt_flags & RTCF_REDIRECTED)) {
> + ip_rt_put(rt);
> + ret = NULL;
>   }
> +
>   return ret;
>  }
>  
> 

This should be 2 separate patches since they are unrelated changes.

For the ipv4_negative_advice, the changelog should note that
negative_advice handler is only called when dst is non-NULL hence the
'if (rt)' check can be removed.


Re: [PATCH v7 bpf-next 10/14] bpf: add new frame_length field to the XDP ctx

2021-03-22 Thread David Ahern
On 3/22/21 3:29 AM, Eelco Chaudron wrote:
> 
> 
> On 20 Mar 2021, at 4:42, David Ahern wrote:
> 
>> On 3/19/21 3:47 PM, Lorenzo Bianconi wrote:
>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>> index 19cd6642e087..e47d9e8da547 100644
>>> --- a/include/net/xdp.h
>>> +++ b/include/net/xdp.h
>>> @@ -75,6 +75,10 @@ struct xdp_buff {
>>>  struct xdp_txq_info *txq;
>>>  u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved
>>> tailroom*/
>>>  u32 mb:1; /* xdp non-linear buffer */
>>> +    u32 frame_length; /* Total frame length across all buffers. Only
>>> needs
>>> +   * to be updated by helper functions, as it will be
>>> +   * initialized at XDP program start.
>>> +   */
>>>  };
>>>
>>>  static __always_inline void
>>
>> If you do another version of this set ...
>>
>> I think you only need 17-bits for the frame length (size is always <=
>> 128kB). It would be helpful for extensions to xdp if you annotated how
>> many bits are really needed here.
> 
> Guess this can be done, but I did not too avoid the use of constants to
> do the BPF extraction.
> Here is an example of what might need to be added, as adding them before
> made people unhappy ;)
> 
> https://elixir.bootlin.com/linux/v5.12-rc4/source/include/linux/skbuff.h#L801
> 
> 

I was just referring to a code comment that bits can be taken from
frame_length for extensions - just like the mb bit takes from frame_sz
(and it too has bits available since 2GB is way larger than actually
needed).


Re: [PATCH iproute2-next] ip: xfrm: add support for tfcpad

2021-03-21 Thread David Ahern
On 3/19/21 10:57 AM, Sabrina Dubroca wrote:
> This patch adds support for setting and displaying the Traffic Flow
> Confidentiality attribute for an XFRM state, which allows padding ESP
> packets to a specified length.
> 
> Signed-off-by: Sabrina Dubroca 
> ---
>  ip/ipxfrm.c|  8 
>  ip/xfrm_state.c| 10 +-
>  man/man8/ip-xfrm.8 |  2 ++
>  3 files changed, 19 insertions(+), 1 deletion(-)
> 

applied to iproute2-next. Thanks,



Re: [PATCH] ipv4/raw: support binding to nonlocal addresses

2021-03-21 Thread David Ahern
On 3/20/21 6:20 PM, Riccardo Paolo Bestetti wrote:
> diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
> index 50a73178d63a..734c0332b54b 100644
> --- a/net/ipv4/raw.c
> +++ b/net/ipv4/raw.c
> @@ -717,6 +717,7 @@ static int raw_bind(struct sock *sk, struct sockaddr 
> *uaddr, int addr_len)
>  {
>   struct inet_sock *inet = inet_sk(sk);
>   struct sockaddr_in *addr = (struct sockaddr_in *) uaddr;
> + struct net *net = sock_net(sk);
>   u32 tb_id = RT_TABLE_LOCAL;
>   int ret = -EINVAL;
>   int chk_addr_ret;
> @@ -732,7 +733,8 @@ static int raw_bind(struct sock *sk, struct sockaddr 
> *uaddr, int addr_len)
>   tb_id);
>  
>   ret = -EADDRNOTAVAIL;
> - if (addr->sin_addr.s_addr && chk_addr_ret != RTN_LOCAL &&
> + if (!inet_can_nonlocal_bind(net, inet) &&
> + addr->sin_addr.s_addr && chk_addr_ret != RTN_LOCAL &&
>   chk_addr_ret != RTN_MULTICAST && chk_addr_ret != RTN_BROADCAST)
>   goto out;
>   inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr;
> 


Please add test cases to ipv4_addr_bind and ipv6_addr_bind in
tools/testing/selftests/net/fcnal-test.sh. The latter will verify if
IPv6 works the same or needs a change.

Also, this check duplicates the ones in __inet_bind and __inet6_bind; it
would be good to use an inline helper to reduce the duplication.


rfc5837 and rfc8335

2021-03-20 Thread David Ahern
On 3/19/21 10:24 PM, David Ahern wrote:
> At the end of the day, what is the value of this feature vs the other
> ICMP probing set?

Merging the conversations about both of these RFCs since my comments and
questions are the same for both.

What is the motivation for adding support for these RFCs? Is the push
from a company or academia (e.g., a CS project)?

Realistically, who is expected to use this feature and why given the
information it leaks about the networking configuration of the node. Why
is this tool expected to be more useful than a network operator using
existing protocols like lldp, collecting that data across nodes and
analyzing, or using tools like suzieq[1]?

RFC 5837 has been out for 11 years. Do any operating systems support it
— e.g., networking vendors like Cisco, Juniper, etc.? If not, why not?
This one seems to me the most dubious at this point in time.

Similarly for RFC 8335, what is the current support for it?

Linux does not need to support an RFC just because it exists. I am
really questioning the value of both of them

[1] https://github.com/netenglabs/suzieq


Re: [PATCH V4 net-next 5/5] icmp: add response to RFC 8335 PROBE messages

2021-03-20 Thread David Ahern
On 3/20/21 10:01 AM, Andreas Roeseler wrote:
> On Wed, 2021-03-17 at 21:24 -0600, David Ahern wrote:
>> On 3/17/21 9:19 PM, David Miller wrote:
>>> From: Andreas Roeseler 
>>> Date: Wed, 17 Mar 2021 22:11:47 -0500
>>>
>>>> On Mon, 2021-03-15 at 04:35 +0800, kernel test robot wrote:
>>>> Is there something that I'm not understanding about compiling
>>>> kernel
>>>> components modularly? How do I avoid this error?
>>>
>>>>
>>> You cannot reference module exported symbols from statically linked
>>> code.
>>> y
>>>
>>
>> Look at ipv6_stub to see how it exports IPv6 functions for v4 code.
>> There are a few examples under net/ipv4.
> 
> Thanks for the advice. I've been able to make some progress but I still
> have some questions that I have been unable to find online.
> 
> What steps are required to include a function into the ipv6_stub
> struct? I've added the declaration of the function to the struct, but
> when I attempt to call it using ipv6_dev_find()> the kernel
> locks up. Additionally, a typo in the declaration isn't flagged during
> compilation. Are there other places where I need to edit the ipv6_stub
> struct or include various headers? The examples I have looked at are
> , , and  in the  folder
> and they don't seem to do anything on the caller side of ipv6_stub, so
> I think I am not adding the function to ipv6_stub properly. I have been
> able to call other functions that currently exist in ipv6_stub, but not
> the one I  am attempting to add, so am I missing a step?

you probably did not add the default in net/ipv6/addrconf_core.c.

> 
> I've noticed that some functions such as  aren't
> exported using EXPORT_SYMBOL when it is defined in
> , but it is still loaded into ipv6_stub. How can
> this be? Is there a different way to include symbols into ipv6_stub
> based on whether or not they are explicitly exported using
> EXPORT_SYMBOL?
> 

take a look at 1aefd3de7bc6 as an example of how to add a new stub.


Re: [PATCH v3] icmp: support rfc5837

2021-03-19 Thread David Ahern
On 3/19/21 6:53 PM, Willem de Bruijn wrote:
> On Fri, Mar 19, 2021 at 7:54 PM David Ahern  wrote:
>>
>> On 3/19/21 10:11 AM, Ishaan Gandhi wrote:
>>> Thank you. Would it be better to do instead:
>>>
>>> + if_index = skb->skb_iif;
>>>
>>> or
>>>
>>> + if_index = ip_version == 4 ? inet_iif(skb) : skb->skb_iif;
>>>
>>
>> If the packet comes in via an interface assigned to a VRF, skb_iif is
>> most likely the VRF index which is not what you want.
>>
>> The general problem of relying on skb_iif was discussed on v1 and v2 of
>> your patch. Returning an iif that is a VRF, as an example, leaks
>> information about the networking configuration of the device which from
>> a quick reading of the RFC is not the intention.
>>
>> Further, the Security Considerations section recommends controls on what
>> information can be returned where you have added a single sysctl that
>> determines if all information or none is returned. Further, it is not a
>> a per-device control but a global one that applies to all net devices -
>> though multiple entries per netdevice has a noticeable cost in memory at
>> scale.
>>
>> In the end it seems to me the cost benefit is not there for a feature
>> like this.
> 
> The sysctl was my suggestion. The detailed filtering suggested in the
> RFC would add more complexity, not helping that cost benefit analysis.
> I cared mostly about being able to disable this feature outright as it has
> obvious risks.
> 
> But perhaps that is overly simplistic. The RFC suggests deciding trusted
> recipients based on destination address. With a sysctl this feature can be
> only deployed when all possible recipients are trusted, i.e., on an isolated
> network. That is highly limiting.
> 
> Perhaps a per-netns trusted subnet prefix?
> 
> The root admin should always be able to override and disable this outright.
> 

sure a sysctl is definitely required for a feature like this.

>From my perspective to be useful the control needs to be per interface
(e.g., management interface vs dataplane devices) and that has a higher
cost. Add in the amount of information returned and we know from other
examples that some users will want to limit which data is returned and
that increases the number of sysctls per device.

On top of that there is the logic of resolving what is the right device
and its information to return. There is are multiple layers - nic port,
bond, vlan, bridge, vrf, macvlan - each of which might be relevant. The
RFC referenced unnumbered devices as the ingress device. It seems like a
means for leaking information which comes back to the sysctl for proper
controls.

At the end of the day, what is the value of this feature vs the other
ICMP probing set?



Re: [PATCH v7 bpf-next 10/14] bpf: add new frame_length field to the XDP ctx

2021-03-19 Thread David Ahern
On 3/19/21 3:47 PM, Lorenzo Bianconi wrote:
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 19cd6642e087..e47d9e8da547 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -75,6 +75,10 @@ struct xdp_buff {
>   struct xdp_txq_info *txq;
>   u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved 
> tailroom*/
>   u32 mb:1; /* xdp non-linear buffer */
> + u32 frame_length; /* Total frame length across all buffers. Only needs
> +* to be updated by helper functions, as it will be
> +* initialized at XDP program start.
> +*/
>  };
>  
>  static __always_inline void

If you do another version of this set ...

I think you only need 17-bits for the frame length (size is always <=
128kB). It would be helpful for extensions to xdp if you annotated how
many bits are really needed here.


Re: [PATCH v3] icmp: support rfc5837

2021-03-19 Thread David Ahern
On 3/19/21 10:11 AM, Ishaan Gandhi wrote:
> Thank you. Would it be better to do instead:
> 
> + if_index = skb->skb_iif;
> 
> or
> 
> + if_index = ip_version == 4 ? inet_iif(skb) : skb->skb_iif;
> 

If the packet comes in via an interface assigned to a VRF, skb_iif is
most likely the VRF index which is not what you want.

The general problem of relying on skb_iif was discussed on v1 and v2 of
your patch. Returning an iif that is a VRF, as an example, leaks
information about the networking configuration of the device which from
a quick reading of the RFC is not the intention.

Further, the Security Considerations section recommends controls on what
information can be returned where you have added a single sysctl that
determines if all information or none is returned. Further, it is not a
a per-device control but a global one that applies to all net devices -
though multiple entries per netdevice has a noticeable cost in memory at
scale.

In the end it seems to me the cost benefit is not there for a feature
like this.


Re: [PATCH iproute2-next v4 0/6] ip: nexthop: Support resilient groups

2021-03-19 Thread David Ahern
On 3/17/21 6:54 AM, Petr Machata wrote:
> Support for resilient next-hop groups was recently accepted to Linux
> kernel[1]. Resilient next-hop groups add a layer of indirection between the
> SKB hash and the next hop. Thus the hash is used to reference a hash table
> bucket, which is then used to reference a particular next hop. This allows
> the system more flexibility when assigning SKB hash space to next hops.
> Previously, each next hop had to be assigned a continuous range of SKB hash
> space. With a hash table as an intermediate layer, it is possible to
> reassign next hops with a hash table bucket granularity. In turn, this
> mends issues with traffic flow redirection resulting from next hop removal
> or adjustments in next-hop weights.
> 
> In this patch set, introduce support for resilient next-hop groups to
> iproute2.
> 

applied to iproute2-next. Thanks,



Re: [PATCH v3] icmp: support rfc5837

2021-03-19 Thread David Ahern
On 3/17/21 4:19 PM, ishaangandhi wrote:
> +void icmp_identify_arrival_interface(struct sk_buff *skb, struct net *net, 
> int room,
> +  char *icmph, int ip_version)
> +{
> + unsigned int ext_len, orig_len, word_aligned_orig_len, offset, 
> extra_space_needed,
> +  if_index, mtu = 0, name_len = 0, name_subobj_len = 0;
> + struct interface_ipv4_addr_sub_obj ip4_addr_subobj = {.addr = 0};
> + struct interface_ipv6_addr_sub_obj ip6_addr_subobj;
> + struct icmp_extobj_hdr *iio_hdr;
> + struct inet6_ifaddr ip6_ifaddr;
> + struct inet6_dev *dev6 = NULL;
> + struct icmp_ext_hdr *ext_hdr;
> + char *name = NULL, ctype;
> + struct net_device *dev;
> + void *subobj_offset;
> +
> + skb_linearize(skb);
> + if_index = inet_iif(skb);

inet_iif is an IPv4 helper; it should not be used for v6 skb's.



Re: [PATCH V4 net-next 5/5] icmp: add response to RFC 8335 PROBE messages

2021-03-17 Thread David Ahern
On 3/17/21 9:19 PM, David Miller wrote:
> From: Andreas Roeseler 
> Date: Wed, 17 Mar 2021 22:11:47 -0500
> 
>> On Mon, 2021-03-15 at 04:35 +0800, kernel test robot wrote:
>> Is there something that I'm not understanding about compiling kernel
>> components modularly? How do I avoid this error?
> 
>>
> You cannot reference module exported symbols from statically linked code.
> y
> 

Look at ipv6_stub to see how it exports IPv6 functions for v4 code.
There are a few examples under net/ipv4.


Re: [PATCH 1/2] neighbour: allow referenced neighbours to be removed

2021-03-17 Thread David Ahern
On 3/17/21 12:53 PM, Thadeu Lima de Souza Cascardo wrote:
> During forced garbage collection, neighbours with more than a reference are
> not removed. It's possible to DoS the neighbour table by using ARP spoofing
> in such a way that there is always a timer pending for all neighbours,
> preventing any of them from being removed. That will cause any new
> neighbour creation to fail.
> 
> Use the same code as used by neigh_flush_dev, which deletes the timer and
> cleans the queue when there are still references left.
> 
> With the same ARP spoofing technique, it was still possible to reach a valid
> destination when this fix was applied, with no more table overflows.

And how fast are neighbor entries removed with this patch? The current
code gives a neighbor entry a minimum lifetime to allow it to exist long
enough to be confirmed. Removing the minimum lifetime means neighbor
entries are constantly churning which is just as bad as the arp spoofing
problem.

> 
> Signed-off-by: Thadeu Lima de Souza Cascardo 
> ---
>  net/core/neighbour.c | 117 +++
>  1 file changed, 51 insertions(+), 66 deletions(-)
> 
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index e2982b3970b8..bbc89c7ffdfd 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -173,25 +173,48 @@ static bool neigh_update_ext_learned(struct neighbour 
> *neigh, u32 flags,
>   return rc;
>  }
>  
> +static int neigh_del_timer(struct neighbour *n)
> +{
> + if ((n->nud_state & NUD_IN_TIMER) &&
> + del_timer(&n->timer)) {
> + neigh_release(n);
> + return 1;
> + }
> + return 0;
> +}
> +
>  static bool neigh_del(struct neighbour *n, struct neighbour __rcu **np,
> struct neigh_table *tbl)
>  {
> - bool retval = false;
> -
> + rcu_assign_pointer(*np,
> +rcu_dereference_protected(n->next,
> + lockdep_is_held(&tbl->lock)));
>   write_lock(&n->lock);
> - if (refcount_read(&n->refcnt) == 1) {
> - struct neighbour *neigh;
> -
> - neigh = rcu_dereference_protected(n->next,
> -   lockdep_is_held(&tbl->lock));
> - rcu_assign_pointer(*np, neigh);
> - neigh_mark_dead(n);
> - retval = true;
> + neigh_del_timer(n);
> + neigh_mark_dead(n);
> + if (refcount_read(&n->refcnt) != 1) {
> + /* The most unpleasant situation.
> +We must destroy neighbour entry,
> +but someone still uses it.
> +
> +The destroy will be delayed until
> +the last user releases us, but
> +we must kill timers etc. and move
> +it to safe state.
> +  */
> + __skb_queue_purge(&n->arp_queue);
> + n->arp_queue_len_bytes = 0;
> + n->output = neigh_blackhole;
> + if (n->nud_state & NUD_VALID)
> + n->nud_state = NUD_NOARP;
> + else
> + n->nud_state = NUD_NONE;
> + neigh_dbg(2, "neigh %p is stray\n", n);
>   }
>   write_unlock(&n->lock);
> - if (retval)
> - neigh_cleanup_and_release(n);
> - return retval;
> + neigh_cleanup_and_release(n);
> +
> + return true;
>  }
>  
>  bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl)
> @@ -229,22 +252,20 @@ static int neigh_forced_gc(struct neigh_table *tbl)
>   write_lock_bh(&tbl->lock);
>  
>   list_for_each_entry_safe(n, tmp, &tbl->gc_list, gc_list) {
> - if (refcount_read(&n->refcnt) == 1) {
> - bool remove = false;
> -
> - write_lock(&n->lock);
> - if ((n->nud_state == NUD_FAILED) ||
> - (tbl->is_multicast &&
> -  tbl->is_multicast(n->primary_key)) ||
> - time_after(tref, n->updated))
> - remove = true;
> - write_unlock(&n->lock);
> -
> - if (remove && neigh_remove_one(n, tbl))
> - shrunk++;
> - if (shrunk >= max_clean)
> - break;
> - }
> + bool remove = false;
> +
> + write_lock(&n->lock);
> + if ((n->nud_state == NUD_FAILED) ||
> + (tbl->is_multicast &&
> +  tbl->is_multicast(n->primary_key)) ||
> + time_after(tref, n->updated))
> + remove = true;
> + write_unlock(&n->lock);
> +
> + if (remove && neigh_remove_one(n, tbl))
> + shrunk++;
> + if (shrunk >= max_clean)
> + break;
>   }
>  
>   tbl->last_flush = jiffies;
> @@ -264,16 +285,6 @@ static void neigh_add_timer(struct neighbour *n, 
> un

Re: [PATCH] net: ipv4: Fixed some styling issues.

2021-03-17 Thread David Ahern
On 3/17/21 9:07 AM, Anish Udupa wrote:
> Ran checkpatch and found these warnings. Fixed some of them in this patch.
> a) Added a space before '='.
> b) Removed the space before the tab.
> 
> Signed-off-by: Anish Udupa H 
> ---
>  net/ipv4/route.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 02d81d79deeb..0b9024584fde 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -2236,7 +2236,7 @@ out: return err;
>   if (!rth)
>   goto e_nobufs;
> 
> - rth->dst.output= ip_rt_bug;
> + rth->dst.output = ip_rt_bug;
>  #ifdef CONFIG_IP_ROUTE_CLASSID
>   rth->dst.tclassid = itag;
>  #endif
> @@ -2244,9 +2244,9 @@ out: return err;
> 
>   RT_CACHE_STAT_INC(in_slow_tot);
>   if (res->type == RTN_UNREACHABLE) {
> - rth->dst.input= ip_error;
> - rth->dst.error= -err;
> - rth->rt_flags &= ~RTCF_LOCAL;
> + rth->dst.input = ip_error;
> + rth->dst.error = -err;
> + rth->rt_flags &= ~RTCF_LOCAL;
>   }
> 
>   if (do_cache) {
> 

your patch seems to have lost one or more tabs at the beginning of each
line.


Re: [BUG] Iproute2 batch-mode fails to bring up veth

2021-03-16 Thread David Ahern
On 3/15/21 1:22 PM, Tim Rice wrote:
> Hey all,
> 
> Sorry if this isn't the right place to report Iproute2 bugs. It was
> implied by README.devel as well as a couple of entries I saw in bugzilla.
> 
> I use iproute2 batch mode to construct network namespaces. Example script:
> 
>   $ cat ~/bin/netns-test.sh
>   #! /bin/bash
> 
>   gw=192.168.5.1
>   ip=192.168.5.2
>   ns=netns-test
>   veth0=${ns}-0
>   veth1=${ns}-1
> 
>   /usr/local/sbin/ip -b - << EOF
>   link add $veth0 type veth peer name $veth1
>   addr add $gw peer $ip dev $veth0
>   link set dev $veth0 up
>   netns add $ns
>   link set $veth1 netns $ns
>   netns exec $ns ip link set dev lo up
>   netns exec $ns ip link set dev $veth1 up
>   netns exec $ns ip addr add $ip/24 dev $veth1
>   netns exec $ns ip addr add $ip peer $gw dev $veth1
>   netns exec $ns ip route add default via $gw dev $veth1
>   netns exec $ns ip route add 192.168.0.0/24 via $gw dev $veth1
>   EOF
> 
> 
> I noticed when version 5.11.0 dropped that this stops working. Batch
> mode fails to bring up the inner veth.
> 
> Expected usage (as produced by v5.10.0):
> 
>   $ sudo ./bin/netns-test.sh
>   $ sudo ip netns exec netns-test ip route
>   default via 192.168.5.1 dev netns-test-1
>   192.168.0.0/24 via 192.168.5.1 dev netns-test-1
>   192.168.5.0/24 dev netns-test-1 proto kernel scope link src 192.168.5.2
>   192.168.5.1 dev netns-test-1 proto kernel scope link src 192.168.5.2
> 
> Actual behaviour:
> 
>   $ sudo ./bin/netns-test.sh
>   $ sudo ip netns exec netns-test ip route  # Notice the empty output
>   $ sudo ip netns exec netns-test ip link
>   1: lo:  mtu 65536 qdisc noqueue state UNKNOWN
> mode DEFAULT group default qlen 1000
>   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>   39: netns-test-1@if40:  mtu 1500 qdisc noop state
> DOWN mode DEFAULT group default qlen 1000
>   link/ether 1a:96:4e:4f:84:31 brd ff:ff:ff:ff:ff:ff link-netnsid 0
> 
> System info:
> 
> * Distro: Void Linux
> * Kernel version: 5.10.23
> * CPU: AMD Ryzen 7 1800X and Ryzen 5 2600. (Reproduced on both.)
> 
> Git bisect pinpoints this commit:
> https://github.com/shemminger/iproute2/commit/1d9a81b8c9f30f9f4abeb875998262f61bf10577
> 

Petr, can you take a look at this regression?



Re: VRF leaking doesn't work

2021-03-15 Thread David Ahern
On 3/15/21 11:10 AM, Greesha Mikhalkin wrote:
>> That's the way the source address selection works -- it takes the fib
>> lookup result and finds the best source address match for it.
>>
>> Try adding 'src a.b.c.d' to the leaked route. e.g.,
>> ip ro add 172.16.1.0/24 dev red vrf blue src 172.16.2.1
>>
>> where red and blue are VRFs, 172.16.2.1 is a valid source address in VRF
>> blue and VRF red has the reverse route installed.
> 
> Tried to do that. Added reverse route to vrf red like that:
> ip ro add vrf red 172.16.2.1/32 dev blue
> 
> 172.16.2.1 is selected as source address when i ping. But now, when i
> look at `tcpdump icmp` i only see requests:
> 172.16.2.1 > 172.16.1.3: ICMP echo request, id 9, seq 10, length 64
> 
> And no replies and anything else. If i look into tcpdump on machine
> that's pinged -- it doesn't receive anything.
> 
> So it looks like it's not using nexthops from vrf red in that case.
> Maybe it has something to do with how address is setup. In routing
> table it looks like:
> local 172.16.2.1 dev vlanblue proto kernel scope host src 172.16.2.1
> 

VRF is implemented via policy routing. did you re-order the FIB rules?


http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf


Re: [PATCH iproute2-next v2] dcb: Fix compilation warning about reallocarray

2021-03-15 Thread David Ahern
On 3/15/21 7:10 AM, Petr Machata wrote:
> Could this be merged, please?

done


Re: [PATCH iproute2-next 4/6] nexthop: Add ability to specify group type

2021-03-14 Thread David Ahern
On 3/12/21 10:23 AM, Petr Machata wrote:
> From: Petr Machata 
> 
> From: Ido Schimmel 

All of the patches have the above. If Ido is the author and you are
sending, AIUI you add your Signed-off-by below his.

> 
> Next patches are going to add a 'resilient' nexthop group type, so allow
> users to specify the type using the 'type' argument. Currently, only
> 'mpath' type is supported.
> 
> These two command are equivalent:
> 
> Signed-off-by: Ido Schimmel 
> ---
>  ip/ipnexthop.c| 32 +++-
>  man/man8/ip-nexthop.8 | 18 --
>  2 files changed, 47 insertions(+), 3 deletions(-)
> 

...

> diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8
> index 4d55f4dbcc75..f02e0555a000 100644
> --- a/man/man8/ip-nexthop.8
> +++ b/man/man8/ip-nexthop.8
> @@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management
>  .BR fdb " ] | "
>  .B  group
>  .IR GROUP " [ "
> -.BR fdb " ] } "
> +.BR fdb " ] [ "
> +.B type
> +.IR TYPE " ] } "
>  
>  .ti -8
>  .IR ENCAP " := [ "
> @@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management
>  .IR GROUP " := "
>  .BR id "[," weight "[/...]"
>  
> +.ti -8
> +.IR TYPE " := { "
> +.BR mpath " }"
> +
>  .SH DESCRIPTION
>  .B ip nexthop
>  is used to manipulate entries in the kernel's nexthop tables.
> @@ -122,9 +128,17 @@ is a set of encapsulation attributes specific to the
>  .in -2
>  
>  .TP
> -.BI group " GROUP"
> +.BI group " GROUP [ " type " TYPE ]"
>  create a nexthop group. Group specification is id with an optional
>  weight (id,weight) and a '/' as a separator between entries.
> +.sp
> +.I TYPE
> +is a string specifying the nexthop group type. Namely:
> +
> +.in +8
> +.BI mpath
> +- multipath nexthop group
> +

Add a comment that this is the default group type and refers to the
legacy hash-bashed multipath group.

The rest of the patches look ok to me.


Re: [PATCH net-next 07/10] selftests: fib_nexthops: Test resilient nexthop groups

2021-03-14 Thread David Ahern
On 3/12/21 9:50 AM, Petr Machata wrote:
> From: Ido Schimmel 
> 
> Add test cases for resilient nexthop groups. Exhaustive forwarding tests
> are added separately under net/forwarding/.
> 
...
> 
> Signed-off-by: Ido Schimmel 
> Co-developed-by: Petr Machata 
> Signed-off-by: Petr Machata 
> ---
>  tools/testing/selftests/net/fib_nexthops.sh | 517 
>  1 file changed, 517 insertions(+)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 06/10] selftests: fib_nexthops: List each test case in a different line

2021-03-14 Thread David Ahern
On 3/12/21 9:50 AM, Petr Machata wrote:
> From: Ido Schimmel 
> 
> The lines with the IPv4 and IPv6 test cases are already very long and
> more test cases will be added in subsequent patches.
> 
> List each test case in a different line to make it easier to extend the
> test with more test cases.
> 
> Signed-off-by: Ido Schimmel 
> Reviewed-by: Petr Machata 
> Signed-off-by: Petr Machata 
> ---
>  tools/testing/selftests/net/fib_nexthops.sh | 30 ++---
>  1 file changed, 26 insertions(+), 4 deletions(-)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 05/10] selftests: fib_nexthops: Declutter test output

2021-03-14 Thread David Ahern
On 3/12/21 9:50 AM, Petr Machata wrote:
> From: Ido Schimmel 
> 
> Before:
> 
>  # ./fib_nexthops.sh -t ipv4_torture
> 
> IPv4 runtime torture
> 
> TEST: IPv4 torture test [ OK ]
> ./fib_nexthops.sh: line 213: 19376 Killed  ipv4_del_add_loop1
> ./fib_nexthops.sh: line 213: 19377 Killed  
> ipv4_grp_replace_loop
> ./fib_nexthops.sh: line 213: 19378 Killed  ip netns exec me 
> ping -f 172.16.101.1 > /dev/null 2>&1
> ./fib_nexthops.sh: line 213: 19380 Killed  ip netns exec me 
> ping -f 172.16.101.2 > /dev/null 2>&1
> ./fib_nexthops.sh: line 213: 19381 Killed  ip netns exec me 
> mausezahn veth1 -B 172.16.101.2 -A 172.16.1.1 -c 0 -t tcp "dp=1-1023, 
> flags=syn" > /dev/null 2>&1
> 
> Tests passed:   1
> Tests failed:   0
> 
>  # ./fib_nexthops.sh -t ipv6_torture
> 
> IPv6 runtime torture
> 
> TEST: IPv6 torture test [ OK ]
> ./fib_nexthops.sh: line 213: 24453 Killed  ipv6_del_add_loop1
> ./fib_nexthops.sh: line 213: 24454 Killed  
> ipv6_grp_replace_loop
> ./fib_nexthops.sh: line 213: 24456 Killed  ip netns exec me 
> ping -f 2001:db8:101::1 > /dev/null 2>&1
> ./fib_nexthops.sh: line 213: 24457 Killed  ip netns exec me 
> ping -f 2001:db8:101::2 > /dev/null 2>&1
> ./fib_nexthops.sh: line 213: 24458 Killed  ip netns exec me 
> mausezahn -6 veth1 -B 2001:db8:101::2 -A 2001:db8:91::1 -c 0 -t tcp 
> "dp=1-1023, flags=syn" > /dev/null 2>&1
> 
> Tests passed:   1
> Tests failed:   0
> 
> After:
> 
>  # ./fib_nexthops.sh -t ipv4_torture
> 
> IPv4 runtime torture
> 
> TEST: IPv4 torture test [ OK ]
> 
> Tests passed:   1
> Tests failed:   0
> 
>  # ./fib_nexthops.sh -t ipv6_torture
> 
> IPv6 runtime torture
> 
> TEST: IPv6 torture test [ OK ]
> 
> Tests passed:   1
> Tests failed:   0
> 
> Signed-off-by: Ido Schimmel 
> Reviewed-by: Petr Machata 
> Signed-off-by: Petr Machata 
> ---
>  tools/testing/selftests/net/fib_nexthops.sh | 2 ++
>  1 file changed, 2 insertions(+)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next] net: ipv6: addrconf: Add accept_ra_prefix_route.

2021-03-12 Thread David Ahern
On 3/11/21 7:22 PM, subas...@codeaurora.org wrote:
> 
> We are seeing that the interface itself doesn't get the address assigned
> via RA when setting accept_ra_pinfo = 0.
> 
> We would like to have the interface address assigned via SLAAC
> here while the route management would be handled via the userspace daemon.
> In that case, we do not want the kernel installed route to be present
> (behavior controlled via this proc entry).

sysctl's are not free and in this case you want to add a second one to
pick and choose which data in the message you want the kernel to act on.

Why can't the userspace daemon remove the route and add the one it
prefers? Or add another route with a metric that makes it the preferred
route making the kernel one effectively moot?


Re: VRF leaking doesn't work

2021-03-12 Thread David Ahern
On 3/10/21 1:34 AM, Greesha Mikhalkin wrote:
> I see. When i do `ping -I vrf2` to address that was leaked from vrf1
> it selects source address that's set as local in vrf1 routing table.
> Is this expected behavior? I guess, forwarding packets from vrf1 to
> vrf2 local address won't help here.
> 

That's the way the source address selection works -- it takes the fib
lookup result and finds the best source address match for it.

Try adding 'src a.b.c.d' to the leaked route. e.g.,
ip ro add 172.16.1.0/24 dev red vrf blue src 172.16.2.1

where red and blue are VRFs, 172.16.2.1 is a valid source address in VRF
blue and VRF red has the reverse route installed.


Re: [PATCH net-next 00/14] nexthop: Resilient next-hop groups

2021-03-11 Thread David Ahern
On 3/10/21 8:02 AM, Petr Machata wrote:
> At this moment, there is only one type of next-hop group: an mpath group.
> Mpath groups implement the hash-threshold algorithm, described in RFC
> 2992[1].
> 
> To select a next hop, hash-threshold algorithm first assigns a range of
> hashes to each next hop in the group, and then selects the next hop by
> comparing the SKB hash with the individual ranges. When a next hop is
> removed from the group, the ranges are recomputed, which leads to
> reassignment of parts of hash space from one next hop to another. RFC 2992
> illustrates it thus:
> 
>  +---+---+---+---+---+
>  |   1   |   2   |   3   |   4   |   5   |
>  +---+-+-+---+---+-+-+---+
>  |1|2|4|5|
>  +-+-+-+-+
> 
>   Before and after deletion of next hop 3
> under the hash-threshold algorithm.
> 
> Note how next hop 2 gave up part of the hash space in favor of next hop 1,
> and 4 in favor of 5. While there will usually be some overlap between the
> previous and the new distribution, some traffic flows change the next hop
> that they resolve to.
> 
> If a multipath group is used for load-balancing between multiple servers,
> this hash space reassignment causes an issue that packets from a single
> flow suddenly end up arriving at a server that does not expect them, which
> may lead to TCP reset.
> 
> If a multipath group is used for load-balancing among available paths to
> the same server, the issue is that different latencies and reordering along
> the way causes the packets to arrive in the wrong order.
> 
> Resilient hashing is a technique to address the above problem. Resilient
> next-hop group has another layer of indirection between the group itself
> and its constituent next hops: a hash table. The selection algorithm uses a
> straightforward modulo operation on the SKB hash to choose a hash table
> bucket, then reads the next hop that this bucket contains, and forwards
> traffic there.
> 
> This indirection brings an important feature. In the hash-threshold
> algorithm, the range of hashes associated with a next hop must be
> continuous. With a hash table, mapping between the hash table buckets and
> the individual next hops is arbitrary. Therefore when a next hop is deleted
> the buckets that held it are simply reassigned to other next hops:
> 
>  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5|
>  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> v v v v
>  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5|
>  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
> 
>   Before and after deletion of next hop 3
> under the resilient hashing algorithm.
> 
> When weights of next hops in a group are altered, it may be possible to
> choose a subset of buckets that are currently not used for forwarding
> traffic, and use those to satisfy the new next-hop distribution demands,
> keeping the "busy" buckets intact. This way, established flows are ideally
> kept being forwarded to the same endpoints through the same paths as before
> the next-hop group change.
> 
> This patch set adds the implementation of resilient next-hop groups.
> 
> In a nutshell, the algorithm works as follows. Each next hop has a number
> of buckets that it wants to have, according to its weight and the number of
> buckets in the hash table. In case of an event that might cause bucket
> allocation change, the numbers for individual next hops are updated,
> similarly to how ranges are updated for mpath group next hops. Following
> that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
> next hop that is currently occupying more buckets than it wants (it is
> "overweight"), it migrates the buckets to one of the next hops that has
> fewer buckets than it wants (it is "underweight"). If, after this, there
> are still underweight next hops, another upkeep run is scheduled to a
> future time.
> 
> Chances are there are not enough "idle" buckets to satisfy the new demands.
> The algorithm has knobs to select both what it means for a bucket to be
> idle, and for whether and when to forcefully migrate buckets if there keeps
> being an insufficient number of idle ones.
> 
> To illustrate the usage, consider the following commands:
> 
>  # ip nexthop add id 1 via 192.0.2.2 dev dummy1
>  # ip nexthop add id 2 via 192.0.2.3 dev dummy1
>  # ip nexthop add id 10 group 1/2 type resilient \
>   buckets 8 idle_timer 60 unbalanced_timer 300
> 
> The last command creates a resilient next-hop group. It will have 8
> buckets, each bucket will be considered idle when no traffic hits it for at
> least 60 seconds, and if the table remains out of balance for 300 seconds,
> it will 

Re: [PATCH net-next 14/14] nexthop: Enable resilient next-hop groups

2021-03-11 Thread David Ahern
On 3/10/21 8:03 AM, Petr Machata wrote:
> Now that all the code is in place, stop rejecting requests to create
> resilient next-hop groups.
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
>  net/ipv4/nexthop.c | 4 
>  1 file changed, 4 deletions(-)
> 


Reviewed-by: David Ahern 



Re: [PATCH net-next 13/14] nexthop: Notify userspace about bucket migrations

2021-03-11 Thread David Ahern
On 3/10/21 8:03 AM, Petr Machata wrote:
> Nexthop replacements et.al. are notified through netlink, but if a delayed
> work migrates buckets on the background, userspace will stay oblivious.
> Notify these as RTM_NEWNEXTHOPBUCKET events.
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
> 
> Notes:
> v1 (changes since RFC):
> - u32 -> u16 for bucket counts / indices
> 
>  net/ipv4/nexthop.c | 45 +++--
>  1 file changed, 39 insertions(+), 6 deletions(-)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 12/14] nexthop: Add netlink handlers for bucket get

2021-03-11 Thread David Ahern
On 3/10/21 8:03 AM, Petr Machata wrote:
> Allow getting (but not setting) individual buckets to inspect the next hop
> mapped therein, idle time, and flags.
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
> 
> Notes:
> v1 (changes since RFC):
> - u32 -> u16 for bucket counts / indices
> 
>  net/ipv4/nexthop.c | 110 -
>  1 file changed, 109 insertions(+), 1 deletion(-)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 11/14] nexthop: Add netlink handlers for bucket dump

2021-03-11 Thread David Ahern
On 3/10/21 8:03 AM, Petr Machata wrote:
> Add a dump handler for resilient next hop buckets. When next-hop group ID
> is given, it walks buckets of that group, otherwise it walks buckets of all
> groups. It then dumps the buckets whose next hops match the given filtering
> criteria.
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
> 
> Notes:
> v1 (changes since RFC):
> - u32 -> u16 for bucket counts / indices
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 10/14] nexthop: Add netlink handlers for resilient nexthop groups

2021-03-11 Thread David Ahern
On 3/10/21 8:03 AM, Petr Machata wrote:
> Implement the netlink messages that allow creation and dumping of resilient
> nexthop groups.
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
> 
> Notes:
> v1 (changes since RFC):
> - u32 -> u16 for bucket counts / indices
> 
>  net/ipv4/nexthop.c | 150 +++--
>  1 file changed, 145 insertions(+), 5 deletions(-)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 09/14] nexthop: Allow reporting activity of nexthop buckets

2021-03-11 Thread David Ahern
On 3/10/21 8:03 AM, Petr Machata wrote:
> From: Ido Schimmel 
> 
> The kernel periodically checks the idle time of nexthop buckets to
> determine if they are idle and can be re-populated with a new nexthop.
> 
> When the resilient nexthop group is offloaded to hardware, the kernel
> will not see activity on nexthop buckets unless it is reported from
> hardware.
> 
> Add a function that can be periodically called by device drivers to
> report activity on nexthop buckets after querying it from the underlying
> device.
> 
> Signed-off-by: Ido Schimmel 
> Reviewed-by: Petr Machata 
> Signed-off-by: Petr Machata 
> ---
> 
> Notes:
> v1 (changes since RFC):
> - u32 -> u16 for bucket counts / indices
> 
>  include/net/nexthop.h |  2 ++
>  net/ipv4/nexthop.c| 35 +++++++
>  2 files changed, 37 insertions(+)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 08/14] nexthop: Allow setting "offload" and "trap" indication of nexthop buckets

2021-03-11 Thread David Ahern
On 3/10/21 8:02 AM, Petr Machata wrote:
> From: Ido Schimmel 
> 
> Add a function that can be called by device drivers to set "offload" or
> "trap" indication on nexthop buckets following nexthop notifications and
> other changes such as a neighbour becoming invalid.
> 
> Signed-off-by: Ido Schimmel 
> Reviewed-by: Petr Machata 
> Signed-off-by: Petr Machata 
> ---
> 
> Notes:
> v1 (changes since RFC):
> - u32 -> u16 for bucket counts / indices
> 
>  include/net/nexthop.h |  2 ++
>  net/ipv4/nexthop.c| 34 ++++++
>  2 files changed, 36 insertions(+)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 07/14] nexthop: Implement notifiers for resilient nexthop groups

2021-03-11 Thread David Ahern
On 3/10/21 8:02 AM, Petr Machata wrote:
> Implement the following notifications towards drivers:
> 
> - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created.
> 
> - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of
>   next hops to hash table buckets. That includes replacements, deletions,
>   and delayed upkeep cycles. Some bucket notifications can be vetoed by the
>   driver, to make it possible to propagate bucket busy-ness flags from the
>   HW back to the algorithm. Some are however forced, e.g. if a next hop is
>   deleted, all buckets that use this next hop simply must be migrated,
>   whether the HW wishes so or not.
> 
> - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is
>   replaced. Usually the driver will get the bucket notifications as well,
>   and could veto those. But in some cases, a bucket may not be migrated
>   immediately, but during delayed upkeep, and that is too late to roll the
>   transaction back. This notification allows the driver to take a look and
>   veto the new proposed group up front, before anything is committed.
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
> 
> Notes:
> v1 (changes since RFC):
> - u32 -> u16 for bucket counts / indices
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 06/14] nexthop: Add data structures for resilient group notifications

2021-03-11 Thread David Ahern
On 3/10/21 8:02 AM, Petr Machata wrote:
> From: Ido Schimmel 
> 
> Add data structures that will be used for in-kernel notifications about
> addition / deletion of a resilient nexthop group and about changes to a
> hash bucket within a resilient group.
> 
> Signed-off-by: Ido Schimmel 
> Reviewed-by: Petr Machata 
> Signed-off-by: Petr Machata 
> ---
> 
> Notes:
> v1 (changes since RFC):
> - u32 -> u16 for bucket counts / indices
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 03/14] nexthop: Add a dedicated flag for multipath next-hop groups

2021-03-11 Thread David Ahern
On 3/11/21 8:39 AM, Petr Machata wrote:
> 
> David Ahern  writes:
> 
>>> diff --git a/include/net/nexthop.h b/include/net/nexthop.h
>>> index 7bc057aee40b..5062c2c08e2b 100644
>>> --- a/include/net/nexthop.h
>>> +++ b/include/net/nexthop.h
>>> @@ -80,6 +80,7 @@ struct nh_grp_entry {
>>>  struct nh_group {
>>> struct nh_group *spare; /* spare group for removals */
>>> u16 num_nh;
>>> +   boolis_multipath;
>>> boolmpath;
>>
>>
>> It would be good to rename the existing type 'mpath' to something else.
>> You have 'resilient' as a group type later, so maybe rename this one to
>> hash or hash_threshold.
> 
> All right, I'll send a follow-up with that.
> 

I'm fine with the rename being a followup after this patch set or as the
last patch in this set.


Re: [PATCH net-next 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups

2021-03-11 Thread David Ahern
On 3/11/21 8:45 AM, Petr Machata wrote:
> 
> David Ahern  writes:
> 
>> On 3/10/21 8:02 AM, Petr Machata wrote:
>>> diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
>>> index 2d4a1e784cf0..8efebf3cb9c7 100644
>>> --- a/include/uapi/linux/nexthop.h
>>> +++ b/include/uapi/linux/nexthop.h
>>> @@ -22,6 +22,7 @@ struct nexthop_grp {
>>>  
>>>  enum {
>>> NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */
>>
>> Update the above comment that it is for legacy, hash based multipath.
> 
> Maybe this would make sense?
> 
>   NEXTHOP_GRP_TYPE_MPATH,  /* hash-threshold nexthop group */
> 

yes, the description is fine. keep the comment about 'default type'.


Re: [PATCH net-next 05/14] nexthop: Add implementation of resilient next-hop groups

2021-03-11 Thread David Ahern
during upkeep. The
> only times that the hash table is invalid is the very beginning and very
> end of its lifetime. Between those points, it is always kept valid.
> 
> This patch introduces the core support code itself. It does not handle
> notifications towards drivers, which are kept as if the group were an mpath
> one. It does not handle netlink either. The only bit currently exposed to
> user space is the new next-hop group type, and that is currently bounced.
> There is therefore no way to actually access this code.
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
> 

Thanks for the detailed documentation around exclusion expectations.

Reviewed-by: David Ahern 




Re: [PATCH net-next 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups

2021-03-11 Thread David Ahern
On 3/10/21 8:02 AM, Petr Machata wrote:
> diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h
> index 2d4a1e784cf0..8efebf3cb9c7 100644
> --- a/include/uapi/linux/nexthop.h
> +++ b/include/uapi/linux/nexthop.h
> @@ -22,6 +22,7 @@ struct nexthop_grp {
>  
>  enum {
>   NEXTHOP_GRP_TYPE_MPATH,  /* default type if not specified */

Update the above comment that it is for legacy, hash based multipath.

> + NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */
>   __NEXTHOP_GRP_TYPE_MAX,
>  };
>  

besides the request for an updated comment, this looks fine:
Reviewed-by: David Ahern 


Re: [PATCH net-next 03/14] nexthop: Add a dedicated flag for multipath next-hop groups

2021-03-11 Thread David Ahern
On 3/10/21 8:02 AM, Petr Machata wrote:
> With the introduction of resilient nexthop groups, there will be two types
> of multipath groups: the current hash-threshold "mpath" ones, and resilient
> groups. Both are multipath, but to determine the fact, the system needs to
> consider two flags. This might prove costly in the datapath. Therefore,
> introduce a new flag, that should be set for next-hop groups that have more
> than one nexthop, and should be considered multipath.
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
> 
> Notes:
> v1 (changes since RFC):
> - This patch is new
> 
>  include/net/nexthop.h | 7 ---
>  net/ipv4/nexthop.c| 5 -
>  2 files changed, 8 insertions(+), 4 deletions(-)

This patch looks good:
Reviewed-by: David Ahern 

> 
> diff --git a/include/net/nexthop.h b/include/net/nexthop.h
> index 7bc057aee40b..5062c2c08e2b 100644
> --- a/include/net/nexthop.h
> +++ b/include/net/nexthop.h
> @@ -80,6 +80,7 @@ struct nh_grp_entry {
>  struct nh_group {
>   struct nh_group *spare; /* spare group for removals */
>   u16 num_nh;
> + boolis_multipath;
>   boolmpath;


It would be good to rename the existing type 'mpath' to something else.
You have 'resilient' as a group type later, so maybe rename this one to
hash or hash_threshold.



Re: [PATCH net-next 02/14] nexthop: __nh_notifier_single_info_init(): Make nh_info an argument

2021-03-11 Thread David Ahern
On 3/10/21 8:02 AM, Petr Machata wrote:
> The cited function currently uses rtnl_dereference() to get nh_info from a
> handed-in nexthop. However, under the resilient hashing scheme, this
> function will not always be called under RTNL, sometimes the mutual
> exclusion will be achieved differently. Therefore move the nh_info
> extraction from the function to its callers to make it possible to use a
> different synchronization guarantee.
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
>  net/ipv4/nexthop.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 

Reviewed-by: David Ahern 




Re: [PATCH net-next 01/14] nexthop: Pass nh_config to replace_nexthop()

2021-03-11 Thread David Ahern
On 3/10/21 8:02 AM, Petr Machata wrote:
> Currently, replace assumes that the new group that is given is a
> fully-formed object. But mpath groups really only have one attribute, and
> that is the constituent next hop configuration. This may not be universally
> true. From the usability perspective, it is desirable to allow the replace
> operation to adjust just the constituent next hop configuration and leave
> the group attributes as such intact.
> 
> But the object that keeps track of whether an attribute was or was not
> given is the nh_config object, not the next hop or next-hop group. To allow
> (selective) attribute updates during NH group replacement, propagate `cfg'
> to replace_nexthop() and further to replace_nexthop_grp().
> 
> Signed-off-by: Petr Machata 
> Reviewed-by: Ido Schimmel 
> ---
>  net/ipv4/nexthop.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 

Reviewed-by: David Ahern 


Re: [PATCH net-next] net: ipv6: addrconf: Add accept_ra_prefix_route.

2021-03-10 Thread David Ahern
On 3/10/21 11:49 AM, Subash Abhinov Kasiviswanathan wrote:
> Added new procfs flag to toggle the automatic addition of prefix
> routes on a per device basis. The new flag is accept_ra_prefix_route.
> 
> A value of 0 for the flag maybe used in some forwarding scenarios
> when a userspace daemon is managing the routing.
> Manual deletion of the kernel installed route was not sufficient as
> kernel was adding back the route.
> 
> Defaults to 1 as to not break existing behavior.
> 
> Signed-off-by: Subash Abhinov Kasiviswanathan 
> ---
>  Documentation/networking/ip-sysctl.rst | 10 ++
>  include/linux/ipv6.h   |  1 +
>  include/uapi/linux/ipv6.h  |  1 +
>  net/ipv6/addrconf.c| 16 +---
>  4 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.rst 
> b/Documentation/networking/ip-sysctl.rst
> index c7952ac..9f0d92d 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -2022,6 +2022,16 @@ accept_ra_mtu - BOOLEAN
>   - enabled if accept_ra is enabled.
>   - disabled if accept_ra is disabled.
>  
> +accept_ra_prefix_route - BOOLEAN
> + Apply the prefix route based on the RA. If disabled, kernel
> + does not install the route. This can be used if a userspace
> + daemon is managing the routing.
> +
> + Functional default:
> +
> + - enabled if accept_ra_prefix_route is enabled
> + - disabled if accept_ra_prefix_route is disabled
> +
>  accept_redirects - BOOLEAN
>   Accept Redirects.
>  

this seems to duplicate accept_ra_pinfo



Re: [PATCH net v2] ipv6: fix suspecious RCU usage warning

2021-03-10 Thread David Ahern
On 3/9/21 7:20 PM, Wei Wang wrote:
> Syzbot reported the suspecious RCU usage in nexthop_fib6_nh() when
> called from ipv6_route_seq_show(). The reason is ipv6_route_seq_start()
> calls rcu_read_lock_bh(), while nexthop_fib6_nh() calls
> rcu_dereference_rtnl().
> The fix proposed is to add a variant of nexthop_fib6_nh() to use
> rcu_dereference_bh_rtnl() for ipv6_route_seq_show().
> 

...

> 
> Fixes: f88d8ea67fbdb ("ipv6: Plumb support for nexthop object in a fib6_info")
> Reported-by: syzbot 
> Signed-off-by: Wei Wang 
> Cc: David Ahern 
> Cc: Ido Schimmel 
> Cc: Petr Machata 
> Cc: Eric Dumazet 
> ---
>  include/net/nexthop.h | 24 
>  net/ipv6/ip6_fib.c|  2 +-
>  2 files changed, 25 insertions(+), 1 deletion(-)
> 


Reviewed-by: David Ahern 

Thanks, Wei.


Re: [PATCH] net: add net namespace inode for all net_dev events

2021-03-09 Thread David Ahern
On 3/9/21 1:02 PM, Steven Rostedt wrote:
> On Tue, 9 Mar 2021 12:53:37 -0700
> David Ahern  wrote:
> 
>> Changing the order of the fields will impact any bpf programs expecting
>> the existing format
> 
> I thought bpf programs were not API. And why are they not parsing this
> information? They have these offsets hard coded Why would they do that!
> The information to extract the data where ever it is has been there from
> day 1! Way before BPF ever had access to trace events.

BPF programs attached to a tracepoint are passed a context - a structure
based on the format for the tracepoint. To take an in-tree example, look
at samples/bpf/offwaketime_kern.c:

...

/* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
struct sched_switch_args {
unsigned long long pad;
char prev_comm[16];
int prev_pid;
int prev_prio;
long long prev_state;
char next_comm[16];
int next_pid;
int next_prio;
};
SEC("tracepoint/sched/sched_switch")
int oncpu(struct sched_switch_args *ctx)
{

...

Production systems do not typically have toolchains installed, so
dynamic generation of the program based on the 'format' file on the
running system is not realistic. That means creating the programs on a
development machine and installing on the production box. Further, there
is an expectation that a bpf program compiled against version X works on
version Y. Changing the order of the fields will break such programs in
non-obvious ways.


Re: [PATCH] net: add net namespace inode for all net_dev events

2021-03-09 Thread David Ahern
On 3/9/21 10:40 AM, Steven Rostedt wrote:
> The order of the fields is important. Don't worry about breaking API by
> fixing it. The parsing code uses this output to find where the binary data
> is.

Changing the order of the fields will impact any bpf programs expecting
the existing format.


Re: [PATCH net] ipv6: fix suspecious RCU usage warning

2021-03-09 Thread David Ahern
On 3/9/21 10:32 AM, Wei Wang wrote:
> Thanks David and Ido.
> To clarify, David, you suggest we add a separate function instead of
> adding an extra parameter, right?

for this case I think it is the better way to go.


Re: [PATCH net v3 2/2] net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0

2021-03-09 Thread David Ahern
On 3/9/21 4:31 AM, Balazs Nemeth wrote:
> A packet with skb_inner_network_header(skb) == skb_network_header(skb)
> and ETH_P_MPLS_UC will prevent mpls_gso_segment from pulling any headers
> from the packet. Subsequently, the call to skb_mac_gso_segment will
> again call mpls_gso_segment with the same packet leading to an infinite
> loop. In addition, ensure that the header length is a multiple of four,
> which should hold irrespective of the number of stacked labels.
> 
> Signed-off-by: Balazs Nemeth 
> ---
>  net/mpls/mpls_gso.c | 3 +++
>  1 file changed, 3 insertions(+)
> 


Reviewed-by: David Ahern 



Re: [PATCH net] ipv6: fix suspecious RCU usage warning

2021-03-08 Thread David Ahern
[ cc Ido and Petr ]

On 3/8/21 12:21 PM, Wei Wang wrote:
> diff --git a/include/net/nexthop.h b/include/net/nexthop.h
> index 7bc057aee40b..48956b144689 100644
> --- a/include/net/nexthop.h
> +++ b/include/net/nexthop.h
> @@ -410,31 +410,39 @@ static inline struct fib_nh *fib_info_nh(struct 
> fib_info *fi, int nhsel)
>  int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg,
>  struct netlink_ext_ack *extack);
>  
> -static inline struct fib6_nh *nexthop_fib6_nh(struct nexthop *nh)
> +static inline struct fib6_nh *nexthop_fib6_nh(struct nexthop *nh,
> +   bool bh_disabled)

Hi Wei: I would prefer not to have a second argument to nexthop_fib6_nh
for 1 code path, and a control path at that.

>  {
>   struct nh_info *nhi;
>  
>   if (nh->is_group) {
>   struct nh_group *nh_grp;
>  
> - nh_grp = rcu_dereference_rtnl(nh->nh_grp);
> + if (bh_disabled)
> + nh_grp = rcu_dereference_bh_rtnl(nh->nh_grp);
> + else
> + nh_grp = rcu_dereference_rtnl(nh->nh_grp);
>   nh = nexthop_mpath_select(nh_grp, 0);
>   if (!nh)
>   return NULL;
>   }
>  
> - nhi = rcu_dereference_rtnl(nh->nh_info);
> + if (bh_disabled)
> + nhi = rcu_dereference_bh_rtnl(nh->nh_info);
> + else
> + nhi = rcu_dereference_rtnl(nh->nh_info);
>   if (nhi->family == AF_INET6)
>   return &nhi->fib6_nh;
>  
>   return NULL;
>  }
>  

I am wary of duplicating code, but this helper is simple enough that it
should be ok with proper documentation.

Ido/Petr: I think your resilient hashing patch set touches this helper.
How ugly does it get to have a second version?


Re: [PATCH v2 2/2] net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0

2021-03-08 Thread David Ahern
On 3/8/21 9:26 AM, Balazs Nemeth wrote:
> On Mon, 2021-03-08 at 09:17 -0700, David Ahern wrote:
>> On 3/8/21 9:07 AM, Willem de Bruijn wrote:
>>>> diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
>>>> index b1690149b6fa..cc1b6457fc93 100644
>>>> --- a/net/mpls/mpls_gso.c
>>>> +++ b/net/mpls/mpls_gso.c
>>>> @@ -27,7 +27,7 @@ static struct sk_buff *mpls_gso_segment(struct
>>>> sk_buff *skb,
>>>>
>>>>     skb_reset_network_header(skb);
>>>>     mpls_hlen = skb_inner_network_header(skb) -
>>>> skb_network_header(skb);
>>>> -   if (unlikely(!pskb_may_pull(skb, mpls_hlen)))
>>>> +   if (unlikely(!mpls_hlen || !pskb_may_pull(skb,
>>>> mpls_hlen)))
>>>>     goto out;
>>>
>>> Good cathc. Besides length zero, this can be more strict: a label
>>> is
>>> 4B, so mpls_hlen needs to be >= 4B.
>>>
>>> Perhaps even aligned to 4B, too, but not if there may be other
>>> encap on top.
>>>
>>> Unfortunately there is no struct or type definition that we can use
>>> a
>>> sizeof instead of open coding the raw constant.
>>>
>>
>> MPLS_HLEN can be used here.
>>
> 
> What about sizeof(struct mpls_label), like in net/ipv4/tunnel4.c?
> 

I was thinking MPLS_HLEN because of its consistent use with skb
manipulations. net/mpls code uses mpls_shim_hdr over mpls_label.

Looks like the MPLS code could use some cleanups to make this consistent.


Re: [PATCH v2 2/2] net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0

2021-03-08 Thread David Ahern
On 3/8/21 9:07 AM, Willem de Bruijn wrote:
>> diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
>> index b1690149b6fa..cc1b6457fc93 100644
>> --- a/net/mpls/mpls_gso.c
>> +++ b/net/mpls/mpls_gso.c
>> @@ -27,7 +27,7 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff 
>> *skb,
>>
>> skb_reset_network_header(skb);
>> mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb);
>> -   if (unlikely(!pskb_may_pull(skb, mpls_hlen)))
>> +   if (unlikely(!mpls_hlen || !pskb_may_pull(skb, mpls_hlen)))
>> goto out;
> 
> Good cathc. Besides length zero, this can be more strict: a label is
> 4B, so mpls_hlen needs to be >= 4B.
> 
> Perhaps even aligned to 4B, too, but not if there may be other encap on top.
> 
> Unfortunately there is no struct or type definition that we can use a
> sizeof instead of open coding the raw constant.
> 

MPLS_HLEN can be used here.


Re: mlx5 sub function issue

2021-03-08 Thread David Ahern
On 3/8/21 12:21 AM, ze wang wrote:
> mlxconfig tool from mft tools version 4.16.52 or higher to set number of SF.
> 
> mlxconfig -d b3:00.0  PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_SF_BAR_SIZE=8
> mlxconfig -d b3:00.0  PER_PF_NUM_SF=1 PF_TOTAL_SF=192 PF_SF_BAR_SIZE=8
> mlxconfig -d b3:00.1  PER_PF_NUM_SF=1 PF_TOTAL_SF=192 PF_SF_BAR_SIZE=8
> 
> Cold reboot power cycle of the system as this changes the BAR size in device
> 

Is that capability going to be added to devlink?


Re: VRF leaking doesn't work

2021-03-06 Thread David Ahern
On 3/2/21 3:57 AM, Greesha Mikhalkin wrote:
> Main goal is that 100.255.254.3 should be reachable from vrf2. But
> after this setup it doesn’t work. When i run `ping -I vrf2
> 100.255.254.3` it sends packets from source address that belongs to
> vlan1 enslaved by vrf1. I can see in tcpdump that ICMP packets are
> sent and then returned to source address but they're not returned to
> ping command for some reason. To be clear `ping -I vrf1 …` works fine.

I remember this case now: VRF route leaking works for fowarding, but not
local traffic. If a packet arrives in vrf2, it should get forwarded to
vrf1 and on to its destination. If the reverse route exists then round
trip traffic works.


Re: VRF leaking doesn't work

2021-03-05 Thread David Ahern
On 3/2/21 3:57 AM, Greesha Mikhalkin wrote:
> Hi. I need a help to understand why VRF leaking doesn’t work in my situation.
> I want to set up leaking between 2 VRFs, that are set up by following 
> commands:
> 
>   # Setup bridge
>   sudo ip link add bridge type bridge
> 
>   # Setup VLANs
>   ip link add link bridge name vlan1 type vlan id 1
>   ip link add link bridge name vlan2 type vlan id 2
>   ip addr add 10.0.0.31/32 dev vlan1
>   ip addr add 10.0.0.32/32 dev vlan2
>   ip link set vlan1 up
>   ip link set vlan2 up
> 
>   # Setup VXLANs
>   ip link add vni1 type vxlan id 1 local 10.1.0.1 dev lan1 srcport
> 0 0 dstport 4789 nolearning
>   ip link add vni2 type vxlan id 2 local 10.1.0.1 dev lan1 srcport
> 0 0 dstport 4789 nolearning
>   ip link set vni1 master bridge
>   ip link set vni2 master bridge
>   bridge vlan add dev vni1 vid 1 pvid untagged
>   bridge vlan add dev vni2 vid 2 pvid untagged
>   ip link set vni1 up
>   ip link set vni2 up
> 
>   # Setup VRFs
>   ip link add vrf1 type vrf table 1000
>   ip link set dev vrf1 up
>   ip link add vrf2 type vrf table 1001
>   ip link set dev vrf2 up
> 
> Setting routes:
> 
>   # Unreachable default routes
>   ip route add table 1000 unreachable default metric 4278198272
>   ip route add table 1001 unreachable default metric 4278198272
> 
>   # Nexthop
>   ip route add table 1000 100.255.254.3 proto bgp metric 20
> nexthop via 10.0.0.11 dev vlan1 weight 1 onlink
> 
> I'm trying to setup VRF leaking in following way:
> 
>   ip r a vrf vrf2 100.255.254.3/32 dev vrf1
>   ip r a vrf vrf2 10.0.0.31/32 dev vrf1
>   ip r a vrf vrf1 10.0.0.32/32 dev vrf2
> 
> Main goal is that 100.255.254.3 should be reachable from vrf2. But
> after this setup it doesn’t work. When i run `ping -I vrf2
> 100.255.254.3` it sends packets from source address that belongs to
> vlan1 enslaved by vrf1. I can see in tcpdump that ICMP packets are
> sent and then returned to source address but they're not returned to
> ping command for some reason. To be clear `ping -I vrf1 …` works fine.
> 

What kernel version? If you have not tried 5.10 or 5.11, please do.


Re: [PATCH iproute2-next 0/4] devlink: Use utils helpers

2021-03-02 Thread David Ahern
On 3/1/21 3:56 AM, Parav Pandit wrote:
> This series uses utils helper for socket operations, string
> processing and print error messages.
> 
> Patch summary:
> Patch-1 uses utils library helper for string split and string search
> Patch-2 extends library for socket receive operation
> Patch-3 converts devlink to use socket helpers from utlis library
> Patch-4 print error when user provides invalid flavour or state
> 
> Parav Pandit (4):
>   devlink: Use library provided string processing APIs
>   utils: Introduce helper routines for generic socket recv
>   devlink: Use generic socket helpers from library
>   devlink: Add error print when unknown values specified
> 
>  devlink/devlink.c   | 365 
>  devlink/mnlg.c  | 121 ++-
>  devlink/mnlg.h  |  13 +-
>  include/mnl_utils.h |   6 +
>  lib/mnl_utils.c |  25 ++-
>  5 files changed, 203 insertions(+), 327 deletions(-)
> 

applied to iproute2-next. Thanks, Parav


Re: [PATCH] ipv6:delete duplicate code for reserved iid check

2021-03-02 Thread David Ahern
On 3/2/21 5:53 PM, zhang kai wrote:
> Using the ipv6_reserved_interfaceid for interface id checking.
> 
> Signed-off-by: zhang kai 
> ---
>  net/ipv6/addrconf.c | 45 +++--
>  1 file changed, 19 insertions(+), 26 deletions(-)
> 

Looks equivalent to me. Code wise:

Reviewed-by: David Ahern 

The commit message needs more words about the change.


Re: [PATCH] net:ipv4: Packet is not forwarded if bc_forwarding not configured on ingress interface

2021-02-28 Thread David Ahern
On 2/28/21 5:53 PM, Henry Shen wrote:
> When an IPv4 packet with a destination address of broadcast is received
> on an ingress interface, it will not be forwarded out of the egress
> interface if the ingress interface is not configured with bc_forwarding 
> but the egress interface is. If both the ingress and egress interfaces
> are configured with bc_forwarding, the packet can be forwarded
> successfully.
> 
> This patch is to be inline with Cisco's implementation that packet can be 
> forwarded if ingress interface is NOT configured with bc_forwarding, 
> but egress interface is.
> 

In Linux, forwarding decisions are made based on the ingress device, not
the egress device.


Re: [PATCH iproute2-next] mptcp: add support for port based endpoint

2021-02-28 Thread David Ahern
On 2/19/21 1:42 PM, Paolo Abeni wrote:
> The feature is supported by the kernel since 5.11-net-next,
> let's allow user-space to use it.
> 
> Just parse and dump an additional, per endpoint, u16 attribute
> 
> Signed-off-by: Paolo Abeni 
> ---
>  ip/ipmptcp.c| 16 ++--
>  man/man8/ip-mptcp.8 |  8 
>  2 files changed, 22 insertions(+), 2 deletions(-)
> 


applied to iproute2-next. Thanks




Re: [RFC PATCH net 2/2] selftests: fib_nexthops: Test blackhole nexthops when loopback goes down

2021-02-28 Thread David Ahern
On 2/28/21 7:26 AM, Ido Schimmel wrote:
> From: Ido Schimmel 
> 
> Test that blackhole nexthops are not flushed when the loopback device
> goes down.
> 


> 
> Signed-off-by: Ido Schimmel 
> ---
>  tools/testing/selftests/net/fib_nexthops.sh | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/tools/testing/selftests/net/fib_nexthops.sh 
> b/tools/testing/selftests/net/fib_nexthops.sh
> index 4c7d33618437..d98fb85e201c 100755
> --- a/tools/testing/selftests/net/fib_nexthops.sh
> +++ b/tools/testing/selftests/net/fib_nexthops.sh
> @@ -1524,6 +1524,14 @@ basic()
>   run_cmd "$IP nexthop replace id 2 blackhole dev veth1"
>   log_test $? 2 "Blackhole nexthop with other attributes"
>  
> + # blackhole nexthop should not be affected by the state of the loopback
> + # device
> + run_cmd "$IP link set dev lo down"
> + check_nexthop "id 2" "id 2 blackhole"
> + log_test $? 0 "Blackhole nexthop with loopback device down"
> +
> + run_cmd "$IP link set dev lo up"
> +
>   #
>   # groups
>   #
> 

Thanks for adding a test.

Reviewed-by: David Ahern 



Re: [RFC PATCH net 1/2] nexthop: Do not flush blackhole nexthops when loopback goes down

2021-02-28 Thread David Ahern
On 2/28/21 7:26 AM, Ido Schimmel wrote:
> From: Ido Schimmel 
> 
> As far as user space is concerned, blackhole nexthops do not have a
> nexthop device and therefore should not be affected by the
> administrative or carrier state of any netdev.
> 
> However, when the loopback netdev goes down all the blackhole nexthops
> are flushed. This happens because internally the kernel associates
> blackhole nexthops with the loopback netdev.

That was not intended, so definitely a bug.

> 
> This behavior is both confusing to those not familiar with kernel
> internals and also diverges from the legacy API where blackhole IPv4
> routes are not flushed when the loopback netdev goes down:
> 
>  # ip route add blackhole 198.51.100.0/24
>  # ip link set dev lo down
>  # ip route show 198.51.100.0/24
>  blackhole 198.51.100.0/24
> 
> Blackhole IPv6 routes are flushed, but at least user space knows that
> they are associated with the loopback netdev:
> 
>  # ip -6 route show 2001:db8:1::/64
>  blackhole 2001:db8:1::/64 dev lo metric 1024 pref medium
> 
> Fix this by only flushing blackhole nexthops when the loopback netdev is
> unregistered.
> 
> Fixes: ab84be7e54fc ("net: Initial nexthop code")
> Signed-off-by: Ido Schimmel 
> Reported-by: Donald Sharp 
> ---
>  net/ipv4/nexthop.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
> index f1c6cbdb9e43..743777bce179 100644
> --- a/net/ipv4/nexthop.c
> +++ b/net/ipv4/nexthop.c
> @@ -1399,7 +1399,7 @@ static int insert_nexthop(struct net *net, struct 
> nexthop *new_nh,
>  
>  /* rtnl */
>  /* remove all nexthops tied to a device being deleted */
> -static void nexthop_flush_dev(struct net_device *dev)
> +static void nexthop_flush_dev(struct net_device *dev, unsigned long event)
>  {
>   unsigned int hash = nh_dev_hashfn(dev->ifindex);
>   struct net *net = dev_net(dev);
> @@ -1411,6 +1411,10 @@ static void nexthop_flush_dev(struct net_device *dev)
>   if (nhi->fib_nhc.nhc_dev != dev)
>   continue;
>  
> + if (nhi->reject_nh &&
> + (event == NETDEV_DOWN || event == NETDEV_CHANGE))
> + continue;
> +
>   remove_nexthop(net, nhi->nh_parent, NULL);
>   }
>  }
> @@ -2189,11 +2193,11 @@ static int nh_netdev_event(struct notifier_block 
> *this,
>   switch (event) {
>   case NETDEV_DOWN:
>   case NETDEV_UNREGISTER:
> - nexthop_flush_dev(dev);
> + nexthop_flush_dev(dev, event);
>   break;
>   case NETDEV_CHANGE:
>   if (!(dev_get_flags(dev) & (IFF_RUNNING | IFF_LOWER_UP)))
> -     nexthop_flush_dev(dev);
> + nexthop_flush_dev(dev, event);
>   break;
>   case NETDEV_CHANGEMTU:
>   info_ext = ptr;
> 

LGTM. I suggest submitting without the RFC.

Reviewed-by: David Ahern 



Re: [PATCH] ipv6: Honor route mtu if it is within limit of dev mtu

2021-02-24 Thread David Ahern
On 2/22/21 9:32 AM, Kaustubh Pandey wrote:
> When netdevice MTU is increased via sysfs, NETDEV_CHANGEMTU is raised.
> 
> addrconf_notify -> rt6_mtu_change -> rt6_mtu_change_route ->
> fib6_nh_mtu_change
> 
> As part of handling NETDEV_CHANGEMTU notification we land up on a
> condition where if route mtu is less than dev mtu and route mtu equals
> ipv6_devconf mtu, route mtu gets updated.
> 
> Due to this v6 traffic end up using wrong MTU then configured earlier.
> This commit fixes this by removing comparison with ipv6_devconf
> and updating route mtu only when it is greater than incoming dev mtu.
> 
> This can be easily reproduced with below script:
> pre-condition:
> device up(mtu = 1500) and route mtu for both v4 and v6 is 1500
> 
> test-script:
> ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1400
> ip -6 route change 2001::/64 dev eth0 metric 256 mtu 1400
> echo 1400 > /sys/class/net/eth0/mtu
> ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1500
> echo 1500 > /sys/class/net/eth0/mtu
> 
> Signed-off-by: Kaustubh Pandey 
> ---
>  net/ipv6/route.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 1536f49..653b6c7 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -4813,8 +4813,7 @@ static int fib6_nh_mtu_change(struct fib6_nh *nh, void 
> *_arg)
>   struct inet6_dev *idev = __in6_dev_get(arg->dev);
>   u32 mtu = f6i->fib6_pmtu;
>  
> - if (mtu >= arg->mtu ||
> - (mtu < arg->mtu && mtu == idev->cnf.mtu6))
> + if (mtu >= arg->mtu)
>   fib6_metric_set(f6i, RTAX_MTU, arg->mtu);
>  
>   spin_lock_bh(&rt6_exception_lock);
> 

The existing logic mirrors what is done for exceptions, see
rt6_mtu_change_route_allowed and commit e9fa1495d738.

It seems right to me to drop the mtu == idev->cnf.mtu6 comparison in
which case the exceptions should do the same.

Added author of e9fa1495d738 in case I am overlooking something.

Test case should be added to tools/testing/selftests/net/pmtu.sh, and
did you run that script with the proposed change?


Re: [PATCH] arp: Remove the arp_hh_ops structure

2021-02-22 Thread David Ahern
On 2/22/21 1:37 AM, Eric Dumazet wrote:
> 
> 
> On 2/22/21 4:15 AM, Yejune Deng wrote:
>> The arp_hh_ops structure is similar to the arp_generic_ops structure.
>> but the latter is more general,so remove the arp_hh_ops structure.
>>
>> Fix when took out the neigh->ops assignment:
>> 8.973653] #PF: supervisor read access in kernel mode
>> [8.975027] #PF: error_code(0x) - not-present page
>> [8.976310] PGD 0 P4D 0
>> [8.977036] Oops:  [#1] SMP PTI
>> [8.977973] CPU: 1 PID: 210 Comm: sd-resolve Not tainted 
>> 5.11.0-rc7-02046-g4591591ab715 #1
>> [8.979998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.12.0-1 04/01/2014
>> [8.981996] RIP: 0010:neigh_probe 
>> (kbuild/src/consumer/net/core/neighbour.c:1009)
>>
> 
> I have a hard time understanding this patch submission.
> 
> This seems a mix of a net-next and net material ?

It is net-next at best.

> 
> 
> 
>> Reported-by: kernel test robot 
> 
> If this is a bug fix, we want a Fixes: tag
> 
> This will really help us. Please don't let us guess what is going on.
> 

This patch is a v2. v1 was clearly wrong and not tested; I responded as
such 12 hours before the robot test.


Re: [PATCH] arp: Remove the arp_hh_ops structure

2021-02-20 Thread David Ahern
On 2/19/21 9:32 PM, Yejune Deng wrote:
>  static const struct neigh_ops arp_direct_ops = {
>   .family =   AF_INET,
>   .output =   neigh_direct_output,
> @@ -277,15 +269,10 @@ static int arp_constructor(struct neighbour *neigh)
>   memcpy(neigh->ha, dev->broadcast, dev->addr_len);
>   }
>  
> - if (dev->header_ops->cache)
> - neigh->ops = &arp_hh_ops;
> - else
> - neigh->ops = &arp_generic_ops;

How did you test this?

you took out the neigh->ops assignment, so all of the neigh->ops in
net/core/neighbour.c are going to cause a NULL dereference.


> -
> - if (neigh->nud_state & NUD_VALID)
> - neigh->output = neigh->ops->connected_output;
> + if (!dev->header_ops->cache && (neigh->nud_state & NUD_VALID))
> + neigh->output = arp_generic_ops.connected_output;
>   else
> - neigh->output = neigh->ops->output;
> + neigh->output = arp_generic_ops.output;
>   }
>   return 0;
>  }
> 



Re: null terminating of IFLA_INFO_KIND/IFLA_IFNAME

2021-02-17 Thread David Ahern
On 2/17/21 9:06 AM, Муравьев Александр wrote:
> Hi
> 
> A noob question that I haven't found an answer.
> 
> Just wanted to clarify a piece of iproute2 code.
> 
> ip/iplink.c:
> 
>> 1058 addattr_l(&req.n, sizeof(req), IFLA_INFO_KIND, type,
>> 1059  strlen(type));
> 
> also ip/iplink.c:
> 
>> 1115 addattr_l(&req.n, sizeof(req),
>> 1116   !check_ifname(name) ? IFLA_IFNAME : IFLA_ALT_IFNAME,
>> 1117   name, strlen(name) + 1);
> My question is why we skip terminating null character for IFLA_INFO_KIND
> (the first case) and don't skip it for IFLA_IFNAME (the second case)? I
> mean "strlen(type)" and "strlen(name) + 1".
> 

I think it is just different coders at different points in time. Kernel
side both use nla_strscpy which handles the string terminator (or
missing terminator).


Re: [PATCH v4 net-next 07/21] nvme-tcp: Add DDP data-path

2021-02-17 Thread David Ahern
On 2/17/21 7:01 AM, Or Gerlitz wrote:
>>> @@ -1136,6 +1265,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct 
>>> nvme_tcp_request *req)
>>>   else
>>>   flags |= MSG_EOR;
>>>
>>> + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags) &&
>>> + blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)
>>> + nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);
>>> +
>>
>> For consistency, shouldn't this be wrapped in the CONFIG_TCP_DDP check too?
> 
> We tried to avoid the wrapping in some places where it was
> possible to do without adding confusion, this one is a good
> example IMOH.
> 
> 

The above (and other locations like it) can easily be put into a helper
that has logic when the CONFIG is enabled and compiles out when not.
Consistency makes for simpler, cleaner code for optional features.


Re: [PATCH iproute2-rc] rdma: Fix statistics bind/unbing argument handling

2021-02-16 Thread David Ahern
On 2/15/21 11:16 PM, Leon Romanovsky wrote:
> On Mon, Feb 15, 2021 at 06:56:26PM -0700, David Ahern wrote:
>> On 2/14/21 10:40 PM, Leon Romanovsky wrote:
>>> On Sun, Feb 14, 2021 at 08:26:16PM -0700, David Ahern wrote:
>>>> what does iproute2-rc mean?
>>>
>>> Patch target is iproute2.git:
>>> https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/
>>
>> so you are asking them to be committed for the 5.11 release?
> 
> This is a Fix to an existing issue (not theoretical one), so I was under
> impression that it should go to -rc repo and not to -next.

It is assigned to Stephen for iproute2.

> 
> Personally, I don't care to which repo will this fix be applied as long
> as it is applied to one of the two iproute2 official repos.
> 
> Do you have clear guidance when should I send patches to 
> iproute2-rc/iproute2-next?
> 

It's the rc label that needs to be dropped: iproute2 or iproute2-next.
Just like there is net and net-next.


  1   2   3   4   5   6   7   8   9   10   >