Re: [PATCH 2/2] neighbour: allow NUD_NOARP entries to be forced GCed
On 4/19/21 10:52 AM, Kasper Dupont wrote: > On 19/04/21 10.10, David Ahern wrote: >> On 4/19/21 9:44 AM, Kasper Dupont wrote: >>> >>> Is there any update regarding this change? >>> >>> I noticed this regression when it was used in a DoS attack on one of >>> my servers which I had upgraded from Ubuntu 18.04 to 20.04. >>> >>> I have verified that Ubuntu 18.04 is not subject to this attack and >>> Ubuntu 20.04 is vulnerable. I have also verified that the one-line >>> change which Cascardo has provided fixes the vulnerability on Ubuntu >>> 20.04. >>> >> >> your testing included both patches or just this one? > > I applied only this one line change on top of the kernel in Ubuntu > 20.04. The behavior I observed was that without the patch the kernel > was vulnerable and with that patch I was unable to reproduce the > problem. This patch should be re-submitted standalone for -net > > The other longer patch is for a different issue which Cascardo > discovered while working on the one I had reported. I don't have an > environment set up where I can reproduce the issue addressed by that > larger patch. > The first patch is the one I have concerns about.
Re: [PATCH 2/2] neighbour: allow NUD_NOARP entries to be forced GCed
On 4/19/21 9:44 AM, Kasper Dupont wrote: > On 17/03/21 15.53, Thadeu Lima de Souza Cascardo wrote: >> IFF_POINTOPOINT interfaces use NUD_NOARP entries for IPv6. It's possible to >> fill up the neighbour table with enough entries that it will overflow for >> valid connections after that. >> >> This behaviour is more prevalent after commit 58956317c8de ("neighbor: >> Improve garbage collection") is applied, as it prevents removal from >> entries that are not NUD_FAILED, unless they are more than 5s old. >> >> Fixes: 58956317c8de (neighbor: Improve garbage collection) >> Reported-by: Kasper Dupont >> Signed-off-by: Thadeu Lima de Souza Cascardo >> --- >> net/core/neighbour.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/net/core/neighbour.c b/net/core/neighbour.c >> index bbc89c7ffdfd..be5ca411b149 100644 >> --- a/net/core/neighbour.c >> +++ b/net/core/neighbour.c >> @@ -256,6 +256,7 @@ static int neigh_forced_gc(struct neigh_table *tbl) >> >> write_lock(&n->lock); >> if ((n->nud_state == NUD_FAILED) || >> +(n->nud_state == NUD_NOARP) || >> (tbl->is_multicast && >> tbl->is_multicast(n->primary_key)) || >> time_after(tref, n->updated)) >> -- >> 2.27.0 >> > > Is there any update regarding this change? > > I noticed this regression when it was used in a DoS attack on one of > my servers which I had upgraded from Ubuntu 18.04 to 20.04. > > I have verified that Ubuntu 18.04 is not subject to this attack and > Ubuntu 20.04 is vulnerable. I have also verified that the one-line > change which Cascardo has provided fixes the vulnerability on Ubuntu > 20.04. > your testing included both patches or just this one?
Re: [PATCH net-next v3] virtio-net: page_to_skb() use build_skb when there's sufficient tailroom
On 4/16/21 2:16 AM, Xuan Zhuo wrote: > In page_to_skb(), if we have enough tailroom to save skb_shared_info, we > can use build_skb to create skb directly. No need to alloc for > additional space. And it can save a 'frags slot', which is very friendly > to GRO. > > Here, if the payload of the received package is too small (less than > GOOD_COPY_LEN), we still choose to copy it directly to the space got by > napi_alloc_skb. So we can reuse these pages. > > Testing Machine: > The four queues of the network card are bound to the cpu1. > > Test command: > for ((i=0;i<5;++i)); do sockperf tp --ip 192.168.122.64 -m 1000 -t 150& > done > > The size of the udp package is 1000, so in the case of this patch, there > will always be enough tailroom to use build_skb. The sent udp packet > will be discarded because there is no port to receive it. The irqsoftd > of the machine is 100%, we observe the received quantity displayed by > sar -n DEV 1: > > no build_skb: 956864.00 rxpck/s > build_skb:1158465.00 rxpck/s > virtio_net is using napi_consume_skb, so napi_build_skb should show a small increase from build_skb.
Re: Different behavior wrt VRF and no VRF - packet Tx
On 4/15/21 8:57 PM, Bala Sajja wrote: >please find the ip link show output(for ifindex) and ping and > its corresponding perf fib events output. OIF seems MGMT(ifindex 5) > always, not enslaved interfaces ? that is the reason it does not work.
Re: [PATCH iproute2] ip: drop 2-char command assumption
On 4/17/21 8:49 PM, Tony Ambardar wrote: > The 'ip' utility hardcodes the assumption of being a 2-char command, where > any follow-on characters are passed as an argument: > > $ ./ip-full help > Object "-full" is unknown, try "ip help". > > This confusing behaviour isn't seen with 'tc' for example, and was added in > a 2005 commit without documentation. It was noticed during testing of 'ip' > variants built/packaged with different feature sets (e.g. w/o BPF support). > > Drop the related code. > > Fixes: 351efcde4e62 ("Update header files to 2.6.14") > Signed-off-by: Tony Ambardar > --- > ip/ip.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/ip/ip.c b/ip/ip.c > index 4cf09fc3..631ce903 100644 > --- a/ip/ip.c > +++ b/ip/ip.c > @@ -313,9 +313,6 @@ int main(int argc, char **argv) > > rtnl_set_strict_dump(&rth); > > - if (strlen(basename) > 2) > - return do_cmd(basename+2, argc, argv); > - > if (argc > 1) > return do_cmd(argv[1], argc-1, argv+1); > > popular change request lately. As I responded to Matteo in January: This has been around for too long to just remove it. How about adding an option to do_cmd that this appears to be guess based on basename? If the guess is wrong, fallback to the next do_cmd.
Re: [PATCH net-next 2/2] selftests: fib_nexthops: Test large scale nexthop flushing
On 4/16/21 8:55 AM, Ido Schimmel wrote: > From: Ido Schimmel > > Test that all the nexthops are flushed when a multi-part nexthop dump is > required for the flushing. > > Without previous patch: > > # ./fib_nexthops.sh > TEST: Large scale nexthop flushing [FAIL] > > With previous patch: > > # ./fib_nexthops.sh > TEST: Large scale nexthop flushing [ OK ] > > Signed-off-by: Ido Schimmel > Reviewed-by: Petr Machata > --- > tools/testing/selftests/net/fib_nexthops.sh | 15 +++ > 1 file changed, 15 insertions(+) > Reviewed-by: David Ahern
Re: [PATCH net-next 1/2] nexthop: Restart nexthop dump based on last dumped nexthop identifier
On 4/16/21 8:55 AM, Ido Schimmel wrote: > From: Ido Schimmel > > Currently, a multi-part nexthop dump is restarted based on the number of > nexthops that have been dumped so far. This can result in a lot of > nexthops not being dumped when nexthops are simultaneously deleted: > > # ip nexthop | wc -l > 65536 > # ip nexthop flush > Dump was interrupted and may be inconsistent. > Flushed 36040 nexthops > # ip nexthop | wc -l > 29496 > > Instead, restart the dump based on the nexthop identifier (fixed number) > of the last successfully dumped nexthop: > > # ip nexthop | wc -l > 65536 > # ip nexthop flush > Dump was interrupted and may be inconsistent. > Flushed 65536 nexthops > # ip nexthop | wc -l > 0 > > Reported-by: Maksym Yaremchuk > Tested-by: Maksym Yaremchuk > Signed-off-by: Ido Schimmel > Reviewed-by: Petr Machata > --- > net/ipv4/nexthop.c | 14 ++ > 1 file changed, 6 insertions(+), 8 deletions(-) > Reviewed-by: David Ahern Any reason not to put this in -net with a Fixes tag?
Re: PROBLEM: DoS Attack on Fragment Cache
[ cc author of 648700f76b03b7e8149d13cc2bdb3355035258a9 ] On 4/16/21 3:58 PM, Keyu Man wrote: > Hi, > > > > My name is Keyu Man. We are a group of researchers from University > of California, Riverside. Zhiyun Qian is my advisor. We found the code > in processing IPv4/IPv6 fragments will potentially lead to DoS Attacks. > Specifically, after the latest kernel receives an IPv4 fragment, it will > try to fit it into a queue by calling function > > > > struct inet_frag_queue *inet_frag_find(struct fqdir *fqdir, void > *key) in net/ipv4/inet_fragment.c. > > > > However, this function will first check if the existing fragment > memory exceeds the fqdir->high_thresh. If it exceeds, then drop the > fragment regardless whether it belongs to a new queue or an existing queue. > > Chances are that an attacker can fill the cache with fragments that > will never be assembled (i.e., only sends the first fragment with new > IPIDs every time) to exceed the threshold so that all future incoming > fragmented IPv4 traffic would be blocked and dropped. Since there is no > GC mechanism, the victim host has to wait for 30s when the fragments are > expired to continue receive incoming fragments normally. > > In practice, given the 4MB fragment cache, the attacker only needs > to send 1766 fragments to exhaust the cache and DoS the victim for 30s, > whose cost is pretty low. Besides, IPv6 would also be affected since the > issue resides in inet part. > > This issue is introduced in commit > 648700f76b03b7e8149d13cc2bdb3355035258a9 (inet: frags: use rhashtables > for reassembly units) which removes fqdir->low_thresh, and GC worker as > well. We would gently request to bring GC worker back to the kernel to > prevent the DoS attacks. > > Looking forward to hear from you > > > > Thanks, > > Keyu Man >
Re: Different behavior wrt VRF and no VRF - packet Tx
On 4/15/21 12:15 AM, Bala Sajja wrote: > When interfaces are not part of VRF and below ip address config is > done on these interfaces, ping with -I (interface) option, we see > packets transmitting out of the right interfaces. > > ip addr add 2.2.2.100 peer 1.1.1.100/32 dev enp0s3 > ip addr add 2.2.2.100 peer 1.1.1.100/32 dev enp0s8 > > ping 1.1.1.100-I enp0s3 , packet always goes out of enp0s3 > ping 1.1.1.100-I enp0s8, packet always goes out of enp0s8 > > When interfaces are enslaved to VRF as below and ip are configured > on these interfaces, packets go out of one interface only. > > ip link add MGMT type vrf table 1 > ip link set dev MGMT up > ip link set dev enp0s3 up > ip link set dev enp0s3 master MGMT > ip link set dev enp0s8 up > ip link set dev enp0s8 master MGMT > ip link set dev enp0s9 up > > ip addr add 2.2.2.100 peer 1.1.1.100/32 dev enp0s3 > ip addr add 2.2.2.100 peer 1.1.1.100/32 dev enp0s8 > > ping 1.1.1.100-I enp0s3 , packet always goes out of enp0s3 > ping 1.1.1.100-I enp0s8, packet always goes out of enp0s3 > > I believe this use case will not work since the FIB lookup loses the original device binding (the -I argument). take a look at the FIB lookup argument and result: perf record -e fib:* perf script
Re: XFRM programming with VRF enslaved interfaces
[ cc Ben ] On 4/15/21 9:51 AM, Rob Dover wrote: > Hi there, > > I'm working on an application that's programming IPSec connections via XFRM > on VRFs. I'm seeing some strange behaviour in cases where there is an > enslaved interface on the VRF - was wondering if anyone has seen something > like this before or perhaps knows how this is supposed to work? Ben was / is looking at ipsec and VRF. Maybe he has some thoughts. > > In our setup, we have a VRF and an enslaved (sidebar: should I be using a > different term for this? I would prefer to use something with fewer negative > historic connotations if possible!) interface like so: for the sidebar, you can just say that a netdev is a member of the L3 domain. iproute2 supports 'vrf' keyword for better user semantics than 'master' Any chance you can create a shell script that creates your setup using network namespaces? tools/testing/selftests/net/fcnal-test.sh has some helpers -- create_vrf, create_ns and connect_ns -- which simplify the 'namespace as a node' concept and configuring the interconnects. A standalone script would allow runs across kernel versions and make it easier for others (me) to debug. > > ``` > # ip link show vrf-1 > 33: vrf-1: mtu 65536 qdisc noqueue state UP mode > DEFAULT group default qlen 1000 > link/ether 2a:71:ba:bd:33:4d brd ff:ff:ff:ff:ff:ff > > # ip link show master vrf-1 > 32: serv1: mtu 1500 qdisc fq_codel master vrf-1 state > UP mode DEFAULT group default qlen 1000 > link/ether 00:13:3e:00:16:68 brd ff:ff:ff:ff:ff:ff > ``` > > The serv1 interface has some associated IPs but the vrf-1 interface does not: > > ``` > # ip addr show dev vrf-1 > 33: vrf-1: mtu 65536 qdisc noqueue state UP group > default qlen 1000 > link/ether 2a:71:ba:bd:33:4d brd ff:ff:ff:ff:ff:ff > > # ip addr show dev serv1 > 32: serv1: mtu 1500 qdisc fq_codel master vrf-1 state > UP group default qlen 1000 > link/ether 00:13:3e:00:16:68 brd ff:ff:ff:ff:ff:ff > inet 10.248.0.191/16 brd 10.248.255.255 scope global serv1 >valid_lft forever preferred_lft forever > inet 10.248.0.250/16 brd 10.248.255.255 scope global secondary serv1 >valid_lft forever preferred_lft forever > inet6 fd5f:5d21:845:1401:213:3eff:fe00:1668/64 scope global >valid_lft forever preferred_lft forever > inet6 fe80::213:3eff:fe00:1668/64 scope link >valid_lft forever preferred_lft forever > ``` > > We're trying to program XFRM using these addresses to send and receive IPSec > traffic in transport mode. The interesting question is which interface the > XFRM state should be programmed on. I started off by programming the > following policies and SAs on the VRF: > > ``` > # ip xfrm policy show > src 10.254.13.16/32 dst 10.248.0.191/32 sport 37409 dport 5080 dev vrf-1 > dir in priority 2147483648 ptype main > tmpl src 0.0.0.0 dst 0.0.0.0 > proto esp reqid 0 mode transport > src 10.248.0.191/32 dst 10.254.13.16/32 sport 16381 dport 37409 dev vrf-1 > dir out priority 2147483648 ptype main > tmpl src 0.0.0.0 dst 0.0.0.0 > proto esp reqid 0 mode transport > # ip xfrm state show > src 10.254.13.16 dst 10.248.0.191 > proto esp spi 0x03a0392c reqid 3892838400 mode transport > replay-window 0 > auth-trunc hmac(md5) 0x00112233445566778899aabbccddeeff 96 > enc cbc(des3_ede) 0xfeefdccdbaab98897667544532231001feefdccdbaab9889 > anti-replay context: seq 0x0, oseq 0x0, bitmap 0x > sel src 0.0.0.0/0 dst 0.0.0.0/0 sport 37409 dport 5080 dev vrf-1 > src 10.248.0.191 dst 10.254.13.16 > proto esp spi 0x00124f80 reqid 0 mode transport > replay-window 0 > auth-trunc hmac(md5) 0x00112233445566778899aabbccddeeff 96 > enc cbc(des3_ede) 0xfeefdccdbaab98897667544532231001feefdccdbaab9889 > anti-replay context: seq 0x0, oseq 0x0, bitmap 0x > sel src 0.0.0.0/0 dst 0.0.0.0/0 sport 16381 dport 37409 dev vrf-1 > ``` > > Having programmed these, I can then send ESP packets from 10.254.13.16:37409 > -> 10.248.0.191:5080 and they are successfully decoded and passed up to my > application. However, when I try to send UDP packets out again from > 10.248.0.191:16381 -> 10.254.13.16:37409, the packets are not encrypted but > sent out in the clear! > > Now, I've done some experimentation and found that if I program the outbound > XFRM policy (eg. 10.248.0.191->10.254.13.16) to be on serv1 rather than > vrf-1, the packets are correctly encrypted. But if I program the inbound XFRM > policy (eg. 10.254.13.16->10.248.0.191) to be on serv1 rather than vrf-1, the > inbound packets are not passed up to my application! That leaves me in a > situation where I need to program the inbound and outbound XFRM policies > asymmetrically in order to get my traffic to be sent properly, like so: > > ``` > # ip xfrm policy show > src 10.254.13.16/32 dst 10.248.0.191/32
Re: [PATCH v2 bpf-next] cpumap: bulk skb using netif_receive_skb_list
On 4/15/21 9:03 AM, Lorenzo Bianconi wrote: >> On 4/15/21 8:05 AM, Daniel Borkmann wrote: > > [...] &stats); >>> >>> Given we stop counting drops with the netif_receive_skb_list(), we >>> should then >>> also remove drops from trace_xdp_cpumap_kthread(), imho, as otherwise it >>> is rather >>> misleading (as in: drops actually happening, but 0 are shown from the >>> tracepoint). >>> Given they are not considered stable API, I would just remove those to >>> make it clear >>> to users that they cannot rely on this counter anymore anyway. >>> >> >> What's the visibility into drops then? Seems like it would be fairly >> easy to have netif_receive_skb_list return number of drops. >> > > In order to return drops from netif_receive_skb_list() I guess we need to > introduce > some extra checks in the hot path. Moreover packet drops are already accounted > in the networking stack and this is currently the only consumer for this info. > Does it worth to do so? right - softnet_stat shows the drop. So the loss here is that the packet is from a cpumap XDP redirect. Better insights into drops is needed, but I guess in this case coming from the cpumap does not really aid into why it is dropped - that is more core to __netif_receive_skb_list_core. I guess this is ok to drop the counter from the tracepoint.
Re: [PATCH v2 bpf-next] cpumap: bulk skb using netif_receive_skb_list
On 4/15/21 8:05 AM, Daniel Borkmann wrote: > On 4/13/21 6:22 PM, Lorenzo Bianconi wrote: >> Rely on netif_receive_skb_list routine to send skbs converted from >> xdp_frames in cpu_map_kthread_run in order to improve i-cache usage. >> The proposed patch has been tested running xdp_redirect_cpu bpf sample >> available in the kernel tree that is used to redirect UDP frames from >> ixgbe driver to a cpumap entry and then to the networking stack. >> UDP frames are generated using pkt_gen. >> >> $xdp_redirect_cpu --cpu --progname xdp_cpu_map0 --dev >> >> bpf-next: ~2.2Mpps >> bpf-next + cpumap skb-list: ~3.15Mpps >> >> Signed-off-by: Lorenzo Bianconi >> --- >> Changes since v1: >> - fixed comment >> - rebased on top of bpf-next tree >> --- >> kernel/bpf/cpumap.c | 11 +-- >> 1 file changed, 5 insertions(+), 6 deletions(-) >> >> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c >> index 0cf2791d5099..d89551a508b2 100644 >> --- a/kernel/bpf/cpumap.c >> +++ b/kernel/bpf/cpumap.c >> @@ -27,7 +27,7 @@ >> #include >> #include >> -#include /* netif_receive_skb_core */ >> +#include /* netif_receive_skb_list */ >> #include /* eth_type_trans */ >> /* General idea: XDP packets getting XDP redirected to another CPU, >> @@ -257,6 +257,7 @@ static int cpu_map_kthread_run(void *data) >> void *frames[CPUMAP_BATCH]; >> void *skbs[CPUMAP_BATCH]; >> int i, n, m, nframes; >> + LIST_HEAD(list); >> /* Release CPU reschedule checks */ >> if (__ptr_ring_empty(rcpu->queue)) { >> @@ -305,7 +306,6 @@ static int cpu_map_kthread_run(void *data) >> for (i = 0; i < nframes; i++) { >> struct xdp_frame *xdpf = frames[i]; >> struct sk_buff *skb = skbs[i]; >> - int ret; >> skb = __xdp_build_skb_from_frame(xdpf, skb, >> xdpf->dev_rx); >> @@ -314,11 +314,10 @@ static int cpu_map_kthread_run(void *data) >> continue; >> } >> - /* Inject into network stack */ >> - ret = netif_receive_skb_core(skb); >> - if (ret == NET_RX_DROP) >> - drops++; >> + list_add_tail(&skb->list, &list); >> } >> + netif_receive_skb_list(&list); >> + >> /* Feedback loop via tracepoint */ >> trace_xdp_cpumap_kthread(rcpu->map_id, n, drops, sched, >> &stats); > > Given we stop counting drops with the netif_receive_skb_list(), we > should then > also remove drops from trace_xdp_cpumap_kthread(), imho, as otherwise it > is rather > misleading (as in: drops actually happening, but 0 are shown from the > tracepoint). > Given they are not considered stable API, I would just remove those to > make it clear > to users that they cannot rely on this counter anymore anyway. > What's the visibility into drops then? Seems like it would be fairly easy to have netif_receive_skb_list return number of drops.
Re: [PATCH v3 net-next] net: multipath routing: configurable seed
On 4/14/21 12:33 AM, Pavel Balaev wrote: >> >> This should work the same for IPv6. > I wanted to add IPv6 support after IPv4 will be approved, > anyway no problem, will add IPv6 in next version >> And please add test cases under tools/testing/selftests/net. > This feature cannot be tested whithin one host instance, becasue the same seed > will be used by default for all netns, so results will be the same > anyway, should I use QEMU for this tests? > > why not make the seed per namespace?
Re: [PATCH net] vrf: fix a comment about loopback device
On 4/14/21 3:03 AM, Nicolas Dichtel wrote: > This is a leftover of the below commit. > > Fixes: 4f04256c983a ("net: vrf: Drop local rtable and rt6_info") > Signed-off-by: Nicolas Dichtel > --- > drivers/net/vrf.c | 10 -- > 1 file changed, 4 insertions(+), 6 deletions(-) > Acked-by: David Ahern Thanks, Nicolas.
Re: [PATCH v3 net-next] net: multipath routing: configurable seed
On 4/13/21 4:55 AM, Balaev Pavel wrote: > Ability for a user to assign seed value to multipath route hashes. > Now kernel uses random seed value to prevent hash-flooding DoS attacks; > however, it disables some use cases, f.e: > > +---++--+++ > | |-eth0---| FW0 |---eth0-|| > | |+--+|| > | GW0 |ECMPECMP| GW1 | > | |+--+|| > | |-eth1---| FW1 |---eth1-|| > +---++--+++ > > In this use case, two ECMP routers balance traffic between > two firewalls. If some flow transmits a response over a different channel > than request, > such flow will be dropped, because keep-state rules are created on > the other firewall. > > This patch adds sysctl variable: net.ipv4.fib_multipath_hash_seed. > User can set the same seed value on GW0 and GW1 for traffic to be > mirror-balanced. By default, random value is used. > > Signed-off-by: Balaev Pavel > --- > Documentation/networking/ip-sysctl.rst | 14 > include/net/flow_dissector.h | 4 + > include/net/netns/ipv4.h | 20 + > net/core/flow_dissector.c | 9 +++ > net/ipv4/af_inet.c | 5 ++ > net/ipv4/route.c | 10 ++- > net/ipv4/sysctl_net_ipv4.c | 104 + > 7 files changed, 165 insertions(+), 1 deletion(-) > This should work the same for IPv6. And please add test cases under tools/testing/selftests/net.
Re: [PATCH v3] ip-nexthop: support flush by id
On 4/5/21 7:33 PM, Chunmei Xu wrote: > since id is unique for nexthop, it is heavy to dump all nexthops. > use existing delete_nexthop to support flush by id > > Signed-off-by: Chunmei Xu > --- > ip/ipnexthop.c | 20 +++- > 1 file changed, 19 insertions(+), 1 deletion(-) > applied to iproute2-next. Thanks,
Re: [PATCH v2] ipv6: report errors for iftoken via netlink extack
On 4/7/21 9:59 AM, Stephen Hemminger wrote: > From: Stephen Hemminger > > Setting iftoken can fail for several different reasons but there > and there was no report to user as to the cause. Add netlink > extended errors to the processing of the request. > > This requires adding additional argument through rtnl_af_ops > set_link_af callback. > > Reported-by: Hongren Zheng > Signed-off-by: Stephen Hemminger > --- > v2 - fix typo that broke build > > include/net/rtnetlink.h | 4 ++-- > net/core/rtnetlink.c| 2 +- > net/ipv4/devinet.c | 3 ++- > net/ipv6/addrconf.c | 32 ++-- > 4 files changed, 31 insertions(+), 10 deletions(-) > Reviewed-by: David Ahern
Re: [RFC net-next 0/1] seg6: Counters for SRv6 Behaviors
Since this is a single patch set, just put this good cover letter content as the message in the patch.
Re: [RFC net-next 1/1] seg6: add counters support for SRv6 Behaviors
On 4/7/21 12:03 PM, Andrea Mayer wrote: > diff --git a/include/uapi/linux/seg6_local.h b/include/uapi/linux/seg6_local.h > index 3b39ef1dbb46..ae5e3fd12b73 100644 > --- a/include/uapi/linux/seg6_local.h > +++ b/include/uapi/linux/seg6_local.h > @@ -27,6 +27,7 @@ enum { > SEG6_LOCAL_OIF, > SEG6_LOCAL_BPF, > SEG6_LOCAL_VRFTABLE, > + SEG6_LOCAL_COUNTERS, > __SEG6_LOCAL_MAX, > }; > #define SEG6_LOCAL_MAX (__SEG6_LOCAL_MAX - 1) > @@ -78,4 +79,11 @@ enum { > > #define SEG6_LOCAL_BPF_PROG_MAX (__SEG6_LOCAL_BPF_PROG_MAX - 1) > > +/* SRv6 Behavior counters */ > +struct seg6_local_counters { > + __u64 rx_packets; > + __u64 rx_bytes; > + __u64 rx_errors; > +}; > + > #endif It's highly likely that more stats would get added over time. It would be good to document that here for interested parties and then make sure iproute2 can handle different sized stats structs. e.g., commit support to your repo, then add a new one (e.g, rx_drops) and verify the combinations handle it. e.g., old kernel - new iproute2, new kernel - old iproute, old - old and new-new.
Re: [PATCH v2] ip-nexthop: support flush by id
On 3/31/21 10:03 PM, Chunmei Xu wrote: > since id is unique for nexthop, it is heavy to dump all nexthops. > use existing delete_nexthop to support flush by id > > Signed-off-by: Chunmei Xu > --- > ip/ipnexthop.c | 20 +++- > 1 file changed, 19 insertions(+), 1 deletion(-) > this version does not apply to iproute2-next.
Re: [iproute2-next] tipc: use the libmnl functions in lib/mnl_utils.c
On 3/31/21 8:34 PM, Hoang Le wrote: > To avoid code duplication, tipc should be converted to use the helper > functions for working with libmnl in lib/mnl_utils.c > > Acked-by: Jon Maloy > Signed-off-by: Hoang Le > --- > tipc/bearer.c| 38 ++ > tipc/cmdl.c | 2 - > tipc/link.c | 37 + > tipc/media.c | 15 +++--- > tipc/msg.c | 132 +++ > tipc/msg.h | 2 +- > tipc/nametable.c | 5 +- > tipc/node.c | 33 +--- > tipc/peer.c | 8 ++- > tipc/socket.c| 10 ++-- > tipc/tipc.c | 21 +++- > 11 files changed, 83 insertions(+), 220 deletions(-) > applied to iproute2-next.
Re: [PATCH net] net: udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...);
On 4/1/21 12:59 AM, Norman Maurer wrote: > From: Norman Maurer > > Support for UDP_GRO was added in the past but the implementation for > getsockopt was missed which did lead to an error when we tried to > retrieve the setting for UDP_GRO. This patch adds the missing switch > case for UDP_GRO > > Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.") > Signed-off-by: Norman Maurer > --- > net/ipv4/udp.c | 4 > 1 file changed, 4 insertions(+) > Reviewed-by: David Ahern
Re: [RFC] add extack errors for iptoken
On 3/31/21 9:49 PM, Stephen Hemminger wrote: > @@ -5681,14 +5682,29 @@ static int inet6_set_iftoken(struct inet6_dev *idev, > struct in6_addr *token) > > ASSERT_RTNL(); > > - if (!token) > + if (!token) { You forgot to add a message here. > return -EINVAL; > - if (dev->flags & (IFF_LOOPBACK | IFF_NOARP)) > + } > + > + if (dev->flags & IFF_LOOPBACK) { > + NL_SET_ERR_MSG_MOD(extack, "Device is loopback"); > return -EINVAL; > - if (!ipv6_accept_ra(idev)) > + } > + > + if (dev->flags & IFF_NOARP) { > + NL_SET_ERR_MSG_MOD(extack, "Device does not do discovery"); 'Device does not do neighbor discovery' > return -EINVAL; > - if (idev->cnf.rtr_solicits == 0) > + } > + > + if (!ipv6_accept_ra(idev)) { > + NL_SET_ERR_MSG_MOD(extack, "Device does accept route adverts"); How about 'Router advertisements are disabled for this device' > + return -EINVAL; > + } > + > + if (idev->cnf.rtr_solicits == 0) { > + NL_SET_ERR_MSG(extack, "Device has disabled router > solicitation"); How about 'Router solicitation is disabled on device'
Re: [PATCH] udp: Add support for getsockopt(..., ..., UDP_GRO, ..., ...)
On 3/31/21 7:10 AM, Norman Maurer wrote: > Friendly ping… > > As this missing change was most likely an oversight in the original commit I > do think it should go into 5.12 and subsequently stable as well. That’s also > the reason why I didn’t send a v2 and changed the commit message / subject > for the patch. For me it clearly is a bug and not a new feature. > > I agree that it should be added to net If you do not see it here: https://patchwork.kernel.org/project/netdevbpf/list/ you need to re-send and clearly mark it as [PATCH net]. Make sure it has a Fixes tag.
Re: [PATCH] ip-nexthop: support flush by id
On 3/31/21 5:53 AM, Ido Schimmel wrote: >> @@ -124,6 +125,9 @@ static int flush_nexthop(struct nlmsghdr *nlh, void *arg) >> if (tb[NHA_ID]) >> id = rta_getattr_u32(tb[NHA_ID]); >> >> +if (filter.id && filter.id != id) >> +return 0; >> + >> if (id && !delete_nexthop(id)) >> filter.flushed++; >> >> @@ -491,7 +495,10 @@ static int ipnh_list_flush(int argc, char **argv, int >> action) >> NEXT_ARG(); >> if (get_unsigned(&id, *argv, 0)) >> invarg("invalid id value", *argv); >> -return ipnh_get_id(id); >> +if (action == IPNH_FLUSH) >> +filter.id = id; >> +else >> +return ipnh_get_id(id); > > I think it's quite weird to ask for a dump of all nexthops only to > delete a specific one. How about this: +1 > > ``` > diff --git a/ip/ipnexthop.c b/ip/ipnexthop.c > index 0263307c49df..09a3076231aa 100644 > --- a/ip/ipnexthop.c > +++ b/ip/ipnexthop.c > @@ -765,8 +765,16 @@ static int ipnh_list_flush(int argc, char **argv, int > action) > if (!filter.master) > invarg("VRF does not exist\n", *argv); > } else if (!strcmp(*argv, "id")) { > - NEXT_ARG(); > - return ipnh_get_id(ipnh_parse_id(*argv)); > + /* When 'id' is specified with 'flush' / 'list' we do > +* not need to perform a dump. > +*/ > + if (action == IPNH_LIST) { > + NEXT_ARG(); > + return ipnh_get_id(ipnh_parse_id(*argv)); > + } else { > + return ipnh_modify(RTM_DELNEXTHOP, 0, argc, > + argv); > + } since delete just needs the id, you could refactor ipnh_modify and create a ipnh_delete_id.
Re: ip-nexthop does not support flush by id
On 3/30/21 7:05 AM, Ido Schimmel wrote: > Looks like a bug. 'flush' does not really make sense with 'id', but > 'list id' works, so I think 'flush id' should also work. right, neither were intended. If you know the id, you don't need the generic list / flush option. > > Can you send a patch? Agreed. At this point consistency is best.
Re: [PATCH iproute2-next] police: add support for packet-per-second rate limiting
On 3/26/21 6:50 AM, Simon Horman wrote: > From: Baowen Zheng > > Allow a policer action to enforce a rate-limit based on packets-per-second, > configurable using a packet-per-second rate and burst parameters. > > e.g. > # $TC actions add action police pkts_rate 1000 pkts_burst 200 index 1 > # $TC actions ls action police > total acts 1 > > action order 0: police 0x1 rate 0bit burst 0b mtu 4096Mb pkts_rate > 1000 pkts_burst 200 > ref 1 bind 0 > > Signed-off-by: Baowen Zheng > Signed-off-by: Simon Horman > Signed-off-by: Louis Peens > --- > man/man8/tc-police.8 | 35 --- > tc/m_police.c| 50 +--- > 2 files changed, 75 insertions(+), 10 deletions(-) > applied to iproute2-next.
Re: [PATCH] Add Open/R to rt_protos
On 3/26/21 9:05 AM, Cooper Ry Lees wrote: > From: Cooper Lees > > - Open Routing is using ID 99 for it's installed routes > - https://github.com/facebook/openr > - Kernel has accepted 99 in `rtnetlink.h` > > Signed-of-by: Cooper Lees > --- > etc/iproute2/rt_protos | 1 + > 1 file changed, 1 insertion(+) > applied to iproute2-next.
Re: [PATCH 8/9] ip6_tunnel:: Correct function name parse_tvl_tnl_enc_lim() in the kerneldoc comments
On 3/27/21 2:15 AM, Xiongfeng Wang wrote: > Fix the following W=1 kernel build warning(s): > > net/ipv6/ip6_tunnel.c:401: warning: expecting prototype for > parse_tvl_tnl_enc_lim(). Prototype was for ip6_tnl_parse_tlv_enc_lim() instead > > Reported-by: Hulk Robot > Signed-off-by: Xiongfeng Wang > --- > net/ipv6/ip6_tunnel.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > Reviewed-by: David Ahern
Re: [PATCH 1/9] l3mdev: Correct function names in the kerneldoc comments
On 3/27/21 2:15 AM, Xiongfeng Wang wrote: > Fix the following W=1 kernel build warning(s): > > net/l3mdev/l3mdev.c:111: warning: expecting prototype for > l3mdev_master_ifindex(). Prototype was for l3mdev_master_ifindex_rcu() instead > net/l3mdev/l3mdev.c:145: warning: expecting prototype for > l3mdev_master_upper_ifindex_by_index(). Prototype was for > l3mdev_master_upper_ifindex_by_index_rcu() instead > > Reported-by: Hulk Robot > Signed-off-by: Xiongfeng Wang > --- > net/l3mdev/l3mdev.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH] sit: use min
On 3/27/21 3:29 AM, Julia Lawall wrote: > From: kernel test robot > > Opportunity for min() > > Generated by: scripts/coccinelle/misc/minmax.cocci > > CC: Denis Efremov > Reported-by: kernel test robot > Signed-off-by: kernel test robot > --- > > sit.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: David Ahern
Re: [PATCH net-next] nexthop: Rename artifacts related to legacy multipath nexthop groups
On 3/26/21 7:20 AM, Petr Machata wrote: > After resilient next-hop groups have been added recently, there are two > types of multipath next-hop groups: the legacy "mpath", and the new > "resilient". Calling the legacy next-hop group type "mpath" is unfortunate, > because that describes the fact that a packet could be forwarded in one of > several paths, which is also true for the resilient next-hop groups. > > Therefore, to make the naming clearer, rename various artifacts to reflect > the assumptions made. Therefore as of this patch: > > - The flag for multipath groups is nh_grp_entry::is_multipath. This > includes the legacy and resilient groups, as well as any future group > types that behave as multipath groups. > Functions that assume this have "mpath" in the name. > > - The flag for legacy multipath groups is nh_grp_entry::hash_threshold. > Functions that assume this have "hthr" in the name. > > - The flag for resilient groups is nh_grp_entry::resilient. > Functions that assume this have "res" in the name. > > Besides the above, struct nh_grp_entry::mpath was renamed to ::hthr as > well. > > UAPI artifacts were obviously left intact. > > Suggested-by: David Ahern > Signed-off-by: Petr Machata > --- > include/net/nexthop.h | 4 ++-- > net/ipv4/nexthop.c| 56 +++++-- > 2 files changed, 30 insertions(+), 30 deletions(-) > Thanks for the followup. Reviewed-by: David Ahern
Re: [PATCH 2/2] net: ipv4: route.c: Remove unnecessary if()
On 3/23/21 9:10 PM, Yejune Deng wrote: > negative_advice handler is only called when dst is non-NULL hence the > 'if (rt)' check can be removed. 'if' and 'else if' can be merged together. > And use container_of() instead of (struct rtable *). > > Signed-off-by: Yejune Deng > --- > net/ipv4/route.c | 16 ++-- > 1 file changed, 6 insertions(+), 10 deletions(-) > Reviewed-by: David Ahern
Re: rfc5837 and rfc8335
On 3/23/21 10:39 AM, Ron Bonica wrote: > Hi Folks, > > > > The rationale for RFC 8335 can be found in Section 5.0 of that document. > Currently, ICMP ECHO and ECHO RESPONSE messages can be used to determine > the liveness of some interfaces. However, they cannot determine the > liveness of: > > > > * An unnumbered IPv4 interface > * An IPv6 interface that has only a link-local address > > > > A router can have hundreds, or even thousands of interfaces that fall > into these categories. > > > > The rational for RFC 5837 can be found in the Introduction to that > document. When a node sends an ICMP TTL Expired message, the node > reports that a packet has expired on it. However, the source address of > the ICMP TTL Expired message doesn’t necessarily identify the interface > upon which the packet arrived. So, TRACEROUTE can be relied upon to > identify the nodes that a packet traverses along its delivery path. But > it cannot be relied upon to identify the interfaces that a packet > traversed along its deliver path. > > It's not a question of the rationale; the question is why add this support to Linux now? RFC 5837 is 11 years old. Why has no one cared to add support before now? What tooling supports it? What other NOS'es support it to really make the feature meaningful? e.g., Do you know what Juniper products support RFC 5837 today? More than likely Linux is the end node of the traceroute chain, not the transit or path nodes. With Linux, the ingress interface can lost in the layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly support either you need to return information about the right one. Unnumbered interfaces can make that more of a challenge.
Re: [iproute2-next] tipc: add support for the netlink extack
On 3/24/21 7:56 PM, Hoang Le wrote: > Add support extack in tipc to dump the netlink extack error messages > (i.e -EINVAL) sent from kernel. > > Acked-by: Jon Maloy > Signed-off-by: Hoang Le > --- > tipc/msg.c | 29 ++--- > 1 file changed, 22 insertions(+), 7 deletions(-) > tipc should be converted to use the library functions in lib/mnl_utils.c.
Re: [PATCH net-next 0/6] page_pool: recycle buffers
On 3/22/21 11:02 AM, Matteo Croce wrote: > From: Matteo Croce > > This series enables recycling of the buffers allocated with the page_pool API. > The first two patches are just prerequisite to save space in a struct and > avoid recycling pages allocated with other API. > Patch 2 was based on a previous idea from Jonathan Lemon. > > The third one is the real recycling, 4 fixes the compilation of > __skb_frag_unref > users, and 5,6 enable the recycling on two drivers. patch 4 should be folded into 3; each patch should build without errors. > > In the last two patches I reported the improvement I have with the series. > > The recycling as is can't be used with drivers like mlx5 which do page split, > but this is documented in a comment. > In the future, a refcount can be used so to support mlx5 with no changes. Is the end goal of the page_pool changes to remove driver private caches?
Re: [PATCH] net: ipv4: route.c: Remove unnecessary {else} if()
subject line should have net-next as the target branch On 3/23/21 4:20 AM, Yejune Deng wrote: > Put if and else if together, and remove unnecessary judgments, because > it's caller can make sure it is true. And add likely() in > ipv4_confirm_neigh(). > > Signed-off-by: Yejune Deng > --- > net/ipv4/route.c | 18 +++--- > 1 file changed, 7 insertions(+), 11 deletions(-) > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index fa68c2612252..f4ba07c5c1b1 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -440,7 +440,7 @@ static void ipv4_confirm_neigh(const struct dst_entry > *dst, const void *daddr) > struct net_device *dev = dst->dev; > const __be32 *pkey = daddr; > > - if (rt->rt_gw_family == AF_INET) { > + if (likely(rt->rt_gw_family == AF_INET)) { > pkey = (const __be32 *)&rt->rt_gw4; > } else if (rt->rt_gw_family == AF_INET6) { > return __ipv6_confirm_neigh_stub(dev, &rt->rt_gw6); > @@ -814,19 +814,15 @@ static void ip_do_redirect(struct dst_entry *dst, > struct sock *sk, struct sk_buf > > static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst) > { > - struct rtable *rt = (struct rtable *)dst; > + struct rtable *rt = container_of(dst, struct rtable, dst); > struct dst_entry *ret = dst; > > - if (rt) { > - if (dst->obsolete > 0) { > - ip_rt_put(rt); > - ret = NULL; > - } else if ((rt->rt_flags & RTCF_REDIRECTED) || > -rt->dst.expires) { > - ip_rt_put(rt); > - ret = NULL; > - } > + if (dst->obsolete > 0 || rt->dst.expires || > + (rt->rt_flags & RTCF_REDIRECTED)) { > + ip_rt_put(rt); > + ret = NULL; > } > + > return ret; > } > > This should be 2 separate patches since they are unrelated changes. For the ipv4_negative_advice, the changelog should note that negative_advice handler is only called when dst is non-NULL hence the 'if (rt)' check can be removed.
Re: [PATCH v7 bpf-next 10/14] bpf: add new frame_length field to the XDP ctx
On 3/22/21 3:29 AM, Eelco Chaudron wrote: > > > On 20 Mar 2021, at 4:42, David Ahern wrote: > >> On 3/19/21 3:47 PM, Lorenzo Bianconi wrote: >>> diff --git a/include/net/xdp.h b/include/net/xdp.h >>> index 19cd6642e087..e47d9e8da547 100644 >>> --- a/include/net/xdp.h >>> +++ b/include/net/xdp.h >>> @@ -75,6 +75,10 @@ struct xdp_buff { >>> struct xdp_txq_info *txq; >>> u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved >>> tailroom*/ >>> u32 mb:1; /* xdp non-linear buffer */ >>> + u32 frame_length; /* Total frame length across all buffers. Only >>> needs >>> + * to be updated by helper functions, as it will be >>> + * initialized at XDP program start. >>> + */ >>> }; >>> >>> static __always_inline void >> >> If you do another version of this set ... >> >> I think you only need 17-bits for the frame length (size is always <= >> 128kB). It would be helpful for extensions to xdp if you annotated how >> many bits are really needed here. > > Guess this can be done, but I did not too avoid the use of constants to > do the BPF extraction. > Here is an example of what might need to be added, as adding them before > made people unhappy ;) > > https://elixir.bootlin.com/linux/v5.12-rc4/source/include/linux/skbuff.h#L801 > > I was just referring to a code comment that bits can be taken from frame_length for extensions - just like the mb bit takes from frame_sz (and it too has bits available since 2GB is way larger than actually needed).
Re: [PATCH iproute2-next] ip: xfrm: add support for tfcpad
On 3/19/21 10:57 AM, Sabrina Dubroca wrote: > This patch adds support for setting and displaying the Traffic Flow > Confidentiality attribute for an XFRM state, which allows padding ESP > packets to a specified length. > > Signed-off-by: Sabrina Dubroca > --- > ip/ipxfrm.c| 8 > ip/xfrm_state.c| 10 +- > man/man8/ip-xfrm.8 | 2 ++ > 3 files changed, 19 insertions(+), 1 deletion(-) > applied to iproute2-next. Thanks,
Re: [PATCH] ipv4/raw: support binding to nonlocal addresses
On 3/20/21 6:20 PM, Riccardo Paolo Bestetti wrote: > diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c > index 50a73178d63a..734c0332b54b 100644 > --- a/net/ipv4/raw.c > +++ b/net/ipv4/raw.c > @@ -717,6 +717,7 @@ static int raw_bind(struct sock *sk, struct sockaddr > *uaddr, int addr_len) > { > struct inet_sock *inet = inet_sk(sk); > struct sockaddr_in *addr = (struct sockaddr_in *) uaddr; > + struct net *net = sock_net(sk); > u32 tb_id = RT_TABLE_LOCAL; > int ret = -EINVAL; > int chk_addr_ret; > @@ -732,7 +733,8 @@ static int raw_bind(struct sock *sk, struct sockaddr > *uaddr, int addr_len) > tb_id); > > ret = -EADDRNOTAVAIL; > - if (addr->sin_addr.s_addr && chk_addr_ret != RTN_LOCAL && > + if (!inet_can_nonlocal_bind(net, inet) && > + addr->sin_addr.s_addr && chk_addr_ret != RTN_LOCAL && > chk_addr_ret != RTN_MULTICAST && chk_addr_ret != RTN_BROADCAST) > goto out; > inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr; > Please add test cases to ipv4_addr_bind and ipv6_addr_bind in tools/testing/selftests/net/fcnal-test.sh. The latter will verify if IPv6 works the same or needs a change. Also, this check duplicates the ones in __inet_bind and __inet6_bind; it would be good to use an inline helper to reduce the duplication.
rfc5837 and rfc8335
On 3/19/21 10:24 PM, David Ahern wrote: > At the end of the day, what is the value of this feature vs the other > ICMP probing set? Merging the conversations about both of these RFCs since my comments and questions are the same for both. What is the motivation for adding support for these RFCs? Is the push from a company or academia (e.g., a CS project)? Realistically, who is expected to use this feature and why given the information it leaks about the networking configuration of the node. Why is this tool expected to be more useful than a network operator using existing protocols like lldp, collecting that data across nodes and analyzing, or using tools like suzieq[1]? RFC 5837 has been out for 11 years. Do any operating systems support it — e.g., networking vendors like Cisco, Juniper, etc.? If not, why not? This one seems to me the most dubious at this point in time. Similarly for RFC 8335, what is the current support for it? Linux does not need to support an RFC just because it exists. I am really questioning the value of both of them [1] https://github.com/netenglabs/suzieq
Re: [PATCH V4 net-next 5/5] icmp: add response to RFC 8335 PROBE messages
On 3/20/21 10:01 AM, Andreas Roeseler wrote: > On Wed, 2021-03-17 at 21:24 -0600, David Ahern wrote: >> On 3/17/21 9:19 PM, David Miller wrote: >>> From: Andreas Roeseler >>> Date: Wed, 17 Mar 2021 22:11:47 -0500 >>> >>>> On Mon, 2021-03-15 at 04:35 +0800, kernel test robot wrote: >>>> Is there something that I'm not understanding about compiling >>>> kernel >>>> components modularly? How do I avoid this error? >>> >>>> >>> You cannot reference module exported symbols from statically linked >>> code. >>> y >>> >> >> Look at ipv6_stub to see how it exports IPv6 functions for v4 code. >> There are a few examples under net/ipv4. > > Thanks for the advice. I've been able to make some progress but I still > have some questions that I have been unable to find online. > > What steps are required to include a function into the ipv6_stub > struct? I've added the declaration of the function to the struct, but > when I attempt to call it using ipv6_dev_find()> the kernel > locks up. Additionally, a typo in the declaration isn't flagged during > compilation. Are there other places where I need to edit the ipv6_stub > struct or include various headers? The examples I have looked at are > , , and in the folder > and they don't seem to do anything on the caller side of ipv6_stub, so > I think I am not adding the function to ipv6_stub properly. I have been > able to call other functions that currently exist in ipv6_stub, but not > the one I am attempting to add, so am I missing a step? you probably did not add the default in net/ipv6/addrconf_core.c. > > I've noticed that some functions such as aren't > exported using EXPORT_SYMBOL when it is defined in > , but it is still loaded into ipv6_stub. How can > this be? Is there a different way to include symbols into ipv6_stub > based on whether or not they are explicitly exported using > EXPORT_SYMBOL? > take a look at 1aefd3de7bc6 as an example of how to add a new stub.
Re: [PATCH v3] icmp: support rfc5837
On 3/19/21 6:53 PM, Willem de Bruijn wrote: > On Fri, Mar 19, 2021 at 7:54 PM David Ahern wrote: >> >> On 3/19/21 10:11 AM, Ishaan Gandhi wrote: >>> Thank you. Would it be better to do instead: >>> >>> + if_index = skb->skb_iif; >>> >>> or >>> >>> + if_index = ip_version == 4 ? inet_iif(skb) : skb->skb_iif; >>> >> >> If the packet comes in via an interface assigned to a VRF, skb_iif is >> most likely the VRF index which is not what you want. >> >> The general problem of relying on skb_iif was discussed on v1 and v2 of >> your patch. Returning an iif that is a VRF, as an example, leaks >> information about the networking configuration of the device which from >> a quick reading of the RFC is not the intention. >> >> Further, the Security Considerations section recommends controls on what >> information can be returned where you have added a single sysctl that >> determines if all information or none is returned. Further, it is not a >> a per-device control but a global one that applies to all net devices - >> though multiple entries per netdevice has a noticeable cost in memory at >> scale. >> >> In the end it seems to me the cost benefit is not there for a feature >> like this. > > The sysctl was my suggestion. The detailed filtering suggested in the > RFC would add more complexity, not helping that cost benefit analysis. > I cared mostly about being able to disable this feature outright as it has > obvious risks. > > But perhaps that is overly simplistic. The RFC suggests deciding trusted > recipients based on destination address. With a sysctl this feature can be > only deployed when all possible recipients are trusted, i.e., on an isolated > network. That is highly limiting. > > Perhaps a per-netns trusted subnet prefix? > > The root admin should always be able to override and disable this outright. > sure a sysctl is definitely required for a feature like this. >From my perspective to be useful the control needs to be per interface (e.g., management interface vs dataplane devices) and that has a higher cost. Add in the amount of information returned and we know from other examples that some users will want to limit which data is returned and that increases the number of sysctls per device. On top of that there is the logic of resolving what is the right device and its information to return. There is are multiple layers - nic port, bond, vlan, bridge, vrf, macvlan - each of which might be relevant. The RFC referenced unnumbered devices as the ingress device. It seems like a means for leaking information which comes back to the sysctl for proper controls. At the end of the day, what is the value of this feature vs the other ICMP probing set?
Re: [PATCH v7 bpf-next 10/14] bpf: add new frame_length field to the XDP ctx
On 3/19/21 3:47 PM, Lorenzo Bianconi wrote: > diff --git a/include/net/xdp.h b/include/net/xdp.h > index 19cd6642e087..e47d9e8da547 100644 > --- a/include/net/xdp.h > +++ b/include/net/xdp.h > @@ -75,6 +75,10 @@ struct xdp_buff { > struct xdp_txq_info *txq; > u32 frame_sz:31; /* frame size to deduce data_hard_end/reserved > tailroom*/ > u32 mb:1; /* xdp non-linear buffer */ > + u32 frame_length; /* Total frame length across all buffers. Only needs > +* to be updated by helper functions, as it will be > +* initialized at XDP program start. > +*/ > }; > > static __always_inline void If you do another version of this set ... I think you only need 17-bits for the frame length (size is always <= 128kB). It would be helpful for extensions to xdp if you annotated how many bits are really needed here.
Re: [PATCH v3] icmp: support rfc5837
On 3/19/21 10:11 AM, Ishaan Gandhi wrote: > Thank you. Would it be better to do instead: > > + if_index = skb->skb_iif; > > or > > + if_index = ip_version == 4 ? inet_iif(skb) : skb->skb_iif; > If the packet comes in via an interface assigned to a VRF, skb_iif is most likely the VRF index which is not what you want. The general problem of relying on skb_iif was discussed on v1 and v2 of your patch. Returning an iif that is a VRF, as an example, leaks information about the networking configuration of the device which from a quick reading of the RFC is not the intention. Further, the Security Considerations section recommends controls on what information can be returned where you have added a single sysctl that determines if all information or none is returned. Further, it is not a a per-device control but a global one that applies to all net devices - though multiple entries per netdevice has a noticeable cost in memory at scale. In the end it seems to me the cost benefit is not there for a feature like this.
Re: [PATCH iproute2-next v4 0/6] ip: nexthop: Support resilient groups
On 3/17/21 6:54 AM, Petr Machata wrote: > Support for resilient next-hop groups was recently accepted to Linux > kernel[1]. Resilient next-hop groups add a layer of indirection between the > SKB hash and the next hop. Thus the hash is used to reference a hash table > bucket, which is then used to reference a particular next hop. This allows > the system more flexibility when assigning SKB hash space to next hops. > Previously, each next hop had to be assigned a continuous range of SKB hash > space. With a hash table as an intermediate layer, it is possible to > reassign next hops with a hash table bucket granularity. In turn, this > mends issues with traffic flow redirection resulting from next hop removal > or adjustments in next-hop weights. > > In this patch set, introduce support for resilient next-hop groups to > iproute2. > applied to iproute2-next. Thanks,
Re: [PATCH v3] icmp: support rfc5837
On 3/17/21 4:19 PM, ishaangandhi wrote: > +void icmp_identify_arrival_interface(struct sk_buff *skb, struct net *net, > int room, > + char *icmph, int ip_version) > +{ > + unsigned int ext_len, orig_len, word_aligned_orig_len, offset, > extra_space_needed, > + if_index, mtu = 0, name_len = 0, name_subobj_len = 0; > + struct interface_ipv4_addr_sub_obj ip4_addr_subobj = {.addr = 0}; > + struct interface_ipv6_addr_sub_obj ip6_addr_subobj; > + struct icmp_extobj_hdr *iio_hdr; > + struct inet6_ifaddr ip6_ifaddr; > + struct inet6_dev *dev6 = NULL; > + struct icmp_ext_hdr *ext_hdr; > + char *name = NULL, ctype; > + struct net_device *dev; > + void *subobj_offset; > + > + skb_linearize(skb); > + if_index = inet_iif(skb); inet_iif is an IPv4 helper; it should not be used for v6 skb's.
Re: [PATCH V4 net-next 5/5] icmp: add response to RFC 8335 PROBE messages
On 3/17/21 9:19 PM, David Miller wrote: > From: Andreas Roeseler > Date: Wed, 17 Mar 2021 22:11:47 -0500 > >> On Mon, 2021-03-15 at 04:35 +0800, kernel test robot wrote: >> Is there something that I'm not understanding about compiling kernel >> components modularly? How do I avoid this error? > >> > You cannot reference module exported symbols from statically linked code. > y > Look at ipv6_stub to see how it exports IPv6 functions for v4 code. There are a few examples under net/ipv4.
Re: [PATCH 1/2] neighbour: allow referenced neighbours to be removed
On 3/17/21 12:53 PM, Thadeu Lima de Souza Cascardo wrote: > During forced garbage collection, neighbours with more than a reference are > not removed. It's possible to DoS the neighbour table by using ARP spoofing > in such a way that there is always a timer pending for all neighbours, > preventing any of them from being removed. That will cause any new > neighbour creation to fail. > > Use the same code as used by neigh_flush_dev, which deletes the timer and > cleans the queue when there are still references left. > > With the same ARP spoofing technique, it was still possible to reach a valid > destination when this fix was applied, with no more table overflows. And how fast are neighbor entries removed with this patch? The current code gives a neighbor entry a minimum lifetime to allow it to exist long enough to be confirmed. Removing the minimum lifetime means neighbor entries are constantly churning which is just as bad as the arp spoofing problem. > > Signed-off-by: Thadeu Lima de Souza Cascardo > --- > net/core/neighbour.c | 117 +++ > 1 file changed, 51 insertions(+), 66 deletions(-) > > diff --git a/net/core/neighbour.c b/net/core/neighbour.c > index e2982b3970b8..bbc89c7ffdfd 100644 > --- a/net/core/neighbour.c > +++ b/net/core/neighbour.c > @@ -173,25 +173,48 @@ static bool neigh_update_ext_learned(struct neighbour > *neigh, u32 flags, > return rc; > } > > +static int neigh_del_timer(struct neighbour *n) > +{ > + if ((n->nud_state & NUD_IN_TIMER) && > + del_timer(&n->timer)) { > + neigh_release(n); > + return 1; > + } > + return 0; > +} > + > static bool neigh_del(struct neighbour *n, struct neighbour __rcu **np, > struct neigh_table *tbl) > { > - bool retval = false; > - > + rcu_assign_pointer(*np, > +rcu_dereference_protected(n->next, > + lockdep_is_held(&tbl->lock))); > write_lock(&n->lock); > - if (refcount_read(&n->refcnt) == 1) { > - struct neighbour *neigh; > - > - neigh = rcu_dereference_protected(n->next, > - lockdep_is_held(&tbl->lock)); > - rcu_assign_pointer(*np, neigh); > - neigh_mark_dead(n); > - retval = true; > + neigh_del_timer(n); > + neigh_mark_dead(n); > + if (refcount_read(&n->refcnt) != 1) { > + /* The most unpleasant situation. > +We must destroy neighbour entry, > +but someone still uses it. > + > +The destroy will be delayed until > +the last user releases us, but > +we must kill timers etc. and move > +it to safe state. > + */ > + __skb_queue_purge(&n->arp_queue); > + n->arp_queue_len_bytes = 0; > + n->output = neigh_blackhole; > + if (n->nud_state & NUD_VALID) > + n->nud_state = NUD_NOARP; > + else > + n->nud_state = NUD_NONE; > + neigh_dbg(2, "neigh %p is stray\n", n); > } > write_unlock(&n->lock); > - if (retval) > - neigh_cleanup_and_release(n); > - return retval; > + neigh_cleanup_and_release(n); > + > + return true; > } > > bool neigh_remove_one(struct neighbour *ndel, struct neigh_table *tbl) > @@ -229,22 +252,20 @@ static int neigh_forced_gc(struct neigh_table *tbl) > write_lock_bh(&tbl->lock); > > list_for_each_entry_safe(n, tmp, &tbl->gc_list, gc_list) { > - if (refcount_read(&n->refcnt) == 1) { > - bool remove = false; > - > - write_lock(&n->lock); > - if ((n->nud_state == NUD_FAILED) || > - (tbl->is_multicast && > - tbl->is_multicast(n->primary_key)) || > - time_after(tref, n->updated)) > - remove = true; > - write_unlock(&n->lock); > - > - if (remove && neigh_remove_one(n, tbl)) > - shrunk++; > - if (shrunk >= max_clean) > - break; > - } > + bool remove = false; > + > + write_lock(&n->lock); > + if ((n->nud_state == NUD_FAILED) || > + (tbl->is_multicast && > + tbl->is_multicast(n->primary_key)) || > + time_after(tref, n->updated)) > + remove = true; > + write_unlock(&n->lock); > + > + if (remove && neigh_remove_one(n, tbl)) > + shrunk++; > + if (shrunk >= max_clean) > + break; > } > > tbl->last_flush = jiffies; > @@ -264,16 +285,6 @@ static void neigh_add_timer(struct neighbour *n, > un
Re: [PATCH] net: ipv4: Fixed some styling issues.
On 3/17/21 9:07 AM, Anish Udupa wrote: > Ran checkpatch and found these warnings. Fixed some of them in this patch. > a) Added a space before '='. > b) Removed the space before the tab. > > Signed-off-by: Anish Udupa H > --- > net/ipv4/route.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 02d81d79deeb..0b9024584fde 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -2236,7 +2236,7 @@ out: return err; > if (!rth) > goto e_nobufs; > > - rth->dst.output= ip_rt_bug; > + rth->dst.output = ip_rt_bug; > #ifdef CONFIG_IP_ROUTE_CLASSID > rth->dst.tclassid = itag; > #endif > @@ -2244,9 +2244,9 @@ out: return err; > > RT_CACHE_STAT_INC(in_slow_tot); > if (res->type == RTN_UNREACHABLE) { > - rth->dst.input= ip_error; > - rth->dst.error= -err; > - rth->rt_flags &= ~RTCF_LOCAL; > + rth->dst.input = ip_error; > + rth->dst.error = -err; > + rth->rt_flags &= ~RTCF_LOCAL; > } > > if (do_cache) { > your patch seems to have lost one or more tabs at the beginning of each line.
Re: [BUG] Iproute2 batch-mode fails to bring up veth
On 3/15/21 1:22 PM, Tim Rice wrote: > Hey all, > > Sorry if this isn't the right place to report Iproute2 bugs. It was > implied by README.devel as well as a couple of entries I saw in bugzilla. > > I use iproute2 batch mode to construct network namespaces. Example script: > > $ cat ~/bin/netns-test.sh > #! /bin/bash > > gw=192.168.5.1 > ip=192.168.5.2 > ns=netns-test > veth0=${ns}-0 > veth1=${ns}-1 > > /usr/local/sbin/ip -b - << EOF > link add $veth0 type veth peer name $veth1 > addr add $gw peer $ip dev $veth0 > link set dev $veth0 up > netns add $ns > link set $veth1 netns $ns > netns exec $ns ip link set dev lo up > netns exec $ns ip link set dev $veth1 up > netns exec $ns ip addr add $ip/24 dev $veth1 > netns exec $ns ip addr add $ip peer $gw dev $veth1 > netns exec $ns ip route add default via $gw dev $veth1 > netns exec $ns ip route add 192.168.0.0/24 via $gw dev $veth1 > EOF > > > I noticed when version 5.11.0 dropped that this stops working. Batch > mode fails to bring up the inner veth. > > Expected usage (as produced by v5.10.0): > > $ sudo ./bin/netns-test.sh > $ sudo ip netns exec netns-test ip route > default via 192.168.5.1 dev netns-test-1 > 192.168.0.0/24 via 192.168.5.1 dev netns-test-1 > 192.168.5.0/24 dev netns-test-1 proto kernel scope link src 192.168.5.2 > 192.168.5.1 dev netns-test-1 proto kernel scope link src 192.168.5.2 > > Actual behaviour: > > $ sudo ./bin/netns-test.sh > $ sudo ip netns exec netns-test ip route # Notice the empty output > $ sudo ip netns exec netns-test ip link > 1: lo: mtu 65536 qdisc noqueue state UNKNOWN > mode DEFAULT group default qlen 1000 > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > 39: netns-test-1@if40: mtu 1500 qdisc noop state > DOWN mode DEFAULT group default qlen 1000 > link/ether 1a:96:4e:4f:84:31 brd ff:ff:ff:ff:ff:ff link-netnsid 0 > > System info: > > * Distro: Void Linux > * Kernel version: 5.10.23 > * CPU: AMD Ryzen 7 1800X and Ryzen 5 2600. (Reproduced on both.) > > Git bisect pinpoints this commit: > https://github.com/shemminger/iproute2/commit/1d9a81b8c9f30f9f4abeb875998262f61bf10577 > Petr, can you take a look at this regression?
Re: VRF leaking doesn't work
On 3/15/21 11:10 AM, Greesha Mikhalkin wrote: >> That's the way the source address selection works -- it takes the fib >> lookup result and finds the best source address match for it. >> >> Try adding 'src a.b.c.d' to the leaked route. e.g., >> ip ro add 172.16.1.0/24 dev red vrf blue src 172.16.2.1 >> >> where red and blue are VRFs, 172.16.2.1 is a valid source address in VRF >> blue and VRF red has the reverse route installed. > > Tried to do that. Added reverse route to vrf red like that: > ip ro add vrf red 172.16.2.1/32 dev blue > > 172.16.2.1 is selected as source address when i ping. But now, when i > look at `tcpdump icmp` i only see requests: > 172.16.2.1 > 172.16.1.3: ICMP echo request, id 9, seq 10, length 64 > > And no replies and anything else. If i look into tcpdump on machine > that's pinged -- it doesn't receive anything. > > So it looks like it's not using nexthops from vrf red in that case. > Maybe it has something to do with how address is setup. In routing > table it looks like: > local 172.16.2.1 dev vlanblue proto kernel scope host src 172.16.2.1 > VRF is implemented via policy routing. did you re-order the FIB rules? http://schd.ws/hosted_files/ossna2017/fe/vrf-tutorial-oss.pdf
Re: [PATCH iproute2-next v2] dcb: Fix compilation warning about reallocarray
On 3/15/21 7:10 AM, Petr Machata wrote: > Could this be merged, please? done
Re: [PATCH iproute2-next 4/6] nexthop: Add ability to specify group type
On 3/12/21 10:23 AM, Petr Machata wrote: > From: Petr Machata > > From: Ido Schimmel All of the patches have the above. If Ido is the author and you are sending, AIUI you add your Signed-off-by below his. > > Next patches are going to add a 'resilient' nexthop group type, so allow > users to specify the type using the 'type' argument. Currently, only > 'mpath' type is supported. > > These two command are equivalent: > > Signed-off-by: Ido Schimmel > --- > ip/ipnexthop.c| 32 +++- > man/man8/ip-nexthop.8 | 18 -- > 2 files changed, 47 insertions(+), 3 deletions(-) > ... > diff --git a/man/man8/ip-nexthop.8 b/man/man8/ip-nexthop.8 > index 4d55f4dbcc75..f02e0555a000 100644 > --- a/man/man8/ip-nexthop.8 > +++ b/man/man8/ip-nexthop.8 > @@ -54,7 +54,9 @@ ip-nexthop \- nexthop object management > .BR fdb " ] | " > .B group > .IR GROUP " [ " > -.BR fdb " ] } " > +.BR fdb " ] [ " > +.B type > +.IR TYPE " ] } " > > .ti -8 > .IR ENCAP " := [ " > @@ -71,6 +73,10 @@ ip-nexthop \- nexthop object management > .IR GROUP " := " > .BR id "[," weight "[/...]" > > +.ti -8 > +.IR TYPE " := { " > +.BR mpath " }" > + > .SH DESCRIPTION > .B ip nexthop > is used to manipulate entries in the kernel's nexthop tables. > @@ -122,9 +128,17 @@ is a set of encapsulation attributes specific to the > .in -2 > > .TP > -.BI group " GROUP" > +.BI group " GROUP [ " type " TYPE ]" > create a nexthop group. Group specification is id with an optional > weight (id,weight) and a '/' as a separator between entries. > +.sp > +.I TYPE > +is a string specifying the nexthop group type. Namely: > + > +.in +8 > +.BI mpath > +- multipath nexthop group > + Add a comment that this is the default group type and refers to the legacy hash-bashed multipath group. The rest of the patches look ok to me.
Re: [PATCH net-next 07/10] selftests: fib_nexthops: Test resilient nexthop groups
On 3/12/21 9:50 AM, Petr Machata wrote: > From: Ido Schimmel > > Add test cases for resilient nexthop groups. Exhaustive forwarding tests > are added separately under net/forwarding/. > ... > > Signed-off-by: Ido Schimmel > Co-developed-by: Petr Machata > Signed-off-by: Petr Machata > --- > tools/testing/selftests/net/fib_nexthops.sh | 517 > 1 file changed, 517 insertions(+) > Reviewed-by: David Ahern
Re: [PATCH net-next 06/10] selftests: fib_nexthops: List each test case in a different line
On 3/12/21 9:50 AM, Petr Machata wrote: > From: Ido Schimmel > > The lines with the IPv4 and IPv6 test cases are already very long and > more test cases will be added in subsequent patches. > > List each test case in a different line to make it easier to extend the > test with more test cases. > > Signed-off-by: Ido Schimmel > Reviewed-by: Petr Machata > Signed-off-by: Petr Machata > --- > tools/testing/selftests/net/fib_nexthops.sh | 30 ++--- > 1 file changed, 26 insertions(+), 4 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net-next 05/10] selftests: fib_nexthops: Declutter test output
On 3/12/21 9:50 AM, Petr Machata wrote: > From: Ido Schimmel > > Before: > > # ./fib_nexthops.sh -t ipv4_torture > > IPv4 runtime torture > > TEST: IPv4 torture test [ OK ] > ./fib_nexthops.sh: line 213: 19376 Killed ipv4_del_add_loop1 > ./fib_nexthops.sh: line 213: 19377 Killed > ipv4_grp_replace_loop > ./fib_nexthops.sh: line 213: 19378 Killed ip netns exec me > ping -f 172.16.101.1 > /dev/null 2>&1 > ./fib_nexthops.sh: line 213: 19380 Killed ip netns exec me > ping -f 172.16.101.2 > /dev/null 2>&1 > ./fib_nexthops.sh: line 213: 19381 Killed ip netns exec me > mausezahn veth1 -B 172.16.101.2 -A 172.16.1.1 -c 0 -t tcp "dp=1-1023, > flags=syn" > /dev/null 2>&1 > > Tests passed: 1 > Tests failed: 0 > > # ./fib_nexthops.sh -t ipv6_torture > > IPv6 runtime torture > > TEST: IPv6 torture test [ OK ] > ./fib_nexthops.sh: line 213: 24453 Killed ipv6_del_add_loop1 > ./fib_nexthops.sh: line 213: 24454 Killed > ipv6_grp_replace_loop > ./fib_nexthops.sh: line 213: 24456 Killed ip netns exec me > ping -f 2001:db8:101::1 > /dev/null 2>&1 > ./fib_nexthops.sh: line 213: 24457 Killed ip netns exec me > ping -f 2001:db8:101::2 > /dev/null 2>&1 > ./fib_nexthops.sh: line 213: 24458 Killed ip netns exec me > mausezahn -6 veth1 -B 2001:db8:101::2 -A 2001:db8:91::1 -c 0 -t tcp > "dp=1-1023, flags=syn" > /dev/null 2>&1 > > Tests passed: 1 > Tests failed: 0 > > After: > > # ./fib_nexthops.sh -t ipv4_torture > > IPv4 runtime torture > > TEST: IPv4 torture test [ OK ] > > Tests passed: 1 > Tests failed: 0 > > # ./fib_nexthops.sh -t ipv6_torture > > IPv6 runtime torture > > TEST: IPv6 torture test [ OK ] > > Tests passed: 1 > Tests failed: 0 > > Signed-off-by: Ido Schimmel > Reviewed-by: Petr Machata > Signed-off-by: Petr Machata > --- > tools/testing/selftests/net/fib_nexthops.sh | 2 ++ > 1 file changed, 2 insertions(+) > Reviewed-by: David Ahern
Re: [PATCH net-next] net: ipv6: addrconf: Add accept_ra_prefix_route.
On 3/11/21 7:22 PM, subas...@codeaurora.org wrote: > > We are seeing that the interface itself doesn't get the address assigned > via RA when setting accept_ra_pinfo = 0. > > We would like to have the interface address assigned via SLAAC > here while the route management would be handled via the userspace daemon. > In that case, we do not want the kernel installed route to be present > (behavior controlled via this proc entry). sysctl's are not free and in this case you want to add a second one to pick and choose which data in the message you want the kernel to act on. Why can't the userspace daemon remove the route and add the one it prefers? Or add another route with a metric that makes it the preferred route making the kernel one effectively moot?
Re: VRF leaking doesn't work
On 3/10/21 1:34 AM, Greesha Mikhalkin wrote: > I see. When i do `ping -I vrf2` to address that was leaked from vrf1 > it selects source address that's set as local in vrf1 routing table. > Is this expected behavior? I guess, forwarding packets from vrf1 to > vrf2 local address won't help here. > That's the way the source address selection works -- it takes the fib lookup result and finds the best source address match for it. Try adding 'src a.b.c.d' to the leaked route. e.g., ip ro add 172.16.1.0/24 dev red vrf blue src 172.16.2.1 where red and blue are VRFs, 172.16.2.1 is a valid source address in VRF blue and VRF red has the reverse route installed.
Re: [PATCH net-next 00/14] nexthop: Resilient next-hop groups
On 3/10/21 8:02 AM, Petr Machata wrote: > At this moment, there is only one type of next-hop group: an mpath group. > Mpath groups implement the hash-threshold algorithm, described in RFC > 2992[1]. > > To select a next hop, hash-threshold algorithm first assigns a range of > hashes to each next hop in the group, and then selects the next hop by > comparing the SKB hash with the individual ranges. When a next hop is > removed from the group, the ranges are recomputed, which leads to > reassignment of parts of hash space from one next hop to another. RFC 2992 > illustrates it thus: > > +---+---+---+---+---+ > | 1 | 2 | 3 | 4 | 5 | > +---+-+-+---+---+-+-+---+ > |1|2|4|5| > +-+-+-+-+ > > Before and after deletion of next hop 3 > under the hash-threshold algorithm. > > Note how next hop 2 gave up part of the hash space in favor of next hop 1, > and 4 in favor of 5. While there will usually be some overlap between the > previous and the new distribution, some traffic flows change the next hop > that they resolve to. > > If a multipath group is used for load-balancing between multiple servers, > this hash space reassignment causes an issue that packets from a single > flow suddenly end up arriving at a server that does not expect them, which > may lead to TCP reset. > > If a multipath group is used for load-balancing among available paths to > the same server, the issue is that different latencies and reordering along > the way causes the packets to arrive in the wrong order. > > Resilient hashing is a technique to address the above problem. Resilient > next-hop group has another layer of indirection between the group itself > and its constituent next hops: a hash table. The selection algorithm uses a > straightforward modulo operation on the SKB hash to choose a hash table > bucket, then reads the next hop that this bucket contains, and forwards > traffic there. > > This indirection brings an important feature. In the hash-threshold > algorithm, the range of hashes associated with a next hop must be > continuous. With a hash table, mapping between the hash table buckets and > the individual next hops is arbitrary. Therefore when a next hop is deleted > the buckets that held it are simply reassigned to other next hops: > > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > |1|1|1|1|2|2|2|2|3|3|3|3|4|4|4|4|5|5|5|5| > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > v v v v > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > |1|1|1|1|2|2|2|2|1|2|4|5|4|4|4|4|5|5|5|5| > +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > > Before and after deletion of next hop 3 > under the resilient hashing algorithm. > > When weights of next hops in a group are altered, it may be possible to > choose a subset of buckets that are currently not used for forwarding > traffic, and use those to satisfy the new next-hop distribution demands, > keeping the "busy" buckets intact. This way, established flows are ideally > kept being forwarded to the same endpoints through the same paths as before > the next-hop group change. > > This patch set adds the implementation of resilient next-hop groups. > > In a nutshell, the algorithm works as follows. Each next hop has a number > of buckets that it wants to have, according to its weight and the number of > buckets in the hash table. In case of an event that might cause bucket > allocation change, the numbers for individual next hops are updated, > similarly to how ranges are updated for mpath group next hops. Following > that, a new "upkeep" algorithm runs, and for idle buckets that belong to a > next hop that is currently occupying more buckets than it wants (it is > "overweight"), it migrates the buckets to one of the next hops that has > fewer buckets than it wants (it is "underweight"). If, after this, there > are still underweight next hops, another upkeep run is scheduled to a > future time. > > Chances are there are not enough "idle" buckets to satisfy the new demands. > The algorithm has knobs to select both what it means for a bucket to be > idle, and for whether and when to forcefully migrate buckets if there keeps > being an insufficient number of idle ones. > > To illustrate the usage, consider the following commands: > > # ip nexthop add id 1 via 192.0.2.2 dev dummy1 > # ip nexthop add id 2 via 192.0.2.3 dev dummy1 > # ip nexthop add id 10 group 1/2 type resilient \ > buckets 8 idle_timer 60 unbalanced_timer 300 > > The last command creates a resilient next-hop group. It will have 8 > buckets, each bucket will be considered idle when no traffic hits it for at > least 60 seconds, and if the table remains out of balance for 300 seconds, > it will
Re: [PATCH net-next 14/14] nexthop: Enable resilient next-hop groups
On 3/10/21 8:03 AM, Petr Machata wrote: > Now that all the code is in place, stop rejecting requests to create > resilient next-hop groups. > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > net/ipv4/nexthop.c | 4 > 1 file changed, 4 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net-next 13/14] nexthop: Notify userspace about bucket migrations
On 3/10/21 8:03 AM, Petr Machata wrote: > Nexthop replacements et.al. are notified through netlink, but if a delayed > work migrates buckets on the background, userspace will stay oblivious. > Notify these as RTM_NEWNEXTHOPBUCKET events. > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > > Notes: > v1 (changes since RFC): > - u32 -> u16 for bucket counts / indices > > net/ipv4/nexthop.c | 45 +++-- > 1 file changed, 39 insertions(+), 6 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net-next 12/14] nexthop: Add netlink handlers for bucket get
On 3/10/21 8:03 AM, Petr Machata wrote: > Allow getting (but not setting) individual buckets to inspect the next hop > mapped therein, idle time, and flags. > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > > Notes: > v1 (changes since RFC): > - u32 -> u16 for bucket counts / indices > > net/ipv4/nexthop.c | 110 - > 1 file changed, 109 insertions(+), 1 deletion(-) > Reviewed-by: David Ahern
Re: [PATCH net-next 11/14] nexthop: Add netlink handlers for bucket dump
On 3/10/21 8:03 AM, Petr Machata wrote: > Add a dump handler for resilient next hop buckets. When next-hop group ID > is given, it walks buckets of that group, otherwise it walks buckets of all > groups. It then dumps the buckets whose next hops match the given filtering > criteria. > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > > Notes: > v1 (changes since RFC): > - u32 -> u16 for bucket counts / indices > Reviewed-by: David Ahern
Re: [PATCH net-next 10/14] nexthop: Add netlink handlers for resilient nexthop groups
On 3/10/21 8:03 AM, Petr Machata wrote: > Implement the netlink messages that allow creation and dumping of resilient > nexthop groups. > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > > Notes: > v1 (changes since RFC): > - u32 -> u16 for bucket counts / indices > > net/ipv4/nexthop.c | 150 +++-- > 1 file changed, 145 insertions(+), 5 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net-next 09/14] nexthop: Allow reporting activity of nexthop buckets
On 3/10/21 8:03 AM, Petr Machata wrote: > From: Ido Schimmel > > The kernel periodically checks the idle time of nexthop buckets to > determine if they are idle and can be re-populated with a new nexthop. > > When the resilient nexthop group is offloaded to hardware, the kernel > will not see activity on nexthop buckets unless it is reported from > hardware. > > Add a function that can be periodically called by device drivers to > report activity on nexthop buckets after querying it from the underlying > device. > > Signed-off-by: Ido Schimmel > Reviewed-by: Petr Machata > Signed-off-by: Petr Machata > --- > > Notes: > v1 (changes since RFC): > - u32 -> u16 for bucket counts / indices > > include/net/nexthop.h | 2 ++ > net/ipv4/nexthop.c| 35 +++++++ > 2 files changed, 37 insertions(+) > Reviewed-by: David Ahern
Re: [PATCH net-next 08/14] nexthop: Allow setting "offload" and "trap" indication of nexthop buckets
On 3/10/21 8:02 AM, Petr Machata wrote: > From: Ido Schimmel > > Add a function that can be called by device drivers to set "offload" or > "trap" indication on nexthop buckets following nexthop notifications and > other changes such as a neighbour becoming invalid. > > Signed-off-by: Ido Schimmel > Reviewed-by: Petr Machata > Signed-off-by: Petr Machata > --- > > Notes: > v1 (changes since RFC): > - u32 -> u16 for bucket counts / indices > > include/net/nexthop.h | 2 ++ > net/ipv4/nexthop.c| 34 ++++++ > 2 files changed, 36 insertions(+) > Reviewed-by: David Ahern
Re: [PATCH net-next 07/14] nexthop: Implement notifiers for resilient nexthop groups
On 3/10/21 8:02 AM, Petr Machata wrote: > Implement the following notifications towards drivers: > > - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created. > > - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of > next hops to hash table buckets. That includes replacements, deletions, > and delayed upkeep cycles. Some bucket notifications can be vetoed by the > driver, to make it possible to propagate bucket busy-ness flags from the > HW back to the algorithm. Some are however forced, e.g. if a next hop is > deleted, all buckets that use this next hop simply must be migrated, > whether the HW wishes so or not. > > - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is > replaced. Usually the driver will get the bucket notifications as well, > and could veto those. But in some cases, a bucket may not be migrated > immediately, but during delayed upkeep, and that is too late to roll the > transaction back. This notification allows the driver to take a look and > veto the new proposed group up front, before anything is committed. > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > > Notes: > v1 (changes since RFC): > - u32 -> u16 for bucket counts / indices > Reviewed-by: David Ahern
Re: [PATCH net-next 06/14] nexthop: Add data structures for resilient group notifications
On 3/10/21 8:02 AM, Petr Machata wrote: > From: Ido Schimmel > > Add data structures that will be used for in-kernel notifications about > addition / deletion of a resilient nexthop group and about changes to a > hash bucket within a resilient group. > > Signed-off-by: Ido Schimmel > Reviewed-by: Petr Machata > Signed-off-by: Petr Machata > --- > > Notes: > v1 (changes since RFC): > - u32 -> u16 for bucket counts / indices > Reviewed-by: David Ahern
Re: [PATCH net-next 03/14] nexthop: Add a dedicated flag for multipath next-hop groups
On 3/11/21 8:39 AM, Petr Machata wrote: > > David Ahern writes: > >>> diff --git a/include/net/nexthop.h b/include/net/nexthop.h >>> index 7bc057aee40b..5062c2c08e2b 100644 >>> --- a/include/net/nexthop.h >>> +++ b/include/net/nexthop.h >>> @@ -80,6 +80,7 @@ struct nh_grp_entry { >>> struct nh_group { >>> struct nh_group *spare; /* spare group for removals */ >>> u16 num_nh; >>> + boolis_multipath; >>> boolmpath; >> >> >> It would be good to rename the existing type 'mpath' to something else. >> You have 'resilient' as a group type later, so maybe rename this one to >> hash or hash_threshold. > > All right, I'll send a follow-up with that. > I'm fine with the rename being a followup after this patch set or as the last patch in this set.
Re: [PATCH net-next 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups
On 3/11/21 8:45 AM, Petr Machata wrote: > > David Ahern writes: > >> On 3/10/21 8:02 AM, Petr Machata wrote: >>> diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h >>> index 2d4a1e784cf0..8efebf3cb9c7 100644 >>> --- a/include/uapi/linux/nexthop.h >>> +++ b/include/uapi/linux/nexthop.h >>> @@ -22,6 +22,7 @@ struct nexthop_grp { >>> >>> enum { >>> NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ >> >> Update the above comment that it is for legacy, hash based multipath. > > Maybe this would make sense? > > NEXTHOP_GRP_TYPE_MPATH, /* hash-threshold nexthop group */ > yes, the description is fine. keep the comment about 'default type'.
Re: [PATCH net-next 05/14] nexthop: Add implementation of resilient next-hop groups
during upkeep. The > only times that the hash table is invalid is the very beginning and very > end of its lifetime. Between those points, it is always kept valid. > > This patch introduces the core support code itself. It does not handle > notifications towards drivers, which are kept as if the group were an mpath > one. It does not handle netlink either. The only bit currently exposed to > user space is the new next-hop group type, and that is currently bounced. > There is therefore no way to actually access this code. > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > Thanks for the detailed documentation around exclusion expectations. Reviewed-by: David Ahern
Re: [PATCH net-next 04/14] nexthop: Add netlink defines and enumerators for resilient NH groups
On 3/10/21 8:02 AM, Petr Machata wrote: > diff --git a/include/uapi/linux/nexthop.h b/include/uapi/linux/nexthop.h > index 2d4a1e784cf0..8efebf3cb9c7 100644 > --- a/include/uapi/linux/nexthop.h > +++ b/include/uapi/linux/nexthop.h > @@ -22,6 +22,7 @@ struct nexthop_grp { > > enum { > NEXTHOP_GRP_TYPE_MPATH, /* default type if not specified */ Update the above comment that it is for legacy, hash based multipath. > + NEXTHOP_GRP_TYPE_RES,/* resilient nexthop group */ > __NEXTHOP_GRP_TYPE_MAX, > }; > besides the request for an updated comment, this looks fine: Reviewed-by: David Ahern
Re: [PATCH net-next 03/14] nexthop: Add a dedicated flag for multipath next-hop groups
On 3/10/21 8:02 AM, Petr Machata wrote: > With the introduction of resilient nexthop groups, there will be two types > of multipath groups: the current hash-threshold "mpath" ones, and resilient > groups. Both are multipath, but to determine the fact, the system needs to > consider two flags. This might prove costly in the datapath. Therefore, > introduce a new flag, that should be set for next-hop groups that have more > than one nexthop, and should be considered multipath. > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > > Notes: > v1 (changes since RFC): > - This patch is new > > include/net/nexthop.h | 7 --- > net/ipv4/nexthop.c| 5 - > 2 files changed, 8 insertions(+), 4 deletions(-) This patch looks good: Reviewed-by: David Ahern > > diff --git a/include/net/nexthop.h b/include/net/nexthop.h > index 7bc057aee40b..5062c2c08e2b 100644 > --- a/include/net/nexthop.h > +++ b/include/net/nexthop.h > @@ -80,6 +80,7 @@ struct nh_grp_entry { > struct nh_group { > struct nh_group *spare; /* spare group for removals */ > u16 num_nh; > + boolis_multipath; > boolmpath; It would be good to rename the existing type 'mpath' to something else. You have 'resilient' as a group type later, so maybe rename this one to hash or hash_threshold.
Re: [PATCH net-next 02/14] nexthop: __nh_notifier_single_info_init(): Make nh_info an argument
On 3/10/21 8:02 AM, Petr Machata wrote: > The cited function currently uses rtnl_dereference() to get nh_info from a > handed-in nexthop. However, under the resilient hashing scheme, this > function will not always be called under RTNL, sometimes the mutual > exclusion will be achieved differently. Therefore move the nh_info > extraction from the function to its callers to make it possible to use a > different synchronization guarantee. > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > net/ipv4/nexthop.c | 12 +++- > 1 file changed, 7 insertions(+), 5 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net-next 01/14] nexthop: Pass nh_config to replace_nexthop()
On 3/10/21 8:02 AM, Petr Machata wrote: > Currently, replace assumes that the new group that is given is a > fully-formed object. But mpath groups really only have one attribute, and > that is the constituent next hop configuration. This may not be universally > true. From the usability perspective, it is desirable to allow the replace > operation to adjust just the constituent next hop configuration and leave > the group attributes as such intact. > > But the object that keeps track of whether an attribute was or was not > given is the nh_config object, not the next hop or next-hop group. To allow > (selective) attribute updates during NH group replacement, propagate `cfg' > to replace_nexthop() and further to replace_nexthop_grp(). > > Signed-off-by: Petr Machata > Reviewed-by: Ido Schimmel > --- > net/ipv4/nexthop.c | 9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > Reviewed-by: David Ahern
Re: [PATCH net-next] net: ipv6: addrconf: Add accept_ra_prefix_route.
On 3/10/21 11:49 AM, Subash Abhinov Kasiviswanathan wrote: > Added new procfs flag to toggle the automatic addition of prefix > routes on a per device basis. The new flag is accept_ra_prefix_route. > > A value of 0 for the flag maybe used in some forwarding scenarios > when a userspace daemon is managing the routing. > Manual deletion of the kernel installed route was not sufficient as > kernel was adding back the route. > > Defaults to 1 as to not break existing behavior. > > Signed-off-by: Subash Abhinov Kasiviswanathan > --- > Documentation/networking/ip-sysctl.rst | 10 ++ > include/linux/ipv6.h | 1 + > include/uapi/linux/ipv6.h | 1 + > net/ipv6/addrconf.c| 16 +--- > 4 files changed, 25 insertions(+), 3 deletions(-) > > diff --git a/Documentation/networking/ip-sysctl.rst > b/Documentation/networking/ip-sysctl.rst > index c7952ac..9f0d92d 100644 > --- a/Documentation/networking/ip-sysctl.rst > +++ b/Documentation/networking/ip-sysctl.rst > @@ -2022,6 +2022,16 @@ accept_ra_mtu - BOOLEAN > - enabled if accept_ra is enabled. > - disabled if accept_ra is disabled. > > +accept_ra_prefix_route - BOOLEAN > + Apply the prefix route based on the RA. If disabled, kernel > + does not install the route. This can be used if a userspace > + daemon is managing the routing. > + > + Functional default: > + > + - enabled if accept_ra_prefix_route is enabled > + - disabled if accept_ra_prefix_route is disabled > + > accept_redirects - BOOLEAN > Accept Redirects. > this seems to duplicate accept_ra_pinfo
Re: [PATCH net v2] ipv6: fix suspecious RCU usage warning
On 3/9/21 7:20 PM, Wei Wang wrote: > Syzbot reported the suspecious RCU usage in nexthop_fib6_nh() when > called from ipv6_route_seq_show(). The reason is ipv6_route_seq_start() > calls rcu_read_lock_bh(), while nexthop_fib6_nh() calls > rcu_dereference_rtnl(). > The fix proposed is to add a variant of nexthop_fib6_nh() to use > rcu_dereference_bh_rtnl() for ipv6_route_seq_show(). > ... > > Fixes: f88d8ea67fbdb ("ipv6: Plumb support for nexthop object in a fib6_info") > Reported-by: syzbot > Signed-off-by: Wei Wang > Cc: David Ahern > Cc: Ido Schimmel > Cc: Petr Machata > Cc: Eric Dumazet > --- > include/net/nexthop.h | 24 > net/ipv6/ip6_fib.c| 2 +- > 2 files changed, 25 insertions(+), 1 deletion(-) > Reviewed-by: David Ahern Thanks, Wei.
Re: [PATCH] net: add net namespace inode for all net_dev events
On 3/9/21 1:02 PM, Steven Rostedt wrote: > On Tue, 9 Mar 2021 12:53:37 -0700 > David Ahern wrote: > >> Changing the order of the fields will impact any bpf programs expecting >> the existing format > > I thought bpf programs were not API. And why are they not parsing this > information? They have these offsets hard coded Why would they do that! > The information to extract the data where ever it is has been there from > day 1! Way before BPF ever had access to trace events. BPF programs attached to a tracepoint are passed a context - a structure based on the format for the tracepoint. To take an in-tree example, look at samples/bpf/offwaketime_kern.c: ... /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */ struct sched_switch_args { unsigned long long pad; char prev_comm[16]; int prev_pid; int prev_prio; long long prev_state; char next_comm[16]; int next_pid; int next_prio; }; SEC("tracepoint/sched/sched_switch") int oncpu(struct sched_switch_args *ctx) { ... Production systems do not typically have toolchains installed, so dynamic generation of the program based on the 'format' file on the running system is not realistic. That means creating the programs on a development machine and installing on the production box. Further, there is an expectation that a bpf program compiled against version X works on version Y. Changing the order of the fields will break such programs in non-obvious ways.
Re: [PATCH] net: add net namespace inode for all net_dev events
On 3/9/21 10:40 AM, Steven Rostedt wrote: > The order of the fields is important. Don't worry about breaking API by > fixing it. The parsing code uses this output to find where the binary data > is. Changing the order of the fields will impact any bpf programs expecting the existing format.
Re: [PATCH net] ipv6: fix suspecious RCU usage warning
On 3/9/21 10:32 AM, Wei Wang wrote: > Thanks David and Ido. > To clarify, David, you suggest we add a separate function instead of > adding an extra parameter, right? for this case I think it is the better way to go.
Re: [PATCH net v3 2/2] net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
On 3/9/21 4:31 AM, Balazs Nemeth wrote: > A packet with skb_inner_network_header(skb) == skb_network_header(skb) > and ETH_P_MPLS_UC will prevent mpls_gso_segment from pulling any headers > from the packet. Subsequently, the call to skb_mac_gso_segment will > again call mpls_gso_segment with the same packet leading to an infinite > loop. In addition, ensure that the header length is a multiple of four, > which should hold irrespective of the number of stacked labels. > > Signed-off-by: Balazs Nemeth > --- > net/mpls/mpls_gso.c | 3 +++ > 1 file changed, 3 insertions(+) > Reviewed-by: David Ahern
Re: [PATCH net] ipv6: fix suspecious RCU usage warning
[ cc Ido and Petr ] On 3/8/21 12:21 PM, Wei Wang wrote: > diff --git a/include/net/nexthop.h b/include/net/nexthop.h > index 7bc057aee40b..48956b144689 100644 > --- a/include/net/nexthop.h > +++ b/include/net/nexthop.h > @@ -410,31 +410,39 @@ static inline struct fib_nh *fib_info_nh(struct > fib_info *fi, int nhsel) > int fib6_check_nexthop(struct nexthop *nh, struct fib6_config *cfg, > struct netlink_ext_ack *extack); > > -static inline struct fib6_nh *nexthop_fib6_nh(struct nexthop *nh) > +static inline struct fib6_nh *nexthop_fib6_nh(struct nexthop *nh, > + bool bh_disabled) Hi Wei: I would prefer not to have a second argument to nexthop_fib6_nh for 1 code path, and a control path at that. > { > struct nh_info *nhi; > > if (nh->is_group) { > struct nh_group *nh_grp; > > - nh_grp = rcu_dereference_rtnl(nh->nh_grp); > + if (bh_disabled) > + nh_grp = rcu_dereference_bh_rtnl(nh->nh_grp); > + else > + nh_grp = rcu_dereference_rtnl(nh->nh_grp); > nh = nexthop_mpath_select(nh_grp, 0); > if (!nh) > return NULL; > } > > - nhi = rcu_dereference_rtnl(nh->nh_info); > + if (bh_disabled) > + nhi = rcu_dereference_bh_rtnl(nh->nh_info); > + else > + nhi = rcu_dereference_rtnl(nh->nh_info); > if (nhi->family == AF_INET6) > return &nhi->fib6_nh; > > return NULL; > } > I am wary of duplicating code, but this helper is simple enough that it should be ok with proper documentation. Ido/Petr: I think your resilient hashing patch set touches this helper. How ugly does it get to have a second version?
Re: [PATCH v2 2/2] net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
On 3/8/21 9:26 AM, Balazs Nemeth wrote: > On Mon, 2021-03-08 at 09:17 -0700, David Ahern wrote: >> On 3/8/21 9:07 AM, Willem de Bruijn wrote: >>>> diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c >>>> index b1690149b6fa..cc1b6457fc93 100644 >>>> --- a/net/mpls/mpls_gso.c >>>> +++ b/net/mpls/mpls_gso.c >>>> @@ -27,7 +27,7 @@ static struct sk_buff *mpls_gso_segment(struct >>>> sk_buff *skb, >>>> >>>> skb_reset_network_header(skb); >>>> mpls_hlen = skb_inner_network_header(skb) - >>>> skb_network_header(skb); >>>> - if (unlikely(!pskb_may_pull(skb, mpls_hlen))) >>>> + if (unlikely(!mpls_hlen || !pskb_may_pull(skb, >>>> mpls_hlen))) >>>> goto out; >>> >>> Good cathc. Besides length zero, this can be more strict: a label >>> is >>> 4B, so mpls_hlen needs to be >= 4B. >>> >>> Perhaps even aligned to 4B, too, but not if there may be other >>> encap on top. >>> >>> Unfortunately there is no struct or type definition that we can use >>> a >>> sizeof instead of open coding the raw constant. >>> >> >> MPLS_HLEN can be used here. >> > > What about sizeof(struct mpls_label), like in net/ipv4/tunnel4.c? > I was thinking MPLS_HLEN because of its consistent use with skb manipulations. net/mpls code uses mpls_shim_hdr over mpls_label. Looks like the MPLS code could use some cleanups to make this consistent.
Re: [PATCH v2 2/2] net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0
On 3/8/21 9:07 AM, Willem de Bruijn wrote: >> diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c >> index b1690149b6fa..cc1b6457fc93 100644 >> --- a/net/mpls/mpls_gso.c >> +++ b/net/mpls/mpls_gso.c >> @@ -27,7 +27,7 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff >> *skb, >> >> skb_reset_network_header(skb); >> mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb); >> - if (unlikely(!pskb_may_pull(skb, mpls_hlen))) >> + if (unlikely(!mpls_hlen || !pskb_may_pull(skb, mpls_hlen))) >> goto out; > > Good cathc. Besides length zero, this can be more strict: a label is > 4B, so mpls_hlen needs to be >= 4B. > > Perhaps even aligned to 4B, too, but not if there may be other encap on top. > > Unfortunately there is no struct or type definition that we can use a > sizeof instead of open coding the raw constant. > MPLS_HLEN can be used here.
Re: mlx5 sub function issue
On 3/8/21 12:21 AM, ze wang wrote: > mlxconfig tool from mft tools version 4.16.52 or higher to set number of SF. > > mlxconfig -d b3:00.0 PF_BAR2_ENABLE=0 PER_PF_NUM_SF=1 PF_SF_BAR_SIZE=8 > mlxconfig -d b3:00.0 PER_PF_NUM_SF=1 PF_TOTAL_SF=192 PF_SF_BAR_SIZE=8 > mlxconfig -d b3:00.1 PER_PF_NUM_SF=1 PF_TOTAL_SF=192 PF_SF_BAR_SIZE=8 > > Cold reboot power cycle of the system as this changes the BAR size in device > Is that capability going to be added to devlink?
Re: VRF leaking doesn't work
On 3/2/21 3:57 AM, Greesha Mikhalkin wrote: > Main goal is that 100.255.254.3 should be reachable from vrf2. But > after this setup it doesn’t work. When i run `ping -I vrf2 > 100.255.254.3` it sends packets from source address that belongs to > vlan1 enslaved by vrf1. I can see in tcpdump that ICMP packets are > sent and then returned to source address but they're not returned to > ping command for some reason. To be clear `ping -I vrf1 …` works fine. I remember this case now: VRF route leaking works for fowarding, but not local traffic. If a packet arrives in vrf2, it should get forwarded to vrf1 and on to its destination. If the reverse route exists then round trip traffic works.
Re: VRF leaking doesn't work
On 3/2/21 3:57 AM, Greesha Mikhalkin wrote: > Hi. I need a help to understand why VRF leaking doesn’t work in my situation. > I want to set up leaking between 2 VRFs, that are set up by following > commands: > > # Setup bridge > sudo ip link add bridge type bridge > > # Setup VLANs > ip link add link bridge name vlan1 type vlan id 1 > ip link add link bridge name vlan2 type vlan id 2 > ip addr add 10.0.0.31/32 dev vlan1 > ip addr add 10.0.0.32/32 dev vlan2 > ip link set vlan1 up > ip link set vlan2 up > > # Setup VXLANs > ip link add vni1 type vxlan id 1 local 10.1.0.1 dev lan1 srcport > 0 0 dstport 4789 nolearning > ip link add vni2 type vxlan id 2 local 10.1.0.1 dev lan1 srcport > 0 0 dstport 4789 nolearning > ip link set vni1 master bridge > ip link set vni2 master bridge > bridge vlan add dev vni1 vid 1 pvid untagged > bridge vlan add dev vni2 vid 2 pvid untagged > ip link set vni1 up > ip link set vni2 up > > # Setup VRFs > ip link add vrf1 type vrf table 1000 > ip link set dev vrf1 up > ip link add vrf2 type vrf table 1001 > ip link set dev vrf2 up > > Setting routes: > > # Unreachable default routes > ip route add table 1000 unreachable default metric 4278198272 > ip route add table 1001 unreachable default metric 4278198272 > > # Nexthop > ip route add table 1000 100.255.254.3 proto bgp metric 20 > nexthop via 10.0.0.11 dev vlan1 weight 1 onlink > > I'm trying to setup VRF leaking in following way: > > ip r a vrf vrf2 100.255.254.3/32 dev vrf1 > ip r a vrf vrf2 10.0.0.31/32 dev vrf1 > ip r a vrf vrf1 10.0.0.32/32 dev vrf2 > > Main goal is that 100.255.254.3 should be reachable from vrf2. But > after this setup it doesn’t work. When i run `ping -I vrf2 > 100.255.254.3` it sends packets from source address that belongs to > vlan1 enslaved by vrf1. I can see in tcpdump that ICMP packets are > sent and then returned to source address but they're not returned to > ping command for some reason. To be clear `ping -I vrf1 …` works fine. > What kernel version? If you have not tried 5.10 or 5.11, please do.
Re: [PATCH iproute2-next 0/4] devlink: Use utils helpers
On 3/1/21 3:56 AM, Parav Pandit wrote: > This series uses utils helper for socket operations, string > processing and print error messages. > > Patch summary: > Patch-1 uses utils library helper for string split and string search > Patch-2 extends library for socket receive operation > Patch-3 converts devlink to use socket helpers from utlis library > Patch-4 print error when user provides invalid flavour or state > > Parav Pandit (4): > devlink: Use library provided string processing APIs > utils: Introduce helper routines for generic socket recv > devlink: Use generic socket helpers from library > devlink: Add error print when unknown values specified > > devlink/devlink.c | 365 > devlink/mnlg.c | 121 ++- > devlink/mnlg.h | 13 +- > include/mnl_utils.h | 6 + > lib/mnl_utils.c | 25 ++- > 5 files changed, 203 insertions(+), 327 deletions(-) > applied to iproute2-next. Thanks, Parav
Re: [PATCH] ipv6:delete duplicate code for reserved iid check
On 3/2/21 5:53 PM, zhang kai wrote: > Using the ipv6_reserved_interfaceid for interface id checking. > > Signed-off-by: zhang kai > --- > net/ipv6/addrconf.c | 45 +++-- > 1 file changed, 19 insertions(+), 26 deletions(-) > Looks equivalent to me. Code wise: Reviewed-by: David Ahern The commit message needs more words about the change.
Re: [PATCH] net:ipv4: Packet is not forwarded if bc_forwarding not configured on ingress interface
On 2/28/21 5:53 PM, Henry Shen wrote: > When an IPv4 packet with a destination address of broadcast is received > on an ingress interface, it will not be forwarded out of the egress > interface if the ingress interface is not configured with bc_forwarding > but the egress interface is. If both the ingress and egress interfaces > are configured with bc_forwarding, the packet can be forwarded > successfully. > > This patch is to be inline with Cisco's implementation that packet can be > forwarded if ingress interface is NOT configured with bc_forwarding, > but egress interface is. > In Linux, forwarding decisions are made based on the ingress device, not the egress device.
Re: [PATCH iproute2-next] mptcp: add support for port based endpoint
On 2/19/21 1:42 PM, Paolo Abeni wrote: > The feature is supported by the kernel since 5.11-net-next, > let's allow user-space to use it. > > Just parse and dump an additional, per endpoint, u16 attribute > > Signed-off-by: Paolo Abeni > --- > ip/ipmptcp.c| 16 ++-- > man/man8/ip-mptcp.8 | 8 > 2 files changed, 22 insertions(+), 2 deletions(-) > applied to iproute2-next. Thanks
Re: [RFC PATCH net 2/2] selftests: fib_nexthops: Test blackhole nexthops when loopback goes down
On 2/28/21 7:26 AM, Ido Schimmel wrote: > From: Ido Schimmel > > Test that blackhole nexthops are not flushed when the loopback device > goes down. > > > Signed-off-by: Ido Schimmel > --- > tools/testing/selftests/net/fib_nexthops.sh | 8 > 1 file changed, 8 insertions(+) > > diff --git a/tools/testing/selftests/net/fib_nexthops.sh > b/tools/testing/selftests/net/fib_nexthops.sh > index 4c7d33618437..d98fb85e201c 100755 > --- a/tools/testing/selftests/net/fib_nexthops.sh > +++ b/tools/testing/selftests/net/fib_nexthops.sh > @@ -1524,6 +1524,14 @@ basic() > run_cmd "$IP nexthop replace id 2 blackhole dev veth1" > log_test $? 2 "Blackhole nexthop with other attributes" > > + # blackhole nexthop should not be affected by the state of the loopback > + # device > + run_cmd "$IP link set dev lo down" > + check_nexthop "id 2" "id 2 blackhole" > + log_test $? 0 "Blackhole nexthop with loopback device down" > + > + run_cmd "$IP link set dev lo up" > + > # > # groups > # > Thanks for adding a test. Reviewed-by: David Ahern
Re: [RFC PATCH net 1/2] nexthop: Do not flush blackhole nexthops when loopback goes down
On 2/28/21 7:26 AM, Ido Schimmel wrote: > From: Ido Schimmel > > As far as user space is concerned, blackhole nexthops do not have a > nexthop device and therefore should not be affected by the > administrative or carrier state of any netdev. > > However, when the loopback netdev goes down all the blackhole nexthops > are flushed. This happens because internally the kernel associates > blackhole nexthops with the loopback netdev. That was not intended, so definitely a bug. > > This behavior is both confusing to those not familiar with kernel > internals and also diverges from the legacy API where blackhole IPv4 > routes are not flushed when the loopback netdev goes down: > > # ip route add blackhole 198.51.100.0/24 > # ip link set dev lo down > # ip route show 198.51.100.0/24 > blackhole 198.51.100.0/24 > > Blackhole IPv6 routes are flushed, but at least user space knows that > they are associated with the loopback netdev: > > # ip -6 route show 2001:db8:1::/64 > blackhole 2001:db8:1::/64 dev lo metric 1024 pref medium > > Fix this by only flushing blackhole nexthops when the loopback netdev is > unregistered. > > Fixes: ab84be7e54fc ("net: Initial nexthop code") > Signed-off-by: Ido Schimmel > Reported-by: Donald Sharp > --- > net/ipv4/nexthop.c | 10 +++--- > 1 file changed, 7 insertions(+), 3 deletions(-) > > diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c > index f1c6cbdb9e43..743777bce179 100644 > --- a/net/ipv4/nexthop.c > +++ b/net/ipv4/nexthop.c > @@ -1399,7 +1399,7 @@ static int insert_nexthop(struct net *net, struct > nexthop *new_nh, > > /* rtnl */ > /* remove all nexthops tied to a device being deleted */ > -static void nexthop_flush_dev(struct net_device *dev) > +static void nexthop_flush_dev(struct net_device *dev, unsigned long event) > { > unsigned int hash = nh_dev_hashfn(dev->ifindex); > struct net *net = dev_net(dev); > @@ -1411,6 +1411,10 @@ static void nexthop_flush_dev(struct net_device *dev) > if (nhi->fib_nhc.nhc_dev != dev) > continue; > > + if (nhi->reject_nh && > + (event == NETDEV_DOWN || event == NETDEV_CHANGE)) > + continue; > + > remove_nexthop(net, nhi->nh_parent, NULL); > } > } > @@ -2189,11 +2193,11 @@ static int nh_netdev_event(struct notifier_block > *this, > switch (event) { > case NETDEV_DOWN: > case NETDEV_UNREGISTER: > - nexthop_flush_dev(dev); > + nexthop_flush_dev(dev, event); > break; > case NETDEV_CHANGE: > if (!(dev_get_flags(dev) & (IFF_RUNNING | IFF_LOWER_UP))) > - nexthop_flush_dev(dev); > + nexthop_flush_dev(dev, event); > break; > case NETDEV_CHANGEMTU: > info_ext = ptr; > LGTM. I suggest submitting without the RFC. Reviewed-by: David Ahern
Re: [PATCH] ipv6: Honor route mtu if it is within limit of dev mtu
On 2/22/21 9:32 AM, Kaustubh Pandey wrote: > When netdevice MTU is increased via sysfs, NETDEV_CHANGEMTU is raised. > > addrconf_notify -> rt6_mtu_change -> rt6_mtu_change_route -> > fib6_nh_mtu_change > > As part of handling NETDEV_CHANGEMTU notification we land up on a > condition where if route mtu is less than dev mtu and route mtu equals > ipv6_devconf mtu, route mtu gets updated. > > Due to this v6 traffic end up using wrong MTU then configured earlier. > This commit fixes this by removing comparison with ipv6_devconf > and updating route mtu only when it is greater than incoming dev mtu. > > This can be easily reproduced with below script: > pre-condition: > device up(mtu = 1500) and route mtu for both v4 and v6 is 1500 > > test-script: > ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1400 > ip -6 route change 2001::/64 dev eth0 metric 256 mtu 1400 > echo 1400 > /sys/class/net/eth0/mtu > ip route change 192.168.0.0/24 dev eth0 src 192.168.0.1 mtu 1500 > echo 1500 > /sys/class/net/eth0/mtu > > Signed-off-by: Kaustubh Pandey > --- > net/ipv6/route.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index 1536f49..653b6c7 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -4813,8 +4813,7 @@ static int fib6_nh_mtu_change(struct fib6_nh *nh, void > *_arg) > struct inet6_dev *idev = __in6_dev_get(arg->dev); > u32 mtu = f6i->fib6_pmtu; > > - if (mtu >= arg->mtu || > - (mtu < arg->mtu && mtu == idev->cnf.mtu6)) > + if (mtu >= arg->mtu) > fib6_metric_set(f6i, RTAX_MTU, arg->mtu); > > spin_lock_bh(&rt6_exception_lock); > The existing logic mirrors what is done for exceptions, see rt6_mtu_change_route_allowed and commit e9fa1495d738. It seems right to me to drop the mtu == idev->cnf.mtu6 comparison in which case the exceptions should do the same. Added author of e9fa1495d738 in case I am overlooking something. Test case should be added to tools/testing/selftests/net/pmtu.sh, and did you run that script with the proposed change?
Re: [PATCH] arp: Remove the arp_hh_ops structure
On 2/22/21 1:37 AM, Eric Dumazet wrote: > > > On 2/22/21 4:15 AM, Yejune Deng wrote: >> The arp_hh_ops structure is similar to the arp_generic_ops structure. >> but the latter is more general,so remove the arp_hh_ops structure. >> >> Fix when took out the neigh->ops assignment: >> 8.973653] #PF: supervisor read access in kernel mode >> [8.975027] #PF: error_code(0x) - not-present page >> [8.976310] PGD 0 P4D 0 >> [8.977036] Oops: [#1] SMP PTI >> [8.977973] CPU: 1 PID: 210 Comm: sd-resolve Not tainted >> 5.11.0-rc7-02046-g4591591ab715 #1 >> [8.979998] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS >> 1.12.0-1 04/01/2014 >> [8.981996] RIP: 0010:neigh_probe >> (kbuild/src/consumer/net/core/neighbour.c:1009) >> > > I have a hard time understanding this patch submission. > > This seems a mix of a net-next and net material ? It is net-next at best. > > > >> Reported-by: kernel test robot > > If this is a bug fix, we want a Fixes: tag > > This will really help us. Please don't let us guess what is going on. > This patch is a v2. v1 was clearly wrong and not tested; I responded as such 12 hours before the robot test.
Re: [PATCH] arp: Remove the arp_hh_ops structure
On 2/19/21 9:32 PM, Yejune Deng wrote: > static const struct neigh_ops arp_direct_ops = { > .family = AF_INET, > .output = neigh_direct_output, > @@ -277,15 +269,10 @@ static int arp_constructor(struct neighbour *neigh) > memcpy(neigh->ha, dev->broadcast, dev->addr_len); > } > > - if (dev->header_ops->cache) > - neigh->ops = &arp_hh_ops; > - else > - neigh->ops = &arp_generic_ops; How did you test this? you took out the neigh->ops assignment, so all of the neigh->ops in net/core/neighbour.c are going to cause a NULL dereference. > - > - if (neigh->nud_state & NUD_VALID) > - neigh->output = neigh->ops->connected_output; > + if (!dev->header_ops->cache && (neigh->nud_state & NUD_VALID)) > + neigh->output = arp_generic_ops.connected_output; > else > - neigh->output = neigh->ops->output; > + neigh->output = arp_generic_ops.output; > } > return 0; > } >
Re: null terminating of IFLA_INFO_KIND/IFLA_IFNAME
On 2/17/21 9:06 AM, Муравьев Александр wrote: > Hi > > A noob question that I haven't found an answer. > > Just wanted to clarify a piece of iproute2 code. > > ip/iplink.c: > >> 1058 addattr_l(&req.n, sizeof(req), IFLA_INFO_KIND, type, >> 1059 strlen(type)); > > also ip/iplink.c: > >> 1115 addattr_l(&req.n, sizeof(req), >> 1116 !check_ifname(name) ? IFLA_IFNAME : IFLA_ALT_IFNAME, >> 1117 name, strlen(name) + 1); > My question is why we skip terminating null character for IFLA_INFO_KIND > (the first case) and don't skip it for IFLA_IFNAME (the second case)? I > mean "strlen(type)" and "strlen(name) + 1". > I think it is just different coders at different points in time. Kernel side both use nla_strscpy which handles the string terminator (or missing terminator).
Re: [PATCH v4 net-next 07/21] nvme-tcp: Add DDP data-path
On 2/17/21 7:01 AM, Or Gerlitz wrote: >>> @@ -1136,6 +1265,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct >>> nvme_tcp_request *req) >>> else >>> flags |= MSG_EOR; >>> >>> + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags) && >>> + blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ) >>> + nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq); >>> + >> >> For consistency, shouldn't this be wrapped in the CONFIG_TCP_DDP check too? > > We tried to avoid the wrapping in some places where it was > possible to do without adding confusion, this one is a good > example IMOH. > > The above (and other locations like it) can easily be put into a helper that has logic when the CONFIG is enabled and compiles out when not. Consistency makes for simpler, cleaner code for optional features.
Re: [PATCH iproute2-rc] rdma: Fix statistics bind/unbing argument handling
On 2/15/21 11:16 PM, Leon Romanovsky wrote: > On Mon, Feb 15, 2021 at 06:56:26PM -0700, David Ahern wrote: >> On 2/14/21 10:40 PM, Leon Romanovsky wrote: >>> On Sun, Feb 14, 2021 at 08:26:16PM -0700, David Ahern wrote: >>>> what does iproute2-rc mean? >>> >>> Patch target is iproute2.git: >>> https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/ >> >> so you are asking them to be committed for the 5.11 release? > > This is a Fix to an existing issue (not theoretical one), so I was under > impression that it should go to -rc repo and not to -next. It is assigned to Stephen for iproute2. > > Personally, I don't care to which repo will this fix be applied as long > as it is applied to one of the two iproute2 official repos. > > Do you have clear guidance when should I send patches to > iproute2-rc/iproute2-next? > It's the rc label that needs to be dropped: iproute2 or iproute2-next. Just like there is net and net-next.