Re: [PATCH bpf-next] libbpf: Add libbpf_version() function to get library version at runtime
On Wed, 18 Nov 2020 09:43:25 -0800, Alexei Starovoitov wrote: > Just like the kernel doesn't add features for out-of-tree modules > libbpf doesn't add features for projects where libbpf is optional. A more fitting comparison would be the kernel refusing to add a new uAPI call because some application refuses to bundle the kernel. A libbpf equivalent of a kernel module would be some kind of libbpf plugin (which does not exist), asking for an internal libbpf API to be added. Alexei, could you please start cooperating with others and actually listening to others' needs? I know you started eBPF but you're not the only user anymore and as much convinced you may be about your view, people have reasons for what they're doing. It would help greatly if you could listen to these reasons. Jiri
Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support
On Thu, 5 Nov 2020 12:45:39 -0800, Andrii Nakryiko wrote: > That's not true. If you need new functionality like BTF, CO-RE, > function-by-function verification, etc., then yes, you have to update > kernel, compiler, libbpf, sometimes pahole. But if you have an BPF > application that doesn't use and need any of the newer features, it > will keep working just fine with the old kernel, old libbpf, and old > compiler. I'm fine with this. It doesn't work that well in practice, we've found ourselves chasing problems caused by llvm update (problems for older bpf programs, not new ones), problems on non-x86_64 caused by kernel updates, etc. It can be attributed to living on the edge and it should stabilize over time, hopefully. But it's still what the users are experiencing and it's probably what David is referring to. I expect it to smooth itself over time. Add to that the fact that something that is in fact a new feature is perceived as a bug fix by some users. For example, a perfectly valid and simple C program, not using anything shiny but a basic simple loop, compiles just fine but is rejected by the kernel. A newer kernel and a newer compiler and a newer libbpf and a newer pahole will cause the same program to be accepted. Now, the user does not see that for this, a new load of BTF functionality had to be added and all those mentioned projects enhanced with substantial code. All they see is their simple hello world test program did not work and now it does. I'm not saying I have a solution nor I'm saying you should do something about it. Just trying to explain the perception. Jiri
Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support
On Thu, 5 Nov 2020 12:19:00 -0800, Andrii Nakryiko wrote: > I'll just quote myself here for your convenience. Sorry, I missed your original email for some reason. > Submodule is a way that I know of to make this better for end users. > If there are other ways to pull this off with shared library use, I'm > all for it, it will save the security angle that distros are arguing > for. E.g., if distributions will always have the latest libbpf > available almost as soon as it's cut upstream *and* new iproute2 > versions enforce the latest libbpf when they are packaged/released, > then this might work equivalently for end users. If Linux distros > would be willing to do this faithfully and promptly, I have no > objections whatsoever. Because all that matters is BPF end user > experience, as Daniel explained above. That's basically what we already do, for both Fedora and RHEL. Of course, it follows the distro release cycle, i.e. no version upgrades - or very limited ones - during lifetime of a particular release. But that would not be different if libbpf was bundled in individual projects. Jiri
Re: [PATCH v3 bpf-next] bpf: make verifier log more relevant by default
On Thu, 5 Nov 2020 13:22:12 -0800, Andrii Nakryiko wrote: > test_progs is the only test runner that's run continuously on every > patch. libbpf CI also runs test_maps and test_verifier. All the other > test binaries/scripts rely on humans to not forget about them. Which > works so-so, as you can see :) Did not know that, thanks for the explanation. Would be good to add full testing to some of the numerous CIs we have (some taker? :-)). Jiri
Re: [PATCH v3 bpf-next] bpf: make verifier log more relevant by default
On Thu, 5 Nov 2020 14:57:13 -0800, Jakub Kicinski wrote: > If you're saying the driver message would still be there if > verification or translation failed that's perfectly fine, we > can definitely adjust the test. But some check that driver > message reporting is working is needed, don't just remove it. Should we change the test to fail the verification? Sounds reasonable to me. Jiri
Re: [PATCH v3 bpf-next] bpf: make verifier log more relevant by default
On Thu, 23 Apr 2020 12:58:50 -0700, Andrii Nakryiko wrote: > To make BPF verifier verbose log more releavant and easier to use to debug > verification failures, "pop" parts of log that were successfully verified. > This has effect of leaving only verifier logs that correspond to code branches > that lead to verification failure, which in practice should result in much > shorter and more relevant verifier log dumps. This behavior is made the > default behavior and can be overriden to do exhaustive logging by specifying > BPF_LOG_LEVEL2 log level. This patch broke the test_offload.py selftest: [...] Test TC offloads work... FAIL: Missing or incorrect message from netdevsim in verifier log [...] The selftest expects to receive "[netdevsim] Hello from netdevsim!" in the log (coming from nsim_bpf_verify_insn) but that part of the log is cleared by bpf_vlog_reset added by this patch. How can this be fixed? The log level 1 comes from the "verbose" keyword passed to tc, I don't think it should be increased to 2. On a related note, the selftest had to start failing after this commit. It's a bit surprising it did not get caught, is there a bug somewhere in the test matrix? Thanks, Jiri
Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support
On Tue, 3 Nov 2020 19:11:45 -0800, Alexei Starovoitov wrote: > When we release new version of libbpf it goes through rigorous testing. > bpftool gets a lot of test coverage as well. > iproute2 with shared libbpf will get nothing. It's the same random roll of > dice. "Random roll of dice" would be true only if libbpf did incredibly bad job in keeping backward compatibility. In my experience it is not the case. Sure, a bug in retaining the compatibility may occasionally appear; after all, any software tends to contain bugs in various places. You are right that such bug may not be caught by your testing. I also believe that if there is a bug in backward compatibility reported by someone, it will be fixed (if possible). So this is really just a matter of testing, not a fundamental problem of ABI compatibility. Let the distros worry about the testing. Upstream may test (and even recommend!) certain combinations of iproute2 + libbpf, such as the latest of both at the time of testing. If distros want to use a different combination, they can and should do their own testing. If their testing reveals a bug in backward compatibility and a patch to fix it is accepted, everything will work smoothly for the distro users. Non-distro users (or small distros) may just rely on the upstream tested combination of iproute2 + libbpf. > Few years from now the situation could be different and shared libbpf would > be the most appropriate choice. But that day is not today. Interestingly, the major compatibility problems we had were with llvm updates. After llvm update while keeping the same kernel version, llvm started to emit code that the verifier did not accept. Meaning a bpf program that was previously accepted by the kernel was rejected after recompilation. This was solved by adding a translation code to libbpf (which nicely demonstrates that indeed libbpf cares about backward compatibility). Now, with dynamically linked libbpf, a single package update was able to solve the problem for everything on the system, including users' own programs. All that was needed was making the llvm package force update the libbpf package (which rpm can do easily with its Conflicts dependency). So, at least for us, there was so far no disadvantage (and no problem) with dynamic linking and a quite substantial advantage. Jiri
Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support
On Tue, 3 Nov 2020 18:45:59 -0800, Alexei Starovoitov wrote: > libbpf is the only library I know that is backward and forward compatible. This is great to hear. It means there will be no problem with iproute2 using the system libbpf. As libbpf is both backward and forward compatible, iproute2 will just work with whatever version it is used with. > All other libraries are backwards compatible only. Backward compatibility would be enough for iproute2 but forward compatibility does not hurt, of course. > The users can upgrade and downgrade libbpf version at any time. > They can upgrade and downgrade kernel while keeping libbpf version the same. > The users can upgrade llvm as well and libbpf has to expect unexpected > and deal with all combinations. This actually goes beyond what would be needed for iproute2 dynamically linked against system libbpf. > > How so? If libbpf is written against kernel APIs and properly versioned, > > it should just work. A new version of libbpf changes the .so version, so > > old commands will not load it. > > Please point out where do you see this happening in the patch set. > See tools/lib/bpf/README.rst to understand the versioning. If the iproute2 binaries are linked against a symbol of a newer version than is available in the system libbpf (which should not really happen unless the system is broken), the dynamic linker will refuse to load it. If the binary is linked against an old version of a particular symbol, that old version will be used, if it's still provided by the library. Otherwise, it will not load. I don't see a problem here? The only problem would be if a particular function changed its semantics while retaining ABI. But since libbpf is backward and forward compatible, this should not happen. Jiri
Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support
On Mon, 2 Nov 2020 22:58:06 -0800, Andrii Nakryiko wrote: > But I don't think I got a real answer as to what's the exact reason > against the submodule. Like what "inappropriate" even means in this > case? Jesper's security argument so far was the only objective > criteria, as far as I can tell. It's the fundamental objection. Distributions in general have the "no bundled libraries" policy. It is sometimes annoying but it helps to understand that the policy is not a whim of distros, it's coming from years of experience with package maintenance for security and stability. > But I also see that using libbpf through submodule gives iproute2 > exact control over which version of libbpf is being used. And that > does not depend at all on any specific Linux distribution, its > version, LTS vs non-LTS, etc. iproute2 will just work the same across > all of them. So matches your stated goals very directly and > explicitly. If you take this route, the end result would be all dependencies for all projects being included as submodules and bundled. At the first sight, this sounds easier for the developers. Why bother with dynamic linking at all? Everything can be linked statically. The result would be nightmare for both distros and users. No timely security updates possible, critical bugs not being fixed in some programs, etc. There is enough experience with this kind of setup to conclude it is not the right way to go. Yes, dynamic linking is initially more work for developers of both apps and libraries. However, it pays off over time - there's no need to keep track of security and other important fixes in the dependencies, it comes for free from the distro work. Btw, taking the bundling to the extreme, every app could bundle its own well tested and compatible kernel version and be run in a VM. This might sound far fetched but there were actual attempts to do that. It didn't take off; I think part of the reason was that the Linux kernel is very good in keeping its APIs stable. And I'm convinced this is the way to go for libraries, too: put an emphasis on API stability. Make it easy to get consumed and updated under the hood. Everybody wins this way. Jiri
[PATCH bpf] selftests: bpf: switch off timeout
Several bpf tests are interrupted by the default timeout of 45 seconds added by commit 852c8cbf34d3 ("selftests/kselftest/runner.sh: Add 45 second timeout per test"). In my case it was test_progs, test_tunnel.sh, test_lwt_ip_encap.sh and test_xdping.sh. There's not much value in having a timeout for bpf tests, switch it off. Signed-off-by: Jiri Benc --- tools/testing/selftests/bpf/settings | 1 + 1 file changed, 1 insertion(+) create mode 100644 tools/testing/selftests/bpf/settings diff --git a/tools/testing/selftests/bpf/settings b/tools/testing/selftests/bpf/settings new file mode 100644 index ..e7b9417537fb --- /dev/null +++ b/tools/testing/selftests/bpf/settings @@ -0,0 +1 @@ +timeout=0 -- 2.18.1
[PATCH net] geneve: change from tx_error to tx_dropped on missing metadata
If the geneve interface is in collect_md (external) mode, it can't send any packets submitted directly to its net interface, as such packets won't have metadata attached. This is expected. However, the kernel itself sends some packets to the interface, most notably, IPv6 DAD, IPv6 multicast listener reports, etc. This is not wrong, as tunnel metadata can be specified in routing table (although technically, that has never worked for IPv6, but hopefully will be fixed eventually) and then the interface must correctly participate in IPv6 housekeeping. The problem is that any such attempt increases the tx_error counter. Just bringing up a geneve interface with IPv6 enabled is enough to see a number of tx_errors. That causes confusion among users, prompting them to find a network error where there is none. Change the counter used to tx_dropped. That better conveys the meaning (there's nothing wrong going on, just some packets are getting dropped) and hopefully will make admins panic less. Signed-off-by: Jiri Benc --- drivers/net/geneve.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 6b461be1820b..75266580b586 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -987,9 +987,10 @@ static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev) if (geneve->collect_md) { info = skb_tunnel_info(skb); if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) { - err = -EINVAL; netdev_dbg(dev, "no tunnel metadata\n"); - goto tx_error; + dev_kfree_skb(skb); + dev->stats.tx_dropped++; + return NETDEV_TX_OK; } } else { info = &geneve->info; @@ -1006,7 +1007,7 @@ static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct net_device *dev) if (likely(!err)) return NETDEV_TX_OK; -tx_error: + dev_kfree_skb(skb); if (err == -ELOOP) -- 2.18.1
Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
On Fri, 18 Oct 2019 13:03:56 -0700, Tom Herbert wrote: > More specifically fou allows encapsulation of anything that has an IP > protocol number. That includes an L3 protocols that have been assigned > a number (e.g. MPLS, GRE, IPv6, IPv4, EtherIP). So the only need for > an alternate method to do L3 encapsulation would be for those > protocols that don't have an IP protocol number assignment. > Presumably, those just have an EtherType. In that case, it seems > simple enough to just extend fou to processed an encapsulated > EtherType. This should be little more than modifying the "struct fou" > to hold 16 bit EtherType (union with protocol), adding > FOU_ENCAP_ETHER, corresponding attribute, and then populate > appropriate receive functions for the socket. How do you suggest to plug that in? We need the received inner packets to be "encapsulated" in an Ethernet header; the current approach of "ip link add type ipip|sit encap fou" does not work here. Are you suggesting to add another rtnl link type? How would it look like? "ip link add type ethernet encap fou" does not sound appealing, especially since "type ethernet" would have no meaning without the "encap fou". Thanks, Jiri
[PATCH bpf] selftests/bpf: More compatible nc options in test_tc_edt
Out of the three nc implementations widely in use, at least two (BSD netcat and nmap-ncat) do not support -l combined with -s. Modify the nc invocation to be accepted by all of them. Fixes: 7df5e3db8f63 ("selftests: bpf: tc-bpf flow shaping with EDT") Cc: Peter Oskolkov Signed-off-by: Jiri Benc --- tools/testing/selftests/bpf/test_tc_edt.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/bpf/test_tc_edt.sh b/tools/testing/selftests/bpf/test_tc_edt.sh index f38567ef694b..daa7d1b8d309 100755 --- a/tools/testing/selftests/bpf/test_tc_edt.sh +++ b/tools/testing/selftests/bpf/test_tc_edt.sh @@ -59,7 +59,7 @@ ip netns exec ${NS_SRC} tc filter add dev veth_src egress \ # start the listener ip netns exec ${NS_DST} bash -c \ - "nc -4 -l -s ${IP_DST} -p 9000 >/dev/null &" + "nc -4 -l -p 9000 >/dev/null &" declare -i NC_PID=$! sleep 1 -- 2.18.1
Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.
On Wed, 9 Oct 2019 10:58:51 -0400, Willem de Bruijn wrote: > Yes, this needs a call to gro_cells_receive like geneve_udp_encap_recv > and vxlan_rcv, instead of returning a negative value back to > udp_queue_rcv_one_skb. Though that's not a major change. Willem, how would you do that? The whole fou code including its netlink API is built around the assumption that it's L4 encapsulation. I see no way to extend it to do L3 encapsulation without major hacks - in fact, you'll add all that Martin's patch adds, including a new netlink API, but just mix that into fou.c. > Transmit is easier. The example I shared already encapsulates packets > with MPLs and sends them through a tunnel device. Admittedly using > mpls routing. But the ip tunneling infrastructure supports adding > arbitrary inner headers using ip_tunnel_encap_ops.build_header. > net/ipv4/fou.c could be extended with a mpls_build_header alongside > fou_build_header and gue_build_header? The nice thing about the bareudp tunnel is that it's generic. It's just (outer) UDP header followed by (inner) L3 header. You can use it for NSH over UDP. Or for multitude of other purposes. What you're proposing is MPLS only. And it would require similar amount of code as the bareudp generic solution. It also doesn't fit fou design: MPLS is not an IP protocol. Trying to add it there is like forcing a round peg into a square hole. Just look at the code. Start with parse_nl_config. > Extending this existing code has more benefits than code reuse (and > thereby fewer bugs), it also allows reusing the existing GRO logic, > say. In this case, it's the opposite: trying to retrofit L3 encapsulation into fou is going to introduce bugs. Moreover, given the expected usage of bareudp, the nice benefit is it's lwtunnel only. No need to carry all the baggage of standalone configuration fou has. > At a high level, I think we should try hard to avoid duplicating > tunnel code for every combination of inner and outer protocol. Yet you're proposing to do exactly that with special casing MPLS in fou. I'd say bareudp is as generic as you could get it. It allows any L3 protocol inside UDP. You can hardly make it more generic than that. With ETH_P_TEB, it could also encapsulate Ethernet (that is, L2). > Geneve > and vxlan on the one hand and generic ip tunnel and FOU on the other > implement most of these features. Indeed, already implements the > IPxIPx, just not MPLS. It will be less code to add MPLS support based > on existing infrastructure. You're mixing apples with oranges. IP tunnels and FOU encapsulate L4 payload. VXLAN and Geneve encapsulate L2 payload (or L3 payload for VXLAN-GPE). I agree that VXLAN, Geneve and the proposed bareudp module have duplicated code. The solution is to factoring it out to a common location. > I think it will be preferable to work the other way around and extend > existing ip tunnel infra to add MPLS. You see, that's the problem: this is not IP tunneling. Jiri
Re: [PATCHv2 net-next 2/6] lwtunnel: add LWTUNNEL_IP_OPTS support for lwtunnel_ip
On Wed, 9 Oct 2019 09:55:27 +0200, Simon Horman wrote: > This is the same concerned that was raised by others when I posed a patch > to allow setting of Geneve options in a similar manner. I think what is > called for here, as was the case in the Geneve work, is to expose netlink > attributes for each option that may be set and have the kernel form > these into the internal format (which appears to also be the wire format). I agree with Simon. Thanks, Jiri
[PATCH bpf] bpf: lwtunnel: fix reroute supplying invalid dst
The dst in bpf_input() has lwtstate field set. As it is of the LWTUNNEL_ENCAP_BPF type, lwtstate->data is struct bpf_lwt. When the bpf program returns BPF_LWT_REROUTE, ip_route_input_noref is directly called on this skb. This causes invalid memory access, as ip_route_input_slow calls skb_tunnel_info(skb) that expects the dst->lwstate->data to be struct ip_tunnel_info. This results to struct bpf_lwt being accessed as struct ip_tunnel_info. Drop the dst before calling the IP route input functions (both for IPv4 and IPv6). Reported by KASAN. Fixes: 3bd0b15281af ("bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c") Cc: Peter Oskolkov Signed-off-by: Jiri Benc --- net/core/lwt_bpf.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index f93785e5833c..74cfb8b5ab33 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -88,11 +88,16 @@ static int bpf_lwt_input_reroute(struct sk_buff *skb) int err = -EINVAL; if (skb->protocol == htons(ETH_P_IP)) { + struct net_device *dev = skb_dst(skb)->dev; struct iphdr *iph = ip_hdr(skb); + dev_hold(dev); + skb_dst_drop(skb); err = ip_route_input_noref(skb, iph->daddr, iph->saddr, - iph->tos, skb_dst(skb)->dev); + iph->tos, dev); + dev_put(dev); } else if (skb->protocol == htons(ETH_P_IPV6)) { + skb_dst_drop(skb); err = ipv6_stub->ipv6_route_input(skb); } else { err = -EAFNOSUPPORT; -- 2.18.1
Re: [PATCH v3 bpf-next 5/7] libbpf: move bpf_{helpers,endian,tracing}.h into libbpf
On Fri, 4 Oct 2019 18:37:44 +, Yonghong Song wrote: > distro can package bpf/btf uapi headers into libbpf package. > Users linking with libbpf.a/libbpf.so can use bpf/btf.h with include > path pointing to libbpf dev package include directory. > Could this work? I don't think it would. Distros have often a policy against bundling files that are available from one package (in this case, kernel-headers or similar) in a different package (libbpf). The correct way is making the libbpf package depend on a particular version of kernel-headers (or newer). As I said, I don't see a problem here. It's not a special situation, it's just usual dependencies. Jiri
Re: [PATCH v3 bpf-next 5/7] libbpf: move bpf_{helpers,endian,tracing}.h into libbpf
On Fri, 4 Oct 2019 11:30:26 -0700, Jakub Kicinski wrote: > Surely for distroes tho - they would have kernel headers matching the > kernel release they ship. If parts of libbpf from GH only work with > the latest kernel, distroes should ship libbpf from the kernel source, > rather than GH. I don't see a problem here for distros. Distros control both the kernel and libbpf, there's no problem in keeping those in sync. Packaging libbpf from the kernel source would not help anything - updating libbpf would still require a new kernel. There's no difference between updating libbpf from a github repo or from the kernel source, as far as dependencies are concerned. Jiri
[PATCH bpf 1/2] selftests/bpf: set rp_filter in test_flow_dissector
Many distributions enable rp_filter. However, the flow dissector test generates packets that have 1.1.1.1 set as (inner) source address without this address being reachable. This causes the selftest to fail. The selftests should not assume a particular initial configuration. Switch off rp_filter. Fixes: 50b3ed57dee9 ("selftests/bpf: test bpf flow dissection") Cc: Petar Penkov Signed-off-by: Jiri Benc --- tools/testing/selftests/bpf/test_flow_dissector.sh | 3 +++ 1 file changed, 3 insertions(+) diff --git a/tools/testing/selftests/bpf/test_flow_dissector.sh b/tools/testing/selftests/bpf/test_flow_dissector.sh index d23d4da66b83..e2d06191bd35 100755 --- a/tools/testing/selftests/bpf/test_flow_dissector.sh +++ b/tools/testing/selftests/bpf/test_flow_dissector.sh @@ -63,6 +63,9 @@ fi # Setup tc qdisc add dev lo ingress +echo 0 > /proc/sys/net/ipv4/conf/default/rp_filter +echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter +echo 0 > /proc/sys/net/ipv4/conf/lo/rp_filter echo "Testing IPv4..." # Drops all IP/UDP packets coming from port 9 -- 2.18.1
[PATCH bpf 0/2] selftests/bpf: fix false failures
The test_flow_dissector and test_lwt_ip_encap selftests were failing for me. It was caused by the tests not being enough system/distro independent. Jiri Benc (2): selftests/bpf: set rp_filter in test_flow_dissector selftests/bpf: more compatible nc options in test_lwt_ip_encap tools/testing/selftests/bpf/test_flow_dissector.sh | 3 +++ tools/testing/selftests/bpf/test_lwt_ip_encap.sh | 6 +++--- 2 files changed, 6 insertions(+), 3 deletions(-) -- 2.18.1
[PATCH bpf 2/2] selftests/bpf: more compatible nc options in test_lwt_ip_encap
Out of the three nc implementations widely in use, at least two (BSD netcat and nmap-ncat) do not support -l combined with -s. Modify the nc invocation to be accepted by all of them. Fixes: 17a90a788473 ("selftests/bpf: test that GSO works in lwt_ip_encap") Cc: Peter Oskolkov Signed-off-by: Jiri Benc --- tools/testing/selftests/bpf/test_lwt_ip_encap.sh | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh index acf7a74f97cd..59ea56945e6c 100755 --- a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh +++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh @@ -314,15 +314,15 @@ test_gso() command -v nc >/dev/null 2>&1 || \ { echo >&2 "nc is not available: skipping TSO tests"; return; } - # listen on IPv*_DST, capture TCP into $TMPFILE + # listen on port 9000, capture TCP into $TMPFILE if [ "${PROTO}" == "IPv4" ] ; then IP_DST=${IPv4_DST} ip netns exec ${NS3} bash -c \ - "nc -4 -l -s ${IPv4_DST} -p 9000 > ${TMPFILE} &" + "nc -4 -l -p 9000 > ${TMPFILE} &" elif [ "${PROTO}" == "IPv6" ] ; then IP_DST=${IPv6_DST} ip netns exec ${NS3} bash -c \ - "nc -6 -l -s ${IPv6_DST} -p 9000 > ${TMPFILE} &" + "nc -6 -l -p 9000 > ${TMPFILE} &" RET=$? else echo "test_gso: unknown PROTO: ${PROTO}" -- 2.18.1
[PATCH bpf-next] selftests: bpf: standardize to static __always_inline
The progs for bpf selftests use several different notations to force function inlining. Standardize to what most of them use, static __always_inline. Suggested-by: Song Liu Signed-off-by: Jiri Benc --- tools/testing/selftests/bpf/progs/pyperf.h| 9 +++-- .../testing/selftests/bpf/progs/strobemeta.h | 36 ++- .../testing/selftests/bpf/progs/test_jhash.h | 3 +- .../selftests/bpf/progs/test_seg6_loop.c | 23 ++-- .../selftests/bpf/progs/test_verif_scale2.c | 2 +- 5 files changed, 38 insertions(+), 35 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/pyperf.h b/tools/testing/selftests/bpf/progs/pyperf.h index 6b0781391be5..abf6224649be 100644 --- a/tools/testing/selftests/bpf/progs/pyperf.h +++ b/tools/testing/selftests/bpf/progs/pyperf.h @@ -75,8 +75,7 @@ typedef struct { void* co_name; // PyCodeObject.co_name } FrameData; -static inline __attribute__((__always_inline__)) void* -get_thread_state(void* tls_base, PidData* pidData) +static __always_inline void *get_thread_state(void *tls_base, PidData *pidData) { void* thread_state; int key; @@ -87,8 +86,8 @@ get_thread_state(void* tls_base, PidData* pidData) return thread_state; } -static inline __attribute__((__always_inline__)) bool -get_frame_data(void* frame_ptr, PidData* pidData, FrameData* frame, Symbol* symbol) +static __always_inline bool get_frame_data(void *frame_ptr, PidData *pidData, + FrameData *frame, Symbol *symbol) { // read data from PyFrameObject bpf_probe_read(&frame->f_back, @@ -161,7 +160,7 @@ struct bpf_elf_map SEC("maps") stackmap = { .max_elem = 1000, }; -static inline __attribute__((__always_inline__)) int __on_event(struct pt_regs *ctx) +static __always_inline int __on_event(struct pt_regs *ctx) { uint64_t pid_tgid = bpf_get_current_pid_tgid(); pid_t pid = (pid_t)(pid_tgid >> 32); diff --git a/tools/testing/selftests/bpf/progs/strobemeta.h b/tools/testing/selftests/bpf/progs/strobemeta.h index 1ff73f60a3e4..553bc3b62e89 100644 --- a/tools/testing/selftests/bpf/progs/strobemeta.h +++ b/tools/testing/selftests/bpf/progs/strobemeta.h @@ -266,8 +266,8 @@ struct tls_index { uint64_t offset; }; -static inline __attribute__((always_inline)) -void *calc_location(struct strobe_value_loc *loc, void *tls_base) +static __always_inline void *calc_location(struct strobe_value_loc *loc, + void *tls_base) { /* * tls_mode value is: @@ -327,10 +327,10 @@ void *calc_location(struct strobe_value_loc *loc, void *tls_base) : NULL; } -static inline __attribute__((always_inline)) -void read_int_var(struct strobemeta_cfg *cfg, size_t idx, void *tls_base, - struct strobe_value_generic *value, - struct strobemeta_payload *data) +static __always_inline void read_int_var(struct strobemeta_cfg *cfg, +size_t idx, void *tls_base, +struct strobe_value_generic *value, +struct strobemeta_payload *data) { void *location = calc_location(&cfg->int_locs[idx], tls_base); if (!location) @@ -342,10 +342,11 @@ void read_int_var(struct strobemeta_cfg *cfg, size_t idx, void *tls_base, data->int_vals_set_mask |= (1 << idx); } -static inline __attribute__((always_inline)) -uint64_t read_str_var(struct strobemeta_cfg* cfg, size_t idx, void *tls_base, - struct strobe_value_generic *value, - struct strobemeta_payload *data, void *payload) +static __always_inline uint64_t read_str_var(struct strobemeta_cfg *cfg, +size_t idx, void *tls_base, +struct strobe_value_generic *value, +struct strobemeta_payload *data, +void *payload) { void *location; uint32_t len; @@ -371,10 +372,11 @@ uint64_t read_str_var(struct strobemeta_cfg* cfg, size_t idx, void *tls_base, return len; } -static inline __attribute__((always_inline)) -void *read_map_var(struct strobemeta_cfg *cfg, size_t idx, void *tls_base, - struct strobe_value_generic *value, - struct strobemeta_payload* data, void *payload) +static __always_inline void *read_map_var(struct strobemeta_cfg *cfg, + size_t idx, void *tls_base, + struct strobe_value_generic *value, + struct strobemeta_payload *data, + void *payload) { struct strobe_map_descr* descr = &data->map_descrs[idx]; struc
[PATCH bpf v2] selftests: bpf: fix inlines in test_lwt_seg6local
Selftests are reporting this failure in test_lwt_seg6local.sh: + ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj test_lwt_seg6local.o sec encap_srh dev veth2 Error fetching program/map! Failed to parse eBPF program: Operation not permitted The problem is __attribute__((always_inline)) alone is not enough to prevent clang from inserting those functions in .text. In that case, .text is not marked as relocateable. See the output of objdump -h test_lwt_seg6local.o: Idx Name Size VMA LMA File off Algn 0 .text 3530 0040 2**3 CONTENTS, ALLOC, LOAD, READONLY, CODE This causes the iproute bpf loader to fail in bpf_fetch_prog_sec: bpf_has_call_data returns true but bpf_fetch_prog_relo fails as there's no relocateable .text section in the file. To fix this, convert to 'static __always_inline'. v2: Use 'static __always_inline' instead of 'static inline __attribute__((always_inline))' Fixes: c99a84eac026 ("selftests/bpf: test for seg6local End.BPF action") Signed-off-by: Jiri Benc --- .../testing/selftests/bpf/progs/test_lwt_seg6local.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c b/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c index 0575751bc1bc..e2f6ed0a583d 100644 --- a/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c +++ b/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c @@ -61,7 +61,7 @@ struct sr6_tlv_t { unsigned char value[0]; } BPF_PACKET_HEADER; -__attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb) +static __always_inline struct ip6_srh_t *get_srh(struct __sk_buff *skb) { void *cursor, *data_end; struct ip6_srh_t *srh; @@ -95,7 +95,7 @@ __attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb) return srh; } -__attribute__((always_inline)) +static __always_inline int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad, uint32_t old_pad, uint32_t pad_off) { @@ -125,7 +125,7 @@ int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad, return 0; } -__attribute__((always_inline)) +static __always_inline int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t *tlv_off, uint32_t *pad_size, uint32_t *pad_off) @@ -184,7 +184,7 @@ int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh, return 0; } -__attribute__((always_inline)) +static __always_inline int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off, struct sr6_tlv_t *itlv, uint8_t tlv_size) { @@ -228,7 +228,7 @@ int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off, return update_tlv_pad(skb, new_pad, pad_size, pad_off); } -__attribute__((always_inline)) +static __always_inline int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off) { @@ -266,7 +266,7 @@ int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, return update_tlv_pad(skb, new_pad, pad_size, pad_off); } -__attribute__((always_inline)) +static __always_inline int has_egr_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh) { int tlv_offset = sizeof(struct ip6_t) + sizeof(struct ip6_srh_t) + -- 2.18.1
Re: [PATCH bpf] selftests: bpf: fix inlines in test_lwt_seg6local
On Mon, 1 Jul 2019 11:39:27 -0700, Y Song wrote: > By default, we have > # define __always_inlineinline __attribute__((always_inline)) > > So just use __always_inline should be less verbose in your patch. I'll resubmit, converting everything to __always_inline, targeting bpf-next. > BTW, what compiler did you use have this behavior? clang version 7.0.1 (tags/RELEASE_701/final) LLVM version 7.0.1 > Did you have issues with `static __attribute__((always_inline))`? That seems to work, too. Jiri
Re: [PATCH bpf] selftests: bpf: fix inlines in test_lwt_seg6local
On Sat, 29 Jun 2019 11:04:54 -0700, Song Liu wrote: > > Maybe use "__always_inline" as most other tests do? > > I meant "static __always_inline". Sure, I can do that. It doesn't seem to be as consistent as you suggest, though. There are three different forms used in selftests/bpf/progs: static __always_inline static inline __attribute__((__always_inline__)) static inline __attribute__((always_inline)) As this is a bug causing selftests to fail (at least for some clang/llvm versions), how about applying this to bpf.git as a minimal fix and unifying the progs in bpf-next? Thanks, Jiri
[PATCH bpf] selftests: bpf: fix inlines in test_lwt_seg6local
Selftests are reporting this failure in test_lwt_seg6local.sh: + ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj test_lwt_seg6local.o sec encap_srh dev veth2 Error fetching program/map! Failed to parse eBPF program: Operation not permitted The problem is __attribute__((always_inline)) alone is not enough to prevent clang from inserting those functions in .text. In that case, .text is not marked as relocateable. See the output of objdump -h test_lwt_seg6local.o: Idx Name Size VMA LMA File off Algn 0 .text 3530 0040 2**3 CONTENTS, ALLOC, LOAD, READONLY, CODE This causes the iproute bpf loader to fail in bpf_fetch_prog_sec: bpf_has_call_data returns true but bpf_fetch_prog_relo fails as there's no relocateable .text section in the file. Add 'static inline' to fix this. Fixes: c99a84eac026 ("selftests/bpf: test for seg6local End.BPF action") Signed-off-by: Jiri Benc --- .../selftests/bpf/progs/test_lwt_seg6local.c| 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c b/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c index 0575751bc1bc..5848c1b52554 100644 --- a/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c +++ b/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c @@ -61,7 +61,8 @@ struct sr6_tlv_t { unsigned char value[0]; } BPF_PACKET_HEADER; -__attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb) +static inline __attribute__((always_inline)) +struct ip6_srh_t *get_srh(struct __sk_buff *skb) { void *cursor, *data_end; struct ip6_srh_t *srh; @@ -95,7 +96,7 @@ __attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb) return srh; } -__attribute__((always_inline)) +static inline __attribute__((always_inline)) int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad, uint32_t old_pad, uint32_t pad_off) { @@ -125,7 +126,7 @@ int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad, return 0; } -__attribute__((always_inline)) +static inline __attribute__((always_inline)) int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t *tlv_off, uint32_t *pad_size, uint32_t *pad_off) @@ -184,7 +185,7 @@ int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh, return 0; } -__attribute__((always_inline)) +static inline __attribute__((always_inline)) int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off, struct sr6_tlv_t *itlv, uint8_t tlv_size) { @@ -228,7 +229,7 @@ int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off, return update_tlv_pad(skb, new_pad, pad_size, pad_off); } -__attribute__((always_inline)) +static inline __attribute__((always_inline)) int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off) { @@ -266,7 +267,7 @@ int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, return update_tlv_pad(skb, new_pad, pad_size, pad_off); } -__attribute__((always_inline)) +static inline __attribute__((always_inline)) int has_egr_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh) { int tlv_offset = sizeof(struct ip6_t) + sizeof(struct ip6_srh_t) + -- 2.18.1
Re: [PATCH net-next] macvlan: pass get_ts_info and SIOC[SG]HWTSTAMP ioctl to real device
On Wed, 17 Apr 2019 08:43:06 -0700, Richard Cochran wrote: > If NET_ADMIN is enabled in the container, don't the host and container > contend with each other for the physical interfaces anyhow? Physical interfaces are not a problem, as each interface can be only in a single net name space. The problem here is this patch gives access to physical interface settings through a virtual interface layered on top of it. Whenever such thing is done, the virtual interface needs to provide a suitable way of moderating access to the shared resources, so the individual virtual interfaces do not affect each other. That's not what's being done here. I think this patch is wrong. Jiri
[PATCH net] geneve: correctly handle ipv6.disable module parameter
When IPv6 is compiled but disabled at runtime, geneve_sock_add returns -EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole operation of bringing up the tunnel. Ignore failure of IPv6 socket creation for metadata based tunnels caused by IPv6 not being available. This is the same fix as what commit d074bf960044 ("vxlan: correctly handle ipv6.disable module parameter") is doing for vxlan. Note there's also commit c0a47e44c098 ("geneve: should not call rt6_lookup() when ipv6 was disabled") which fixes a similar issue but for regular tunnels, while this patch is needed for metadata based tunnels. Signed-off-by: Jiri Benc --- drivers/net/geneve.c | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c index 3377ac66a347..5583d993480d 100644 --- a/drivers/net/geneve.c +++ b/drivers/net/geneve.c @@ -692,15 +692,20 @@ static int geneve_sock_add(struct geneve_dev *geneve, bool ipv6) static int geneve_open(struct net_device *dev) { struct geneve_dev *geneve = netdev_priv(dev); - bool ipv6 = !!(geneve->info.mode & IP_TUNNEL_INFO_IPV6); bool metadata = geneve->collect_md; + bool ipv4, ipv6; int ret = 0; + ipv6 = geneve->info.mode & IP_TUNNEL_INFO_IPV6 || metadata; + ipv4 = !ipv6 || metadata; #if IS_ENABLED(CONFIG_IPV6) - if (ipv6 || metadata) + if (ipv6) { ret = geneve_sock_add(geneve, true); + if (ret < 0 && ret != -EAFNOSUPPORT) + ipv4 = false; + } #endif - if (!ret && (!ipv6 || metadata)) + if (ipv4) ret = geneve_sock_add(geneve, false); if (ret < 0) geneve_sock_release(geneve); -- 2.18.1
Re: [PATCH net-next 00/11] ICMP error handling for UDP tunnels
On Tue, 6 Nov 2018 22:38:56 +0100, Stefano Brivio wrote: > - patch 1/11 adds a socket lookup for UDP tunnels that use, by design, > the same destination port on both endpoints -- i.e. VxLAN and GENEVE This is not necessarily true with lwtunnels (collect_md mode of VXLAN and GENEVE). While any sane setup will use the same dst ports, there's really nothing that enforces it. Of course, in that case we have no way to map the ICMP error back to the tunnel. Generally speaking, I'm not sure how ICMP error handling should work for external control planes. Are we sure they want PMTU discovery and route redirection done by the kernel? (I am not sure, neither way.) Jiri
Re: [PATCH RFC v2 net-next 03/25] netlink: introduce NLM_F_DUMP_PROPER_HDR flag
On Tue, 2 Oct 2018 08:55:19 -0600, David Ahern wrote: > You want take issue with a name suggest a different one. I don't have a strong opinion on this. STRICT_HDR? HDR_CHECKED? VERIFY_HDR? If you feel that your proposal is better, I don't mind. Thanks, Jiri
Re: [PATCH RFC v2 net-next 03/25] netlink: introduce NLM_F_DUMP_PROPER_HDR flag
On Tue, 2 Oct 2018 08:57:24 -0600, David Ahern wrote: > You can when you introduce a new option or a new flag that is required > to get new behavior like kernel side filtering. Yes. That was what I tried with the patchset a few years back. It would be nice to revive the effort. > I chose a netlink flag for consistency with NLM_F_DUMP_INTR and > NLM_F_DUMP_FILTERED. Both are netlink flags. This patch set fixes only > what is broken -- dumps. When we're introducing better input checking in netlink (which is a good thing!), it would be good to do it consistently and have it generic across all operations. Jiri
Re: [PATCH RFC v2 net-next 02/25] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs
On Tue, 2 Oct 2018 09:11:17 -0600, David Ahern wrote: > Generically speaking a filter modifies the output based on the input. > Specifying a target namespace is an input to the dump that modifies the > output. That's conventionally called "algorithm" :-) Let's just say we have a different understanding of what "filter" is. Perhaps we should look at it from a different side. What is the use case of having NLM_F_DUMP_FILTERED set for IFA_TARGET_NETNSID? How is this going to be used by applications? > Yes, you can do it in userspace which is what iproute2 has done to this > point, but it is grossly inefficient and that inefficiency has > implications at scale. You can't do that with IFA_TARGET_NETNSID. Which is my point. Without the flag, you don't get a list of all interfaces in all net name spaces. Jiri
Re: [PATCH RFC v2 net-next 03/25] netlink: introduce NLM_F_DUMP_PROPER_HDR flag
On Tue, 2 Oct 2018 13:18:32 +0200, Christian Brauner wrote: > I didn't find this in the linked thread. Maybe it was suggested in another thread or in person on a conference, I can't remember, it's too long ago, sorry. > What I find interesting and convincing is one of Dave's points: > > "I'm beginning to wonder if we can just change this unilaterally to > not ignore unrecognized attributes. > > I am increasingly certain that things that would "break" we wouldn't > want to succeed anyways." [1] It's unfortunate we can't do that. I'd like it. > But a socket option or this header flag both sound acceptable to me. Was > there any more detail on how a socket option would look like, i.e. an > api proposal or something? Look at how NETLINK_CAP_ACK and NETLINK_EXT_ACK is implemented. Jiri
Re: [PATCH RFC v2 net-next 02/25] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs
On Tue, 2 Oct 2018 13:03:00 +0200, Christian Brauner wrote: > Well, it's a namespace filter that's how I saw it. That would imply that without it, you get data from all name spaces (= unfiltered by name space), which is not what's happening :-) Jiri
Re: [PATCH RFC v2 net-next 03/25] netlink: introduce NLM_F_DUMP_PROPER_HDR flag
On Mon, 1 Oct 2018 17:28:29 -0700, David Ahern wrote: > Add a new flag, NLM_F_DUMP_PROPER_HDR, for userspace to indicate to the > kernel that it believes it is sending the right header struct for the > dump message type (ifinfomsg, ifaddrmsg, rtmsg, fib_rule_hdr, ...). Why is this limited to dumps? Other kind of netlink messages contain the common struct, too. When introducing such mechanism, please make it generic. Last time when we were discussing strict checking in netlink, it was suggested to add a socket option instead of adding NLM flags[1]. It makes a lot of sense: the number of flags is very limited and we'd run out of them pretty fast. It's not just the header structure that is currently checked sloppily. It's also attributes, flags in attributes, etc. We can't assign a flag to all of them. You should also consider a different name for the flag: it should reflect what the effect of the flag is. "Proper header" is not an effect, it's a requirement for the message to pass. The effect is enforced strict checking of the header. Jiri [1] https://marc.info/?l=linux-netdev&m=144492718118955
Re: [PATCH RFC v2 net-next 02/25] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs
On Mon, 1 Oct 2018 17:28:28 -0700, David Ahern wrote: > Pull the inet6_fill_args arg up to in6_dump_addrs and move netnsid > into it. Since IFA_TARGET_NETNSID is a kernel side filter add the > NLM_F_DUMP_FILTERED flag so userspace knows the request was honored. IFA_TARGET_NETNSID is not a filter. "Filter" returns a subset of the results. It's kind of optimization when one is interested only in some data but not all of them. Instead of dumping everything, going through the results and picking only the data one is interested in, it's better to pass a filter and get only the relevant data. But you're not really required to: you can filter in your app. By contrast, IFA_TARGET_NETNSID returns a completely different set of data. It's impossible to not set it and filter the results in your app. As the consequence, IFA_TARGET_NETNSID must not set NLM_F_DUMP_FILTERED (if not complemented by a real filter). I understand that you want to differentiate between data dumped without and with IFA_TARGET_NETNSID present. But we already have that: the IFA_TARGET_NETNSID attribute is returned back in the latter case. Nacked-by: Jiri Benc Jiri
Re: [PATCH net] rtnetlink: Fail dump if target netnsid is invalid
On Fri, 28 Sep 2018 12:28:41 -0700, David Ahern wrote: > --- a/net/core/rtnetlink.c > +++ b/net/core/rtnetlink.c > @@ -1898,10 +1898,8 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, > struct netlink_callback *cb) > if (tb[IFLA_IF_NETNSID]) { > netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]); > tgt_net = get_target_net(skb->sk, netnsid); > - if (IS_ERR(tgt_net)) { > - tgt_net = net; > - netnsid = -1; > - } > + if (IS_ERR(tgt_net)) > + return PTR_ERR(tgt_net); > } > > if (tb[IFLA_EXT_MASK]) Sorry for the late review, I see it has been applied. I intentionally chose the behavior to preserve the behavior of the older kernels: that attribute was silently ignored. Note that the IFLA_IF_NETNSID is not returned in such case, thus it's easy to distinguish that it was not applied. And the user space has to do such check anyway to support old kernels. But you're right that there was no way to distinguish "the kernel does not support IFLA_IF_NETNSID" from "wrong IFLA_IF_NETNSID provided". I'm okay with the patch, I just don't think the "Fixes" tag is justified but whatever, can't be unapplied :-) (and it's my fault for not reviewing the patches timely). Thanks! Jiri
Re: netlink: 16 bytes leftover after parsing attributes in process `ip'.
On Tue, 25 Sep 2018 09:37:41 -0600, David Ahern wrote: > For ifaddrmsg ifa_flags aligns with ifi_type which is set by kernel side > so this should be ok. Does the existing user space set ifi_type to anything? Does it zero out the field? Are we able to find a flag value that is not going to be set by unaware user space? I.e., a bit that is unused by the current ARPHRD values on both little and big endian? (ARPHRD_NONE might be a problem, though...) Jiri
Re: netlink: 16 bytes leftover after parsing attributes in process `ip'.
On Tue, 25 Sep 2018 11:49:10 +0200, Christian Brauner wrote: > So if people really want to hide this issue as much as we can then we > can play the guessing game. I could send a patch that roughly does the > following: > > if (nlmsg_len(cb->nlh) < sizeof(struct ifinfomsg)) > guessed_header_len = sizeof(struct ifaddrmsg); > else > guessed_header_len = sizeof(struct ifinfomsg); > > This will work since sizeof(ifaddrmsg) == 8 and sizeof(ifinfomsg) == 16. > The only valid property for RTM_GETADDR requests is IFA_TARGET_NETNSID. > This propert is a __s32 which should bring the message up to 12 bytes > (not sure about alignment requiremnts and where we might wend up ten) > which is still less than the 16 bytes without that property from > ifinfomsg. That's a hacky hacky hack-hack and will likely work but will > break when ifaddrmsg grows a new member or we introduce another property > that is valid in RTM_GETADDR requests. It also will not work cleanly > when users stuff additional properties in there that are vaif > (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,lid for the > address family but are not used int RTM_GETADDR requests. I'd expect that any potential existing code that makes use of other attributes already assumes ifaddrmsg. Hence, if the nlmsg_len is > sizeof(ifinfomsg), you can be sure that there are attributes and thus the struct used was ifaddrmsg. So, in order for RTM_GETADDR to work reliably with attributes, you have to ensure that the length is > sizeof(ifinfomsg). This can be achieved by putting IFA_TARGET_NETNSID into a nested attribute. Just define IFA_EXTENDED (feel free to invent a better name, of course) and put IFA_TARGET_NETNSID inside. Then in the code, attempt to parse only when the size is large enough: if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) { int err; err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX, ifa_ipv6_policy, NULL); if (err < 0) return err; if (tb[IFA_EXTENDED]) { ...parse the nested attribute... if (tb_nested[IFA_TARGET_NETNSID]) { ...etc... } } } Another option is forcing the user space to add another attribute, for example, IFA_FLAGS_PRESENT, and attempt parsing only when it is present. The logic would then be: if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) { int err; err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX, ifa_ipv6_policy, NULL); if (err < 0) return err; if (tb[IFA_FLAGS_PRESENT] && tb[IFA_TARGET_NETNSID]) { ...etc... } } Jiri
Re: [RFC 4/5] netlink: prepare validate extack setting for recursion
On Wed, 19 Sep 2018 11:25:17 +0200, Johannes Berg wrote: > Now, with this patch, all I'm doing is changing the internal behaviour > of nla_parse/nla_validate - externally, it still overwrites any existing > message if an error occurs, but internally it keeps the inner-most > error. Ah, okay, that answers my question about putting the flag into the ext_ack struct, too. It may still be worthwhile to have a mechanism for prioritizing certain extack messages over another ones but it's clearly out of scope of this patchset. The patchset looks good to me. Just include the description you just wrote in the commit message :-) Thanks! Jiri
Re: [RFC 4/5] netlink: prepare validate extack setting for recursion
On Wed, 19 Sep 2018 11:15:25 +0200, Johannes Berg wrote: > For one, having the NL_SET_* macros check it on their own will already > not work - as we discussed over in the NLA_REJECT thread, we do need to > be able to override the data, e.g. if somebody does > > NL_SET_ERR_MSG(extack, "warning: deprecated command"); > err = nla_parse(..., extack); > > and nla_parse() sets a new message. Thus, hiding all the logic in there > already will not work, and is also IMHO rather unexpected. Normally, > *later* messages should win, not *earlier* ones - and that doesn't > require any tracking, just setting them unconditionally. > > It's just a side effect of how we do the recursive validation here that > we want *earlier* messages to win (but only within this code piece - if > a previous message was set, we want it to be overwritten by nla_parse!). Fair enough. > It might be possible to do this differently, in theory, but all the ways > I've tried to come up with so far made the code vastly more complex. Wouldn't still make sense to store the flag in the struct netlink_ext_ack, though? The way the parameters are passed around in this patch looks ugly. And adding more users of the flag later (there may be other cases when we want the earlier messages to be preserved) would mean adding parameters all around, while the flag in the struct would be readily available. I don't have a strong opinion on this, just the patch seems to be inelegant. Jiri
Re: [RFC 4/5] netlink: prepare validate extack setting for recursion
On Tue, 18 Sep 2018 15:12:11 +0200, Johannes Berg wrote: > static int validate_nla(const struct nlattr *nla, int maxtype, > const struct nla_policy *policy, > - const char **error_msg) > + struct netlink_ext_ack *extack, bool *extack_set) Can't the extack_set be included in the struct netlink_ext_ack? One less parameter to pass around and the NL_SET_* macros could check this on their own. This way, it can be also easily turned to a more complex mechanism, such as the one Marcelo proposed. Thanks, Jiri
Re: [PATCH net] geneve: add ttl inherit support
On Wed, 12 Sep 2018 10:04:21 +0800, Hangbin Liu wrote: > Similar with commit 72f6d71e491e6 ("vxlan: add ttl inherit support"), > currently ttl == 0 means "use whatever default value" on geneve instead > of inherit inner ttl. To respect compatibility with old behavior, let's > add a new IFLA_GENEVE_TTL_INHERIT for geneve ttl inherit support. > > Reported-by: Jianlin Shi > Suggested-by: Jiri Benc > Signed-off-by: Hangbin Liu Reviewed-by: Jiri Benc Thanks, Hangbin! Jiri
Re: [PATCH net] nsh: set mac len based on inner packet
On Wed, 11 Jul 2018 12:00:44 -0400, Willem de Bruijn wrote: > From: Willem de Bruijn > > When pulling the NSH header in nsh_gso_segment, set the mac length > based on the encapsulated packet type. > > skb_reset_mac_len computes an offset to the network header, which > here still points to the outer packet: > > > skb_reset_network_header(skb); > > [...] > > __skb_pull(skb, nsh_len); > > skb_reset_mac_header(skb);// now mac hdr starts nsh_len == 8B > after net hdr > > skb_reset_mac_len(skb); // mac len = net hdr - mac hdr == (u16) > -8 == 65528 > > [..] > > skb_mac_gso_segment(skb, ..) > > Link: > http://lkml.kernel.org/r/CAF=yd-keactson4axiraxl8m7qas8gbbe1w09eziywvpbbu...@mail.gmail.com > Reported-by: syzbot+7b9ed9872dab8c323...@syzkaller.appspotmail.com > Fixes: c411ed854584 ("nsh: add GSO support") > Signed-off-by: Willem de Bruijn Acked-by: Jiri Benc
Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
On Thu, 28 Jun 2018 10:05:36 -0700, Jakub Kicinski wrote: > Can we take this as a follow up through the bpf-next tree or do you > want us to respin as part of this set? I think BPF is a separate issue. Jiri
Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
On Thu, 28 Jun 2018 09:54:52 -0700, Jakub Kicinski wrote: > Hmm... in practice we could steal top bits of the size parameter for > some flags, since it seems to be limited to values < 256 today? Is it > worth it? > > It would look something along the lines of: Something like that, yes. I'll leave to Daniel to review how much sense it makes from the BPF side. Thanks! Jiri
Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags
On Wed, 27 Jun 2018 11:49:49 +0200, Daniel Borkmann wrote: > Looks good to me, and yes in BPF case a mask like TUNNEL_OPTIONS_PRESENT is > right approach since this is opaque info and solely defined by the BPF prog > that is using the generic helper. Wouldn't it make sense to introduce some safeguards here (in a backward compatible way, of course)? It's easy to mistakenly set data for a different tunnel type in a BPF program and then be surprised by the result. It might help users if such usage was detected by the kernel, one way or another. I'm thinking about something like the BPF program voluntarily specifying the type of the data; if not specified, the wildcard would be used as it is now. Jiri
Re: [PATCH net] nsh: fix infinite loop
On Thu, 3 May 2018 13:37:54 -0700, Eric Dumazet wrote: > diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c > index > d7da99a0b0b852d7459eed9ac6d3cdf3d49a1a1c..9696ef96b719bf24625adea2a959deac1d2a975f > 100644 > --- a/net/nsh/nsh.c > +++ b/net/nsh/nsh.c > @@ -57,6 +57,8 @@ int nsh_pop(struct sk_buff *skb) > return -ENOMEM; > nh = (struct nshhdr *)(skb->data); > length = nsh_hdr_len(nh); > + if (length < NSH_BASE_HDR_LEN) > + return -EINVAL; > inner_proto = tun_p_to_eth_p(nh->np); > if (!pskb_may_pull(skb, length)) > return -ENOMEM; > @@ -90,6 +92,8 @@ static struct sk_buff *nsh_gso_segment(struct sk_buff *skb, > if (unlikely(!pskb_may_pull(skb, NSH_BASE_HDR_LEN))) > goto out; > nsh_len = nsh_hdr_len(nsh_hdr(skb)); > + if (nsh_len < NSH_BASE_HDR_LEN) > + goto out; > if (unlikely(!pskb_may_pull(skb, nsh_len))) > goto out; > Acked-by: Jiri Benc Thanks, Eric, and shame on me! Jiri
Re: [bpf PATCH v2] bpf: fix for lex/yacc build error with gcc-5
On Wed, 25 Apr 2018 14:22:45 -0700, John Fastabend wrote: > --- a/tools/bpf/Makefile > +++ b/tools/bpf/Makefile > @@ -76,6 +76,8 @@ $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o > $(OUTPUT)bpf_exp.yacc.o $(OUTPUT)bpf_exp.le > $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ > > $(OUTPUT)bpf_exp.lex.c: $(OUTPUT)bpf_exp.yacc.c > +$(OUTPUT)bpf_exp.yacc.o: $(OUTPUT)bpf_exp.yacc.c > +$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.lex.c Looks better than v1, the first dependency is important. For some reason, I did not need the other two rules. By the way, make invoked from tools/bpf/ has never really worked, even before my patchset. This works correctly: cd tools make bpf Jiri
Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind
On Wed, 14 Mar 2018 20:37:00 -0700, Alexei Starovoitov wrote: > Hanness expressed the reasons why RHEL doesn't support ipvlan long ago. > I couldn't find the complete link. This one mentions some of the issues: > https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html ipvlan improved a lot since that time :-) And Paolo has recently fixed the remaining issues we were aware of. Jiri
Re: [PATCH net-next] net: do not create fallback tunnels for non-default namespaces
On Fri, 9 Mar 2018 04:53:07 -0800, Eric Dumazet wrote: > Unless you bring it up ;) Often, you need some options to be specified, thus you end up creating another interface anyway. > Compatibility problems, mostly. > Some users might depend on existing behavior. How is this a compatibility problem, when the user has to opt in? Without the default value, the kernel behaves as before. I'm sorry, I don't see the problem, what am I missing? > You and me would not care of breaking our setups, but maybe not > unaware people out there. Sure, that's why this is an option and it defaults to off. > Since init_ns is created at boot time before the sysctl can be > changed, we rather should not change the default behavior for init_ns. This could be a problem for built in drivers. With modules, I'm perfectly happy doing this in initrd :-) To behave consistently in all cases, we might consider adding a boot option for this, too, though. Jiri
Re: [PATCH/RFC 3/3] net/sched: add tunnel option support to act_tunnel_key
On Tue, 6 Mar 2018 18:08:05 +0100, Simon Horman wrote: > +static int > +tunnel_key_copy_geneve_opt(const struct nlattr *nla, int dst_len, void *dst, > +struct netlink_ext_ack *extack) > +{ > + struct nlattr *tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1]; > + int err, data_len, opt_len; > + u8 *data; > + > + err = nla_parse_nested(tb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX, > +nla, geneve_opt_policy, extack); > + if (err < 0) > + return err; > + > + if (!tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS] || > + !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE] || > + !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]) { > + NL_SET_ERR_MSG(extack, "Missing tunnel key enc geneve option > class, type or data"); I think it's obvious by now that I don't like the "enc" :-) > + return -EINVAL; > + } > + > + data = nla_data(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]); > + data_len = nla_len(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]); > + if (data_len < 4) { > + NL_SET_ERR_MSG(extack, "Tunnel key enc geneve option data is > less than 4 bytes long"); > + return -ERANGE; > + } > + if (data_len % 4) { > + NL_SET_ERR_MSG(extack, "Tunnel key enc geneve option data is > not a multiple of 4 bytes long"); > + return -ERANGE; > + } > + > + opt_len = sizeof(struct geneve_opt) + data_len; > + if (dst) { > + struct geneve_opt *opt = dst; > + u16 class; > + > + if (dst_len < opt_len) { > + NL_SET_ERR_MSG(extack, "Tunnel key enc geneve option > data length is longer than the supplied data"); I don't understand this error. What can the user do about it? Furthermore, the error is not propagated to the user space (see also below). Shouldn't this be WARN_ON or something? > + return -EINVAL; > + } > + > + class = nla_get_u16(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]); > + put_unaligned_be16(class, &opt->opt_class); How this can be unaligned, given we check that the option length is a multiple of 4 bytes and the option header is 4 bytes? > + > + opt->type = nla_get_u8(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]); > + opt->length = data_len / 4; /* length is in units of 4 bytes */ > + opt->r1 = 0; > + opt->r2 = 0; > + opt->r3 = 0; > + > + memcpy(opt + 1, data, data_len); > + } > + > + return opt_len; > +} > + > +static int tunnel_key_copy_opts(const struct nlattr *nla, u8 *dst, > + int dst_len, struct netlink_ext_ack *extack) Please be consistent with parameter ordering, dst and dst_len are in a different order here and in tunnel_key_copy_geneve_opt. [...] > @@ -157,6 +292,11 @@ static int tunnel_key_init(struct net *net, struct > nlattr *nla, > goto err_out; > } > > + if (opts_len) > + tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS], > + &metadata->u.tun_info, opts_len, > + extack); You need to propagate the error here. The previous validation is not enough as errors while copying to tun_info were not covered. > + > metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX; > break; > default: > @@ -221,6 +361,53 @@ static void tunnel_key_release(struct tc_action *a) > kfree_rcu(params, rcu); > } > > +static int tunnel_key_geneve_opts_dump(struct sk_buff *skb, > +const struct ip_tunnel_info *info) > +{ > + int len = info->options_len; > + u8 *src = (u8 *)(info + 1); > + > + while (len > 0) { > + struct geneve_opt *opt = (struct geneve_opt *)src; > + u16 class; > + > + class = get_unaligned_be16(&opt->opt_class); I don't think this can be unaligned. > + if (nla_put_u16(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS, > + class) || > + nla_put_u8(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE, > +opt->type) || > + nla_put(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA, > + opt->length * 4, opt + 1)) > + return -EMSGSIZE; > + > + len -= sizeof(struct geneve_opt) + opt->length * 4; > + src += sizeof(struct geneve_opt) + opt->length * 4; > + } All of this needs to be nested in TCA_TUNNEL_KEY_ENC_OPTS_GENEVE. > + > + return 0; > +} > + > +static int tunnel_key_opts_dump(struct sk_buff *skb, > + const struct ip_tunnel_info *info) > +{ > + struct nlattr *start; > + int err; > + > + if (!info->options_len) > + return 0; > + > + start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS); > +
Re: [PATCH/RFC 2/3] net/sched: act_tunnel_key: add extended ack support
On Tue, 6 Mar 2018 18:08:04 +0100, Simon Horman wrote: > - if (!tb[TCA_TUNNEL_KEY_PARMS]) > + if (!tb[TCA_TUNNEL_KEY_PARMS]) { > + NL_SET_ERR_MSG(extack, "Missing tunnel key parameter"); "parameters" (it's not just one parameter) > @@ -107,6 +109,7 @@ static int tunnel_key_init(struct net *net, struct nlattr > *nla, > break; > case TCA_TUNNEL_KEY_ACT_SET: > if (!tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) { > + NL_SET_ERR_MSG(extack, "Missing tunnel key enc id"); This is not much helpful to the user. What's "enc"? I guess "Missing tunnel key id" would be enough and better. > @@ -144,6 +147,7 @@ static int tunnel_key_init(struct net *net, struct nlattr > *nla, > 0, flags, > key_id, 0); > } else { > + NL_SET_ERR_MSG(extack, "Missing both ipv4 and ipv6 enc > src and dst"); We don't need both but only one of them. And again, "enc" does not have a clear meaning. "Missing either IPv4 or IPv6 source and destination address"? In addition, it makes little sense to me to add extacks to just some of the errors in the tunnel_key_init function. Please add extacks to all of them. Thanks, Jiri
Re: [PATCH net-next] net: do not create fallback tunnels for non-default namespaces
On Thu, 8 Mar 2018 12:51:41 -0800, Eric Dumazet wrote: > Note that these tunnels are still created for the initial namespace, > to be the least intrusive for typical setups. Since this is a knob and must be turned on explicitly, why we don't get rid of the automatic interfaces even for the initial name space? It causes only problems nowadays, such as ip link add name gre0 type gre failing with "File exists" even if there was no gre0 interface before. And of course, even with the error, the interface with the name "gre0" appears in the system. And of course, it does not have any of the options specified. This is highly confusing. Not to mention the autocreated gre0 interface is basically useless. I'd like to switch the knob on by default on my systems and have the kernel behave sane, finally, even without name spaces. Thanks! Jiri
Re: [PATCH bpf-next 1/7] tools: bpftool: silence 'missing initializer' warnings
On Thu, 8 Mar 2018 23:38:12 -0800, Jakub Kicinski wrote: > FWIW I couldn't reproduce this one. Out of curiosity what GCC did you > use? gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16) Jiri
[PATCH bpf-next 2/7] tools: bpf: respect output directory during build
Currently, the programs under tools/bpf (with the notable exception of bpftool) do not respect the output directory (make O=dir). Fix that. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 38 ++ 1 file changed, 22 insertions(+), 16 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index c8ec0ae16bf0..e7b15967492e 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -1,4 +1,6 @@ # SPDX-License-Identifier: GPL-2.0 +include ../scripts/Makefile.include + prefix = /usr CC = gcc @@ -7,7 +9,7 @@ YACC = bison MAKE = make CFLAGS += -Wall -O2 -CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include +CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/include/uapi -I$(srctree)/include ifeq ($(srctree),) srctree := $(patsubst %/,%,$(dir $(CURDIR))) @@ -38,32 +40,36 @@ ifeq ($(feature-disassembler-four-args), 1) CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE endif -%.yacc.c: %.y +$(OUTPUT)%.yacc.c: $(srctree)/tools/bpf/%.y $(YACC) -o $@ -d $< -%.lex.c: %.l +$(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l $(LEX) -o $@ $< -all: bpf_jit_disasm bpf_dbg bpf_asm bpftool +$(OUTPUT)%.o: $(srctree)/tools/bpf/%.c + $(COMPILE.c) -o $@ $< + +all: $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm bpftool -bpf_jit_disasm : CFLAGS += -DPACKAGE='bpf_jit_disasm' -bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl -bpf_jit_disasm : bpf_jit_disasm.o +$(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm' +$(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl +$(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o -bpf_dbg : LDLIBS = -lreadline -bpf_dbg : bpf_dbg.o +$(OUTPUT)bpf_dbg: LDLIBS = -lreadline +$(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o -bpf_asm : LDLIBS = -bpf_asm : bpf_asm.o bpf_exp.yacc.o bpf_exp.lex.o -bpf_exp.lex.o : bpf_exp.yacc.c +$(OUTPUT)bpf_asm: LDLIBS = +$(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o $(OUTPUT)bpf_exp.lex.o +$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c clean: bpftool_clean - rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.* + rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \ + $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.* install: bpftool_install - install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm - install bpf_dbg $(prefix)/bin/bpf_dbg - install bpf_asm $(prefix)/bin/bpf_asm + install $(OUTPUT)bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm + install $(OUTPUT)bpf_dbg $(prefix)/bin/bpf_dbg + install $(OUTPUT)bpf_asm $(prefix)/bin/bpf_asm bpftool: $(MAKE) -C bpftool -- 1.8.3.1
[PATCH bpf-next 4/7] tools: bpf: make install should build first
Make the 'install' target depend on the 'all' target to build the binaries first. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index c42ca24a072d..e8d9e4125bf3 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -50,7 +50,9 @@ $(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l $(OUTPUT)%.o: $(srctree)/tools/bpf/%.c $(COMPILE.c) -o $@ $< -all: $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm bpftool +PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm + +all: $(PROGS) bpftool $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm' $(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl @@ -67,7 +69,7 @@ clean: bpftool_clean rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \ $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.* -install: bpftool_install +install: $(PROGS) bpftool_install $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin $(INSTALL) $(OUTPUT)bpf_jit_disasm $(DESTDIR)$(prefix)/bin/bpf_jit_disasm $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg -- 1.8.3.1
[PATCH bpf-next 0/7] tools: bpf: standardize make
Currently, 'make bpf' in the tools/ directory does not provide the standard quiet output except for bpftool (which is however listed with a wrong directory). Worse, it does not respect the build output directory. The 'make bpf_install' does not work as one would expect, either. It installs unconditionally to /usr/bin without respecting DESTDIR and prefix. This patchset improves that behavior. Jiri Benc (7): tools: bpftool: silence 'missing initializer' warnings tools: bpf: respect output directory during build tools: bpf: consistent make bpf_install tools: bpf: make install should build first tools: bpf: call descend in Makefile tools: bpf: respect quiet/verbose build tools: bpf: silence make by not deleting intermediate file tools/bpf/Makefile | 74 +++--- tools/bpf/bpftool/Makefile | 2 +- 2 files changed, 51 insertions(+), 25 deletions(-) -- 1.8.3.1
[PATCH bpf-next 6/7] tools: bpf: respect quiet/verbose build
Default to quiet build, with V=1 enabling verbose build as is usual. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 38 +++--- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index daca0a4277d1..757ea22c428a 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -17,6 +17,12 @@ srctree := $(patsubst %/,%,$(dir $(CURDIR))) srctree := $(patsubst %/,%,$(dir $(srctree))) endif +ifeq ($(V),1) + Q = +else + Q = @ +endif + FEATURE_USER = .bpf FEATURE_TESTS = libbfd disassembler-four-args FEATURE_DISPLAY = libbfd disassembler-four-args @@ -42,38 +48,48 @@ CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE endif $(OUTPUT)%.yacc.c: $(srctree)/tools/bpf/%.y - $(YACC) -o $@ -d $< + $(QUIET_BISON)$(YACC) -o $@ -d $< $(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l - $(LEX) -o $@ $< + $(QUIET_FLEX)$(LEX) -o $@ $< $(OUTPUT)%.o: $(srctree)/tools/bpf/%.c - $(COMPILE.c) -o $@ $< + $(QUIET_CC)$(COMPILE.c) -o $@ $< + +$(OUTPUT)%.yacc.o: $(OUTPUT)%.yacc.c + $(QUIET_CC)$(COMPILE.c) -o $@ $< +$(OUTPUT)%.lex.o: $(OUTPUT)%.lex.c + $(QUIET_CC)$(COMPILE.c) -o $@ $< PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm all: $(PROGS) bpftool $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm' -$(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl $(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o + $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -lopcodes -lbfd -ldl -$(OUTPUT)bpf_dbg: LDLIBS = -lreadline $(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o + $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -lreadline -$(OUTPUT)bpf_asm: LDLIBS = $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o $(OUTPUT)bpf_exp.lex.o + $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ + $(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c clean: bpftool_clean - rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \ + $(call QUIET_CLEAN, bpf-progs) + $(Q)rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \ $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.* install: $(PROGS) bpftool_install - $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin - $(INSTALL) $(OUTPUT)bpf_jit_disasm $(DESTDIR)$(prefix)/bin/bpf_jit_disasm - $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg - $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm + $(call QUIET_INSTALL, bpf_jit_disasm) + $(Q)$(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin + $(Q)$(INSTALL) $(OUTPUT)bpf_jit_disasm $(DESTDIR)$(prefix)/bin/bpf_jit_disasm + $(call QUIET_INSTALL, bpf_dbg) + $(Q)$(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg + $(call QUIET_INSTALL, bpf_asm) + $(Q)$(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm bpftool: $(call descend,bpftool) -- 1.8.3.1
[PATCH bpf-next 7/7] tools: bpf: silence make by not deleting intermediate file
Even in quiet mode, make finishes with rm tools/bpf/bpf_exp.lex.c That's because it considers the file to be intermediate. Silence that by mentioning the lex.c file instead of the lex.o file; the dependency still stays. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index 757ea22c428a..c07b4e718eeb 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -75,7 +75,7 @@ $(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o $(OUTPUT)bpf_exp.lex.o $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c +$(OUTPUT)bpf_exp.lex.c: $(OUTPUT)bpf_exp.yacc.c clean: bpftool_clean $(call QUIET_CLEAN, bpf-progs) -- 1.8.3.1
[PATCH bpf-next 3/7] tools: bpf: consistent make bpf_install
Currently, make bpf_install in tools/ does not respect DESTDIR. Moreover, it installs to /usr/bin/ unconditionally. Let it respect DESTDIR and allow prefix to be specified. Also, to be more consistent with bpftool and with the usual customs, default the prefix to /usr/local instead of /usr. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index e7b15967492e..c42ca24a072d 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -1,12 +1,13 @@ # SPDX-License-Identifier: GPL-2.0 include ../scripts/Makefile.include -prefix = /usr +prefix ?= /usr/local CC = gcc LEX = flex YACC = bison MAKE = make +INSTALL ?= install CFLAGS += -Wall -O2 CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/include/uapi -I$(srctree)/include @@ -67,9 +68,10 @@ clean: bpftool_clean $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.* install: bpftool_install - install $(OUTPUT)bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm - install $(OUTPUT)bpf_dbg $(prefix)/bin/bpf_dbg - install $(OUTPUT)bpf_asm $(prefix)/bin/bpf_asm + $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin + $(INSTALL) $(OUTPUT)bpf_jit_disasm $(DESTDIR)$(prefix)/bin/bpf_jit_disasm + $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg + $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm bpftool: $(MAKE) -C bpftool -- 1.8.3.1
[PATCH bpf-next 1/7] tools: bpftool: silence 'missing initializer' warnings
When building bpf tool, gcc emits piles of warnings: prog.c: In function ‘prog_fd_by_tag’: prog.c:101:9: warning: missing initializer for field ‘type’ of ‘struct bpf_prog_info’ [-Wmissing-field-initializers] struct bpf_prog_info info = {}; ^ In file included from /home/storage/jbenc/git/net-next/tools/lib/bpf/bpf.h:26:0, from prog.c:47: /home/storage/jbenc/git/net-next/tools/include/uapi/linux/bpf.h:925:8: note: ‘type’ declared here __u32 type; ^ As these warnings are not useful, switch them off. Signed-off-by: Jiri Benc --- tools/bpf/bpftool/Makefile | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile index 26901ec87361..4c2867481f5c 100644 --- a/tools/bpf/bpftool/Makefile +++ b/tools/bpf/bpftool/Makefile @@ -38,7 +38,7 @@ bash_compdir ?= /usr/share/bash-completion/completions CC = gcc CFLAGS += -O2 -CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow +CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow -Wno-missing-field-initializers CFLAGS += -DPACKAGE='"bpftool"' -D__EXPORTED_HEADERS__ -I$(srctree)/tools/include/uapi -I$(srctree)/tools/include -I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/ CFLAGS += -DBPFTOOL_VERSION='"$(BPFTOOL_VERSION)"' LIBS = -lelf -lbfd -lopcodes $(LIBBPF) -- 1.8.3.1
[PATCH bpf-next 5/7] tools: bpf: call descend in Makefile
Use the descend macro to properly propagate $(subdir) to bpftool. Signed-off-by: Jiri Benc --- tools/bpf/Makefile | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile index e8d9e4125bf3..daca0a4277d1 100644 --- a/tools/bpf/Makefile +++ b/tools/bpf/Makefile @@ -76,12 +76,12 @@ install: $(PROGS) bpftool_install $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm bpftool: - $(MAKE) -C bpftool + $(call descend,bpftool) bpftool_install: - $(MAKE) -C bpftool install + $(call descend,bpftool,install) bpftool_clean: - $(MAKE) -C bpftool clean + $(call descend,bpftool,clean) .PHONY: bpftool FORCE -- 1.8.3.1
Re: [PATCH bpf] tools: bpftool: fix compilation with older headers
On Tue, 6 Mar 2018 15:39:07 +0100, Daniel Borkmann wrote: > Thanks for the fix, Jiri! The standard approach to resolve such header > dependencies under > tools/ would be to add a copy of magic.h uapi header into > tools/include/uapi/linux/magic.h. > > Both bpftool and libbpf have tools/include/uapi/ in their include path from > their > Makefile, so they would pull this in automatically and it would also allow to > get rid > of the extra ifdef in libbpf then. Could you look into that? That's what I tried at first. But honestly, this is a shortcut to hell. Eventually, we'd end up with most of uapi headers duplicated under tools/include/uapi and hopelessly out of sync. The right approach would be to export uapi headers from the currently built kernel somewhere (a temporary directory, perhaps) and use that to build the tools. We should not have duplicated and out of sync headers in the kernel tree. Just look at the git log for tools/include/uapi to see what I mean by "out of sync". But that's certainly not a fix suitable for bpf.git, while I think the build problem should be fixed in bpf.git. We can do something more clever in bpf-next. I can of course add the magic.h header if you have strong opinion about that. I just don't think it's the right approach and seeing the precedent in tools/lib/bpf/libbpf.c and tools/lib/api/fs/fs.c, I think this minimal patch is better at this point until the duplicate headers mess is sorted out. Please let me know if you really want me to duplicate the magic.h file. Thanks, Jiri
[PATCH bpf] tools: bpftool: fix compilation with older headers
Compilation of bpftool on a distro that lacks eBPF support in the installed kernel headers fails with: common.c: In function ‘is_bpffs’: common.c:96:40: error: ‘BPF_FS_MAGIC’ undeclared (first use in this function) return (unsigned long)st_fs.f_type == BPF_FS_MAGIC; ^ Fix this the same way it is already in tools/lib/bpf/libbpf.c and tools/lib/api/fs/fs.c. Signed-off-by: Jiri Benc --- tools/bpf/bpftool/common.c | 4 1 file changed, 4 insertions(+) diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c index 0b482c0070e0..465995281dcd 100644 --- a/tools/bpf/bpftool/common.c +++ b/tools/bpf/bpftool/common.c @@ -55,6 +55,10 @@ #include "main.h" +#ifndef BPF_FS_MAGIC +#define BPF_FS_MAGIC 0xcafe4a11 +#endif + void p_err(const char *fmt, ...) { va_list ap; -- 1.8.3.1
Re: [PATCH net-next] team: Use extack to report enslavement failures
On Wed, 28 Feb 2018 08:12:41 +0100, Jiri Pirko wrote: > Yeah, or if you have an older iproute2 package. I would keep the existing > dmesg msgs for now. In the future, when everyone is used to exacks, then > we can remove them. How do we ensure that this will actually happen in the future? Usually, when not done right away, such things tend to be forgotten for years or even forever. Also, how distant future are we talking about? Thanks, Jiri
Re: [PATCH net-next] team: Use extack to report enslavement failures
On Tue, 27 Feb 2018 17:38:08 +0200, Ido Schimmel wrote: > if (port_dev->flags & IFF_LOOPBACK) { > + NL_SET_ERR_MSG(extack, "Loopback device can't be added as a > team port"); > netdev_err(dev, "Device %s is loopback device. Loopback devices > can't be added as a team port\n", > portname); Aren't the netdev_errs unnecessary now? Or do you keep them for people using old config tools with a new kernel? If so, for how long? They should certainly be removed eventually. How do we ensure we don't forget? Seems to me it would be better to remove them right now. Thanks, Jiri
Re: Automatic TAP destruction/Monitoring namespace destruction
On Fri, 23 Feb 2018 04:39:37 -0500, Andrew Cann wrote: > In a program I'm writing I have a network namespace with a virtual (TAP) > network interface assigned to it. I would like it so that the interface is > automatically destroyed when the namespace is destroyed (ie. when the last > process in the namespace exits). I can't see any way to implement this.. This should just work. > As I understand it, when a namespace is destroyed all its interfaces are moved > to the root namespace. If this is the case, is there anyway to detect when an > interface is moved so that I can close it manually? It is the case only for interfaces backed by a physical device. Virtual interfaces are deleted when the netns is destroyed. That includes tun/tap interfaces. > Alternatively, is there a way to detect when a namespace is destroyed? I don't think we emit any netlink event on netns exit. > I figured it might possible to use inotify to do this, but it won't let me > watch directories under /proc. Also the files under /proc/*/ns/ seem to be > some > kind of wierd symlink-to-a-raw-inode-thing (?) - is there a way to detect when > an inode is destroyed that I can use with these? You'd need this patchset: https://lkml.org/lkml/2016/10/15/40 but I don't think it went anywhere. Plus it probably wouldn't be enough anyway. > I also thought it might be possible to use a netlink socket to detect when an > interface changes namespace. But the netlink docs don't seem to suggest that > this is possible. Yes, that's possible. You'll need a recent kernel with commit e8368d9ebb94 included. > Basically I'm looking for any event the Linux kernel can give me that I can > use > to implement what I want. Does anyone have any ideas? What you want should already be happening automatically. Have you tried? ip netns add ns0 ip -n ns0 tuntap add name tap0 mode tap ip -n ns0 link show dev tap0 ip netns del ns0 ip a# no tap interface Jiri
Re: [PATCH net 1/1 v4] rtnetlink: require unique netns identifier
On Wed, 7 Feb 2018 14:36:21 +0100, Christian Brauner wrote: > On Wed, Feb 07, 2018 at 04:20:01PM +0300, Kirill Tkhai wrote: > > Can't we write these 3 above branches more compact? Something like this: > > > > if (!!tb[IFLA_NET_NS_FD] + !!tb[IFLA_IF_NETNSID] + > > !!tb[IFLA_NET_NS_PID] <= 1) > > return 0; > > I always prefer for conditions to be separate and readable even if it > means additional code. But if others feel that there's value in avoiding > two additional conditions I'm happy to adapt the patch. FWIW, I don't like the n x n conditions much. But Kirill's proposal seems not to be much better. I was thinking about: int cnt = 0; if (tb[IFLA_NET_NS_FD]) cnt++; if (tb[IFLA_NET_NS_PID]) cnt++; if (tb[IFLA_NET_NETNSID]) cnt++; if (cnt > 1) { ...errorr... } but that's not better, either. As we're unlikely to add a fourth value, I guess I'm okay with the current approach in the patch. > Before I added support for netns ids for additional requests Jiri made > it so that all request that specified properties that they did not > support returned ENOTSUPP instead of EINVAL. This just keeps things > consistent. Users would now suddenly receive EINVAL. That's potentially > confusing. Yes, please, keep -EOPNOTSUPP. > As for the case of passing multiple netns identifying properties into > the same request EINVAL seems the perfect candidate and the error > message seems instructive to userspace programs. Agreed. Acked-by: Jiri Benc Thanks, Jiri
Re: [PATCH net 1/1 v3] rtnetlink: require unique netns identifier
On Tue, 6 Feb 2018 14:19:02 +0100, Christian Brauner wrote: > +/* Verify that rtnetlink requests supporting network namespace ids > + * do not pass additional properties potentially referring to different > + * network namespaces. > + */ > +static int rtnl_ensure_unique_netns(struct nlattr *tb[], > + struct netlink_ext_ack *extack) > +{ > + /* Requests without network namespace ids have been able to specify > + * multiple properties referring to different network namespaces so > + * don't regress them. > + */ > + if (!tb[IFLA_IF_NETNSID]) > + return 0; I agree with Eric that we should enforce this also for the existing pid/fd attributes. > + > + /* Caller operates on the current network namespace. */ > + if (!tb[IFLA_NET_NS_PID] && !tb[IFLA_NET_NS_FD]) > + return 0; > + > + NL_SET_ERR_MSG(extack, "multiple netns identifying attributes > specified"); > + return -EINVAL; But if we don't reach an agreement on that, this version is the next best one. No reason to compare the namespaces whether they're the same, a message with more than one such attribute is just invalid. > @@ -2649,6 +2675,10 @@ static int rtnl_dellink(struct sk_buff *skb, struct > nlmsghdr *nlh, > if (err < 0) > return err; > > + err = rtnl_ensure_unique_netns(tb, extack); > + if (err < 0) > + return err; > + > if (tb[IFLA_IFNAME]) > nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ); > > @@ -3045,6 +3079,10 @@ static int rtnl_getlink(struct sk_buff *skb, struct > nlmsghdr *nlh, > if (err < 0) > return err; > > + err = rtnl_ensure_unique_netns(tb, extack); > + if (err < 0) > + return err; > + > if (tb[IFLA_IF_NETNSID]) { > netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]); > tgt_net = get_target_net(NETLINK_CB(skb).sk, netnsid); dellink and getlink support only netnsid, we should just reject a message with pid or fd set. Jiri
Re: [PATCH net 1/1 v2] rtnetlink: require unique netns identifier
On Tue, 06 Feb 2018 16:31:29 -0600, Eric W. Biederman wrote: > Frankly. If we are talking precedence it should be: > fds > netnsids > pids The current order is 1. pids, 2. fds, though. Not that it matters much, see below. > I do think it makes a lot of sense to error if someone passes in > duplicate arguments. AKA multiple attribute that could select for > the same thing. No one will do that deliberately. It doesn't make > sense. So it is just a nonsense case we have to handle gracefully, > and correctly. With correctness being the most important as otherwise > people might just send in nonsense to exploit bugs. Completely agreed. Let's just start returning error if more than one of the pid/fs/netnsid attributes is specified. I don't think this is going to break any user. And we'll not have to care about the order. > I agree refusing to combine multiple attributes for the same thing > sounds the most sensible course. Yes, please. Thanks! Jiri
Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK
On Fri, 26 Jan 2018 00:34:51 +0100, Nicolas Dichtel wrote: > Why meaningful? The user knows that the answer is like if if was done in > another > netns. It enables to have only one netlink socket instead of one per netns. > But > the code using it will be the same. Because you can't use it to query the linked interface. You can't even use it as an opaque value to track interfaces (netnsid+ifindex) because netnsids are not unique across net name spaces. You can easily have two interfaces that have all the ifindex, ifname, netnsid (and basically everything else) identical but being completely different interfaces. That's really not helpful. > I fear that with your approach, it will results to a lot of complexity in the > kernel. The complexity is (at least partly) already there. It's an inevitable result of the design decision to have relative identifiers. I agree that we should think about how to make this easy to implement. I like your idea of doing this somehow generically. Perhaps it's possible to do while keeping the netnsids valid in the caller's netns? > What is really missing for me, is a way to get a fd from an nsid. The user > should be able to call RTM_GETNSID with an fd and a nsid and the kernel > performs > the needed operations so that the fd points to the corresponding netns. That's what I was missing, too. I even looked into implementing it. But opening a fd on behalf of the process and returning it over netlink is a wrong thing to do. Netlink messages can get lost. Then you have a fd leak you can do nothing about. Given that we have netnsids used for so much stuff already (like NETLINK_LISTEN_ALL_NSID) you need to track them anyway. And if you need to track them, why bother with another identifier? It would be better if netnsid can be used universally for anything. Then there will be no need for the conversion. Jiri
Re: [PATCH net-next 2/2] dev: advertise the new ifindex when the netns iface changes
On Thu, 25 Jan 2018 15:01:39 +0100, Nicolas Dichtel wrote: > The goal is to let the user follow an interface that moves to another > netns. > > CC: Jiri Benc > CC: Christian Brauner > Signed-off-by: Nicolas Dichtel This is great, thanks, Nicolas! Reviewed-by: Jiri Benc
Re: [PATCH net-next 1/2] dev: always advertise the new nsid when the netns iface changes
On Thu, 25 Jan 2018 15:01:38 +0100, Nicolas Dichtel wrote: > The user should be able to follow any interface that moves to another > netns. There is no reason to hide physical interfaces. > > CC: Jiri Benc > CC: Christian Brauner > Signed-off-by: Nicolas Dichtel Reviewed-by: Jiri Benc
Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK
On Thu, 25 Jan 2018 15:20:59 +0100, Nicolas Dichtel wrote: > Hmm, I don't agree. For me, it would be the correct answer. If user has a > socket > in ns_a and targets a RTM_GETLINK in ns_b, the answer he gets should be like > if > it was done in ns_b. But that information would be useless for the caller. Why return a value that has no meaning for the caller and can not be used? More so when the kernel is aware of what the correct meaningful value is? > This is already the case with messages received with NETLINK_LISTEN_ALL_NSID, > there is no reason to do something different. NETLINK_LISTEN_ALL_NSID is tough due to way it is implemented. But yes, it should translate the netnsids to be valid in the socket's netns. That's the only sane way for the listener. Jiri
Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK
On Wed, 24 Jan 2018 16:24:34 +0100, Nicolas Dichtel wrote: > I wonder if it would be possible to do something in the netlink framework, > like > NETLINK_LISTEN_ALL_NSID. > Having some ancillary data at the netlink socket level and a function like > nlsock_net() (instead of sock_net()) to get the corresponding netns. > With that, it would be possible, in a generci way, to support this feature for > all netlink family. I'm not sure it's worth the effort to do that in the framework. You'll need modifications all the way down to the code that generates the attributes anyway. It's not enough to just specify that the operation should be done on a different netns and hide that from the handlers. Take for example the existing RTM_GETLINK. Let's say it's executed from within ns_a and targeted to ns_b (thus IFLA_IF_NETNSID = netnsid of ns_b). Now, if there's a veth interface in ns_b whose other end is in ns_c, there will be IFLA_LINK_NETNSID attribute in the response. But the value has to be netnsid of ns_c as seen from *ns_a*. If you just pretended to switch to ns_b before invoking rtnl_getlink, it would be netnsid of ns_c as seen from ns_b which would be wrong. That's why 79e1ad148c844 added the tgt_net stuff. Jiri
Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd
On Wed, 24 Jan 2018 11:53:11 +0100, Nicolas Dichtel wrote: > Le 23/01/2018 à 18:08, Jiri Benc a écrit : > > It would be much better if the whole (ifindex, netnsid) pair was > > returned. I think it could be added. > Sure. Do you plan to send a patch? I can do that but it will take a while. I'll be happy to leave that to someone else but if there's no one, I'll eventually do it. > > I think we need that. > If you agree, I can send a patch to remove this limitation. That would be great. Thanks! Jiri
Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd
On Tue, 23 Jan 2018 17:37:11 +0100, Nicolas Dichtel wrote: > When a virtual interface moves to another netns, the netlink RTM_DELLINK > message > contains the attribute IFLA_NEW_NETNSID, which identifies where the interface > moves. The nsid may be allocated if needed. The problem is that ifindex may change and it's not announced. The only way is to track both ifindex and ifname, watch for the ifname to appear in the target netns and update the application's view of ifindex. It would be much better if the whole (ifindex, netnsid) pair was returned. I think it could be added. > I don't know if it's acceptable to also allocate an nsid in case of a physical > interface. I think we need that. Thanks, Jiri
Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd
(Christian, I'm adding back the netdev list, there's no reason not to have the discussion in open.) On Tue, 23 Jan 2018 12:42:19 +0100, Christian Brauner wrote: > Thanks for the comments and discussion. Sorry, for not going through the > list but I just have a quick question that doesn't deserve to spam the > whole netdev list. :) I have written another set of patches that would > add support for IFLA_IF_NETNSID to RTM_{N,S}ETLINK and friends which I > planned to send out after the pid and fd ones. Would you be interested > in those if I sent them out over the next week? Yes, please CC me. I will have limited time next week (whole day meetings for the most of the week) but I'll try to review. In general, I see value in adding netnsid support to setlink, newlink and dellink. It's a bit of minefield, though. Look for and be sure to resolve: - Backwards compatibility with old kernels. Old kernels will ignore the IFLA_IF_NETNSID attribute that is unknown to them. And it's certainly undesirable for an application to attempt to set/create an interface in a given netns and instead the kernel sets/creates an interface in the current netns. In 79e1ad148c844, I put into place means to handle that: any kernel that understands IFLA_IF_NETNSID but does not support setlink/newlink/dellink will return -EOPNOTSUPP. This reduces the problem to detecting whether the kernel understands IFLA_IF_NETNSID. This can be done by invoking getlink with IFLA_IF_NETNSID and looking whether the reply contains IFLA_IF_NETNSID. The current getlink calls are also limited to CAP_NET_ADMIN in the *target* netns. If/when this is relaxed in the future, we still need to stay backwards compatible. So, the application needs to do this: 1. call RTM_GETLINK with IFLA_IF_NETNSID on lo (or do a dump) 2. if getting EACCESS, assume IFLA_IF_NETNSID is not supported => fall back to other means 2. if the reply does not contain IFLA_IF_NETNSID, the kernel does not support IFLA_IF_NETNSID => fall back to other means 3. try RTM_SETLINK/NEWLINK/DELLINK and check for EOPNOTSUPP => fall back to other means What is important for your kernel patches is to not break this. If you implement only partial support for some of the operation, think about the future extensibility. Applications have to be able to reliably detect what is supported on the running kernel, regardless of the kernel version. - Access rights. This is security sensitive stuff. I took the big hammer and limited everything to CAP_NET_ADMIN in the target netns. If you want to relax this, be sure to get review from people involved in name space security. Jiri
Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd
On Tue, 23 Jan 2018 11:26:58 +0100, Wolfgang Bumiller wrote: > Even if you know the netnsid, do the mentioned watches work for > nested/child namespaces if eg. a container creates new namespace before > and/or after the watch was established and moves interfaces to these > child namespaces, would you just see them disappear, or can you keep > track of them later on as well? What do you mean by "nested namespaces"? There's no such thing for net name spaces. As for missing API to get netnsid of the netns the interface is moved to, see my previous emails in this thread. This needs to be added. > Even if that works, from what the documentation tells me netlink is an > unreliable protocol, so if my watcher's socket buffer is full, wouldn't > I be losing important tracking information? Sure. But that's fundamentally unfixable independently on netlink, the kernel needs to take an action if a program is not reading its messages. Either some messages get dropped or the program is killed or infinite amount of memory is consumed. This has nothing to do with uAPI design. > I think one possible solution to tracking interfaces would be to have a > unique identifier that never changes (even if it's just a simple > uint64_t incremented whenever an interface is created). But since > they're not local to the current namespace that may require a lot of > extra permission checks (but I'm just speculating here...). You'll get a hard NACK from CRIU folks if you try to propose this. > In any case, IFLA_NET_NS_FD/PID are already there and I had been > wondering previously why they couldn't be used with RTM_GETLINK, it > would just make sense. Those predate netnsids and we can't get rid of them now, since they're part of uAPI. But we can (and should) make sure we don't add more of those. Jiri
Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd
On Mon, 22 Jan 2018 23:25:41 +0100, Christian Brauner wrote: > This is not necessarily true in scenarios where I move a network device > via RTM_NEWLINK + IFLA_NET_NS_PID into a network namespace I haven't > created. Here is an example: > > nlmsghdr->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK; > nlmsghdr->nlmsg_type = RTM_NEWLINK; > /* move to network namespace of pid */ > nla_put_u32(nlmsg, IFLA_NET_NS_PID, pid) > /* give interface new name */ > nla_put_string(nlmsg, IFLA_IFNAME, ifname) > > The only thing I have is the pid that identifies the network namespace. How do you know the interface did not get renamed in the new netns? This is racy and won't work reliably. You really need to know the netnsid before moving the interface to the netns to be able to do meaningful queries. You may argue that for your case, you're fine with the race. But I know about use cases where it matters a lot: those are tools that show network topology including changes in real time, such as Skydive. It's important to have the uAPI designed right. And we don't want two different APIs for the same thing. If you want to do any watching over the interfaces (as opposed to "move to the netns and forget"), you really have to work with netnsids. Let's focus on how to do that more easily. We don't return netnsid at all places where we should return it and we need to fix that. > There's no non-syscall way to learn the netnsid. And that is the primary problem. Jiri
Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd
On Mon, 22 Jan 2018 22:23:54 +0100, Christian Brauner wrote: > That is certainly a good idea and I'm happy to send a follow-up patch > for that! Note that I haven't looked into that and I don't know whether it is easily possible. I'll appreciate if you could try that. > But there's still value in being able to use > IFLA_NET_NS_{FD,PID} in scenarios where the network namespace has been > created by another process. In this case we don't know what its netnsid > is and not even if it had been assigned one at creation time or not. In > this case it would be useful to refer to the netns via a pid or fd. A > more concrete and frequent example is querying a network namespace of a > (sorry for the buzzword :)) container for all defined network > interfaces. That's what spurred my original comment. If you don't know the netnsid in such case, we're missing something in uAPI but at a different point than RTM_GETLINK. When you find yourself in a need to query an interface in another netns, you had to learn about that interface in the first place. Meaning you got its ifindex (or ifname, perhaps) somehow. My point is, you should have learned the netnsid at the same time. ifindex alone is not an unique identifier of an interface. The (ifindex, netnsid) pair is. (Also, note that ifindex can change when moving the interface to a different netns.) So you should never get ifindex alone when the interface is in another netns than the current one. If that happens, that is the uAPI that needs to be fixed. You have to always get the (ifindex, netnsid) pair. And that's also the way how it should operate in the other direction: (ifindex, netnsid) is the identifier to query interface in another netns. Note that many APIs in networking are based around netnsid - look at NETLINK_LISTEN_ALL_NSID. It allows you to keep track of interfaces as they are moved from name spaces to name spaces using a single socket: you get notifications about interfaces disappearing or appearing in "watched" name spaces. The name spaces are, of course, referenced by their netnsid. In order to add another "watched" name space, just assign it (or, more typically, let the kernel assign) a netnsid. Btw, we have one missing piece here: when an interface is moved to a name space that does not have netnsid attached, we want to find out where the interface was moved to. But there's currently no reliable way to do it. For veth, the other end can be used to get the netnsid (note that RTM_GETLINK will return the correct link netnsid even when the queried veth interface is in a different name space), for openvswitch, we can now use genetlink, etc., but using different ways for different interface types is not the best API ever and for standalone interfaces we have nothing. I'd like to see something added to uAPI to cover this in a generic way. But as for this patch, I don't think it's the correct way. We do have missing pieces in uAPI wrt. netns support but I think they're at different places. If you have a counter example, please speak up. Thanks, Jiri
Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd
On Thu, 18 Jan 2018 21:55:53 +0100, Christian Brauner wrote: > A more concrete scenario is creating a network namespace, moving a > device into it via RTM_SETLINK which also supports IFLA_NET_NS_{FD,PID} > and then wanting to query the device. This would be very easy to do if > one could reuse the IFLA_NET_NS_{FD,PID} without having to set a > netnsid. Wouldn't be a better solution to have a way to request the netnsid allocation (and return it) right away when creating the name space, then? Jiri
Re: [PATCH iproute2-next 1/2] ip/tunnel: Be consistent when printing tunnel collect metadata
On Mon, 22 Jan 2018 18:46:53 +0200, Serhey Popovych wrote: > + if (tb[IFLA_GRE_COLLECT_METADATA]) { > + print_bool(PRINT_ANY, "collect_metadata", "external", true); > + return; > + } Nacked-by: Jiri Benc Don't ever use "collect_metadata" for anything visible to the user. collect_metadata is a *horrible* name. It describes the internal implementation of the lwtunneling in the kernel and provides zero explanation to the user about what's that feature good for. The netlink attribute should have never had such name but it's uAPI and we have to live with it. But there's no reason to expose this to the user. Stick with the "external" name. It explains what it is about: instead of the traffic being controlled by the tunnel internal logic (or tunnel control plane, if you want), an external logic needs to be attached to the tunnel in order for the tunneling to work. Jiri
Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd
On Thu, 18 Jan 2018 21:21:24 +0100, Christian Brauner wrote: > In such scenarios setting a netns id property is > not really wanted Why? I think that's what you should do if you want to avoid setns. Just use netnsid. I don't see any problem with that and you didn't provide any explanation. Jiri
Re: [PATCH net-next] macvlan: Pass SIOC[SG]HWTSTAMP ioctls and get_ts_info to lower device
On Thu, 18 Jan 2018 02:12:34 +0100, grzegorz.ha...@gmail.com wrote: > This patch allows to enable hardware timestamping on macvlan intefaces and > find out capabilities of the lower device. > > Signed-off-by: Grzegorz Halat NACK. This does not work. For start, how do you deal with fwd_priv? When a packet is sent to other software ports, it wouldn't be time stamped. And I expect more cases like this to be there in macvlan, I only spent 10 seconds checking it. Please study how time stamping in the kernel works. A good start is Documentation/networking/timestamping.txt. Then examine all possible packet paths with macvlan, both egress and ingress. > +static int macvlan_ethtool_get_ts_info(struct net_device *dev, > + struct ethtool_ts_info *ts_info) > +{ > + const struct macvlan_dev *vlan = netdev_priv(dev); > + const struct ethtool_ops *eth_ops = vlan->lowerdev->ethtool_ops; > + > + if (eth_ops->get_ts_info) > + return eth_ops->get_ts_info(vlan->lowerdev, ts_info); > + > + ts_info->so_timestamping = SOF_TIMESTAMPING_RX_SOFTWARE | > +SOF_TIMESTAMPING_SOFTWARE; > + ts_info->phc_index = -1; What calls skb_tx_timestamp if the driver does not support it? Jiri
Re: [PATCHv3 net-next 2/2] openvswitch: add erspan version I and II support
Looks much better. On Wed, 17 Jan 2018 09:32:51 -0800, William Tu wrote: > + OVS_ERSPAN_OPT_IDX, /* be32 index */ Why don't you convert this to u32 while passing to/from user space? > + [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) }, sizeof(__be32) but see above. Thanks, Jiri
Re: [PATCH net] Revert "openvswitch: Add erspan tunnel support."
On Fri, 12 Jan 2018 12:29:22 -0800, William Tu wrote: > This reverts commit ceaa001a170e43608854d5290a48064f57b565ed. > > The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed > as a nested attribute to support all ERSPAN v1 and v2's fields. > The current attr is a be32 supporting only one field. Thus, this > patch reverts it and later patch will redo it using nested attr. > > Signed-off-by: William Tu > Cc: Jiri Benc > Cc: Pravin Shelar Acked-by: Jiri Benc
Re: [PATCH V2] ipvlan: fix ipvlan MTU limits
On Fri, 12 Jan 2018 09:50:35 -0800, Mahesh Bandewar (महेश बंडेवार) wrote: > (Looks like you missed the last update I mentioned) I did not miss it. The proposed behavior is inconsistent and has no clear pattern (I used the word "magic" for that). I guess examples will help more. See below. > Here is the approach in details - > (a) At slave creation time - it's exactly how it's done currently > where slave copies masters' mtu. at the same time max_mtu of the slave > is set as the current mtu of the master. > (b) If slave updates mtu - ipvlan_change_mtu() will be called and the > slave's mtu will get set and it will set a flag indicating that slave > has changed it's mtu (dissociation from master if the mtu is different > from masters'). If slave mtu is set same as masters' then this flag > will get reset-ed indicating it wants to follow master (current > behavior). Consider these two cases: # ip l a link eth0 type ipvlan # ip l s ipvlan0 mtu 1400 # ip l s eth0 mtu 1400 # ip l s eth0 mtu 1500 Now MTU of ipvlan0 is 1400. # ip l a link eth0 type ipvlan # ip l s eth0 mtu 1400 # ip l s ipvlan0 mtu 1400 # ip l s eth0 mtu 1500 Now MTU of ipvlan0 is 1500. See why I call that behavior "magic"? Before the last step, it's impossible to tell from the user space what will happen. And no, don't propose to artificially cover the first example by forcing the auto follow flag, it doesn't help. It just moves the magic behavior to different scenarios. > (c) When master updates mtu - ipvlan_adj_mtu() gets called where all > slaves' max_mtu changes to the master's mtu value (clamping applies > for the all slaves which are not following master). All the slaves > which are following master (flag per slave) will get this new mtu. > Another consequence of this is that slaves' flag might get reset-ed if > the master's mtu is reduced to the value that was set earlier for the > slave (and it will start following slave). Okay, you're actually proposing that. So, another example: # ip l a link eth0 type ipvlan # ip l s ipvlan0 mtu 1400 # ip l s eth0 mtu 1400 # ip l s eth0 mtu 1500 Now MTU of ipvlan0 is 1500. # ip l a link eth0 type ipvlan # ip l s ipvlan0 mtu 1400 # ip l s eth0 mtu 1600 # ip l s eth0 mtu 1500 Now MTU of ipvlan0 is 1400. There's still no consistency here. > The above should work? The user-space can query the mtu of the slave > device just like any other device. I was thinking about 'mtu_adj' with > some additional future extention but for now; we can live with a flag > on the slave device(s). It absolutely does not. Never introduce magic behavior, it just forces you to add more layers of hacks later. Really, the only way is to introduce user visible and user changeable flag. Yes, it's more work. But that's something you'll have to swallow. Introducing hacky behavior is not the way to go. We try hard to preserve user space compatibility which is pretty much impossible if you introduce magic hacky behavior. Do it right from the start. Jiri
Re: [PATCH V2] ipvlan: fix ipvlan MTU limits
On Fri, 12 Jan 2018 09:34:13 +0100, Jiri Benc wrote: > I don't think this works currently. When someone (does not have to be > you, it can be a management software running in background) sets the > MTU to the current value, the magic behavior is lost without any way to > restore it (unless I'm missing a way to restore it, see my question > above). So any user that depends on the magic behavior is broken anyway > even now. Upon further inspection, it seems that currently, slaves always follow master's MTU without a way to change it. Tough situation. Even implementing user space toggleable mtu_adj could break users in the way I described. But it seems to be the lesser evil, at least there would be a way to unbreak the scripts with one line addition. But it's absolute must to have this visible to the user space and changeable. Something like this: # ip a 123: ipvlan0: mtu 1500 (auto) qdisc ... # ip l s ipvlan0 mtu 1400 # ip a 123: ipvlan0: mtu 1400 qdisc ... # ip l s ipvlan0 mtu auto # ip a 123: ipvlan0: mtu 1500 (auto) qdisc ... Jiri
Re: [PATCH V2] ipvlan: fix ipvlan MTU limits
On Thu, 11 Jan 2018 08:59:58 -0800, Mahesh Bandewar (महेश बंडेवार) wrote: > I guess the logic would be as simple as - if mtu_adj for a slave is > set to 0, then it's > following master otherwise not. By setting different mtu for a slave, you will > set this mtu_adj a positive number which would mean it's not following master. > So it's subjected to clamping when masters' mtu is reducing but should stay > otherwise. Also when slave decides to follow master again, it can set the mtu > to be same as masters' (making mtu_adj == 0) and then it would start following > master again. How can the mtu_adj value be queried and set from user space? > Whether it's magic or not, it's the current behavior and I know several use > cases depend on this behavior which would be broken otherwise. The > approach I proposed keeps that going for those who depend on that while > adds an ability to set mtu per slave for the use case mentioned in this > patch-set too. I don't think this works currently. When someone (does not have to be you, it can be a management software running in background) sets the MTU to the current value, the magic behavior is lost without any way to restore it (unless I'm missing a way to restore it, see my question above). So any user that depends on the magic behavior is broken anyway even now. Jiri
Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support
On Thu, 11 Jan 2018 08:34:14 -0800, William Tu wrote: > I'd also prefer reverting ceaa001a170e since it's more clean but I > also hope to have this feature in 4.15. > How long does reverting take? Am I only able to submit the new patch > after the reverting is merged? Or I can submit revert and this new > patch in one series? I have little experience in reverting, can you > suggest which way is better? Send the revert for net (subject will be "[PATCH net] revert: openvswitch: Add erspan tunnel support."). Don't forget to explain why you're proposing a revert. After it is accepted and applied to net.git, wait until the patch appears in net-next.git. It may take a little while. After that, send the new patch(es) for net-next. Jiri
Re: [PATCH V2] ipvlan: fix ipvlan MTU limits
On Wed, 10 Jan 2018 18:09:50 -0800, Mahesh Bandewar (महेश बंडेवार) wrote: > I still prefer the approach I had mentioned that uses 'mtu_adj'. In > that approach you can leave those slaves which have changed their mtu > to be lower than masters' but if master's mtu changes to larger value > all other slaves will get updated mtu leaving behind the slaves who > have opted to change their mtu on their own. Also the same thing is > true when mtu get reduced at master. The problem with this magic behavior is, well, that it's magic. There's no way to tell what happens with a given slave when the master's MTU gets changed just by looking at the current configuration. There's also no way to switch the magic behavior back on once the slave's MTU is changed. At minimum, you'd need some kind of indication that the slave's MTU is following the master. And a way to toggle this back. Keefe's patch is much saner, the behavior is completely deterministic. Jiri
Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support
On Wed, 10 Jan 2018 22:35:14 +0100, Jiri Benc wrote: > The existing field must continue to work in the same way as before. It must > be accepted and *returned* by the kernel. You may add an additional field > but the existing behavior must be 100% preserved, both uABI and uAPI wise. Another way around this is reverting ceaa001a170e in net.git and designing the uAPI properly in net-next. I think that should be the preferred way, as ceaa001a170e is clearly wrong since you need to redo it after 3 months. Not sure when Linus intends to release 4.15 and how much time you have for this, though. Jiri
Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support
On Tue, 9 Jan 2018 17:51:22 -0800, William Tu wrote: > - [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS] = { .len = sizeof(u32) }, > + [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1] = { .len = sizeof(u32) }, > + [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS] = { .len = OVS_ATTR_NESTED, > + .next = ovs_erspan_opt_lens }, Ouch. It's actually much worse than that, you're redefining the meaning of the field. That's complete no-go. > + case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1: > + OVS_NLERR(log, "ERSPAN attribute %d is deprecated.", > + type); > + return -EINVAL; As is this. > @@ -906,8 +1017,8 @@ static int __ip_tun_to_nlattr(struct sk_buff *skb, >vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len)) > return -EMSGSIZE; > else if (output->tun_flags & TUNNEL_ERSPAN_OPT && > - nla_put_be32(skb, OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS, > - ((struct erspan_metadata > *)tun_opts)->u.index)) > + erspan_opt_to_nlattr(skb, tun_opts, > + swkey_tun_opts_len)) And this. The existing field must continue to work in the same way as before. It must be accepted and *returned* by the kernel. You may add an additional field but the existing behavior must be 100% preserved, both uABI and uAPI wise. Jiri
Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support
On Tue, 9 Jan 2018 17:51:22 -0800, William Tu wrote: > --- a/include/uapi/linux/openvswitch.h > +++ b/include/uapi/linux/openvswitch.h > @@ -363,7 +373,8 @@ enum ovs_tunnel_key_attr { > OVS_TUNNEL_KEY_ATTR_IPV6_SRC, /* struct in6_addr src IPv6 > address. */ > OVS_TUNNEL_KEY_ATTR_IPV6_DST, /* struct in6_addr dst IPv6 > address. */ > OVS_TUNNEL_KEY_ATTR_PAD, > - OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* be32 ERSPAN index. */ > + OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1, /* be32 ERSPAN v1 index > (deprecated). */ You can't do this in an uapi header, the existing name must stay the same forever. Otherwise existing programs would suddenly stop to compile. Jiri
Re: [PATCH net v3] openvswitch: Fix pop_vlan action for double tagged frames
On Wed, 20 Dec 2017 15:09:22 -0500, Eric Garver wrote: > skb_vlan_pop() expects skb->protocol to be a valid TPID for double > tagged frames. So set skb->protocol to the TPID and let skb_vlan_pop() > shift the true ethertype into position for us. > > Fixes: 5108bbaddc37 ("openvswitch: add processing of L3 packets") > Signed-off-by: Eric Garver Thanks! Reviewed-by: Jiri Benc
Re: [PATCH net v2] openvswitch: Fix pop_vlan action for double tagged frames
On Wed, 20 Dec 2017 10:39:32 -0500, Eric Garver wrote: > + if (is_flow_key_valid(key) && key->eth.vlan.tci && key->eth.cvlan.tci) Maybe (key->eth.vlan.tci & htons(VLAN_TAG_PRESENT)) for consistency with the rest of the code? But it's just nitpicking. The real problem here is when a double tagged packet leaves the ovs bridge, it won't have the skb->protocol that the kernel expects: it will be ethertype of the payload, while my understanding is it should be the inner tpid, right? This patch fixes that nicely for the pop vlan case. But what about other cases? It seems to me that we need to add the logic to key_extract. Thanks! Jiri
Re: [PATCH net] openvswitch: Fix pop_vlan action for double tagged frames
On Tue, 19 Dec 2017 13:57:53 -0500, Eric Garver wrote: > --- a/net/openvswitch/flow.c > +++ b/net/openvswitch/flow.c > @@ -559,8 +559,9 @@ static int parse_nsh(struct sk_buff *skb, struct > sw_flow_key *key) > * of a correct length, otherwise the same as skb->network_header. > * For other key->eth.type values it is left untouched. > * > - *- skb->protocol: the type of the data starting at skb->network_header. > - * Equals to key->eth.type. > + *- skb->protocol: For Ethernet, the ethertype or VLAN TPID. > + * For non-Ethernet, the type of the data starting at > skb->network_header > + * (also equal to key->eth.type). > */ > static int key_extract(struct sk_buff *skb, struct sw_flow_key *key) > { > @@ -579,6 +580,7 @@ static int key_extract(struct sk_buff *skb, struct > sw_flow_key *key) > return -EINVAL; > > skb_reset_network_header(skb); > + key->eth.type = skb->protocol; > } else { > eth = eth_hdr(skb); > ether_addr_copy(key->eth.src, eth->h_source); > @@ -592,15 +594,14 @@ static int key_extract(struct sk_buff *skb, struct > sw_flow_key *key) > if (unlikely(parse_vlan(skb, key))) > return -ENOMEM; > > - skb->protocol = parse_ethertype(skb); > - if (unlikely(skb->protocol == htons(0))) > + key->eth.type = parse_ethertype(skb); > + if (unlikely(key->eth.type == htons(0))) > return -ENOMEM; > > skb_reset_network_header(skb); > __skb_push(skb, skb->data - skb_mac_header(skb)); > } > skb_reset_mac_len(skb); > - key->eth.type = skb->protocol; > > /* Network layer. */ > if (key->eth.type == htons(ETH_P_IP)) { Unfortunately, this does not work. key_extract must set skb->protocol even for Ethernet frames that come from a mixed L2/L3 tunnel. Such packets will have key->mac_proto set to MAC_PROTO_ETHERNET and skb->protocol set to ETH_P_TEB (see key_extract_mac_proto). In key_extract, skb->protocol has to be correctly set to the dissected value. Which means that we have to check for the existence of inner vlan tag (by checking key->eth.cvlan.tci or, perhaps better, by returning it from parse_vlan) and set skb->protocol accordingly. Jiri
Re: [PATCH] rtnetlink: fix missing size for IFLA_IF_NETNSID
On Mon, 6 Nov 2017 15:04:54 +, Colin King wrote: > The size for IFLA_IF_NETNSID is missing from the size calculation > because the proceeding semicolon was not removed. Fix this by removing > the semicolon. Acked-by: Jiri Benc Thanks for spotting this! Looking at my initial code, I had that right, this was probably introduced during one of rebases, so I blame Flavio :-p (On a serious note, thank you, Flavio, for taking care of the rebases.) Hopefully, with the "+ 0" added, this won't happen again in this particular piece of code in the future. Jiri