Re: [PATCH bpf-next] libbpf: Add libbpf_version() function to get library version at runtime

2020-11-19 Thread Jiri Benc
On Wed, 18 Nov 2020 09:43:25 -0800, Alexei Starovoitov wrote:
> Just like the kernel doesn't add features for out-of-tree modules
> libbpf doesn't add features for projects where libbpf is optional.

A more fitting comparison would be the kernel refusing to add a new
uAPI call because some application refuses to bundle the kernel.
A libbpf equivalent of a kernel module would be some kind of libbpf
plugin (which does not exist), asking for an internal libbpf API to be
added.

Alexei, could you please start cooperating with others and actually
listening to others' needs? I know you started eBPF but you're not the
only user anymore and as much convinced you may be about your view,
people have reasons for what they're doing. It would help greatly if
you could listen to these reasons.

 Jiri



Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support

2020-11-06 Thread Jiri Benc
On Thu, 5 Nov 2020 12:45:39 -0800, Andrii Nakryiko wrote:
> That's not true. If you need new functionality like BTF, CO-RE,
> function-by-function verification, etc., then yes, you have to update
> kernel, compiler, libbpf, sometimes pahole. But if you have an BPF
> application that doesn't use and need any of the newer features, it
> will keep working just fine with the old kernel, old libbpf, and old
> compiler.

I'm fine with this.

It doesn't work that well in practice, we've found ourselves chasing
problems caused by llvm update (problems for older bpf programs, not
new ones), problems on non-x86_64 caused by kernel updates, etc. It can
be attributed to living on the edge and it should stabilize over time,
hopefully. But it's still what the users are experiencing and it's
probably what David is referring to. I expect it to smooth itself over
time.

Add to that the fact that something that is in fact a new feature is
perceived as a bug fix by some users. For example, a perfectly valid
and simple C program, not using anything shiny but a basic simple loop,
compiles just fine but is rejected by the kernel. A newer kernel and a
newer compiler and a newer libbpf and a newer pahole will cause the
same program to be accepted. Now, the user does not see that for this,
a new load of BTF functionality had to be added and all those mentioned
projects enhanced with substantial code. All they see is their simple
hello world test program did not work and now it does.

I'm not saying I have a solution nor I'm saying you should do something
about it. Just trying to explain the perception.

 Jiri



Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support

2020-11-06 Thread Jiri Benc
On Thu, 5 Nov 2020 12:19:00 -0800, Andrii Nakryiko wrote:
> I'll just quote myself here for your convenience.

Sorry, I missed your original email for some reason.

>   Submodule is a way that I know of to make this better for end users.
>   If there are other ways to pull this off with shared library use, I'm
>   all for it, it will save the security angle that distros are arguing
>   for. E.g., if distributions will always have the latest libbpf
>   available almost as soon as it's cut upstream *and* new iproute2
>   versions enforce the latest libbpf when they are packaged/released,
>   then this might work equivalently for end users. If Linux distros
>   would be willing to do this faithfully and promptly, I have no
>   objections whatsoever. Because all that matters is BPF end user
>   experience, as Daniel explained above.

That's basically what we already do, for both Fedora and RHEL.

Of course, it follows the distro release cycle, i.e. no version
upgrades - or very limited ones - during lifetime of a particular
release. But that would not be different if libbpf was bundled in
individual projects.

 Jiri



Re: [PATCH v3 bpf-next] bpf: make verifier log more relevant by default

2020-11-06 Thread Jiri Benc
On Thu, 5 Nov 2020 13:22:12 -0800, Andrii Nakryiko wrote:
> test_progs is the only test runner that's run continuously on every
> patch. libbpf CI also runs test_maps and test_verifier. All the other
> test binaries/scripts rely on humans to not forget about them. Which
> works so-so, as you can see :)

Did not know that, thanks for the explanation.

Would be good to add full testing to some of the numerous CIs we have
(some taker? :-)).

 Jiri



Re: [PATCH v3 bpf-next] bpf: make verifier log more relevant by default

2020-11-06 Thread Jiri Benc
On Thu, 5 Nov 2020 14:57:13 -0800, Jakub Kicinski wrote:
> If you're saying the driver message would still be there if
> verification or translation failed that's perfectly fine, we 
> can definitely adjust the test. But some check that driver 
> message reporting is working is needed, don't just remove it.

Should we change the test to fail the verification? Sounds reasonable
to me.

 Jiri



Re: [PATCH v3 bpf-next] bpf: make verifier log more relevant by default

2020-11-05 Thread Jiri Benc
On Thu, 23 Apr 2020 12:58:50 -0700, Andrii Nakryiko wrote:
> To make BPF verifier verbose log more releavant and easier to use to debug
> verification failures, "pop" parts of log that were successfully verified.
> This has effect of leaving only verifier logs that correspond to code branches
> that lead to verification failure, which in practice should result in much
> shorter and more relevant verifier log dumps. This behavior is made the
> default behavior and can be overriden to do exhaustive logging by specifying
> BPF_LOG_LEVEL2 log level.

This patch broke the test_offload.py selftest:

[...]
Test TC offloads work...
FAIL: Missing or incorrect message from netdevsim in verifier log
[...]

The selftest expects to receive "[netdevsim] Hello from netdevsim!" in
the log (coming from nsim_bpf_verify_insn) but that part of the log is
cleared by bpf_vlog_reset added by this patch.

How can this be fixed? The log level 1 comes from the "verbose" keyword
passed to tc, I don't think it should be increased to 2.

On a related note, the selftest had to start failing after this commit.
It's a bit surprising it did not get caught, is there a bug somewhere
in the test matrix?

Thanks,

 Jiri



Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support

2020-11-04 Thread Jiri Benc
On Tue, 3 Nov 2020 19:11:45 -0800, Alexei Starovoitov wrote:
> When we release new version of libbpf it goes through rigorous testing.
> bpftool gets a lot of test coverage as well.
> iproute2 with shared libbpf will get nothing. It's the same random roll of 
> dice.

"Random roll of dice" would be true only if libbpf did incredibly bad
job in keeping backward compatibility. In my experience it is not the
case. Sure, a bug in retaining the compatibility may occasionally
appear; after all, any software tends to contain bugs in various
places. You are right that such bug may not be caught by your testing.

I also believe that if there is a bug in backward compatibility
reported by someone, it will be fixed (if possible). So this is really
just a matter of testing, not a fundamental problem of ABI
compatibility.

Let the distros worry about the testing. Upstream may test (and
even recommend!) certain combinations of iproute2 + libbpf, such as the
latest of both at the time of testing. If distros want to use a
different combination, they can and should do their own testing. If
their testing reveals a bug in backward compatibility and a patch to
fix it is accepted, everything will work smoothly for the distro users.

Non-distro users (or small distros) may just rely on the upstream
tested combination of iproute2 + libbpf.

> Few years from now the situation could be different and shared libbpf would
> be the most appropriate choice. But that day is not today.

Interestingly, the major compatibility problems we had were with llvm
updates. After llvm update while keeping the same kernel version, llvm
started to emit code that the verifier did not accept. Meaning a bpf
program that was previously accepted by the kernel was rejected after
recompilation. This was solved by adding a translation code to libbpf
(which nicely demonstrates that indeed libbpf cares about backward
compatibility).

Now, with dynamically linked libbpf, a single package update was able
to solve the problem for everything on the system, including users' own
programs. All that was needed was making the llvm package force update
the libbpf package (which rpm can do easily with its Conflicts
dependency).

So, at least for us, there was so far no disadvantage (and no problem)
with dynamic linking and a quite substantial advantage.

 Jiri



Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support

2020-11-04 Thread Jiri Benc
On Tue, 3 Nov 2020 18:45:59 -0800, Alexei Starovoitov wrote:
> libbpf is the only library I know that is backward and forward compatible.

This is great to hear. It means there will be no problem with iproute2
using the system libbpf. As libbpf is both backward and forward
compatible, iproute2 will just work with whatever version it is used
with.

> All other libraries are backwards compatible only.

Backward compatibility would be enough for iproute2 but forward
compatibility does not hurt, of course.

> The users can upgrade and downgrade libbpf version at any time.
> They can upgrade and downgrade kernel while keeping libbpf version the same.
> The users can upgrade llvm as well and libbpf has to expect unexpected
> and deal with all combinations.

This actually goes beyond what would be needed for iproute2 dynamically
linked against system libbpf.

> > How so? If libbpf is written against kernel APIs and properly versioned,
> > it should just work. A new version of libbpf changes the .so version, so
> > old commands will not load it.  
> 
> Please point out where do you see this happening in the patch set.
> See tools/lib/bpf/README.rst to understand the versioning.

If the iproute2 binaries are linked against a symbol of a newer version than
is available in the system libbpf (which should not really happen
unless the system is broken), the dynamic linker will refuse to load
it. If the binary is linked against an old version of a particular
symbol, that old version will be used, if it's still provided by the
library. Otherwise, it will not load. I don't see a problem here?

The only problem would be if a particular function changed its
semantics while retaining ABI. But since libbpf is backward and forward
compatible, this should not happen.

 Jiri



Re: [PATCHv3 iproute2-next 0/5] iproute2: add libbpf support

2020-11-03 Thread Jiri Benc
On Mon, 2 Nov 2020 22:58:06 -0800, Andrii Nakryiko wrote:
> But I don't think I got a real answer as to what's the exact reason
> against the submodule. Like what "inappropriate" even means in this
> case? Jesper's security argument so far was the only objective
> criteria, as far as I can tell.

It's the fundamental objection. Distributions in general have the "no
bundled libraries" policy. It is sometimes annoying but it helps to
understand that the policy is not a whim of distros, it's coming from
years of experience with package maintenance for security and stability.

> But I also see that using libbpf through submodule gives iproute2
> exact control over which version of libbpf is being used. And that
> does not depend at all on any specific Linux distribution, its
> version, LTS vs non-LTS, etc. iproute2 will just work the same across
> all of them. So matches your stated goals very directly and
> explicitly.

If you take this route, the end result would be all dependencies for
all projects being included as submodules and bundled. At the first
sight, this sounds easier for the developers. Why bother with dynamic
linking at all? Everything can be linked statically.

The result would be nightmare for both distros and users. No timely
security updates possible, critical bugs not being fixed in some
programs, etc. There is enough experience with this kind of setup to
conclude it is not the right way to go.

Yes, dynamic linking is initially more work for developers of both apps
and libraries. However, it pays off over time - there's no need to keep
track of security and other important fixes in the dependencies, it
comes for free from the distro work.

Btw, taking the bundling to the extreme, every app could bundle its own
well tested and compatible kernel version and be run in a VM. This
might sound far fetched but there were actual attempts to do that. It
didn't take off; I think part of the reason was that the Linux kernel
is very good in keeping its APIs stable.

And I'm convinced this is the way to go for libraries, too: put an
emphasis on API stability. Make it easy to get consumed and updated
under the hood. Everybody wins this way.

 Jiri



[PATCH bpf] selftests: bpf: switch off timeout

2020-08-06 Thread Jiri Benc
Several bpf tests are interrupted by the default timeout of 45 seconds added
by commit 852c8cbf34d3 ("selftests/kselftest/runner.sh: Add 45 second
timeout per test"). In my case it was test_progs, test_tunnel.sh,
test_lwt_ip_encap.sh and test_xdping.sh.

There's not much value in having a timeout for bpf tests, switch it off.

Signed-off-by: Jiri Benc 
---
 tools/testing/selftests/bpf/settings | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 tools/testing/selftests/bpf/settings

diff --git a/tools/testing/selftests/bpf/settings 
b/tools/testing/selftests/bpf/settings
new file mode 100644
index ..e7b9417537fb
--- /dev/null
+++ b/tools/testing/selftests/bpf/settings
@@ -0,0 +1 @@
+timeout=0
-- 
2.18.1



[PATCH net] geneve: change from tx_error to tx_dropped on missing metadata

2020-06-03 Thread Jiri Benc
If the geneve interface is in collect_md (external) mode, it can't send any
packets submitted directly to its net interface, as such packets won't have
metadata attached. This is expected.

However, the kernel itself sends some packets to the interface, most
notably, IPv6 DAD, IPv6 multicast listener reports, etc. This is not wrong,
as tunnel metadata can be specified in routing table (although technically,
that has never worked for IPv6, but hopefully will be fixed eventually) and
then the interface must correctly participate in IPv6 housekeeping.

The problem is that any such attempt increases the tx_error counter. Just
bringing up a geneve interface with IPv6 enabled is enough to see a number
of tx_errors. That causes confusion among users, prompting them to find
a network error where there is none.

Change the counter used to tx_dropped. That better conveys the meaning
(there's nothing wrong going on, just some packets are getting dropped) and
hopefully will make admins panic less.

Signed-off-by: Jiri Benc 
---
 drivers/net/geneve.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 6b461be1820b..75266580b586 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -987,9 +987,10 @@ static netdev_tx_t geneve_xmit(struct sk_buff *skb, struct 
net_device *dev)
if (geneve->collect_md) {
info = skb_tunnel_info(skb);
if (unlikely(!info || !(info->mode & IP_TUNNEL_INFO_TX))) {
-   err = -EINVAL;
netdev_dbg(dev, "no tunnel metadata\n");
-   goto tx_error;
+   dev_kfree_skb(skb);
+   dev->stats.tx_dropped++;
+   return NETDEV_TX_OK;
}
} else {
info = &geneve->info;
@@ -1006,7 +1007,7 @@ static netdev_tx_t geneve_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
if (likely(!err))
return NETDEV_TX_OK;
-tx_error:
+
dev_kfree_skb(skb);
 
if (err == -ELOOP)
-- 
2.18.1



Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.

2019-10-21 Thread Jiri Benc
On Fri, 18 Oct 2019 13:03:56 -0700, Tom Herbert wrote:
> More specifically fou allows encapsulation of anything that has an IP
> protocol number. That includes an L3 protocols that have been assigned
> a number (e.g. MPLS, GRE, IPv6, IPv4, EtherIP). So the only need for
> an alternate method to do L3 encapsulation would be for those
> protocols that don't have an IP protocol number assignment.
> Presumably, those just have an EtherType. In that case, it seems
> simple enough to just extend fou to processed an encapsulated
> EtherType. This should be little more than modifying the "struct fou"
> to hold 16 bit EtherType (union with protocol), adding
> FOU_ENCAP_ETHER, corresponding attribute, and then populate
> appropriate receive functions for the socket.

How do you suggest to plug that in? We need the received inner packets
to be "encapsulated" in an Ethernet header; the current approach of
"ip link add type ipip|sit encap fou" does not work here. Are you
suggesting to add another rtnl link type? How would it look like?
"ip link add type ethernet encap fou" does not sound appealing,
especially since "type ethernet" would have no meaning without the
"encap fou".

Thanks,

 Jiri



[PATCH bpf] selftests/bpf: More compatible nc options in test_tc_edt

2019-10-18 Thread Jiri Benc
Out of the three nc implementations widely in use, at least two (BSD netcat
and nmap-ncat) do not support -l combined with -s. Modify the nc invocation
to be accepted by all of them.

Fixes: 7df5e3db8f63 ("selftests: bpf: tc-bpf flow shaping with EDT")
Cc: Peter Oskolkov 
Signed-off-by: Jiri Benc 
---
 tools/testing/selftests/bpf/test_tc_edt.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_tc_edt.sh 
b/tools/testing/selftests/bpf/test_tc_edt.sh
index f38567ef694b..daa7d1b8d309 100755
--- a/tools/testing/selftests/bpf/test_tc_edt.sh
+++ b/tools/testing/selftests/bpf/test_tc_edt.sh
@@ -59,7 +59,7 @@ ip netns exec ${NS_SRC} tc filter add dev veth_src egress \
 
 # start the listener
 ip netns exec ${NS_DST} bash -c \
-   "nc -4 -l -s ${IP_DST} -p 9000 >/dev/null &"
+   "nc -4 -l -p 9000 >/dev/null &"
 declare -i NC_PID=$!
 sleep 1
 
-- 
2.18.1



Re: [PATCH net-next 1/2] UDP tunnel encapsulation module for tunnelling different protocols like MPLS,IP,NSH etc.

2019-10-09 Thread Jiri Benc
On Wed, 9 Oct 2019 10:58:51 -0400, Willem de Bruijn wrote:
> Yes, this needs a call to gro_cells_receive like geneve_udp_encap_recv
> and vxlan_rcv, instead of returning a negative value back to
> udp_queue_rcv_one_skb. Though that's not a major change.

Willem, how would you do that? The whole fou code including its netlink
API is built around the assumption that it's L4 encapsulation. I see no
way to extend it to do L3 encapsulation without major hacks - in fact,
you'll add all that Martin's patch adds, including a new netlink API,
but just mix that into fou.c.

> Transmit is easier. The example I shared already encapsulates packets
> with MPLs and sends them through a tunnel device. Admittedly using
> mpls routing. But the ip tunneling infrastructure supports adding
> arbitrary inner headers using ip_tunnel_encap_ops.build_header.
> net/ipv4/fou.c could be extended with a mpls_build_header alongside
> fou_build_header and gue_build_header?

The nice thing about the bareudp tunnel is that it's generic. It's just
(outer) UDP header followed by (inner) L3 header. You can use it for
NSH over UDP. Or for multitude of other purposes.

What you're proposing is MPLS only. And it would require similar amount
of code as the bareudp generic solution. It also doesn't fit fou
design: MPLS is not an IP protocol. Trying to add it there is like
forcing a round peg into a square hole. Just look at the code.
Start with parse_nl_config.

> Extending this existing code has more benefits than code reuse (and
> thereby fewer bugs), it also allows reusing the existing GRO logic,
> say.

In this case, it's the opposite: trying to retrofit L3 encapsulation
into fou is going to introduce bugs.

Moreover, given the expected usage of bareudp, the nice benefit is it's
lwtunnel only. No need to carry all the baggage of standalone
configuration fou has.

> At a high level, I think we should try hard to avoid duplicating
> tunnel code for every combination of inner and outer protocol.

Yet you're proposing to do exactly that with special casing MPLS in fou.

I'd say bareudp is as generic as you could get it. It allows any L3
protocol inside UDP. You can hardly make it more generic than that.
With ETH_P_TEB, it could also encapsulate Ethernet (that is, L2).

> Geneve
> and vxlan on the one hand and generic ip tunnel and FOU on the other
> implement most of these features. Indeed, already implements the
> IPxIPx, just not MPLS. It will be less code to add MPLS support based
> on existing infrastructure.

You're mixing apples with oranges. IP tunnels and FOU encapsulate L4
payload. VXLAN and Geneve encapsulate L2 payload (or L3 payload for
VXLAN-GPE).

I agree that VXLAN, Geneve and the proposed bareudp module have
duplicated code. The solution is to factoring it out to a common
location.

> I think it will be preferable to work the other way around and extend
> existing ip tunnel infra to add MPLS.

You see, that's the problem: this is not IP tunneling.

 Jiri


Re: [PATCHv2 net-next 2/6] lwtunnel: add LWTUNNEL_IP_OPTS support for lwtunnel_ip

2019-10-09 Thread Jiri Benc
On Wed, 9 Oct 2019 09:55:27 +0200, Simon Horman wrote:
> This is the same concerned that was raised by others when I posed a patch
> to allow setting of Geneve options in a similar manner. I think what is
> called for here, as was the case in the Geneve work, is to expose netlink
> attributes for each option that may be set and have the kernel form
> these into the internal format (which appears to also be the wire format).

I agree with Simon.

Thanks,

 Jiri


[PATCH bpf] bpf: lwtunnel: fix reroute supplying invalid dst

2019-10-09 Thread Jiri Benc
The dst in bpf_input() has lwtstate field set. As it is of the
LWTUNNEL_ENCAP_BPF type, lwtstate->data is struct bpf_lwt. When the bpf
program returns BPF_LWT_REROUTE, ip_route_input_noref is directly called on
this skb. This causes invalid memory access, as ip_route_input_slow calls
skb_tunnel_info(skb) that expects the dst->lwstate->data to be
struct ip_tunnel_info. This results to struct bpf_lwt being accessed as
struct ip_tunnel_info.

Drop the dst before calling the IP route input functions (both for IPv4 and
IPv6).

Reported by KASAN.

Fixes: 3bd0b15281af ("bpf: add handling of BPF_LWT_REROUTE to lwt_bpf.c")
Cc: Peter Oskolkov 
Signed-off-by: Jiri Benc 
---
 net/core/lwt_bpf.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index f93785e5833c..74cfb8b5ab33 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -88,11 +88,16 @@ static int bpf_lwt_input_reroute(struct sk_buff *skb)
int err = -EINVAL;
 
if (skb->protocol == htons(ETH_P_IP)) {
+   struct net_device *dev = skb_dst(skb)->dev;
struct iphdr *iph = ip_hdr(skb);
 
+   dev_hold(dev);
+   skb_dst_drop(skb);
err = ip_route_input_noref(skb, iph->daddr, iph->saddr,
-  iph->tos, skb_dst(skb)->dev);
+  iph->tos, dev);
+   dev_put(dev);
} else if (skb->protocol == htons(ETH_P_IPV6)) {
+   skb_dst_drop(skb);
err = ipv6_stub->ipv6_route_input(skb);
} else {
err = -EAFNOSUPPORT;
-- 
2.18.1



Re: [PATCH v3 bpf-next 5/7] libbpf: move bpf_{helpers,endian,tracing}.h into libbpf

2019-10-08 Thread Jiri Benc
On Fri, 4 Oct 2019 18:37:44 +, Yonghong Song wrote:
> distro can package bpf/btf uapi headers into libbpf package.
> Users linking with libbpf.a/libbpf.so can use bpf/btf.h with include
> path pointing to libbpf dev package include directory.
> Could this work?

I don't think it would. Distros have often a policy against bundling
files that are available from one package (in this case, kernel-headers
or similar) in a different package (libbpf).

The correct way is making the libbpf package depend on a particular
version of kernel-headers (or newer). As I said, I don't see a problem
here. It's not a special situation, it's just usual dependencies.

 Jiri


Re: [PATCH v3 bpf-next 5/7] libbpf: move bpf_{helpers,endian,tracing}.h into libbpf

2019-10-08 Thread Jiri Benc
On Fri, 4 Oct 2019 11:30:26 -0700, Jakub Kicinski wrote:
> Surely for distroes tho - they would have kernel headers matching the
> kernel release they ship. If parts of libbpf from GH only work with
> the latest kernel, distroes should ship libbpf from the kernel source,
> rather than GH.

I don't see a problem here for distros. Distros control both the kernel
and libbpf, there's no problem in keeping those in sync.

Packaging libbpf from the kernel source would not help anything -
updating libbpf would still require a new kernel. There's no difference
between updating libbpf from a github repo or from the kernel source,
as far as dependencies are concerned.

 Jiri


[PATCH bpf 1/2] selftests/bpf: set rp_filter in test_flow_dissector

2019-10-08 Thread Jiri Benc
Many distributions enable rp_filter. However, the flow dissector test
generates packets that have 1.1.1.1 set as (inner) source address without
this address being reachable. This causes the selftest to fail.

The selftests should not assume a particular initial configuration. Switch
off rp_filter.

Fixes: 50b3ed57dee9 ("selftests/bpf: test bpf flow dissection")
Cc: Petar Penkov 
Signed-off-by: Jiri Benc 
---
 tools/testing/selftests/bpf/test_flow_dissector.sh | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_flow_dissector.sh 
b/tools/testing/selftests/bpf/test_flow_dissector.sh
index d23d4da66b83..e2d06191bd35 100755
--- a/tools/testing/selftests/bpf/test_flow_dissector.sh
+++ b/tools/testing/selftests/bpf/test_flow_dissector.sh
@@ -63,6 +63,9 @@ fi
 
 # Setup
 tc qdisc add dev lo ingress
+echo 0 > /proc/sys/net/ipv4/conf/default/rp_filter
+echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter
+echo 0 > /proc/sys/net/ipv4/conf/lo/rp_filter
 
 echo "Testing IPv4..."
 # Drops all IP/UDP packets coming from port 9
-- 
2.18.1



[PATCH bpf 0/2] selftests/bpf: fix false failures

2019-10-08 Thread Jiri Benc
The test_flow_dissector and test_lwt_ip_encap selftests were failing for me.
It was caused by the tests not being enough system/distro independent.

Jiri Benc (2):
  selftests/bpf: set rp_filter in test_flow_dissector
  selftests/bpf: more compatible nc options in test_lwt_ip_encap

 tools/testing/selftests/bpf/test_flow_dissector.sh | 3 +++
 tools/testing/selftests/bpf/test_lwt_ip_encap.sh   | 6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

-- 
2.18.1



[PATCH bpf 2/2] selftests/bpf: more compatible nc options in test_lwt_ip_encap

2019-10-08 Thread Jiri Benc
Out of the three nc implementations widely in use, at least two (BSD netcat
and nmap-ncat) do not support -l combined with -s. Modify the nc invocation
to be accepted by all of them.

Fixes: 17a90a788473 ("selftests/bpf: test that GSO works in lwt_ip_encap")
Cc: Peter Oskolkov 
Signed-off-by: Jiri Benc 
---
 tools/testing/selftests/bpf/test_lwt_ip_encap.sh | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh 
b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
index acf7a74f97cd..59ea56945e6c 100755
--- a/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
+++ b/tools/testing/selftests/bpf/test_lwt_ip_encap.sh
@@ -314,15 +314,15 @@ test_gso()
command -v nc >/dev/null 2>&1 || \
{ echo >&2 "nc is not available: skipping TSO tests"; return; }
 
-   # listen on IPv*_DST, capture TCP into $TMPFILE
+   # listen on port 9000, capture TCP into $TMPFILE
if [ "${PROTO}" == "IPv4" ] ; then
IP_DST=${IPv4_DST}
ip netns exec ${NS3} bash -c \
-   "nc -4 -l -s ${IPv4_DST} -p 9000 > ${TMPFILE} &"
+   "nc -4 -l -p 9000 > ${TMPFILE} &"
elif [ "${PROTO}" == "IPv6" ] ; then
IP_DST=${IPv6_DST}
ip netns exec ${NS3} bash -c \
-   "nc -6 -l -s ${IPv6_DST} -p 9000 > ${TMPFILE} &"
+   "nc -6 -l -p 9000 > ${TMPFILE} &"
RET=$?
else
echo "test_gso: unknown PROTO: ${PROTO}"
-- 
2.18.1



[PATCH bpf-next] selftests: bpf: standardize to static __always_inline

2019-07-02 Thread Jiri Benc
The progs for bpf selftests use several different notations to force
function inlining. Standardize to what most of them use,
static __always_inline.

Suggested-by: Song Liu 
Signed-off-by: Jiri Benc 
---
 tools/testing/selftests/bpf/progs/pyperf.h|  9 +++--
 .../testing/selftests/bpf/progs/strobemeta.h  | 36 ++-
 .../testing/selftests/bpf/progs/test_jhash.h  |  3 +-
 .../selftests/bpf/progs/test_seg6_loop.c  | 23 ++--
 .../selftests/bpf/progs/test_verif_scale2.c   |  2 +-
 5 files changed, 38 insertions(+), 35 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/pyperf.h 
b/tools/testing/selftests/bpf/progs/pyperf.h
index 6b0781391be5..abf6224649be 100644
--- a/tools/testing/selftests/bpf/progs/pyperf.h
+++ b/tools/testing/selftests/bpf/progs/pyperf.h
@@ -75,8 +75,7 @@ typedef struct {
void* co_name; // PyCodeObject.co_name
 } FrameData;
 
-static inline __attribute__((__always_inline__)) void*
-get_thread_state(void* tls_base, PidData* pidData)
+static __always_inline void *get_thread_state(void *tls_base, PidData *pidData)
 {
void* thread_state;
int key;
@@ -87,8 +86,8 @@ get_thread_state(void* tls_base, PidData* pidData)
return thread_state;
 }
 
-static inline __attribute__((__always_inline__)) bool
-get_frame_data(void* frame_ptr, PidData* pidData, FrameData* frame, Symbol* 
symbol)
+static __always_inline bool get_frame_data(void *frame_ptr, PidData *pidData,
+  FrameData *frame, Symbol *symbol)
 {
// read data from PyFrameObject
bpf_probe_read(&frame->f_back,
@@ -161,7 +160,7 @@ struct bpf_elf_map SEC("maps") stackmap = {
.max_elem = 1000,
 };
 
-static inline __attribute__((__always_inline__)) int __on_event(struct pt_regs 
*ctx)
+static __always_inline int __on_event(struct pt_regs *ctx)
 {
uint64_t pid_tgid = bpf_get_current_pid_tgid();
pid_t pid = (pid_t)(pid_tgid >> 32);
diff --git a/tools/testing/selftests/bpf/progs/strobemeta.h 
b/tools/testing/selftests/bpf/progs/strobemeta.h
index 1ff73f60a3e4..553bc3b62e89 100644
--- a/tools/testing/selftests/bpf/progs/strobemeta.h
+++ b/tools/testing/selftests/bpf/progs/strobemeta.h
@@ -266,8 +266,8 @@ struct tls_index {
uint64_t offset;
 };
 
-static inline __attribute__((always_inline))
-void *calc_location(struct strobe_value_loc *loc, void *tls_base)
+static __always_inline void *calc_location(struct strobe_value_loc *loc,
+  void *tls_base)
 {
/*
 * tls_mode value is:
@@ -327,10 +327,10 @@ void *calc_location(struct strobe_value_loc *loc, void 
*tls_base)
: NULL;
 }
 
-static inline __attribute__((always_inline))
-void read_int_var(struct strobemeta_cfg *cfg, size_t idx, void *tls_base,
- struct strobe_value_generic *value,
- struct strobemeta_payload *data)
+static __always_inline void read_int_var(struct strobemeta_cfg *cfg,
+size_t idx, void *tls_base,
+struct strobe_value_generic *value,
+struct strobemeta_payload *data)
 {
void *location = calc_location(&cfg->int_locs[idx], tls_base);
if (!location)
@@ -342,10 +342,11 @@ void read_int_var(struct strobemeta_cfg *cfg, size_t idx, 
void *tls_base,
data->int_vals_set_mask |= (1 << idx);
 }
 
-static inline __attribute__((always_inline))
-uint64_t read_str_var(struct strobemeta_cfg* cfg, size_t idx, void *tls_base,
- struct strobe_value_generic *value,
- struct strobemeta_payload *data, void *payload)
+static __always_inline uint64_t read_str_var(struct strobemeta_cfg *cfg,
+size_t idx, void *tls_base,
+struct strobe_value_generic *value,
+struct strobemeta_payload *data,
+void *payload)
 {
void *location;
uint32_t len;
@@ -371,10 +372,11 @@ uint64_t read_str_var(struct strobemeta_cfg* cfg, size_t 
idx, void *tls_base,
return len;
 }
 
-static inline __attribute__((always_inline))
-void *read_map_var(struct strobemeta_cfg *cfg, size_t idx, void *tls_base,
-  struct strobe_value_generic *value,
-  struct strobemeta_payload* data, void *payload)
+static __always_inline void *read_map_var(struct strobemeta_cfg *cfg,
+ size_t idx, void *tls_base,
+ struct strobe_value_generic *value,
+ struct strobemeta_payload *data,
+ void *payload)
 {
struct strobe_map_descr* descr = &data->map_descrs[idx];
struc

[PATCH bpf v2] selftests: bpf: fix inlines in test_lwt_seg6local

2019-07-02 Thread Jiri Benc
Selftests are reporting this failure in test_lwt_seg6local.sh:

+ ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj 
test_lwt_seg6local.o sec encap_srh dev veth2
Error fetching program/map!
Failed to parse eBPF program: Operation not permitted

The problem is __attribute__((always_inline)) alone is not enough to prevent
clang from inserting those functions in .text. In that case, .text is not
marked as relocateable.

See the output of objdump -h test_lwt_seg6local.o:

Idx Name  Size  VMA   LMA   File off  Algn
  0 .text 3530      0040  2**3
  CONTENTS, ALLOC, LOAD, READONLY, CODE

This causes the iproute bpf loader to fail in bpf_fetch_prog_sec:
bpf_has_call_data returns true but bpf_fetch_prog_relo fails as there's no
relocateable .text section in the file.

To fix this, convert to 'static __always_inline'.

v2: Use 'static __always_inline' instead of 'static inline
__attribute__((always_inline))'

Fixes: c99a84eac026 ("selftests/bpf: test for seg6local End.BPF action")
Signed-off-by: Jiri Benc 
---
 .../testing/selftests/bpf/progs/test_lwt_seg6local.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c 
b/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c
index 0575751bc1bc..e2f6ed0a583d 100644
--- a/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c
+++ b/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c
@@ -61,7 +61,7 @@ struct sr6_tlv_t {
unsigned char value[0];
 } BPF_PACKET_HEADER;
 
-__attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb)
+static __always_inline struct ip6_srh_t *get_srh(struct __sk_buff *skb)
 {
void *cursor, *data_end;
struct ip6_srh_t *srh;
@@ -95,7 +95,7 @@ __attribute__((always_inline)) struct ip6_srh_t 
*get_srh(struct __sk_buff *skb)
return srh;
 }
 
-__attribute__((always_inline))
+static __always_inline
 int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad,
   uint32_t old_pad, uint32_t pad_off)
 {
@@ -125,7 +125,7 @@ int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad,
return 0;
 }
 
-__attribute__((always_inline))
+static __always_inline
 int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh,
  uint32_t *tlv_off, uint32_t *pad_size,
  uint32_t *pad_off)
@@ -184,7 +184,7 @@ int is_valid_tlv_boundary(struct __sk_buff *skb, struct 
ip6_srh_t *srh,
return 0;
 }
 
-__attribute__((always_inline))
+static __always_inline
 int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off,
struct sr6_tlv_t *itlv, uint8_t tlv_size)
 {
@@ -228,7 +228,7 @@ int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, 
uint32_t tlv_off,
return update_tlv_pad(skb, new_pad, pad_size, pad_off);
 }
 
-__attribute__((always_inline))
+static __always_inline
 int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh,
   uint32_t tlv_off)
 {
@@ -266,7 +266,7 @@ int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh,
return update_tlv_pad(skb, new_pad, pad_size, pad_off);
 }
 
-__attribute__((always_inline))
+static __always_inline
 int has_egr_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh)
 {
int tlv_offset = sizeof(struct ip6_t) + sizeof(struct ip6_srh_t) +
-- 
2.18.1



Re: [PATCH bpf] selftests: bpf: fix inlines in test_lwt_seg6local

2019-07-02 Thread Jiri Benc
On Mon, 1 Jul 2019 11:39:27 -0700, Y Song wrote:
> By default, we have
> # define __always_inlineinline __attribute__((always_inline))
> 
> So just use __always_inline should be less verbose in your patch.

I'll resubmit, converting everything to __always_inline, targeting
bpf-next.

> BTW, what compiler did you use have this behavior?

clang version 7.0.1 (tags/RELEASE_701/final)
LLVM version 7.0.1

> Did you have issues with `static __attribute__((always_inline))`?

That seems to work, too.

 Jiri


Re: [PATCH bpf] selftests: bpf: fix inlines in test_lwt_seg6local

2019-07-01 Thread Jiri Benc
On Sat, 29 Jun 2019 11:04:54 -0700, Song Liu wrote:
> > Maybe use "__always_inline" as most other tests do?  
> 
> I meant "static __always_inline".

Sure, I can do that. It doesn't seem to be as consistent as you
suggest, though.

There are three different forms used in selftests/bpf/progs:

static __always_inline
static inline __attribute__((__always_inline__))
static inline __attribute__((always_inline))

As this is a bug causing selftests to fail (at least for some clang/llvm
versions), how about applying this to bpf.git as a minimal fix and
unifying the progs in bpf-next?

Thanks,

 Jiri


[PATCH bpf] selftests: bpf: fix inlines in test_lwt_seg6local

2019-06-29 Thread Jiri Benc
Selftests are reporting this failure in test_lwt_seg6local.sh:

+ ip netns exec ns2 ip -6 route add fb00::6 encap bpf in obj 
test_lwt_seg6local.o sec encap_srh dev veth2
Error fetching program/map!
Failed to parse eBPF program: Operation not permitted

The problem is __attribute__((always_inline)) alone is not enough to prevent
clang from inserting those functions in .text. In that case, .text is not
marked as relocateable.

See the output of objdump -h test_lwt_seg6local.o:

Idx Name  Size  VMA   LMA   File off  Algn
  0 .text 3530      0040  2**3
  CONTENTS, ALLOC, LOAD, READONLY, CODE

This causes the iproute bpf loader to fail in bpf_fetch_prog_sec:
bpf_has_call_data returns true but bpf_fetch_prog_relo fails as there's no
relocateable .text section in the file.

Add 'static inline' to fix this.

Fixes: c99a84eac026 ("selftests/bpf: test for seg6local End.BPF action")
Signed-off-by: Jiri Benc 
---
 .../selftests/bpf/progs/test_lwt_seg6local.c| 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c 
b/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c
index 0575751bc1bc..5848c1b52554 100644
--- a/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c
+++ b/tools/testing/selftests/bpf/progs/test_lwt_seg6local.c
@@ -61,7 +61,8 @@ struct sr6_tlv_t {
unsigned char value[0];
 } BPF_PACKET_HEADER;
 
-__attribute__((always_inline)) struct ip6_srh_t *get_srh(struct __sk_buff *skb)
+static inline __attribute__((always_inline))
+struct ip6_srh_t *get_srh(struct __sk_buff *skb)
 {
void *cursor, *data_end;
struct ip6_srh_t *srh;
@@ -95,7 +96,7 @@ __attribute__((always_inline)) struct ip6_srh_t 
*get_srh(struct __sk_buff *skb)
return srh;
 }
 
-__attribute__((always_inline))
+static inline __attribute__((always_inline))
 int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad,
   uint32_t old_pad, uint32_t pad_off)
 {
@@ -125,7 +126,7 @@ int update_tlv_pad(struct __sk_buff *skb, uint32_t new_pad,
return 0;
 }
 
-__attribute__((always_inline))
+static inline __attribute__((always_inline))
 int is_valid_tlv_boundary(struct __sk_buff *skb, struct ip6_srh_t *srh,
  uint32_t *tlv_off, uint32_t *pad_size,
  uint32_t *pad_off)
@@ -184,7 +185,7 @@ int is_valid_tlv_boundary(struct __sk_buff *skb, struct 
ip6_srh_t *srh,
return 0;
 }
 
-__attribute__((always_inline))
+static inline __attribute__((always_inline))
 int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, uint32_t tlv_off,
struct sr6_tlv_t *itlv, uint8_t tlv_size)
 {
@@ -228,7 +229,7 @@ int add_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh, 
uint32_t tlv_off,
return update_tlv_pad(skb, new_pad, pad_size, pad_off);
 }
 
-__attribute__((always_inline))
+static inline __attribute__((always_inline))
 int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh,
   uint32_t tlv_off)
 {
@@ -266,7 +267,7 @@ int delete_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh,
return update_tlv_pad(skb, new_pad, pad_size, pad_off);
 }
 
-__attribute__((always_inline))
+static inline __attribute__((always_inline))
 int has_egr_tlv(struct __sk_buff *skb, struct ip6_srh_t *srh)
 {
int tlv_offset = sizeof(struct ip6_t) + sizeof(struct ip6_srh_t) +
-- 
2.18.1



Re: [PATCH net-next] macvlan: pass get_ts_info and SIOC[SG]HWTSTAMP ioctl to real device

2019-04-17 Thread Jiri Benc
On Wed, 17 Apr 2019 08:43:06 -0700, Richard Cochran wrote:
> If NET_ADMIN is enabled in the container, don't the host and container
> contend with each other for the physical interfaces anyhow?

Physical interfaces are not a problem, as each interface can be only in
a single net name space.

The problem here is this patch gives access to physical interface
settings through a virtual interface layered on top of it. Whenever
such thing is done, the virtual interface needs to provide a suitable
way of moderating access to the shared resources, so the individual
virtual interfaces do not affect each other. That's not what's being
done here.

I think this patch is wrong.

 Jiri


[PATCH net] geneve: correctly handle ipv6.disable module parameter

2019-02-28 Thread Jiri Benc
When IPv6 is compiled but disabled at runtime, geneve_sock_add returns
-EAFNOSUPPORT. For metadata based tunnels, this causes failure of the whole
operation of bringing up the tunnel.

Ignore failure of IPv6 socket creation for metadata based tunnels caused by
IPv6 not being available.

This is the same fix as what commit d074bf960044 ("vxlan: correctly handle
ipv6.disable module parameter") is doing for vxlan.

Note there's also commit c0a47e44c098 ("geneve: should not call rt6_lookup()
when ipv6 was disabled") which fixes a similar issue but for regular
tunnels, while this patch is needed for metadata based tunnels.

Signed-off-by: Jiri Benc 
---
 drivers/net/geneve.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
index 3377ac66a347..5583d993480d 100644
--- a/drivers/net/geneve.c
+++ b/drivers/net/geneve.c
@@ -692,15 +692,20 @@ static int geneve_sock_add(struct geneve_dev *geneve, 
bool ipv6)
 static int geneve_open(struct net_device *dev)
 {
struct geneve_dev *geneve = netdev_priv(dev);
-   bool ipv6 = !!(geneve->info.mode & IP_TUNNEL_INFO_IPV6);
bool metadata = geneve->collect_md;
+   bool ipv4, ipv6;
int ret = 0;
 
+   ipv6 = geneve->info.mode & IP_TUNNEL_INFO_IPV6 || metadata;
+   ipv4 = !ipv6 || metadata;
 #if IS_ENABLED(CONFIG_IPV6)
-   if (ipv6 || metadata)
+   if (ipv6) {
ret = geneve_sock_add(geneve, true);
+   if (ret < 0 && ret != -EAFNOSUPPORT)
+   ipv4 = false;
+   }
 #endif
-   if (!ret && (!ipv6 || metadata))
+   if (ipv4)
ret = geneve_sock_add(geneve, false);
if (ret < 0)
geneve_sock_release(geneve);
-- 
2.18.1



Re: [PATCH net-next 00/11] ICMP error handling for UDP tunnels

2018-11-07 Thread Jiri Benc
On Tue,  6 Nov 2018 22:38:56 +0100, Stefano Brivio wrote:
> - patch 1/11 adds a socket lookup for UDP tunnels that use, by design,
>   the same destination port on both endpoints -- i.e. VxLAN and GENEVE

This is not necessarily true with lwtunnels (collect_md mode of VXLAN
and GENEVE). While any sane setup will use the same dst ports, there's
really nothing that enforces it. Of course, in that case we have no way
to map the ICMP error back to the tunnel.

Generally speaking, I'm not sure how ICMP error handling should work
for external control planes. Are we sure they want PMTU discovery and
route redirection done by the kernel? (I am not sure, neither way.)

Jiri


Re: [PATCH RFC v2 net-next 03/25] netlink: introduce NLM_F_DUMP_PROPER_HDR flag

2018-10-02 Thread Jiri Benc
On Tue, 2 Oct 2018 08:55:19 -0600, David Ahern wrote:
> You want take issue with a name suggest a different one.

I don't have a strong opinion on this. STRICT_HDR? HDR_CHECKED?
VERIFY_HDR? If you feel that your proposal is better, I don't mind.

Thanks,

 Jiri


Re: [PATCH RFC v2 net-next 03/25] netlink: introduce NLM_F_DUMP_PROPER_HDR flag

2018-10-02 Thread Jiri Benc
On Tue, 2 Oct 2018 08:57:24 -0600, David Ahern wrote:
> You can when you introduce a new option or a new flag that is required
> to get new behavior like kernel side filtering.

Yes. That was what I tried with the patchset a few years back. It would
be nice to revive the effort.

> I chose a netlink flag for consistency with NLM_F_DUMP_INTR and
> NLM_F_DUMP_FILTERED. Both are netlink flags. This patch set fixes only
> what is broken -- dumps.

When we're introducing better input checking in netlink (which is a
good thing!), it would be good to do it consistently and have it
generic across all operations.

 Jiri


Re: [PATCH RFC v2 net-next 02/25] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs

2018-10-02 Thread Jiri Benc
On Tue, 2 Oct 2018 09:11:17 -0600, David Ahern wrote:
> Generically speaking a filter modifies the output based on the input.
> Specifying a target namespace is an input to the dump that modifies the
> output.

That's conventionally called "algorithm" :-)

Let's just say we have a different understanding of what "filter" is.
Perhaps we should look at it from a different side. What is the use
case of having NLM_F_DUMP_FILTERED set for IFA_TARGET_NETNSID? How is
this going to be used by applications?

> Yes, you can do it in userspace which is what iproute2 has done to this
> point, but it is grossly inefficient and that inefficiency has
> implications at scale.

You can't do that with IFA_TARGET_NETNSID. Which is my point. Without
the flag, you don't get a list of all interfaces in all net name spaces.

 Jiri


Re: [PATCH RFC v2 net-next 03/25] netlink: introduce NLM_F_DUMP_PROPER_HDR flag

2018-10-02 Thread Jiri Benc
On Tue, 2 Oct 2018 13:18:32 +0200, Christian Brauner wrote:
> I didn't find this in the linked thread.

Maybe it was suggested in another thread or in person on a conference,
I can't remember, it's too long ago, sorry.

> What I find interesting and convincing is one of Dave's points:
> 
> "I'm beginning to wonder if we can just change this unilaterally to
> not ignore unrecognized attributes.
> 
> I am increasingly certain that things that would "break" we wouldn't
> want to succeed anyways." [1]

It's unfortunate we can't do that. I'd like it.

> But a socket option or this header flag both sound acceptable to me. Was
> there any more detail on how a socket option would look like, i.e. an
> api proposal or something?

Look at how NETLINK_CAP_ACK and NETLINK_EXT_ACK is implemented.

 Jiri


Re: [PATCH RFC v2 net-next 02/25] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs

2018-10-02 Thread Jiri Benc
On Tue, 2 Oct 2018 13:03:00 +0200, Christian Brauner wrote:
> Well, it's a namespace filter that's how I saw it.

That would imply that without it, you get data from all name spaces
(= unfiltered by name space), which is not what's happening :-)

 Jiri


Re: [PATCH RFC v2 net-next 03/25] netlink: introduce NLM_F_DUMP_PROPER_HDR flag

2018-10-02 Thread Jiri Benc
On Mon,  1 Oct 2018 17:28:29 -0700, David Ahern wrote:
> Add a new flag, NLM_F_DUMP_PROPER_HDR, for userspace to indicate to the
> kernel that it believes it is sending the right header struct for the
> dump message type (ifinfomsg, ifaddrmsg, rtmsg, fib_rule_hdr, ...).

Why is this limited to dumps? Other kind of netlink messages contain
the common struct, too. When introducing such mechanism, please make it
generic.

Last time when we were discussing strict checking in netlink, it was
suggested to add a socket option instead of adding NLM flags[1].
It makes a lot of sense: the number of flags is very limited and we'd
run out of them pretty fast. It's not just the header structure that
is currently checked sloppily. It's also attributes, flags in
attributes, etc. We can't assign a flag to all of them.

You should also consider a different name for the flag: it should
reflect what the effect of the flag is. "Proper header" is not an
effect, it's a requirement for the message to pass. The effect is
enforced strict checking of the header.

 Jiri

[1] https://marc.info/?l=linux-netdev&m=144492718118955


Re: [PATCH RFC v2 net-next 02/25] net/ipv6: Refactor address dump to push inet6_fill_args to in6_dump_addrs

2018-10-02 Thread Jiri Benc
On Mon,  1 Oct 2018 17:28:28 -0700, David Ahern wrote:
> Pull the inet6_fill_args arg up to in6_dump_addrs and move netnsid
> into it. Since IFA_TARGET_NETNSID is a kernel side filter add the
> NLM_F_DUMP_FILTERED flag so userspace knows the request was honored.

IFA_TARGET_NETNSID is not a filter.

"Filter" returns a subset of the results. It's kind of optimization
when one is interested only in some data but not all of them. Instead
of dumping everything, going through the results and picking only the
data one is interested in, it's better to pass a filter and get only
the relevant data. But you're not really required to: you can filter in
your app.

By contrast, IFA_TARGET_NETNSID returns a completely different set of
data. It's impossible to not set it and filter the results in your app.

As the consequence, IFA_TARGET_NETNSID must not set NLM_F_DUMP_FILTERED
(if not complemented by a real filter).

I understand that you want to differentiate between data dumped without
and with IFA_TARGET_NETNSID present. But we already have that: the
IFA_TARGET_NETNSID attribute is returned back in the latter case.

Nacked-by: Jiri Benc 

 Jiri


Re: [PATCH net] rtnetlink: Fail dump if target netnsid is invalid

2018-10-02 Thread Jiri Benc
On Fri, 28 Sep 2018 12:28:41 -0700, David Ahern wrote:
> --- a/net/core/rtnetlink.c
> +++ b/net/core/rtnetlink.c
> @@ -1898,10 +1898,8 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, 
> struct netlink_callback *cb)
>   if (tb[IFLA_IF_NETNSID]) {
>   netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
>   tgt_net = get_target_net(skb->sk, netnsid);
> - if (IS_ERR(tgt_net)) {
> - tgt_net = net;
> - netnsid = -1;
> - }
> + if (IS_ERR(tgt_net))
> + return PTR_ERR(tgt_net);
>   }
>  
>   if (tb[IFLA_EXT_MASK])

Sorry for the late review, I see it has been applied.

I intentionally chose the behavior to preserve the behavior of the
older kernels: that attribute was silently ignored. Note that the
IFLA_IF_NETNSID is not returned in such case, thus it's easy to
distinguish that it was not applied. And the user space has to do such
check anyway to support old kernels.

But you're right that there was no way to distinguish "the kernel does
not support IFLA_IF_NETNSID" from "wrong IFLA_IF_NETNSID provided". I'm
okay with the patch, I just don't think the "Fixes" tag is justified but
whatever, can't be unapplied :-) (and it's my fault for not reviewing
the patches timely).

Thanks!

 Jiri


Re: netlink: 16 bytes leftover after parsing attributes in process `ip'.

2018-09-25 Thread Jiri Benc
On Tue, 25 Sep 2018 09:37:41 -0600, David Ahern wrote:
> For ifaddrmsg ifa_flags aligns with ifi_type which is set by kernel side
> so this should be ok.

Does the existing user space set ifi_type to anything? Does it zero out
the field?

Are we able to find a flag value that is not going to be set by unaware
user space? I.e., a bit that is unused by the current ARPHRD values on
both little and big endian? (ARPHRD_NONE might be a problem, though...)

 Jiri


Re: netlink: 16 bytes leftover after parsing attributes in process `ip'.

2018-09-25 Thread Jiri Benc
On Tue, 25 Sep 2018 11:49:10 +0200, Christian Brauner wrote:
> So if people really want to hide this issue as much as we can then we
> can play the guessing game. I could send a patch that roughly does the
> following:
> 
> if (nlmsg_len(cb->nlh) < sizeof(struct ifinfomsg))
> guessed_header_len = sizeof(struct ifaddrmsg);
> else
> guessed_header_len = sizeof(struct ifinfomsg);
> 
> This will work since sizeof(ifaddrmsg) == 8 and sizeof(ifinfomsg) == 16.
> The only valid property for RTM_GETADDR requests is IFA_TARGET_NETNSID.
> This propert is a __s32 which should bring the message up to 12 bytes
> (not sure about alignment requiremnts and where we might wend up ten)
> which is still less than the 16 bytes without that property from
> ifinfomsg. That's a hacky hacky hack-hack and will likely work but will
> break when ifaddrmsg grows a new member or we introduce another property
> that is valid in RTM_GETADDR requests. It also will not work cleanly
> when users stuff additional properties in there that are vaif 
> (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,lid for the
> address family but are not used int RTM_GETADDR requests.

I'd expect that any potential existing code that makes use of other
attributes already assumes ifaddrmsg. Hence, if the nlmsg_len is >
sizeof(ifinfomsg), you can be sure that there are attributes and thus
the struct used was ifaddrmsg.

So, in order for RTM_GETADDR to work reliably with attributes, you have
to ensure that the length is > sizeof(ifinfomsg).

This can be achieved by putting IFA_TARGET_NETNSID into a nested
attribute. Just define IFA_EXTENDED (feel free to invent a better name,
of course) and put IFA_TARGET_NETNSID inside. Then in the code, attempt
to parse only when the size is large enough:

if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) {
int err;
 
err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb,
  IFA_MAX, ifa_ipv6_policy, NULL);
if (err < 0)
return err;
if (tb[IFA_EXTENDED]) {
...parse the nested attribute...
if (tb_nested[IFA_TARGET_NETNSID]) {
...etc...
}
}
}

Another option is forcing the user space to add another attribute, for
example, IFA_FLAGS_PRESENT, and attempt parsing only when it is
present. The logic would then be:

if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) {
int err;
 
err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb,
  IFA_MAX, ifa_ipv6_policy, NULL);
if (err < 0)
return err;
if (tb[IFA_FLAGS_PRESENT] && tb[IFA_TARGET_NETNSID]) {
...etc...
}
}

 Jiri


Re: [RFC 4/5] netlink: prepare validate extack setting for recursion

2018-09-19 Thread Jiri Benc
On Wed, 19 Sep 2018 11:25:17 +0200, Johannes Berg wrote:
> Now, with this patch, all I'm doing is changing the internal behaviour
> of nla_parse/nla_validate - externally, it still overwrites any existing
> message if an error occurs, but internally it keeps the inner-most
> error.

Ah, okay, that answers my question about putting the flag into the
ext_ack struct, too.

It may still be worthwhile to have a mechanism for prioritizing certain
extack messages over another ones but it's clearly out of scope of this
patchset.

The patchset looks good to me. Just include the description you just
wrote in the commit message :-)

Thanks!

 Jiri


Re: [RFC 4/5] netlink: prepare validate extack setting for recursion

2018-09-19 Thread Jiri Benc
On Wed, 19 Sep 2018 11:15:25 +0200, Johannes Berg wrote:
> For one, having the NL_SET_* macros check it on their own will already
> not work - as we discussed over in the NLA_REJECT thread, we do need to
> be able to override the data, e.g. if somebody does
> 
> NL_SET_ERR_MSG(extack, "warning: deprecated command");
> err = nla_parse(..., extack);
> 
> and nla_parse() sets a new message. Thus, hiding all the logic in there
> already will not work, and is also IMHO rather unexpected. Normally,
> *later* messages should win, not *earlier* ones - and that doesn't
> require any tracking, just setting them unconditionally.
> 
> It's just a side effect of how we do the recursive validation here that
> we want *earlier* messages to win (but only within this code piece - if
> a previous message was set, we want it to be overwritten by nla_parse!).

Fair enough.

> It might be possible to do this differently, in theory, but all the ways
> I've tried to come up with so far made the code vastly more complex.

Wouldn't still make sense to store the flag in the struct
netlink_ext_ack, though? The way the parameters are passed around in
this patch looks ugly. And adding more users of the flag later (there
may be other cases when we want the earlier messages to be preserved)
would mean adding parameters all around, while the flag in the struct
would be readily available.

I don't have a strong opinion on this, just the patch seems to be
inelegant.

 Jiri


Re: [RFC 4/5] netlink: prepare validate extack setting for recursion

2018-09-19 Thread Jiri Benc
On Tue, 18 Sep 2018 15:12:11 +0200, Johannes Berg wrote:
>  static int validate_nla(const struct nlattr *nla, int maxtype,
>   const struct nla_policy *policy,
> - const char **error_msg)
> + struct netlink_ext_ack *extack, bool *extack_set)

Can't the extack_set be included in the struct netlink_ext_ack? One less
parameter to pass around and the NL_SET_* macros could check this on
their own.

This way, it can be also easily turned to a more complex mechanism,
such as the one Marcelo proposed.

Thanks,

 Jiri


Re: [PATCH net] geneve: add ttl inherit support

2018-09-11 Thread Jiri Benc
On Wed, 12 Sep 2018 10:04:21 +0800, Hangbin Liu wrote:
> Similar with commit 72f6d71e491e6 ("vxlan: add ttl inherit support"),
> currently ttl == 0 means "use whatever default value" on geneve instead
> of inherit inner ttl. To respect compatibility with old behavior, let's
> add a new IFLA_GENEVE_TTL_INHERIT for geneve ttl inherit support.
> 
> Reported-by: Jianlin Shi 
> Suggested-by: Jiri Benc 
> Signed-off-by: Hangbin Liu 

Reviewed-by: Jiri Benc 

Thanks, Hangbin!

 Jiri


Re: [PATCH net] nsh: set mac len based on inner packet

2018-07-12 Thread Jiri Benc
On Wed, 11 Jul 2018 12:00:44 -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> When pulling the NSH header in nsh_gso_segment, set the mac length
> based on the encapsulated packet type.
> 
> skb_reset_mac_len computes an offset to the network header, which
> here still points to the outer packet:
> 
>   > skb_reset_network_header(skb);
>   > [...]
>   > __skb_pull(skb, nsh_len);
>   > skb_reset_mac_header(skb);// now mac hdr starts nsh_len == 8B 
> after net hdr
>   > skb_reset_mac_len(skb);   // mac len = net hdr - mac hdr == (u16) 
> -8 == 65528
>   > [..]
>   > skb_mac_gso_segment(skb, ..)  
> 
> Link: 
> http://lkml.kernel.org/r/CAF=yd-keactson4axiraxl8m7qas8gbbe1w09eziywvpbbu...@mail.gmail.com
> Reported-by: syzbot+7b9ed9872dab8c323...@syzkaller.appspotmail.com
> Fixes: c411ed854584 ("nsh: add GSO support")
> Signed-off-by: Willem de Bruijn 

Acked-by: Jiri Benc 


Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags

2018-06-29 Thread Jiri Benc
On Thu, 28 Jun 2018 10:05:36 -0700, Jakub Kicinski wrote:
> Can we take this as a follow up through the bpf-next tree or do you
> want us to respin as part of this set?

I think BPF is a separate issue.

 Jiri


Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags

2018-06-28 Thread Jiri Benc
On Thu, 28 Jun 2018 09:54:52 -0700, Jakub Kicinski wrote:
> Hmm... in practice we could steal top bits of the size parameter for
> some flags, since it seems to be limited to values < 256 today?  Is it
> worth it?
> 
> It would look something along the lines of:

Something like that, yes. I'll leave to Daniel to review how much sense
it makes from the BPF side.

Thanks!

 Jiri


Re: [PATCH net-next v2 3/4] net: check tunnel option type in tunnel flags

2018-06-28 Thread Jiri Benc
On Wed, 27 Jun 2018 11:49:49 +0200, Daniel Borkmann wrote:
> Looks good to me, and yes in BPF case a mask like TUNNEL_OPTIONS_PRESENT is
> right approach since this is opaque info and solely defined by the BPF prog
> that is using the generic helper.

Wouldn't it make sense to introduce some safeguards here (in a backward
compatible way, of course)? It's easy to mistakenly set data for a
different tunnel type in a BPF program and then be surprised by the
result. It might help users if such usage was detected by the kernel,
one way or another.

I'm thinking about something like the BPF program voluntarily
specifying the type of the data; if not specified, the wildcard would be
used as it is now.

 Jiri


Re: [PATCH net] nsh: fix infinite loop

2018-05-04 Thread Jiri Benc
On Thu,  3 May 2018 13:37:54 -0700, Eric Dumazet wrote:
> diff --git a/net/nsh/nsh.c b/net/nsh/nsh.c
> index 
> d7da99a0b0b852d7459eed9ac6d3cdf3d49a1a1c..9696ef96b719bf24625adea2a959deac1d2a975f
>  100644
> --- a/net/nsh/nsh.c
> +++ b/net/nsh/nsh.c
> @@ -57,6 +57,8 @@ int nsh_pop(struct sk_buff *skb)
>   return -ENOMEM;
>   nh = (struct nshhdr *)(skb->data);
>   length = nsh_hdr_len(nh);
> + if (length < NSH_BASE_HDR_LEN)
> + return -EINVAL;
>   inner_proto = tun_p_to_eth_p(nh->np);
>   if (!pskb_may_pull(skb, length))
>   return -ENOMEM;
> @@ -90,6 +92,8 @@ static struct sk_buff *nsh_gso_segment(struct sk_buff *skb,
>   if (unlikely(!pskb_may_pull(skb, NSH_BASE_HDR_LEN)))
>   goto out;
>   nsh_len = nsh_hdr_len(nsh_hdr(skb));
> + if (nsh_len < NSH_BASE_HDR_LEN)
> + goto out;
>   if (unlikely(!pskb_may_pull(skb, nsh_len)))
>   goto out;
>  

Acked-by: Jiri Benc 

Thanks, Eric, and shame on me!

 Jiri


Re: [bpf PATCH v2] bpf: fix for lex/yacc build error with gcc-5

2018-04-26 Thread Jiri Benc
On Wed, 25 Apr 2018 14:22:45 -0700, John Fastabend wrote:
> --- a/tools/bpf/Makefile
> +++ b/tools/bpf/Makefile
> @@ -76,6 +76,8 @@ $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o 
> $(OUTPUT)bpf_exp.yacc.o $(OUTPUT)bpf_exp.le
>   $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^
>  
>  $(OUTPUT)bpf_exp.lex.c: $(OUTPUT)bpf_exp.yacc.c
> +$(OUTPUT)bpf_exp.yacc.o: $(OUTPUT)bpf_exp.yacc.c
> +$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.lex.c

Looks better than v1, the first dependency is important.

For some reason, I did not need the other two rules.

By the way, make invoked from tools/bpf/ has never really worked, even
before my patchset. This works correctly:

cd tools
make bpf

 Jiri


Re: [PATCH RFC bpf-next 1/6] bpf: Hooks for sys_bind

2018-03-15 Thread Jiri Benc
On Wed, 14 Mar 2018 20:37:00 -0700, Alexei Starovoitov wrote:
> Hanness expressed the reasons why RHEL doesn't support ipvlan long ago.
> I couldn't find the complete link. This one mentions some of the issues:
> https://www.mail-archive.com/netdev@vger.kernel.org/msg157614.html

ipvlan improved a lot since that time :-) And Paolo has recently fixed
the remaining issues we were aware of.

 Jiri


Re: [PATCH net-next] net: do not create fallback tunnels for non-default namespaces

2018-03-09 Thread Jiri Benc
On Fri, 9 Mar 2018 04:53:07 -0800, Eric Dumazet wrote:
> Unless you bring it up ;)

Often, you need some options to be specified, thus you end up creating
another interface anyway.

> Compatibility problems,  mostly.
> Some users might depend on existing behavior.

How is this a compatibility problem, when the user has to opt in?
Without the default value, the kernel behaves as before. I'm sorry,
I don't see the problem, what am I missing?

> You and me would not care of breaking our setups, but maybe not
> unaware people out there.

Sure, that's why this is an option and it defaults to off.

> Since init_ns is created at boot time before the sysctl can be
> changed, we rather should not change the default behavior for init_ns.

This could be a problem for built in drivers. With modules, I'm
perfectly happy doing this in initrd :-)

To behave consistently in all cases, we might consider adding a boot
option for this, too, though.

 Jiri


Re: [PATCH/RFC 3/3] net/sched: add tunnel option support to act_tunnel_key

2018-03-09 Thread Jiri Benc
On Tue,  6 Mar 2018 18:08:05 +0100, Simon Horman wrote:
> +static int
> +tunnel_key_copy_geneve_opt(const struct nlattr *nla, int dst_len, void *dst,
> +struct netlink_ext_ack *extack)
> +{
> + struct nlattr *tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX + 1];
> + int err, data_len, opt_len;
> + u8 *data;
> +
> + err = nla_parse_nested(tb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_MAX,
> +nla, geneve_opt_policy, extack);
> + if (err < 0)
> + return err;
> +
> + if (!tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS] ||
> + !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE] ||
> + !tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]) {
> + NL_SET_ERR_MSG(extack, "Missing tunnel key enc geneve option 
> class, type or data");

I think it's obvious by now that I don't like the "enc" :-)

> + return -EINVAL;
> + }
> +
> + data = nla_data(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
> + data_len = nla_len(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA]);
> + if (data_len < 4) {
> + NL_SET_ERR_MSG(extack, "Tunnel key enc geneve option data is 
> less than 4 bytes long");
> + return -ERANGE;
> + }
> + if (data_len % 4) {
> + NL_SET_ERR_MSG(extack, "Tunnel key enc geneve option data is 
> not a multiple of 4 bytes long");
> + return -ERANGE;
> + }
> +
> + opt_len = sizeof(struct geneve_opt) + data_len;
> + if (dst) {
> + struct geneve_opt *opt = dst;
> + u16 class;
> +
> + if (dst_len < opt_len) {
> + NL_SET_ERR_MSG(extack, "Tunnel key enc geneve option 
> data length is longer than the supplied data");

I don't understand this error. What can the user do about it?
Furthermore, the error is not propagated to the user space (see also
below).

Shouldn't this be WARN_ON or something?

> + return -EINVAL;
> + }
> +
> + class = nla_get_u16(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS]);
> + put_unaligned_be16(class, &opt->opt_class);

How this can be unaligned, given we check that the option length is a
multiple of 4 bytes and the option header is 4 bytes?

> +
> + opt->type = nla_get_u8(tb[TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE]);
> + opt->length = data_len / 4; /* length is in units of 4 bytes */
> + opt->r1 = 0;
> + opt->r2 = 0;
> + opt->r3 = 0;
> +
> + memcpy(opt + 1, data, data_len);
> + }
> +
> + return opt_len;
> +}
> +
> +static int tunnel_key_copy_opts(const struct nlattr *nla, u8 *dst,
> + int dst_len, struct netlink_ext_ack *extack)

Please be consistent with parameter ordering, dst and dst_len are in a
different order here and in tunnel_key_copy_geneve_opt.

[...]
> @@ -157,6 +292,11 @@ static int tunnel_key_init(struct net *net, struct 
> nlattr *nla,
>   goto err_out;
>   }
>  
> + if (opts_len)
> + tunnel_key_opts_set(tb[TCA_TUNNEL_KEY_ENC_OPTS],
> + &metadata->u.tun_info, opts_len,
> + extack);

You need to propagate the error here. The previous validation is not
enough as errors while copying to tun_info were not covered.

> +
>   metadata->u.tun_info.mode |= IP_TUNNEL_INFO_TX;
>   break;
>   default:
> @@ -221,6 +361,53 @@ static void tunnel_key_release(struct tc_action *a)
>   kfree_rcu(params, rcu);
>  }
>  
> +static int tunnel_key_geneve_opts_dump(struct sk_buff *skb,
> +const struct ip_tunnel_info *info)
> +{
> + int len = info->options_len;
> + u8 *src = (u8 *)(info + 1);
> +
> + while (len > 0) {
> + struct geneve_opt *opt = (struct geneve_opt *)src;
> + u16 class;
> +
> + class = get_unaligned_be16(&opt->opt_class);

I don't think this can be unaligned.

> + if (nla_put_u16(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_CLASS,
> + class) ||
> + nla_put_u8(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_TYPE,
> +opt->type) ||
> + nla_put(skb, TCA_TUNNEL_KEY_ENC_OPT_GENEVE_DATA,
> + opt->length * 4, opt + 1))
> + return -EMSGSIZE;
> +
> + len -= sizeof(struct geneve_opt) + opt->length * 4;
> + src += sizeof(struct geneve_opt) + opt->length * 4;
> + }

All of this needs to be nested in TCA_TUNNEL_KEY_ENC_OPTS_GENEVE.

> +
> + return 0;
> +}
> +
> +static int tunnel_key_opts_dump(struct sk_buff *skb,
> + const struct ip_tunnel_info *info)
> +{
> + struct nlattr *start;
> + int err;
> +
> + if (!info->options_len)
> + return 0;
> +
> + start = nla_nest_start(skb, TCA_TUNNEL_KEY_ENC_OPTS);
> +  

Re: [PATCH/RFC 2/3] net/sched: act_tunnel_key: add extended ack support

2018-03-09 Thread Jiri Benc
On Tue,  6 Mar 2018 18:08:04 +0100, Simon Horman wrote:
> - if (!tb[TCA_TUNNEL_KEY_PARMS])
> + if (!tb[TCA_TUNNEL_KEY_PARMS]) {
> + NL_SET_ERR_MSG(extack, "Missing tunnel key parameter");

"parameters" (it's not just one parameter)

> @@ -107,6 +109,7 @@ static int tunnel_key_init(struct net *net, struct nlattr 
> *nla,
>   break;
>   case TCA_TUNNEL_KEY_ACT_SET:
>   if (!tb[TCA_TUNNEL_KEY_ENC_KEY_ID]) {
> + NL_SET_ERR_MSG(extack, "Missing tunnel key enc id");

This is not much helpful to the user. What's "enc"? I guess "Missing
tunnel key id" would be enough and better.

> @@ -144,6 +147,7 @@ static int tunnel_key_init(struct net *net, struct nlattr 
> *nla,
> 0, flags,
> key_id, 0);
>   } else {
> + NL_SET_ERR_MSG(extack, "Missing both ipv4 and ipv6 enc 
> src and dst");

We don't need both but only one of them. And again, "enc" does not have
a clear meaning.

"Missing either IPv4 or IPv6 source and destination address"?

In addition, it makes little sense to me to add extacks to just some of
the errors in the tunnel_key_init function. Please add extacks to all
of them.

Thanks,

 Jiri


Re: [PATCH net-next] net: do not create fallback tunnels for non-default namespaces

2018-03-09 Thread Jiri Benc
On Thu,  8 Mar 2018 12:51:41 -0800, Eric Dumazet wrote:
> Note that these tunnels are still created for the initial namespace,
> to be the least intrusive for typical setups.

Since this is a knob and must be turned on explicitly, why we don't get
rid of the automatic interfaces even for the initial name space? It
causes only problems nowadays, such as

ip link add name gre0 type gre 

failing with "File exists" even if there was no gre0 interface before.
And of course, even with the error, the interface with the name "gre0"
appears in the system. And of course, it does not have any of the
options specified. This is highly confusing. Not to mention the
autocreated gre0 interface is basically useless.

I'd like to switch the knob on by default on my systems and have the
kernel behave sane, finally, even without name spaces.

Thanks!

 Jiri


Re: [PATCH bpf-next 1/7] tools: bpftool: silence 'missing initializer' warnings

2018-03-09 Thread Jiri Benc
On Thu, 8 Mar 2018 23:38:12 -0800, Jakub Kicinski wrote:
> FWIW I couldn't reproduce this one.  Out of curiosity what GCC did you
> use?

gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)

 Jiri


[PATCH bpf-next 2/7] tools: bpf: respect output directory during build

2018-03-08 Thread Jiri Benc
Currently, the programs under tools/bpf (with the notable exception of
bpftool) do not respect the output directory (make O=dir). Fix that.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 38 ++
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index c8ec0ae16bf0..e7b15967492e 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -1,4 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
+include ../scripts/Makefile.include
+
 prefix = /usr
 
 CC = gcc
@@ -7,7 +9,7 @@ YACC = bison
 MAKE = make
 
 CFLAGS += -Wall -O2
-CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
+CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/include/uapi -I$(srctree)/include
 
 ifeq ($(srctree),)
 srctree := $(patsubst %/,%,$(dir $(CURDIR)))
@@ -38,32 +40,36 @@ ifeq ($(feature-disassembler-four-args), 1)
 CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
 endif
 
-%.yacc.c: %.y
+$(OUTPUT)%.yacc.c: $(srctree)/tools/bpf/%.y
$(YACC) -o $@ -d $<
 
-%.lex.c: %.l
+$(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l
$(LEX) -o $@ $<
 
-all: bpf_jit_disasm bpf_dbg bpf_asm bpftool
+$(OUTPUT)%.o: $(srctree)/tools/bpf/%.c
+   $(COMPILE.c) -o $@ $<
+
+all: $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm bpftool
 
-bpf_jit_disasm : CFLAGS += -DPACKAGE='bpf_jit_disasm'
-bpf_jit_disasm : LDLIBS = -lopcodes -lbfd -ldl
-bpf_jit_disasm : bpf_jit_disasm.o
+$(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm'
+$(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl
+$(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o
 
-bpf_dbg : LDLIBS = -lreadline
-bpf_dbg : bpf_dbg.o
+$(OUTPUT)bpf_dbg: LDLIBS = -lreadline
+$(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o
 
-bpf_asm : LDLIBS =
-bpf_asm : bpf_asm.o bpf_exp.yacc.o bpf_exp.lex.o
-bpf_exp.lex.o : bpf_exp.yacc.c
+$(OUTPUT)bpf_asm: LDLIBS =
+$(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o 
$(OUTPUT)bpf_exp.lex.o
+$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c
 
 clean: bpftool_clean
-   rm -rf *.o bpf_jit_disasm bpf_dbg bpf_asm bpf_exp.yacc.* bpf_exp.lex.*
+   rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \
+  $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.*
 
 install: bpftool_install
-   install bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
-   install bpf_dbg $(prefix)/bin/bpf_dbg
-   install bpf_asm $(prefix)/bin/bpf_asm
+   install $(OUTPUT)bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
+   install $(OUTPUT)bpf_dbg $(prefix)/bin/bpf_dbg
+   install $(OUTPUT)bpf_asm $(prefix)/bin/bpf_asm
 
 bpftool:
$(MAKE) -C bpftool
-- 
1.8.3.1



[PATCH bpf-next 4/7] tools: bpf: make install should build first

2018-03-08 Thread Jiri Benc
Make the 'install' target depend on the 'all' target to build the binaries
first.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index c42ca24a072d..e8d9e4125bf3 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -50,7 +50,9 @@ $(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l
 $(OUTPUT)%.o: $(srctree)/tools/bpf/%.c
$(COMPILE.c) -o $@ $<
 
-all: $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm bpftool
+PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm
+
+all: $(PROGS) bpftool
 
 $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm'
 $(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl
@@ -67,7 +69,7 @@ clean: bpftool_clean
rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \
   $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.*
 
-install: bpftool_install
+install: $(PROGS) bpftool_install
$(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin
$(INSTALL) $(OUTPUT)bpf_jit_disasm 
$(DESTDIR)$(prefix)/bin/bpf_jit_disasm
$(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg
-- 
1.8.3.1



[PATCH bpf-next 0/7] tools: bpf: standardize make

2018-03-08 Thread Jiri Benc
Currently, 'make bpf' in the tools/ directory does not provide the standard
quiet output except for bpftool (which is however listed with a wrong
directory). Worse, it does not respect the build output directory.

The 'make bpf_install' does not work as one would expect, either. It
installs unconditionally to /usr/bin without respecting DESTDIR and prefix.

This patchset improves that behavior.

Jiri Benc (7):
  tools: bpftool: silence 'missing initializer' warnings
  tools: bpf: respect output directory during build
  tools: bpf: consistent make bpf_install
  tools: bpf: make install should build first
  tools: bpf: call descend in Makefile
  tools: bpf: respect quiet/verbose build
  tools: bpf: silence make by not deleting intermediate file

 tools/bpf/Makefile | 74 +++---
 tools/bpf/bpftool/Makefile |  2 +-
 2 files changed, 51 insertions(+), 25 deletions(-)

-- 
1.8.3.1



[PATCH bpf-next 6/7] tools: bpf: respect quiet/verbose build

2018-03-08 Thread Jiri Benc
Default to quiet build, with V=1 enabling verbose build as is usual.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 38 +++---
 1 file changed, 27 insertions(+), 11 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index daca0a4277d1..757ea22c428a 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -17,6 +17,12 @@ srctree := $(patsubst %/,%,$(dir $(CURDIR)))
 srctree := $(patsubst %/,%,$(dir $(srctree)))
 endif
 
+ifeq ($(V),1)
+  Q =
+else
+  Q = @
+endif
+
 FEATURE_USER = .bpf
 FEATURE_TESTS = libbfd disassembler-four-args
 FEATURE_DISPLAY = libbfd disassembler-four-args
@@ -42,38 +48,48 @@ CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
 endif
 
 $(OUTPUT)%.yacc.c: $(srctree)/tools/bpf/%.y
-   $(YACC) -o $@ -d $<
+   $(QUIET_BISON)$(YACC) -o $@ -d $<
 
 $(OUTPUT)%.lex.c: $(srctree)/tools/bpf/%.l
-   $(LEX) -o $@ $<
+   $(QUIET_FLEX)$(LEX) -o $@ $<
 
 $(OUTPUT)%.o: $(srctree)/tools/bpf/%.c
-   $(COMPILE.c) -o $@ $<
+   $(QUIET_CC)$(COMPILE.c) -o $@ $<
+
+$(OUTPUT)%.yacc.o: $(OUTPUT)%.yacc.c
+   $(QUIET_CC)$(COMPILE.c) -o $@ $<
+$(OUTPUT)%.lex.o: $(OUTPUT)%.lex.c
+   $(QUIET_CC)$(COMPILE.c) -o $@ $<
 
 PROGS = $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg $(OUTPUT)bpf_asm
 
 all: $(PROGS) bpftool
 
 $(OUTPUT)bpf_jit_disasm: CFLAGS += -DPACKAGE='bpf_jit_disasm'
-$(OUTPUT)bpf_jit_disasm: LDLIBS = -lopcodes -lbfd -ldl
 $(OUTPUT)bpf_jit_disasm: $(OUTPUT)bpf_jit_disasm.o
+   $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -lopcodes -lbfd -ldl
 
-$(OUTPUT)bpf_dbg: LDLIBS = -lreadline
 $(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o
+   $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^ -lreadline
 
-$(OUTPUT)bpf_asm: LDLIBS =
 $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o 
$(OUTPUT)bpf_exp.lex.o
+   $(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^
+
 $(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c
 
 clean: bpftool_clean
-   rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \
+   $(call QUIET_CLEAN, bpf-progs)
+   $(Q)rm -rf $(OUTPUT)*.o $(OUTPUT)bpf_jit_disasm $(OUTPUT)bpf_dbg \
   $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.*
 
 install: $(PROGS) bpftool_install
-   $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin
-   $(INSTALL) $(OUTPUT)bpf_jit_disasm 
$(DESTDIR)$(prefix)/bin/bpf_jit_disasm
-   $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg
-   $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm
+   $(call QUIET_INSTALL, bpf_jit_disasm)
+   $(Q)$(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin
+   $(Q)$(INSTALL) $(OUTPUT)bpf_jit_disasm 
$(DESTDIR)$(prefix)/bin/bpf_jit_disasm
+   $(call QUIET_INSTALL, bpf_dbg)
+   $(Q)$(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg
+   $(call QUIET_INSTALL, bpf_asm)
+   $(Q)$(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm
 
 bpftool:
$(call descend,bpftool)
-- 
1.8.3.1



[PATCH bpf-next 7/7] tools: bpf: silence make by not deleting intermediate file

2018-03-08 Thread Jiri Benc
Even in quiet mode, make finishes with

rm tools/bpf/bpf_exp.lex.c

That's because it considers the file to be intermediate. Silence that by
mentioning the lex.c file instead of the lex.o file; the dependency still
stays.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index 757ea22c428a..c07b4e718eeb 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -75,7 +75,7 @@ $(OUTPUT)bpf_dbg: $(OUTPUT)bpf_dbg.o
 $(OUTPUT)bpf_asm: $(OUTPUT)bpf_asm.o $(OUTPUT)bpf_exp.yacc.o 
$(OUTPUT)bpf_exp.lex.o
$(QUIET_LINK)$(CC) $(CFLAGS) -o $@ $^
 
-$(OUTPUT)bpf_exp.lex.o: $(OUTPUT)bpf_exp.yacc.c
+$(OUTPUT)bpf_exp.lex.c: $(OUTPUT)bpf_exp.yacc.c
 
 clean: bpftool_clean
$(call QUIET_CLEAN, bpf-progs)
-- 
1.8.3.1



[PATCH bpf-next 3/7] tools: bpf: consistent make bpf_install

2018-03-08 Thread Jiri Benc
Currently, make bpf_install in tools/ does not respect DESTDIR. Moreover, it
installs to /usr/bin/ unconditionally.

Let it respect DESTDIR and allow prefix to be specified. Also, to be more
consistent with bpftool and with the usual customs, default the prefix to
/usr/local instead of /usr.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index e7b15967492e..c42ca24a072d 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -1,12 +1,13 @@
 # SPDX-License-Identifier: GPL-2.0
 include ../scripts/Makefile.include
 
-prefix = /usr
+prefix ?= /usr/local
 
 CC = gcc
 LEX = flex
 YACC = bison
 MAKE = make
+INSTALL ?= install
 
 CFLAGS += -Wall -O2
 CFLAGS += -D__EXPORTED_HEADERS__ -I$(srctree)/include/uapi -I$(srctree)/include
@@ -67,9 +68,10 @@ clean: bpftool_clean
   $(OUTPUT)bpf_asm $(OUTPUT)bpf_exp.yacc.* $(OUTPUT)bpf_exp.lex.*
 
 install: bpftool_install
-   install $(OUTPUT)bpf_jit_disasm $(prefix)/bin/bpf_jit_disasm
-   install $(OUTPUT)bpf_dbg $(prefix)/bin/bpf_dbg
-   install $(OUTPUT)bpf_asm $(prefix)/bin/bpf_asm
+   $(INSTALL) -m 0755 -d $(DESTDIR)$(prefix)/bin
+   $(INSTALL) $(OUTPUT)bpf_jit_disasm 
$(DESTDIR)$(prefix)/bin/bpf_jit_disasm
+   $(INSTALL) $(OUTPUT)bpf_dbg $(DESTDIR)$(prefix)/bin/bpf_dbg
+   $(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm
 
 bpftool:
$(MAKE) -C bpftool
-- 
1.8.3.1



[PATCH bpf-next 1/7] tools: bpftool: silence 'missing initializer' warnings

2018-03-08 Thread Jiri Benc
When building bpf tool, gcc emits piles of warnings:

prog.c: In function ‘prog_fd_by_tag’:
prog.c:101:9: warning: missing initializer for field ‘type’ of ‘struct 
bpf_prog_info’ [-Wmissing-field-initializers]
  struct bpf_prog_info info = {};
 ^
In file included from /home/storage/jbenc/git/net-next/tools/lib/bpf/bpf.h:26:0,
 from prog.c:47:
/home/storage/jbenc/git/net-next/tools/include/uapi/linux/bpf.h:925:8: note: 
‘type’ declared here
  __u32 type;
^

As these warnings are not useful, switch them off.

Signed-off-by: Jiri Benc 
---
 tools/bpf/bpftool/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
index 26901ec87361..4c2867481f5c 100644
--- a/tools/bpf/bpftool/Makefile
+++ b/tools/bpf/bpftool/Makefile
@@ -38,7 +38,7 @@ bash_compdir ?= /usr/share/bash-completion/completions
 CC = gcc
 
 CFLAGS += -O2
-CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow
+CFLAGS += -W -Wall -Wextra -Wno-unused-parameter -Wshadow 
-Wno-missing-field-initializers
 CFLAGS += -DPACKAGE='"bpftool"' -D__EXPORTED_HEADERS__ 
-I$(srctree)/tools/include/uapi -I$(srctree)/tools/include 
-I$(srctree)/tools/lib/bpf -I$(srctree)/kernel/bpf/
 CFLAGS += -DBPFTOOL_VERSION='"$(BPFTOOL_VERSION)"'
 LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
-- 
1.8.3.1



[PATCH bpf-next 5/7] tools: bpf: call descend in Makefile

2018-03-08 Thread Jiri Benc
Use the descend macro to properly propagate $(subdir) to bpftool.

Signed-off-by: Jiri Benc 
---
 tools/bpf/Makefile | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
index e8d9e4125bf3..daca0a4277d1 100644
--- a/tools/bpf/Makefile
+++ b/tools/bpf/Makefile
@@ -76,12 +76,12 @@ install: $(PROGS) bpftool_install
$(INSTALL) $(OUTPUT)bpf_asm $(DESTDIR)$(prefix)/bin/bpf_asm
 
 bpftool:
-   $(MAKE) -C bpftool
+   $(call descend,bpftool)
 
 bpftool_install:
-   $(MAKE) -C bpftool install
+   $(call descend,bpftool,install)
 
 bpftool_clean:
-   $(MAKE) -C bpftool clean
+   $(call descend,bpftool,clean)
 
 .PHONY: bpftool FORCE
-- 
1.8.3.1



Re: [PATCH bpf] tools: bpftool: fix compilation with older headers

2018-03-06 Thread Jiri Benc
On Tue, 6 Mar 2018 15:39:07 +0100, Daniel Borkmann wrote:
> Thanks for the fix, Jiri! The standard approach to resolve such header 
> dependencies under
> tools/ would be to add a copy of magic.h uapi header into 
> tools/include/uapi/linux/magic.h.
> 
> Both bpftool and libbpf have tools/include/uapi/ in their include path from 
> their
> Makefile, so they would pull this in automatically and it would also allow to 
> get rid
> of the extra ifdef in libbpf then. Could you look into that?

That's what I tried at first. But honestly, this is a shortcut to hell.
Eventually, we'd end up with most of uapi headers duplicated under
tools/include/uapi and hopelessly out of sync.

The right approach would be to export uapi headers from the currently
built kernel somewhere (a temporary directory, perhaps) and use that to
build the tools. We should not have duplicated and out of sync headers
in the kernel tree. Just look at the git log for tools/include/uapi to
see what I mean by "out of sync".

But that's certainly not a fix suitable for bpf.git, while I think the
build problem should be fixed in bpf.git. We can do something more
clever in bpf-next.

I can of course add the magic.h header if you have strong opinion about
that. I just don't think it's the right approach and seeing the
precedent in tools/lib/bpf/libbpf.c and tools/lib/api/fs/fs.c, I think
this minimal patch is better at this point until the duplicate headers
mess is sorted out.

Please let me know if you really want me to duplicate the magic.h file.

Thanks,

 Jiri


[PATCH bpf] tools: bpftool: fix compilation with older headers

2018-03-06 Thread Jiri Benc
Compilation of bpftool on a distro that lacks eBPF support in the installed
kernel headers fails with:

common.c: In function ‘is_bpffs’:
common.c:96:40: error: ‘BPF_FS_MAGIC’ undeclared (first use in this function)
  return (unsigned long)st_fs.f_type == BPF_FS_MAGIC;
^
Fix this the same way it is already in tools/lib/bpf/libbpf.c and
tools/lib/api/fs/fs.c.

Signed-off-by: Jiri Benc 
---
 tools/bpf/bpftool/common.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/bpf/bpftool/common.c b/tools/bpf/bpftool/common.c
index 0b482c0070e0..465995281dcd 100644
--- a/tools/bpf/bpftool/common.c
+++ b/tools/bpf/bpftool/common.c
@@ -55,6 +55,10 @@
 
 #include "main.h"
 
+#ifndef BPF_FS_MAGIC
+#define BPF_FS_MAGIC   0xcafe4a11
+#endif
+
 void p_err(const char *fmt, ...)
 {
va_list ap;
-- 
1.8.3.1



Re: [PATCH net-next] team: Use extack to report enslavement failures

2018-02-28 Thread Jiri Benc
On Wed, 28 Feb 2018 08:12:41 +0100, Jiri Pirko wrote:
> Yeah, or if you have an older iproute2 package. I would keep the existing
> dmesg msgs for now. In the future, when everyone is used to exacks, then
> we can remove them.

How do we ensure that this will actually happen in the future? Usually,
when not done right away, such things tend to be forgotten for years or
even forever.

Also, how distant future are we talking about?

Thanks,

 Jiri


Re: [PATCH net-next] team: Use extack to report enslavement failures

2018-02-27 Thread Jiri Benc
On Tue, 27 Feb 2018 17:38:08 +0200, Ido Schimmel wrote:
>   if (port_dev->flags & IFF_LOOPBACK) {
> + NL_SET_ERR_MSG(extack, "Loopback device can't be added as a 
> team port");
>   netdev_err(dev, "Device %s is loopback device. Loopback devices 
> can't be added as a team port\n",
>  portname);

Aren't the netdev_errs unnecessary now?

Or do you keep them for people using old config tools with a new kernel?
If so, for how long? They should certainly be removed eventually. How
do we ensure we don't forget?

Seems to me it would be better to remove them right now.

Thanks,

 Jiri


Re: Automatic TAP destruction/Monitoring namespace destruction

2018-02-26 Thread Jiri Benc
On Fri, 23 Feb 2018 04:39:37 -0500, Andrew Cann wrote:
> In a program I'm writing I have a network namespace with a virtual (TAP)
> network interface assigned to it. I would like it so that the interface is
> automatically destroyed when the namespace is destroyed (ie. when the last
> process in the namespace exits). I can't see any way to implement this..

This should just work.

> As I understand it, when a namespace is destroyed all its interfaces are moved
> to the root namespace. If this is the case, is there anyway to detect when an
> interface is moved so that I can close it manually?

It is the case only for interfaces backed by a physical device. Virtual
interfaces are deleted when the netns is destroyed. That includes
tun/tap interfaces.

> Alternatively, is there a way to detect when a namespace is destroyed?

I don't think we emit any netlink event on netns exit.

> I figured it might possible to use inotify to do this, but it won't let me
> watch directories under /proc. Also the files under /proc/*/ns/ seem to be 
> some
> kind of wierd symlink-to-a-raw-inode-thing (?) - is there a way to detect when
> an inode is destroyed that I can use with these?

You'd need this patchset: https://lkml.org/lkml/2016/10/15/40 but
I don't think it went anywhere. Plus it probably wouldn't be enough
anyway.

> I also thought it might be possible to use a netlink socket to detect when an
> interface changes namespace. But the netlink docs don't seem to suggest that
> this is possible.

Yes, that's possible. You'll need a recent kernel with commit
e8368d9ebb94 included.

> Basically I'm looking for any event the Linux kernel can give me that I can 
> use
> to implement what I want. Does anyone have any ideas?

What you want should already be happening automatically. Have you tried?

ip netns add ns0
ip -n ns0 tuntap add name tap0 mode tap
ip -n ns0 link show dev tap0
ip netns del ns0
ip a# no tap interface

 Jiri


Re: [PATCH net 1/1 v4] rtnetlink: require unique netns identifier

2018-02-07 Thread Jiri Benc
On Wed, 7 Feb 2018 14:36:21 +0100, Christian Brauner wrote:
> On Wed, Feb 07, 2018 at 04:20:01PM +0300, Kirill Tkhai wrote:
> > Can't we write these 3 above branches more compact? Something like this:
> > 
> > if (!!tb[IFLA_NET_NS_FD] + !!tb[IFLA_IF_NETNSID] + 
> > !!tb[IFLA_NET_NS_PID] <= 1)
> > return 0;  
> 
> I always prefer for conditions to be separate and readable even if it
> means additional code. But if others feel that there's value in avoiding
> two additional conditions I'm happy to adapt the patch.

FWIW, I don't like the n x n conditions much. But Kirill's proposal
seems not to be much better. I was thinking about:

int cnt = 0;
if (tb[IFLA_NET_NS_FD])
cnt++;
if (tb[IFLA_NET_NS_PID])
cnt++;
if (tb[IFLA_NET_NETNSID])
cnt++;
if (cnt > 1) {
...errorr...
}

but that's not better, either. As we're unlikely to add a fourth value,
I guess I'm okay with the current approach in the patch.

> Before I added support for netns ids for additional requests Jiri made
> it so that all request that specified properties that they did not
> support returned ENOTSUPP instead of EINVAL. This just keeps things
> consistent. Users would now suddenly receive EINVAL. That's potentially
> confusing.

Yes, please, keep -EOPNOTSUPP.

> As for the case of passing multiple netns identifying properties into
> the same request EINVAL seems the perfect candidate and the error
> message seems instructive to userspace programs.

Agreed.

Acked-by: Jiri Benc 

Thanks,

 Jiri


Re: [PATCH net 1/1 v3] rtnetlink: require unique netns identifier

2018-02-07 Thread Jiri Benc
On Tue,  6 Feb 2018 14:19:02 +0100, Christian Brauner wrote:
> +/* Verify that rtnetlink requests supporting network namespace ids
> + * do not pass additional properties potentially referring to different
> + * network namespaces.
> + */
> +static int rtnl_ensure_unique_netns(struct nlattr *tb[],
> + struct netlink_ext_ack *extack)
> +{
> + /* Requests without network namespace ids have been able to specify
> +  * multiple properties referring to different network namespaces so
> +  * don't regress them.
> +  */
> + if (!tb[IFLA_IF_NETNSID])
> + return 0;

I agree with Eric that we should enforce this also for the existing
pid/fd attributes.

> +
> + /* Caller operates on the current network namespace. */
> + if (!tb[IFLA_NET_NS_PID] && !tb[IFLA_NET_NS_FD])
> + return 0;
> +
> + NL_SET_ERR_MSG(extack, "multiple netns identifying attributes 
> specified");
> + return -EINVAL;

But if we don't reach an agreement on that, this version is the next
best one. No reason to compare the namespaces whether they're the same,
a message with more than one such attribute is just invalid.

> @@ -2649,6 +2675,10 @@ static int rtnl_dellink(struct sk_buff *skb, struct 
> nlmsghdr *nlh,
>   if (err < 0)
>   return err;
>  
> + err = rtnl_ensure_unique_netns(tb, extack);
> + if (err < 0)
> + return err;
> +
>   if (tb[IFLA_IFNAME])
>   nla_strlcpy(ifname, tb[IFLA_IFNAME], IFNAMSIZ);
>  
> @@ -3045,6 +3079,10 @@ static int rtnl_getlink(struct sk_buff *skb, struct 
> nlmsghdr *nlh,
>   if (err < 0)
>   return err;
>  
> + err = rtnl_ensure_unique_netns(tb, extack);
> + if (err < 0)
> + return err;
> +
>   if (tb[IFLA_IF_NETNSID]) {
>   netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
>   tgt_net = get_target_net(NETLINK_CB(skb).sk, netnsid);

dellink and getlink support only netnsid, we should just reject a
message with pid or fd set.

 Jiri


Re: [PATCH net 1/1 v2] rtnetlink: require unique netns identifier

2018-02-07 Thread Jiri Benc
On Tue, 06 Feb 2018 16:31:29 -0600, Eric W. Biederman wrote:
> Frankly.  If we are talking precedence it should be:
> fds
> netnsids
> pids

The current order is 1. pids, 2. fds, though. Not that it matters much,
see below.

> I do think it makes a lot of sense to error if someone passes in
> duplicate arguments.  AKA multiple attribute that could select for
> the same thing.   No one will do that deliberately.  It doesn't make
> sense.  So it is just a nonsense case we have to handle gracefully,
> and correctly.  With correctness being the most important as otherwise
> people might just send in nonsense to exploit bugs.

Completely agreed. Let's just start returning error if more than one of
the pid/fs/netnsid attributes is specified. I don't think this is going
to break any user. And we'll not have to care about the order.

> I agree refusing to combine multiple attributes for the same thing
> sounds the most sensible course.

Yes, please.

Thanks!

 Jiri


Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK

2018-01-26 Thread Jiri Benc
On Fri, 26 Jan 2018 00:34:51 +0100, Nicolas Dichtel wrote:
> Why meaningful? The user knows that the answer is like if if was done in 
> another
> netns. It enables to have only one netlink socket instead of one per netns. 
> But
> the code using it will be the same.  

Because you can't use it to query the linked interface. You can't even
use it as an opaque value to track interfaces (netnsid+ifindex) because
netnsids are not unique across net name spaces. You can easily have two
interfaces that have all the ifindex, ifname, netnsid (and basically
everything else) identical but being completely different interfaces.
That's really not helpful.

> I fear that with your approach, it will results to a lot of complexity in the
> kernel.  

The complexity is (at least partly) already there. It's an inevitable
result of the design decision to have relative identifiers.

I agree that we should think about how to make this easy to implement.
I like your idea of doing this somehow generically. Perhaps it's
possible to do while keeping the netnsids valid in the caller's netns?

> What is really missing for me, is a way to get a fd from an nsid. The user
> should be able to call RTM_GETNSID with an fd and a nsid and the kernel 
> performs
> the needed operations so that the fd points to the corresponding netns.  

That's what I was missing, too. I even looked into implementing it. But
opening a fd on behalf of the process and returning it over netlink is a
wrong thing to do. Netlink messages can get lost. Then you have a fd
leak you can do nothing about.

Given that we have netnsids used for so much stuff already (like
NETLINK_LISTEN_ALL_NSID) you need to track them anyway. And if you need
to track them, why bother with another identifier? It would be better
if netnsid can be used universally for anything. Then there will be no
need for the conversion.

 Jiri


Re: [PATCH net-next 2/2] dev: advertise the new ifindex when the netns iface changes

2018-01-25 Thread Jiri Benc
On Thu, 25 Jan 2018 15:01:39 +0100, Nicolas Dichtel wrote:
> The goal is to let the user follow an interface that moves to another
> netns.
> 
> CC: Jiri Benc 
> CC: Christian Brauner 
> Signed-off-by: Nicolas Dichtel 

This is great, thanks, Nicolas!

Reviewed-by: Jiri Benc 


Re: [PATCH net-next 1/2] dev: always advertise the new nsid when the netns iface changes

2018-01-25 Thread Jiri Benc
On Thu, 25 Jan 2018 15:01:38 +0100, Nicolas Dichtel wrote:
> The user should be able to follow any interface that moves to another
> netns.  There is no reason to hide physical interfaces.
> 
> CC: Jiri Benc 
> CC: Christian Brauner 
> Signed-off-by: Nicolas Dichtel 

Reviewed-by: Jiri Benc 


Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK

2018-01-25 Thread Jiri Benc
On Thu, 25 Jan 2018 15:20:59 +0100, Nicolas Dichtel wrote:
> Hmm, I don't agree. For me, it would be the correct answer. If user has a 
> socket
> in ns_a and targets a RTM_GETLINK in ns_b, the answer he gets should be like 
> if
> it was done in ns_b.

But that information would be useless for the caller. Why return a
value that has no meaning for the caller and can not be used? More so
when the kernel is aware of what the correct meaningful value is?

> This is already the case with messages received with NETLINK_LISTEN_ALL_NSID,
> there is no reason to do something different.

NETLINK_LISTEN_ALL_NSID is tough due to way it is implemented. But yes,
it should translate the netnsids to be valid in the socket's netns.
That's the only sane way for the listener.

 Jiri


Re: [PATCH net-next 0/3 V1] rtnetlink: enable IFLA_IF_NETNSID for RTM_{DEL,SET}LINK

2018-01-24 Thread Jiri Benc
On Wed, 24 Jan 2018 16:24:34 +0100, Nicolas Dichtel wrote:
> I wonder if it would be possible to do something in the netlink framework, 
> like
> NETLINK_LISTEN_ALL_NSID.
> Having some ancillary data at the netlink socket level and a function like
> nlsock_net() (instead of sock_net()) to get the corresponding netns.
> With that, it would be possible, in a generci way, to support this feature for
> all netlink family.

I'm not sure it's worth the effort to do that in the framework. You'll
need modifications all the way down to the code that generates the
attributes anyway.

It's not enough to just specify that the operation should be done on a
different netns and hide that from the handlers. Take for example the
existing RTM_GETLINK. Let's say it's executed from within ns_a and
targeted to ns_b (thus IFLA_IF_NETNSID = netnsid of ns_b). Now, if
there's a veth interface in ns_b whose other end is in ns_c, there will
be IFLA_LINK_NETNSID attribute in the response. But the value has to be
netnsid of ns_c as seen from *ns_a*. If you just pretended to switch to
ns_b before invoking rtnl_getlink, it would be netnsid of ns_c as seen
from ns_b which would be wrong.

That's why 79e1ad148c844 added the tgt_net stuff.

 Jiri


Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd

2018-01-24 Thread Jiri Benc
On Wed, 24 Jan 2018 11:53:11 +0100, Nicolas Dichtel wrote:
> Le 23/01/2018 à 18:08, Jiri Benc a écrit :
> > It would be much better if the whole (ifindex, netnsid) pair was
> > returned. I think it could be added.  
> Sure. Do you plan to send a patch?

I can do that but it will take a while. I'll be happy to leave that to
someone else but if there's no one, I'll eventually do it.

> > I think we need that.  
> If you agree, I can send a patch to remove this limitation.

That would be great.

Thanks!

 Jiri


Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd

2018-01-23 Thread Jiri Benc
On Tue, 23 Jan 2018 17:37:11 +0100, Nicolas Dichtel wrote:
> When a virtual interface moves to another netns, the netlink RTM_DELLINK 
> message
> contains the attribute IFLA_NEW_NETNSID, which identifies where the interface
> moves. The nsid may be allocated if needed.

The problem is that ifindex may change and it's not announced. The only
way is to track both ifindex and ifname, watch for the ifname to appear
in the target netns and update the application's view of ifindex.

It would be much better if the whole (ifindex, netnsid) pair was
returned. I think it could be added.

> I don't know if it's acceptable to also allocate an nsid in case of a physical
> interface.

I think we need that.

Thanks,

 Jiri


Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd

2018-01-23 Thread Jiri Benc
(Christian, I'm adding back the netdev list, there's no reason not to
have the discussion in open.)

On Tue, 23 Jan 2018 12:42:19 +0100, Christian Brauner wrote:
> Thanks for the comments and discussion. Sorry, for not going through the
> list but I just have a quick question that doesn't deserve to spam the
> whole netdev list. :) I have written another set of patches that would
> add support for IFLA_IF_NETNSID to RTM_{N,S}ETLINK and friends which I
> planned to send out after the pid and fd ones. Would you be interested
> in those if I sent them out over the next week?

Yes, please CC me. I will have limited time next week (whole day
meetings for the most of the week) but I'll try to review.

In general, I see value in adding netnsid support to setlink, newlink
and dellink. It's a bit of minefield, though. Look for and be sure to
resolve:

- Backwards compatibility with old kernels. Old kernels will ignore
  the IFLA_IF_NETNSID attribute that is unknown to them. And it's
  certainly undesirable for an application to attempt to set/create an
  interface in a given netns and instead the kernel sets/creates an
  interface in the current netns.

  In 79e1ad148c844, I put into place means to handle that: any kernel
  that understands IFLA_IF_NETNSID but does not support
  setlink/newlink/dellink will return -EOPNOTSUPP. This reduces the
  problem to detecting whether the kernel understands IFLA_IF_NETNSID.
  This can be done by invoking getlink with IFLA_IF_NETNSID and
  looking whether the reply contains IFLA_IF_NETNSID.

  The current getlink calls are also limited to CAP_NET_ADMIN in the
  *target* netns. If/when this is relaxed in the future, we still need
  to stay backwards compatible.

  So, the application needs to do this:
  1. call RTM_GETLINK with IFLA_IF_NETNSID on lo (or do a dump)
  2. if getting EACCESS, assume IFLA_IF_NETNSID is not supported =>
 fall back to other means
  2. if the reply does not contain IFLA_IF_NETNSID, the kernel does not
 support IFLA_IF_NETNSID => fall back to other means
  3. try RTM_SETLINK/NEWLINK/DELLINK and check for EOPNOTSUPP => fall
 back to other means

  What is important for your kernel patches is to not break this. If
  you implement only partial support for some of the operation, think
  about the future extensibility. Applications have to be able to
  reliably detect what is supported on the running kernel, regardless
  of the kernel version.

- Access rights. This is security sensitive stuff. I took the big
  hammer and limited everything to CAP_NET_ADMIN in the target netns.
  If you want to relax this, be sure to get review from people involved
  in name space security.

 Jiri


Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd

2018-01-23 Thread Jiri Benc
On Tue, 23 Jan 2018 11:26:58 +0100, Wolfgang Bumiller wrote:
> Even if you know the netnsid, do the mentioned watches work for
> nested/child namespaces if eg. a container creates new namespace before
> and/or after the watch was established and moves interfaces to these
> child namespaces, would you just see them disappear, or can you keep
> track of them later on as well?

What do you mean by "nested namespaces"? There's no such thing for net
name spaces.

As for missing API to get netnsid of the netns the interface is moved
to, see my previous emails in this thread. This needs to be added.

> Even if that works, from what the documentation tells me netlink is an
> unreliable protocol, so if my watcher's socket buffer is full, wouldn't
> I be losing important tracking information?

Sure. But that's fundamentally unfixable independently on netlink, the
kernel needs to take an action if a program is not reading its
messages. Either some messages get dropped or the program is killed or
infinite amount of memory is consumed. This has nothing to do with uAPI
design.

> I think one possible solution to tracking interfaces would be to have a
> unique identifier that never changes (even if it's just a simple
> uint64_t incremented whenever an interface is created). But since
> they're not local to the current namespace that may require a lot of
> extra permission checks (but I'm just speculating here...).

You'll get a hard NACK from CRIU folks if you try to propose this.

> In any case, IFLA_NET_NS_FD/PID are already there and I had been
> wondering previously why they couldn't be used with RTM_GETLINK, it
> would just make sense.

Those predate netnsids and we can't get rid of them now, since they're
part of uAPI. But we can (and should) make sure we don't add more of
those.

 Jiri


Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd

2018-01-23 Thread Jiri Benc
On Mon, 22 Jan 2018 23:25:41 +0100, Christian Brauner wrote:
> This is not necessarily true in scenarios where I move a network device
> via RTM_NEWLINK + IFLA_NET_NS_PID into a network namespace I haven't
> created. Here is an example:
> 
> nlmsghdr->nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
> nlmsghdr->nlmsg_type = RTM_NEWLINK;
> /* move to network namespace of pid */
> nla_put_u32(nlmsg, IFLA_NET_NS_PID, pid)
> /* give interface new name */
> nla_put_string(nlmsg, IFLA_IFNAME, ifname)
> 
> The only thing I have is the pid that identifies the network namespace.

How do you know the interface did not get renamed in the new netns?

This is racy and won't work reliably. You really need to know the
netnsid before moving the interface to the netns to be able to do
meaningful queries.

You may argue that for your case, you're fine with the race. But I know
about use cases where it matters a lot: those are tools that show
network topology including changes in real time, such as Skydive. It's
important to have the uAPI designed right. And we don't want two
different APIs for the same thing.

If you want to do any watching over the interfaces (as opposed to "move
to the netns and forget"), you really have to work with netnsids. Let's
focus on how to do that more easily. We don't return netnsid at all
places where we should return it and we need to fix that.

> There's no non-syscall way to learn the netnsid.

And that is the primary problem.

 Jiri


Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd

2018-01-22 Thread Jiri Benc
On Mon, 22 Jan 2018 22:23:54 +0100, Christian Brauner wrote:
> That is certainly a good idea and I'm happy to send a follow-up patch
> for that!

Note that I haven't looked into that and I don't know whether it is
easily possible. I'll appreciate if you could try that.

> But there's still value in being able to use
> IFLA_NET_NS_{FD,PID} in scenarios where the network namespace has been
> created by another process. In this case we don't know what its netnsid
> is and not even if it had been assigned one at creation time or not. In
> this case it would be useful to refer to the netns via a pid or fd. A
> more concrete and frequent example is querying a network namespace of a
> (sorry for the buzzword :)) container for all defined network
> interfaces.

That's what spurred my original comment. If you don't know the netnsid
in such case, we're missing something in uAPI but at a different point
than RTM_GETLINK.

When you find yourself in a need to query an interface in another
netns, you had to learn about that interface in the first place.
Meaning you got its ifindex (or ifname, perhaps) somehow. My point is,
you should have learned the netnsid at the same time.

ifindex alone is not an unique identifier of an interface. The
(ifindex, netnsid) pair is. (Also, note that ifindex can change when
moving the interface to a different netns.) So you should never get
ifindex alone when the interface is in another netns than the current
one. If that happens, that is the uAPI that needs to be fixed. You have
to always get the (ifindex, netnsid) pair.

And that's also the way how it should operate in the other direction:
(ifindex, netnsid) is the identifier to query interface in another
netns.

Note that many APIs in networking are based around netnsid - look at
NETLINK_LISTEN_ALL_NSID. It allows you to keep track of interfaces as
they are moved from name spaces to name spaces using a single socket:
you get notifications about interfaces disappearing or appearing in
"watched" name spaces. The name spaces are, of course, referenced by
their netnsid. In order to add another "watched" name space, just
assign it (or, more typically, let the kernel assign) a netnsid.

Btw, we have one missing piece here: when an interface is moved to a
name space that does not have netnsid attached, we want to find out
where the interface was moved to. But there's currently no reliable way
to do it. For veth, the other end can be used to get the netnsid (note
that RTM_GETLINK will return the correct link netnsid even when the
queried veth interface is in a different name space), for openvswitch,
we can now use genetlink, etc., but using different ways for different
interface types is not the best API ever and for standalone interfaces
we have nothing. I'd like to see something added to uAPI to cover this
in a generic way.

But as for this patch, I don't think it's the correct way. We do have
missing pieces in uAPI wrt. netns support but I think they're at
different places. If you have a counter example, please speak up.

Thanks,

 Jiri


Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd

2018-01-22 Thread Jiri Benc
On Thu, 18 Jan 2018 21:55:53 +0100, Christian Brauner wrote:
> A more concrete scenario is creating a network namespace, moving a
> device into it via RTM_SETLINK which also supports IFLA_NET_NS_{FD,PID}
> and then wanting to query the device. This would be very easy to do if
> one could reuse the IFLA_NET_NS_{FD,PID} without having to set a
> netnsid.

Wouldn't be a better solution to have a way to request the netnsid
allocation (and return it) right away when creating the name space,
then?

 Jiri


Re: [PATCH iproute2-next 1/2] ip/tunnel: Be consistent when printing tunnel collect metadata

2018-01-22 Thread Jiri Benc
On Mon, 22 Jan 2018 18:46:53 +0200, Serhey Popovych wrote:
> + if (tb[IFLA_GRE_COLLECT_METADATA]) {
> + print_bool(PRINT_ANY, "collect_metadata", "external", true);
> + return;
> + }

Nacked-by: Jiri Benc 

Don't ever use "collect_metadata" for anything visible to the user.

collect_metadata is a *horrible* name. It describes the internal
implementation of the lwtunneling in the kernel and provides zero
explanation to the user about what's that feature good for.

The netlink attribute should have never had such name but it's uAPI and
we have to live with it. But there's no reason to expose this to the
user.

Stick with the "external" name. It explains what it is about: instead of
the traffic being controlled by the tunnel internal logic (or tunnel
control plane, if you want), an external logic needs to be attached to
the tunnel in order for the tunneling to work.

 Jiri


Re: [PATCH net-next 1/1] rtnetlink: request RTM_GETLINK by pid or fd

2018-01-18 Thread Jiri Benc
On Thu, 18 Jan 2018 21:21:24 +0100, Christian Brauner wrote:
> In such scenarios setting a netns id property is
> not really wanted

Why? I think that's what you should do if you want to avoid setns. Just
use netnsid. I don't see any problem with that and you didn't provide
any explanation.

 Jiri


Re: [PATCH net-next] macvlan: Pass SIOC[SG]HWTSTAMP ioctls and get_ts_info to lower device

2018-01-18 Thread Jiri Benc
On Thu, 18 Jan 2018 02:12:34 +0100, grzegorz.ha...@gmail.com wrote:
> This patch allows to enable hardware timestamping on macvlan intefaces and 
> find out capabilities of the lower device.
> 
> Signed-off-by: Grzegorz Halat 

NACK. This does not work.

For start, how do you deal with fwd_priv? When a packet is sent to
other software ports, it wouldn't be time stamped. And I expect more
cases like this to be there in macvlan, I only spent 10 seconds
checking it.

Please study how time stamping in the kernel works. A good start is
Documentation/networking/timestamping.txt. Then examine all possible
packet paths with macvlan, both egress and ingress.

> +static int macvlan_ethtool_get_ts_info(struct net_device *dev,
> + struct ethtool_ts_info *ts_info)
> +{
> + const struct macvlan_dev *vlan = netdev_priv(dev);
> + const struct ethtool_ops *eth_ops = vlan->lowerdev->ethtool_ops;
> +
> + if (eth_ops->get_ts_info)
> + return eth_ops->get_ts_info(vlan->lowerdev, ts_info);
> +
> + ts_info->so_timestamping = SOF_TIMESTAMPING_RX_SOFTWARE |
> +SOF_TIMESTAMPING_SOFTWARE;
> + ts_info->phc_index = -1;

What calls skb_tx_timestamp if the driver does not support it?

 Jiri


Re: [PATCHv3 net-next 2/2] openvswitch: add erspan version I and II support

2018-01-18 Thread Jiri Benc
Looks much better.

On Wed, 17 Jan 2018 09:32:51 -0800, William Tu wrote:
> + OVS_ERSPAN_OPT_IDX, /* be32 index */

Why don't you convert this to u32 while passing to/from user space?

> + [OVS_ERSPAN_OPT_IDX]= { .len = sizeof(u32) },

sizeof(__be32) but see above.

Thanks,

 Jiri


Re: [PATCH net] Revert "openvswitch: Add erspan tunnel support."

2018-01-12 Thread Jiri Benc
On Fri, 12 Jan 2018 12:29:22 -0800, William Tu wrote:
> This reverts commit ceaa001a170e43608854d5290a48064f57b565ed.
> 
> The OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS attr should be designed
> as a nested attribute to support all ERSPAN v1 and v2's fields.
> The current attr is a be32 supporting only one field.  Thus, this
> patch reverts it and later patch will redo it using nested attr.
> 
> Signed-off-by: William Tu 
> Cc: Jiri Benc 
> Cc: Pravin Shelar 

Acked-by: Jiri Benc 


Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-12 Thread Jiri Benc
On Fri, 12 Jan 2018 09:50:35 -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
> (Looks like you missed the last update I mentioned)

I did not miss it. The proposed behavior is inconsistent and has no
clear pattern (I used the word "magic" for that). I guess examples will
help more. See below.

> Here is the approach in details -

> (a) At slave creation time - it's exactly how it's done currently
> where slave copies masters' mtu. at the same time max_mtu of the slave
> is set as the current mtu of the master.
> (b) If slave updates mtu - ipvlan_change_mtu() will be called and the
> slave's mtu will get set and it will set a flag indicating that slave
> has changed it's mtu (dissociation from master if the mtu is different
> from masters'). If slave mtu is set same as masters' then this flag
> will get reset-ed indicating it wants to follow master (current
> behavior).

Consider these two cases:

# ip l a link eth0 type ipvlan 
# ip l s ipvlan0 mtu 1400
# ip l s eth0 mtu 1400
# ip l s eth0 mtu 1500

Now MTU of ipvlan0 is 1400.

# ip l a link eth0 type ipvlan 
# ip l s eth0 mtu 1400
# ip l s ipvlan0 mtu 1400
# ip l s eth0 mtu 1500

Now MTU of ipvlan0 is 1500.

See why I call that behavior "magic"? Before the last step, it's
impossible to tell from the user space what will happen. And no, don't
propose to artificially cover the first example by forcing the auto
follow flag, it doesn't help. It just moves the magic behavior to
different scenarios.

> (c) When master updates mtu - ipvlan_adj_mtu() gets called where all
> slaves' max_mtu changes to the master's mtu value (clamping applies
> for the all slaves which are not following master). All the slaves
> which are following master (flag per slave) will get this new mtu.
> Another consequence of this is that slaves' flag might get reset-ed if
> the master's mtu is reduced to the value that was set earlier for the
> slave (and it will start following slave).

Okay, you're actually proposing that. So, another example:

# ip l a link eth0 type ipvlan 
# ip l s ipvlan0 mtu 1400
# ip l s eth0 mtu 1400
# ip l s eth0 mtu 1500

Now MTU of ipvlan0 is 1500.

# ip l a link eth0 type ipvlan 
# ip l s ipvlan0 mtu 1400
# ip l s eth0 mtu 1600
# ip l s eth0 mtu 1500

Now MTU of ipvlan0 is 1400. There's still no consistency here.

> The above should work? The user-space can query the mtu of the slave
> device just like any other device. I was thinking about 'mtu_adj' with
> some additional future extention but for now; we can live with a flag
> on the slave device(s).

It absolutely does not. Never introduce magic behavior, it just forces
you to add more layers of hacks later.

Really, the only way is to introduce user visible and user changeable
flag. Yes, it's more work. But that's something you'll have to swallow.
Introducing hacky behavior is not the way to go. We try hard to
preserve user space compatibility which is pretty much impossible if
you introduce magic hacky behavior. Do it right from the start.

 Jiri


Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-12 Thread Jiri Benc
On Fri, 12 Jan 2018 09:34:13 +0100, Jiri Benc wrote:
> I don't think this works currently. When someone (does not have to be
> you, it can be a management software running in background) sets the
> MTU to the current value, the magic behavior is lost without any way to
> restore it (unless I'm missing a way to restore it, see my question
> above). So any user that depends on the magic behavior is broken anyway
> even now.

Upon further inspection, it seems that currently, slaves always follow
master's MTU without a way to change it. Tough situation. Even
implementing user space toggleable mtu_adj could break users in the way
I described. But it seems to be the lesser evil, at least there would
be a way to unbreak the scripts with one line addition.

But it's absolute must to have this visible to the user space and
changeable. Something like this:

# ip a
123: ipvlan0:  mtu 1500 (auto) qdisc ...
# ip l s ipvlan0 mtu 1400
# ip a
123: ipvlan0:  mtu 1400 qdisc ...
# ip l s ipvlan0 mtu auto
# ip a
123: ipvlan0:  mtu 1500 (auto) qdisc ...

 Jiri


Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-12 Thread Jiri Benc
On Thu, 11 Jan 2018 08:59:58 -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
> I guess the logic would be as simple as - if mtu_adj for a slave is
> set to 0, then it's
> following master otherwise not. By setting different mtu for a slave, you will
> set this mtu_adj a positive number which would mean it's not following master.
> So it's subjected to clamping when masters' mtu is reducing but should stay
> otherwise. Also when slave decides to follow master again, it can set the mtu
> to be same as masters' (making mtu_adj == 0) and then it would start following
> master again.

How can the mtu_adj value be queried and set from user space?

> Whether it's magic or not, it's the current behavior and I know several use
> cases depend on this behavior which would be broken otherwise. The
> approach I proposed keeps that going for those who depend on that while
> adds an ability to set mtu per slave for the use case mentioned in this
> patch-set too.

I don't think this works currently. When someone (does not have to be
you, it can be a management software running in background) sets the
MTU to the current value, the magic behavior is lost without any way to
restore it (unless I'm missing a way to restore it, see my question
above). So any user that depends on the magic behavior is broken anyway
even now.

 Jiri


Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-12 Thread Jiri Benc
On Thu, 11 Jan 2018 08:34:14 -0800, William Tu wrote:
> I'd also prefer reverting ceaa001a170e since it's more clean but I
> also hope to have this feature in 4.15.
> How long does reverting take? Am I only able to submit the new patch
> after the reverting is merged? Or I can submit revert and this new
> patch in one series? I have little experience in reverting, can you
> suggest which way is better?

Send the revert for net (subject will be "[PATCH net] revert:
openvswitch: Add erspan tunnel support."). Don't forget to explain why
you're proposing a revert.

After it is accepted and applied to net.git, wait until the patch
appears in net-next.git. It may take a little while. After that, send
the new patch(es) for net-next.

 Jiri


Re: [PATCH V2] ipvlan: fix ipvlan MTU limits

2018-01-11 Thread Jiri Benc
On Wed, 10 Jan 2018 18:09:50 -0800, Mahesh Bandewar (महेश बंडेवार) wrote:
> I still prefer the approach I had mentioned that uses 'mtu_adj'. In
> that approach you can leave those slaves which have changed their mtu
> to be lower than masters' but if master's mtu changes to larger value
> all other slaves will get updated mtu leaving behind the slaves who
> have opted to change their mtu on their own. Also the same thing is
> true when mtu get reduced at master.

The problem with this magic behavior is, well, that it's magic. There's
no way to tell what happens with a given slave when the master's MTU
gets changed just by looking at the current configuration. There's also
no way to switch the magic behavior back on once the slave's MTU is
changed.

At minimum, you'd need some kind of indication that the slave's MTU is
following the master. And a way to toggle this back.

Keefe's patch is much saner, the behavior is completely deterministic.

 Jiri


Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-10 Thread Jiri Benc
On Wed, 10 Jan 2018 22:35:14 +0100, Jiri Benc wrote:
> The existing field must continue to work in the same way as before. It must
> be accepted and *returned* by the kernel. You may add an additional field
> but the existing behavior must be 100% preserved, both uABI and uAPI wise.

Another way around this is reverting ceaa001a170e in net.git and
designing the uAPI properly in net-next. I think that should be the
preferred way, as ceaa001a170e is clearly wrong since you need to redo
it after 3 months.

Not sure when Linus intends to release 4.15 and how much time you have
for this, though.

 Jiri


Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-10 Thread Jiri Benc
On Tue,  9 Jan 2018 17:51:22 -0800, William Tu wrote:
> - [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = sizeof(u32) },
> + [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1] = { .len = sizeof(u32) },
> + [OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS]   = { .len = OVS_ATTR_NESTED,
> + .next = ovs_erspan_opt_lens },

Ouch. It's actually much worse than that, you're redefining the meaning of
the field. That's complete no-go.

> + case OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1:
> + OVS_NLERR(log, "ERSPAN attribute %d is deprecated.",
> +   type);
> + return -EINVAL;

As is this.

> @@ -906,8 +1017,8 @@ static int __ip_tun_to_nlattr(struct sk_buff *skb,
>vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len))
>   return -EMSGSIZE;
>   else if (output->tun_flags & TUNNEL_ERSPAN_OPT &&
> -  nla_put_be32(skb, OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,
> -   ((struct erspan_metadata 
> *)tun_opts)->u.index))
> +  erspan_opt_to_nlattr(skb, tun_opts,
> +   swkey_tun_opts_len))

And this.

The existing field must continue to work in the same way as before. It must
be accepted and *returned* by the kernel. You may add an additional field
but the existing behavior must be 100% preserved, both uABI and uAPI wise.

 Jiri


Re: [PATCHv2 net-next 2/2] openvswitch: add erspan version II support

2018-01-10 Thread Jiri Benc
On Tue,  9 Jan 2018 17:51:22 -0800, William Tu wrote:
> --- a/include/uapi/linux/openvswitch.h
> +++ b/include/uapi/linux/openvswitch.h
> @@ -363,7 +373,8 @@ enum ovs_tunnel_key_attr {
>   OVS_TUNNEL_KEY_ATTR_IPV6_SRC,   /* struct in6_addr src IPv6 
> address. */
>   OVS_TUNNEL_KEY_ATTR_IPV6_DST,   /* struct in6_addr dst IPv6 
> address. */
>   OVS_TUNNEL_KEY_ATTR_PAD,
> - OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,/* be32 ERSPAN index. */
> + OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTSV1,  /* be32 ERSPAN v1 index 
> (deprecated). */

You can't do this in an uapi header, the existing name must stay the
same forever. Otherwise existing programs would suddenly stop to
compile.

 Jiri


Re: [PATCH net v3] openvswitch: Fix pop_vlan action for double tagged frames

2017-12-20 Thread Jiri Benc
On Wed, 20 Dec 2017 15:09:22 -0500, Eric Garver wrote:
> skb_vlan_pop() expects skb->protocol to be a valid TPID for double
> tagged frames. So set skb->protocol to the TPID and let skb_vlan_pop()
> shift the true ethertype into position for us.
> 
> Fixes: 5108bbaddc37 ("openvswitch: add processing of L3 packets")
> Signed-off-by: Eric Garver 

Thanks!

Reviewed-by: Jiri Benc 


Re: [PATCH net v2] openvswitch: Fix pop_vlan action for double tagged frames

2017-12-20 Thread Jiri Benc
On Wed, 20 Dec 2017 10:39:32 -0500, Eric Garver wrote:
> + if (is_flow_key_valid(key) && key->eth.vlan.tci && key->eth.cvlan.tci)

Maybe (key->eth.vlan.tci & htons(VLAN_TAG_PRESENT)) for consistency
with the rest of the code? But it's just nitpicking.

The real problem here is when a double tagged packet leaves the ovs
bridge, it won't have the skb->protocol that the kernel expects: it
will be ethertype of the payload, while my understanding is it should
be the inner tpid, right?

This patch fixes that nicely for the pop vlan case. But what about
other cases? It seems to me that we need to add the logic to
key_extract.

Thanks!

 Jiri


Re: [PATCH net] openvswitch: Fix pop_vlan action for double tagged frames

2017-12-19 Thread Jiri Benc
On Tue, 19 Dec 2017 13:57:53 -0500, Eric Garver wrote:
> --- a/net/openvswitch/flow.c
> +++ b/net/openvswitch/flow.c
> @@ -559,8 +559,9 @@ static int parse_nsh(struct sk_buff *skb, struct 
> sw_flow_key *key)
>   *  of a correct length, otherwise the same as skb->network_header.
>   *  For other key->eth.type values it is left untouched.
>   *
> - *- skb->protocol: the type of the data starting at skb->network_header.
> - *  Equals to key->eth.type.
> + *- skb->protocol: For Ethernet, the ethertype or VLAN TPID.
> + *  For non-Ethernet, the type of the data starting at 
> skb->network_header
> + *  (also equal to key->eth.type).
>   */
>  static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
>  {
> @@ -579,6 +580,7 @@ static int key_extract(struct sk_buff *skb, struct 
> sw_flow_key *key)
>   return -EINVAL;
>  
>   skb_reset_network_header(skb);
> + key->eth.type = skb->protocol;
>   } else {
>   eth = eth_hdr(skb);
>   ether_addr_copy(key->eth.src, eth->h_source);
> @@ -592,15 +594,14 @@ static int key_extract(struct sk_buff *skb, struct 
> sw_flow_key *key)
>   if (unlikely(parse_vlan(skb, key)))
>   return -ENOMEM;
>  
> - skb->protocol = parse_ethertype(skb);
> - if (unlikely(skb->protocol == htons(0)))
> + key->eth.type = parse_ethertype(skb);
> + if (unlikely(key->eth.type == htons(0)))
>   return -ENOMEM;
>  
>   skb_reset_network_header(skb);
>   __skb_push(skb, skb->data - skb_mac_header(skb));
>   }
>   skb_reset_mac_len(skb);
> - key->eth.type = skb->protocol;
>  
>   /* Network layer. */
>   if (key->eth.type == htons(ETH_P_IP)) {

Unfortunately, this does not work. key_extract must set skb->protocol
even for Ethernet frames that come from a mixed L2/L3 tunnel. Such
packets will have key->mac_proto set to MAC_PROTO_ETHERNET and
skb->protocol set to ETH_P_TEB (see key_extract_mac_proto). In
key_extract, skb->protocol has to be correctly set to the dissected
value.

Which means that we have to check for the existence of inner vlan tag
(by checking key->eth.cvlan.tci or, perhaps better, by returning it
from parse_vlan) and set skb->protocol accordingly.

 Jiri


Re: [PATCH] rtnetlink: fix missing size for IFLA_IF_NETNSID

2017-11-06 Thread Jiri Benc
On Mon,  6 Nov 2017 15:04:54 +, Colin King wrote:
> The size for IFLA_IF_NETNSID is missing from the size calculation
> because the proceeding semicolon was not removed. Fix this by removing
> the semicolon.

Acked-by: Jiri Benc 

Thanks for spotting this! Looking at my initial code, I had that right,
this was probably introduced during one of rebases, so I blame
Flavio :-p (On a serious note, thank you, Flavio, for taking care of
the rebases.)

Hopefully, with the "+ 0" added, this won't happen again in this
particular piece of code in the future.

 Jiri


  1   2   3   4   5   6   7   8   9   10   >