date:20181120

Re: [PATCH 00/12 net-next,v2] add flow_rule infrastructure

2018-11-20 Thread Jiri Pirko

Tue, Nov 20, 2018 at 06:16:40PM CET, da...@davemloft.net wrote:
>From: Jiri Pirko 
>Date: Tue, 20 Nov 2018 08:39:12 +0100
>
>> If later on the netfilter code will use it, through another
>> ndo/notifier/whatever, that is side a nice side-effect in my
>> opinion.
>
>Netfilter HW offloading is the main motivation of these changes.
>
>You can try to spin it any way you like, but I think this is pretty
>clear.
>
>Would the author of these changes be even be remotely interested in
>this "cleanup" in areas of code he has never been involved in if that
>were not the case?

No, but of course. I'm just saying that the cleanup is nice and handy
even if the code would never be used by netfilter. Therefore I think
the info is irrelevant for the review. Anyway, I get your point.


>
>I think it is very dishonest to portray the situation differently.
>
>Thank you.

Re: [PATCH net] sctp: hold transport before accessing its asoc in sctp_hash_transport

2018-11-20 Thread Xin Long

On Wed, Nov 21, 2018 at 9:46 AM Marcelo Ricardo Leitner
 wrote:
>
> On Tue, Nov 20, 2018 at 07:52:48AM -0500, Neil Horman wrote:
> > On Tue, Nov 20, 2018 at 07:09:16PM +0800, Xin Long wrote:
> > > In sctp_hash_transport, it dereferences a transport's asoc only under
> > > rcu_read_lock. Without holding the transport, its asoc could be freed
> > > already, which leads to a use-after-free panic.
> > >
> > > A similar fix as Commit bab1be79a516 ("sctp: hold transport before
> > > accessing its asoc in sctp_transport_get_next") is needed to hold
> > > the transport before accessing its asoc in sctp_hash_transport.
> > >
> > > Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new 
> > > transport")
> > > Reported-by: syzbot+0b05d8aa7cb185107...@syzkaller.appspotmail.com
> > > Signed-off-by: Xin Long 
> > > ---
> > >  net/sctp/input.c | 7 ++-
> > >  1 file changed, 6 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/sctp/input.c b/net/sctp/input.c
> > > index 5c36a99..69584e9 100644
> > > --- a/net/sctp/input.c
> > > +++ b/net/sctp/input.c
> > > @@ -896,11 +896,16 @@ int sctp_hash_transport(struct sctp_transport *t)
> > > list = rhltable_lookup(_transport_hashtable, ,
> > >sctp_hash_params);
> > >
> > > -   rhl_for_each_entry_rcu(transport, tmp, list, node)
> > > +   rhl_for_each_entry_rcu(transport, tmp, list, node) {
> > > +   if (!sctp_transport_hold(transport))
> > > +   continue;
> > > if (transport->asoc->ep == t->asoc->ep) {
> > > +   sctp_transport_put(transport);
> > > rcu_read_unlock();
> > > return -EEXIST;
> > > }
> > > +   sctp_transport_put(transport);
> > > +   }
> > > rcu_read_unlock();
> > >
> > > err = rhltable_insert_key(_transport_hashtable, ,
> > > --
> > > 2.1.0
> > >
> > >
> >
> > something doesn't feel at all right about this.  If we are inserting a 
> > transport
> > to an association, it would seem to me that we should have at least one 
> > user of
> > the association (i.e. non-zero refcount).  As such it seems something is 
> > wrong
> > with the association refcount here.  At the very least, if there is a case 
> > where
> > an association is being removed while a transport is being added, the better
> > solution would be to ensure that sctp_association_destroy goes through a
> > quiescent point prior to unhashing transports from the list, to ensure that
> > there is no conflict with the add operation above.
Changing to do call_rcu(>rcu, sctp_association_destroy) can
work for this case.
But it means asoc and socket (taking the port) will have to wait for a
grace period, which is not expected. We seemed to have talked about
this before, Marcelo?

>
> Consider that the rhl_for_each_entry_rcu() is traversing the global
> rhashtable, and that it may operate on unrelated transports/asocs.
> E.g., transport->asoc in the for() is potentially different from the
> asoc under socket lock.
>
> The core of the fix is at:
> +   if (!sctp_transport_hold(transport))
> +   continue;
> If we can get a hold, the asoc will be available for dereferencing in
> subsequent lines. Otherwise, move on.
>
> With that, the patch makes sense to me.
>
> Although I would prefer if we come up with a better way to do this
> jump, or even avoid the jump. We are only comparing pointers here and,
> if we had asoc->ep cached on sctp_transport itself, we could avoid the
> atomics here.
Right, but it's another u64.

>
> This change, in the next patch on sctp_epaddr_lookup_transport, will
> hurt performance as that is called in datapath. Rhashtable will help
> on keeping entry lists to a size, but still.
This loop is not long normally, will only a few atomic operations hurt
a noticeable performance?

[iproute2-next PATCH v4] tc: flower: Classify packets based port ranges

2018-11-20 Thread Amritha Nambiar

Added support for filtering based on port ranges.
UAPI changes have been accepted into net-next.

Example:
1. Match on a port range:
-
$ tc filter add dev enp4s0 protocol ip parent :\
  prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\
  action drop

$ tc -s filter show dev enp4s0 parent :
filter protocol ip pref 1 flower chain 0
filter protocol ip pref 1 flower chain 0 handle 0x1
  eth_type ipv4
  ip_proto tcp
  dst_port range 20-30
  skip_hw
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 1 ref 1 bind 1 installed 85 sec used 3 sec
Action statistics:
Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

2. Match on IP address and port range:
--
$ tc filter add dev enp4s0 protocol ip parent :\
  prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\
  skip_hw action drop

$ tc -s filter show dev enp4s0 parent :
filter protocol ip pref 1 flower chain 0 handle 0x2
  eth_type ipv4
  ip_proto tcp
  dst_ip 192.168.1.1
  dst_port range 100-200
  skip_hw
  not_in_hw
action order 1: gact action drop
 random type none pass val 0
 index 2 ref 1 bind 1 installed 58 sec used 2 sec
Action statistics:
Sent 920 bytes 20 pkt (dropped 20, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

v4:
Added man updates explaining filtering based on port ranges.
Removed 'range' keyword.

v3:
Modified flower_port_range_attr_type calls.

v2:
Addressed Jiri's comment to sync output format with input

Signed-off-by: Amritha Nambiar 
---
 man/man8/tc-flower.8 |   13 +++--
 tc/f_flower.c|  136 ++
 2 files changed, 134 insertions(+), 15 deletions(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index 8be8882..adff41e 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -56,8 +56,9 @@ flower \- flow based traffic control filter
 .IR MASKED_IP_TTL " | { "
 .BR dst_ip " | " src_ip " } "
 .IR PREFIX " | { "
-.BR dst_port " | " src_port " } "
-.IR port_number " } | "
+.BR dst_port " | " src_port " } { "
+.IR port_number " | "
+.IR min_port_number-max_port_number " } | "
 .B tcp_flags
 .IR MASKED_TCP_FLAGS " | "
 .B type
@@ -220,10 +221,12 @@ must be a valid IPv4 or IPv6 address, depending on the 
\fBprotocol\fR
 option to tc filter, optionally followed by a slash and the prefix length.
 If the prefix is missing, \fBtc\fR assumes a full-length host match.
 .TP
-.BI dst_port " NUMBER"
+.IR \fBdst_port " { "  NUMBER " | " " MIN_VALUE-MAX_VALUE "  }
 .TQ
-.BI src_port " NUMBER"
-Match on layer 4 protocol source or destination port number. Only available for
+.IR \fBsrc_port " { "  NUMBER " | " " MIN_VALUE-MAX_VALUE "  }
+Match on layer 4 protocol source or destination port number. Alternatively, the
+mininum and maximum values can be specified to match on a range of layer 4
+protocol source or destination port numbers. Only available for
 .BR ip_proto " values " udp ", " tcp  " and " sctp
 which have to be specified in beforehand.
 .TP
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 65fca04..722647d 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -494,6 +494,68 @@ static int flower_parse_port(char *str, __u8 ip_proto,
return 0;
 }
 
+static int flower_port_range_attr_type(__u8 ip_proto, enum flower_endpoint 
type,
+  __be16 *min_port_type,
+  __be16 *max_port_type)
+{
+   if (ip_proto == IPPROTO_TCP || ip_proto == IPPROTO_UDP ||
+   ip_proto == IPPROTO_SCTP) {
+   if (type == FLOWER_ENDPOINT_SRC) {
+   *min_port_type = TCA_FLOWER_KEY_PORT_SRC_MIN;
+   *max_port_type = TCA_FLOWER_KEY_PORT_SRC_MAX;
+   } else {
+   *min_port_type = TCA_FLOWER_KEY_PORT_DST_MIN;
+   *max_port_type = TCA_FLOWER_KEY_PORT_DST_MAX;
+   }
+   } else {
+   return -1;
+   }
+
+   return 0;
+}
+
+static int flower_parse_port_range(__be16 *min, __be16 *max, __u8 ip_proto,
+  enum flower_endpoint endpoint,
+  struct nlmsghdr *n)
+{
+   __be16 min_port_type, max_port_type;
+
+   if (htons(*max) <= htons(*min)) {
+   fprintf(stderr, "max value should be greater than min value\n");
+   return -1;
+   }
+
+   if (flower_port_range_attr_type(ip_proto, endpoint, _port_type,
+   _port_type))
+   return -1;
+
+   addattr16(n, MAX_MSG, min_port_type, *min);
+   addattr16(n, MAX_MSG, max_port_type, *max);
+
+   return 0;
+}
+
+static int get_range(__be16 *min, __be16 *max, char *argv)
+{
+   char *r;
+
+   r = strchr(argv, '-');
+   if (r) {
+

Re: [iproute2-next PATCH v3 2/2] man: tc-flower: Add explanation for range option

2018-11-20 Thread Nambiar, Amritha

On 11/20/2018 8:59 PM, David Ahern wrote:
> On 11/20/18 9:59 PM, Nambiar, Amritha wrote:
>> Oops, submitted the v2 patch for man changes too soon, without seeing
>> this. So, in this case, should I re-submit the iproute2-flower patch
>> that was accepted removing the 'range' keyword?
> 
> I think so. Consistency across commands is a good thing.
> 

Okay, will do. I'll also combine the 'man patch' into 'flower patch' and
make a single patch as Jiri recommended.

Re: netns_id in bpf_sk_lookup_{tcp,udp}

2018-11-20 Thread David Ahern

On 11/20/18 2:05 AM, Nicolas Dichtel wrote:
> Le 20/11/2018 à 00:46, David Ahern a écrit :
> [snip]
>> That revelation shows another hole:
>> $ ip netns add foo
>> $ ip netns set foo 0x
> It also works with 0xf000 ...
> 
>> $ ip netns list
>> foo (id: 0)
>>
>> Seems like alloc_netid() should error out if reqid < -1 (-1 being the
>> NETNSA_NSID_NOT_ASSIGNED flag) as opposed to blindly ignoring it.
> alloc_netid() tries to allocate the specified nsid if this nsid is valid, ie 
> >=
> 0, else it allocates a new nsid (actually the lower available).
> This is the expected behavior.
> 
> For me, it's more an iproute2 problem, which parses an unsigned and silently
> cast it to a signed value.
> 
> -8<
> 
> From 79bac98bfd0acbf2526a3427d5aba96564844209 Mon Sep 17 00:00:00 2001
> From: Nicolas Dichtel 
> Date: Tue, 20 Nov 2018 09:59:46 +0100
> Subject: ipnetns: parse nsid as a signed integer
> 
> Don't confuse the user, nsid is a signed interger, this kind of command
> should return an error: 'ip netns set foo 0x'.
> 
> Signed-off-by: Nicolas Dichtel 
> ---
>  ip/ipnetns.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/ip/ipnetns.c b/ip/ipnetns.c
> index 0eac18cf2682..54346ac987cf 100644
> --- a/ip/ipnetns.c
> +++ b/ip/ipnetns.c
> @@ -739,8 +739,7 @@ static int netns_set(int argc, char **argv)
>  {
>   char netns_path[PATH_MAX];
>   const char *name;
> - unsigned int nsid;
> - int netns;
> + int netns, nsid;
> 
>   if (argc < 1) {
>   fprintf(stderr, "No netns name specified\n");
> @@ -754,7 +753,7 @@ static int netns_set(int argc, char **argv)
>   /* If a negative nsid is specified the kernel will select the nsid. */
>   if (strcmp(argv[1], "auto") == 0)
>   nsid = -1;
> - else if (get_unsigned(, argv[1], 0))
> + else if (get_integer(, argv[1], 0))
>   invarg("Invalid \"netnsid\" value\n", argv[1]);
> 
>   snprintf(netns_path, sizeof(netns_path), "%s/%s", NETNS_RUN_DIR, name);
> 

Nicolas: Can you send this formally and cc Stephen so it goes into the
master branch? Thanks

Re: [net-next 00/11][pull request] 100GbE Intel Wired LAN Driver Updates 2018-11-20

2018-11-20 Thread David Miller

From: Jeff Kirsher 
Date: Tue, 20 Nov 2018 12:22:36 -0800

> This series contains updates to the ice driver only.

Pulled, thanks Jeff.

Re: [iproute2-next PATCH v3 2/2] man: tc-flower: Add explanation for range option

2018-11-20 Thread David Ahern

On 11/20/18 9:59 PM, Nambiar, Amritha wrote:
> Oops, submitted the v2 patch for man changes too soon, without seeing
> this. So, in this case, should I re-submit the iproute2-flower patch
> that was accepted removing the 'range' keyword?

I think so. Consistency across commands is a good thing.

Re: [iproute2-next PATCH v3 2/2] man: tc-flower: Add explanation for range option

2018-11-20 Thread Nambiar, Amritha

On 11/20/2018 8:46 PM, David Ahern wrote:
> On 11/20/18 9:44 PM, Nambiar, Amritha wrote:
>> On 11/20/2018 2:56 PM, David Ahern wrote:
>>> On 11/15/18 5:55 PM, Amritha Nambiar wrote:
 Add details explaining filtering based on port ranges.

 Signed-off-by: Amritha Nambiar 
 ---
  man/man8/tc-flower.8 |   12 ++--
  1 file changed, 10 insertions(+), 2 deletions(-)

 diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
 index 8be8882..768bfa1 100644
 --- a/man/man8/tc-flower.8
 +++ b/man/man8/tc-flower.8
 @@ -56,8 +56,10 @@ flower \- flow based traffic control filter
  .IR MASKED_IP_TTL " | { "
  .BR dst_ip " | " src_ip " } "
  .IR PREFIX " | { "
 -.BR dst_port " | " src_port " } "
 -.IR port_number " } | "
 +.BR dst_port " | " src_port " } { "
 +.IR port_number " | "
 +.B range
 +.IR min_port_number-max_port_number " } | "
  .B tcp_flags
  .IR MASKED_TCP_FLAGS " | "
  .B type
 @@ -227,6 +229,12 @@ Match on layer 4 protocol source or destination port 
 number. Only available for
  .BR ip_proto " values " udp ", " tcp  " and " sctp
  which have to be specified in beforehand.
  .TP
 +.BI range " MIN_VALUE-MAX_VALUE"
 +Match on a range of layer 4 protocol source or destination port number. 
 Only
 +available for
 +.BR ip_proto " values " udp ", " tcp  " and " sctp
 +which have to be specified in beforehand.
 +.TP
  .BI tcp_flags " MASKED_TCP_FLAGS"
  Match on TCP flags represented as 12bit bitfield in in hexadecimal format.
  A mask may be optionally provided to limit the bits which are matched. A 
 mask

>>>
>>> This prints as:
>>>
>>> dst_port NUMBER
>>> src_port NUMBER
>>>   Match  on  layer  4  protocol source or destination port number.
>>>   Only available for ip_proto values udp, tcp and sctp which  have
>>>   to be specified in beforehand.
>>>
>>> range MIN_VALUE-MAX_VALUE
>>>   Match  on a range of layer 4 protocol source or destination port
>>>   number. Only available for ip_proto values  udp,  tcp  and  sctp
>>>   which have to be specified in beforehand.
>>>
>>> ###
>>>
>>> That makes it look like range is a standalone option - independent of
>>> dst_port/src_port.
>>>
>>> It seems to me the dst_port / src_port should be updated to:
>>>
>>> dst_port {NUMBER | range MIN_VALUE-MAX_VALUE}
>>>
>>> with the description updated for both options and indented under
>>> dst_port / src_port
>>>
>>
>> Okay, will do.
>>
> 
> Thinking about this perhaps the 'range' keyword can just be dropped. We
> do not use it in other places -- e.g., ip rule.
> 

Oops, submitted the v2 patch for man changes too soon, without seeing
this. So, in this case, should I re-submit the iproute2-flower patch
that was accepted removing the 'range' keyword?

Re: [PATCH v4 net-next 0/6] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2018-11-20 Thread David Miller

From: 
Date: Tue, 20 Nov 2018 15:55:04 -0800

> This series of patches is to modify the original KSZ9477 DSA driver so
> that other KSZ switch drivers can be added and use the common code.

Series applied.

[PATCH v5 bpf-next 0/2] bpf: adding support for mapinmap in libbpf

2018-11-20 Thread Nikita V. Shirokov

in this patch series i'm adding a helper for libbpf which would allow
it to load map-in-map(BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS).
first patch contains new helper + explains proposed workflow
second patch contains tests which also could be used as example of usage

v4->v5:
 - naming: renamed everything to map_in_map instead of mapinmap
 - start to return nonzero val if set_inner_map_fd failed

v3->v4:
 - renamed helper to set_inner_map_fd
 - now we set this value only if it haven't
   been set before and only for (array|hash) of maps

v2->v3:
 - fixing typo in patch description
 - initializing inner_map_fd to -1 by default

v1->v2:
 - addressing nits
 - removing const identifier from fd in new helper
 - starting to check return val for bpf_map_update_elem

Nikita V. Shirokov (2):
  bpf: adding support for map in map in libbpf
  bpf: adding tests for mapinmap helpber in libbpf

 tools/lib/bpf/libbpf.c| 40 ++--
 tools/lib/bpf/libbpf.h|  2 +
 tools/testing/selftests/bpf/Makefile  |  3 +-
 tools/testing/selftests/bpf/test_map_in_map.c | 49 +++
 tools/testing/selftests/bpf/test_maps.c   | 90 +++
 5 files changed, 177 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_map_in_map.c

-- 
2.15.1

[PATCH v5 bpf-next 2/2] bpf: adding tests for map_in_map helpber in libbpf

2018-11-20 Thread Nikita V. Shirokov

adding test/example of bpf_map__set_inner_map_fd usage

Signed-off-by: Nikita V. Shirokov 
Acked-by: Yonghong Song 
---
 tools/testing/selftests/bpf/Makefile  |  3 +-
 tools/testing/selftests/bpf/test_map_in_map.c | 49 +++
 tools/testing/selftests/bpf/test_maps.c   | 90 +++
 3 files changed, 141 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_map_in_map.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 1dde03ea1484..43157bd89165 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -38,7 +38,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o \
-   test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o
+   test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o 
\
+   test_map_in_map.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_map_in_map.c 
b/tools/testing/selftests/bpf/test_map_in_map.c
new file mode 100644
index ..ce923e67e08e
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_map_in_map.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Facebook */
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") mim_array = {
+   .type = BPF_MAP_TYPE_ARRAY_OF_MAPS,
+   .key_size = sizeof(int),
+   /* must be sizeof(__u32) for map in map */
+   .value_size = sizeof(__u32),
+   .max_entries = 1,
+   .map_flags = 0,
+};
+
+struct bpf_map_def SEC("maps") mim_hash = {
+   .type = BPF_MAP_TYPE_HASH_OF_MAPS,
+   .key_size = sizeof(int),
+   /* must be sizeof(__u32) for map in map */
+   .value_size = sizeof(__u32),
+   .max_entries = 1,
+   .map_flags = 0,
+};
+
+SEC("xdp_mimtest")
+int xdp_mimtest0(struct xdp_md *ctx)
+{
+   int value = 123;
+   int key = 0;
+   void *map;
+
+   map = bpf_map_lookup_elem(_array, );
+   if (!map)
+   return XDP_DROP;
+
+   bpf_map_update_elem(map, , , 0);
+
+   map = bpf_map_lookup_elem(_hash, );
+   if (!map)
+   return XDP_DROP;
+
+   bpf_map_update_elem(map, , , 0);
+
+   return XDP_PASS;
+}
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 9f0a5b16a246..9c79ee017df3 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -1125,6 +1125,94 @@ static void test_sockmap(int tasks, void *data)
exit(1);
 }
 
+#define MAPINMAP_PROG "./test_map_in_map.o"
+static void test_map_in_map(void)
+{
+   struct bpf_program *prog;
+   struct bpf_object *obj;
+   struct bpf_map *map;
+   int mim_fd, fd, err;
+   int pos = 0;
+
+   obj = bpf_object__open(MAPINMAP_PROG);
+
+   fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(int), sizeof(int),
+   2, 0);
+   if (fd < 0) {
+   printf("Failed to create hashmap '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   map = bpf_object__find_map_by_name(obj, "mim_array");
+   if (IS_ERR(map)) {
+   printf("Failed to load array of maps from test prog\n");
+   goto out_map_in_map;
+   }
+   err = bpf_map__set_inner_map_fd(map, fd);
+   if (err) {
+   printf("Failed to set inner_map_fd for array of maps\n");
+   goto out_map_in_map;
+   }
+
+   map = bpf_object__find_map_by_name(obj, "mim_hash");
+   if (IS_ERR(map)) {
+   printf("Failed to load hash of maps from test prog\n");
+   goto out_map_in_map;
+   }
+   err = bpf_map__set_inner_map_fd(map, fd);
+   if (err) {
+   printf("Failed to set inner_map_fd for hash of maps\n");
+   goto out_map_in_map;
+   }
+
+   bpf_object__for_each_program(prog, obj) {
+   bpf_program__set_xdp(prog);
+   }
+   bpf_object__load(obj);
+
+   map = bpf_object__find_map_by_name(obj, "mim_array");
+   if (IS_ERR(map)) {
+   printf("Failed to load array of maps from test prog\n");
+   goto out_map_in_map;
+   }
+   mim_fd = bpf_map__fd(map);
+   if (mim_fd < 0) {
+   printf("Failed to get descriptor for array of maps\n");
+   goto out_map_in_map;
+   }
+
+   err = bpf_map_update_elem(mim_fd, , , 0);
+   if (err) {
+   printf("Failed to update array of maps\n");
+   goto

[PATCH v5 bpf-next 1/2] bpf: adding support for map in map in libbpf

2018-11-20 Thread Nikita V. Shirokov

idea is pretty simple. for specified map (pointed by struct bpf_map)
we would provide descriptor of already loaded map, which is going to be
used as a prototype for inner map. proposed workflow:
1) open bpf's object (bpf_object__open)
2) create bpf's map which is going to be used as a prototype
3) find (by name) map-in-map which you want to load and update w/
descriptor of inner map w/ a new helper from this patch
4) load bpf program w/ bpf_object__load

Signed-off-by: Nikita V. Shirokov 
Acked-by: Yonghong Song 
---
 tools/lib/bpf/libbpf.c | 40 ++--
 tools/lib/bpf/libbpf.h |  2 ++
 2 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index cb6565d79603..ba12e070f182 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -167,6 +167,7 @@ struct bpf_map {
char *name;
size_t offset;
int map_ifindex;
+   int inner_map_fd;
struct bpf_map_def def;
__u32 btf_key_type_id;
__u32 btf_value_type_id;
@@ -594,6 +595,14 @@ static int compare_bpf_map(const void *_a, const void *_b)
return a->offset - b->offset;
 }
 
+static bool bpf_map_type__is_map_in_map(enum bpf_map_type type)
+{
+   if (type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
+   type == BPF_MAP_TYPE_HASH_OF_MAPS)
+   return true;
+   return false;
+}
+
 static int
 bpf_object__init_maps(struct bpf_object *obj, int flags)
 {
@@ -657,13 +666,15 @@ bpf_object__init_maps(struct bpf_object *obj, int flags)
}
obj->nr_maps = nr_maps;
 
-   /*
-* fill all fd with -1 so won't close incorrect
-* fd (fd=0 is stdin) when failure (zclose won't close
-* negative fd)).
-*/
-   for (i = 0; i < nr_maps; i++)
+   for (i = 0; i < nr_maps; i++) {
+   /*
+* fill all fd with -1 so won't close incorrect
+* fd (fd=0 is stdin) when failure (zclose won't close
+* negative fd)).
+*/
obj->maps[i].fd = -1;
+   obj->maps[i].inner_map_fd = -1;
+   }
 
/*
 * Fill obj->maps using data in "maps" section.
@@ -1164,6 +1175,9 @@ bpf_object__create_maps(struct bpf_object *obj)
create_attr.btf_fd = 0;
create_attr.btf_key_type_id = 0;
create_attr.btf_value_type_id = 0;
+   if (bpf_map_type__is_map_in_map(def->type) &&
+   map->inner_map_fd >= 0)
+   create_attr.inner_map_fd = map->inner_map_fd;
 
if (obj->btf && !bpf_map_find_btf_info(map, obj->btf)) {
create_attr.btf_fd = btf__fd(obj->btf);
@@ -2621,6 +2635,20 @@ void bpf_map__set_ifindex(struct bpf_map *map, __u32 
ifindex)
map->map_ifindex = ifindex;
 }
 
+int bpf_map__set_inner_map_fd(struct bpf_map *map, int fd)
+{
+   if (!bpf_map_type__is_map_in_map(map->def.type)) {
+   pr_warning("error: unsupported map type\n");
+   return -EINVAL;
+   }
+   if (map->inner_map_fd != -1) {
+   pr_warning("error: inner_map_fd already specified\n");
+   return -EINVAL;
+   }
+   map->inner_map_fd = fd;
+   return 0;
+}
+
 static struct bpf_map *
 __bpf_map__iter(struct bpf_map *m, struct bpf_object *obj, int i)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index b1686a787102..16158b6b213f 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -293,6 +293,8 @@ LIBBPF_API void bpf_map__set_ifindex(struct bpf_map *map, 
__u32 ifindex);
 LIBBPF_API int bpf_map__pin(struct bpf_map *map, const char *path);
 LIBBPF_API int bpf_map__unpin(struct bpf_map *map, const char *path);
 
+LIBBPF_API int bpf_map__set_inner_map_fd(struct bpf_map *map, int fd);
+
 LIBBPF_API long libbpf_get_error(const void *ptr);
 
 struct bpf_prog_load_attr {
-- 
2.15.1

[iproute2-next PATCH v2] man: tc-flower: Add explanation for range option

2018-11-20 Thread Amritha Nambiar

Add details explaining filtering based on port ranges.

v2: Modified description to remove range as standalone option
and updated as part of dst_port/src_port.

Signed-off-by: Amritha Nambiar 
---
 man/man8/tc-flower.8 |   15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
index 8be8882..1d195d0 100644
--- a/man/man8/tc-flower.8
+++ b/man/man8/tc-flower.8
@@ -56,8 +56,10 @@ flower \- flow based traffic control filter
 .IR MASKED_IP_TTL " | { "
 .BR dst_ip " | " src_ip " } "
 .IR PREFIX " | { "
-.BR dst_port " | " src_port " } "
-.IR port_number " } | "
+.BR dst_port " | " src_port " } { "
+.IR port_number " | "
+.B range
+.IR min_port_number-max_port_number " } | "
 .B tcp_flags
 .IR MASKED_TCP_FLAGS " | "
 .B type
@@ -220,10 +222,13 @@ must be a valid IPv4 or IPv6 address, depending on the 
\fBprotocol\fR
 option to tc filter, optionally followed by a slash and the prefix length.
 If the prefix is missing, \fBtc\fR assumes a full-length host match.
 .TP
-.BI dst_port " NUMBER"
+.BR dst_port " { "  \fINUMBER " | " range " \fIMIN_VALUE-MAX_VALUE "  \fR }
 .TQ
-.BI src_port " NUMBER"
-Match on layer 4 protocol source or destination port number. Only available for
+.BR src_port " { "  \fINUMBER " | " range " \fIMIN_VALUE-MAX_VALUE "  \fR }
+Match on layer 4 protocol source or destination port number. Alternatively, the
+\fBrange\fR option can be used to match on a range of layer 4 protocol source
+or destination port numbers by specifying the mininum and maximum values. Only
+available for
 .BR ip_proto " values " udp ", " tcp  " and " sctp
 which have to be specified in beforehand.
 .TP

Re: [iproute2-next PATCH v3 2/2] man: tc-flower: Add explanation for range option

2018-11-20 Thread David Ahern

On 11/20/18 9:44 PM, Nambiar, Amritha wrote:
> On 11/20/2018 2:56 PM, David Ahern wrote:
>> On 11/15/18 5:55 PM, Amritha Nambiar wrote:
>>> Add details explaining filtering based on port ranges.
>>>
>>> Signed-off-by: Amritha Nambiar 
>>> ---
>>>  man/man8/tc-flower.8 |   12 ++--
>>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
>>> index 8be8882..768bfa1 100644
>>> --- a/man/man8/tc-flower.8
>>> +++ b/man/man8/tc-flower.8
>>> @@ -56,8 +56,10 @@ flower \- flow based traffic control filter
>>>  .IR MASKED_IP_TTL " | { "
>>>  .BR dst_ip " | " src_ip " } "
>>>  .IR PREFIX " | { "
>>> -.BR dst_port " | " src_port " } "
>>> -.IR port_number " } | "
>>> +.BR dst_port " | " src_port " } { "
>>> +.IR port_number " | "
>>> +.B range
>>> +.IR min_port_number-max_port_number " } | "
>>>  .B tcp_flags
>>>  .IR MASKED_TCP_FLAGS " | "
>>>  .B type
>>> @@ -227,6 +229,12 @@ Match on layer 4 protocol source or destination port 
>>> number. Only available for
>>>  .BR ip_proto " values " udp ", " tcp  " and " sctp
>>>  which have to be specified in beforehand.
>>>  .TP
>>> +.BI range " MIN_VALUE-MAX_VALUE"
>>> +Match on a range of layer 4 protocol source or destination port number. 
>>> Only
>>> +available for
>>> +.BR ip_proto " values " udp ", " tcp  " and " sctp
>>> +which have to be specified in beforehand.
>>> +.TP
>>>  .BI tcp_flags " MASKED_TCP_FLAGS"
>>>  Match on TCP flags represented as 12bit bitfield in in hexadecimal format.
>>>  A mask may be optionally provided to limit the bits which are matched. A 
>>> mask
>>>
>>
>> This prints as:
>>
>> dst_port NUMBER
>> src_port NUMBER
>>   Match  on  layer  4  protocol source or destination port number.
>>   Only available for ip_proto values udp, tcp and sctp which  have
>>   to be specified in beforehand.
>>
>> range MIN_VALUE-MAX_VALUE
>>   Match  on a range of layer 4 protocol source or destination port
>>   number. Only available for ip_proto values  udp,  tcp  and  sctp
>>   which have to be specified in beforehand.
>>
>> ###
>>
>> That makes it look like range is a standalone option - independent of
>> dst_port/src_port.
>>
>> It seems to me the dst_port / src_port should be updated to:
>>
>> dst_port {NUMBER | range MIN_VALUE-MAX_VALUE}
>>
>> with the description updated for both options and indented under
>> dst_port / src_port
>>
> 
> Okay, will do.
> 

Thinking about this perhaps the 'range' keyword can just be dropped. We
do not use it in other places -- e.g., ip rule.

Re: [iproute2-next PATCH v3 2/2] man: tc-flower: Add explanation for range option

2018-11-20 Thread Nambiar, Amritha

On 11/20/2018 2:56 PM, David Ahern wrote:
> On 11/15/18 5:55 PM, Amritha Nambiar wrote:
>> Add details explaining filtering based on port ranges.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>  man/man8/tc-flower.8 |   12 ++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
>> index 8be8882..768bfa1 100644
>> --- a/man/man8/tc-flower.8
>> +++ b/man/man8/tc-flower.8
>> @@ -56,8 +56,10 @@ flower \- flow based traffic control filter
>>  .IR MASKED_IP_TTL " | { "
>>  .BR dst_ip " | " src_ip " } "
>>  .IR PREFIX " | { "
>> -.BR dst_port " | " src_port " } "
>> -.IR port_number " } | "
>> +.BR dst_port " | " src_port " } { "
>> +.IR port_number " | "
>> +.B range
>> +.IR min_port_number-max_port_number " } | "
>>  .B tcp_flags
>>  .IR MASKED_TCP_FLAGS " | "
>>  .B type
>> @@ -227,6 +229,12 @@ Match on layer 4 protocol source or destination port 
>> number. Only available for
>>  .BR ip_proto " values " udp ", " tcp  " and " sctp
>>  which have to be specified in beforehand.
>>  .TP
>> +.BI range " MIN_VALUE-MAX_VALUE"
>> +Match on a range of layer 4 protocol source or destination port number. Only
>> +available for
>> +.BR ip_proto " values " udp ", " tcp  " and " sctp
>> +which have to be specified in beforehand.
>> +.TP
>>  .BI tcp_flags " MASKED_TCP_FLAGS"
>>  Match on TCP flags represented as 12bit bitfield in in hexadecimal format.
>>  A mask may be optionally provided to limit the bits which are matched. A 
>> mask
>>
> 
> This prints as:
> 
> dst_port NUMBER
> src_port NUMBER
>   Match  on  layer  4  protocol source or destination port number.
>   Only available for ip_proto values udp, tcp and sctp which  have
>   to be specified in beforehand.
> 
> range MIN_VALUE-MAX_VALUE
>   Match  on a range of layer 4 protocol source or destination port
>   number. Only available for ip_proto values  udp,  tcp  and  sctp
>   which have to be specified in beforehand.
> 
> ###
> 
> That makes it look like range is a standalone option - independent of
> dst_port/src_port.
> 
> It seems to me the dst_port / src_port should be updated to:
> 
> dst_port {NUMBER | range MIN_VALUE-MAX_VALUE}
> 
> with the description updated for both options and indented under
> dst_port / src_port
> 

Okay, will do.

- Amritha

RE: thunderx: nicvf_xdp_setup error code path

2018-11-20 Thread Goutham, Sunil




> -Original Message-
> From: Lorenzo Bianconi 
> Sent: 20 November 2018 23:27
> To: netdev@vger.kernel.org
> Cc: Goutham, Sunil 
> Subject: net: thunderx: nicvf_xdp_setup error code path
> 
> External Email
> 
> Hi all,
> 
> looking at thunderx XDP support I noticed that nic->xdp_prog pointer in
> nicvf_xdp_setup is not actually set to NULL if bpf_prog_add fails but it
> is initialized with bpf_prog_add error code. xdp_prog pointer value is used in
> the driver to verify if XDP is currently enabled.
> Moreover nicvf_xdp_setup does not report to the userspace any error code in
> case of failure.
> I wrote the following patch to fix the reported issues. Please note I just
> compiled it, not actually tested since I have no thunderx nic at the moment.
> 
> @Sunil: could you please give it a whirl? If it is ok I will post a formal
> patch, thanks
> 
> Regards,
> Lorenzo
> 

Thanks for fixing, changes look good to me, 

Sunil.

Re: [PATCH v4 net-next 0/6] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2018-11-20 Thread Florian Fainelli




On 11/20/2018 3:55 PM, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> This series of patches is to modify the original KSZ9477 DSA driver so
> that other KSZ switch drivers can be added and use the common code.
> 
> There are several steps to accomplish this achievement.  First is to
> rename some function names with a prefix to indicate chip specific
> function.  Second is to move common code into header that can be shared.
> Last is to modify tag_ksz.c so that it can handle many tail tag formats
> used by different KSZ switch drivers.
> 
> ksz_common.c will contain the common code used by all KSZ switch drivers.
> ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
> ksz9477_spi.c is renamed from ksz_spi.c.
> ksz9477_reg.h is renamed from ksz_9477_reg.h.
> ksz_common.h is added to provide common code access to KSZ switch
> drivers.
> ksz_spi.h is added to provide common SPI access functions to KSZ SPI
> drivers.

Thanks a lot for getting this series out, hopefully there is no blocker
getting it merged now and you can follow-up with additional features.

> 
> v4
> - Patches were removed to concentrate on changing driver structure without
> adding new code.
> 
> v3
> - The phy_device structure is used to hold port link information
> - A structure is passed in ksz_xmit and ksz_rcv instead of function pointer
> - Switch offload forwarding is supported
> 
> v2
> - Initialize reg_mutex before use
> - The alu_mutex is only used inside chip specific functions
> 
> v1
> - Each patch in the set is self-contained
> - Use ksz9477 prefix to indicate KSZ9477 specific code
> 
> Tristram Ha (6):
>   net: dsa: microchip: replace license with GPL
>   net: dsa: microchip: clean up code
>   net: dsa: microchip: rename some functions with ksz9477 prefix
>   net: dsa: microchip: rename ksz_spi.c to ksz9477_spi.c
>   net: dsa: microchip: break KSZ9477 DSA driver into two files
>   net: dsa: microchip: rename ksz_9477_reg.h to ksz9477_reg.h
> 
>  drivers/net/dsa/microchip/Kconfig  |   16 +-
>  drivers/net/dsa/microchip/Makefile |5 +-
>  drivers/net/dsa/microchip/ksz9477.c| 1316 
> 
>  .../microchip/{ksz_9477_reg.h => ksz9477_reg.h}|   17 +-
>  drivers/net/dsa/microchip/ksz9477_spi.c|  177 +++
>  drivers/net/dsa/microchip/ksz_common.c | 1183 +++---
>  drivers/net/dsa/microchip/ksz_common.h |  214 
>  drivers/net/dsa/microchip/ksz_priv.h   |  245 ++--
>  drivers/net/dsa/microchip/ksz_spi.c|  217 
>  drivers/net/dsa/microchip/ksz_spi.h|   69 +
>  10 files changed, 2039 insertions(+), 1420 deletions(-)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477.c
>  rename drivers/net/dsa/microchip/{ksz_9477_reg.h => ksz9477_reg.h} (98%)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_common.h
>  delete mode 100644 drivers/net/dsa/microchip/ksz_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_spi.h
> 

-- 
Florian

Re: [PATCH v4 net-next 6/6] net: dsa: microchip: rename ksz_9477_reg.h to ksz9477_reg.h

2018-11-20 Thread Florian Fainelli




On 11/20/2018 3:55 PM, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> Rename ksz_9477_reg.h to ksz9477_reg.h for consistency as the product
> name is always KSZ.
> 
> Signed-off-by: Tristram Ha 
> Reviewed-by: Woojung Huh 
> Reviewed-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH v4 net-next 1/6] net: dsa: microchip: replace license with GPL

2018-11-20 Thread Florian Fainelli




On 11/20/2018 3:55 PM, tristram...@microchip.com wrote:
> From: Tristram Ha 
> 
> Replace license with GPL.
> 
> Signed-off-by: Tristram Ha 
> Reviewed-by: Woojung Huh 
> Reviewed-by: Andrew Lunn 
> Acked-by: Pavel Machek 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH net v2] net/sched: act_police: fix race condition on state variables

2018-11-20 Thread Cong Wang

On Tue, Nov 20, 2018 at 3:30 PM Eric Dumazet  wrote:
>
> On Tue, Nov 20, 2018 at 3:28 PM David Miller  wrote:
> >
> > Applied.
>
> We need a fix to make lockdep happy, as reported by Cong.
>
> Cong, do you want to handle this ?
>

I hope Davide could send a followup fix and really test it
with LOCKDEP enabled.

Re: [Patch net] net: invert the check of detecting hardware RX checksum fault

2018-11-20 Thread Herbert Xu

On Tue, Nov 20, 2018 at 10:18:17AM -0800, Eric Dumazet wrote:
>
> > Something like this?  Is it safe to linearize here?

It looks safe to me.  It's only unsafe if your skb is shared which
from my grepping does not appear to be the case (and it cannot be
shared if you're modifying skb->csum which all the callers that
I found were doing for obvious reasons).

If it's cloned skb_linearize will unclone it.

> I guess we should dump from skb->head instead, to show all headers.
>
> Maybe dump the mac header offset as well, so that we can ignore the
> padding bytes (NET_SKB_PAD) if needed.

Yes I agree.  It doesn't hurt to dump more data.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net] sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit

2018-11-20 Thread David Miller

From: Marcelo Ricardo Leitner 
Date: Tue, 20 Nov 2018 22:56:07 -0200

> On Mon, Nov 19, 2018 at 12:39:55PM -0800, David Miller wrote:
>> From: Xin Long 
>> Date: Sun, 18 Nov 2018 15:07:38 +0800
>> 
>> > Now sctp increases sk_wmem_alloc by 1 when doing set_owner_w for the
>> > skb allocked in sctp_packet_transmit and decreases by 1 when freeing
>> > this skb.
>> > 
>> > But when this skb goes through networking stack, some subcomponents
>> > might change skb->truesize and add the same amount on sk_wmem_alloc.
>> > However sctp doesn't know the amount to decrease by, it would cause
>> > a leak on sk->sk_wmem_alloc and the sock can never be freed.
>> > 
>> > Xiumei found this issue when it hit esp_output_head() by using sctp
>> > over ipsec, where skb->truesize is added and so is sk->sk_wmem_alloc.
>> > 
>> > Since sctp has used sk_wmem_queued to count for writable space since
>> > Commit cd305c74b0f8 ("sctp: use sk_wmem_queued to check for writable
>> > space"), it's ok to fix it by counting sk_wmem_alloc by skb truesize
>> > in sctp_packet_transmit.
>> > 
>> > Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
>> > Reported-by: Xiumei Mu 
>> > Signed-off-by: Xin Long 
>> 
>> Applied and queued up for -stable.
> 
> Dave, is there a way that we can check to which versions you queued it
> up?

I queued up the patch is, and then do backports as needed.

If you think it's too complex to backport this, I'll toss it from the
-stable queue and that's what I have just done.

Re: [PATCH v1 net] lan743x: fix return value for lan743x_tx_napi_poll

2018-11-20 Thread David Miller

From: 
Date: Wed, 21 Nov 2018 02:13:30 +

> Slightly out of topic I am not sure why NAPI is used on the transmit side.
> Originally NAPI was designed to fix the receive interrupt happening on each
> receive frame problem, so on transmit side it is to avoid the transmit
> done interrupt on each transmit frame?  Typically hardware has a way
> to trigger transmit done interrupt or not in each transmit frame.

It puts transmit completion, like receive processing, inside of a
software instead of a hardware interrupt.

It is very much intended that all drivers do transmit completion
inside of a NAPI context.

Avoiding TX interrupts by clearing interrupt indication bits in the
TX descriptors or turning TX completion interrupts off compeltely
is a non-starter.

All TX completion events MUST occur in a short finite amount of time,
otherwise you wedge TCP sockets waiting for memory to be free'd up
etc.

[PATCH net-next,v3 01/12] flow_dissector: add flow_rule and flow_match structures and use them

2018-11-20 Thread Pablo Neira Ayuso

This patch wraps the dissector key and mask - that flower uses to
represent the matching side - around the flow_match structure.

To avoid a follow up patch that would edit the same LoCs in the drivers,
this patch also wraps this new flow match structure around the flow rule
object. This new structure will also contain the flow actions in follow
up patches.

This introduces two new interfaces:

bool flow_rule_match_key(rule, dissector_id)

that returns true if a given matching key is set on, and:

flow_rule_match_XYZ(rule, );

To fetch the matching side XYZ into the match container structure, to
retrieve the key and the mask with one single call.

Signed-off-by: Pablo Neira Ayuso 
---
v3: Proposed by Jiri Pirko:
- Place this new API in net/core/flow_offload.c and
  include/net/flow_offload.h.
- Add flow_rule_alloc() helper function.

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   | 174 -
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 194 --
 drivers/net/ethernet/intel/i40e/i40e_main.c| 178 -
 drivers/net/ethernet/intel/iavf/iavf_main.c| 195 --
 drivers/net/ethernet/intel/igb/igb_main.c  |  64 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 420 +
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 202 +-
 drivers/net/ethernet/netronome/nfp/flower/action.c |  11 +-
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 417 ++--
 .../net/ethernet/netronome/nfp/flower/offload.c| 145 +++
 drivers/net/ethernet/qlogic/qede/qede_filter.c |  85 ++---
 include/net/flow_offload.h | 115 ++
 include/net/pkt_cls.h  |  11 +-
 net/core/Makefile  |   2 +-
 net/core/flow_offload.c| 143 +++
 net/sched/cls_flower.c |  45 ++-
 16 files changed, 1195 insertions(+), 1206 deletions(-)
 create mode 100644 include/net/flow_offload.h
 create mode 100644 net/core/flow_offload.c

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 749f63beddd8..b82143d6cdde 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -177,18 +177,12 @@ static int bnxt_tc_parse_actions(struct bnxt *bp,
return 0;
 }
 
-#define GET_KEY(flow_cmd, key_type)\
-   skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
- (flow_cmd)->key)
-#define GET_MASK(flow_cmd, key_type)   \
-   skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
- (flow_cmd)->mask)
-
 static int bnxt_tc_parse_flow(struct bnxt *bp,
  struct tc_cls_flower_offload *tc_flow_cmd,
  struct bnxt_tc_flow *flow)
 {
-   struct flow_dissector *dissector = tc_flow_cmd->dissector;
+   struct flow_rule *rule = tc_cls_flower_offload_flow_rule(tc_flow_cmd);
+   struct flow_dissector *dissector = rule->match.dissector;
 
/* KEY_CONTROL and KEY_BASIC are needed for forming a meaningful key */
if ((dissector->used_keys & BIT(FLOW_DISSECTOR_KEY_CONTROL)) == 0 ||
@@ -198,140 +192,120 @@ static int bnxt_tc_parse_flow(struct bnxt *bp,
return -EOPNOTSUPP;
}
 
-   if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_BASIC)) {
-   struct flow_dissector_key_basic *key =
-   GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
-   struct flow_dissector_key_basic *mask =
-   GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
+   if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_BASIC)) {
+   struct flow_match_basic match;
 
-   flow->l2_key.ether_type = key->n_proto;
-   flow->l2_mask.ether_type = mask->n_proto;
+   flow_rule_match_basic(rule, );
+   flow->l2_key.ether_type = match.key->n_proto;
+   flow->l2_mask.ether_type = match.mask->n_proto;
 
-   if (key->n_proto == htons(ETH_P_IP) ||
-   key->n_proto == htons(ETH_P_IPV6)) {
-   flow->l4_key.ip_proto = key->ip_proto;
-   flow->l4_mask.ip_proto = mask->ip_proto;
+   if (match.key->n_proto == htons(ETH_P_IP) ||
+   match.key->n_proto == htons(ETH_P_IPV6)) {
+   flow->l4_key.ip_proto = match.key->ip_proto;
+   flow->l4_mask.ip_proto = match.mask->ip_proto;
}
}
 
-   if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
-   struct flow_dissector_key_eth_addrs *key =
-   GET_KEY(tc_flow_cmd,

[PATCH net-next,v3 05/12] cls_flower: add statistics retrieval infrastructure and use it

2018-11-20 Thread Pablo Neira Ayuso

This patch provides the flow_stats structure that acts as container for
tc_cls_flower_offload, then we can use to restore the statistics on the
existing TC actions. Hence, tcf_exts_stats_update() is not used from
drivers.

Signed-off-by: Pablo Neira Ayuso 
---
v3: Suggested by Jiri Pirko:
- Rename to struct flow_stats and to function flow_stats_update().

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |  4 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c  |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c |  2 +-
 drivers/net/ethernet/netronome/nfp/flower/offload.c   |  5 ++---
 include/net/flow_offload.h| 14 ++
 include/net/pkt_cls.h |  1 +
 net/sched/cls_flower.c|  4 
 8 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index b82143d6cdde..09cd75f54eba 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -1366,8 +1366,8 @@ static int bnxt_tc_get_flow_stats(struct bnxt *bp,
lastused = flow->lastused;
spin_unlock(>stats_lock);
 
-   tcf_exts_stats_update(tc_flow_cmd->exts, stats.bytes, stats.packets,
- lastused);
+   flow_stats_update(_flow_cmd->stats, stats.bytes, stats.packets,
+ lastused);
return 0;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
index 39c5af5dad3d..8a2d66ee1d7b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
@@ -807,9 +807,9 @@ int cxgb4_tc_flower_stats(struct net_device *dev,
if (ofld_stats->packet_count != packets) {
if (ofld_stats->prev_packet_count != packets)
ofld_stats->last_used = jiffies;
-   tcf_exts_stats_update(cls->exts, bytes - ofld_stats->byte_count,
- packets - ofld_stats->packet_count,
- ofld_stats->last_used);
+   flow_stats_update(>stats, bytes - ofld_stats->byte_count,
+ packets - ofld_stats->packet_count,
+ ofld_stats->last_used);
 
ofld_stats->packet_count = packets;
ofld_stats->byte_count = bytes;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 2645e5d1e790..ad53214e0ee5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -3224,7 +3224,7 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv,
 
mlx5_fc_query_cached(counter, , , );
 
-   tcf_exts_stats_update(f->exts, bytes, packets, lastuse);
+   flow_stats_update(>stats, bytes, packets, lastuse);
 
return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index e6c4c672b1ca..60900e53243b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -460,7 +460,7 @@ int mlxsw_sp_flower_stats(struct mlxsw_sp *mlxsw_sp,
if (err)
goto err_rule_get_stats;
 
-   tcf_exts_stats_update(f->exts, bytes, packets, lastuse);
+   flow_stats_update(>stats, bytes, packets, lastuse);
 
mlxsw_sp_acl_ruleset_put(mlxsw_sp, ruleset);
return 0;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 708331234908..524b9ae1a639 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -532,9 +532,8 @@ nfp_flower_get_stats(struct nfp_app *app, struct net_device 
*netdev,
ctx_id = be32_to_cpu(nfp_flow->meta.host_ctx_id);
 
spin_lock_bh(>stats_lock);
-   tcf_exts_stats_update(flow->exts, priv->stats[ctx_id].bytes,
- priv->stats[ctx_id].pkts,
- priv->stats[ctx_id].used);
+   flow_stats_update(>stats, priv->stats[ctx_id].bytes,
+ priv->stats[ctx_id].pkts, priv->stats[ctx_id].used);
 
priv->stats[ctx_id].pkts = 0;
priv->stats[ctx_id].bytes = 0;
diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 48aa47ba5561..8c1235fb6ed2 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -179,4 +179,18 @@ static inline bool flow_rule_match_key(const struct 
flow_rule *rule,
return dissector_uses_key(rule->match.dissector, key);
 }
 
+struct flow_stats {
+   u64

[PATCH net-next,v3 07/12] cls_flower: don't expose TC actions to drivers anymore

2018-11-20 Thread Pablo Neira Ayuso

Now that drivers have been converted to use the flow action
infrastructure, remove this field from the tc_cls_flower_offload
structure.

Signed-off-by: Pablo Neira Ayuso 
---
v3: no changes.

 include/net/pkt_cls.h  | 1 -
 net/sched/cls_flower.c | 5 -
 2 files changed, 6 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a08c06e383db..9bd724bfa860 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -768,7 +768,6 @@ struct tc_cls_flower_offload {
unsigned long cookie;
struct flow_rule *rule;
struct flow_stats stats;
-   struct tcf_exts *exts;
u32 classid;
 };
 
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index b88cf29aff7b..ea92228ddc12 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -392,7 +392,6 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
cls_flower.rule->match.dissector = >mask->dissector;
cls_flower.rule->match.mask = >mask->key;
cls_flower.rule->match.key = >mkey;
-   cls_flower.exts = >exts;
cls_flower.classid = f->res.classid;
 
err = tc_setup_flow_action(_flower.rule->action, >exts);
@@ -427,7 +426,6 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct 
cls_fl_filter *f)
tc_cls_common_offload_init(_flower.common, tp, f->flags, NULL);
cls_flower.command = TC_CLSFLOWER_STATS;
cls_flower.cookie = (unsigned long) f;
-   cls_flower.exts = >exts;
cls_flower.classid = f->res.classid;
 
tc_setup_cb_call(block, >exts, TC_SETUP_CLSFLOWER,
@@ -1490,7 +1488,6 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, 
tc_setup_cb_t *cb,
cls_flower.rule->match.dissector = >dissector;
cls_flower.rule->match.mask = >key;
cls_flower.rule->match.key = >mkey;
-   cls_flower.exts = >exts;
 
err = tc_setup_flow_action(_flower.rule->action,
   >exts);
@@ -1523,7 +1520,6 @@ static int fl_hw_create_tmplt(struct tcf_chain *chain,
 {
struct tc_cls_flower_offload cls_flower = {};
struct tcf_block *block = chain->block;
-   struct tcf_exts dummy_exts = { 0, };
 
cls_flower.rule = flow_rule_alloc(0);
if (!cls_flower.rule)
@@ -1535,7 +1531,6 @@ static int fl_hw_create_tmplt(struct tcf_chain *chain,
cls_flower.rule->match.dissector = >dissector;
cls_flower.rule->match.mask = >mask;
cls_flower.rule->match.key = >dummy_key;
-   cls_flower.exts = _exts;
 
/* We don't care if driver (any of them) fails to handle this
 * call. It serves just as a hint for it.
-- 
2.11.0

[PATCH net-next,v3 06/12] drivers: net: use flow action infrastructure

2018-11-20 Thread Pablo Neira Ayuso

This patch updates drivers to use the new flow action infrastructure.

Signed-off-by: Pablo Neira Ayuso 
---
v3: rebase on top of previous patches.

 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   |  74 +++---
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 250 +--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 266 ++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  54 +++--
 drivers/net/ethernet/netronome/nfp/flower/action.c | 185 +++---
 drivers/net/ethernet/qlogic/qede/qede_filter.c |  12 +-
 7 files changed, 417 insertions(+), 426 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 09cd75f54eba..b7bd27edd80e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -61,9 +61,9 @@ static u16 bnxt_flow_get_dst_fid(struct bnxt *pf_bp, struct 
net_device *dev)
 
 static int bnxt_tc_parse_redir(struct bnxt *bp,
   struct bnxt_tc_actions *actions,
-  const struct tc_action *tc_act)
+  const struct flow_action_entry *act)
 {
-   struct net_device *dev = tcf_mirred_dev(tc_act);
+   struct net_device *dev = act->dev;
 
if (!dev) {
netdev_info(bp->dev, "no dev in mirred action");
@@ -77,16 +77,16 @@ static int bnxt_tc_parse_redir(struct bnxt *bp,
 
 static int bnxt_tc_parse_vlan(struct bnxt *bp,
  struct bnxt_tc_actions *actions,
- const struct tc_action *tc_act)
+ const struct flow_action_entry *act)
 {
-   switch (tcf_vlan_action(tc_act)) {
-   case TCA_VLAN_ACT_POP:
+   switch (act->id) {
+   case FLOW_ACTION_VLAN_POP:
actions->flags |= BNXT_TC_ACTION_FLAG_POP_VLAN;
break;
-   case TCA_VLAN_ACT_PUSH:
+   case FLOW_ACTION_VLAN_PUSH:
actions->flags |= BNXT_TC_ACTION_FLAG_PUSH_VLAN;
-   actions->push_vlan_tci = htons(tcf_vlan_push_vid(tc_act));
-   actions->push_vlan_tpid = tcf_vlan_push_proto(tc_act);
+   actions->push_vlan_tci = htons(act->vlan.vid);
+   actions->push_vlan_tpid = act->vlan.proto;
break;
default:
return -EOPNOTSUPP;
@@ -96,10 +96,10 @@ static int bnxt_tc_parse_vlan(struct bnxt *bp,
 
 static int bnxt_tc_parse_tunnel_set(struct bnxt *bp,
struct bnxt_tc_actions *actions,
-   const struct tc_action *tc_act)
+   const struct flow_action_entry *act)
 {
-   struct ip_tunnel_info *tun_info = tcf_tunnel_info(tc_act);
-   struct ip_tunnel_key *tun_key = _info->key;
+   const struct ip_tunnel_info *tun_info = act->tunnel;
+   const struct ip_tunnel_key *tun_key = _info->key;
 
if (ip_tunnel_info_af(tun_info) != AF_INET) {
netdev_info(bp->dev, "only IPv4 tunnel-encap is supported");
@@ -113,51 +113,43 @@ static int bnxt_tc_parse_tunnel_set(struct bnxt *bp,
 
 static int bnxt_tc_parse_actions(struct bnxt *bp,
 struct bnxt_tc_actions *actions,
-struct tcf_exts *tc_exts)
+struct flow_action *flow_action)
 {
-   const struct tc_action *tc_act;
+   struct flow_action_entry *act;
int i, rc;
 
-   if (!tcf_exts_has_actions(tc_exts)) {
+   if (!flow_action_has_entries(flow_action)) {
netdev_info(bp->dev, "no actions");
return -EINVAL;
}
 
-   tcf_exts_for_each_action(i, tc_act, tc_exts) {
-   /* Drop action */
-   if (is_tcf_gact_shot(tc_act)) {
+   flow_action_for_each(i, act, flow_action) {
+   switch (act->id) {
+   case FLOW_ACTION_DROP:
actions->flags |= BNXT_TC_ACTION_FLAG_DROP;
return 0; /* don't bother with other actions */
-   }
-
-   /* Redirect action */
-   if (is_tcf_mirred_egress_redirect(tc_act)) {
-   rc = bnxt_tc_parse_redir(bp, actions, tc_act);
+   case FLOW_ACTION_REDIRECT:
+   rc = bnxt_tc_parse_redir(bp, actions, act);
if (rc)
return rc;
-   continue;
-   }
-
-   /* Push/pop VLAN */
-   if (is_tcf_vlan(tc_act)) {
-   rc = bnxt_tc_parse_vlan(bp, actions, tc_act);
+   break;
+   case FLOW_ACTION_VLAN_POP:
+   case FLOW_ACTION_VLAN_PUSH:
+   case FLOW_ACTION_VLAN_MANGLE:
+   rc =

[PATCH net-next,v3 11/12] qede: place ethtool_rx_flow_spec after code after TC flower codebase

2018-11-20 Thread Pablo Neira Ayuso

This is a preparation patch to reuse the existing TC flower codebase
from ethtool_rx_flow_spec.

This patch is merely moving the core ethtool_rx_flow_spec parser after
tc flower offload driver code so we can skip a few forward function
declarations in the follow up patch.

Signed-off-by: Pablo Neira Ayuso 
---
v3: no changes.

 drivers/net/ethernet/qlogic/qede/qede_filter.c | 264 -
 1 file changed, 132 insertions(+), 132 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c 
b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index 833c9ec58a6e..ed77950f6cf9 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -1791,72 +1791,6 @@ static int qede_flow_spec_to_tuple_udpv6(struct qede_dev 
*edev,
return 0;
 }
 
-static int qede_flow_spec_to_tuple(struct qede_dev *edev,
-  struct qede_arfs_tuple *t,
-  struct ethtool_rx_flow_spec *fs)
-{
-   memset(t, 0, sizeof(*t));
-
-   if (qede_flow_spec_validate_unused(edev, fs))
-   return -EOPNOTSUPP;
-
-   switch ((fs->flow_type & ~FLOW_EXT)) {
-   case TCP_V4_FLOW:
-   return qede_flow_spec_to_tuple_tcpv4(edev, t, fs);
-   case UDP_V4_FLOW:
-   return qede_flow_spec_to_tuple_udpv4(edev, t, fs);
-   case TCP_V6_FLOW:
-   return qede_flow_spec_to_tuple_tcpv6(edev, t, fs);
-   case UDP_V6_FLOW:
-   return qede_flow_spec_to_tuple_udpv6(edev, t, fs);
-   default:
-   DP_VERBOSE(edev, NETIF_MSG_IFUP,
-  "Can't support flow of type %08x\n", fs->flow_type);
-   return -EOPNOTSUPP;
-   }
-
-   return 0;
-}
-
-static int qede_flow_spec_validate(struct qede_dev *edev,
-  struct ethtool_rx_flow_spec *fs,
-  struct qede_arfs_tuple *t)
-{
-   if (fs->location >= QEDE_RFS_MAX_FLTR) {
-   DP_INFO(edev, "Location out-of-bounds\n");
-   return -EINVAL;
-   }
-
-   /* Check location isn't already in use */
-   if (test_bit(fs->location, edev->arfs->arfs_fltr_bmap)) {
-   DP_INFO(edev, "Location already in use\n");
-   return -EINVAL;
-   }
-
-   /* Check if the filtering-mode could support the filter */
-   if (edev->arfs->filter_count &&
-   edev->arfs->mode != t->mode) {
-   DP_INFO(edev,
-   "flow_spec would require filtering mode %08x, but %08x 
is configured\n",
-   t->mode, edev->arfs->filter_count);
-   return -EINVAL;
-   }
-
-   /* If drop requested then no need to validate other data */
-   if (fs->ring_cookie == RX_CLS_FLOW_DISC)
-   return 0;
-
-   if (ethtool_get_flow_spec_ring_vf(fs->ring_cookie))
-   return 0;
-
-   if (fs->ring_cookie >= QEDE_RSS_COUNT(edev)) {
-   DP_INFO(edev, "Queue out-of-bounds\n");
-   return -EINVAL;
-   }
-
-   return 0;
-}
-
 /* Must be called while qede lock is held */
 static struct qede_arfs_fltr_node *
 qede_flow_find_fltr(struct qede_dev *edev, struct qede_arfs_tuple *t)
@@ -1896,72 +1830,6 @@ static void qede_flow_set_destination(struct qede_dev 
*edev,
   "Configuring N-tuple for VF 0x%02x\n", n->vfid - 1);
 }
 
-int qede_add_cls_rule(struct qede_dev *edev, struct ethtool_rxnfc *info)
-{
-   struct ethtool_rx_flow_spec *fsp = >fs;
-   struct qede_arfs_fltr_node *n;
-   struct qede_arfs_tuple t;
-   int min_hlen, rc;
-
-   __qede_lock(edev);
-
-   if (!edev->arfs) {
-   rc = -EPERM;
-   goto unlock;
-   }
-
-   /* Translate the flow specification into something fittign our DB */
-   rc = qede_flow_spec_to_tuple(edev, , fsp);
-   if (rc)
-   goto unlock;
-
-   /* Make sure location is valid and filter isn't already set */
-   rc = qede_flow_spec_validate(edev, fsp, );
-   if (rc)
-   goto unlock;
-
-   if (qede_flow_find_fltr(edev, )) {
-   rc = -EINVAL;
-   goto unlock;
-   }
-
-   n = kzalloc(sizeof(*n), GFP_KERNEL);
-   if (!n) {
-   rc = -ENOMEM;
-   goto unlock;
-   }
-
-   min_hlen = qede_flow_get_min_header_size();
-   n->data = kzalloc(min_hlen, GFP_KERNEL);
-   if (!n->data) {
-   kfree(n);
-   rc = -ENOMEM;
-   goto unlock;
-   }
-
-   n->sw_id = fsp->location;
-   set_bit(n->sw_id, edev->arfs->arfs_fltr_bmap);
-   n->buf_len = min_hlen;
-
-   memcpy(>tuple, , sizeof(n->tuple));
-
-   qede_flow_set_destination(edev, n, fsp);
-
-   /* Build a minimal header according to the flow */
-   n->tuple.build_hdr(>tuple, n->data);
-
-   rc =

[PATCH net-next,v3 02/12] net/mlx5e: support for two independent packet edit actions

2018-11-20 Thread Pablo Neira Ayuso

This patch adds pedit_headers_action structure to store the result of
parsing tc pedit actions. Then, it calls alloc_tc_pedit_action() to
populate the mlx5e hardware intermediate representation once all actions
have been parsed.

This patch comes in preparation for the new flow_action infrastructure,
where each packet mangling comes in an separated action, ie. not packed
as in tc pedit.

Signed-off-by: Pablo Neira Ayuso 
---
v3: no changes.

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 81 ++---
 1 file changed, 59 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 6a22f7f22890..2645e5d1e790 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1748,6 +1748,12 @@ struct pedit_headers {
struct udphdr  udp;
 };
 
+struct pedit_headers_action {
+   struct pedit_headersvals;
+   struct pedit_headersmasks;
+   u32 pedits;
+};
+
 static int pedit_header_offsets[] = {
[TCA_PEDIT_KEY_EX_HDR_TYPE_ETH] = offsetof(struct pedit_headers, eth),
[TCA_PEDIT_KEY_EX_HDR_TYPE_IP4] = offsetof(struct pedit_headers, ip4),
@@ -1759,16 +1765,15 @@ static int pedit_header_offsets[] = {
 #define pedit_header(_ph, _htype) ((void *)(_ph) + 
pedit_header_offsets[_htype])
 
 static int set_pedit_val(u8 hdr_type, u32 mask, u32 val, u32 offset,
-struct pedit_headers *masks,
-struct pedit_headers *vals)
+struct pedit_headers_action *hdrs)
 {
u32 *curr_pmask, *curr_pval;
 
if (hdr_type >= __PEDIT_HDR_TYPE_MAX)
goto out_err;
 
-   curr_pmask = (u32 *)(pedit_header(masks, hdr_type) + offset);
-   curr_pval  = (u32 *)(pedit_header(vals, hdr_type) + offset);
+   curr_pmask = (u32 *)(pedit_header(>masks, hdr_type) + offset);
+   curr_pval  = (u32 *)(pedit_header(>vals, hdr_type) + offset);
 
if (*curr_pmask & mask)  /* disallow acting twice on the same location 
*/
goto out_err;
@@ -1824,8 +1829,7 @@ static struct mlx5_fields fields[] = {
  * max from the SW pedit action. On success, it says how many HW actions were
  * actually parsed.
  */
-static int offload_pedit_fields(struct pedit_headers *masks,
-   struct pedit_headers *vals,
+static int offload_pedit_fields(struct pedit_headers_action *hdrs,
struct mlx5e_tc_flow_parse_attr *parse_attr,
struct netlink_ext_ack *extack)
 {
@@ -1840,10 +1844,10 @@ static int offload_pedit_fields(struct pedit_headers 
*masks,
__be16 mask_be16;
void *action;
 
-   set_masks = [TCA_PEDIT_KEY_EX_CMD_SET];
-   add_masks = [TCA_PEDIT_KEY_EX_CMD_ADD];
-   set_vals = [TCA_PEDIT_KEY_EX_CMD_SET];
-   add_vals = [TCA_PEDIT_KEY_EX_CMD_ADD];
+   set_masks = [TCA_PEDIT_KEY_EX_CMD_SET].masks;
+   add_masks = [TCA_PEDIT_KEY_EX_CMD_ADD].masks;
+   set_vals = [TCA_PEDIT_KEY_EX_CMD_SET].vals;
+   add_vals = [TCA_PEDIT_KEY_EX_CMD_ADD].vals;
 
action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto);
action = parse_attr->mod_hdr_actions;
@@ -1939,12 +1943,14 @@ static int offload_pedit_fields(struct pedit_headers 
*masks,
 }
 
 static int alloc_mod_hdr_actions(struct mlx5e_priv *priv,
-const struct tc_action *a, int namespace,
+struct pedit_headers_action *hdrs,
+int namespace,
 struct mlx5e_tc_flow_parse_attr *parse_attr)
 {
int nkeys, action_size, max_actions;
 
-   nkeys = tcf_pedit_nkeys(a);
+   nkeys = hdrs[TCA_PEDIT_KEY_EX_CMD_SET].pedits +
+   hdrs[TCA_PEDIT_KEY_EX_CMD_ADD].pedits;
action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto);
 
if (namespace == MLX5_FLOW_NAMESPACE_FDB) /* FDB offloading */
@@ -1968,18 +1974,15 @@ static const struct pedit_headers zero_masks = {};
 static int parse_tc_pedit_action(struct mlx5e_priv *priv,
 const struct tc_action *a, int namespace,
 struct mlx5e_tc_flow_parse_attr *parse_attr,
+struct pedit_headers_action *hdrs,
 struct netlink_ext_ack *extack)
 {
-   struct pedit_headers masks[__PEDIT_CMD_MAX], vals[__PEDIT_CMD_MAX], 
*cmd_masks;
int nkeys, i, err = -EOPNOTSUPP;
u32 mask, val, offset;
u8 cmd, htype;
 
nkeys = tcf_pedit_nkeys(a);
 
-   memset(masks, 0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX);
-   memset(vals,  0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX);
-
for (i = 0; i < nkeys; i++) {
htype = tcf_pedit_htype(a, i);
cmd =

[PATCH net-next,v3 09/12] flow_dissector: add basic ethtool_rx_flow_spec to flow_rule structure translator

2018-11-20 Thread Pablo Neira Ayuso

This patch adds a function to translate the ethtool_rx_flow_spec
structure to the flow_rule representation.

This allows us to reuse code from the driver side given that both flower
and ethtool_rx_flow interfaces use the same representation.

Signed-off-by: Pablo Neira Ayuso 
---
v3: Suggested by Jiri Pirko:
- Add struct ethtool_rx_flow_rule, keep placeholder to private
  dissector information.
Reported by Manish Chopra:
- Fix incorrect dissector user_keys flags.

 include/linux/ethtool.h |  10 +++
 net/core/ethtool.c  | 189 
 2 files changed, 199 insertions(+)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index afd9596ce636..99849e0858b2 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -400,4 +400,14 @@ struct ethtool_ops {
void(*get_ethtool_phy_stats)(struct net_device *,
 struct ethtool_stats *, u64 *);
 };
+
+struct ethtool_rx_flow_rule {
+   struct flow_rule*rule;
+   unsigned long   priv[0];
+};
+
+struct ethtool_rx_flow_rule *
+ethtool_rx_flow_rule_alloc(const struct ethtool_rx_flow_spec *fs);
+void ethtool_rx_flow_rule_free(struct ethtool_rx_flow_rule *rule);
+
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index d05402868575..e679d6478371 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Some useful ethtool_ops methods that're device independent.
@@ -2808,3 +2809,191 @@ int dev_ethtool(struct net *net, struct ifreq *ifr)
 
return rc;
 }
+
+struct ethtool_rx_flow_key {
+   struct flow_dissector_key_basic basic;
+   union {
+   struct flow_dissector_key_ipv4_addrsipv4;
+   struct flow_dissector_key_ipv6_addrsipv6;
+   };
+   struct flow_dissector_key_ports tp;
+   struct flow_dissector_key_ipip;
+} __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. 
*/
+
+struct ethtool_rx_flow_match {
+   struct flow_dissector   dissector;
+   struct ethtool_rx_flow_key  key;
+   struct ethtool_rx_flow_key  mask;
+};
+
+struct ethtool_rx_flow_rule *
+ethtool_rx_flow_rule_alloc(const struct ethtool_rx_flow_spec *fs)
+{
+   static struct in6_addr zero_addr = {};
+   struct ethtool_rx_flow_match *match;
+   struct ethtool_rx_flow_rule *flow;
+   struct flow_action_entry *act;
+
+   flow = kzalloc(sizeof(struct ethtool_rx_flow_rule) +
+  sizeof(struct ethtool_rx_flow_match), GFP_KERNEL);
+   if (!flow)
+   return NULL;
+
+   /* ethtool_rx supports only one single action per rule. */
+   flow->rule = flow_rule_alloc(1);
+   if (!flow->rule) {
+   kfree(flow);
+   return NULL;
+   }
+
+   match = (struct ethtool_rx_flow_match *)flow->priv;
+   flow->rule->match.dissector = >dissector;
+   flow->rule->match.mask  = >mask;
+   flow->rule->match.key   = >key;
+
+   match->mask.basic.n_proto = 0x;
+
+   switch (fs->flow_type & ~FLOW_EXT) {
+   case TCP_V4_FLOW:
+   case UDP_V4_FLOW: {
+   const struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec;
+
+   match->key.basic.n_proto = htons(ETH_P_IP);
+
+   v4_spec = >h_u.tcp_ip4_spec;
+   v4_m_spec = >m_u.tcp_ip4_spec;
+
+   if (v4_m_spec->ip4src) {
+   match->key.ipv4.src = v4_spec->ip4src;
+   match->mask.ipv4.src = v4_m_spec->ip4src;
+   }
+   if (v4_m_spec->ip4dst) {
+   match->key.ipv4.dst = v4_spec->ip4dst;
+   match->mask.ipv4.dst = v4_m_spec->ip4dst;
+   }
+   if (v4_m_spec->ip4src ||
+   v4_m_spec->ip4dst) {
+   match->dissector.used_keys |=
+   (1 << FLOW_DISSECTOR_KEY_IPV4_ADDRS);
+   match->dissector.offset[FLOW_DISSECTOR_KEY_IPV4_ADDRS] =
+   offsetof(struct ethtool_rx_flow_key, ipv4);
+   }
+   if (v4_m_spec->psrc) {
+   match->key.tp.src = v4_spec->psrc;
+   match->mask.tp.src = v4_m_spec->psrc;
+   }
+   if (v4_m_spec->pdst) {
+   match->key.tp.dst = v4_spec->pdst;
+   match->mask.tp.dst = v4_m_spec->pdst;
+   }
+   if (v4_m_spec->psrc ||
+   v4_m_spec->pdst) {
+   match->dissector.used_keys |=
+   (1 << FLOW_DISSECTOR_KEY_PORTS);
+   match->dissector.offset[FLOW_DISSECTOR_KEY_PORTS] =
+

[PATCH net-next,v3 08/12] flow_dissector: add wake-up-on-lan and queue to flow_action

2018-11-20 Thread Pablo Neira Ayuso

These actions need to be added to support bcm sf2 features available
through the ethtool_rx_flow interface.

Reviewed-by: Florian Fainelli 
Signed-off-by: Pablo Neira Ayuso 
---
v3: no changes.

 include/net/flow_offload.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 8c1235fb6ed2..7205abd6cac6 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -116,6 +116,8 @@ enum flow_action_id {
FLOW_ACTION_ADD,
FLOW_ACTION_CSUM,
FLOW_ACTION_MARK,
+   FLOW_ACTION_WAKE,
+   FLOW_ACTION_QUEUE,
 };
 
 /* This is mirroring enum pedit_header_type definition for easy mapping between
@@ -150,6 +152,7 @@ struct flow_action_entry {
const struct ip_tunnel_info *tunnel;/* 
FLOW_ACTION_TUNNEL_ENCAP */
u32 csum_flags; /* FLOW_ACTION_CSUM */
u32 mark;   /* FLOW_ACTION_MARK */
+   u32 queue_index;/* FLOW_ACTION_QUEUE */
};
 };
 
-- 
2.11.0

[PATCH net-next,v3 04/12] cls_api: add translator to flow_action representation

2018-11-20 Thread Pablo Neira Ayuso

This patch implements a new function to translate from native TC action
to the new flow_action representation. Moreover, this patch also updates
cls_flower to use this new function.

Signed-off-by: Pablo Neira Ayuso 
---
v3: add tcf_exts_num_actions() and pass it to flow_rule_alloc() to calculate
the size of the array of actions.

 include/net/pkt_cls.h  |   5 +++
 net/sched/cls_api.c| 116 +
 net/sched/cls_flower.c |  21 +++--
 3 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index 359876ee32be..abb035f84321 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -620,6 +620,11 @@ tcf_match_indev(struct sk_buff *skb, int ifindex)
 }
 #endif /* CONFIG_NET_CLS_IND */
 
+unsigned int tcf_exts_num_actions(struct tcf_exts *exts);
+
+int tc_setup_flow_action(struct flow_action *flow_action,
+const struct tcf_exts *exts);
+
 int tc_setup_cb_call(struct tcf_block *block, struct tcf_exts *exts,
 enum tc_setup_type type, void *type_data, bool err_stop);
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index d92f44ac4c39..6f8b953dabc4 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -31,6 +31,14 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
 
@@ -2567,6 +2575,114 @@ int tc_setup_cb_call(struct tcf_block *block, struct 
tcf_exts *exts,
 }
 EXPORT_SYMBOL(tc_setup_cb_call);
 
+int tc_setup_flow_action(struct flow_action *flow_action,
+const struct tcf_exts *exts)
+{
+   const struct tc_action *act;
+   int i, j, k;
+
+   if (!exts)
+   return 0;
+
+   j = 0;
+   tcf_exts_for_each_action(i, act, exts) {
+   struct flow_action_entry *key;
+
+   key = _action->entries[j];
+   if (is_tcf_gact_ok(act)) {
+   key->id = FLOW_ACTION_ACCEPT;
+   } else if (is_tcf_gact_shot(act)) {
+   key->id = FLOW_ACTION_DROP;
+   } else if (is_tcf_gact_trap(act)) {
+   key->id = FLOW_ACTION_TRAP;
+   } else if (is_tcf_gact_goto_chain(act)) {
+   key->id = FLOW_ACTION_GOTO;
+   key->chain_index = tcf_gact_goto_chain_index(act);
+   } else if (is_tcf_mirred_egress_redirect(act)) {
+   key->id = FLOW_ACTION_REDIRECT;
+   key->dev = tcf_mirred_dev(act);
+   } else if (is_tcf_mirred_egress_mirror(act)) {
+   key->id = FLOW_ACTION_MIRRED;
+   key->dev = tcf_mirred_dev(act);
+   } else if (is_tcf_vlan(act)) {
+   switch (tcf_vlan_action(act)) {
+   case TCA_VLAN_ACT_PUSH:
+   key->id = FLOW_ACTION_VLAN_PUSH;
+   key->vlan.vid = tcf_vlan_push_vid(act);
+   key->vlan.proto = tcf_vlan_push_proto(act);
+   key->vlan.prio = tcf_vlan_push_prio(act);
+   break;
+   case TCA_VLAN_ACT_POP:
+   key->id = FLOW_ACTION_VLAN_POP;
+   break;
+   case TCA_VLAN_ACT_MODIFY:
+   key->id = FLOW_ACTION_VLAN_MANGLE;
+   key->vlan.vid = tcf_vlan_push_vid(act);
+   key->vlan.proto = tcf_vlan_push_proto(act);
+   key->vlan.prio = tcf_vlan_push_prio(act);
+   break;
+   default:
+   goto err_out;
+   }
+   } else if (is_tcf_tunnel_set(act)) {
+   key->id = FLOW_ACTION_TUNNEL_ENCAP;
+   key->tunnel = tcf_tunnel_info(act);
+   } else if (is_tcf_tunnel_release(act)) {
+   key->id = FLOW_ACTION_TUNNEL_DECAP;
+   key->tunnel = tcf_tunnel_info(act);
+   } else if (is_tcf_pedit(act)) {
+   for (k = 0; k < tcf_pedit_nkeys(act); k++) {
+   switch (tcf_pedit_cmd(act, k)) {
+   case TCA_PEDIT_KEY_EX_CMD_SET:
+   key->id = FLOW_ACTION_MANGLE;
+   break;
+   case TCA_PEDIT_KEY_EX_CMD_ADD:
+   key->id = FLOW_ACTION_ADD;
+   break;
+   default:
+   goto err_out;
+   }
+

[PATCH net-next,v3 10/12] dsa: bcm_sf2: use flow_rule infrastructure

2018-11-20 Thread Pablo Neira Ayuso

Update this driver to use the flow_rule infrastructure, hence we can use
the same code to populate hardware IR from ethtool_rx_flow and the
cls_flower interfaces.

Signed-off-by: Pablo Neira Ayuso 
---
v3: adapt it to use new ethtool_rx_flow_rule_alloc()

 drivers/net/dsa/bcm_sf2_cfp.c | 109 +++---
 1 file changed, 71 insertions(+), 38 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index e14663ab6dbc..3bdc65fe8408 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "bcm_sf2.h"
 #include "bcm_sf2_regs.h"
@@ -257,7 +258,8 @@ static int bcm_sf2_cfp_act_pol_set(struct bcm_sf2_priv 
*priv,
 }
 
 static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv *priv,
-  struct ethtool_tcpip4_spec *v4_spec,
+  struct flow_dissector_key_ipv4_addrs *addrs,
+  struct flow_dissector_key_ports *ports,
   unsigned int slice_num,
   bool mask)
 {
@@ -278,7 +280,7 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * UDF_n_A6 [23:8]
 * UDF_n_A5 [7:0]
 */
-   reg = be16_to_cpu(v4_spec->pdst) >> 8;
+   reg = be16_to_cpu(ports->dst) >> 8;
if (mask)
offset = CORE_CFP_MASK_PORT(3);
else
@@ -289,9 +291,9 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * UDF_n_A4 [23:8]
 * UDF_n_A3 [7:0]
 */
-   reg = (be16_to_cpu(v4_spec->pdst) & 0xff) << 24 |
- (u32)be16_to_cpu(v4_spec->psrc) << 8 |
- (be32_to_cpu(v4_spec->ip4dst) & 0xff00) >> 8;
+   reg = (be16_to_cpu(ports->dst) & 0xff) << 24 |
+ (u32)be16_to_cpu(ports->src) << 8 |
+ (be32_to_cpu(addrs->dst) & 0xff00) >> 8;
if (mask)
offset = CORE_CFP_MASK_PORT(2);
else
@@ -302,9 +304,9 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * UDF_n_A2 [23:8]
 * UDF_n_A1 [7:0]
 */
-   reg = (u32)(be32_to_cpu(v4_spec->ip4dst) & 0xff) << 24 |
- (u32)(be32_to_cpu(v4_spec->ip4dst) >> 16) << 8 |
- (be32_to_cpu(v4_spec->ip4src) & 0xff00) >> 8;
+   reg = (u32)(be32_to_cpu(addrs->dst) & 0xff) << 24 |
+ (u32)(be32_to_cpu(addrs->dst) >> 16) << 8 |
+ (be32_to_cpu(addrs->src) & 0xff00) >> 8;
if (mask)
offset = CORE_CFP_MASK_PORT(1);
else
@@ -317,8 +319,8 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * Slice ID [3:2]
 * Slice valid  [1:0]
 */
-   reg = (u32)(be32_to_cpu(v4_spec->ip4src) & 0xff) << 24 |
- (u32)(be32_to_cpu(v4_spec->ip4src) >> 16) << 8 |
+   reg = (u32)(be32_to_cpu(addrs->src) & 0xff) << 24 |
+ (u32)(be32_to_cpu(addrs->src) >> 16) << 8 |
  SLICE_NUM(slice_num) | SLICE_VALID;
if (mask)
offset = CORE_CFP_MASK_PORT(0);
@@ -332,9 +334,13 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
 unsigned int queue_num,
 struct ethtool_rx_flow_spec *fs)
 {
-   struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec;
const struct cfp_udf_layout *layout;
unsigned int slice_num, rule_index;
+   struct ethtool_rx_flow_rule *flow;
+   struct flow_match_ipv4_addrs ipv4;
+   struct flow_match_ports ports;
+   struct flow_match_basic basic;
+   struct flow_match_ip ip;
u8 ip_proto, ip_frag;
u8 num_udf;
u32 reg;
@@ -343,13 +349,9 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
switch (fs->flow_type & ~FLOW_EXT) {
case TCP_V4_FLOW:
ip_proto = IPPROTO_TCP;
-   v4_spec = >h_u.tcp_ip4_spec;
-   v4_m_spec = >m_u.tcp_ip4_spec;
break;
case UDP_V4_FLOW:
ip_proto = IPPROTO_UDP;
-   v4_spec = >h_u.udp_ip4_spec;
-   v4_m_spec = >m_u.udp_ip4_spec;
break;
default:
return -EINVAL;
@@ -367,11 +369,22 @@ static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv 
*priv, int port,
if (rule_index > bcm_sf2_cfp_rule_size(priv))
return -ENOSPC;
 
+   flow = ethtool_rx_flow_rule_alloc(fs);
+   if (!flow)
+   return -ENOMEM;
+
+   flow_rule_match_ipv4_addrs(flow->rule, );
+   flow_rule_match_ports(flow->rule, );
+   flow_rule_match_basic(flow->rule, );
+   flow_rule_match_ip(flow->rule, );
+
layout = _tcpip4_layout;
/* We only use one UDF slice for now

[PATCH net-next,v3 12/12] qede: use ethtool_rx_flow_rule() to remove duplicated parser code

2018-11-20 Thread Pablo Neira Ayuso

The qede driver supports for ethtool_rx_flow_spec and flower, both
codebases look very similar.

This patch uses the ethtool_rx_flow_rule() infrastructure to remove the
duplicated ethtool_rx_flow_spec parser and consolidate ACL offload
support around the flow_rule infrastructure.

Furthermore, more code can be consolidated by merging
qede_add_cls_rule() and qede_add_tc_flower_fltr(), these two functions
also look very similar.

This driver currently provides simple ACL support, such as 5-tuple
matching, drop policy and queue to CPU.

Drivers that support more features can benefit from this infrastructure
to save even more redundant codebase.

Signed-off-by: Pablo Neira Ayuso 
---
v3: Suggested by Jiri Pirko:
- Pass struct flow_rule *rule to all parser functions.

Moreover, do not remove qede_flow_spec_validate_unused().

 drivers/net/ethernet/qlogic/qede/qede_filter.c | 271 +++--
 1 file changed, 71 insertions(+), 200 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qede/qede_filter.c 
b/drivers/net/ethernet/qlogic/qede/qede_filter.c
index ed77950f6cf9..b7562fb86e52 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_filter.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_filter.c
@@ -1665,132 +1665,6 @@ static int qede_set_v6_tuple_to_profile(struct qede_dev 
*edev,
return 0;
 }
 
-static int qede_flow_spec_to_tuple_ipv4_common(struct qede_dev *edev,
-  struct qede_arfs_tuple *t,
-  struct ethtool_rx_flow_spec *fs)
-{
-   if ((fs->h_u.tcp_ip4_spec.ip4src &
-fs->m_u.tcp_ip4_spec.ip4src) != fs->h_u.tcp_ip4_spec.ip4src) {
-   DP_INFO(edev, "Don't support IP-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip4_spec.ip4dst &
-fs->m_u.tcp_ip4_spec.ip4dst) != fs->h_u.tcp_ip4_spec.ip4dst) {
-   DP_INFO(edev, "Don't support IP-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip4_spec.psrc &
-fs->m_u.tcp_ip4_spec.psrc) != fs->h_u.tcp_ip4_spec.psrc) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip4_spec.pdst &
-fs->m_u.tcp_ip4_spec.pdst) != fs->h_u.tcp_ip4_spec.pdst) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if (fs->h_u.tcp_ip4_spec.tos) {
-   DP_INFO(edev, "Don't support tos\n");
-   return -EOPNOTSUPP;
-   }
-
-   t->eth_proto = htons(ETH_P_IP);
-   t->src_ipv4 = fs->h_u.tcp_ip4_spec.ip4src;
-   t->dst_ipv4 = fs->h_u.tcp_ip4_spec.ip4dst;
-   t->src_port = fs->h_u.tcp_ip4_spec.psrc;
-   t->dst_port = fs->h_u.tcp_ip4_spec.pdst;
-
-   return qede_set_v4_tuple_to_profile(edev, t);
-}
-
-static int qede_flow_spec_to_tuple_tcpv4(struct qede_dev *edev,
-struct qede_arfs_tuple *t,
-struct ethtool_rx_flow_spec *fs)
-{
-   t->ip_proto = IPPROTO_TCP;
-
-   if (qede_flow_spec_to_tuple_ipv4_common(edev, t, fs))
-   return -EINVAL;
-
-   return 0;
-}
-
-static int qede_flow_spec_to_tuple_udpv4(struct qede_dev *edev,
-struct qede_arfs_tuple *t,
-struct ethtool_rx_flow_spec *fs)
-{
-   t->ip_proto = IPPROTO_UDP;
-
-   if (qede_flow_spec_to_tuple_ipv4_common(edev, t, fs))
-   return -EINVAL;
-
-   return 0;
-}
-
-static int qede_flow_spec_to_tuple_ipv6_common(struct qede_dev *edev,
-  struct qede_arfs_tuple *t,
-  struct ethtool_rx_flow_spec *fs)
-{
-   struct in6_addr zero_addr;
-
-   memset(_addr, 0, sizeof(zero_addr));
-
-   if ((fs->h_u.tcp_ip6_spec.psrc &
-fs->m_u.tcp_ip6_spec.psrc) != fs->h_u.tcp_ip6_spec.psrc) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if ((fs->h_u.tcp_ip6_spec.pdst &
-fs->m_u.tcp_ip6_spec.pdst) != fs->h_u.tcp_ip6_spec.pdst) {
-   DP_INFO(edev, "Don't support port-masks\n");
-   return -EOPNOTSUPP;
-   }
-
-   if (fs->h_u.tcp_ip6_spec.tclass) {
-   DP_INFO(edev, "Don't support tclass\n");
-   return -EOPNOTSUPP;
-   }
-
-   t->eth_proto = htons(ETH_P_IPV6);
-   memcpy(>src_ipv6, >h_u.tcp_ip6_spec.ip6src,
-  sizeof(struct in6_addr));
-   memcpy(>dst_ipv6, >h_u.tcp_ip6_spec.ip6dst,
-  sizeof(struct in6_addr));
-   t->src_port = fs->h_u.tcp_ip6_spec.psrc;
-   t->dst_port = fs->h_u.tcp_ip6_spec.pdst;
-
-   return qede_set_v6_tuple_to_profile(edev, t, _addr);
-}
-
-static int qede_flow_spec_to_tuple_tcpv6(struct qede_dev

[PATCH net-next,v3 03/12] flow_dissector: add flow action infrastructure

2018-11-20 Thread Pablo Neira Ayuso

This new infrastructure defines the nic actions that you can perform
from existing network drivers. This infrastructure allows us to avoid a
direct dependency with the native software TC action representation.

Signed-off-by: Pablo Neira Ayuso 
---
v3: Suggested by Jiri Pirko:
- Remove _key postfix and _KEY_ infix in flow_action definitions.
- Use enum flow_action_mangle_base for consistency.
- Rename key field to entries and num_keys to num_entries.
- Rename struct flow_action_key to flow_action_entry.
- Use placeholder in struct flow_action to store array of actions
  from flow_rule_alloc().

 include/net/flow_offload.h | 69 +-
 net/core/flow_offload.c| 14 --
 2 files changed, 80 insertions(+), 3 deletions(-)

diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
index 461c66595763..48aa47ba5561 100644
--- a/include/net/flow_offload.h
+++ b/include/net/flow_offload.h
@@ -100,11 +100,78 @@ void flow_rule_match_enc_keyid(const struct flow_rule 
*rule,
 void flow_rule_match_enc_opts(const struct flow_rule *rule,
  struct flow_match_enc_opts *out);
 
+enum flow_action_id {
+   FLOW_ACTION_ACCEPT  = 0,
+   FLOW_ACTION_DROP,
+   FLOW_ACTION_TRAP,
+   FLOW_ACTION_GOTO,
+   FLOW_ACTION_REDIRECT,
+   FLOW_ACTION_MIRRED,
+   FLOW_ACTION_VLAN_PUSH,
+   FLOW_ACTION_VLAN_POP,
+   FLOW_ACTION_VLAN_MANGLE,
+   FLOW_ACTION_TUNNEL_ENCAP,
+   FLOW_ACTION_TUNNEL_DECAP,
+   FLOW_ACTION_MANGLE,
+   FLOW_ACTION_ADD,
+   FLOW_ACTION_CSUM,
+   FLOW_ACTION_MARK,
+};
+
+/* This is mirroring enum pedit_header_type definition for easy mapping between
+ * tc pedit action. Legacy TCA_PEDIT_KEY_EX_HDR_TYPE_NETWORK is mapped to
+ * FLOW_ACT_MANGLE_UNSPEC, which is supported by no driver.
+ */
+enum flow_action_mangle_base {
+   FLOW_ACT_MANGLE_UNSPEC  = 0,
+   FLOW_ACT_MANGLE_HDR_TYPE_ETH,
+   FLOW_ACT_MANGLE_HDR_TYPE_IP4,
+   FLOW_ACT_MANGLE_HDR_TYPE_IP6,
+   FLOW_ACT_MANGLE_HDR_TYPE_TCP,
+   FLOW_ACT_MANGLE_HDR_TYPE_UDP,
+};
+
+struct flow_action_entry {
+   enum flow_action_id id;
+   union {
+   u32 chain_index;/* FLOW_ACTION_GOTO */
+   struct net_device   *dev;   /* FLOW_ACTION_REDIRECT 
*/
+   struct {/* FLOW_ACTION_VLAN */
+   u16 vid;
+   __be16  proto;
+   u8  prio;
+   } vlan;
+   struct {/* 
FLOW_ACTION_PACKET_EDIT */
+   enum flow_action_mangle_base htype;
+   u32 offset;
+   u32 mask;
+   u32 val;
+   } mangle;
+   const struct ip_tunnel_info *tunnel;/* 
FLOW_ACTION_TUNNEL_ENCAP */
+   u32 csum_flags; /* FLOW_ACTION_CSUM */
+   u32 mark;   /* FLOW_ACTION_MARK */
+   };
+};
+
+struct flow_action {
+   unsigned intnum_entries;
+   struct flow_action_entryentries[0];
+};
+
+static inline bool flow_action_has_entries(const struct flow_action *action)
+{
+   return action->num_entries;
+}
+
+#define flow_action_for_each(__i, __act, __actions)\
+for (__i = 0, __act = &(__actions)->entries[0]; __i < 
(__actions)->num_entries; __act = &(__actions)->entries[++__i])
+
 struct flow_rule {
struct flow_match   match;
+   struct flow_action  action;
 };
 
-struct flow_rule *flow_rule_alloc(void);
+struct flow_rule *flow_rule_alloc(unsigned int num_actions);
 
 static inline bool flow_rule_match_key(const struct flow_rule *rule,
   enum flow_dissector_key_id key)
diff --git a/net/core/flow_offload.c b/net/core/flow_offload.c
index 2fbf6903d2f6..c3a00eac4804 100644
--- a/net/core/flow_offload.c
+++ b/net/core/flow_offload.c
@@ -3,9 +3,19 @@
 #include 
 #include 
 
-struct flow_rule *flow_rule_alloc(void)
+struct flow_rule *flow_rule_alloc(unsigned int num_actions)
 {
-   return kzalloc(sizeof(struct flow_rule), GFP_KERNEL);
+   struct flow_rule *rule;
+
+   rule = kzalloc(sizeof(struct flow_rule) +
+  sizeof(struct flow_action_entry) * num_actions,
+  GFP_KERNEL);
+   if (!rule)
+   return NULL;
+
+   rule->action.num_entries = num_actions;
+
+   return rule;
 }
 EXPORT_SYMBOL(flow_rule_alloc);
 
-- 
2.11.0

[PATCH net-next,v3 00/12] add flow_rule infrastructure

2018-11-20 Thread Pablo Neira Ayuso

Hi,

This patchset is the third iteration [1] [2] [3] to introduce a kernel
intermediate (IR) to express ACL hardware offloads.

This round addresses feedback from Jiri Pirko:

* Add net/core/flow_offload.c and include/net/flow_offload.h.
* Add flow_rule_alloc() helper function.
* Remove _key postfix and _KEY_ infix in flow_action definitions.
* Use enum flow_action_mangle_base for consistency.
* Rename key field to entries and num_keys to num_entries.
* Rename struct flow_action_key to flow_action_entry.
* Use placeholder in struct flow_action to store array of actions
  from flow_rule_alloc().
* Add tcf_exts_num_actions() and pass it to flow_rule_alloc() to
  calculate the size of the array of actions.
* Rename to struct flow_stats and to function flow_stats_update().
* Add struct ethtool_rx_flow_rule, keep placeholder to private
  dissector information.
* Pass struct flow_rule *rule to all parser functions in qlogic/qede
  driver.

This also fixes a bug reported by Manish Chopra, in the ethtool_rx_spec
to flow_rule translator.

Making all these changes have been an exercise to review the existing
infrastructure, to understand what has been done and to propose
improvements to the _great work_ that core drivers developers have done
so far to introduce HW offloads through the existing frontend APIs.  I
still have more feedback and technical ideas that I'm very much looking
forward to discuss with them in the future.

Main goal of this patchset is to avoid code duplication for driver
developers. There are no netfilter changes coming in this batch.
I would like to explore Netfilter hardware offloads in the future.

Thanks a lot for reviewing!

[1] https://lwn.net/Articles/766695/
[2] https://marc.info/?l=linux-netdev=154233253114506=2
[3] https://marc.info/?l=linux-netdev=154258780717036=2

Pablo Neira Ayuso (12):
  flow_dissector: add flow_rule and flow_match structures and use them
  net/mlx5e: support for two independent packet edit actions
  flow_dissector: add flow action infrastructure
  cls_api: add translator to flow_action representation
  cls_flower: add statistics retrieval infrastructure and use it
  drivers: net: use flow action infrastructure
  cls_flower: don't expose TC actions to drivers anymore
  flow_dissector: add wake-up-on-lan and queue to flow_action
  flow_dissector: add basic ethtool_rx_flow_spec to flow_rule structure 
translator
  dsa: bcm_sf2: use flow_rule infrastructure
  qede: place ethtool_rx_flow_spec after code after TC flower codebase
  qede: use ethtool_rx_flow_rule() to remove duplicated parser code

 drivers/net/dsa/bcm_sf2_cfp.c  | 109 +--
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   | 252 +++
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 450 ++---
 drivers/net/ethernet/intel/i40e/i40e_main.c| 178 ++---
 drivers/net/ethernet/intel/iavf/iavf_main.c| 195 +++---
 drivers/net/ethernet/intel/igb/igb_main.c  |  64 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 743 ++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 258 ---
 drivers/net/ethernet/netronome/nfp/flower/action.c | 196 +++---
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 417 ++--
 .../net/ethernet/netronome/nfp/flower/offload.c| 150 ++---
 drivers/net/ethernet/qlogic/qede/qede_filter.c | 560 ++--
 include/linux/ethtool.h|  10 +
 include/net/flow_offload.h | 199 ++
 include/net/pkt_cls.h  |  18 +-
 net/core/Makefile  |   2 +-
 net/core/ethtool.c | 189 ++
 net/core/flow_offload.c| 153 +
 net/sched/cls_api.c| 116 
 net/sched/cls_flower.c |  69 +-
 21 files changed, 2339 insertions(+), 1991 deletions(-)
 create mode 100644 include/net/flow_offload.h
 create mode 100644 net/core/flow_offload.c

-- 
2.11.0

Re: [PATCH bpf-next] bpf: libbpf: retry program creation without the name

2018-11-20 Thread Quentin Monnet

2018-11-20 15:26 UTC-0800 ~ Stanislav Fomichev 
> On 11/20, Alexei Starovoitov wrote:
>> On Wed, Nov 21, 2018 at 12:18:57AM +0100, Daniel Borkmann wrote:
>>> On 11/21/2018 12:04 AM, Alexei Starovoitov wrote:
 On Tue, Nov 20, 2018 at 01:19:05PM -0800, Stanislav Fomichev wrote:
> On 11/20, Alexei Starovoitov wrote:
>> On Mon, Nov 19, 2018 at 04:46:25PM -0800, Stanislav Fomichev wrote:
>>> [Recent commit 23499442c319 ("bpf: libbpf: retry map creation without
>>> the name") fixed this issue for maps, let's do the same for programs.]
>>>
>>> Since commit 88cda1c9da02 ("bpf: libbpf: Provide basic API support
>>> to specify BPF obj name"), libbpf unconditionally sets bpf_attr->name
>>> for programs. Pre v4.14 kernels don't know about programs names and
>>> return an error about unexpected non-zero data. Retry sys_bpf without
>>> a program name to cover older kernels.
>>>
>>> Signed-off-by: Stanislav Fomichev 
>>> ---
>>>  tools/lib/bpf/bpf.c | 10 ++
>>>  1 file changed, 10 insertions(+)
>>>
>>> diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>>> index 961e1b9fc592..cbe9d757c646 100644
>>> --- a/tools/lib/bpf/bpf.c
>>> +++ b/tools/lib/bpf/bpf.c
>>> @@ -212,6 +212,16 @@ int bpf_load_program_xattr(const struct 
>>> bpf_load_program_attr *load_attr,
>>> if (fd >= 0 || !log_buf || !log_buf_sz)
>>> return fd;
>>>  
>>> +   if (fd < 0 && errno == E2BIG && load_attr->name) {
>>> +   /* Retry the same syscall, but without the name.
>>> +* Pre v4.14 kernels don't support prog names.
>>> +*/
>>
>> I'm afraid that will put unnecessary stress on the kernel.
>> This check needs to be tighter.
>> Like E2BIG and anything in the log_buf probably means that
>> E2BIG came from the verifier and nothing to do with prog_name.
>> Asking kernel to repeat is an unnecessary work.
>>
>> In general we need to think beyond this single prog_name field.
>> There are bunch of other fields in bpf_load_program_xattr() and older 
>> kernels
>> won't support them. Are we going to zero them out one by one
>> and retry? I don't think that would be practical.
> I general, we don't want to zero anything out. However,
> for this particular problem the rationale is the following:
> In commit 88cda1c9da02 we started unconditionally setting {prog,map}->name
> from the 'higher' libbpfc layer which breaks users on the older kernels.
>
>> Also libbpf silently ignoring prog_name is not great for debugging.
>> A warning is needed.
>> But it cannot be done out of lib/bpf/bpf.c, since it's a set of syscall
>> wrappers.
>> Imo such "old kernel -> lets retry" feature should probably be done
>> at lib/bpf/libbpf.c level. inside load_program().
> For maps bpftools calls bpf_create_map_xattr directly, that's why
> for maps I did the retry on the lower level (and why for programs I 
> initially
> thought about doing the same). However, in this case maybe asking
> user to omit 'name' argument might be a better option.
>
> For program names, I agree, we might think about doing it on the higher
> level (although I'm not sure whether we want to have different API
> expectations, i.e. bpf_create_map_xattr ignoring the name and
> bpf_load_program_xattr not ignoring the name).
>
> So given that rationale above, what do you think is the best way to
> move forward?
> 1. Same patch, but tighten the retry check inside bpf_load_program_xattr ?
> 2. Move this retry logic into load_program and have different handling
>for bpf_create_map_xattr vs bpf_load_program_xattr ?
> 3. Do 2 and move the retry check for maps from bpf_create_map_xattr
>into bpf_object__create_maps ?
>
> (I'm slightly leaning towards #3)

 me too. I think it's cleaner for maps to do it in
 bpf_object__create_maps().
 Originally bpf.c was envisioned to be a thin layer on top of bpf syscall.
 Whereas 'smart bits' would go into libbpf.c
>>>
>>> Can't we create in bpf_object__load() a small helper 
>>> bpf_object__probe_caps()
>>> which would figure this out _once_ upon start with a few things to probe for
>>> availability in the underlying kernel for maps and programs? E.g. programs
>>> it could try to inject a tiny 'r0 = 0; exit' snippet where we figure out
>>> things like prog name support etc. Given underlying kernel doesn't change, 
>>> we
>>> would only try this once and it doesn't require fallback every time.
>>
>> +1. great idea!
> Sounds good, let me try to do it.
> 
> It sounds more like a recent LPC proposal/idea to have some sys_bpf option
> to query BPF features. This new bpf_object__probe_caps can probably query
> that in the future if we eventually add support for it.
> 

Hi,

LPC proposal indeed. I've

Re: [PATCH bpf-next v2] libbpf: make sure bpf headers are c++ include-able

2018-11-20 Thread Alexei Starovoitov

On Tue, Nov 20, 2018 at 05:59:52PM -0800, Stanislav Fomichev wrote:
> On 11/20, Alexei Starovoitov wrote:
> > On Tue, Nov 20, 2018 at 04:05:55PM -0800, Stanislav Fomichev wrote:
> > > On 11/20, Alexei Starovoitov wrote:
> > > > On Tue, Nov 20, 2018 at 01:37:23PM -0800, Stanislav Fomichev wrote:
> > > > > Wrap headers in extern "C", to turn off C++ mangling.
> > > > > This simplifies including libbpf in c++ and linking against it.
> > > > > 
> > > > > v2 changes:
> > > > > * do the same for btf.h
> > > > > 
> > > > > Signed-off-by: Stanislav Fomichev 
> > > > > ---
> > > > >  tools/lib/bpf/bpf.h| 9 +
> > > > >  tools/lib/bpf/btf.h| 8 
> > > > >  tools/lib/bpf/libbpf.h | 9 +
> > > > >  3 files changed, 26 insertions(+)
> > > > > 
> > > > > diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> > > > > index 26a51538213c..9ea3aec82d8a 100644
> > > > > --- a/tools/lib/bpf/bpf.h
> > > > > +++ b/tools/lib/bpf/bpf.h
> > > > > @@ -27,6 +27,10 @@
> > > > >  #include 
> > > > >  #include 
> > > > >  
> > > > > +#ifdef __cplusplus
> > > > > +extern "C" {
> > > > > +#endif
> > > > 
> > > > Acked-by: Alexei Starovoitov 
> > > > 
> > > > was wondering whether it's possible to make it testable.
> > > > HOSTCXX is available, but I don't see much of the kernel tree
> > > > using it...
> > > By testable you mean compile some dummy c++ main and link against libbpf?
> > 
> > yes. something like this.
> > to make sure that it keeps being functional and no one introduces 'int new'
> > in some function argument list by accident.
> I tried something like the patch below, it does seem to work locally (building
> in the same directory, no cross-compile). Who would be the best to
> review that kind of stuff?
> 
> diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
> index 425b480bda75..4c0e58628aad 100644
> --- a/tools/lib/bpf/Makefile
> +++ b/tools/lib/bpf/Makefile
> @@ -66,7 +66,7 @@ ifndef VERBOSE
>  endif
>  
>  FEATURE_USER = .libbpf
> -FEATURE_TESTS = libelf libelf-mmap bpf reallocarray
> +FEATURE_TESTS = libelf libelf-mmap bpf reallocarray cxx
>  FEATURE_DISPLAY = libelf bpf
>  
>  INCLUDES = -I. -I$(srctree)/tools/include 
> -I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi
> @@ -148,6 +148,10 @@ LIB_FILE := $(addprefix $(OUTPUT),$(LIB_FILE))
>  
>  CMD_TARGETS = $(LIB_FILE)
>  
> +ifeq ($(feature-cxx), 1)
> + CMD_TARGETS += $(OUTPUT)test_libbpf
> +endif
> +
>  TARGETS = $(CMD_TARGETS)
>  
>  all: fixdep all_cmd
> @@ -175,6 +179,9 @@ $(OUTPUT)libbpf.so: $(BPF_IN)
>  $(OUTPUT)libbpf.a: $(BPF_IN)
>   $(QUIET_LINK)$(RM) $@; $(AR) rcs $@ $^
>  
> +$(OUTPUT)test_libbpf: test_libbpf.cpp $(OUTPUT)libbpf.a
> + $(QUIET_LINK)$(CXX) $^ -lelf -o $@

looks good to me.
pls include test_libbpf.cpp and resubmit for bpf-next.

Re: [PATCH bpf-next] bpf: add read/write access to skb->tstamp from tc clsact progs

2018-11-20 Thread Alexei Starovoitov

On Tue, Nov 20, 2018 at 07:18:48PM -0500, Vlad Dumitrescu wrote:
> This could be used to rate limit egress traffic in concert with a qdisc
> which supports Earliest Departure Time, such as FQ.
> 
> Signed-off-by: Vlad Dumitrescu 
> ---
>  include/uapi/linux/bpf.h|  1 +
>  net/core/filter.c   | 26 +
>  tools/include/uapi/linux/bpf.h  |  1 +
>  tools/testing/selftests/bpf/test_verifier.c |  4 
>  4 files changed, 32 insertions(+)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index c1554aa074659..23e2031a43d43 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2468,6 +2468,7 @@ struct __sk_buff {
>  
>   __u32 data_meta;
>   struct bpf_flow_keys *flow_keys;
> + __u64 tstamp;
>  };
>  
>  struct bpf_tunnel_key {
> diff --git a/net/core/filter.c b/net/core/filter.c
> index f6ca38a7d4332..c45155c8e519c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -5573,6 +5573,10 @@ static bool bpf_skb_is_valid_access(int off, int size, 
> enum bpf_access_type type
>   if (size != sizeof(struct bpf_flow_keys *))
>   return false;
>   break;
> + case bpf_ctx_range(struct __sk_buff, tstamp):
> + if (size != sizeof(__u64))
> + return false;
> + break;
>   default:
>   /* Only narrow read access allowed for now. */
>   if (type == BPF_WRITE) {
> @@ -5600,6 +5604,7 @@ static bool sk_filter_is_valid_access(int off, int size,
>   case bpf_ctx_range(struct __sk_buff, data_end):
>   case bpf_ctx_range(struct __sk_buff, flow_keys):
>   case bpf_ctx_range_till(struct __sk_buff, family, local_port):
> + case bpf_ctx_range(struct __sk_buff, tstamp):
>   return false;
>   }
>  
> @@ -5624,6 +5629,7 @@ static bool cg_skb_is_valid_access(int off, int size,
>   case bpf_ctx_range(struct __sk_buff, tc_classid):
>   case bpf_ctx_range(struct __sk_buff, data_meta):
>   case bpf_ctx_range(struct __sk_buff, flow_keys):
> + case bpf_ctx_range(struct __sk_buff, tstamp):
>   return false;

looks good to me.

Any particular reason you decided to disable it for cg_skb ?
It seems to me the same EDT approach will work from
cgroup-bpf skb hooks just as well and then we can have neat
way of controlling traffic per-container instead of tc-clsbpf global.
If you're already on cgroup v2 it will save you a lot of classifier
cycles, since you'd be able to group apps by cgroup
instead of relying on ip only.

RE: [PATCH v1 net] lan743x: fix return value for lan743x_tx_napi_poll

2018-11-20 Thread Tristram.Ha

Slightly out of topic I am not sure why NAPI is used on the transmit side.
Originally NAPI was designed to fix the receive interrupt happening on each
receive frame problem, so on transmit side it is to avoid the transmit
done interrupt on each transmit frame?  Typically hardware has a way
to trigger transmit done interrupt or not in each transmit frame.

NAPI may have other uses in newer kernels that I am not aware of.

I notice 2 problems in the driver:

1. netif_napi_add is used instead of netif_tx_napi_add.
2. In all other drivers that use netif_tx_napi_add most do not call 
napi_complete_done.
They all call napi_complete directly and return 0.

freescale/gianfar.c
rocker/rocker_main.c
ti/cpsw.c

virtio_net.c does use napi_complete_done but it also passes 0 as a parameter.


> -Original Message-
> From: Florian Fainelli 
> Sent: Tuesday, November 20, 2018 2:12 PM
> To: Bryan Whitehead - C21958 ;
> and...@lunn.ch
> Cc: da...@davemloft.net; netdev@vger.kernel.org; UNGLinuxDriver
> 
> Subject: Re: [PATCH v1 net] lan743x: fix return value for
> lan743x_tx_napi_poll
> 
> On 11/20/18 1:39 PM, bryan.whiteh...@microchip.com wrote:
> >> -Original Message-
> >> From: Andrew Lunn 
> >> Sent: Tuesday, November 20, 2018 2:31 PM
> >> To: Bryan Whitehead - C21958 
> >> Cc: da...@davemloft.net; netdev@vger.kernel.org; UNGLinuxDriver
> >> 
> >> Subject: Re: [PATCH v1 net] lan743x: fix return value for
> >> lan743x_tx_napi_poll
> >>
> >> On Tue, Nov 20, 2018 at 01:26:43PM -0500, Bryan Whitehead wrote:
> >>> It has been noticed that under stress the lan743x driver will
> >>> sometimes hang or cause a kernel panic. It has been noticed that
> >>> returning '0' instead of 'weight' fixes this issue.
> >>>
> >>> fixes: rare kernel panic under heavy traffic load.
> >>> Signed-off-by: Bryan Whitehead 
> >>
> >> Hi Bryan
> >>
> >> This sounds like a band aid over something which is broken, not a real fix.
> >>
> >> Can you show us the stack trace from the panic?
> >>
> >> Andrew
> >
> > Andrew,
> >
> > Admittedly, my knowledge of what the kernel is doing behind the scenes is
> limited.
> >
> > But according to documentation found on
> > https://wiki.linuxfoundation.org/networking/napi
> >
> > It states the following
> > "The poll() function may also process TX completions, in which case if it
> processes
> > the entire TX ring then it should count that work as the rest of the budget.
> > Otherwise, TX completions are not counted."
> >
> > So based on that, the original driver was returning the full budget. But I 
> > was
> having
> > Issues with it. And the above documentation seems to suggest that I could
> return 0
> > As in "not counted" from above.
> >
> > I tried it, and my lock up issues disappeared.
> >
> > Regarding the kernel panic stack trace. So far its very hard to replicate 
> > that
> on the
> > latest kernel. I've seen it more frequently when back porting to older
> kernels such
> > as 4.14, and 4.9. This same fix caused those kernel panics to disappear.
> > Are you interested in seeing a stack dump from older kernels?
> >
> > In the latest kernel the issue manifests as a kernel message which states
> > "[  945.021101] enp48s0: Budget exhausted after napi rescheduled"
> >
> > I'm not sure what that means. But it does not lock up immediately after
> seeing that
> > Message. But it usually locks up with in a minute of seeing that message.
> >
> > And the sometimes I get the following warning
> > [ 1240.425020] [ cut here ]
> > [ 1240.426014] NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0
> timed out
> > [ 1240.430027] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461
> dev_watchdog+0x1ef/0x200
> > [ 1240.430027] Modules linked in: lan743x
> > [ 1240.430027] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G  I   
> > 4.19.2 #1
> > [ 1240.430027] Hardware name: Hewlett-Packard HP Compaq dc7900
> Convertible Minitower/3032h, BIOS 786G1 v01.16 03/05/2009
> > [ 1240.430027] RIP: 0010:dev_watchdog+0x1ef/0x200
> > [ 1240.430027] Code: 00 48 63 4d e0 eb 93 4c 89 e7 c6 05 68 30 b3 00 01 e8 
> > 25
> 3d fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 98 92 48 ab e8 f1 28 87 ff <0f> 0b 
> eb c0
> 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47 08 00
> > [ 1240.430027] RSP: 0018:98490be03e90 EFLAGS: 00010282
> > [ 1240.430027] RAX:  RBX:  RCX:
> 
> > [ 1240.497168] RDX: 00040400 RSI: 00f6 RDI:
> 0300
> > [ 1240.497168] RBP: 984908574440 R08:  R09:
> 03a4
> > [ 1240.497168] R10: 0020 R11: abc928ed R12:
> 984908574000
> > [ 1240.497168] R13:  R14:  R15:
> 98490be195b0
> > [ 1240.497168] FS:  () GS:98490be0()
> knlGS:
> > [ 1240.497168] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 1240.497168] CR2: 7f31cd4c CR3: 000109bca000 CR4:
>

[Patch net-next 2/2] net: dump whole skb data in netdev_rx_csum_fault()

2018-11-20 Thread Cong Wang

Currently, we only dump a few selected skb fields in
netdev_rx_csum_fault(). It is not suffient for debugging checksum
fault. This patch introduces skb_dump() which dumps skb mac header,
network header and its whole skb->data too.

Cc: Herbert Xu 
Cc: Eric Dumazet 
Cc: David Miller 
Signed-off-by: Cong Wang 
---
 include/linux/skbuff.h |  5 +
 net/core/dev.c |  6 +-
 net/core/skbuff.c  | 49 ++
 3 files changed, 55 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index afddb5c17ce5..844c0a7ff52f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4218,5 +4218,10 @@ static inline __wsum lco_csum(struct sk_buff *skb)
return csum_partial(l4_hdr, csum_start - l4_hdr, partial);
 }
 
+#ifdef CONFIG_BUG
+void skb_dump(const char *level, const struct sk_buff *skb, bool dump_header,
+ bool dump_mac_header, bool dump_network_header);
+#endif
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_SKBUFF_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index f2bfd2eda7b2..dc54c89fb4b1 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3097,11 +3097,7 @@ void netdev_rx_csum_fault(struct net_device *dev, struct 
sk_buff *skb)
pr_err("%s: hw csum failure\n", dev ? dev->name : "");
if (dev)
pr_err("dev features: %pNF\n", >features);
-   pr_err("skb len=%u data_len=%u pkt_type=%u gso_size=%u 
gso_type=%u nr_frags=%u ip_summed=%u csum=%x csum_complete_sw=%d csum_valid=%d 
csum_level=%u\n",
-  skb->len, skb->data_len, skb->pkt_type,
-  skb_shinfo(skb)->gso_size, skb_shinfo(skb)->gso_type,
-  skb_shinfo(skb)->nr_frags, skb->ip_summed, skb->csum,
-  skb->csum_complete_sw, skb->csum_valid, skb->csum_level);
+   skb_dump(KERN_ERR, skb, true, true, true);
dump_stack();
}
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b6ba923e7dc7..21aaef3f6a4a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -,3 +,52 @@ void skb_condense(struct sk_buff *skb)
 */
skb->truesize = SKB_TRUESIZE(skb_end_offset(skb));
 }
+
+#ifdef CONFIG_BUG
+void skb_dump(const char *level, const struct sk_buff *skb, bool dump_header,
+ bool dump_mac_header, bool dump_network_header)
+{
+   struct sk_buff *frag_iter;
+   int i;
+
+   if (dump_header)
+   printk("%sskb len=%u data_len=%u pkt_type=%u gso_size=%u 
gso_type=%u nr_frags=%u ip_summed=%u csum=%x csum_complete_sw=%d csum_valid=%d 
csum_level=%u\n",
+  level, skb->len, skb->data_len, skb->pkt_type,
+  skb_shinfo(skb)->gso_size, skb_shinfo(skb)->gso_type,
+  skb_shinfo(skb)->nr_frags, skb->ip_summed, skb->csum,
+  skb->csum_complete_sw, skb->csum_valid, skb->csum_level);
+
+   if (dump_mac_header && skb_mac_header_was_set(skb))
+   print_hex_dump(level, "mac header: ", DUMP_PREFIX_OFFSET, 16, 1,
+  skb_mac_header(skb), skb_mac_header_len(skb),
+  false);
+
+   if (dump_network_header && skb_network_header_was_set(skb))
+   print_hex_dump(level, "network header: ", DUMP_PREFIX_OFFSET,
+  16, 1, skb_network_header(skb),
+  skb_network_header_len(skb), false);
+
+   print_hex_dump(level, "skb data: ", DUMP_PREFIX_OFFSET, 16, 1,
+  skb->data, skb->len, false);
+
+   for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+   skb_frag_t *frag = _shinfo(skb)->frags[i];
+   u32 p_off, p_len, copied;
+   struct page *p;
+   u8 *vaddr;
+
+   skb_frag_foreach_page(frag, frag->page_offset, 
skb_frag_size(frag),
+ p, p_off, p_len, copied) {
+   vaddr = kmap_atomic(p);
+   print_hex_dump(level, "skb frag: ", DUMP_PREFIX_OFFSET,
+  16, 1, vaddr + p_off, p_len, false);
+   kunmap_atomic(vaddr);
+   }
+   }
+
+   if (skb_has_frag_list(skb))
+   printk("%sskb frags list:\n", level);
+   skb_walk_frags(skb, frag_iter)
+   skb_dump(level, frag_iter, false, false, false);
+}
+#endif
-- 
2.19.1

[Patch net-next 1/2] net: introduce skb_network_header_was_set()

2018-11-20 Thread Cong Wang

Signed-off-by: Cong Wang 
---
 include/linux/skbuff.h | 5 +
 net/core/skbuff.c  | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a2e8297a5b00..afddb5c17ce5 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2444,6 +2444,11 @@ static inline u32 skb_network_header_len(const struct 
sk_buff *skb)
return skb->transport_header - skb->network_header;
 }
 
+static inline int skb_network_header_was_set(const struct sk_buff *skb)
+{
+   return skb->network_header != (typeof(skb->network_header))~0U;
+}
+
 static inline u32 skb_inner_network_header_len(const struct sk_buff *skb)
 {
return skb->inner_transport_header - skb->inner_network_header;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9a8a72cefe9b..b6ba923e7dc7 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -227,6 +227,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
gfp_mask,
skb_reset_tail_pointer(skb);
skb->end = skb->tail + size;
skb->mac_header = (typeof(skb->mac_header))~0U;
+   skb->network_header = (typeof(skb->network_header))~0U;
skb->transport_header = (typeof(skb->transport_header))~0U;
 
/* make sure we initialize shinfo sequentially */
@@ -292,6 +293,7 @@ struct sk_buff *__build_skb(void *data, unsigned int 
frag_size)
skb_reset_tail_pointer(skb);
skb->end = skb->tail + size;
skb->mac_header = (typeof(skb->mac_header))~0U;
+   skb->network_header = (typeof(skb->network_header))~0U;
skb->transport_header = (typeof(skb->transport_header))~0U;
 
/* make sure we initialize shinfo sequentially */
-- 
2.19.1

Re: [PATCH v4 bpf-next 1/2] bpf: adding support for map in map in libbpf

2018-11-20 Thread Alexei Starovoitov

On Tue, Nov 20, 2018 at 05:33:43PM -0800, Nikita V. Shirokov wrote:
> idea is pretty simple. for specified map (pointed by struct bpf_map)
> we would provide descriptor of already loaded map, which is going to be
> used as a prototype for inner map. proposed workflow:
> 1) open bpf's object (bpf_object__open)
> 2) create bpf's map which is going to be used as a prototype
> 3) find (by name) map-in-map which you want to load and update w/
> descriptor of inner map w/ a new helper from this patch
> 4) load bpf program w/ bpf_object__load
> 
> Signed-off-by: Nikita V. Shirokov 
> Acked-by: Yonghong Song 
> ---
>  tools/lib/bpf/libbpf.c | 33 +++--
>  tools/lib/bpf/libbpf.h |  2 ++
>  2 files changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index a01eb9584e52..0f46e8497ab8 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -163,6 +163,7 @@ struct bpf_map {
>   char *name;
>   size_t offset;
>   int map_ifindex;
> + int inner_map_fd;
>   struct bpf_map_def def;
>   __u32 btf_key_type_id;
>   __u32 btf_value_type_id;
> @@ -585,6 +586,14 @@ static int compare_bpf_map(const void *_a, const void 
> *_b)
>   return a->offset - b->offset;
>  }
>  
> +static bool bpf_map_type__is_mapinmap(enum bpf_map_type type)

there is already public api bpf_create_map_in_map()
Please use the existing naming convention.

> +{
> + if (type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
> + type == BPF_MAP_TYPE_HASH_OF_MAPS)
> + return true;
> + return false;
> +}
> +
>  static int
>  bpf_object__init_maps(struct bpf_object *obj, int flags)
>  {
> @@ -648,13 +657,15 @@ bpf_object__init_maps(struct bpf_object *obj, int flags)
>   }
>   obj->nr_maps = nr_maps;
>  
> - /*
> -  * fill all fd with -1 so won't close incorrect
> -  * fd (fd=0 is stdin) when failure (zclose won't close
> -  * negative fd)).
> -  */
> - for (i = 0; i < nr_maps; i++)
> + for (i = 0; i < nr_maps; i++) {
> + /*
> +  * fill all fd with -1 so won't close incorrect
> +  * fd (fd=0 is stdin) when failure (zclose won't close
> +  * negative fd)).
> +  */
>   obj->maps[i].fd = -1;
> + obj->maps[i].inner_map_fd = -1;
> + }
>  
>   /*
>* Fill obj->maps using data in "maps" section.
> @@ -1146,6 +1157,9 @@ bpf_object__create_maps(struct bpf_object *obj)
>   create_attr.btf_fd = 0;
>   create_attr.btf_key_type_id = 0;
>   create_attr.btf_value_type_id = 0;
> + if (bpf_map_type__is_mapinmap(def->type) &&
> + map->inner_map_fd >= 0)
> + create_attr.inner_map_fd = map->inner_map_fd;
>  
>   if (obj->btf && !bpf_map_find_btf_info(map, obj->btf)) {
>   create_attr.btf_fd = btf__fd(obj->btf);
> @@ -2562,6 +2576,13 @@ void bpf_map__set_ifindex(struct bpf_map *map, __u32 
> ifindex)
>   map->map_ifindex = ifindex;
>  }
>  
> +void bpf_map__set_inner_map_fd(struct bpf_map *map, int fd)
> +{
> + if (bpf_map_type__is_mapinmap(map->def.type) &&
> + map->inner_map_fd == -1)
> + map->inner_map_fd = fd;

return an error?

> +}
> +
>  static struct bpf_map *
>  __bpf_map__iter(struct bpf_map *m, struct bpf_object *obj, int i)
>  {
> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index b1686a787102..e2132c8c84ae 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -293,6 +293,8 @@ LIBBPF_API void bpf_map__set_ifindex(struct bpf_map *map, 
> __u32 ifindex);
>  LIBBPF_API int bpf_map__pin(struct bpf_map *map, const char *path);
>  LIBBPF_API int bpf_map__unpin(struct bpf_map *map, const char *path);
>  
> +LIBBPF_API void bpf_map__set_inner_map_fd(struct bpf_map *map, int fd);
> +
>  LIBBPF_API long libbpf_get_error(const void *ptr);
>  
>  struct bpf_prog_load_attr {
> -- 
> 2.15.1
>

Re: [PATCH bpf-next] bpf: add read/write access to skb->tstamp from tc clsact progs

2018-11-20 Thread Willem de Bruijn

On Tue, Nov 20, 2018 at 8:22 PM Eric Dumazet  wrote:
>
>
>
> On 11/20/2018 04:18 PM, Vlad Dumitrescu wrote:
> > This could be used to rate limit egress traffic in concert with a qdisc
> > which supports Earliest Departure Time, such as FQ.
> >
> > Signed-off-by: Vlad Dumitrescu 
> > ---
> >  include/uapi/linux/bpf.h|  1 +
> >  net/core/filter.c   | 26 +
> >  tools/include/uapi/linux/bpf.h  |  1 +
> >  tools/testing/selftests/bpf/test_verifier.c |  4 
> >  4 files changed, 32 insertions(+)
> >
>
> Awesome, thanks Vlad
>
> Note that this also can be used to implement a delay (a la netem).
>
> Acked-by: Eric Dumazet 

Acked-by: Willem de Bruijn

Re: [PATCH bpf-next v2] libbpf: make sure bpf headers are c++ include-able

2018-11-20 Thread Stanislav Fomichev

On 11/20, Alexei Starovoitov wrote:
> On Tue, Nov 20, 2018 at 04:05:55PM -0800, Stanislav Fomichev wrote:
> > On 11/20, Alexei Starovoitov wrote:
> > > On Tue, Nov 20, 2018 at 01:37:23PM -0800, Stanislav Fomichev wrote:
> > > > Wrap headers in extern "C", to turn off C++ mangling.
> > > > This simplifies including libbpf in c++ and linking against it.
> > > > 
> > > > v2 changes:
> > > > * do the same for btf.h
> > > > 
> > > > Signed-off-by: Stanislav Fomichev 
> > > > ---
> > > >  tools/lib/bpf/bpf.h| 9 +
> > > >  tools/lib/bpf/btf.h| 8 
> > > >  tools/lib/bpf/libbpf.h | 9 +
> > > >  3 files changed, 26 insertions(+)
> > > > 
> > > > diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> > > > index 26a51538213c..9ea3aec82d8a 100644
> > > > --- a/tools/lib/bpf/bpf.h
> > > > +++ b/tools/lib/bpf/bpf.h
> > > > @@ -27,6 +27,10 @@
> > > >  #include 
> > > >  #include 
> > > >  
> > > > +#ifdef __cplusplus
> > > > +extern "C" {
> > > > +#endif
> > > 
> > > Acked-by: Alexei Starovoitov 
> > > 
> > > was wondering whether it's possible to make it testable.
> > > HOSTCXX is available, but I don't see much of the kernel tree
> > > using it...
> > By testable you mean compile some dummy c++ main and link against libbpf?
> 
> yes. something like this.
> to make sure that it keeps being functional and no one introduces 'int new'
> in some function argument list by accident.
I tried something like the patch below, it does seem to work locally (building
in the same directory, no cross-compile). Who would be the best to
review that kind of stuff?

diff --git a/tools/lib/bpf/Makefile b/tools/lib/bpf/Makefile
index 425b480bda75..4c0e58628aad 100644
--- a/tools/lib/bpf/Makefile
+++ b/tools/lib/bpf/Makefile
@@ -66,7 +66,7 @@ ifndef VERBOSE
 endif
 
 FEATURE_USER = .libbpf
-FEATURE_TESTS = libelf libelf-mmap bpf reallocarray
+FEATURE_TESTS = libelf libelf-mmap bpf reallocarray cxx
 FEATURE_DISPLAY = libelf bpf
 
 INCLUDES = -I. -I$(srctree)/tools/include 
-I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi
@@ -148,6 +148,10 @@ LIB_FILE := $(addprefix $(OUTPUT),$(LIB_FILE))
 
 CMD_TARGETS = $(LIB_FILE)
 
+ifeq ($(feature-cxx), 1)
+   CMD_TARGETS += $(OUTPUT)test_libbpf
+endif
+
 TARGETS = $(CMD_TARGETS)
 
 all: fixdep all_cmd
@@ -175,6 +179,9 @@ $(OUTPUT)libbpf.so: $(BPF_IN)
 $(OUTPUT)libbpf.a: $(BPF_IN)
$(QUIET_LINK)$(RM) $@; $(AR) rcs $@ $^
 
+$(OUTPUT)test_libbpf: test_libbpf.cpp $(OUTPUT)libbpf.a
+   $(QUIET_LINK)$(CXX) $^ -lelf -o $@
+
 define do_install
if [ ! -d '$(DESTDIR_SQ)$2' ]; then \
$(INSTALL) -d -m 755 '$(DESTDIR_SQ)$2'; \

Re: [PATCH net-next] net: don't keep lonely packets forever in the gro hash

2018-11-20 Thread Willem de Bruijn

On Tue, Nov 20, 2018 at 1:19 PM Paolo Abeni  wrote:
>
> Eric noted that with UDP GRO and NAPI timeout, we could keep a single
> UDP packet inside the GRO hash forever, if the related NAPI instance
> calls napi_gro_complete() at an higher frequency than the NAPI timeout.
> Willem noted that even TCP packets could be trapped there, till the
> next retransmission.
> This patch tries to address the issue, flushing the oldest packets before
> scheduling the NAPI timeout. The rationale is that such a timeout should be
> well below a jiffy and we are not flushing packets eligible for sane GRO.

Might be useful to be a bit more exact: oldest packets -> old packets,
those with a NAPI_GRO_CB age before the current jiffy. That helps
explain the "well below a jiffy" comment. I had to reread to code to
understand that this did not just flush everything, obviation the
purpose of the gro timer.

Agreed that something more fine-grained than jiffies would help. In
the case of gro_timer we know the interval, so even an age expressed
with an epoch counter would suffice? Anyway, out of scope for this patch.

> RFC -> v1:
>  - added 'Fixes tags', cleaned-up the wording.
>
> Reported-by: Eric Dumazet 
> Fixes: 3b47d30396ba ("net: gro: add a per device gro flush timer")
> Fixes: e20cf8d3f1f7 ("udp: implement GRO for plain UDP sockets.")
> Signed-off-by: Paolo Abeni 

Acked on the change. Thanks, Paolo. Just a few comments and questions
about the description and target. Feel free to ignore, in which case
I'll just add my Acked-by.

> --
> Note: since one of the fixed commit is currently only on net-next,
> the other one is really old, and the affected scenario without the
> more recent commit is really a corner case, targeting net-next.

IMHO there is nothing UDP specific here. Any sender that applies GRO
and that does not use push bits can trigger this, including non-native
TCP stacks. The gro_timer is a bit of an edge case, but that makes it
fine to send it to net, too.

> ---
>  net/core/dev.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 5927f6a7c301..b6eb4e0bfa91 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5975,11 +5975,14 @@ bool napi_complete_done(struct napi_struct *n, int 
> work_done)
> if (work_done)
> timeout = n->dev->gro_flush_timeout;
>
> +   /* When the NAPI instance uses a timeout, we still need to
> +* somehow bound the time packets are keept in the GRO layer

nit: keept -> kept

> +* under heavy traffic

this only requires one packet per gro_flush_timeout?

> +*/
> +   napi_gro_flush(n, !!timeout);
> if (timeout)
> hrtimer_start(>timer, ns_to_ktime(timeout),
>   HRTIMER_MODE_REL_PINNED);
> -   else
> -   napi_gro_flush(n, false);
> }
> if (unlikely(!list_empty(>poll_list))) {
> /* If n->poll_list is not empty, we need to mask irqs */
> --
> 2.17.2
>

[PATCH v4 bpf-next 2/2] bpf: adding tests for mapinmap helpber in libbpf

2018-11-20 Thread Nikita V. Shirokov

adding test/example of bpf_map__add_inner_map_fd usage

Signed-off-by: Nikita V. Shirokov 
Acked-by: Yonghong Song 
---
 tools/testing/selftests/bpf/Makefile|  3 +-
 tools/testing/selftests/bpf/test_mapinmap.c | 49 +
 tools/testing/selftests/bpf/test_maps.c | 82 +
 3 files changed, 133 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_mapinmap.c

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 57b4712a6276..a3ea69dc9bdf 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -38,7 +38,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o \
-   test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o
+   test_sk_lookup_kern.o test_xdp_vlan.o test_queue_map.o test_stack_map.o 
\
+   test_mapinmap.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
diff --git a/tools/testing/selftests/bpf/test_mapinmap.c 
b/tools/testing/selftests/bpf/test_mapinmap.c
new file mode 100644
index ..ce923e67e08e
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_mapinmap.c
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018 Facebook */
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+struct bpf_map_def SEC("maps") mim_array = {
+   .type = BPF_MAP_TYPE_ARRAY_OF_MAPS,
+   .key_size = sizeof(int),
+   /* must be sizeof(__u32) for map in map */
+   .value_size = sizeof(__u32),
+   .max_entries = 1,
+   .map_flags = 0,
+};
+
+struct bpf_map_def SEC("maps") mim_hash = {
+   .type = BPF_MAP_TYPE_HASH_OF_MAPS,
+   .key_size = sizeof(int),
+   /* must be sizeof(__u32) for map in map */
+   .value_size = sizeof(__u32),
+   .max_entries = 1,
+   .map_flags = 0,
+};
+
+SEC("xdp_mimtest")
+int xdp_mimtest0(struct xdp_md *ctx)
+{
+   int value = 123;
+   int key = 0;
+   void *map;
+
+   map = bpf_map_lookup_elem(_array, );
+   if (!map)
+   return XDP_DROP;
+
+   bpf_map_update_elem(map, , , 0);
+
+   map = bpf_map_lookup_elem(_hash, );
+   if (!map)
+   return XDP_DROP;
+
+   bpf_map_update_elem(map, , , 0);
+
+   return XDP_PASS;
+}
+
+int _version SEC("version") = 1;
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 4db2116e52be..6f2cf1a8a1b6 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -1080,6 +1080,86 @@ static void test_sockmap(int tasks, void *data)
exit(1);
 }
 
+#define MAPINMAP_PROG "./test_mapinmap.o"
+static void test_mapinmap(void)
+{
+   struct bpf_program *prog;
+   struct bpf_object *obj;
+   struct bpf_map *map;
+   int mim_fd, fd, err;
+   int pos = 0;
+
+   obj = bpf_object__open(MAPINMAP_PROG);
+
+   fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(int), sizeof(int),
+   2, 0);
+   if (fd < 0) {
+   printf("Failed to create hashmap '%s'!\n", strerror(errno));
+   exit(1);
+   }
+
+   map = bpf_object__find_map_by_name(obj, "mim_array");
+   if (IS_ERR(map)) {
+   printf("Failed to load array of maps from test prog\n");
+   goto out_mapinmap;
+   }
+   bpf_map__set_inner_map_fd(map, fd);
+
+   map = bpf_object__find_map_by_name(obj, "mim_hash");
+   if (IS_ERR(map)) {
+   printf("Failed to load hash of maps from test prog\n");
+   goto out_mapinmap;
+   }
+   bpf_map__set_inner_map_fd(map, fd);
+
+   bpf_object__for_each_program(prog, obj) {
+   bpf_program__set_xdp(prog);
+   }
+   bpf_object__load(obj);
+
+   map = bpf_object__find_map_by_name(obj, "mim_array");
+   if (IS_ERR(map)) {
+   printf("Failed to load array of maps from test prog\n");
+   goto out_mapinmap;
+   }
+   mim_fd = bpf_map__fd(map);
+   if (mim_fd < 0) {
+   printf("Failed to get descriptor for array of maps\n");
+   goto out_mapinmap;
+   }
+
+   err = bpf_map_update_elem(mim_fd, , , 0);
+   if (err) {
+   printf("Failed to update array of maps\n");
+   goto out_mapinmap;
+   }
+
+   map = bpf_object__find_map_by_name(obj, "mim_hash");
+   if (IS_ERR(map)) {
+   printf("Failed to load hash of maps from test prog\n");
+   goto out_mapinmap;
+   }
+   mim_fd = bpf_map__fd(map);
+   if (mim_fd < 0) {
+

[PATCH v4 bpf-next 0/2] bpf: adding support for mapinmap in libbpf

2018-11-20 Thread Nikita V. Shirokov

in this patch series i'm adding a helper for libbpf which would allow
it to load map-in-map(BPF_MAP_TYPE_ARRAY_OF_MAPS and
BPF_MAP_TYPE_HASH_OF_MAPS).
first patch contains new helper + explains proposed workflow
second patch contains tests which also could be used as example of usage

v3->v4:
 - renamed helper to set_inner_map_fd
 - now we set this value only if it haven't
   been set before and only for (array|hash) of maps

v2->v3:
 - fixing typo in patch description
 - initializing inner_map_fd to -1 by default

v1->v2:
 - addressing nits
 - removing const identifier from fd in new helper
 - starting to check return val for bpf_map_update_elem

Nikita V. Shirokov (2):
  bpf: adding support for map in map in libbpf
  bpf: adding tests for mapinmap helpber in libbpf

 tools/lib/bpf/libbpf.c  | 33 +---
 tools/lib/bpf/libbpf.h  |  2 +
 tools/testing/selftests/bpf/Makefile|  3 +-
 tools/testing/selftests/bpf/test_mapinmap.c | 49 +
 tools/testing/selftests/bpf/test_maps.c | 82 +
 5 files changed, 162 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_mapinmap.c

-- 
2.15.1

[PATCH v4 bpf-next 1/2] bpf: adding support for map in map in libbpf

2018-11-20 Thread Nikita V. Shirokov

idea is pretty simple. for specified map (pointed by struct bpf_map)
we would provide descriptor of already loaded map, which is going to be
used as a prototype for inner map. proposed workflow:
1) open bpf's object (bpf_object__open)
2) create bpf's map which is going to be used as a prototype
3) find (by name) map-in-map which you want to load and update w/
descriptor of inner map w/ a new helper from this patch
4) load bpf program w/ bpf_object__load

Signed-off-by: Nikita V. Shirokov 
Acked-by: Yonghong Song 
---
 tools/lib/bpf/libbpf.c | 33 +++--
 tools/lib/bpf/libbpf.h |  2 ++
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index a01eb9584e52..0f46e8497ab8 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -163,6 +163,7 @@ struct bpf_map {
char *name;
size_t offset;
int map_ifindex;
+   int inner_map_fd;
struct bpf_map_def def;
__u32 btf_key_type_id;
__u32 btf_value_type_id;
@@ -585,6 +586,14 @@ static int compare_bpf_map(const void *_a, const void *_b)
return a->offset - b->offset;
 }
 
+static bool bpf_map_type__is_mapinmap(enum bpf_map_type type)
+{
+   if (type == BPF_MAP_TYPE_ARRAY_OF_MAPS ||
+   type == BPF_MAP_TYPE_HASH_OF_MAPS)
+   return true;
+   return false;
+}
+
 static int
 bpf_object__init_maps(struct bpf_object *obj, int flags)
 {
@@ -648,13 +657,15 @@ bpf_object__init_maps(struct bpf_object *obj, int flags)
}
obj->nr_maps = nr_maps;
 
-   /*
-* fill all fd with -1 so won't close incorrect
-* fd (fd=0 is stdin) when failure (zclose won't close
-* negative fd)).
-*/
-   for (i = 0; i < nr_maps; i++)
+   for (i = 0; i < nr_maps; i++) {
+   /*
+* fill all fd with -1 so won't close incorrect
+* fd (fd=0 is stdin) when failure (zclose won't close
+* negative fd)).
+*/
obj->maps[i].fd = -1;
+   obj->maps[i].inner_map_fd = -1;
+   }
 
/*
 * Fill obj->maps using data in "maps" section.
@@ -1146,6 +1157,9 @@ bpf_object__create_maps(struct bpf_object *obj)
create_attr.btf_fd = 0;
create_attr.btf_key_type_id = 0;
create_attr.btf_value_type_id = 0;
+   if (bpf_map_type__is_mapinmap(def->type) &&
+   map->inner_map_fd >= 0)
+   create_attr.inner_map_fd = map->inner_map_fd;
 
if (obj->btf && !bpf_map_find_btf_info(map, obj->btf)) {
create_attr.btf_fd = btf__fd(obj->btf);
@@ -2562,6 +2576,13 @@ void bpf_map__set_ifindex(struct bpf_map *map, __u32 
ifindex)
map->map_ifindex = ifindex;
 }
 
+void bpf_map__set_inner_map_fd(struct bpf_map *map, int fd)
+{
+   if (bpf_map_type__is_mapinmap(map->def.type) &&
+   map->inner_map_fd == -1)
+   map->inner_map_fd = fd;
+}
+
 static struct bpf_map *
 __bpf_map__iter(struct bpf_map *m, struct bpf_object *obj, int i)
 {
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index b1686a787102..e2132c8c84ae 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -293,6 +293,8 @@ LIBBPF_API void bpf_map__set_ifindex(struct bpf_map *map, 
__u32 ifindex);
 LIBBPF_API int bpf_map__pin(struct bpf_map *map, const char *path);
 LIBBPF_API int bpf_map__unpin(struct bpf_map *map, const char *path);
 
+LIBBPF_API void bpf_map__set_inner_map_fd(struct bpf_map *map, int fd);
+
 LIBBPF_API long libbpf_get_error(const void *ptr);
 
 struct bpf_prog_load_attr {
-- 
2.15.1

[PATCH bpf-next v3 3/3] bpf: libbpf: don't specify prog name if kernel doesn't support it

2018-11-20 Thread Stanislav Fomichev

Use recently added capability check.

See commit 23499442c319 ("bpf: libbpf: retry map creation without the
name") for rationale.

Signed-off-by: Stanislav Fomichev 
---
 tools/lib/bpf/libbpf.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index bf45f285d0a0..a080aeff7e2e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1439,7 +1439,8 @@ load_program(struct bpf_program *prog, struct bpf_insn 
*insns, int insns_cnt,
memset(_attr, 0, sizeof(struct bpf_load_program_attr));
load_attr.prog_type = prog->type;
load_attr.expected_attach_type = prog->expected_attach_type;
-   load_attr.name = prog->name;
+   if (prog->caps->name)
+   load_attr.name = prog->name;
load_attr.insns = insns;
load_attr.insns_cnt = insns_cnt;
load_attr.license = license;
-- 
2.19.1.1215.g8438c0b245-goog

[PATCH bpf-next v3 2/3] bpf: libbpf: remove map name retry from bpf_create_map_xattr

2018-11-20 Thread Stanislav Fomichev

Instead, check for a newly created caps.name bpf_object capability.
If kernel doesn't support names, don't specify the attribute.

See commit 23499442c319 ("bpf: libbpf: retry map creation without the
name") for rationale.

Signed-off-by: Stanislav Fomichev 
---
 tools/lib/bpf/bpf.c| 11 +--
 tools/lib/bpf/libbpf.c |  3 ++-
 2 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 836447bb4f14..ce1822194590 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -69,7 +69,6 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr 
*create_attr)
 {
__u32 name_len = create_attr->name ? strlen(create_attr->name) : 0;
union bpf_attr attr;
-   int ret;
 
memset(, '\0', sizeof(attr));
 
@@ -87,15 +86,7 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr 
*create_attr)
attr.map_ifindex = create_attr->map_ifindex;
attr.inner_map_fd = create_attr->inner_map_fd;
 
-   ret = sys_bpf(BPF_MAP_CREATE, , sizeof(attr));
-   if (ret < 0 && errno == EINVAL && create_attr->name) {
-   /* Retry the same syscall, but without the name.
-* Pre v4.14 kernels don't support map names.
-*/
-   memset(attr.map_name, 0, sizeof(attr.map_name));
-   return sys_bpf(BPF_MAP_CREATE, , sizeof(attr));
-   }
-   return ret;
+   return sys_bpf(BPF_MAP_CREATE, , sizeof(attr));
 }
 
 int bpf_create_map_node(enum bpf_map_type map_type, const char *name,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 3f7f476d8751..bf45f285d0a0 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1211,7 +1211,8 @@ bpf_object__create_maps(struct bpf_object *obj)
continue;
}
 
-   create_attr.name = map->name;
+   if (obj->caps.name)
+   create_attr.name = map->name;
create_attr.map_ifindex = map->map_ifindex;
create_attr.map_type = def->type;
create_attr.map_flags = def->map_flags;
-- 
2.19.1.1215.g8438c0b245-goog

[PATCH bpf-next v3 1/3] bpf, libbpf: introduce bpf_object__probe_caps to test BPF capabilities

2018-11-20 Thread Stanislav Fomichev

It currently only checks whether kernel supports map/prog names.
This capability check will be used in the next two commits to skip setting
prog/map names.

Suggested-by: Daniel Borkmann 
Signed-off-by: Stanislav Fomichev 
---
 tools/lib/bpf/libbpf.c | 58 ++
 1 file changed, 58 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index cb6565d79603..3f7f476d8751 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -114,6 +115,11 @@ void libbpf_set_print(libbpf_print_fn_t warn,
 # define LIBBPF_ELF_C_READ_MMAP ELF_C_READ
 #endif
 
+struct bpf_capabilities {
+   /* v4.14: kernel support for program & map names. */
+   __u32 name:1;
+};
+
 /*
  * bpf_prog should be a better name but it has been used in
  * linux/filter.h.
@@ -160,6 +166,8 @@ struct bpf_program {
void *func_info;
__u32 func_info_rec_size;
__u32 func_info_len;
+
+   struct bpf_capabilities *caps;
 };
 
 struct bpf_map {
@@ -221,6 +229,8 @@ struct bpf_object {
void *priv;
bpf_object_clear_priv_t clear_priv;
 
+   struct bpf_capabilities caps;
+
char path[];
 };
 #define obj_elf_valid(o)   ((o)->efile.elf)
@@ -342,6 +352,7 @@ bpf_object__add_program(struct bpf_object *obj, void *data, 
size_t size,
if (err)
return err;
 
+   prog.caps = >caps;
progs = obj->programs;
nr_progs = obj->nr_programs;
 
@@ -1135,6 +1146,52 @@ int bpf_map__reuse_fd(struct bpf_map *map, int fd)
return -errno;
 }
 
+static int
+bpf_object__probe_name(struct bpf_object *obj)
+{
+   struct bpf_load_program_attr attr;
+   char *cp, errmsg[STRERR_BUFSIZE];
+   struct bpf_insn insns[] = {
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   };
+   int ret;
+
+   /* make sure basic loading works */
+
+   memset(, 0, sizeof(attr));
+   attr.prog_type = BPF_PROG_TYPE_SOCKET_FILTER;
+   attr.insns = insns;
+   attr.insns_cnt = ARRAY_SIZE(insns);
+   attr.license = "GPL";
+
+   ret = bpf_load_program_xattr(, NULL, 0);
+   if (ret < 0) {
+   cp = libbpf_strerror_r(errno, errmsg, sizeof(errmsg));
+   pr_warning("Error in %s():%s(%d). Couldn't load basic 'r0 = 0' 
BPF program.\n",
+  __func__, cp, errno);
+   return -errno;
+   }
+   close(ret);
+
+   /* now try the same program, but with the name */
+
+   attr.name = "test";
+   ret = bpf_load_program_xattr(, NULL, 0);
+   if (ret >= 0) {
+   obj->caps.name = 1;
+   close(ret);
+   }
+
+   return 0;
+}
+
+static int
+bpf_object__probe_caps(struct bpf_object *obj)
+{
+   return bpf_object__probe_name(obj);
+}
+
 static int
 bpf_object__create_maps(struct bpf_object *obj)
 {
@@ -1708,6 +1765,7 @@ int bpf_object__load(struct bpf_object *obj)
 
obj->loaded = true;
 
+   CHECK_ERR(bpf_object__probe_caps(obj), err, out);
CHECK_ERR(bpf_object__create_maps(obj), err, out);
CHECK_ERR(bpf_object__relocate(obj), err, out);
CHECK_ERR(bpf_object__load_progs(obj), err, out);
-- 
2.19.1.1215.g8438c0b245-goog

Re: [PATCH net] sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit

2018-11-20 Thread Marcelo Ricardo Leitner

On Mon, Nov 19, 2018 at 12:39:55PM -0800, David Miller wrote:
> From: Xin Long 
> Date: Sun, 18 Nov 2018 15:07:38 +0800
> 
> > Now sctp increases sk_wmem_alloc by 1 when doing set_owner_w for the
> > skb allocked in sctp_packet_transmit and decreases by 1 when freeing
> > this skb.
> > 
> > But when this skb goes through networking stack, some subcomponents
> > might change skb->truesize and add the same amount on sk_wmem_alloc.
> > However sctp doesn't know the amount to decrease by, it would cause
> > a leak on sk->sk_wmem_alloc and the sock can never be freed.
> > 
> > Xiumei found this issue when it hit esp_output_head() by using sctp
> > over ipsec, where skb->truesize is added and so is sk->sk_wmem_alloc.
> > 
> > Since sctp has used sk_wmem_queued to count for writable space since
> > Commit cd305c74b0f8 ("sctp: use sk_wmem_queued to check for writable
> > space"), it's ok to fix it by counting sk_wmem_alloc by skb truesize
> > in sctp_packet_transmit.
> > 
> > Fixes: cac2661c53f3 ("esp4: Avoid skb_cow_data whenever possible")
> > Reported-by: Xiumei Mu 
> > Signed-off-by: Xin Long 
> 
> Applied and queued up for -stable.

Dave, is there a way that we can check to which versions you queued it
up?

Asking because even though this patch fixes cac2661c53f3 (v4.10) and
the patch probably applies cleanly, it has a dependency on
cd305c74b0f8 (v4.19) and fixing the issue in older kernels either need
a different fix or backport of cd305c74b0f8 too.

Re: [PATCH bpf-next] bpf: add read/write access to skb->tstamp from tc clsact progs

2018-11-20 Thread Eric Dumazet




On 11/20/2018 04:18 PM, Vlad Dumitrescu wrote:
> This could be used to rate limit egress traffic in concert with a qdisc
> which supports Earliest Departure Time, such as FQ.
> 
> Signed-off-by: Vlad Dumitrescu 
> ---
>  include/uapi/linux/bpf.h|  1 +
>  net/core/filter.c   | 26 +
>  tools/include/uapi/linux/bpf.h  |  1 +
>  tools/testing/selftests/bpf/test_verifier.c |  4 
>  4 files changed, 32 insertions(+)
>

Awesome, thanks Vlad

Note that this also can be used to implement a delay (a la netem).

Acked-by: Eric Dumazet

Re: [PATCH net] sctp: hold transport before accessing its asoc in sctp_hash_transport

2018-11-20 Thread Marcelo Ricardo Leitner

On Tue, Nov 20, 2018 at 07:52:48AM -0500, Neil Horman wrote:
> On Tue, Nov 20, 2018 at 07:09:16PM +0800, Xin Long wrote:
> > In sctp_hash_transport, it dereferences a transport's asoc only under
> > rcu_read_lock. Without holding the transport, its asoc could be freed
> > already, which leads to a use-after-free panic.
> > 
> > A similar fix as Commit bab1be79a516 ("sctp: hold transport before
> > accessing its asoc in sctp_transport_get_next") is needed to hold
> > the transport before accessing its asoc in sctp_hash_transport.
> > 
> > Fixes: cd2b70875058 ("sctp: check duplicate node before inserting a new 
> > transport")
> > Reported-by: syzbot+0b05d8aa7cb185107...@syzkaller.appspotmail.com
> > Signed-off-by: Xin Long 
> > ---
> >  net/sctp/input.c | 7 ++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/sctp/input.c b/net/sctp/input.c
> > index 5c36a99..69584e9 100644
> > --- a/net/sctp/input.c
> > +++ b/net/sctp/input.c
> > @@ -896,11 +896,16 @@ int sctp_hash_transport(struct sctp_transport *t)
> > list = rhltable_lookup(_transport_hashtable, ,
> >sctp_hash_params);
> >  
> > -   rhl_for_each_entry_rcu(transport, tmp, list, node)
> > +   rhl_for_each_entry_rcu(transport, tmp, list, node) {
> > +   if (!sctp_transport_hold(transport))
> > +   continue;
> > if (transport->asoc->ep == t->asoc->ep) {
> > +   sctp_transport_put(transport);
> > rcu_read_unlock();
> > return -EEXIST;
> > }
> > +   sctp_transport_put(transport);
> > +   }
> > rcu_read_unlock();
> >  
> > err = rhltable_insert_key(_transport_hashtable, ,
> > -- 
> > 2.1.0
> > 
> > 
> 
> something doesn't feel at all right about this.  If we are inserting a 
> transport
> to an association, it would seem to me that we should have at least one user 
> of
> the association (i.e. non-zero refcount).  As such it seems something is wrong
> with the association refcount here.  At the very least, if there is a case 
> where
> an association is being removed while a transport is being added, the better
> solution would be to ensure that sctp_association_destroy goes through a
> quiescent point prior to unhashing transports from the list, to ensure that
> there is no conflict with the add operation above.

Consider that the rhl_for_each_entry_rcu() is traversing the global
rhashtable, and that it may operate on unrelated transports/asocs.
E.g., transport->asoc in the for() is potentially different from the
asoc under socket lock.

The core of the fix is at:
+   if (!sctp_transport_hold(transport))
+   continue;
If we can get a hold, the asoc will be available for dereferencing in
subsequent lines. Otherwise, move on.

With that, the patch makes sense to me.

Although I would prefer if we come up with a better way to do this
jump, or even avoid the jump. We are only comparing pointers here and
if we had asoc->ep cached on sctp_transport itself, we could avoid the
atomics here.

This change, in the next patch on sctp_epaddr_lookup_transport, will
hurt performance as that is called in datapath. Rhashtable will help
on keeping entry lists to a size, but still.

  Marcelo

[PATCH bpf-next] bpf: add read/write access to skb->tstamp from tc clsact progs

2018-11-20 Thread Vlad Dumitrescu

This could be used to rate limit egress traffic in concert with a qdisc
which supports Earliest Departure Time, such as FQ.

Signed-off-by: Vlad Dumitrescu 
---
 include/uapi/linux/bpf.h|  1 +
 net/core/filter.c   | 26 +
 tools/include/uapi/linux/bpf.h  |  1 +
 tools/testing/selftests/bpf/test_verifier.c |  4 
 4 files changed, 32 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c1554aa074659..23e2031a43d43 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2468,6 +2468,7 @@ struct __sk_buff {
 
__u32 data_meta;
struct bpf_flow_keys *flow_keys;
+   __u64 tstamp;
 };
 
 struct bpf_tunnel_key {
diff --git a/net/core/filter.c b/net/core/filter.c
index f6ca38a7d4332..c45155c8e519c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5573,6 +5573,10 @@ static bool bpf_skb_is_valid_access(int off, int size, 
enum bpf_access_type type
if (size != sizeof(struct bpf_flow_keys *))
return false;
break;
+   case bpf_ctx_range(struct __sk_buff, tstamp):
+   if (size != sizeof(__u64))
+   return false;
+   break;
default:
/* Only narrow read access allowed for now. */
if (type == BPF_WRITE) {
@@ -5600,6 +5604,7 @@ static bool sk_filter_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, data_end):
case bpf_ctx_range(struct __sk_buff, flow_keys):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
return false;
}
 
@@ -5624,6 +5629,7 @@ static bool cg_skb_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
return false;
case bpf_ctx_range(struct __sk_buff, data):
case bpf_ctx_range(struct __sk_buff, data_end):
@@ -5665,6 +5671,7 @@ static bool lwt_is_valid_access(int off, int size,
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
return false;
}
 
@@ -5874,6 +5881,7 @@ static bool tc_cls_act_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, priority):
case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range_till(struct __sk_buff, cb[0], cb[4]):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
break;
default:
return false;
@@ -6093,6 +6101,7 @@ static bool sk_skb_is_valid_access(int off, int size,
case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range(struct __sk_buff, flow_keys):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
return false;
}
 
@@ -6179,6 +6188,7 @@ static bool flow_dissector_is_valid_access(int off, int 
size,
case bpf_ctx_range(struct __sk_buff, tc_classid):
case bpf_ctx_range(struct __sk_buff, data_meta):
case bpf_ctx_range_till(struct __sk_buff, family, local_port):
+   case bpf_ctx_range(struct __sk_buff, tstamp):
return false;
}
 
@@ -6488,6 +6498,22 @@ static u32 bpf_convert_ctx_access(enum bpf_access_type 
type,
*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg,
  si->src_reg, off);
break;
+
+   case offsetof(struct __sk_buff, tstamp):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sk_buff, tstamp) != 8);
+
+   if (type == BPF_WRITE)
+   *insn++ = BPF_STX_MEM(BPF_DW,
+ si->dst_reg, si->src_reg,
+ bpf_target_off(struct sk_buff,
+tstamp, 8,
+target_size));
+   else
+   *insn++ = BPF_LDX_MEM(BPF_DW,
+ si->dst_reg, si->src_reg,
+ bpf_target_off(struct sk_buff,
+tstamp, 8,
+target_size));
}
 
return insn - insn_buf;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c1554aa074659..23e2031a43d43 100644
---

Re: [PATCH bpf-next v2] libbpf: make sure bpf headers are c++ include-able

2018-11-20 Thread Alexei Starovoitov

On Tue, Nov 20, 2018 at 04:05:55PM -0800, Stanislav Fomichev wrote:
> On 11/20, Alexei Starovoitov wrote:
> > On Tue, Nov 20, 2018 at 01:37:23PM -0800, Stanislav Fomichev wrote:
> > > Wrap headers in extern "C", to turn off C++ mangling.
> > > This simplifies including libbpf in c++ and linking against it.
> > > 
> > > v2 changes:
> > > * do the same for btf.h
> > > 
> > > Signed-off-by: Stanislav Fomichev 
> > > ---
> > >  tools/lib/bpf/bpf.h| 9 +
> > >  tools/lib/bpf/btf.h| 8 
> > >  tools/lib/bpf/libbpf.h | 9 +
> > >  3 files changed, 26 insertions(+)
> > > 
> > > diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> > > index 26a51538213c..9ea3aec82d8a 100644
> > > --- a/tools/lib/bpf/bpf.h
> > > +++ b/tools/lib/bpf/bpf.h
> > > @@ -27,6 +27,10 @@
> > >  #include 
> > >  #include 
> > >  
> > > +#ifdef __cplusplus
> > > +extern "C" {
> > > +#endif
> > 
> > Acked-by: Alexei Starovoitov 
> > 
> > was wondering whether it's possible to make it testable.
> > HOSTCXX is available, but I don't see much of the kernel tree
> > using it...
> By testable you mean compile some dummy c++ main and link against libbpf?

yes. something like this.
to make sure that it keeps being functional and no one introduces 'int new'
in some function argument list by accident.

Re: [PATCH bpf-next v2] libbpf: make sure bpf headers are c++ include-able

2018-11-20 Thread Stanislav Fomichev

On 11/20, Alexei Starovoitov wrote:
> On Tue, Nov 20, 2018 at 01:37:23PM -0800, Stanislav Fomichev wrote:
> > Wrap headers in extern "C", to turn off C++ mangling.
> > This simplifies including libbpf in c++ and linking against it.
> > 
> > v2 changes:
> > * do the same for btf.h
> > 
> > Signed-off-by: Stanislav Fomichev 
> > ---
> >  tools/lib/bpf/bpf.h| 9 +
> >  tools/lib/bpf/btf.h| 8 
> >  tools/lib/bpf/libbpf.h | 9 +
> >  3 files changed, 26 insertions(+)
> > 
> > diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> > index 26a51538213c..9ea3aec82d8a 100644
> > --- a/tools/lib/bpf/bpf.h
> > +++ b/tools/lib/bpf/bpf.h
> > @@ -27,6 +27,10 @@
> >  #include 
> >  #include 
> >  
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> 
> Acked-by: Alexei Starovoitov 
> 
> was wondering whether it's possible to make it testable.
> HOSTCXX is available, but I don't see much of the kernel tree
> using it...
By testable you mean compile some dummy c++ main and link against libbpf?

perf has something similar:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/c++/clang-test.cpp#n7

But they don't use Makefiles for that (there is USE_CXX feature test as
well), so I'm not sure either :-/

Re: BPF probe namespacing

2018-11-20 Thread Alexei Starovoitov

On Mon, Nov 19, 2018 at 03:29:07PM +, Peter Parkanyi wrote:
> Hi,
> 
> At LPC I raised the observation that currently it doesn't seem
> feasible to insert a BPF probe from within a container that sees
> events happening outside of the container, while it is possible to
> insert a kernel module.
> 
> It was suggested that this is not the case, and things should just work.
> I wanted to get a minimal reproduction of what I've seen in Docker
> containers, so if somebody could take a look, I'd appreciate any
> comments on the right way of doing this.
> 
> The kprobe in question:
> https://github.com/redsift/ingraind/blob/master/bpf/file.c

I suspect there is something in the bpf prog that makes it skip in-chroot
events.
Can you add bpf_trace_printk-s throught the code to see what's wrong?
Like does your test can even create the test file?
As far as I can see on the kernel side kprobe will be firing.

[PATCH v4 net-next 5/6] net: dsa: microchip: break KSZ9477 DSA driver into two files

2018-11-20 Thread Tristram.Ha

From: Tristram Ha 

Break KSZ9477 DSA driver into two files in preparation to add more KSZ
switch drivers.
Add common functions in ksz_common.h so that other KSZ switch drivers
can access code in ksz_common.c.
Add ksz_spi.h for common functions used by KSZ switch SPI drivers.

Signed-off-by: Tristram Ha 
Reviewed-by: Woojung Huh 
Reviewed-by: Pavel Machek 
Reviewed-by: Florian Fainelli 
Reviewed-by: Andrew Lunn 
---
 drivers/net/dsa/microchip/Kconfig   |4 +
 drivers/net/dsa/microchip/Makefile  |3 +-
 drivers/net/dsa/microchip/ksz9477.c | 1316 +++
 drivers/net/dsa/microchip/ksz9477_spi.c |  143 ++--
 drivers/net/dsa/microchip/ksz_common.c  | 1170 ---
 drivers/net/dsa/microchip/ksz_common.h  |  214 +
 drivers/net/dsa/microchip/ksz_priv.h|  226 +++---
 drivers/net/dsa/microchip/ksz_spi.h |   69 ++
 8 files changed, 1903 insertions(+), 1242 deletions(-)
 create mode 100644 drivers/net/dsa/microchip/ksz9477.c
 create mode 100644 drivers/net/dsa/microchip/ksz_common.h
 create mode 100644 drivers/net/dsa/microchip/ksz_spi.h

diff --git a/drivers/net/dsa/microchip/Kconfig 
b/drivers/net/dsa/microchip/Kconfig
index 4e25fe4..a8caf92 100644
--- a/drivers/net/dsa/microchip/Kconfig
+++ b/drivers/net/dsa/microchip/Kconfig
@@ -1,7 +1,11 @@
+config NET_DSA_MICROCHIP_KSZ_COMMON
+   tristate
+
 menuconfig NET_DSA_MICROCHIP_KSZ9477
tristate "Microchip KSZ9477 series switch support"
depends on NET_DSA
select NET_DSA_TAG_KSZ
+   select NET_DSA_MICROCHIP_KSZ_COMMON
help
  This driver adds support for Microchip KSZ9477 switch chips.
 
diff --git a/drivers/net/dsa/microchip/Makefile 
b/drivers/net/dsa/microchip/Makefile
index 9393e73..3142c18 100644
--- a/drivers/net/dsa/microchip/Makefile
+++ b/drivers/net/dsa/microchip/Makefile
@@ -1,2 +1,3 @@
-obj-$(CONFIG_NET_DSA_MICROCHIP_KSZ9477)+= ksz_common.o
+obj-$(CONFIG_NET_DSA_MICROCHIP_KSZ_COMMON) += ksz_common.o
+obj-$(CONFIG_NET_DSA_MICROCHIP_KSZ9477)+= ksz9477.o
 obj-$(CONFIG_NET_DSA_MICROCHIP_KSZ9477_SPI)+= ksz9477_spi.o
diff --git a/drivers/net/dsa/microchip/ksz9477.c 
b/drivers/net/dsa/microchip/ksz9477.c
new file mode 100644
index 000..80df6c0
--- /dev/null
+++ b/drivers/net/dsa/microchip/ksz9477.c
@@ -0,0 +1,1316 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Microchip KSZ9477 switch driver main logic
+ *
+ * Copyright (C) 2017-2018 Microchip Technology Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "ksz_priv.h"
+#include "ksz_common.h"
+#include "ksz_9477_reg.h"
+
+static const struct {
+   int index;
+   char string[ETH_GSTRING_LEN];
+} ksz9477_mib_names[TOTAL_SWITCH_COUNTER_NUM] = {
+   { 0x00, "rx_hi" },
+   { 0x01, "rx_undersize" },
+   { 0x02, "rx_fragments" },
+   { 0x03, "rx_oversize" },
+   { 0x04, "rx_jabbers" },
+   { 0x05, "rx_symbol_err" },
+   { 0x06, "rx_crc_err" },
+   { 0x07, "rx_align_err" },
+   { 0x08, "rx_mac_ctrl" },
+   { 0x09, "rx_pause" },
+   { 0x0A, "rx_bcast" },
+   { 0x0B, "rx_mcast" },
+   { 0x0C, "rx_ucast" },
+   { 0x0D, "rx_64_or_less" },
+   { 0x0E, "rx_65_127" },
+   { 0x0F, "rx_128_255" },
+   { 0x10, "rx_256_511" },
+   { 0x11, "rx_512_1023" },
+   { 0x12, "rx_1024_1522" },
+   { 0x13, "rx_1523_2000" },
+   { 0x14, "rx_2001" },
+   { 0x15, "tx_hi" },
+   { 0x16, "tx_late_col" },
+   { 0x17, "tx_pause" },
+   { 0x18, "tx_bcast" },
+   { 0x19, "tx_mcast" },
+   { 0x1A, "tx_ucast" },
+   { 0x1B, "tx_deferred" },
+   { 0x1C, "tx_total_col" },
+   { 0x1D, "tx_exc_col" },
+   { 0x1E, "tx_single_col" },
+   { 0x1F, "tx_mult_col" },
+   { 0x80, "rx_total" },
+   { 0x81, "tx_total" },
+   { 0x82, "rx_discards" },
+   { 0x83, "tx_discards" },
+};
+
+static void ksz9477_cfg32(struct ksz_device *dev, u32 addr, u32 bits, bool set)
+{
+   u32 data;
+
+   ksz_read32(dev, addr, );
+   if (set)
+   data |= bits;
+   else
+   data &= ~bits;
+   ksz_write32(dev, addr, data);
+}
+
+static void ksz9477_port_cfg32(struct ksz_device *dev, int port, int offset,
+  u32 bits, bool set)
+{
+   u32 addr;
+   u32 data;
+
+   addr = PORT_CTRL_ADDR(port, offset);
+   ksz_read32(dev, addr, );
+
+   if (set)
+   data |= bits;
+   else
+   data &= ~bits;
+
+   ksz_write32(dev, addr, data);
+}
+
+static int ksz9477_wait_vlan_ctrl_ready(struct ksz_device *dev, u32 waiton,
+   int timeout)
+{
+   u8 data;
+
+   do {
+   ksz_read8(dev, REG_SW_VLAN_CTRL, );
+   if (!(data & waiton))
+   break;
+   usleep_range(1, 10);
+   } while

[PATCH v4 net-next 2/6] net: dsa: microchip: clean up code

2018-11-20 Thread Tristram.Ha

From: Tristram Ha 

Clean up code according to patch check suggestions.

Signed-off-by: Tristram Ha 
Reviewed-by: Woojung Huh 
Reviewed-by: Pavel Machek 
Reviewed-by: Florian Fainelli 
Reviewed-by: Andrew Lunn 
---
 drivers/net/dsa/microchip/ksz_common.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz_common.c 
b/drivers/net/dsa/microchip/ksz_common.c
index d47d03b..eb833df 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -890,9 +890,9 @@ static void ksz_port_mdb_add(struct dsa_switch *ds, int 
port,
 
if (static_table[0] & ALU_V_STATIC_VALID) {
/* check this has same vid & mac address */
-   if (((static_table[2] >> ALU_V_FID_S) == (mdb->vid)) &&
+   if (((static_table[2] >> ALU_V_FID_S) == mdb->vid) &&
((static_table[2] & ALU_V_MAC_ADDR_HI) == mac_hi) &&
-   (static_table[3] == mac_lo)) {
+   static_table[3] == mac_lo) {
/* found matching one */
break;
}
@@ -963,9 +963,9 @@ static int ksz_port_mdb_del(struct dsa_switch *ds, int port,
if (static_table[0] & ALU_V_STATIC_VALID) {
/* check this has same vid & mac address */
 
-   if (((static_table[2] >> ALU_V_FID_S) == (mdb->vid)) &&
+   if (((static_table[2] >> ALU_V_FID_S) == mdb->vid) &&
((static_table[2] & ALU_V_MAC_ADDR_HI) == mac_hi) &&
-   (static_table[3] == mac_lo)) {
+   static_table[3] == mac_lo) {
/* found matching one */
break;
}
-- 
1.9.1

[PATCH v4 net-next 3/6] net: dsa: microchip: rename some functions with ksz9477 prefix

2018-11-20 Thread Tristram.Ha

From: Tristram Ha 

Rename some functions with ksz9477 prefix to separate chip specific code
from common code.

Signed-off-by: Tristram Ha 
Reviewed-by: Woojung Huh 
Reviewed-by: Pavel Machek 
Reviewed-by: Florian Fainelli 
Reviewed-by: Andrew Lunn 
---
 drivers/net/dsa/microchip/ksz_common.c | 116 +
 1 file changed, 59 insertions(+), 57 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz_common.c 
b/drivers/net/dsa/microchip/ksz_common.c
index eb833df..50b24dc 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -253,9 +253,8 @@ static int wait_alu_sta_ready(struct ksz_device *dev, u32 
waiton, int timeout)
return 0;
 }
 
-static int ksz_reset_switch(struct dsa_switch *ds)
+static int ksz9477_reset_switch(struct ksz_device *dev)
 {
-   struct ksz_device *dev = ds->priv;
u8 data8;
u16 data16;
u32 data32;
@@ -288,7 +287,7 @@ static int ksz_reset_switch(struct dsa_switch *ds)
return 0;
 }
 
-static void port_setup(struct ksz_device *dev, int port, bool cpu_port)
+static void ksz9477_port_setup(struct ksz_device *dev, int port, bool cpu_port)
 {
u8 data8;
u16 data16;
@@ -334,7 +333,7 @@ static void port_setup(struct ksz_device *dev, int port, 
bool cpu_port)
ksz_pread16(dev, port, REG_PORT_PHY_INT_ENABLE, );
 }
 
-static void ksz_config_cpu_port(struct dsa_switch *ds)
+static void ksz9477_config_cpu_port(struct dsa_switch *ds)
 {
struct ksz_device *dev = ds->priv;
int i;
@@ -346,12 +345,12 @@ static void ksz_config_cpu_port(struct dsa_switch *ds)
dev->cpu_port = i;
 
/* enable cpu port */
-   port_setup(dev, i, true);
+   ksz9477_port_setup(dev, i, true);
}
}
 }
 
-static int ksz_setup(struct dsa_switch *ds)
+static int ksz9477_setup(struct dsa_switch *ds)
 {
struct ksz_device *dev = ds->priv;
int ret = 0;
@@ -361,7 +360,7 @@ static int ksz_setup(struct dsa_switch *ds)
if (!dev->vlan_cache)
return -ENOMEM;
 
-   ret = ksz_reset_switch(ds);
+   ret = ksz9477_reset_switch(dev);
if (ret) {
dev_err(ds->dev, "failed to reset switch\n");
return ret;
@@ -370,7 +369,7 @@ static int ksz_setup(struct dsa_switch *ds)
/* accept packet up to 2000bytes */
ksz_cfg(dev, REG_SW_MAC_CTRL_1, SW_LEGAL_PACKET_DISABLE, true);
 
-   ksz_config_cpu_port(ds);
+   ksz9477_config_cpu_port(ds);
 
ksz_cfg(dev, REG_SW_MAC_CTRL_1, MULTICAST_STORM_DISABLE, true);
 
@@ -383,13 +382,13 @@ static int ksz_setup(struct dsa_switch *ds)
return 0;
 }
 
-static enum dsa_tag_protocol ksz_get_tag_protocol(struct dsa_switch *ds,
- int port)
+static enum dsa_tag_protocol ksz9477_get_tag_protocol(struct dsa_switch *ds,
+ int port)
 {
return DSA_TAG_PROTO_KSZ;
 }
 
-static int ksz_phy_read16(struct dsa_switch *ds, int addr, int reg)
+static int ksz9477_phy_read16(struct dsa_switch *ds, int addr, int reg)
 {
struct ksz_device *dev = ds->priv;
u16 val = 0;
@@ -399,7 +398,8 @@ static int ksz_phy_read16(struct dsa_switch *ds, int addr, 
int reg)
return val;
 }
 
-static int ksz_phy_write16(struct dsa_switch *ds, int addr, int reg, u16 val)
+static int ksz9477_phy_write16(struct dsa_switch *ds, int addr, int reg,
+  u16 val)
 {
struct ksz_device *dev = ds->priv;
 
@@ -414,7 +414,7 @@ static int ksz_enable_port(struct dsa_switch *ds, int port,
struct ksz_device *dev = ds->priv;
 
/* setup slave port */
-   port_setup(dev, port, false);
+   ksz9477_port_setup(dev, port, false);
 
return 0;
 }
@@ -436,8 +436,8 @@ static int ksz_sset_count(struct dsa_switch *ds, int port, 
int sset)
return TOTAL_SWITCH_COUNTER_NUM;
 }
 
-static void ksz_get_strings(struct dsa_switch *ds, int port,
-   u32 stringset, uint8_t *buf)
+static void ksz9477_get_strings(struct dsa_switch *ds, int port,
+   u32 stringset, uint8_t *buf)
 {
int i;
 
@@ -490,7 +490,8 @@ static void ksz_get_ethtool_stats(struct dsa_switch *ds, 
int port,
mutex_unlock(>stats_mutex);
 }
 
-static void ksz_port_stp_state_set(struct dsa_switch *ds, int port, u8 state)
+static void ksz9477_port_stp_state_set(struct dsa_switch *ds, int port,
+  u8 state)
 {
struct ksz_device *dev = ds->priv;
u8 data;
@@ -535,7 +536,8 @@ static void ksz_port_fast_age(struct dsa_switch *ds, int 
port)
ksz_write8(dev, REG_SW_LUE_CTRL_1, data8);
 }
 
-static int ksz_port_vlan_filtering(struct dsa_switch *ds, int port, bool flag)
+static int ksz9477_port_vlan_filtering(struct dsa_switch *ds, int port,
+

[PATCH v4 net-next 6/6] net: dsa: microchip: rename ksz_9477_reg.h to ksz9477_reg.h

2018-11-20 Thread Tristram.Ha

From: Tristram Ha 

Rename ksz_9477_reg.h to ksz9477_reg.h for consistency as the product
name is always KSZ.

Signed-off-by: Tristram Ha 
Reviewed-by: Woojung Huh 
Reviewed-by: Andrew Lunn 
---
 drivers/net/dsa/microchip/ksz9477.c | 2 +-
 drivers/net/dsa/microchip/{ksz_9477_reg.h => ksz9477_reg.h} | 0
 drivers/net/dsa/microchip/ksz_priv.h| 2 +-
 3 files changed, 2 insertions(+), 2 deletions(-)
 rename drivers/net/dsa/microchip/{ksz_9477_reg.h => ksz9477_reg.h} (100%)

diff --git a/drivers/net/dsa/microchip/ksz9477.c 
b/drivers/net/dsa/microchip/ksz9477.c
index 80df6c0..0684657 100644
--- a/drivers/net/dsa/microchip/ksz9477.c
+++ b/drivers/net/dsa/microchip/ksz9477.c
@@ -19,7 +19,7 @@
 
 #include "ksz_priv.h"
 #include "ksz_common.h"
-#include "ksz_9477_reg.h"
+#include "ksz9477_reg.h"
 
 static const struct {
int index;
diff --git a/drivers/net/dsa/microchip/ksz_9477_reg.h 
b/drivers/net/dsa/microchip/ksz9477_reg.h
similarity index 100%
rename from drivers/net/dsa/microchip/ksz_9477_reg.h
rename to drivers/net/dsa/microchip/ksz9477_reg.h
diff --git a/drivers/net/dsa/microchip/ksz_priv.h 
b/drivers/net/dsa/microchip/ksz_priv.h
index 74c5c1a..a38ff08 100644
--- a/drivers/net/dsa/microchip/ksz_priv.h
+++ b/drivers/net/dsa/microchip/ksz_priv.h
@@ -14,7 +14,7 @@
 #include 
 #include 
 
-#include "ksz_9477_reg.h"
+#include "ksz9477_reg.h"
 
 struct ksz_io_ops;
 
-- 
1.9.1

[PATCH v4 net-next 1/6] net: dsa: microchip: replace license with GPL

2018-11-20 Thread Tristram.Ha

From: Tristram Ha 

Replace license with GPL.

Signed-off-by: Tristram Ha 
Reviewed-by: Woojung Huh 
Reviewed-by: Andrew Lunn 
Acked-by: Pavel Machek 
---
 drivers/net/dsa/microchip/ksz_9477_reg.h | 17 +++--
 drivers/net/dsa/microchip/ksz_common.c   | 15 ++-
 drivers/net/dsa/microchip/ksz_priv.h | 17 +++--
 drivers/net/dsa/microchip/ksz_spi.c  | 15 ++-
 4 files changed, 10 insertions(+), 54 deletions(-)

diff --git a/drivers/net/dsa/microchip/ksz_9477_reg.h 
b/drivers/net/dsa/microchip/ksz_9477_reg.h
index 6aa6752..2938e89 100644
--- a/drivers/net/dsa/microchip/ksz_9477_reg.h
+++ b/drivers/net/dsa/microchip/ksz_9477_reg.h
@@ -1,19 +1,8 @@
-/*
- * Microchip KSZ9477 register definitions
- *
- * Copyright (C) 2017
+/* SPDX-License-Identifier: GPL-2.0
  *
- * Permission to use, copy, modify, and/or distribute this software for any
- * purpose with or without fee is hereby granted, provided that the above
- * copyright notice and this permission notice appear in all copies.
+ * Microchip KSZ9477 register definitions
  *
- * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
- * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
- * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
- * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
- * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
- * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ * Copyright (C) 2017-2018 Microchip Technology Inc.
  */
 
 #ifndef __KSZ9477_REGS_H
diff --git a/drivers/net/dsa/microchip/ksz_common.c 
b/drivers/net/dsa/microchip/ksz_common.c
index 86b6464..d47d03b 100644
--- a/drivers/net/dsa/microchip/ksz_common.c
+++ b/drivers/net/dsa/microchip/ksz_common.c
@@ -1,19 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Microchip switch driver main logic
  *
- * Copyright (C) 2017
- *
- * Permission to use, copy, modify, and/or distribute this software for any
- * purpose with or without fee is hereby granted, provided that the above
- * copyright notice and this permission notice appear in all copies.
- *
- * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
- * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
- * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
- * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
- * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
- * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ * Copyright (C) 2017-2018 Microchip Technology Inc.
  */
 
 #include 
diff --git a/drivers/net/dsa/microchip/ksz_priv.h 
b/drivers/net/dsa/microchip/ksz_priv.h
index 2a98dbd..6a27933 100644
--- a/drivers/net/dsa/microchip/ksz_priv.h
+++ b/drivers/net/dsa/microchip/ksz_priv.h
@@ -1,19 +1,8 @@
-/*
- * Microchip KSZ series switch common definitions
- *
- * Copyright (C) 2017
+/* SPDX-License-Identifier: GPL-2.0
  *
- * Permission to use, copy, modify, and/or distribute this software for any
- * purpose with or without fee is hereby granted, provided that the above
- * copyright notice and this permission notice appear in all copies.
+ * Microchip KSZ series switch common definitions
  *
- * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
- * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
- * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
- * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
- * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
- * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+ * Copyright (C) 2017-2018 Microchip Technology Inc.
  */
 
 #ifndef __KSZ_PRIV_H
diff --git a/drivers/net/dsa/microchip/ksz_spi.c 
b/drivers/net/dsa/microchip/ksz_spi.c
index 8c1778b..dc70f48 100644
--- a/drivers/net/dsa/microchip/ksz_spi.c
+++ b/drivers/net/dsa/microchip/ksz_spi.c
@@ -1,19 +1,8 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * Microchip KSZ series register access through SPI
  *
- * Copyright (C) 2017
- *
- * Permission to use, copy, modify, and/or distribute this software for any
- * purpose with or without fee is hereby granted, provided that the above
- * copyright notice and this permission notice appear in all copies.
- *
- * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
- * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
- * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
- * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
- * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER

[PATCH v4 net-next 0/6] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2018-11-20 Thread Tristram.Ha

From: Tristram Ha 

This series of patches is to modify the original KSZ9477 DSA driver so
that other KSZ switch drivers can be added and use the common code.

There are several steps to accomplish this achievement.  First is to
rename some function names with a prefix to indicate chip specific
function.  Second is to move common code into header that can be shared.
Last is to modify tag_ksz.c so that it can handle many tail tag formats
used by different KSZ switch drivers.

ksz_common.c will contain the common code used by all KSZ switch drivers.
ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
ksz9477_spi.c is renamed from ksz_spi.c.
ksz9477_reg.h is renamed from ksz_9477_reg.h.
ksz_common.h is added to provide common code access to KSZ switch
drivers.
ksz_spi.h is added to provide common SPI access functions to KSZ SPI
drivers.

v4
- Patches were removed to concentrate on changing driver structure without
adding new code.

v3
- The phy_device structure is used to hold port link information
- A structure is passed in ksz_xmit and ksz_rcv instead of function pointer
- Switch offload forwarding is supported

v2
- Initialize reg_mutex before use
- The alu_mutex is only used inside chip specific functions

v1
- Each patch in the set is self-contained
- Use ksz9477 prefix to indicate KSZ9477 specific code

Tristram Ha (6):
  net: dsa: microchip: replace license with GPL
  net: dsa: microchip: clean up code
  net: dsa: microchip: rename some functions with ksz9477 prefix
  net: dsa: microchip: rename ksz_spi.c to ksz9477_spi.c
  net: dsa: microchip: break KSZ9477 DSA driver into two files
  net: dsa: microchip: rename ksz_9477_reg.h to ksz9477_reg.h

 drivers/net/dsa/microchip/Kconfig  |   16 +-
 drivers/net/dsa/microchip/Makefile |5 +-
 drivers/net/dsa/microchip/ksz9477.c| 1316 
 .../microchip/{ksz_9477_reg.h => ksz9477_reg.h}|   17 +-
 drivers/net/dsa/microchip/ksz9477_spi.c|  177 +++
 drivers/net/dsa/microchip/ksz_common.c | 1183 +++---
 drivers/net/dsa/microchip/ksz_common.h |  214 
 drivers/net/dsa/microchip/ksz_priv.h   |  245 ++--
 drivers/net/dsa/microchip/ksz_spi.c|  217 
 drivers/net/dsa/microchip/ksz_spi.h|   69 +
 10 files changed, 2039 insertions(+), 1420 deletions(-)
 create mode 100644 drivers/net/dsa/microchip/ksz9477.c
 rename drivers/net/dsa/microchip/{ksz_9477_reg.h => ksz9477_reg.h} (98%)
 create mode 100644 drivers/net/dsa/microchip/ksz9477_spi.c
 create mode 100644 drivers/net/dsa/microchip/ksz_common.h
 delete mode 100644 drivers/net/dsa/microchip/ksz_spi.c
 create mode 100644 drivers/net/dsa/microchip/ksz_spi.h

-- 
1.9.1

[PATCH v4 net-next 4/6] net: dsa: microchip: rename ksz_spi.c to ksz9477_spi.c

2018-11-20 Thread Tristram.Ha

From: Tristram Ha 

Rename ksz_spi.c to ksz9477_spi.c and update Kconfig in preparation to add
more KSZ switch drivers.

Signed-off-by: Tristram Ha 
Reviewed-by: Woojung Huh 
Reviewed-by: Pavel Machek 
Reviewed-by: Florian Fainelli 
Reviewed-by: Andrew Lunn 
---
 drivers/net/dsa/microchip/Kconfig  | 12 ++--
 drivers/net/dsa/microchip/Makefile |  4 ++--
 drivers/net/dsa/microchip/{ksz_spi.c => ksz9477_spi.c} |  0
 3 files changed, 8 insertions(+), 8 deletions(-)
 rename drivers/net/dsa/microchip/{ksz_spi.c => ksz9477_spi.c} (100%)

diff --git a/drivers/net/dsa/microchip/Kconfig 
b/drivers/net/dsa/microchip/Kconfig
index a8b8f59..4e25fe4 100644
--- a/drivers/net/dsa/microchip/Kconfig
+++ b/drivers/net/dsa/microchip/Kconfig
@@ -1,12 +1,12 @@
-menuconfig MICROCHIP_KSZ
-   tristate "Microchip KSZ series switch support"
+menuconfig NET_DSA_MICROCHIP_KSZ9477
+   tristate "Microchip KSZ9477 series switch support"
depends on NET_DSA
select NET_DSA_TAG_KSZ
help
- This driver adds support for Microchip KSZ switch chips.
+ This driver adds support for Microchip KSZ9477 switch chips.
 
-config MICROCHIP_KSZ_SPI_DRIVER
-   tristate "KSZ series SPI connected switch driver"
-   depends on MICROCHIP_KSZ && SPI
+config NET_DSA_MICROCHIP_KSZ9477_SPI
+   tristate "KSZ9477 series SPI connected switch driver"
+   depends on NET_DSA_MICROCHIP_KSZ9477 && SPI
help
  Select to enable support for registering switches configured through 
SPI.
diff --git a/drivers/net/dsa/microchip/Makefile 
b/drivers/net/dsa/microchip/Makefile
index ed335e2..9393e73 100644
--- a/drivers/net/dsa/microchip/Makefile
+++ b/drivers/net/dsa/microchip/Makefile
@@ -1,2 +1,2 @@
-obj-$(CONFIG_MICROCHIP_KSZ)+= ksz_common.o
-obj-$(CONFIG_MICROCHIP_KSZ_SPI_DRIVER) += ksz_spi.o
+obj-$(CONFIG_NET_DSA_MICROCHIP_KSZ9477)+= ksz_common.o
+obj-$(CONFIG_NET_DSA_MICROCHIP_KSZ9477_SPI)+= ksz9477_spi.o
diff --git a/drivers/net/dsa/microchip/ksz_spi.c 
b/drivers/net/dsa/microchip/ksz9477_spi.c
similarity index 100%
rename from drivers/net/dsa/microchip/ksz_spi.c
rename to drivers/net/dsa/microchip/ksz9477_spi.c
-- 
1.9.1

Re: [PATCH bpf-next v2] libbpf: make sure bpf headers are c++ include-able

2018-11-20 Thread Alexei Starovoitov

On Tue, Nov 20, 2018 at 01:37:23PM -0800, Stanislav Fomichev wrote:
> Wrap headers in extern "C", to turn off C++ mangling.
> This simplifies including libbpf in c++ and linking against it.
> 
> v2 changes:
> * do the same for btf.h
> 
> Signed-off-by: Stanislav Fomichev 
> ---
>  tools/lib/bpf/bpf.h| 9 +
>  tools/lib/bpf/btf.h| 8 
>  tools/lib/bpf/libbpf.h | 9 +
>  3 files changed, 26 insertions(+)
> 
> diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
> index 26a51538213c..9ea3aec82d8a 100644
> --- a/tools/lib/bpf/bpf.h
> +++ b/tools/lib/bpf/bpf.h
> @@ -27,6 +27,10 @@
>  #include 
>  #include 
>  
> +#ifdef __cplusplus
> +extern "C" {
> +#endif

Acked-by: Alexei Starovoitov 

was wondering whether it's possible to make it testable.
HOSTCXX is available, but I don't see much of the kernel tree
using it...

Re: [PATCH net v2] net/sched: act_police: fix race condition on state variables

2018-11-20 Thread Eric Dumazet

On Tue, Nov 20, 2018 at 3:28 PM David Miller  wrote:
>
> From: Davide Caratti 
> Date: Tue, 20 Nov 2018 22:18:44 +0100
>
> > after 'police' configuration parameters were converted to use RCU instead
> > of spinlock, the state variables used to compute the traffic rate (namely
> > 'tcfp_toks', 'tcfp_ptoks' and 'tcfp_t_c') are erroneously read/updated in
> > the traffic path without any protection.
> >
> > Use a dedicated spinlock to avoid race conditions on these variables, and
> > ensure proper cache-line alignment. In this way, 'police' is still faster
> > than what we observed when 'tcf_lock' was used in the traffic path _ i.e.
> > reverting commit 2d550dbad83c ("net/sched: act_police: don't use spinlock
> > in the data path"). Moreover, we preserve the throughput improvement that
> > was obtained after 'police' started using per-cpu counters, when 'avrate'
> > is used instead of 'rate'.
> >
> > Changes since v1 (thanks to Eric Dumazet):
> > - call ktime_get_ns() before acquiring the lock in the traffic path
> > - use a dedicated spinlock instead of tcf_lock
> > - improve cache-line usage
> >
> > Fixes: 2d550dbad83c ("net/sched: act_police: don't use spinlock in the data 
> > path")
> > Reported-and-suggested-by: Eric Dumazet 
> > Signed-off-by: Davide Caratti 
>
> Applied.

We need a fix to make lockdep happy, as reported by Cong.

Cong, do you want to handle this ?

Thanks !

Re: [PATCH net v2] net/sched: act_police: fix race condition on state variables

2018-11-20 Thread David Miller

From: Davide Caratti 
Date: Tue, 20 Nov 2018 22:18:44 +0100

> after 'police' configuration parameters were converted to use RCU instead
> of spinlock, the state variables used to compute the traffic rate (namely
> 'tcfp_toks', 'tcfp_ptoks' and 'tcfp_t_c') are erroneously read/updated in
> the traffic path without any protection.
> 
> Use a dedicated spinlock to avoid race conditions on these variables, and
> ensure proper cache-line alignment. In this way, 'police' is still faster
> than what we observed when 'tcf_lock' was used in the traffic path _ i.e.
> reverting commit 2d550dbad83c ("net/sched: act_police: don't use spinlock
> in the data path"). Moreover, we preserve the throughput improvement that
> was obtained after 'police' started using per-cpu counters, when 'avrate'
> is used instead of 'rate'.
> 
> Changes since v1 (thanks to Eric Dumazet):
> - call ktime_get_ns() before acquiring the lock in the traffic path
> - use a dedicated spinlock instead of tcf_lock
> - improve cache-line usage
> 
> Fixes: 2d550dbad83c ("net/sched: act_police: don't use spinlock in the data 
> path")
> Reported-and-suggested-by: Eric Dumazet 
> Signed-off-by: Davide Caratti 

Applied.

Re: [PATCH bpf-next] bpf: libbpf: retry program creation without the name

2018-11-20 Thread Stanislav Fomichev

On 11/20, Alexei Starovoitov wrote:
> On Wed, Nov 21, 2018 at 12:18:57AM +0100, Daniel Borkmann wrote:
> > On 11/21/2018 12:04 AM, Alexei Starovoitov wrote:
> > > On Tue, Nov 20, 2018 at 01:19:05PM -0800, Stanislav Fomichev wrote:
> > >> On 11/20, Alexei Starovoitov wrote:
> > >>> On Mon, Nov 19, 2018 at 04:46:25PM -0800, Stanislav Fomichev wrote:
> >  [Recent commit 23499442c319 ("bpf: libbpf: retry map creation without
> >  the name") fixed this issue for maps, let's do the same for programs.]
> > 
> >  Since commit 88cda1c9da02 ("bpf: libbpf: Provide basic API support
> >  to specify BPF obj name"), libbpf unconditionally sets bpf_attr->name
> >  for programs. Pre v4.14 kernels don't know about programs names and
> >  return an error about unexpected non-zero data. Retry sys_bpf without
> >  a program name to cover older kernels.
> > 
> >  Signed-off-by: Stanislav Fomichev 
> >  ---
> >   tools/lib/bpf/bpf.c | 10 ++
> >   1 file changed, 10 insertions(+)
> > 
> >  diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> >  index 961e1b9fc592..cbe9d757c646 100644
> >  --- a/tools/lib/bpf/bpf.c
> >  +++ b/tools/lib/bpf/bpf.c
> >  @@ -212,6 +212,16 @@ int bpf_load_program_xattr(const struct 
> >  bpf_load_program_attr *load_attr,
> > if (fd >= 0 || !log_buf || !log_buf_sz)
> > return fd;
> >   
> >  +  if (fd < 0 && errno == E2BIG && load_attr->name) {
> >  +  /* Retry the same syscall, but without the name.
> >  +   * Pre v4.14 kernels don't support prog names.
> >  +   */
> > >>>
> > >>> I'm afraid that will put unnecessary stress on the kernel.
> > >>> This check needs to be tighter.
> > >>> Like E2BIG and anything in the log_buf probably means that
> > >>> E2BIG came from the verifier and nothing to do with prog_name.
> > >>> Asking kernel to repeat is an unnecessary work.
> > >>>
> > >>> In general we need to think beyond this single prog_name field.
> > >>> There are bunch of other fields in bpf_load_program_xattr() and older 
> > >>> kernels
> > >>> won't support them. Are we going to zero them out one by one
> > >>> and retry? I don't think that would be practical.
> > >> I general, we don't want to zero anything out. However,
> > >> for this particular problem the rationale is the following:
> > >> In commit 88cda1c9da02 we started unconditionally setting 
> > >> {prog,map}->name
> > >> from the 'higher' libbpfc layer which breaks users on the older kernels.
> > >>
> > >>> Also libbpf silently ignoring prog_name is not great for debugging.
> > >>> A warning is needed.
> > >>> But it cannot be done out of lib/bpf/bpf.c, since it's a set of syscall
> > >>> wrappers.
> > >>> Imo such "old kernel -> lets retry" feature should probably be done
> > >>> at lib/bpf/libbpf.c level. inside load_program().
> > >> For maps bpftools calls bpf_create_map_xattr directly, that's why
> > >> for maps I did the retry on the lower level (and why for programs I 
> > >> initially
> > >> thought about doing the same). However, in this case maybe asking
> > >> user to omit 'name' argument might be a better option.
> > >>
> > >> For program names, I agree, we might think about doing it on the higher
> > >> level (although I'm not sure whether we want to have different API
> > >> expectations, i.e. bpf_create_map_xattr ignoring the name and
> > >> bpf_load_program_xattr not ignoring the name).
> > >>
> > >> So given that rationale above, what do you think is the best way to
> > >> move forward?
> > >> 1. Same patch, but tighten the retry check inside bpf_load_program_xattr 
> > >> ?
> > >> 2. Move this retry logic into load_program and have different handling
> > >>for bpf_create_map_xattr vs bpf_load_program_xattr ?
> > >> 3. Do 2 and move the retry check for maps from bpf_create_map_xattr
> > >>into bpf_object__create_maps ?
> > >>
> > >> (I'm slightly leaning towards #3)
> > > 
> > > me too. I think it's cleaner for maps to do it in
> > > bpf_object__create_maps().
> > > Originally bpf.c was envisioned to be a thin layer on top of bpf syscall.
> > > Whereas 'smart bits' would go into libbpf.c
> > 
> > Can't we create in bpf_object__load() a small helper 
> > bpf_object__probe_caps()
> > which would figure this out _once_ upon start with a few things to probe for
> > availability in the underlying kernel for maps and programs? E.g. programs
> > it could try to inject a tiny 'r0 = 0; exit' snippet where we figure out
> > things like prog name support etc. Given underlying kernel doesn't change, 
> > we
> > would only try this once and it doesn't require fallback every time.
> 
> +1. great idea!
Sounds good, let me try to do it.

It sounds more like a recent LPC proposal/idea to have some sys_bpf option
to query BPF features. This new bpf_object__probe_caps can probably query
that in the future if we eventually add support for it.

Re: [PATCH bpf-next v2] bpf: fix a compilation error when CONFIG_BPF_SYSCALL is not defined

2018-11-20 Thread Alexei Starovoitov

On Tue, Nov 20, 2018 at 02:08:20PM -0800, Yonghong Song wrote:
> Kernel test robot (l...@intel.com) reports a compilation error at
>   https://www.spinics.net/lists/netdev/msg534913.html
> introduced by commit 838e96904ff3 ("bpf: Introduce bpf_func_info").
> 
> If CONFIG_BPF is defined and CONFIG_BPF_SYSCALL is not defined,
> the following error will appear:
>   kernel/bpf/core.c:414: undefined reference to `btf_type_by_id'
>   kernel/bpf/core.c:415: undefined reference to `btf_name_by_offset'
> 
> When CONFIG_BPF_SYSCALL is not defined,
> let us define stub inline functions for btf_type_by_id()
> and btf_name_by_offset() in include/linux/btf.h.
> This way, the compilation failure can be avoided.
> 
> Fixes: 838e96904ff3 ("bpf: Introduce bpf_func_info")
> Reported-by: kbuild test robot 
> Cc: Martin KaFai Lau 
> Signed-off-by: Yonghong Song 

Applied, Thanks

Re: [PATCH bpf-next] bpf: libbpf: retry program creation without the name

2018-11-20 Thread Alexei Starovoitov

On Wed, Nov 21, 2018 at 12:18:57AM +0100, Daniel Borkmann wrote:
> On 11/21/2018 12:04 AM, Alexei Starovoitov wrote:
> > On Tue, Nov 20, 2018 at 01:19:05PM -0800, Stanislav Fomichev wrote:
> >> On 11/20, Alexei Starovoitov wrote:
> >>> On Mon, Nov 19, 2018 at 04:46:25PM -0800, Stanislav Fomichev wrote:
>  [Recent commit 23499442c319 ("bpf: libbpf: retry map creation without
>  the name") fixed this issue for maps, let's do the same for programs.]
> 
>  Since commit 88cda1c9da02 ("bpf: libbpf: Provide basic API support
>  to specify BPF obj name"), libbpf unconditionally sets bpf_attr->name
>  for programs. Pre v4.14 kernels don't know about programs names and
>  return an error about unexpected non-zero data. Retry sys_bpf without
>  a program name to cover older kernels.
> 
>  Signed-off-by: Stanislav Fomichev 
>  ---
>   tools/lib/bpf/bpf.c | 10 ++
>   1 file changed, 10 insertions(+)
> 
>  diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
>  index 961e1b9fc592..cbe9d757c646 100644
>  --- a/tools/lib/bpf/bpf.c
>  +++ b/tools/lib/bpf/bpf.c
>  @@ -212,6 +212,16 @@ int bpf_load_program_xattr(const struct 
>  bpf_load_program_attr *load_attr,
>   if (fd >= 0 || !log_buf || !log_buf_sz)
>   return fd;
>   
>  +if (fd < 0 && errno == E2BIG && load_attr->name) {
>  +/* Retry the same syscall, but without the name.
>  + * Pre v4.14 kernels don't support prog names.
>  + */
> >>>
> >>> I'm afraid that will put unnecessary stress on the kernel.
> >>> This check needs to be tighter.
> >>> Like E2BIG and anything in the log_buf probably means that
> >>> E2BIG came from the verifier and nothing to do with prog_name.
> >>> Asking kernel to repeat is an unnecessary work.
> >>>
> >>> In general we need to think beyond this single prog_name field.
> >>> There are bunch of other fields in bpf_load_program_xattr() and older 
> >>> kernels
> >>> won't support them. Are we going to zero them out one by one
> >>> and retry? I don't think that would be practical.
> >> I general, we don't want to zero anything out. However,
> >> for this particular problem the rationale is the following:
> >> In commit 88cda1c9da02 we started unconditionally setting {prog,map}->name
> >> from the 'higher' libbpfc layer which breaks users on the older kernels.
> >>
> >>> Also libbpf silently ignoring prog_name is not great for debugging.
> >>> A warning is needed.
> >>> But it cannot be done out of lib/bpf/bpf.c, since it's a set of syscall
> >>> wrappers.
> >>> Imo such "old kernel -> lets retry" feature should probably be done
> >>> at lib/bpf/libbpf.c level. inside load_program().
> >> For maps bpftools calls bpf_create_map_xattr directly, that's why
> >> for maps I did the retry on the lower level (and why for programs I 
> >> initially
> >> thought about doing the same). However, in this case maybe asking
> >> user to omit 'name' argument might be a better option.
> >>
> >> For program names, I agree, we might think about doing it on the higher
> >> level (although I'm not sure whether we want to have different API
> >> expectations, i.e. bpf_create_map_xattr ignoring the name and
> >> bpf_load_program_xattr not ignoring the name).
> >>
> >> So given that rationale above, what do you think is the best way to
> >> move forward?
> >> 1. Same patch, but tighten the retry check inside bpf_load_program_xattr ?
> >> 2. Move this retry logic into load_program and have different handling
> >>for bpf_create_map_xattr vs bpf_load_program_xattr ?
> >> 3. Do 2 and move the retry check for maps from bpf_create_map_xattr
> >>into bpf_object__create_maps ?
> >>
> >> (I'm slightly leaning towards #3)
> > 
> > me too. I think it's cleaner for maps to do it in
> > bpf_object__create_maps().
> > Originally bpf.c was envisioned to be a thin layer on top of bpf syscall.
> > Whereas 'smart bits' would go into libbpf.c
> 
> Can't we create in bpf_object__load() a small helper bpf_object__probe_caps()
> which would figure this out _once_ upon start with a few things to probe for
> availability in the underlying kernel for maps and programs? E.g. programs
> it could try to inject a tiny 'r0 = 0; exit' snippet where we figure out
> things like prog name support etc. Given underlying kernel doesn't change, we
> would only try this once and it doesn't require fallback every time.

+1. great idea!

Re: [PATCH iproute2-next 2/8] json: add %hhu helpers

2018-11-20 Thread David Ahern

On 11/19/18 6:40 PM, Jakub Kicinski wrote:
> On Mon, 19 Nov 2018 17:18:42 -0800, Stephen Hemminger wrote:
>>>  void jsonw_hu_field(json_writer_t *self, const char *prop, unsigned short 
>>> num)
>>>  {
>>> jsonw_name(self, prop);  
>>
>> Do you really need this? it turns out that because of C type
>> conversions print_uint should just work?
> 
> I wondered about that for a second, but I took the existence of
> jsonw_hu_field() etc. as a proof that explicit typing is preferred.
> 

Stephen: you ok with the explicit typing version?

Re: [PATCH bpf-next] bpf: libbpf: retry program creation without the name

2018-11-20 Thread Daniel Borkmann

On 11/21/2018 12:04 AM, Alexei Starovoitov wrote:
> On Tue, Nov 20, 2018 at 01:19:05PM -0800, Stanislav Fomichev wrote:
>> On 11/20, Alexei Starovoitov wrote:
>>> On Mon, Nov 19, 2018 at 04:46:25PM -0800, Stanislav Fomichev wrote:
 [Recent commit 23499442c319 ("bpf: libbpf: retry map creation without
 the name") fixed this issue for maps, let's do the same for programs.]

 Since commit 88cda1c9da02 ("bpf: libbpf: Provide basic API support
 to specify BPF obj name"), libbpf unconditionally sets bpf_attr->name
 for programs. Pre v4.14 kernels don't know about programs names and
 return an error about unexpected non-zero data. Retry sys_bpf without
 a program name to cover older kernels.

 Signed-off-by: Stanislav Fomichev 
 ---
  tools/lib/bpf/bpf.c | 10 ++
  1 file changed, 10 insertions(+)

 diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
 index 961e1b9fc592..cbe9d757c646 100644
 --- a/tools/lib/bpf/bpf.c
 +++ b/tools/lib/bpf/bpf.c
 @@ -212,6 +212,16 @@ int bpf_load_program_xattr(const struct 
 bpf_load_program_attr *load_attr,
if (fd >= 0 || !log_buf || !log_buf_sz)
return fd;
  
 +  if (fd < 0 && errno == E2BIG && load_attr->name) {
 +  /* Retry the same syscall, but without the name.
 +   * Pre v4.14 kernels don't support prog names.
 +   */
>>>
>>> I'm afraid that will put unnecessary stress on the kernel.
>>> This check needs to be tighter.
>>> Like E2BIG and anything in the log_buf probably means that
>>> E2BIG came from the verifier and nothing to do with prog_name.
>>> Asking kernel to repeat is an unnecessary work.
>>>
>>> In general we need to think beyond this single prog_name field.
>>> There are bunch of other fields in bpf_load_program_xattr() and older 
>>> kernels
>>> won't support them. Are we going to zero them out one by one
>>> and retry? I don't think that would be practical.
>> I general, we don't want to zero anything out. However,
>> for this particular problem the rationale is the following:
>> In commit 88cda1c9da02 we started unconditionally setting {prog,map}->name
>> from the 'higher' libbpfc layer which breaks users on the older kernels.
>>
>>> Also libbpf silently ignoring prog_name is not great for debugging.
>>> A warning is needed.
>>> But it cannot be done out of lib/bpf/bpf.c, since it's a set of syscall
>>> wrappers.
>>> Imo such "old kernel -> lets retry" feature should probably be done
>>> at lib/bpf/libbpf.c level. inside load_program().
>> For maps bpftools calls bpf_create_map_xattr directly, that's why
>> for maps I did the retry on the lower level (and why for programs I initially
>> thought about doing the same). However, in this case maybe asking
>> user to omit 'name' argument might be a better option.
>>
>> For program names, I agree, we might think about doing it on the higher
>> level (although I'm not sure whether we want to have different API
>> expectations, i.e. bpf_create_map_xattr ignoring the name and
>> bpf_load_program_xattr not ignoring the name).
>>
>> So given that rationale above, what do you think is the best way to
>> move forward?
>> 1. Same patch, but tighten the retry check inside bpf_load_program_xattr ?
>> 2. Move this retry logic into load_program and have different handling
>>for bpf_create_map_xattr vs bpf_load_program_xattr ?
>> 3. Do 2 and move the retry check for maps from bpf_create_map_xattr
>>into bpf_object__create_maps ?
>>
>> (I'm slightly leaning towards #3)
> 
> me too. I think it's cleaner for maps to do it in
> bpf_object__create_maps().
> Originally bpf.c was envisioned to be a thin layer on top of bpf syscall.
> Whereas 'smart bits' would go into libbpf.c

Can't we create in bpf_object__load() a small helper bpf_object__probe_caps()
which would figure this out _once_ upon start with a few things to probe for
availability in the underlying kernel for maps and programs? E.g. programs
it could try to inject a tiny 'r0 = 0; exit' snippet where we figure out
things like prog name support etc. Given underlying kernel doesn't change, we
would only try this once and it doesn't require fallback every time.

> Right now this boundary is unfortunately blurry.
> May be as #4 long term option we'll introduce another 'smart' layer
> between bpf.c that will assume the latest kernel and libbpf.c that deals
> with elf. May be will call this new layer a 'compat' layer?
> For now I think doing #3 as you suggested is probably the best short term.
>

Re: [PATCH v3 bpf-next 1/2] bpf: adding support for map in map in libbpf

2018-11-20 Thread Alexei Starovoitov

On Mon, Nov 19, 2018 at 10:42:21PM -0800, Nikita V. Shirokov wrote:
> idea is pretty simple. for specified map (pointed by struct bpf_map)
> we would provide descriptor of already loaded map, which is going to be
> used as a prototype for inner map. proposed workflow:
> 1) open bpf's object (bpf_object__open)
> 2) create bpf's map which is going to be used as a prototype
> 3) find (by name) map-in-map which you want to load and update w/
> descriptor of inner map w/ a new helper from this patch
> 4) load bpf program w/ bpf_object__load
> 
> inner_map_fd is ignored by any other maps aside from (hash|array) of
> maps
> 
> Signed-off-by: Nikita V. Shirokov 
> Acked-by: Yonghong Song 
> ---
>  tools/lib/bpf/libbpf.c | 11 ++-
>  tools/lib/bpf/libbpf.h |  2 ++
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index a01eb9584e52..7e130e0c8fc9 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -163,6 +163,7 @@ struct bpf_map {
>   char *name;
>   size_t offset;
>   int map_ifindex;
> + int inner_map_fd;
>   struct bpf_map_def def;
>   __u32 btf_key_type_id;
>   __u32 btf_value_type_id;
> @@ -653,8 +654,10 @@ bpf_object__init_maps(struct bpf_object *obj, int flags)
>* fd (fd=0 is stdin) when failure (zclose won't close
>* negative fd)).
>*/
> - for (i = 0; i < nr_maps; i++)
> + for (i = 0; i < nr_maps; i++) {
>   obj->maps[i].fd = -1;
> + obj->maps[i].inner_map_fd = -1;
> + }
>  
>   /*
>* Fill obj->maps using data in "maps" section.
> @@ -1146,6 +1149,7 @@ bpf_object__create_maps(struct bpf_object *obj)
>   create_attr.btf_fd = 0;
>   create_attr.btf_key_type_id = 0;
>   create_attr.btf_value_type_id = 0;
> + create_attr.inner_map_fd = map->inner_map_fd;
>  
>   if (obj->btf && !bpf_map_find_btf_info(map, obj->btf)) {
>   create_attr.btf_fd = btf__fd(obj->btf);
> @@ -2562,6 +2566,11 @@ void bpf_map__set_ifindex(struct bpf_map *map, __u32 
> ifindex)
>   map->map_ifindex = ifindex;
>  }
>  
> +void bpf_map__add_inner_map_fd(struct bpf_map *map, int fd)
> +{
> + map->inner_map_fd = fd;

I think the name bpf_map__set_inner_map_fd() would be more appropriate
and it should check that map->def->type == map-in-map && map->inner_map_fd == -1
before assigning new one.

Also the behavior of bpf_object__create_maps() is not great.
If nothing is set the function will be passing
create_attr.inner_map_fd == -1 to the kernel.
For regular maps that field is sadly ignored by kernel.
Only for map-in-map the value of -1 will be triggering map_create error.
Imo bpf_object__create_maps() should be doing:
if (create_attr.map_type == map-in-map && map->inner_map_fd >= 0)
 create_attr.inner_map_fd = map->inner_map_fd;
// otherwise keep it zero inited

[PATCH bpf-next v2 2/2] bpf: libbpf: move map name retry into libbpf.c

2018-11-20 Thread Stanislav Fomichev

To be in line with the previous commit ("bpf: libbpf: retry program
creation without the name"), do the retry at the higher level, not the
syscall level.

Signed-off-by: Stanislav Fomichev 
---
 tools/lib/bpf/bpf.c| 11 +--
 tools/lib/bpf/libbpf.c | 11 +++
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 836447bb4f14..ce1822194590 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -69,7 +69,6 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr 
*create_attr)
 {
__u32 name_len = create_attr->name ? strlen(create_attr->name) : 0;
union bpf_attr attr;
-   int ret;
 
memset(, '\0', sizeof(attr));
 
@@ -87,15 +86,7 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr 
*create_attr)
attr.map_ifindex = create_attr->map_ifindex;
attr.inner_map_fd = create_attr->inner_map_fd;
 
-   ret = sys_bpf(BPF_MAP_CREATE, , sizeof(attr));
-   if (ret < 0 && errno == EINVAL && create_attr->name) {
-   /* Retry the same syscall, but without the name.
-* Pre v4.14 kernels don't support map names.
-*/
-   memset(attr.map_name, 0, sizeof(attr.map_name));
-   return sys_bpf(BPF_MAP_CREATE, , sizeof(attr));
-   }
-   return ret;
+   return sys_bpf(BPF_MAP_CREATE, , sizeof(attr));
 }
 
 int bpf_create_map_node(enum bpf_map_type map_type, const char *name,
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 2ea14dfa28fc..c081b5b8f68f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1184,6 +1184,17 @@ bpf_object__create_maps(struct bpf_object *obj)
*pfd = bpf_create_map_xattr(_attr);
}
 
+   if (*pfd < 0 && errno == EINVAL && create_attr.name) {
+   /* Retry the same operation, but without the name.
+* Pre v4.14 kernels don't support map names.
+*/
+   cp = libbpf_strerror_r(errno, errmsg, sizeof(errmsg));
+   pr_warning("Error in bpf_create_map_xattr(%s):%s(%d). 
Retrying without name.\n",
+  map->name, cp, errno);
+   create_attr.name = NULL;
+   *pfd = bpf_create_map_xattr(_attr);
+   }
+
if (*pfd < 0) {
size_t j;
 
-- 
2.19.1.1215.g8438c0b245-goog

[PATCH bpf-next v2 1/2] bpf: libbpf: retry program creation without the name

2018-11-20 Thread Stanislav Fomichev

[Recent commit 23499442c319 ("bpf: libbpf: retry map creation without
the name") fixed this issue for maps, let's do the same for programs.]

Since commit 88cda1c9da02 ("bpf: libbpf: Provide basic API support
to specify BPF obj name"), libbpf unconditionally sets bpf_attr->name
for programs. Pre v4.14 kernels don't know about programs names and
return an error about unexpected non-zero data. Retry
bpf_load_program_xattr without a program name to cover older kernels.

Signed-off-by: Stanislav Fomichev 
---
 tools/lib/bpf/libbpf.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index cb6565d79603..2ea14dfa28fc 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -1401,6 +1401,18 @@ load_program(struct bpf_program *prog, struct bpf_insn 
*insns, int insns_cnt,
 
ret = bpf_load_program_xattr(_attr, log_buf, BPF_LOG_BUF_SIZE);
 
+   if (ret < 0 && errno == E2BIG && load_attr.name) {
+   /* Retry the same operation, but without the name.
+* Pre v4.14 kernels don't support prog names.
+*/
+   cp = libbpf_strerror_r(errno, errmsg, sizeof(errmsg));
+   pr_warning("Error in bpf_load_program_xattr(%s):%s(%d). 
Retrying without name.\n",
+  prog->name, cp, errno);
+   load_attr.name = NULL;
+   ret = bpf_load_program_xattr(_attr, log_buf,
+BPF_LOG_BUF_SIZE);
+   }
+
if (ret >= 0) {
*pfd = ret;
ret = 0;
-- 
2.19.1.1215.g8438c0b245-goog

Re: [PATCH bpf-next] bpf: libbpf: retry program creation without the name

2018-11-20 Thread Alexei Starovoitov

On Tue, Nov 20, 2018 at 01:19:05PM -0800, Stanislav Fomichev wrote:
> On 11/20, Alexei Starovoitov wrote:
> > On Mon, Nov 19, 2018 at 04:46:25PM -0800, Stanislav Fomichev wrote:
> > > [Recent commit 23499442c319 ("bpf: libbpf: retry map creation without
> > > the name") fixed this issue for maps, let's do the same for programs.]
> > > 
> > > Since commit 88cda1c9da02 ("bpf: libbpf: Provide basic API support
> > > to specify BPF obj name"), libbpf unconditionally sets bpf_attr->name
> > > for programs. Pre v4.14 kernels don't know about programs names and
> > > return an error about unexpected non-zero data. Retry sys_bpf without
> > > a program name to cover older kernels.
> > > 
> > > Signed-off-by: Stanislav Fomichev 
> > > ---
> > >  tools/lib/bpf/bpf.c | 10 ++
> > >  1 file changed, 10 insertions(+)
> > > 
> > > diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> > > index 961e1b9fc592..cbe9d757c646 100644
> > > --- a/tools/lib/bpf/bpf.c
> > > +++ b/tools/lib/bpf/bpf.c
> > > @@ -212,6 +212,16 @@ int bpf_load_program_xattr(const struct 
> > > bpf_load_program_attr *load_attr,
> > >   if (fd >= 0 || !log_buf || !log_buf_sz)
> > >   return fd;
> > >  
> > > + if (fd < 0 && errno == E2BIG && load_attr->name) {
> > > + /* Retry the same syscall, but without the name.
> > > +  * Pre v4.14 kernels don't support prog names.
> > > +  */
> > 
> > I'm afraid that will put unnecessary stress on the kernel.
> > This check needs to be tighter.
> > Like E2BIG and anything in the log_buf probably means that
> > E2BIG came from the verifier and nothing to do with prog_name.
> > Asking kernel to repeat is an unnecessary work.
> > 
> > In general we need to think beyond this single prog_name field.
> > There are bunch of other fields in bpf_load_program_xattr() and older 
> > kernels
> > won't support them. Are we going to zero them out one by one
> > and retry? I don't think that would be practical.
> I general, we don't want to zero anything out. However,
> for this particular problem the rationale is the following:
> In commit 88cda1c9da02 we started unconditionally setting {prog,map}->name
> from the 'higher' libbpfc layer which breaks users on the older kernels.
> 
> > Also libbpf silently ignoring prog_name is not great for debugging.
> > A warning is needed.
> > But it cannot be done out of lib/bpf/bpf.c, since it's a set of syscall
> > wrappers.
> > Imo such "old kernel -> lets retry" feature should probably be done
> > at lib/bpf/libbpf.c level. inside load_program().
> For maps bpftools calls bpf_create_map_xattr directly, that's why
> for maps I did the retry on the lower level (and why for programs I initially
> thought about doing the same). However, in this case maybe asking
> user to omit 'name' argument might be a better option.
> 
> For program names, I agree, we might think about doing it on the higher
> level (although I'm not sure whether we want to have different API
> expectations, i.e. bpf_create_map_xattr ignoring the name and
> bpf_load_program_xattr not ignoring the name).
> 
> So given that rationale above, what do you think is the best way to
> move forward?
> 1. Same patch, but tighten the retry check inside bpf_load_program_xattr ?
> 2. Move this retry logic into load_program and have different handling
>for bpf_create_map_xattr vs bpf_load_program_xattr ?
> 3. Do 2 and move the retry check for maps from bpf_create_map_xattr
>into bpf_object__create_maps ?
> 
> (I'm slightly leaning towards #3)

me too. I think it's cleaner for maps to do it in
bpf_object__create_maps().
Originally bpf.c was envisioned to be a thin layer on top of bpf syscall.
Whereas 'smart bits' would go into libbpf.c
Right now this boundary is unfortunately blurry.
May be as #4 long term option we'll introduce another 'smart' layer
between bpf.c that will assume the latest kernel and libbpf.c that deals
with elf. May be will call this new layer a 'compat' layer?
For now I think doing #3 as you suggested is probably the best short term.

Re: [iproute2-next PATCH v3 1/2] tc: flower: Classify packets based port ranges

2018-11-20 Thread David Ahern

On 11/15/18 5:55 PM, Amritha Nambiar wrote:
> Added support for filtering based on port ranges.
> UAPI changes have been accepted into net-next.
> 
> Example:
> 1. Match on a port range:
> -
> $ tc filter add dev enp4s0 protocol ip parent :\
>   prio 1 flower ip_proto tcp dst_port range 20-30 skip_hw\
>   action drop
> 
> $ tc -s filter show dev enp4s0 parent :
> filter protocol ip pref 1 flower chain 0
> filter protocol ip pref 1 flower chain 0 handle 0x1
>   eth_type ipv4
>   ip_proto tcp
>   dst_port range 20-30
>   skip_hw
>   not_in_hw
> action order 1: gact action drop
>  random type none pass val 0
>  index 1 ref 1 bind 1 installed 85 sec used 3 sec
> Action statistics:
> Sent 460 bytes 10 pkt (dropped 10, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> 
> 2. Match on IP address and port range:
> --
> $ tc filter add dev enp4s0 protocol ip parent :\
>   prio 1 flower dst_ip 192.168.1.1 ip_proto tcp dst_port range 100-200\
>   skip_hw action drop
> 
> $ tc -s filter show dev enp4s0 parent :
> filter protocol ip pref 1 flower chain 0 handle 0x2
>   eth_type ipv4
>   ip_proto tcp
>   dst_ip 192.168.1.1
>   dst_port range 100-200
>   skip_hw
>   not_in_hw
> action order 1: gact action drop
>  random type none pass val 0
>  index 2 ref 1 bind 1 installed 58 sec used 2 sec
> Action statistics:
> Sent 920 bytes 20 pkt (dropped 20, overlimits 0 requeues 0)
> backlog 0b 0p requeues 0
> 
> v3:
> Modified flower_port_range_attr_type calls.
> 
> v2:
> Addressed Jiri's comment to sync output format with input
> 
> Signed-off-by: Amritha Nambiar 
> ---
>  include/uapi/linux/pkt_cls.h |7 ++
>  tc/f_flower.c|  143 
> +++---
>  2 files changed, 140 insertions(+), 10 deletions(-)
> 

applied to iproute2-next. Thanks

Re: Kernel 4.19 network performance - forwarding/routing normal users traffic

2018-11-20 Thread Paweł Staszewski




W dniu 19.11.2018 o 22:59, David Ahern pisze:

On 11/9/18 5:06 PM, David Ahern wrote:

On 11/9/18 9:21 AM, David Ahern wrote:

Is there possible to add only counters from xdp for vlans ?
This will help me in testing.

I will take a look today at adding counters that you can dump using
bpftool. It will be a temporary solution for this xdp program only.


Same tree, kernel-tables-wip-02 branch. Compile kernel and install.
Compile samples as before.

new version:
 https://github.com/dsahern/linux.git bpf/kernel-tables-wip-03

This one prototypes incrementing counters for VLAN devices (rx/tx,
packets and bytes). Counters for netdevices representing physical ports
should be managed by the NIC driver.


Will test it today


Thanks

Paweł







I will look at what can be done for packet captures (e.g., xdpdump and
https://github.com/facebookincubator/katran/tree/master/tools). Most
likely a project for next week.

Re: [iproute2-next PATCH v3 2/2] man: tc-flower: Add explanation for range option

2018-11-20 Thread David Ahern

On 11/15/18 5:55 PM, Amritha Nambiar wrote:
> Add details explaining filtering based on port ranges.
> 
> Signed-off-by: Amritha Nambiar 
> ---
>  man/man8/tc-flower.8 |   12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/man/man8/tc-flower.8 b/man/man8/tc-flower.8
> index 8be8882..768bfa1 100644
> --- a/man/man8/tc-flower.8
> +++ b/man/man8/tc-flower.8
> @@ -56,8 +56,10 @@ flower \- flow based traffic control filter
>  .IR MASKED_IP_TTL " | { "
>  .BR dst_ip " | " src_ip " } "
>  .IR PREFIX " | { "
> -.BR dst_port " | " src_port " } "
> -.IR port_number " } | "
> +.BR dst_port " | " src_port " } { "
> +.IR port_number " | "
> +.B range
> +.IR min_port_number-max_port_number " } | "
>  .B tcp_flags
>  .IR MASKED_TCP_FLAGS " | "
>  .B type
> @@ -227,6 +229,12 @@ Match on layer 4 protocol source or destination port 
> number. Only available for
>  .BR ip_proto " values " udp ", " tcp  " and " sctp
>  which have to be specified in beforehand.
>  .TP
> +.BI range " MIN_VALUE-MAX_VALUE"
> +Match on a range of layer 4 protocol source or destination port number. Only
> +available for
> +.BR ip_proto " values " udp ", " tcp  " and " sctp
> +which have to be specified in beforehand.
> +.TP
>  .BI tcp_flags " MASKED_TCP_FLAGS"
>  Match on TCP flags represented as 12bit bitfield in in hexadecimal format.
>  A mask may be optionally provided to limit the bits which are matched. A mask
> 

This prints as:

dst_port NUMBER
src_port NUMBER
  Match  on  layer  4  protocol source or destination port number.
  Only available for ip_proto values udp, tcp and sctp which  have
  to be specified in beforehand.

range MIN_VALUE-MAX_VALUE
  Match  on a range of layer 4 protocol source or destination port
  number. Only available for ip_proto values  udp,  tcp  and  sctp
  which have to be specified in beforehand.

###

That makes it look like range is a standalone option - independent of
dst_port/src_port.

It seems to me the dst_port / src_port should be updated to:

dst_port {NUMBER | range MIN_VALUE-MAX_VALUE}

with the description updated for both options and indented under
dst_port / src_port

Re: [PATCH v2 3/4] libbpf: require size hint in bpf_prog_test_run

2018-11-20 Thread Alexei Starovoitov

On Tue, Nov 20, 2018 at 07:43:57PM +, Lorenz Bauer wrote:
> On Tue, 20 Nov 2018 at 19:18, Alexei Starovoitov
>  wrote:
> >
> > On Tue, Nov 20, 2018 at 03:43:05PM +, Lorenz Bauer wrote:
> > > Require size_out to be non-NULL if data_out is given. This prevents
> > > accidental overwriting of process memory after the output buffer.
> > >
> > > Adjust callers of bpf_prog_test_run to this behaviour.
> > >
> > > Signed-off-by: Lorenz Bauer 
> > > ---
> > >  tools/lib/bpf/bpf.c  |  7 ++-
> > >  tools/testing/selftests/bpf/test_progs.c | 10 ++
> > >  2 files changed, 16 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
> > > index 961e1b9fc592..1a835ff27486 100644
> > > --- a/tools/lib/bpf/bpf.c
> > > +++ b/tools/lib/bpf/bpf.c
> > > @@ -407,15 +407,20 @@ int bpf_prog_test_run(int prog_fd, int repeat, void 
> > > *data, __u32 size,
> > >   union bpf_attr attr;
> > >   int ret;
> > >
> > > + if (data_out && !size_out)
> > > + return -EINVAL;
> > > +
> > >   bzero(, sizeof(attr));
> > >   attr.test.prog_fd = prog_fd;
> > >   attr.test.data_in = ptr_to_u64(data);
> > >   attr.test.data_out = ptr_to_u64(data_out);
> > >   attr.test.data_size_in = size;
> > > + if (data_out)
> > > + attr.test.data_size_out = *size_out;
> > >   attr.test.repeat = repeat;
> > >
> > >   ret = sys_bpf(BPF_PROG_TEST_RUN, , sizeof(attr));
> > > - if (size_out)
> > > + if (data_out)
> > >   *size_out = attr.test.data_size_out;
> > >   if (retval)
> > >   *retval = attr.test.retval;
> > > diff --git a/tools/testing/selftests/bpf/test_progs.c 
> > > b/tools/testing/selftests/bpf/test_progs.c
> > > index c1e688f61061..299938603cb6 100644
> > > --- a/tools/testing/selftests/bpf/test_progs.c
> > > +++ b/tools/testing/selftests/bpf/test_progs.c
> > > @@ -150,6 +150,7 @@ static void test_xdp(void)
> > >   bpf_map_update_elem(map_fd, , , 0);
> > >   bpf_map_update_elem(map_fd, , , 0);
> > >
> > > + size = sizeof(buf);
> > >   err = bpf_prog_test_run(prog_fd, 1, _v4, sizeof(pkt_v4),
> > >   buf, , , );
> > >
> > > @@ -158,6 +159,7 @@ static void test_xdp(void)
> > > "err %d errno %d retval %d size %d\n",
> > > err, errno, retval, size);
> > >
> > > + size = sizeof(buf);
> > >   err = bpf_prog_test_run(prog_fd, 1, _v6, sizeof(pkt_v6),
> > >   buf, , , );
> >
> > This will surely break existing bpf_prog_test_run users.
> > Like it will break our testing framework.
> > we can fix out stuff and libbpf is a user space library, but I don't
> > think that this is the case to invoke such pain.
> > libbpf's bpf_prog_test_run() should be a simple wrapper on top of syscall.
> > I don't think it should be making such restrictions on api.
> >
> > btw patch 1 looks good to me.
> >
> 
> What if I add bpf_prog_test_run_safe or similar, with the behaviour
> proposed in the patch?
> Makes sense that you don't want to break existing users of libbpf
> outside the kernel, OTOH
> user space really should specify the output buffer length (or be given
> the choice).

+ if (data_out && !size_out)
+ return -EINVAL;
+
+ if (data_out)
+ attr.test.data_size_out = *size_out;

this is actually worse than I thought, since it will cause sporadic
failures in the test frameworks that don't init size_out.
Like test_progs.c will be randomly passing/failing depending
on the state of uninit bytes in the stack.

Also consider that during bpf uconf folks have requested to extend
prog_test_run with __sk_buff in/out argument, so no only packet data,
but skb related fields can be tested as well.
I think that was a valid request and prog_test_run should be extended.
So soon such libbpf's bpf_prog_test_run_safe() will not be enough.
I think it's the best to use _xattr approach we did for map_create
and prog_load.
This new bpf_prog_test_run_xattr() will be able to do the check
you're proposing:
+ if (data_out && !size_out)
+ return -EINVAL;
+
+ if (data_out)
+ attr.test.data_size_out = *size_out;
it can also check that both size and size_out are sane
with similar check to kernel:
if (size < ETH_HLEN || size > PAGE_SIZE - headroom - tailroom);

and will be extendable in the near future with __sk_buff in/out.

Re: [PATCH net v2] net/sched: act_police: fix race condition on state variables

2018-11-20 Thread Cong Wang

On Tue, Nov 20, 2018 at 1:19 PM Davide Caratti  wrote:
>
> after 'police' configuration parameters were converted to use RCU instead
> of spinlock, the state variables used to compute the traffic rate (namely
> 'tcfp_toks', 'tcfp_ptoks' and 'tcfp_t_c') are erroneously read/updated in
> the traffic path without any protection.
>
> Use a dedicated spinlock to avoid race conditions on these variables, and
> ensure proper cache-line alignment. In this way, 'police' is still faster
> than what we observed when 'tcf_lock' was used in the traffic path _ i.e.
> reverting commit 2d550dbad83c ("net/sched: act_police: don't use spinlock
> in the data path"). Moreover, we preserve the throughput improvement that
> was obtained after 'police' started using per-cpu counters, when 'avrate'
> is used instead of 'rate'.
>
> Changes since v1 (thanks to Eric Dumazet):
> - call ktime_get_ns() before acquiring the lock in the traffic path
> - use a dedicated spinlock instead of tcf_lock

Do you initialize this dedicated spinlock?

Re: [PATCH iproute2 22/22] rdma: make local functions static

2018-11-20 Thread David Ahern

On 11/15/18 3:36 PM, Stephen Hemminger wrote:
> Several functions only used inside utils.c
> 
> Signed-off-by: Stephen Hemminger 
> ---
>  rdma/rdma.h  | 11 ---
>  rdma/utils.c | 12 ++--
>  2 files changed, 6 insertions(+), 17 deletions(-)
> 

this patch breaks builds for me on Debian stretch:

rdma
CC   rdma.o
CC   utils.o
CC   dev.o
CC   link.o
dev.c: In function 'dev_set_name':
dev.c:248:6: warning: implicit declaration of function 'rd_no_arg'
[-Wimplicit-function-declaration]
  if (rd_no_arg(rd)) {
  ^
dev.c:256:55: warning: implicit declaration of function 'rd_argv'
[-Wimplicit-function-declaration]
  mnl_attr_put_strz(rd->nlh, RDMA_NLDEV_ATTR_DEV_NAME, rd_argv(rd));
   ^~~
dev.c:256:55: warning: passing argument 3 of 'mnl_attr_put_strz' makes
pointer from integer without a cast [-Wint-conversion]
In file included from rdma.h:19:0,
 from dev.c:12:
/usr/include/libmnl/libmnl.h:103:13: note: expected 'const char *' but
argument is of type 'int'
 extern void mnl_attr_put_strz(struct nlmsghdr *nlh, uint16_t type,
const char *data);


Reverting the patch fixes it.

Re: [PATCH net] MAINTAINERS: add myself as co-maintainer for r8169

2018-11-20 Thread David Miller

From: Heiner Kallweit 
Date: Tue, 20 Nov 2018 21:22:50 +0100

> Meanwhile I know the driver quite well and I refactored bigger parts
> of it. As a result people contact me already with r8169 questions.
> Therefore I'd volunteer to become co-maintainer of the driver also
> officially.
> 
> Signed-off-by: Heiner Kallweit 

I honestly thought we had done this already :-)

Applied, thanks.

[PATCH mlx5-next 08/11] net/mlx5: Resource tables, Use async events chain

2018-11-20 Thread Saeed Mahameed

Remove the explicit call to QP/SRQ resources events handlers on several FW
events and let resources logic register resources events notifiers via the
new API.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 29 
 drivers/net/ethernet/mellanox/mlx5/core/qp.c  | 68 +++
 drivers/net/ethernet/mellanox/mlx5/core/srq.c | 55 +--
 include/linux/mlx5/driver.h   |  6 +-
 4 files changed, 108 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index b28869aa1a4e..0cf448575ebd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -324,7 +324,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
struct mlx5_eqe *eqe;
int set_ci = 0;
u32 cqn = -1;
-   u32 rsn;
u8 port;
 
dev = eq->dev;
@@ -340,34 +339,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
mlx5_core_dbg(eq->dev, "eqn %d, eqe type %s\n",
  eq->eqn, eqe_type_str(eqe->type));
switch (eqe->type) {
-   case MLX5_EVENT_TYPE_DCT_DRAINED:
-   rsn = be32_to_cpu(eqe->data.dct.dctn) & 0xff;
-   rsn |= (MLX5_RES_DCT << MLX5_USER_INDEX_LEN);
-   mlx5_rsc_event(dev, rsn, eqe->type);
-   break;
-   case MLX5_EVENT_TYPE_PATH_MIG:
-   case MLX5_EVENT_TYPE_COMM_EST:
-   case MLX5_EVENT_TYPE_SQ_DRAINED:
-   case MLX5_EVENT_TYPE_SRQ_LAST_WQE:
-   case MLX5_EVENT_TYPE_WQ_CATAS_ERROR:
-   case MLX5_EVENT_TYPE_PATH_MIG_FAILED:
-   case MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
-   case MLX5_EVENT_TYPE_WQ_ACCESS_ERROR:
-   rsn = be32_to_cpu(eqe->data.qp_srq.qp_srq_n) & 0xff;
-   rsn |= (eqe->data.qp_srq.type << MLX5_USER_INDEX_LEN);
-   mlx5_core_dbg(dev, "event %s(%d) arrived on resource 
0x%x\n",
- eqe_type_str(eqe->type), eqe->type, rsn);
-   mlx5_rsc_event(dev, rsn, eqe->type);
-   break;
-
-   case MLX5_EVENT_TYPE_SRQ_RQ_LIMIT:
-   case MLX5_EVENT_TYPE_SRQ_CATAS_ERROR:
-   rsn = be32_to_cpu(eqe->data.qp_srq.qp_srq_n) & 0xff;
-   mlx5_core_dbg(dev, "SRQ event %s(%d): srqn 0x%x\n",
- eqe_type_str(eqe->type), eqe->type, rsn);
-   mlx5_srq_event(dev, rsn, eqe->type);
-   break;
-
case MLX5_EVENT_TYPE_PORT_CHANGE:
port = (eqe->data.port.port >> 4) & 0xf;
switch (eqe->sub_type) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index cba4a435043a..28726c63101f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -38,11 +38,11 @@
 #include 
 
 #include "mlx5_core.h"
+#include "lib/eq.h"
 
-static struct mlx5_core_rsc_common *mlx5_get_rsc(struct mlx5_core_dev *dev,
-u32 rsn)
+static struct mlx5_core_rsc_common *
+mlx5_get_rsc(struct mlx5_qp_table *table, u32 rsn)
 {
-   struct mlx5_qp_table *table = >priv.qp_table;
struct mlx5_core_rsc_common *common;
 
spin_lock(>lock);
@@ -53,11 +53,6 @@ static struct mlx5_core_rsc_common *mlx5_get_rsc(struct 
mlx5_core_dev *dev,
 
spin_unlock(>lock);
 
-   if (!common) {
-   mlx5_core_warn(dev, "Async event for bogus resource 0x%x\n",
-  rsn);
-   return NULL;
-   }
return common;
 }
 
@@ -120,14 +115,52 @@ static bool is_event_type_allowed(int rsc_type, int 
event_type)
}
 }
 
-void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int event_type)
+static int rsc_event_notifier(struct notifier_block *nb,
+ unsigned long type, void *data)
 {
-   struct mlx5_core_rsc_common *common = mlx5_get_rsc(dev, rsn);
+   struct mlx5_core_rsc_common *common;
+   struct mlx5_qp_table *table;
+   struct mlx5_core_dev *dev;
struct mlx5_core_dct *dct;
+   u8 event_type = (u8)type;
struct mlx5_core_qp *qp;
+   struct mlx5_priv *priv;
+   struct mlx5_eqe *eqe;
+   u32 rsn;
+
+   switch (event_type) {
+   case MLX5_EVENT_TYPE_DCT_DRAINED:
+   eqe = data;
+   rsn = be32_to_cpu(eqe->data.dct.dctn) & 0xff;
+   rsn |= (MLX5_RES_DCT << MLX5_USER_INDEX_LEN);
+   break;
+   case MLX5_EVENT_TYPE_PATH_MIG:
+   case MLX5_EVENT_TYPE_COMM_EST:
+   case MLX5_EVENT_TYPE_SQ_DRAINED:
+

[PATCH mlx5-next 03/11] net/mlx5: FPGA, Use async events chain

2018-11-20 Thread Saeed Mahameed

Remove the explicit call to mlx5_fpga_event on
MLX5_EVENT_TYPE_FPGA_ERROR or MLX5_EVENT_TYPE_FPGA_QP_ERROR
let fpga core to register its own handler when its ready.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  5 ---
 .../ethernet/mellanox/mlx5/core/fpga/core.c   | 38 ---
 .../ethernet/mellanox/mlx5/core/fpga/core.h   | 11 +++---
 3 files changed, 38 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index c7c436b0ed2e..8aabd23d2166 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -421,11 +421,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
mlx5_pps_event(dev, eqe);
break;
 
-   case MLX5_EVENT_TYPE_FPGA_ERROR:
-   case MLX5_EVENT_TYPE_FPGA_QP_ERROR:
-   mlx5_fpga_event(dev, eqe->type, >data.raw);
-   break;
-
case MLX5_EVENT_TYPE_TEMP_WARN_EVENT:
mlx5_temp_warning_event(dev, eqe);
break;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c
index 436a8136f26f..27c5f6c7d36a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c
@@ -36,6 +36,7 @@
 
 #include "mlx5_core.h"
 #include "lib/mlx5.h"
+#include "lib/eq.h"
 #include "fpga/core.h"
 #include "fpga/conn.h"
 
@@ -145,6 +146,22 @@ static int mlx5_fpga_device_brb(struct mlx5_fpga_device 
*fdev)
return 0;
 }
 
+static int mlx5_fpga_event(struct mlx5_fpga_device *, unsigned long, void *);
+
+static int fpga_err_event(struct notifier_block *nb, unsigned long event, void 
*eqe)
+{
+   struct mlx5_fpga_device *fdev = mlx5_nb_cof(nb, struct 
mlx5_fpga_device, fpga_err_nb);
+
+   return mlx5_fpga_event(fdev, event, eqe);
+}
+
+static int fpga_qp_err_event(struct notifier_block *nb, unsigned long event, 
void *eqe)
+{
+   struct mlx5_fpga_device *fdev = mlx5_nb_cof(nb, struct 
mlx5_fpga_device, fpga_qp_err_nb);
+
+   return mlx5_fpga_event(fdev, event, eqe);
+}
+
 int mlx5_fpga_device_start(struct mlx5_core_dev *mdev)
 {
struct mlx5_fpga_device *fdev = mdev->fpga;
@@ -185,6 +202,11 @@ int mlx5_fpga_device_start(struct mlx5_core_dev *mdev)
if (err)
goto out;
 
+   MLX5_NB_INIT(>fpga_err_nb, fpga_err_event, FPGA_ERROR);
+   MLX5_NB_INIT(>fpga_qp_err_nb, fpga_qp_err_event, FPGA_QP_ERROR);
+   mlx5_eq_notifier_register(fdev->mdev, >fpga_err_nb);
+   mlx5_eq_notifier_register(fdev->mdev, >fpga_qp_err_nb);
+
err = mlx5_fpga_conn_device_init(fdev);
if (err)
goto err_rsvd_gid;
@@ -201,6 +223,8 @@ int mlx5_fpga_device_start(struct mlx5_core_dev *mdev)
mlx5_fpga_conn_device_cleanup(fdev);
 
 err_rsvd_gid:
+   mlx5_eq_notifier_unregister(fdev->mdev, >fpga_err_nb);
+   mlx5_eq_notifier_unregister(fdev->mdev, >fpga_qp_err_nb);
mlx5_core_unreserve_gids(mdev, max_num_qps);
 out:
spin_lock_irqsave(>state_lock, flags);
@@ -256,6 +280,9 @@ void mlx5_fpga_device_stop(struct mlx5_core_dev *mdev)
}
 
mlx5_fpga_conn_device_cleanup(fdev);
+   mlx5_eq_notifier_unregister(fdev->mdev, >fpga_err_nb);
+   mlx5_eq_notifier_unregister(fdev->mdev, >fpga_qp_err_nb);
+
max_num_qps = MLX5_CAP_FPGA(mdev, shell_caps.max_num_qps);
mlx5_core_unreserve_gids(mdev, max_num_qps);
 }
@@ -283,9 +310,10 @@ static const char *mlx5_fpga_qp_syndrome_to_string(u8 
syndrome)
return "Unknown";
 }
 
-void mlx5_fpga_event(struct mlx5_core_dev *mdev, u8 event, void *data)
+static int mlx5_fpga_event(struct mlx5_fpga_device *fdev,
+  unsigned long event, void *eqe)
 {
-   struct mlx5_fpga_device *fdev = mdev->fpga;
+   void *data = ((struct mlx5_eqe *)eqe)->data.raw;
const char *event_name;
bool teardown = false;
unsigned long flags;
@@ -303,9 +331,7 @@ void mlx5_fpga_event(struct mlx5_core_dev *mdev, u8 event, 
void *data)
fpga_qpn = MLX5_GET(fpga_qp_error_event, data, fpga_qpn);
break;
default:
-   mlx5_fpga_warn_ratelimited(fdev, "Unexpected event %u\n",
-  event);
-   return;
+   return NOTIFY_DONE;
}
 
spin_lock_irqsave(>state_lock, flags);
@@ -326,4 +352,6 @@ void mlx5_fpga_event(struct mlx5_core_dev *mdev, u8 event, 
void *data)
 */
if (teardown)
mlx5_trigger_health_work(fdev->mdev);
+
+   return NOTIFY_OK;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.h
index 3e2355c8df3f..7e2e871dbf83 100644
---

[PATCH mlx5-next 01/11] net/mlx5: EQ, Introduce atomic notifier chain subscription API

2018-11-20 Thread Saeed Mahameed

Use atomic_notifier_chain to fire firmware events at internal mlx5 core
components such as eswitch/fpga/clock/FW tracer/etc.., this is to
avoid explicit calls from low level mlx5_core to upper components and to
simplify the mlx5_core API for future developments.

Simply provide register/unregister notifiers API and call the notifier
chain on firmware async events.

Example: to subscribe to a FW event:
struct mlx5_nb port_event;

MLX5_NB_INIT(_event, port_event_handler, PORT_CHANGE);
mlx5_eq_notifier_register(mdev, _event);

where:
 - port_event_handler is the notifier block callback.
 - PORT_EVENT is the suffix of MLX5_EVENT_TYPE_PORT_CHANGE.

The above will guarantee that port_event_handler will receive all FW
events of the type MLX5_EVENT_TYPE_PORT_CHANGE.

To receive all FW/HW events one can subscribe to
MLX5_EVENT_TYPE_NOTIFY_ANY.

The next few patches will start moving all mlx5 core components to use
this new API and cleanup mlx5_eq_async_int misx handler from component
explicit calls and specific logic.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 42 +--
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  |  5 +++
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  5 +++
 include/linux/mlx5/device.h   | 10 -
 include/linux/mlx5/eq.h   | 16 ++-
 5 files changed, 72 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 6ba8e401a0c7..34e4b2c246ff 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -31,6 +31,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -68,8 +69,10 @@ struct mlx5_irq_info {
 struct mlx5_eq_table {
struct list_headcomp_eqs_list;
struct mlx5_eq  pages_eq;
-   struct mlx5_eq  async_eq;
struct mlx5_eq  cmd_eq;
+   struct mlx5_eq  async_eq;
+
+   struct atomic_notifier_head nh[MLX5_EVENT_TYPE_MAX];
 
struct mutexlock; /* sync async eqs creations */
int num_comp_vectors;
@@ -316,13 +319,17 @@ u32 mlx5_eq_poll_irq_disabled(struct mlx5_eq_comp *eq)
 static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
 {
struct mlx5_eq *eq = eq_ptr;
-   struct mlx5_core_dev *dev = eq->dev;
+   struct mlx5_eq_table *eqt;
+   struct mlx5_core_dev *dev;
struct mlx5_eqe *eqe;
int set_ci = 0;
u32 cqn = -1;
u32 rsn;
u8 port;
 
+   dev = eq->dev;
+   eqt = dev->priv.eq_table;
+
while ((eqe = next_eqe_sw(eq))) {
/*
 * Make sure we read EQ entry contents after we've
@@ -437,6 +444,13 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
break;
}
 
+   if (likely(eqe->type < MLX5_EVENT_TYPE_MAX))
+   atomic_notifier_call_chain(>nh[eqe->type], 
eqe->type, eqe);
+   else
+   mlx5_core_warn_once(dev, "notifier_call_chain is not 
setup for eqe: %d\n", eqe->type);
+
+   
atomic_notifier_call_chain(>nh[MLX5_EVENT_TYPE_NOTIFY_ANY], eqe->type, 
eqe);
+
++eq->cons_index;
++set_ci;
 
@@ -625,7 +639,7 @@ int mlx5_eq_del_cq(struct mlx5_eq *eq, struct mlx5_core_cq 
*cq)
 int mlx5_eq_table_init(struct mlx5_core_dev *dev)
 {
struct mlx5_eq_table *eq_table;
-   int err;
+   int i, err;
 
eq_table = kvzalloc(sizeof(*eq_table), GFP_KERNEL);
if (!eq_table)
@@ -638,6 +652,8 @@ int mlx5_eq_table_init(struct mlx5_core_dev *dev)
goto kvfree_eq_table;
 
mutex_init(_table->lock);
+   for (i = 0; i < MLX5_EVENT_TYPE_MAX; i++)
+   ATOMIC_INIT_NOTIFIER_HEAD(_table->nh[i]);
 
return 0;
 
@@ -1202,3 +1218,23 @@ void mlx5_eq_table_destroy(struct mlx5_core_dev *dev)
destroy_async_eqs(dev);
free_irq_vectors(dev);
 }
+
+int mlx5_eq_notifier_register(struct mlx5_core_dev *dev, struct mlx5_nb *nb)
+{
+   struct mlx5_eq_table *eqt = dev->priv.eq_table;
+
+   if (nb->event_type >= MLX5_EVENT_TYPE_MAX)
+   return -EINVAL;
+
+   return atomic_notifier_chain_register(>nh[nb->event_type], 
>nb);
+}
+
+int mlx5_eq_notifier_unregister(struct mlx5_core_dev *dev, struct mlx5_nb *nb)
+{
+   struct mlx5_eq_table *eqt = dev->priv.eq_table;
+
+   if (nb->event_type >= MLX5_EVENT_TYPE_MAX)
+   return -EINVAL;
+
+   return atomic_notifier_chain_unregister(>nh[nb->event_type], 
>nb);
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h 
b/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h
index 6d8c8a57d52b..c0fb6d72b695 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/eq.h
@@ -4,6 +4,8 @@
 #ifndef

[PATCH mlx5-next 10/11] net/mlx5: Device events, Use async events chain

2018-11-20 Thread Saeed Mahameed

Move all the generic async events handling into new specific events
handling file events.c to keep eq.c file clean from concrete event logic
handling.

Use new API to register for NOTIFY_ANY to handle generic events and
dispatch allowed events to mlx5_core consumers (mlx5_ib and mlx5e)

Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 .../ethernet/mellanox/mlx5/core/en_stats.c|   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 157 --
 .../net/ethernet/mellanox/mlx5/core/events.c  | 283 ++
 .../ethernet/mellanox/mlx5/core/lib/mlx5.h|  34 +++
 .../net/ethernet/mellanox/mlx5/core/main.c|  16 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |   6 +-
 .../net/ethernet/mellanox/mlx5/core/port.c|  57 
 include/linux/mlx5/driver.h   |  29 +-
 include/linux/mlx5/port.h |   3 -
 10 files changed, 344 insertions(+), 252 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/events.c

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/Makefile 
b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
index d324a3884462..26afe0779a0c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/Makefile
+++ b/drivers/net/ethernet/mellanox/mlx5/core/Makefile
@@ -14,7 +14,7 @@ obj-$(CONFIG_MLX5_CORE) += mlx5_core.o
 mlx5_core-y := main.o cmd.o debugfs.o fw.o eq.o uar.o pagealloc.o \
health.o mcg.o cq.o srq.o alloc.o qp.o port.o mr.o pd.o \
mad.o transobj.o vport.o sriov.o fs_cmd.o fs_core.o \
-   fs_counters.o rl.o lag.o dev.o wq.o lib/gid.o  \
+   fs_counters.o rl.o lag.o dev.o events.o wq.o lib/gid.o \
diag/fs_tracepoint.o diag/fw_tracer.o
 
 #
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 1e55b9c27ffc..748d23806391 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -30,6 +30,7 @@
  * SOFTWARE.
  */
 
+#include "lib/mlx5.h"
 #include "en.h"
 #include "en_accel/ipsec.h"
 #include "en_accel/tls.h"
@@ -1120,15 +1121,17 @@ static int mlx5e_grp_pme_fill_strings(struct mlx5e_priv 
*priv, u8 *data,
 static int mlx5e_grp_pme_fill_stats(struct mlx5e_priv *priv, u64 *data,
int idx)
 {
-   struct mlx5_priv *mlx5_priv = >mdev->priv;
+   struct mlx5_pme_stats pme_stats;
int i;
 
+   mlx5_get_pme_stats(priv->mdev, _stats);
+
for (i = 0; i < NUM_PME_STATUS_STATS; i++)
-   data[idx++] = 
MLX5E_READ_CTR64_CPU(mlx5_priv->pme_stats.status_counters,
+   data[idx++] = MLX5E_READ_CTR64_CPU(pme_stats.status_counters,
   mlx5e_pme_status_desc, i);
 
for (i = 0; i < NUM_PME_ERR_STATS; i++)
-   data[idx++] = 
MLX5E_READ_CTR64_CPU(mlx5_priv->pme_stats.error_counters,
+   data[idx++] = MLX5E_READ_CTR64_CPU(pme_stats.error_counters,
   mlx5e_pme_error_desc, i);
 
return idx;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 4e3febbf639d..4aa39a1fe23f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -108,121 +108,6 @@ static int mlx5_cmd_destroy_eq(struct mlx5_core_dev *dev, 
u8 eqn)
return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
 }
 
-static const char *eqe_type_str(u8 type)
-{
-   switch (type) {
-   case MLX5_EVENT_TYPE_COMP:
-   return "MLX5_EVENT_TYPE_COMP";
-   case MLX5_EVENT_TYPE_PATH_MIG:
-   return "MLX5_EVENT_TYPE_PATH_MIG";
-   case MLX5_EVENT_TYPE_COMM_EST:
-   return "MLX5_EVENT_TYPE_COMM_EST";
-   case MLX5_EVENT_TYPE_SQ_DRAINED:
-   return "MLX5_EVENT_TYPE_SQ_DRAINED";
-   case MLX5_EVENT_TYPE_SRQ_LAST_WQE:
-   return "MLX5_EVENT_TYPE_SRQ_LAST_WQE";
-   case MLX5_EVENT_TYPE_SRQ_RQ_LIMIT:
-   return "MLX5_EVENT_TYPE_SRQ_RQ_LIMIT";
-   case MLX5_EVENT_TYPE_CQ_ERROR:
-   return "MLX5_EVENT_TYPE_CQ_ERROR";
-   case MLX5_EVENT_TYPE_WQ_CATAS_ERROR:
-   return "MLX5_EVENT_TYPE_WQ_CATAS_ERROR";
-   case MLX5_EVENT_TYPE_PATH_MIG_FAILED:
-   return "MLX5_EVENT_TYPE_PATH_MIG_FAILED";
-   case MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR:
-   return "MLX5_EVENT_TYPE_WQ_INVAL_REQ_ERROR";
-   case MLX5_EVENT_TYPE_WQ_ACCESS_ERROR:
-   return "MLX5_EVENT_TYPE_WQ_ACCESS_ERROR";
-   case MLX5_EVENT_TYPE_SRQ_CATAS_ERROR:
-   return "MLX5_EVENT_TYPE_SRQ_CATAS_ERROR";
-   case MLX5_EVENT_TYPE_INTERNAL_ERROR:
-   return "MLX5_EVENT_TYPE_INTERNAL_ERROR";
-   case MLX5_EVENT_TYPE_PORT_CHANGE:
-   return

[PATCH mlx5-next 04/11] net/mlx5: Clock, Use async events chain

2018-11-20 Thread Saeed Mahameed

Remove the explicit call to mlx5_pps_event on MLX5_EVENT_TYPE_PPS_EVENT
and let clock logic to register its own handler when its ready.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  4 
 .../ethernet/mellanox/mlx5/core/lib/clock.c   | 24 +--
 .../ethernet/mellanox/mlx5/core/lib/clock.h   |  3 ---
 include/linux/mlx5/driver.h   |  4 +++-
 4 files changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 8aabd23d2166..e5fcce9ca107 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -417,10 +417,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
mlx5_port_module_event(dev, eqe);
break;
 
-   case MLX5_EVENT_TYPE_PPS_EVENT:
-   mlx5_pps_event(dev, eqe);
-   break;
-
case MLX5_EVENT_TYPE_TEMP_WARN_EVENT:
mlx5_temp_warning_event(dev, eqe);
break;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c 
b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
index 0d90b1b4a3d3..d27c239e7d6c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include "lib/eq.h"
 #include "en.h"
 #include "clock.h"
 
@@ -439,16 +440,17 @@ static void mlx5_get_pps_caps(struct mlx5_core_dev *mdev)
clock->pps_info.pin_caps[7] = MLX5_GET(mtpps_reg, out, cap_pin_7_mode);
 }
 
-void mlx5_pps_event(struct mlx5_core_dev *mdev,
-   struct mlx5_eqe *eqe)
+static int mlx5_pps_event(struct notifier_block *nb,
+ unsigned long type, void *data)
 {
-   struct mlx5_clock *clock = >clock;
+   struct mlx5_clock *clock = mlx5_nb_cof(nb, struct mlx5_clock, pps_nb);
+   struct mlx5_core_dev *mdev = clock->mdev;
struct ptp_clock_event ptp_event;
-   struct timespec64 ts;
-   u64 nsec_now, nsec_delta;
u64 cycles_now, cycles_delta;
+   u64 nsec_now, nsec_delta, ns;
+   struct mlx5_eqe *eqe = data;
int pin = eqe->data.pps.pin;
-   s64 ns;
+   struct timespec64 ts;
unsigned long flags;
 
switch (clock->ptp_info.pin_config[pin].func) {
@@ -463,6 +465,7 @@ void mlx5_pps_event(struct mlx5_core_dev *mdev,
} else {
ptp_event.type = PTP_CLOCK_EXTTS;
}
+   /* TODOL clock->ptp can be NULL if ptp_clock_register failes */
ptp_clock_event(clock->ptp, _event);
break;
case PTP_PF_PEROUT:
@@ -481,8 +484,11 @@ void mlx5_pps_event(struct mlx5_core_dev *mdev,
write_sequnlock_irqrestore(>lock, flags);
break;
default:
-   mlx5_core_err(mdev, " Unhandled event\n");
+   mlx5_core_err(mdev, " Unhandled clock PPS event, func %d\n",
+ clock->ptp_info.pin_config[pin].func);
}
+
+   return NOTIFY_OK;
 }
 
 void mlx5_init_clock(struct mlx5_core_dev *mdev)
@@ -567,6 +573,9 @@ void mlx5_init_clock(struct mlx5_core_dev *mdev)
   PTR_ERR(clock->ptp));
clock->ptp = NULL;
}
+
+   MLX5_NB_INIT(>pps_nb, mlx5_pps_event, PPS_EVENT);
+   mlx5_eq_notifier_register(mdev, >pps_nb);
 }
 
 void mlx5_cleanup_clock(struct mlx5_core_dev *mdev)
@@ -576,6 +585,7 @@ void mlx5_cleanup_clock(struct mlx5_core_dev *mdev)
if (!MLX5_CAP_GEN(mdev, device_frequency_khz))
return;
 
+   mlx5_eq_notifier_unregister(mdev, >pps_nb);
if (clock->ptp) {
ptp_clock_unregister(clock->ptp);
clock->ptp = NULL;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.h 
b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.h
index 263cb6e2aeee..31600924bdc3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/clock.h
@@ -36,7 +36,6 @@
 #if IS_ENABLED(CONFIG_PTP_1588_CLOCK)
 void mlx5_init_clock(struct mlx5_core_dev *mdev);
 void mlx5_cleanup_clock(struct mlx5_core_dev *mdev);
-void mlx5_pps_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe);
 
 static inline int mlx5_clock_get_ptp_index(struct mlx5_core_dev *mdev)
 {
@@ -60,8 +59,6 @@ static inline ktime_t mlx5_timecounter_cyc2time(struct 
mlx5_clock *clock,
 #else
 static inline void mlx5_init_clock(struct mlx5_core_dev *mdev) {}
 static inline void mlx5_cleanup_clock(struct mlx5_core_dev *mdev) {}
-static inline void mlx5_pps_event(struct mlx5_core_dev *dev, struct mlx5_eqe 
*eqe) {}
-
 static inline int mlx5_clock_get_ptp_index(struct mlx5_core_dev *mdev)
 {
return -1;
diff --git a/include/linux/mlx5/driver.h

[PATCH mlx5-next 06/11] net/mlx5: FWPage, Use async events chain

2018-11-20 Thread Saeed Mahameed

Remove the explicit call to mlx5_core_req_pages_handler on
MLX5_EVENT_TYPE_PAGE_REQUEST and let FW page logic  to register its own
handler when its ready.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 11 -
 .../net/ethernet/mellanox/mlx5/core/main.c| 27 ++--
 .../ethernet/mellanox/mlx5/core/pagealloc.c   | 44 +--
 include/linux/mlx5/driver.h   |  5 ++-
 4 files changed, 47 insertions(+), 40 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 7c8b2d89645b..7f6a644700eb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -398,17 +398,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
mlx5_eq_cq_event(eq, cqn, eqe->type);
break;
 
-   case MLX5_EVENT_TYPE_PAGE_REQUEST:
-   {
-   u16 func_id = 
be16_to_cpu(eqe->data.req_pages.func_id);
-   s32 npages = 
be32_to_cpu(eqe->data.req_pages.num_pages);
-
-   mlx5_core_dbg(dev, "page request for func 0x%x, 
npages %d\n",
- func_id, npages);
-   mlx5_core_req_pages_handler(dev, func_id, 
npages);
-   }
-   break;
-
case MLX5_EVENT_TYPE_PORT_MODULE_EVENT:
mlx5_port_module_event(dev, eqe);
break;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 91022f141855..9e4cd2757ea8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -916,16 +916,10 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
goto reclaim_boot_pages;
}
 
-   err = mlx5_pagealloc_start(dev);
-   if (err) {
-   dev_err(>dev, "mlx5_pagealloc_start failed\n");
-   goto reclaim_boot_pages;
-   }
-
err = mlx5_cmd_init_hca(dev, sw_owner_id);
if (err) {
dev_err(>dev, "init hca failed\n");
-   goto err_pagealloc_stop;
+   goto reclaim_boot_pages;
}
 
mlx5_set_driver_version(dev);
@@ -953,6 +947,8 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, struct 
mlx5_priv *priv,
goto err_get_uars;
}
 
+   mlx5_pagealloc_start(dev);
+
err = mlx5_eq_table_create(dev);
if (err) {
dev_err(>dev, "Failed to create EQs\n");
@@ -1039,6 +1035,7 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
mlx5_eq_table_destroy(dev);
 
 err_eq_table:
+   mlx5_pagealloc_stop(dev);
mlx5_put_uars_page(dev, priv->uar);
 
 err_get_uars:
@@ -1052,9 +1049,6 @@ static int mlx5_load_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
goto out_err;
}
 
-err_pagealloc_stop:
-   mlx5_pagealloc_stop(dev);
-
 reclaim_boot_pages:
mlx5_reclaim_startup_pages(dev);
 
@@ -1100,16 +1094,18 @@ static int mlx5_unload_one(struct mlx5_core_dev *dev, 
struct mlx5_priv *priv,
mlx5_fpga_device_stop(dev);
mlx5_fw_tracer_cleanup(dev->tracer);
mlx5_eq_table_destroy(dev);
+   mlx5_pagealloc_stop(dev);
mlx5_put_uars_page(dev, priv->uar);
+
if (cleanup)
mlx5_cleanup_once(dev);
mlx5_stop_health_poll(dev, cleanup);
+
err = mlx5_cmd_teardown_hca(dev);
if (err) {
dev_err(>pdev->dev, "tear_down_hca failed, skip 
cleanup\n");
goto out;
}
-   mlx5_pagealloc_stop(dev);
mlx5_reclaim_startup_pages(dev);
mlx5_core_disable_hca(dev, 0);
mlx5_cmd_cleanup(dev);
@@ -1186,12 +1182,14 @@ static int init_one(struct pci_dev *pdev,
goto close_pci;
}
 
-   mlx5_pagealloc_init(dev);
+   err = mlx5_pagealloc_init(dev);
+   if (err)
+   goto err_pagealloc_init;
 
err = mlx5_load_one(dev, priv, true);
if (err) {
dev_err(>dev, "mlx5_load_one failed with error code 
%d\n", err);
-   goto clean_health;
+   goto err_load_one;
}
 
request_module_nowait(MLX5_IB_MOD);
@@ -1205,8 +1203,9 @@ static int init_one(struct pci_dev *pdev,
 
 clean_load:
mlx5_unload_one(dev, priv, true);
-clean_health:
+err_load_one:
mlx5_pagealloc_cleanup(dev);
+err_pagealloc_init:
mlx5_health_cleanup(dev);
 close_pci:
mlx5_pci_close(dev, priv);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c
index e36d3e3675f9..a83b517b0714 100644
---

[PATCH mlx5-next 05/11] net/mlx5: E-Switch, Use async events chain

2018-11-20 Thread Saeed Mahameed

Remove the explicit call to mlx5_eswitch_vport_event on
MLX5_EVENT_TYPE_NIC_VPORT_CHANGE and let the eswitch register its own
handler when its ready.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  4 --
 .../net/ethernet/mellanox/mlx5/core/eswitch.c | 44 +++
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |  3 +-
 3 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index e5fcce9ca107..7c8b2d89645b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -409,10 +409,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
}
break;
 
-   case MLX5_EVENT_TYPE_NIC_VPORT_CHANGE:
-   mlx5_eswitch_vport_event(dev->priv.eswitch, eqe);
-   break;
-
case MLX5_EVENT_TYPE_PORT_MODULE_EVENT:
mlx5_port_module_event(dev, eqe);
break;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 2346b6ba3d54..e6a9b19d8626 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include "mlx5_core.h"
+#include "lib/eq.h"
 #include "eswitch.h"
 #include "fs_core.h"
 #include "lib/eq.h"
@@ -1568,7 +1569,6 @@ static void esw_disable_vport(struct mlx5_eswitch *esw, 
int vport_num)
/* Mark this vport as disabled to discard new events */
vport->enabled = false;
 
-   mlx5_eq_synchronize_async_irq(esw->dev);
/* Wait for current already scheduled events to complete */
flush_workqueue(esw->work_queue);
/* Disable events from this vport */
@@ -1594,10 +1594,25 @@ static void esw_disable_vport(struct mlx5_eswitch *esw, 
int vport_num)
mutex_unlock(>state_lock);
 }
 
+static int eswitch_vport_event(struct notifier_block *nb,
+  unsigned long type, void *data)
+{
+   struct mlx5_eswitch *esw = mlx5_nb_cof(nb, struct mlx5_eswitch, nb);
+   struct mlx5_eqe *eqe = data;
+   struct mlx5_vport *vport;
+   u16 vport_num;
+
+   vport_num = be16_to_cpu(eqe->data.vport_change.vport_num);
+   vport = >vports[vport_num];
+   if (vport->enabled)
+   queue_work(esw->work_queue, >vport_change_handler);
+
+   return NOTIFY_OK;
+}
+
 /* Public E-Switch API */
 #define ESW_ALLOWED(esw) ((esw) && MLX5_ESWITCH_MANAGER((esw)->dev))
 
-
 int mlx5_eswitch_enable_sriov(struct mlx5_eswitch *esw, int nvfs, int mode)
 {
int err;
@@ -1641,6 +1656,11 @@ int mlx5_eswitch_enable_sriov(struct mlx5_eswitch *esw, 
int nvfs, int mode)
for (i = 0; i <= nvfs; i++)
esw_enable_vport(esw, i, enabled_events);
 
+   if (mode == SRIOV_LEGACY) {
+   MLX5_NB_INIT(>nb, eswitch_vport_event, NIC_VPORT_CHANGE);
+   mlx5_eq_notifier_register(esw->dev, >nb);
+   }
+
esw_info(esw->dev, "SRIOV enabled: active vports(%d)\n",
 esw->enabled_vports);
return 0;
@@ -1670,6 +1690,9 @@ void mlx5_eswitch_disable_sriov(struct mlx5_eswitch *esw)
mc_promisc = >mc_promisc;
nvports = esw->enabled_vports;
 
+   if (esw->mode == SRIOV_LEGACY)
+   mlx5_eq_notifier_unregister(esw->dev, >nb);
+
for (i = 0; i < esw->total_vports; i++)
esw_disable_vport(esw, i);
 
@@ -1778,23 +1801,6 @@ void mlx5_eswitch_cleanup(struct mlx5_eswitch *esw)
kfree(esw);
 }
 
-void mlx5_eswitch_vport_event(struct mlx5_eswitch *esw, struct mlx5_eqe *eqe)
-{
-   struct mlx5_eqe_vport_change *vc_eqe = >data.vport_change;
-   u16 vport_num = be16_to_cpu(vc_eqe->vport_num);
-   struct mlx5_vport *vport;
-
-   if (!esw) {
-   pr_warn("MLX5 E-Switch: vport %d got an event while eswitch is 
not initialized\n",
-   vport_num);
-   return;
-   }
-
-   vport = >vports[vport_num];
-   if (vport->enabled)
-   queue_work(esw->work_queue, >vport_change_handler);
-}
-
 /* Vport Administration */
 #define LEGAL_VPORT(esw, vport) (vport >= 0 && vport < esw->total_vports)
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
index aaafc9f17115..480ffa294867 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.h
@@ -181,6 +181,7 @@ struct esw_mc_addr { /* SRIOV only */
 
 struct mlx5_eswitch {
struct mlx5_core_dev*dev;
+   struct mlx5_nb  nb;
struct mlx5_eswitch_fdb fdb_table;
struct hlist_head   mc_table[MLX5_L2_ADDR_HASH_SIZE];
struct workqueue_struct *work_queue;
@@

[PATCH mlx5-next 11/11] net/mlx5: Improve core device events handling

2018-11-20 Thread Saeed Mahameed

Register a separate handler per event type, rather than listening for all
events and looking for the events to handle in a switch case.

Signed-off-by: Saeed Mahameed 
---
 .../net/ethernet/mellanox/mlx5/core/events.c  | 223 +++---
 1 file changed, 136 insertions(+), 87 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/events.c 
b/drivers/net/ethernet/mellanox/mlx5/core/events.c
index d3ab86bd394b..3ad004af37d7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/events.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/events.c
@@ -2,15 +2,41 @@
 // Copyright (c) 2018 Mellanox Technologies
 
 #include 
+
 #include "mlx5_core.h"
 #include "lib/eq.h"
 #include "lib/mlx5.h"
 
+struct mlx5_event_nb {
+   struct mlx5_nb  nb;
+   void   *ctx;
+};
+
+/* General events handlers for the low level mlx5_core driver
+ *
+ * Other Major feature specific events such as
+ * clock/eswitch/fpga/FW trace and many others, are handled elsewhere, with
+ * separate notifiers callbacks, specifically by those mlx5 components.
+ */
+static int any_notifier(struct notifier_block *, unsigned long, void *);
+static int port_change(struct notifier_block *, unsigned long, void *);
+static int general_event(struct notifier_block *, unsigned long, void *);
+static int temp_warn(struct notifier_block *, unsigned long, void *);
+static int port_module(struct notifier_block *, unsigned long, void *);
+
+static struct mlx5_nb events_nbs_ref[] = {
+   {.nb.notifier_call = any_notifier,  .event_type = 
MLX5_EVENT_TYPE_NOTIFY_ANY },
+   {.nb.notifier_call = port_change,   .event_type = 
MLX5_EVENT_TYPE_PORT_CHANGE },
+   {.nb.notifier_call = general_event, .event_type = 
MLX5_EVENT_TYPE_GENERAL_EVENT },
+   {.nb.notifier_call = temp_warn, .event_type = 
MLX5_EVENT_TYPE_TEMP_WARN_EVENT },
+   {.nb.notifier_call = port_module,   .event_type = 
MLX5_EVENT_TYPE_PORT_MODULE_EVENT },
+};
+
 struct mlx5_events {
-   struct mlx5_nbnb;
struct mlx5_core_dev *dev;
+   struct mlx5_event_nb  notifiers[ARRAY_SIZE(events_nbs_ref)];
 
-   /* port module evetns stats */
+   /* port module events stats */
struct mlx5_pme_stats pme_stats;
 };
 
@@ -80,6 +106,19 @@ static const char *eqe_type_str(u8 type)
}
 }
 
+/* handles all FW events, type == eqe->type */
+static int any_notifier(struct notifier_block *nb,
+   unsigned long type, void *data)
+{
+   struct mlx5_event_nb *event_nb = mlx5_nb_cof(nb, struct mlx5_event_nb, 
nb);
+   struct mlx5_events   *events   = event_nb->ctx;
+   struct mlx5_eqe  *eqe  = data;
+
+   mlx5_core_dbg(events->dev, "Async eqe type %s, subtype (%d)\n",
+ eqe_type_str(eqe->type), eqe->sub_type);
+   return NOTIFY_OK;
+}
+
 static enum mlx5_dev_event port_subtype2dev(u8 subtype)
 {
switch (subtype) {
@@ -101,19 +140,92 @@ static enum mlx5_dev_event port_subtype2dev(u8 subtype)
return -1;
 }
 
-static void temp_warning_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe)
+/* type == MLX5_EVENT_TYPE_PORT_CHANGE */
+static int port_change(struct notifier_block *nb,
+  unsigned long type, void *data)
 {
+   struct mlx5_event_nb *event_nb = mlx5_nb_cof(nb, struct mlx5_event_nb, 
nb);
+   struct mlx5_events   *events   = event_nb->ctx;
+   struct mlx5_core_dev *dev  = events->dev;
+
+   bool dev_event_dispatch = false;
+   enum mlx5_dev_event dev_event;
+   unsigned long dev_event_data;
+   struct mlx5_eqe *eqe = data;
+   u8 port = (eqe->data.port.port >> 4) & 0xf;
+
+   switch (eqe->sub_type) {
+   case MLX5_PORT_CHANGE_SUBTYPE_DOWN:
+   case MLX5_PORT_CHANGE_SUBTYPE_ACTIVE:
+   case MLX5_PORT_CHANGE_SUBTYPE_LID:
+   case MLX5_PORT_CHANGE_SUBTYPE_PKEY:
+   case MLX5_PORT_CHANGE_SUBTYPE_GUID:
+   case MLX5_PORT_CHANGE_SUBTYPE_CLIENT_REREG:
+   case MLX5_PORT_CHANGE_SUBTYPE_INITIALIZED:
+   dev_event = port_subtype2dev(eqe->sub_type);
+   dev_event_data = (unsigned long)port;
+   dev_event_dispatch = true;
+   break;
+   default:
+   mlx5_core_warn(dev, "Port event with unrecognized subtype: port 
%d, sub_type %d\n",
+  port, eqe->sub_type);
+   }
+
+   if (dev->event && dev_event_dispatch)
+   dev->event(dev, dev_event, dev_event_data);
+
+   return NOTIFY_OK;
+}
+
+/* type == MLX5_EVENT_TYPE_GENERAL_EVENT */
+static int general_event(struct notifier_block *nb, unsigned long type, void 
*data)
+{
+   struct mlx5_event_nb *event_nb = mlx5_nb_cof(nb, struct mlx5_event_nb, 
nb);
+   struct mlx5_events   *events   = event_nb->ctx;
+   struct mlx5_core_dev *dev  = events->dev;
+
+   bool dev_event_dispatch = false;
+   enum mlx5_dev_event dev_event;
+   unsigned long dev_event_data;
+   struct mlx5_eqe *eqe = data;
+

[PATCH mlx5-next 07/11] net/mlx5: CmdIF, Use async events chain

2018-11-20 Thread Saeed Mahameed

Remove the explicit call to mlx5_cmd_comp_handler on MLX5_EVENT_TYPE_CMD
and let command interface to register its own handler when its ready.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 48 ++-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  4 --
 .../net/ethernet/mellanox/mlx5/core/health.c  | 25 +-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  2 +-
 include/linux/mlx5/driver.h   |  2 +
 5 files changed, 50 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index 7b18aff955f1..8ab636d59edb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -40,9 +40,11 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "mlx5_core.h"
+#include "lib/eq.h"
 
 enum {
CMD_IF_REV = 5,
@@ -805,6 +807,8 @@ static u16 msg_to_opcode(struct mlx5_cmd_msg *in)
return MLX5_GET(mbox_in, in->first.data, opcode);
 }
 
+static void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec, bool 
forced);
+
 static void cb_timeout_handler(struct work_struct *work)
 {
struct delayed_work *dwork = container_of(work, struct delayed_work,
@@ -1412,14 +1416,32 @@ static void mlx5_cmd_change_mod(struct mlx5_core_dev 
*dev, int mode)
up(>sem);
 }
 
+static int cmd_comp_notifier(struct notifier_block *nb,
+unsigned long type, void *data)
+{
+   struct mlx5_core_dev *dev;
+   struct mlx5_cmd *cmd;
+   struct mlx5_eqe *eqe;
+
+   cmd = mlx5_nb_cof(nb, struct mlx5_cmd, nb);
+   dev = container_of(cmd, struct mlx5_core_dev, cmd);
+   eqe = data;
+
+   mlx5_cmd_comp_handler(dev, be32_to_cpu(eqe->data.cmd.vector), false);
+
+   return NOTIFY_OK;
+}
 void mlx5_cmd_use_events(struct mlx5_core_dev *dev)
 {
+   MLX5_NB_INIT(>cmd.nb, cmd_comp_notifier, CMD);
+   mlx5_eq_notifier_register(dev, >cmd.nb);
mlx5_cmd_change_mod(dev, CMD_MODE_EVENTS);
 }
 
 void mlx5_cmd_use_polling(struct mlx5_core_dev *dev)
 {
mlx5_cmd_change_mod(dev, CMD_MODE_POLLING);
+   mlx5_eq_notifier_unregister(dev, >cmd.nb);
 }
 
 static void free_msg(struct mlx5_core_dev *dev, struct mlx5_cmd_msg *msg)
@@ -1435,7 +1457,7 @@ static void free_msg(struct mlx5_core_dev *dev, struct 
mlx5_cmd_msg *msg)
}
 }
 
-void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec, bool forced)
+static void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec, bool 
forced)
 {
struct mlx5_cmd *cmd = >cmd;
struct mlx5_cmd_work_ent *ent;
@@ -1533,7 +1555,29 @@ void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, 
u64 vec, bool forced)
}
}
 }
-EXPORT_SYMBOL(mlx5_cmd_comp_handler);
+
+void mlx5_cmd_trigger_completions(struct mlx5_core_dev *dev)
+{
+   unsigned long flags;
+   u64 vector;
+
+   /* wait for pending handlers to complete */
+   mlx5_eq_synchronize_cmd_irq(dev);
+   spin_lock_irqsave(>cmd.alloc_lock, flags);
+   vector = ~dev->cmd.bitmask & ((1ul << (1 << dev->cmd.log_sz)) - 1);
+   if (!vector)
+   goto no_trig;
+
+   vector |= MLX5_TRIGGERED_CMD_COMP;
+   spin_unlock_irqrestore(>cmd.alloc_lock, flags);
+
+   mlx5_core_dbg(dev, "vector 0x%llx\n", vector);
+   mlx5_cmd_comp_handler(dev, vector, true);
+   return;
+
+no_trig:
+   spin_unlock_irqrestore(>cmd.alloc_lock, flags);
+}
 
 static int status_to_err(u8 status)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 7f6a644700eb..b28869aa1a4e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -368,10 +368,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
mlx5_srq_event(dev, rsn, eqe->type);
break;
 
-   case MLX5_EVENT_TYPE_CMD:
-   mlx5_cmd_comp_handler(dev, 
be32_to_cpu(eqe->data.cmd.vector), false);
-   break;
-
case MLX5_EVENT_TYPE_PORT_CHANGE:
port = (eqe->data.port.port >> 4) & 0xf;
switch (eqe->sub_type) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/health.c 
b/drivers/net/ethernet/mellanox/mlx5/core/health.c
index 066883003aea..4e42bd290959 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/health.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/health.c
@@ -79,29 +79,6 @@ void mlx5_set_nic_state(struct mlx5_core_dev *dev, u8 state)
>iseg->cmdq_addr_l_sz);
 }
 
-static void trigger_cmd_completions(struct mlx5_core_dev *dev)
-{
-   unsigned long flags;
-   u64 vector;
-
-   /* wait for pending handlers to complete */
-   mlx5_eq_synchronize_cmd_irq(dev);
-   spin_lock_irqsave(>cmd.alloc_lock,

[PATCH mlx5-next 02/11] net/mlx5: FWTrace, Use async events chain

2018-11-20 Thread Saeed Mahameed

Remove the explicit call to mlx5_fw_tracer_event on
MLX5_EVENT_TYPE_DEVICE_TRACER and let fw tracer to register
its own handler when its ready.

Signed-off-by: Saeed Mahameed 
---
 .../mellanox/mlx5/core/diag/fw_tracer.c   | 27 ++-
 .../mellanox/mlx5/core/diag/fw_tracer.h   |  2 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  4 ---
 3 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
index d4ec93bde4de..6999f4486e9e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.c
@@ -30,6 +30,7 @@
  * SOFTWARE.
  */
 #define CREATE_TRACE_POINTS
+#include "lib/eq.h"
 #include "fw_tracer.h"
 #include "fw_tracer_tracepoint.h"
 
@@ -846,9 +847,9 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct 
mlx5_core_dev *dev)
return ERR_PTR(err);
 }
 
-/* Create HW resources + start tracer
- * must be called before Async EQ is created
- */
+static int fw_tracer_event(struct notifier_block *nb, unsigned long action, 
void *data);
+
+/* Create HW resources + start tracer */
 int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer)
 {
struct mlx5_core_dev *dev;
@@ -874,6 +875,9 @@ int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer)
goto err_dealloc_pd;
}
 
+   MLX5_NB_INIT(>nb, fw_tracer_event, DEVICE_TRACER);
+   mlx5_eq_notifier_register(dev, >nb);
+
mlx5_fw_tracer_start(tracer);
 
return 0;
@@ -883,9 +887,7 @@ int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer)
return err;
 }
 
-/* Stop tracer + Cleanup HW resources
- * must be called after Async EQ is destroyed
- */
+/* Stop tracer + Cleanup HW resources */
 void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
 {
if (IS_ERR_OR_NULL(tracer))
@@ -893,7 +895,7 @@ void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer)
 
mlx5_core_dbg(tracer->dev, "FWTracer: Cleanup, is owner ? (%d)\n",
  tracer->owner);
-
+   mlx5_eq_notifier_unregister(tracer->dev, >nb);
cancel_work_sync(>ownership_change_work);
cancel_work_sync(>handle_traces_work);
 
@@ -922,12 +924,11 @@ void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer)
kfree(tracer);
 }
 
-void mlx5_fw_tracer_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe)
+static int fw_tracer_event(struct notifier_block *nb, unsigned long action, 
void *data)
 {
-   struct mlx5_fw_tracer *tracer = dev->tracer;
-
-   if (!tracer)
-   return;
+   struct mlx5_fw_tracer *tracer = mlx5_nb_cof(nb, struct mlx5_fw_tracer, 
nb);
+   struct mlx5_core_dev *dev = tracer->dev;
+   struct mlx5_eqe *eqe = data;
 
switch (eqe->sub_type) {
case MLX5_TRACER_SUBTYPE_OWNERSHIP_CHANGE:
@@ -942,6 +943,8 @@ void mlx5_fw_tracer_event(struct mlx5_core_dev *dev, struct 
mlx5_eqe *eqe)
mlx5_core_dbg(dev, "FWTracer: Event with unrecognized subtype: 
sub_type %d\n",
  eqe->sub_type);
}
+
+   return NOTIFY_OK;
 }
 
 EXPORT_TRACEPOINT_SYMBOL(mlx5_fw);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
index 0347f2dd5cee..a8b8747f2b61 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fw_tracer.h
@@ -55,6 +55,7 @@
 
 struct mlx5_fw_tracer {
struct mlx5_core_dev *dev;
+   struct mlx5_nbnb;
bool owner;
u8   trc_ver;
struct workqueue_struct *work_queue;
@@ -170,6 +171,5 @@ struct mlx5_fw_tracer *mlx5_fw_tracer_create(struct 
mlx5_core_dev *dev);
 int mlx5_fw_tracer_init(struct mlx5_fw_tracer *tracer);
 void mlx5_fw_tracer_cleanup(struct mlx5_fw_tracer *tracer);
 void mlx5_fw_tracer_destroy(struct mlx5_fw_tracer *tracer);
-void mlx5_fw_tracer_event(struct mlx5_core_dev *dev, struct mlx5_eqe *eqe);
 
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 34e4b2c246ff..c7c436b0ed2e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -434,10 +434,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
general_event_handler(dev, eqe);
break;
 
-   case MLX5_EVENT_TYPE_DEVICE_TRACER:
-   mlx5_fw_tracer_event(dev, eqe);
-   break;
-
default:
mlx5_core_warn(dev, "Unhandled event 0x%x on EQ 0x%x\n",
   eqe->type, eq->eqn);
-- 
2.19.1

[PATCH mlx5-next 00/11] mlx5 core internal firmware events handling improvements

2018-11-20 Thread Saeed Mahameed

Hi

This patchset is for mlx5-next shared branch, and will be applied there
once the review is done.

The main idea of this change is to define a flexible scalable and
simpler low level mlx5 core APIs to upper level components for better
features decoupling and maximum code locality and modularity.

Improve and simplify mlx5 core internal firmware and device async events
handling and subscription, currently all async firmware events are
handled in one place (switch case in eq.c) and every time we need to
update one of the mlx5_core handlers or add new events handling to the
system, the driver needs to be changed in many places in order to deliver
the new event to its consumer.

To improve this we will use atomic_notifier_chain to fire firmware events
at internal mlx5 core components such as eswitch/fpga/clock/FW tracer/etc..,
this is to avoid explicit calls from low level mlx5_core to upper components
and to simplify the mlx5_core API for future developments.

Provide register/unregister notifiers API and call the notifier chain on
firmware async events.

Example to subscribe to a FW event:

struct mlx5_nb port_event;

MLX5_NB_INIT(_event, port_event_handler, PORT_CHANGE);
mlx5_eq_notifier_register(mdev, _event);

Where:
  - port_event_handler is the notifier block callback.
  - PORT_EVENT is the suffix of MLX5_EVENT_TYPE_PORT_CHANGE (The event
type to subscribe to)

The above will guarantee that port_event_handler will receive all FW
events of the type MLX5_EVENT_TYPE_PORT_CHANGE.

To receive all FW/HW events one can subscribe to MLX5_EVENT_TYPE_NOTIFY_ANY.
 
There can be only 128 types of firmware events each has its own 64Byte 
EQE (Event Queue Element) data, we will have one atomic_notifier_chain
per event type for maximum performance and verbosity.
Each handler is going to receive the event_type as unsigned long and
the event data as void pointer, exactly as defined in the notifier block
handlers prototype.
   
This API is implemented in the first patch of this series all following
patches are modifying the existing mlx5 components to use the new API to
subscribe to FW events.

Thanks,
Saeed.

---

Saeed Mahameed (11):
  net/mlx5: EQ, Introduce atomic notifier chain subscription API
  net/mlx5: FWTrace, Use async events chain
  net/mlx5: FPGA, Use async events chain
  net/mlx5: Clock, Use async events chain
  net/mlx5: E-Switch, Use async events chain
  net/mlx5: FWPage, Use async events chain
  net/mlx5: CmdIF, Use async events chain
  net/mlx5: Resource tables, Use async events chain
  net/mlx5: CQ ERR, Use async events chain
  net/mlx5: Device events, Use async events chain
  net/mlx5: Improve core device events handling

 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c |  48 ++-
 .../mellanox/mlx5/core/diag/fw_tracer.c   |  27 +-
 .../mellanox/mlx5/core/diag/fw_tracer.h   |   2 +-
 .../ethernet/mellanox/mlx5/core/en_stats.c|   9 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c  | 322 +
 .../net/ethernet/mellanox/mlx5/core/eswitch.c |  44 ++-
 .../net/ethernet/mellanox/mlx5/core/eswitch.h |   3 +-
 .../net/ethernet/mellanox/mlx5/core/events.c  | 332 ++
 .../ethernet/mellanox/mlx5/core/fpga/core.c   |  38 +-
 .../ethernet/mellanox/mlx5/core/fpga/core.h   |  11 +-
 .../net/ethernet/mellanox/mlx5/core/health.c  |  25 +-
 .../ethernet/mellanox/mlx5/core/lib/clock.c   |  24 +-
 .../ethernet/mellanox/mlx5/core/lib/clock.h   |   3 -
 .../net/ethernet/mellanox/mlx5/core/lib/eq.h  |   5 +
 .../ethernet/mellanox/mlx5/core/lib/mlx5.h|  34 ++
 .../net/ethernet/mellanox/mlx5/core/main.c|  41 ++-
 .../ethernet/mellanox/mlx5/core/mlx5_core.h   |  13 +-
 .../ethernet/mellanox/mlx5/core/pagealloc.c   |  44 ++-
 .../net/ethernet/mellanox/mlx5/core/port.c|  57 ---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c  |  68 +++-
 drivers/net/ethernet/mellanox/mlx5/core/srq.c |  55 ++-
 include/linux/mlx5/device.h   |  10 +-
 include/linux/mlx5/driver.h   |  46 +--
 include/linux/mlx5/eq.h   |  16 +-
 include/linux/mlx5/port.h |   3 -
 26 files changed, 811 insertions(+), 471 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/events.c

-- 
2.19.1

[PATCH mlx5-next 09/11] net/mlx5: CQ ERR, Use async events chain

2018-11-20 Thread Saeed Mahameed

Remove the explicit call to mlx5_eq_cq_event on MLX5_EVENT_TYPE_CQ_ERROR
and register a specific CQ ERROR handler via the new API.

Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 66 +---
 1 file changed, 44 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 0cf448575ebd..4e3febbf639d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -74,6 +74,9 @@ struct mlx5_eq_table {
 
struct atomic_notifier_head nh[MLX5_EVENT_TYPE_MAX];
 
+   /* Since CQ DB is stored in async_eq */
+   struct mlx5_nb  cq_err_nb;
+
struct mutexlock; /* sync async eqs creations */
int num_comp_vectors;
struct mlx5_irq_info*irq_info;
@@ -235,20 +238,6 @@ static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq 
*eq, u32 cqn)
return cq;
 }
 
-static void mlx5_eq_cq_event(struct mlx5_eq *eq, u32 cqn, int event_type)
-{
-   struct mlx5_core_cq *cq = mlx5_eq_cq_get(eq, cqn);
-
-   if (unlikely(!cq)) {
-   mlx5_core_warn(eq->dev, "Async event for bogus CQ 0x%x\n", cqn);
-   return;
-   }
-
-   cq->event(cq, event_type);
-
-   mlx5_cq_put(cq);
-}
-
 static irqreturn_t mlx5_eq_comp_int(int irq, void *eq_ptr)
 {
struct mlx5_eq_comp *eq_comp = eq_ptr;
@@ -323,7 +312,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
struct mlx5_core_dev *dev;
struct mlx5_eqe *eqe;
int set_ci = 0;
-   u32 cqn = -1;
u8 port;
 
dev = eq->dev;
@@ -358,12 +346,6 @@ static irqreturn_t mlx5_eq_async_int(int irq, void *eq_ptr)
   port, eqe->sub_type);
}
break;
-   case MLX5_EVENT_TYPE_CQ_ERROR:
-   cqn = be32_to_cpu(eqe->data.cq_err.cqn) & 0xff;
-   mlx5_core_warn(dev, "CQ error on CQN 0x%x, syndrome 
0x%x\n",
-  cqn, eqe->data.cq_err.syndrome);
-   mlx5_eq_cq_event(eq, cqn, eqe->type);
-   break;
 
case MLX5_EVENT_TYPE_PORT_MODULE_EVENT:
mlx5_port_module_event(dev, eqe);
@@ -639,6 +621,38 @@ static int destroy_async_eq(struct mlx5_core_dev *dev, 
struct mlx5_eq *eq)
return err;
 }
 
+static int cq_err_event_notifier(struct notifier_block *nb,
+unsigned long type, void *data)
+{
+   struct mlx5_eq_table *eqt;
+   struct mlx5_core_cq *cq;
+   struct mlx5_eqe *eqe;
+   struct mlx5_eq *eq;
+   u32 cqn;
+
+   /* type == MLX5_EVENT_TYPE_CQ_ERROR */
+
+   eqt = mlx5_nb_cof(nb, struct mlx5_eq_table, cq_err_nb);
+   eq  = >async_eq;
+   eqe = data;
+
+   cqn = be32_to_cpu(eqe->data.cq_err.cqn) & 0xff;
+   mlx5_core_warn(eq->dev, "CQ error on CQN 0x%x, syndrome 0x%x\n",
+  cqn, eqe->data.cq_err.syndrome);
+
+   cq = mlx5_eq_cq_get(eq, cqn);
+   if (unlikely(!cq)) {
+   mlx5_core_warn(eq->dev, "Async event for bogus CQ 0x%x\n", cqn);
+   return NOTIFY_OK;
+   }
+
+   cq->event(cq, type);
+
+   mlx5_cq_put(cq);
+
+   return NOTIFY_OK;
+}
+
 static u64 gather_async_events_mask(struct mlx5_core_dev *dev)
 {
u64 async_event_mask = MLX5_ASYNC_EVENT_MASK;
@@ -679,6 +693,9 @@ static int create_async_eqs(struct mlx5_core_dev *dev)
struct mlx5_eq_param param = {};
int err;
 
+   MLX5_NB_INIT(>cq_err_nb, cq_err_event_notifier, CQ_ERROR);
+   mlx5_eq_notifier_register(dev, >cq_err_nb);
+
param = (struct mlx5_eq_param) {
.index = MLX5_EQ_CMD_IDX,
.mask = 1ull << MLX5_EVENT_TYPE_CMD,
@@ -689,7 +706,7 @@ static int create_async_eqs(struct mlx5_core_dev *dev)
err = create_async_eq(dev, "mlx5_cmd_eq", >cmd_eq, );
if (err) {
mlx5_core_warn(dev, "failed to create cmd EQ %d\n", err);
-   return err;
+   goto err0;
}
 
mlx5_cmd_use_events(dev);
@@ -728,6 +745,8 @@ static int create_async_eqs(struct mlx5_core_dev *dev)
 err1:
mlx5_cmd_use_polling(dev);
destroy_async_eq(dev, >cmd_eq);
+err0:
+   mlx5_eq_notifier_unregister(dev, >cq_err_nb);
return err;
 }
 
@@ -745,12 +764,15 @@ static void destroy_async_eqs(struct mlx5_core_dev *dev)
if (err)
mlx5_core_err(dev, "failed to destroy async eq, err(%d)\n",
  err);
+
mlx5_cmd_use_polling(dev);
 
err = destroy_async_eq(dev, >cmd_eq);
if (err)
mlx5_core_err(dev, "failed to destroy command eq, err(%d)\n",
  err);
+
+   mlx5_eq_notifier_unregister(dev,

Re: [PATCH bpf-next v2] libbpf: make sure bpf headers are c++ include-able

2018-11-20 Thread Y Song

On Tue, Nov 20, 2018 at 1:37 PM Stanislav Fomichev  wrote:
>
> Wrap headers in extern "C", to turn off C++ mangling.
> This simplifies including libbpf in c++ and linking against it.
>
> v2 changes:
> * do the same for btf.h
>
> Signed-off-by: Stanislav Fomichev 

Acked-by: Yonghong Song 

> ---
>  tools/lib/bpf/bpf.h| 9 +
>  tools/lib/bpf/btf.h| 8 
>  tools/lib/bpf/libbpf.h | 9 +
>  3 files changed, 26 insertions(+)

Re: [PATCH v1 net] lan743x: fix return value for lan743x_tx_napi_poll

2018-11-20 Thread Florian Fainelli

On 11/20/18 1:39 PM, bryan.whiteh...@microchip.com wrote:
>> -Original Message-
>> From: Andrew Lunn 
>> Sent: Tuesday, November 20, 2018 2:31 PM
>> To: Bryan Whitehead - C21958 
>> Cc: da...@davemloft.net; netdev@vger.kernel.org; UNGLinuxDriver
>> 
>> Subject: Re: [PATCH v1 net] lan743x: fix return value for
>> lan743x_tx_napi_poll
>>
>> On Tue, Nov 20, 2018 at 01:26:43PM -0500, Bryan Whitehead wrote:
>>> It has been noticed that under stress the lan743x driver will
>>> sometimes hang or cause a kernel panic. It has been noticed that
>>> returning '0' instead of 'weight' fixes this issue.
>>>
>>> fixes: rare kernel panic under heavy traffic load.
>>> Signed-off-by: Bryan Whitehead 
>>
>> Hi Bryan
>>
>> This sounds like a band aid over something which is broken, not a real fix.
>>
>> Can you show us the stack trace from the panic?
>>
>> Andrew
> 
> Andrew,
> 
> Admittedly, my knowledge of what the kernel is doing behind the scenes is 
> limited.
> 
> But according to documentation found on 
> https://wiki.linuxfoundation.org/networking/napi
> 
> It states the following
> "The poll() function may also process TX completions, in which case if it 
> processes
> the entire TX ring then it should count that work as the rest of the budget.
> Otherwise, TX completions are not counted."
> 
> So based on that, the original driver was returning the full budget. But I 
> was having
> Issues with it. And the above documentation seems to suggest that I could 
> return 0
> As in "not counted" from above.
> 
> I tried it, and my lock up issues disappeared.
> 
> Regarding the kernel panic stack trace. So far its very hard to replicate 
> that on the 
> latest kernel. I've seen it more frequently when back porting to older 
> kernels such
> as 4.14, and 4.9. This same fix caused those kernel panics to disappear.
> Are you interested in seeing a stack dump from older kernels?
> 
> In the latest kernel the issue manifests as a kernel message which states
> "[  945.021101] enp48s0: Budget exhausted after napi rescheduled"
> 
> I'm not sure what that means. But it does not lock up immediately after 
> seeing that
> Message. But it usually locks up with in a minute of seeing that message.
> 
> And the sometimes I get the following warning
> [ 1240.425020] [ cut here ]
> [ 1240.426014] NETDEV WATCHDOG: enp0s25 (e1000e): transmit queue 0 timed out
> [ 1240.430027] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:461 
> dev_watchdog+0x1ef/0x200
> [ 1240.430027] Modules linked in: lan743x
> [ 1240.430027] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G  I   
> 4.19.2 #1
> [ 1240.430027] Hardware name: Hewlett-Packard HP Compaq dc7900 Convertible 
> Minitower/3032h, BIOS 786G1 v01.16 03/05/2009
> [ 1240.430027] RIP: 0010:dev_watchdog+0x1ef/0x200
> [ 1240.430027] Code: 00 48 63 4d e0 eb 93 4c 89 e7 c6 05 68 30 b3 00 01 e8 25 
> 3d fd ff 89 d9 48 89 c2 4c 89 e6 48 c7 c7 98 92 48 ab e8 f1 28 87 ff <0f> 0b 
> eb c0 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47 08 00
> [ 1240.430027] RSP: 0018:98490be03e90 EFLAGS: 00010282
> [ 1240.430027] RAX:  RBX:  RCX: 
> 
> [ 1240.497168] RDX: 00040400 RSI: 00f6 RDI: 
> 0300
> [ 1240.497168] RBP: 984908574440 R08:  R09: 
> 03a4
> [ 1240.497168] R10: 0020 R11: abc928ed R12: 
> 984908574000
> [ 1240.497168] R13:  R14:  R15: 
> 98490be195b0
> [ 1240.497168] FS:  () GS:98490be0() 
> knlGS:
> [ 1240.497168] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1240.497168] CR2: 7f31cd4c CR3: 000109bca000 CR4: 
> 000406f0
> [ 1240.497168] Call Trace:
> [ 1240.497168]  
> [ 1240.497168]  ? qdisc_reset+0xe0/0xe0
> [ 1240.497168]  call_timer_fn+0x26/0x130
> [ 1240.497168]  run_timer_softirq+0x1cd/0x400
> [ 1240.497168]  ? hpet_interrupt_handler+0x10/0x30
> [ 1240.497168]  __do_softirq+0xed/0x2aa
> [ 1240.497168]  irq_exit+0xb7/0xc0
> [ 1240.497168]  do_IRQ+0x45/0xd0
> [ 1240.497168]  common_interrupt+0xf/0xf
> [ 1240.497168]  
> [ 1240.497168] RIP: 0010:cpuidle_enter_state+0xa6/0x330
> [ 1240.497168] Code: 65 8b 3d 1d b0 4d 55 e8 58 6a 95 ff 48 89 c3 66 66 66 66 
> 90 31 ff e8 59 73 95 ff 80 7c 24 0b 00 0f 85 25 02 00 00 fb 4c 29 eb <48> ba 
> cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 ff
> [ 1240.497168] RSP: 0018:ab603e60 EFLAGS: 0216 ORIG_RAX: 
> ffde
> [ 1240.497168] RAX: 98490be20a80 RBX: 0081035c RCX: 
> 0120cf178c49
> [ 1240.497168] RDX: 0120cf178ca0 RSI: 0120cf178ca0 RDI: 
> 
> [ 1240.497168] RBP: 984908fbd000 R08: fffb58ea5f9e R09: 
> 01208e0b48df
> [ 1240.497168] R10: 18c4 R11: 2468 R12: 
> 0002
> [ 1240.497168] R13: 0120ce968944 R14: ab6a68a0 R15: 
> ab611740
> [

[PATCH bpf-next v2] bpf: fix a compilation error when CONFIG_BPF_SYSCALL is not defined

2018-11-20 Thread Yonghong Song

Kernel test robot (l...@intel.com) reports a compilation error at
  https://www.spinics.net/lists/netdev/msg534913.html
introduced by commit 838e96904ff3 ("bpf: Introduce bpf_func_info").

If CONFIG_BPF is defined and CONFIG_BPF_SYSCALL is not defined,
the following error will appear:
  kernel/bpf/core.c:414: undefined reference to `btf_type_by_id'
  kernel/bpf/core.c:415: undefined reference to `btf_name_by_offset'

When CONFIG_BPF_SYSCALL is not defined,
let us define stub inline functions for btf_type_by_id()
and btf_name_by_offset() in include/linux/btf.h.
This way, the compilation failure can be avoided.

Fixes: 838e96904ff3 ("bpf: Introduce bpf_func_info")
Reported-by: kbuild test robot 
Cc: Martin KaFai Lau 
Signed-off-by: Yonghong Song 
---
 include/linux/btf.h | 14 ++
 1 file changed, 14 insertions(+)

Changelog:
  v1 -> v2:
. Two functions should be static inline functions
  if CONFIG_BPF_SYSCALL is not defined.

diff --git a/include/linux/btf.h b/include/linux/btf.h
index 7f2c0a4a45ea..8c2199b5d250 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -46,7 +46,21 @@ void btf_type_seq_show(const struct btf *btf, u32 type_id, 
void *obj,
   struct seq_file *m);
 int btf_get_fd_by_id(u32 id);
 u32 btf_id(const struct btf *btf);
+
+#ifdef CONFIG_BPF_SYSCALL
 const struct btf_type *btf_type_by_id(const struct btf *btf, u32 type_id);
 const char *btf_name_by_offset(const struct btf *btf, u32 offset);
+#else
+static inline const struct btf_type *btf_type_by_id(const struct btf *btf,
+   u32 type_id)
+{
+   return NULL;
+}
+static inline const char *btf_name_by_offset(const struct btf *btf,
+u32 offset)
+{
+   return NULL;
+}
+#endif
 
 #endif
-- 
2.17.1

1 2 3 >

1 - 100 of 223 matches

Mail list logo