pull-request: bpf-next 2018-10-08

2018-10-08 Thread Alexei Starovoitov
Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow
BPF programs to perform lookups for sockets in a network namespace. This would
allow programs to determine early on in processing whether the stack is
expecting to receive the packet, and perform some action (eg drop,
forward somewhere) based on this information.

2) per-cpu cgroup local storage from Roman Gushchin.
Per-cpu cgroup local storage is very similar to simple cgroup storage
except all the data is per-cpu. The main goal of per-cpu variant is to
implement super fast counters (e.g. packet counters), which don't require
neither lookups, neither atomic operations in a fast path.
The example of these hybrid counters is in selftests/bpf/netcnt_prog.c

3) allow HW offload of programs with BPF-to-BPF function calls from Quentin 
Monnet

4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub 
Kicinski

5) rename of libbpf interfaces from Andrey Ignatov.
libbpf is maturing as a library and should follow good practices in
library design and implementation to play well with other libraries.
This patch set brings consistent naming convention to global symbols.

6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov
to let Apache2 projects use libbpf

7) various AF_XDP fixes from Björn and Magnus

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!



The following changes since commit 1042caa79e9351b81ed19dc8d2d7fd6ff51a4422:

  net-ipv4: remove 2 always zero parameters from ipv4_redirect() (2018-09-26 
20:30:55 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to df3f94a0bbeb6cb6a02eb16b8e76f16b33cb2f8f:

  bpf: fix building without CONFIG_INET (2018-10-09 00:49:44 +0200)


Alexei Starovoitov (1):
  libbpf: relicense libbpf as LGPL-2.1 OR BSD-2-Clause

Andrey Ignatov (12):
  bpftool: Fix bpftool net output
  libbpf: Introduce libbpf_attach_type_by_name
  libbpf: Support cgroup_skb/{e,in}gress section names
  libbpf: Support sk_skb/stream_{parser, verdict} section names
  selftests/bpf: Use libbpf_attach_type_by_name in test_socket_cookie
  selftests/bpf: Test libbpf_{prog,attach}_type_by_name
  libbpf: Move __dump_nlmsg_t from API to implementation
  libbpf: Consistent prefixes for interfaces in libbpf.h.
  libbpf: Consistent prefixes for interfaces in nlattr.h.
  libbpf: Consistent prefixes for interfaces in str_error.h.
  libbpf: Make include guards consistent
  libbpf: Use __u32 instead of u32 in bpf_program__load

Arnd Bergmann (1):
  bpf: fix building without CONFIG_INET

Arthur Fabre (1):
  bpf, doc: Document Jump X addressing mode

Björn Töpel (1):
  xsk: proper AF_XDP socket teardown ordering

Bo YU (1):
  bpf, tracex3_user: erase "ARRAY_SIZE" redefined

Daniel Borkmann (7):
  Merge branch 'bpf-libbpf-attach-by-name'
  Merge branch 'bpf-per-cpu-cgroup-storage'
  Merge branch 'bpf-big-map-entries'
  Merge branch 'bpf-sk-lookup'
  Merge branch 'bpf-libbpf-consistent-iface'
  Merge branch 'bpf-xsk-fix-mixed-mode'
  Merge branch 'bpf-to-bpf-calls-nfp'

Jakub Kicinski (5):
  nfp: bpf: parse global BPF ABI version capability
  nfp: allow apps to request larger MTU on control vNIC
  nfp: bpf: allow control message sizing for map ops
  ethtool: rename local variable max -> curr
  ethtool: don't allow disabling queues with umem installed

Joe Stringer (14):
  bpf: Add iterator for spilled registers
  bpf: Simplify ptr_min_max_vals adjustment
  bpf: Reuse canonical string formatter for ctx errs
  bpf: Generalize ptr_or_null regs check
  bpf: Add PTR_TO_SOCKET verifier type
  bpf: Macrofy stack state copy
  bpf: Add reference tracking to verifier
  bpf: Add helper to retrieve socket in BPF
  selftests/bpf: Generalize dummy program types
  selftests/bpf: Add tests for reference tracking
  libbpf: Support loading individual progs
  selftests/bpf: Add C tests for reference tracking
  Documentation: Describe bpf reference tracking
  net: core: Fix build with CONFIG_IPV6=m

Konrad Djimeli (1):
  bpf: typo fix in Documentation/networking/af_xdp.rst

Magnus Karlsson (3):
  net: add umem reference in netdev{_rx}_queue
  xsk: fix bug when trying to use both copy and zero-copy on one queue id
  xsk: simplify xdp_clear_umem_at_qid implementation

Quentin Monnet (12):
  bpf: add verifier callback to get stack usage info for offloaded progs
  nfp: bpf: rename nfp_prog->stack_depth as nfp_prog->stack_frame_depth
  nfp: 

Re: [PATCH v2 ipsec] Clear secpath on loopback_xmit

2018-10-08 Thread Steffen Klassert
On Mon, Oct 08, 2018 at 11:13:36AM -0700, Benedict Wong wrote:
> This patch clears the skb->sp when transmitted over loopback. This
> ensures that the loopback-ed packet does not have any secpath
> information from the outbound transforms.
> 
> At present, this causes XFRM tunnel mode packets to be dropped with
> XFRMINNOPOLS, due to the outbound state being in the secpath, without
> a matching inbound policy. Clearing the secpath ensures that all states
> added to the secpath are exclusively from the inbound processing.
> 
> Tests: xfrm tunnel mode tests added for loopback:
> https://android-review.googlesource.com/c/kernel/tests/+/777328
> Fixes: 8fe7ee2ba983 ("[IPsec]: Strengthen policy checks")

This patch is from 2003, it predates our git history.
The fixes tag is mainly to ensure proper backporting.
The commit you mentioned can't be found in the mainline
git repo, so it does not help much.

If this bug was really introduced in pre git times,
use the very first git commit in our history.

It should look like:

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

But I doubt that this commit introduced the bug. We started to
use outbound secpaths when we added offloading support, so I
think one of the offloading patches introduced it.

Do you have CONFIG_INET_ESP_OFFLOAD or CONFIG_INET6_ESP_OFFLOAD
enabled? Do you see the bug without these two config options
enabled?

> Signed-off-by: Benedict Wong 
> ---
>  drivers/net/loopback.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
> index 30612497643c..a6bf54df94bd 100644
> --- a/drivers/net/loopback.c
> +++ b/drivers/net/loopback.c
> @@ -50,6 +50,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include   /* For the statistics structure. */
>  #include /* For ARPHRD_ETHER */
>  #include 
> @@ -82,6 +83,9 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
>*/
>   skb_dst_force(skb);
>  
> + // Clear secpath to ensure xfrm policy check not tainted by outbound 
> SAs.

Please align your comment to the format of the other comments in this file.
It should look like this:

/* Clear secpath to ensure xfrm policy check not tainted by
 * outbound SAs.
 */

> + secpath_reset(skb);
> +
>   skb->protocol = eth_type_trans(skb, dev);
>  
>   /* it's OK to use per_cpu_ptr() because BHs are off */
> -- 
> 2.19.0.605.g01d371f741-goog


Re: [PATCH net-next] net/ipv6: stop leaking percpu memory in fib6 info

2018-10-08 Thread Mike Rapoport
On Mon, Oct 08, 2018 at 12:15:54PM -0600, David Ahern wrote:
> On 10/8/18 6:06 AM, Mike Rapoport wrote:
> > The fib6_info_alloc() function allocates percpu memory to hold per CPU
> > pointers to rt6_info, but this memory is never freed. Fix it.
> > 
> > Fixes: a64efe142f5e ("net/ipv6: introduce fib6_info struct and helpers")
> > 
> > Signed-off-by: Mike Rapoport 
> > Cc: sta...@vger.kernel.org
> > ---
> >  net/ipv6/ip6_fib.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> > index cf709eadc932..cc7de7eb8b9c 100644
> > --- a/net/ipv6/ip6_fib.c
> > +++ b/net/ipv6/ip6_fib.c
> > @@ -194,6 +194,8 @@ void fib6_info_destroy_rcu(struct rcu_head *head)
> > *ppcpu_rt = NULL;
> > }
> > }
> > +
> > +   free_percpu(f6i->rt6i_pcpu);
> > }
> >  
> > lwtstate_put(f6i->fib6_nh.nh_lwtstate);
> > 
> 
> Odd that KMEMLEAK is not detecting this. Thanks for the fix.

There's a comment in kmemleak that says:

/*
 * Percpu allocations are only scanned and not reported as leaks
 * (min_count is set to 0).
 */

No idea why, though...
 
> Reviewed-by: David Ahern 
> 

-- 
Sincerely yours,
Mike.



[PATCH net] net/ipv6: stop leaking percpu memory in fib6 info

2018-10-08 Thread Mike Rapoport
The fib6_info_alloc() function allocates percpu memory to hold per CPU
pointers to rt6_info, but this memory is never freed. Fix it.

Fixes: a64efe142f5e ("net/ipv6: introduce fib6_info struct and helpers")
Signed-off-by: Mike Rapoport 
Reviewed-by: David Ahern 
---
 net/ipv6/ip6_fib.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 5516f55..cbe4617 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -196,6 +196,8 @@ void fib6_info_destroy_rcu(struct rcu_head *head)
*ppcpu_rt = NULL;
}
}
+
+   free_percpu(f6i->rt6i_pcpu);
}
 
lwtstate_put(f6i->fib6_nh.nh_lwtstate);
-- 
2.7.4



[PATCH net-next v2] net: tun: remove useless codes of tun_automq_select_queue

2018-10-08 Thread Wang Li
Because the function __skb_get_hash_symmetric always returns non-zero.

Signed-off-by: Zhang Yu 
Signed-off-by: Wang Li 
---
 drivers/net/tun.c | 35 +--
 1 file changed, 13 insertions(+), 22 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e2648b5a3861..dc72f60578e5 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -562,12 +562,11 @@ static inline void tun_flow_save_rps_rxhash(struct 
tun_flow_entry *e, u32 hash)
e->rps_rxhash = hash;
 }
 
-/* We try to identify a flow through its rxhash first. The reason that
+/* We try to identify a flow through its rxhash. The reason that
  * we do not check rxq no. is because some cards(e.g 82599), chooses
  * the rxq based on the txq where the last packet of the flow comes. As
  * the userspace application move between processors, we may get a
- * different rxq no. here. If we could not get rxhash, then we would
- * hope the rxq no. may help here.
+ * different rxq no. here.
  */
 static u16 tun_automq_select_queue(struct tun_struct *tun, struct sk_buff *skb)
 {
@@ -578,18 +577,13 @@ static u16 tun_automq_select_queue(struct tun_struct 
*tun, struct sk_buff *skb)
numqueues = READ_ONCE(tun->numqueues);
 
txq = __skb_get_hash_symmetric(skb);
-   if (txq) {
-   e = tun_flow_find(>flows[tun_hashfn(txq)], txq);
-   if (e) {
-   tun_flow_save_rps_rxhash(e, txq);
-   txq = e->queue_index;
-   } else
-   /* use multiply and shift instead of expensive divide */
-   txq = ((u64)txq * numqueues) >> 32;
-   } else if (likely(skb_rx_queue_recorded(skb))) {
-   txq = skb_get_rx_queue(skb);
-   while (unlikely(txq >= numqueues))
-   txq -= numqueues;
+   e = tun_flow_find(>flows[tun_hashfn(txq)], txq);
+   if (e) {
+   tun_flow_save_rps_rxhash(e, txq);
+   txq = e->queue_index;
+   } else {
+   /* use multiply and shift instead of expensive divide */
+   txq = ((u64)txq * numqueues) >> 32;
}
 
return txq;
@@ -1044,16 +1038,13 @@ static void tun_automq_xmit(struct tun_struct *tun, 
struct sk_buff *skb)
/* Select queue was not called for the skbuff, so we extract the
 * RPS hash and save it into the flow_table here.
 */
+   struct tun_flow_entry *e;
__u32 rxhash;
 
rxhash = __skb_get_hash_symmetric(skb);
-   if (rxhash) {
-   struct tun_flow_entry *e;
-   e = tun_flow_find(>flows[tun_hashfn(rxhash)],
-   rxhash);
-   if (e)
-   tun_flow_save_rps_rxhash(e, rxhash);
-   }
+   e = tun_flow_find(>flows[tun_hashfn(rxhash)], rxhash);
+   if (e)
+   tun_flow_save_rps_rxhash(e, rxhash);
}
 #endif
 }
-- 
2.15.2 (Apple Git-101.1)



[PATCH net-next] rxrpc: Remove set but not used variable 'ioc'

2018-10-08 Thread YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning:

net/rxrpc/output.c: In function 'rxrpc_reject_packets':
net/rxrpc/output.c:527:11: warning:
 variable 'ioc' set but not used [-Wunused-but-set-variable]

It never used since introduction in
commit ece64fec164f ("rxrpc: Emit BUSY packets when supposed to rather than 
ABORTs")

Signed-off-by: YueHaibing 
---
 net/rxrpc/output.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/net/rxrpc/output.c b/net/rxrpc/output.c
index e8fb892..f3ed16a 100644
--- a/net/rxrpc/output.c
+++ b/net/rxrpc/output.c
@@ -524,7 +524,7 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
struct kvec iov[2];
size_t size;
__be32 code;
-   int ret, ioc;
+   int ret;
 
_enter("%d", local->debug_id);
 
@@ -548,13 +548,11 @@ void rxrpc_reject_packets(struct rxrpc_local *local)
case RXRPC_SKB_MARK_REJECT_BUSY:
whdr.type = RXRPC_PACKET_TYPE_BUSY;
size = sizeof(whdr);
-   ioc = 1;
break;
case RXRPC_SKB_MARK_REJECT_ABORT:
whdr.type = RXRPC_PACKET_TYPE_ABORT;
code = htonl(skb->priority);
size = sizeof(whdr) + sizeof(code);
-   ioc = 2;
break;
default:
rxrpc_free_skb(skb, rxrpc_skb_rx_freed);



[PATCH net-next 3/3] nfp: flower: use host context count provided by firmware

2018-10-08 Thread Jakub Kicinski
From: Pieter Jansen van Vuuren 

Read the host context count symbols provided by firmware and use
it to determine the number of allocated stats ids. Previously it
won't be possible to offload more than 2^17 filter even if FW was
able to do so.

Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/flower/main.c   | 15 +--
 .../net/ethernet/netronome/nfp/flower/main.h   |  9 +
 .../ethernet/netronome/nfp/flower/metadata.c   | 18 +-
 3 files changed, 27 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 34b0c3602ab2..255a0d28ea15 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -518,8 +518,8 @@ static int nfp_flower_vnic_init(struct nfp_app *app, struct 
nfp_net *nn)
 static int nfp_flower_init(struct nfp_app *app)
 {
const struct nfp_pf *pf = app->pf;
+   u64 version, features, ctx_count;
struct nfp_flower_priv *app_priv;
-   u64 version, features;
int err;
 
if (!pf->eth_tbl) {
@@ -543,6 +543,16 @@ static int nfp_flower_init(struct nfp_app *app)
return err;
}
 
+   ctx_count = nfp_rtsym_read_le(app->pf->rtbl, "CONFIG_FC_HOST_CTX_COUNT",
+ );
+   if (err) {
+   nfp_warn(app->cpp,
+"FlowerNIC: unsupported host context count: %d\n",
+err);
+   err = 0;
+   ctx_count = BIT(17);
+   }
+
/* We need to ensure hardware has enough flower capabilities. */
if (version != NFP_FLOWER_ALLOWED_VER) {
nfp_warn(app->cpp, "FlowerNIC: unsupported firmware version\n");
@@ -553,6 +563,7 @@ static int nfp_flower_init(struct nfp_app *app)
if (!app_priv)
return -ENOMEM;
 
+   app_priv->stats_ring_size = roundup_pow_of_two(ctx_count);
app->priv = app_priv;
app_priv->app = app;
skb_queue_head_init(_priv->cmsg_skbs_high);
@@ -563,7 +574,7 @@ static int nfp_flower_init(struct nfp_app *app)
init_waitqueue_head(_priv->mtu_conf.wait_q);
spin_lock_init(_priv->mtu_conf.lock);
 
-   err = nfp_flower_metadata_init(app);
+   err = nfp_flower_metadata_init(app, ctx_count);
if (err)
goto err_free_app_priv;
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 21a167df90c1..037aa9553597 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -51,9 +51,8 @@ struct net_device;
 struct nfp_app;
 
 #define NFP_FL_STATS_CTX_DONT_CARE cpu_to_be32(0x)
-#define NFP_FL_STATS_ENTRY_RS  BIT(20)
-#define NFP_FL_STATS_ELEM_RS   4
-#define NFP_FL_REPEATED_HASH_MAX   BIT(17)
+#define NFP_FL_STATS_ELEM_RS   FIELD_SIZEOF(struct nfp_fl_stats_id, \
+init_unalloc)
 #define NFP_FLOWER_MASK_ENTRY_RS   256
 #define NFP_FLOWER_MASK_ELEMENT_RS 1
 #define NFP_FLOWER_MASK_HASH_BITS  10
@@ -138,6 +137,7 @@ struct nfp_fl_lag {
  * @stats_ids: List of free stats ids
  * @mask_ids:  List of free mask ids
  * @mask_table:Hash table used to store masks
+ * @stats_ring_size:   Maximum number of allowed stats ids
  * @flow_table:Hash table used to store flower rules
  * @stats: Stored stats updates for flower rules
  * @stats_lock:Lock for flower rule stats updates
@@ -174,6 +174,7 @@ struct nfp_flower_priv {
struct nfp_fl_stats_id stats_ids;
struct nfp_fl_mask_id mask_ids;
DECLARE_HASHTABLE(mask_table, NFP_FLOWER_MASK_HASH_BITS);
+   u32 stats_ring_size;
struct rhashtable flow_table;
struct nfp_fl_stats *stats;
spinlock_t stats_lock; /* lock stats */
@@ -253,7 +254,7 @@ struct nfp_fl_stats_frame {
__be64 stats_cookie;
 };
 
-int nfp_flower_metadata_init(struct nfp_app *app);
+int nfp_flower_metadata_init(struct nfp_app *app, u64 host_ctx_count);
 void nfp_flower_metadata_cleanup(struct nfp_app *app);
 
 int nfp_flower_setup_tc(struct nfp_app *app, struct net_device *netdev,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/metadata.c 
b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
index f0db7f9122d2..a4cce9a30830 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/metadata.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
@@ -61,14 +61,14 @@ static int nfp_release_stats_entry(struct nfp_app *app, u32 
stats_context_id)
 
ring = >stats_ids.free_list;
/* Check if buffer is full. */
-   if (!CIRC_SPACE(ring->head, ring->tail, NFP_FL_STATS_ENTRY_RS *
-   

[PATCH net-next 2/3] nfp: flower: use stats array instead of storing stats per flow

2018-10-08 Thread Jakub Kicinski
From: Pieter Jansen van Vuuren 

Make use of an array stats instead of storing stats per flow which
would require a hash lookup at critical times.

Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Jakub Kicinski 
---
 .../net/ethernet/netronome/nfp/flower/main.h  |  6 +-
 .../ethernet/netronome/nfp/flower/metadata.c  | 56 +--
 .../ethernet/netronome/nfp/flower/offload.c   | 19 ---
 3 files changed, 40 insertions(+), 41 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 8b2b656da7ca..21a167df90c1 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -139,6 +139,8 @@ struct nfp_fl_lag {
  * @mask_ids:  List of free mask ids
  * @mask_table:Hash table used to store masks
  * @flow_table:Hash table used to store flower rules
+ * @stats: Stored stats updates for flower rules
+ * @stats_lock:Lock for flower rule stats updates
  * @cmsg_work: Workqueue for control messages processing
  * @cmsg_skbs_high:List of higher priority skbs for control message
  * processing
@@ -173,6 +175,8 @@ struct nfp_flower_priv {
struct nfp_fl_mask_id mask_ids;
DECLARE_HASHTABLE(mask_table, NFP_FLOWER_MASK_HASH_BITS);
struct rhashtable flow_table;
+   struct nfp_fl_stats *stats;
+   spinlock_t stats_lock; /* lock stats */
struct work_struct cmsg_work;
struct sk_buff_head cmsg_skbs_high;
struct sk_buff_head cmsg_skbs_low;
@@ -232,8 +236,6 @@ struct nfp_fl_payload {
unsigned long tc_flower_cookie;
struct rhash_head fl_node;
struct rcu_head rcu;
-   spinlock_t lock; /* lock stats */
-   struct nfp_fl_stats stats;
__be32 nfp_tun_ipv4_addr;
struct net_device *ingress_dev;
char *unmasked_data;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/metadata.c 
b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
index 2427c994c91d..f0db7f9122d2 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/metadata.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
@@ -119,42 +119,26 @@ nfp_flower_search_fl_table(struct nfp_app *app, unsigned 
long tc_flower_cookie,
  nfp_flower_table_params);
 }
 
-static void
-nfp_flower_update_stats(struct nfp_app *app, struct nfp_fl_stats_frame *stats)
-{
-   struct nfp_fl_payload *nfp_flow;
-   unsigned long flower_cookie;
-
-   flower_cookie = be64_to_cpu(stats->stats_cookie);
-
-   rcu_read_lock();
-   nfp_flow = nfp_flower_search_fl_table(app, flower_cookie, NULL,
- stats->stats_con_id);
-   if (!nfp_flow)
-   goto exit_rcu_unlock;
-
-   spin_lock(_flow->lock);
-   nfp_flow->stats.pkts += be32_to_cpu(stats->pkt_count);
-   nfp_flow->stats.bytes += be64_to_cpu(stats->byte_count);
-   nfp_flow->stats.used = jiffies;
-   spin_unlock(_flow->lock);
-
-exit_rcu_unlock:
-   rcu_read_unlock();
-}
-
 void nfp_flower_rx_flow_stats(struct nfp_app *app, struct sk_buff *skb)
 {
unsigned int msg_len = nfp_flower_cmsg_get_data_len(skb);
-   struct nfp_fl_stats_frame *stats_frame;
+   struct nfp_flower_priv *priv = app->priv;
+   struct nfp_fl_stats_frame *stats;
unsigned char *msg;
+   u32 ctx_id;
int i;
 
msg = nfp_flower_cmsg_get_data(skb);
 
-   stats_frame = (struct nfp_fl_stats_frame *)msg;
-   for (i = 0; i < msg_len / sizeof(*stats_frame); i++)
-   nfp_flower_update_stats(app, stats_frame + i);
+   spin_lock(>stats_lock);
+   for (i = 0; i < msg_len / sizeof(*stats); i++) {
+   stats = (struct nfp_fl_stats_frame *)msg + i;
+   ctx_id = be32_to_cpu(stats->stats_con_id);
+   priv->stats[ctx_id].pkts += be32_to_cpu(stats->pkt_count);
+   priv->stats[ctx_id].bytes += be64_to_cpu(stats->byte_count);
+   priv->stats[ctx_id].used = jiffies;
+   }
+   spin_unlock(>stats_lock);
 }
 
 static int nfp_release_mask_id(struct nfp_app *app, u8 mask_id)
@@ -348,9 +332,9 @@ int nfp_compile_flow_metadata(struct nfp_app *app,
 
/* Update flow payload with mask ids. */
nfp_flow->unmasked_data[NFP_FL_MASK_ID_LOCATION] = new_mask_id;
-   nfp_flow->stats.pkts = 0;
-   nfp_flow->stats.bytes = 0;
-   nfp_flow->stats.used = jiffies;
+   priv->stats[stats_cxt].pkts = 0;
+   priv->stats[stats_cxt].bytes = 0;
+   priv->stats[stats_cxt].used = jiffies;
 
check_entry = nfp_flower_search_fl_table(app, flow->cookie, netdev,
 NFP_FL_STATS_CTX_DONT_CARE);
@@ -469,8 +453,17 @@ int nfp_flower_metadata_init(struct nfp_app *app)
 
priv->stats_ids.init_unalloc = 

[PATCH net-next 1/3] nfp: flower: use rhashtable for flow caching

2018-10-08 Thread Jakub Kicinski
From: Pieter Jansen van Vuuren 

Make use of relativistic hash tables for tracking flows instead
of fixed sized hash tables.

Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  5 --
 .../net/ethernet/netronome/nfp/flower/main.h  |  8 +-
 .../ethernet/netronome/nfp/flower/metadata.c  | 73 ---
 .../ethernet/netronome/nfp/flower/offload.c   | 12 ++-
 drivers/net/ethernet/netronome/nfp/nfp_app.c  |  5 ++
 drivers/net/ethernet/netronome/nfp/nfp_app.h  |  1 +
 6 files changed, 82 insertions(+), 22 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index d9d37aa860e0..28af263d4577 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -509,11 +509,6 @@ static int nfp_bpf_init(struct nfp_app *app)
return err;
 }
 
-static void nfp_check_rhashtable_empty(void *ptr, void *arg)
-{
-   WARN_ON_ONCE(1);
-}
-
 static void nfp_bpf_clean(struct nfp_app *app)
 {
struct nfp_app_bpf *bpf = app->priv;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 5f27318ecdbd..8b2b656da7ca 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -38,6 +38,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -53,7 +54,6 @@ struct nfp_app;
 #define NFP_FL_STATS_ENTRY_RS  BIT(20)
 #define NFP_FL_STATS_ELEM_RS   4
 #define NFP_FL_REPEATED_HASH_MAX   BIT(17)
-#define NFP_FLOWER_HASH_BITS   19
 #define NFP_FLOWER_MASK_ENTRY_RS   256
 #define NFP_FLOWER_MASK_ELEMENT_RS 1
 #define NFP_FLOWER_MASK_HASH_BITS  10
@@ -172,7 +172,7 @@ struct nfp_flower_priv {
struct nfp_fl_stats_id stats_ids;
struct nfp_fl_mask_id mask_ids;
DECLARE_HASHTABLE(mask_table, NFP_FLOWER_MASK_HASH_BITS);
-   DECLARE_HASHTABLE(flow_table, NFP_FLOWER_HASH_BITS);
+   struct rhashtable flow_table;
struct work_struct cmsg_work;
struct sk_buff_head cmsg_skbs_high;
struct sk_buff_head cmsg_skbs_low;
@@ -230,7 +230,7 @@ struct nfp_fl_stats {
 struct nfp_fl_payload {
struct nfp_fl_rule_metadata meta;
unsigned long tc_flower_cookie;
-   struct hlist_node link;
+   struct rhash_head fl_node;
struct rcu_head rcu;
spinlock_t lock; /* lock stats */
struct nfp_fl_stats stats;
@@ -242,6 +242,8 @@ struct nfp_fl_payload {
bool ingress_offload;
 };
 
+extern const struct rhashtable_params nfp_flower_table_params;
+
 struct nfp_fl_stats_frame {
__be32 stats_con_id;
__be32 pkt_count;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/metadata.c 
b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
index c098730544b7..2427c994c91d 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/metadata.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
@@ -48,6 +48,12 @@ struct nfp_mask_id_table {
u8 mask_id;
 };
 
+struct nfp_fl_flow_table_cmp_arg {
+   struct net_device *netdev;
+   unsigned long cookie;
+   __be32 host_ctx;
+};
+
 static int nfp_release_stats_entry(struct nfp_app *app, u32 stats_context_id)
 {
struct nfp_flower_priv *priv = app->priv;
@@ -102,18 +108,15 @@ struct nfp_fl_payload *
 nfp_flower_search_fl_table(struct nfp_app *app, unsigned long tc_flower_cookie,
   struct net_device *netdev, __be32 host_ctx)
 {
+   struct nfp_fl_flow_table_cmp_arg flower_cmp_arg;
struct nfp_flower_priv *priv = app->priv;
-   struct nfp_fl_payload *flower_entry;
 
-   hash_for_each_possible_rcu(priv->flow_table, flower_entry, link,
-  tc_flower_cookie)
-   if (flower_entry->tc_flower_cookie == tc_flower_cookie &&
-   (!netdev || flower_entry->ingress_dev == netdev) &&
-   (host_ctx == NFP_FL_STATS_CTX_DONT_CARE ||
-flower_entry->meta.host_ctx_id == host_ctx))
-   return flower_entry;
+   flower_cmp_arg.netdev = netdev;
+   flower_cmp_arg.cookie = tc_flower_cookie;
+   flower_cmp_arg.host_ctx = host_ctx;
 
-   return NULL;
+   return rhashtable_lookup_fast(>flow_table, _cmp_arg,
+ nfp_flower_table_params);
 }
 
 static void
@@ -389,12 +392,56 @@ int nfp_modify_flow_metadata(struct nfp_app *app,
return nfp_release_stats_entry(app, temp_ctx_id);
 }
 
+static int nfp_fl_obj_cmpfn(struct rhashtable_compare_arg *arg,
+   const void *obj)
+{
+   const struct nfp_fl_flow_table_cmp_arg *cmp_arg = arg->key;
+   const struct nfp_fl_payload *flow_entry = obj;
+
+   if ((!cmp_arg->netdev || flow_entry->ingress_dev == cmp_arg->netdev) &&
+   (cmp_arg->host_ctx 

[PATCH net-next 0/3] nfp: flower: speed up stats update loop

2018-10-08 Thread Jakub Kicinski
Hi!

This set from Pieter improves performance of processing FW stats
update notifications.  The FW seems to send those at relatively
high rate (roughly ten per second per flow), therefore if we want
to approach the million flows mark we have to be very careful
about our data structures.

We tried rhashtable for stat updates, but according to our experiments
rhashtable lookup on a u32 takes roughly 60ns on an Xeon E5-2670 v3.
Which translate to a hard limit of 16M lookups per second on this CPU,
and, according to perf record jhash and memcmp account for 60% of CPU
usage on the core handling the updates.

Given that our statistic IDs are already array indices, and considering
each statistic is only 24B in size, we decided to forego the use
of hashtables and use a directly indexed array.  The CPU savings are
considerable.

With the recent improvements in TC core and with our own bottlenecks
out of the way Pieter removes the artificial limit of 128 flows, and
allows the driver to install as many flows as FW supports.

Pieter Jansen van Vuuren (3):
  nfp: flower: use rhashtable for flow caching
  nfp: flower: use stats array instead of storing stats per flow
  nfp: flower: use host context count provided by firmware

 drivers/net/ethernet/netronome/nfp/bpf/main.c |   5 -
 .../net/ethernet/netronome/nfp/flower/main.c  |  15 +-
 .../net/ethernet/netronome/nfp/flower/main.h  |  23 +--
 .../ethernet/netronome/nfp/flower/metadata.c  | 145 --
 .../ethernet/netronome/nfp/flower/offload.c   |  31 ++--
 drivers/net/ethernet/netronome/nfp/nfp_app.c  |   5 +
 drivers/net/ethernet/netronome/nfp/nfp_app.h  |   1 +
 7 files changed, 148 insertions(+), 77 deletions(-)

-- 
2.17.1



Re: [PATCH bpf-next 0/6] Error handling when map lookup isn't supported

2018-10-08 Thread David Miller
From: Prashant Bhole 
Date: Tue, 9 Oct 2018 09:02:15 +0900

> Thanks. Is there any reason this series did not get posted on
> netdev-list and can not be seen in the patchwork?

Bad timing.  I did something stupid on vger.kernel.org, it ran out of
disk space while the postings were being processed, and thus they were
lost.


Re: [PATCH bpf-next 1/6] bpf: rename stack trace map operations

2018-10-08 Thread Song Liu
Actually, does it make sense to implement a list_map that supports
both pop_head()
and pop_tail()? Maybe gate one of the pop operations with options?

I am asking because mixing stack with stack trace is still confusing
after this patch.

Thanks,
Song

On Mon, Oct 8, 2018 at 12:11 PM Mauricio Vasquez B
 wrote:
>
> In the following patches queue and stack maps (FIFO and LIFO
> datastructures) will be implemented.  In order to avoid confusion and
> a possible name clash rename stack_map_ops to stack_trace_map_ops
>
> Signed-off-by: Mauricio Vasquez B 
> ---
>  include/linux/bpf_types.h |2 +-
>  kernel/bpf/stackmap.c |2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 5432f4c9f50e..658509daacd4 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -51,7 +51,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_HASH, htab_lru_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_PERCPU_HASH, htab_lru_percpu_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_LPM_TRIE, trie_map_ops)
>  #ifdef CONFIG_PERF_EVENTS
> -BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_trace_map_ops)
>  #endif
>  BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
> diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
> index 8061a439ef18..bb41e293418d 100644
> --- a/kernel/bpf/stackmap.c
> +++ b/kernel/bpf/stackmap.c
> @@ -600,7 +600,7 @@ static void stack_map_free(struct bpf_map *map)
> put_callchain_buffers();
>  }
>
> -const struct bpf_map_ops stack_map_ops = {
> +const struct bpf_map_ops stack_trace_map_ops = {
> .map_alloc = stack_map_alloc,
> .map_free = stack_map_free,
> .map_get_next_key = stack_map_get_next_key,
>


Re: [PATCH bpf-next 4/6] bpf: add queue and stack maps

2018-10-08 Thread Song Liu
On Mon, Oct 8, 2018 at 12:12 PM Mauricio Vasquez B
 wrote:
>
> Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
> These maps support peek, pop and push operations that are exposed to eBPF
> programs through the new bpf_map[peek/pop/push] helpers.  Those operations
> are exposed to userspace applications through the already existing
> syscalls in the following way:
>
> BPF_MAP_LOOKUP_ELEM-> peek
> BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
> BPF_MAP_UPDATE_ELEM-> push
>
> Queue/stack maps are implemented using a buffer, tail and head indexes,
> hence BPF_F_NO_PREALLOC is not supported.
>
> As opposite to other maps, queue and stack do not use RCU for protecting
> maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
> argument that is a pointer to a memory zone where to save the value of a
> map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
> be passed as an extra argument.
>
> Our main motivation for implementing queue/stack maps was to keep track
> of a pool of elements, like network ports in a SNAT, however we forsee
> other use cases, like for exampling saving last N kernel events in a map
> and then analysing from userspace.
>
> Signed-off-by: Mauricio Vasquez B 
> ---
>  include/linux/bpf.h   |7 +
>  include/linux/bpf_types.h |2
>  include/uapi/linux/bpf.h  |   35 -
>  kernel/bpf/Makefile   |2
>  kernel/bpf/core.c |3
>  kernel/bpf/helpers.c  |   43 ++
>  kernel/bpf/queue_stack_maps.c |  288 
> +
>  kernel/bpf/syscall.c  |   30 +++-
>  kernel/bpf/verifier.c |   28 +++-
>  net/core/filter.c |6 +
>  10 files changed, 426 insertions(+), 18 deletions(-)
>  create mode 100644 kernel/bpf/queue_stack_maps.c
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 98c7eeb6d138..cad3bc5cffd1 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -40,6 +40,9 @@ struct bpf_map_ops {
> int (*map_update_elem)(struct bpf_map *map, void *key, void *value, 
> u64 flags);
> int (*map_delete_elem)(struct bpf_map *map, void *key);
> void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void *key);
> +   int (*map_push_elem)(struct bpf_map *map, void *value, u64 flags);
> +   int (*map_pop_elem)(struct bpf_map *map, void *value);
> +   int (*map_peek_elem)(struct bpf_map *map, void *value);
>
> /* funcs called by prog_array and perf_event_array map */
> void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
> @@ -139,6 +142,7 @@ enum bpf_arg_type {
> ARG_CONST_MAP_PTR,  /* const argument used as pointer to bpf_map 
> */
> ARG_PTR_TO_MAP_KEY, /* pointer to stack used as map key */
> ARG_PTR_TO_MAP_VALUE,   /* pointer to stack used as map value */
> +   ARG_PTR_TO_UNINIT_MAP_VALUE,/* pointer to valid memory used to 
> store a map value */

How about we put ARG_PTR_TO_UNINIT_MAP_VALUE and related logic to a
separate patch?

>
> /* the following constraints used to prototype bpf_memcmp() and other
>  * functions that access data on eBPF program stack
> @@ -825,6 +829,9 @@ static inline int 
> bpf_fd_reuseport_array_update_elem(struct bpf_map *map,
>  extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
>  extern const struct bpf_func_proto bpf_map_update_elem_proto;
>  extern const struct bpf_func_proto bpf_map_delete_elem_proto;
> +extern const struct bpf_func_proto bpf_map_push_elem_proto;
> +extern const struct bpf_func_proto bpf_map_pop_elem_proto;
> +extern const struct bpf_func_proto bpf_map_peek_elem_proto;
>
>  extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
>  extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
> diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
> index 658509daacd4..a2ec73aa1ec7 100644
> --- a/include/linux/bpf_types.h
> +++ b/include/linux/bpf_types.h
> @@ -69,3 +69,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
>  BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
>  #endif
>  #endif
> +BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
> +BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 3bb94aa2d408..bfa042273fad 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -129,6 +129,8 @@ enum bpf_map_type {
> BPF_MAP_TYPE_CGROUP_STORAGE,
> BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
> BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
> +   BPF_MAP_TYPE_QUEUE,
> +   BPF_MAP_TYPE_STACK,
>  };
>
>  enum bpf_prog_type {
> @@ -463,6 +465,28 @@ union bpf_attr {
>   * Return
>   * 0 on success, or a negative error in case of failure.
>   *
> + * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
> + * Description
> + * 

Re: [PATCH bpf-next 3/6] bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall

2018-10-08 Thread Song Liu
On Mon, Oct 8, 2018 at 12:12 PM Mauricio Vasquez B
 wrote:
>
> The following patch implements a bpf queue/stack maps that
> provides the peek/pop/push functions.  There is not a direct
> relationship between those functions and the current maps
> syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
> this is mapped to the pop operation in the queue/stack maps
> and it is still to implement in other kind of maps.

Do we need this system call for other maps (non-stack/queue)?
If not, maybe we can just call it POP, and only implement it for
stack and queue?

>
> Signed-off-by: Mauricio Vasquez B 
> ---
>  include/linux/bpf.h  |1 +
>  include/uapi/linux/bpf.h |1 +
>  kernel/bpf/syscall.c |   81 
> ++
>  3 files changed, 83 insertions(+)
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 027697b6a22f..98c7eeb6d138 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -39,6 +39,7 @@ struct bpf_map_ops {
> void *(*map_lookup_elem)(struct bpf_map *map, void *key);
> int (*map_update_elem)(struct bpf_map *map, void *key, void *value, 
> u64 flags);
> int (*map_delete_elem)(struct bpf_map *map, void *key);
> +   void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void *key);
>
> /* funcs called by prog_array and perf_event_array map */
> void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f9187b41dff6..3bb94aa2d408 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -103,6 +103,7 @@ enum bpf_cmd {
> BPF_BTF_LOAD,
> BPF_BTF_GET_FD_BY_ID,
> BPF_TASK_FD_QUERY,
> +   BPF_MAP_LOOKUP_AND_DELETE_ELEM,
>  };
>
>  enum bpf_map_type {
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index eb75e8af73ff..c33d9303f72f 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -975,6 +975,84 @@ static int map_get_next_key(union bpf_attr *attr)
> return err;
>  }
>
> +#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
> +
> +static int map_lookup_and_delete_elem(union bpf_attr *attr)
> +{
> +   void __user *ukey = u64_to_user_ptr(attr->key);
> +   void __user *uvalue = u64_to_user_ptr(attr->value);
> +   int ufd = attr->map_fd;
> +   struct bpf_map *map;
> +   void *key, *value, *ptr;
> +   u32 value_size;
> +   struct fd f;
> +   int err;
> +
> +   if (CHECK_ATTR(BPF_MAP_LOOKUP_AND_DELETE_ELEM))
> +   return -EINVAL;
> +
> +   f = fdget(ufd);
> +   map = __bpf_map_get(f);
> +   if (IS_ERR(map))
> +   return PTR_ERR(map);
> +
> +   if (!(f.file->f_mode & FMODE_CAN_WRITE)) {
> +   err = -EPERM;
> +   goto err_put;
> +   }
> +
> +   key = __bpf_copy_key(ukey, map->key_size);
> +   if (IS_ERR(key)) {
> +   err = PTR_ERR(key);
> +   goto err_put;
> +   }
> +
> +   value_size = map->value_size;
> +
> +   err = -ENOMEM;
> +   value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
> +   if (!value)
> +   goto free_key;
> +
> +   err = -EFAULT;
> +   if (copy_from_user(value, uvalue, value_size) != 0)
> +   goto free_value;
> +
> +   /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
> +* inside bpf map update or delete otherwise deadlocks are possible
> +*/
> +   preempt_disable();
> +   __this_cpu_inc(bpf_prog_active);
> +   if (!map->ops->map_lookup_and_delete_elem) {
Why do have have this check here? Shouldn't it be check much earlier?
If we really need it here, we need at least add the following:

__this_cpu_dec(bpf_prog_active);
   preempt_enable();


> +   err = -ENOTSUPP;
> +   goto free_value;
> +   }
> +   rcu_read_lock();
> +   ptr = map->ops->map_lookup_and_delete_elem(map, key);
> +   if (ptr)
> +   memcpy(value, ptr, value_size);
> +   rcu_read_unlock();
> +   err = ptr ? 0 : -ENOENT;
> +   __this_cpu_dec(bpf_prog_active);
> +   preempt_enable();
> +
> +   if (err)
> +   goto free_value;
> +
> +   if (copy_to_user(uvalue, value, value_size) != 0)
> +   goto free_value;
> +
> +   err = 0;
> +
> +free_value:
> +   kfree(value);
> +free_key:
> +   kfree(key);
> +err_put:
> +   fdput(f);
> +   return err;
> +}
> +
>  static const struct bpf_prog_ops * const bpf_prog_types[] = {
>  #define BPF_PROG_TYPE(_id, _name) \
> [_id] = & _name ## _prog_ops,
> @@ -2448,6 +2526,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
> uattr, unsigned int, siz
> case BPF_TASK_FD_QUERY:
> err = bpf_task_fd_query(, uattr);
> break;
> +   case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
> +   err = 

[PATCH bpf-next 6/6] selftests/bpf: test_verifier, check bpf_map_lookup_elem access in bpf prog

2018-10-08 Thread Prashant Bhole
map_lookup_elem isn't supported by certain map types like:
- BPF_MAP_TYPE_PROG_ARRAY
- BPF_MAP_TYPE_STACK_TRACE
- BPF_MAP_TYPE_XSKMAP
- BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH
Let's add verfier tests to check whether verifier prevents
bpf_map_lookup_elem call on above programs from bpf program.

Signed-off-by: Prashant Bhole 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_verifier.c | 121 +++-
 1 file changed, 120 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 65ae44c85d27..cf4cd32b6772 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -48,7 +48,7 @@
 
 #define MAX_INSNS  BPF_MAXINSNS
 #define MAX_FIXUPS 8
-#define MAX_NR_MAPS8
+#define MAX_NR_MAPS13
 #define POINTER_VALUE  0xcafe4all
 #define TEST_DATA_LEN  64
 
@@ -65,6 +65,10 @@ struct bpf_test {
int fixup_map_hash_48b[MAX_FIXUPS];
int fixup_map_hash_16b[MAX_FIXUPS];
int fixup_map_array_48b[MAX_FIXUPS];
+   int fixup_map_sockmap[MAX_FIXUPS];
+   int fixup_map_sockhash[MAX_FIXUPS];
+   int fixup_map_xskmap[MAX_FIXUPS];
+   int fixup_map_stacktrace[MAX_FIXUPS];
int fixup_prog1[MAX_FIXUPS];
int fixup_prog2[MAX_FIXUPS];
int fixup_map_in_map[MAX_FIXUPS];
@@ -4541,6 +4545,85 @@ static struct bpf_test tests[] = {
.errstr = "invalid access to packet",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
},
+   {
+   "prevent map lookup in sockmap",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map_sockmap = { 3 },
+   .result = REJECT,
+   .errstr = "cannot pass map_type 15 into func 
bpf_map_lookup_elem",
+   .prog_type = BPF_PROG_TYPE_SOCK_OPS,
+   },
+   {
+   "prevent map lookup in sockhash",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map_sockhash = { 3 },
+   .result = REJECT,
+   .errstr = "cannot pass map_type 18 into func 
bpf_map_lookup_elem",
+   .prog_type = BPF_PROG_TYPE_SOCK_OPS,
+   },
+   {
+   "prevent map lookup in xskmap",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map_xskmap = { 3 },
+   .result = REJECT,
+   .errstr = "cannot pass map_type 17 into func 
bpf_map_lookup_elem",
+   .prog_type = BPF_PROG_TYPE_XDP,
+   },
+   {
+   "prevent map lookup in stack trace",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+BPF_FUNC_map_lookup_elem),
+   BPF_EXIT_INSN(),
+   },
+   .fixup_map_stacktrace = { 3 },
+   .result = REJECT,
+   .errstr = "cannot pass map_type 7 into func 
bpf_map_lookup_elem",
+   .prog_type = BPF_PROG_TYPE_PERF_EVENT,
+   },
+   {
+   "prevent map lookup in prog array",
+   .insns = {
+   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
+   BPF_LD_MAP_FD(BPF_REG_1, 0),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+

[PATCH bpf-next 4/6] tools/bpf: bpftool, print strerror when map lookup error occurs

2018-10-08 Thread Prashant Bhole
Since map lookup error can be ENOENT or EOPNOTSUPP, let's print
strerror() as error message in normal and JSON output.

This patch adds helper function print_entry_error() to print
entry from lookup error occurs

Example: Following example dumps a map which does not support lookup.

Output before:
root# bpftool map -jp dump id 40
[
"key": ["0x0a","0x00","0x00","0x00"
],
"value": {
"error": "can\'t lookup element"
},
"key": ["0x0b","0x00","0x00","0x00"
],
"value": {
"error": "can\'t lookup element"
}
]

root# bpftool map dump id 40
can't lookup element with key:
0a 00 00 00
can't lookup element with key:
0b 00 00 00
Found 0 elements

Output after changes:
root# bpftool map dump -jp  id 45
[
"key": ["0x0a","0x00","0x00","0x00"
],
"value": {
"error": "Operation not supported"
},
"key": ["0x0b","0x00","0x00","0x00"
],
"value": {
"error": "Operation not supported"
}
]

root# bpftool map dump id 45
key:
0a 00 00 00
value:
Operation not supported
key:
0b 00 00 00
value:
Operation not supported
Found 0 elements

Signed-off-by: Prashant Bhole 
Acked-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 
---
 tools/bpf/bpftool/map.c | 29 -
 1 file changed, 24 insertions(+), 5 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 28d365435fea..9f5de48f8a99 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -336,6 +336,25 @@ static void print_entry_json(struct bpf_map_info *info, 
unsigned char *key,
jsonw_end_object(json_wtr);
 }
 
+static void print_entry_error(struct bpf_map_info *info, unsigned char *key,
+ const char *value)
+{
+   int value_size = strlen(value);
+   bool single_line, break_names;
+
+   break_names = info->key_size > 16 || value_size > 16;
+   single_line = info->key_size + value_size <= 24 && !break_names;
+
+   printf("key:%c", break_names ? '\n' : ' ');
+   fprint_hex(stdout, key, info->key_size, " ");
+
+   printf(single_line ? "  " : "\n");
+
+   printf("value:%c%s", break_names ? '\n' : ' ', value);
+
+   printf("\n");
+}
+
 static void print_entry_plain(struct bpf_map_info *info, unsigned char *key,
  unsigned char *value)
 {
@@ -663,6 +682,7 @@ static int dump_map_elem(int fd, void *key, void *value,
 json_writer_t *btf_wtr)
 {
int num_elems = 0;
+   int lookup_errno;
 
if (!bpf_map_lookup_elem(fd, key, value)) {
if (json_output) {
@@ -685,6 +705,8 @@ static int dump_map_elem(int fd, void *key, void *value,
}
 
/* lookup error handling */
+   lookup_errno = errno;
+
if (map_is_map_of_maps(map_info->type) ||
map_is_map_of_progs(map_info->type))
return 0;
@@ -694,13 +716,10 @@ static int dump_map_elem(int fd, void *key, void *value,
print_hex_data_json(key, map_info->key_size);
jsonw_name(json_wtr, "value");
jsonw_start_object(json_wtr);
-   jsonw_string_field(json_wtr, "error",
-  "can't lookup element");
+   jsonw_string_field(json_wtr, "error", strerror(lookup_errno));
jsonw_end_object(json_wtr);
} else {
-   p_info("can't lookup element with key: ");
-   fprint_hex(stderr, key, map_info->key_size, " ");
-   fprintf(stderr, "\n");
+   print_entry_error(map_info, key, strerror(lookup_errno));
}
 
return 0;
-- 
2.17.1




[PATCH bpf-next 3/6] tools/bpf: bpftool, split the function do_dump()

2018-10-08 Thread Prashant Bhole
do_dump() function in bpftool/map.c has deep indentations. In order
to reduce deep indent, let's move element printing code out of
do_dump() into dump_map_elem() function.

Signed-off-by: Prashant Bhole 
Acked-by: Jakub Kicinski 
Acked-by: Alexei Starovoitov 
---
 tools/bpf/bpftool/map.c | 83 -
 1 file changed, 49 insertions(+), 34 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 6003e9598973..28d365435fea 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -658,6 +658,54 @@ static int do_show(int argc, char **argv)
return errno == ENOENT ? 0 : -1;
 }
 
+static int dump_map_elem(int fd, void *key, void *value,
+struct bpf_map_info *map_info, struct btf *btf,
+json_writer_t *btf_wtr)
+{
+   int num_elems = 0;
+
+   if (!bpf_map_lookup_elem(fd, key, value)) {
+   if (json_output) {
+   print_entry_json(map_info, key, value, btf);
+   } else {
+   if (btf) {
+   struct btf_dumper d = {
+   .btf = btf,
+   .jw = btf_wtr,
+   .is_plain_text = true,
+   };
+
+   do_dump_btf(, map_info, key, value);
+   } else {
+   print_entry_plain(map_info, key, value);
+   }
+   num_elems++;
+   }
+   return num_elems;
+   }
+
+   /* lookup error handling */
+   if (map_is_map_of_maps(map_info->type) ||
+   map_is_map_of_progs(map_info->type))
+   return 0;
+
+   if (json_output) {
+   jsonw_name(json_wtr, "key");
+   print_hex_data_json(key, map_info->key_size);
+   jsonw_name(json_wtr, "value");
+   jsonw_start_object(json_wtr);
+   jsonw_string_field(json_wtr, "error",
+  "can't lookup element");
+   jsonw_end_object(json_wtr);
+   } else {
+   p_info("can't lookup element with key: ");
+   fprint_hex(stderr, key, map_info->key_size, " ");
+   fprintf(stderr, "\n");
+   }
+
+   return 0;
+}
+
 static int do_dump(int argc, char **argv)
 {
struct bpf_map_info info = {};
@@ -713,40 +761,7 @@ static int do_dump(int argc, char **argv)
err = 0;
break;
}
-
-   if (!bpf_map_lookup_elem(fd, key, value)) {
-   if (json_output)
-   print_entry_json(, key, value, btf);
-   else
-   if (btf) {
-   struct btf_dumper d = {
-   .btf = btf,
-   .jw = btf_wtr,
-   .is_plain_text = true,
-   };
-
-   do_dump_btf(, , key, value);
-   } else {
-   print_entry_plain(, key, value);
-   }
-   num_elems++;
-   } else if (!map_is_map_of_maps(info.type) &&
-  !map_is_map_of_progs(info.type)) {
-   if (json_output) {
-   jsonw_name(json_wtr, "key");
-   print_hex_data_json(key, info.key_size);
-   jsonw_name(json_wtr, "value");
-   jsonw_start_object(json_wtr);
-   jsonw_string_field(json_wtr, "error",
-  "can't lookup element");
-   jsonw_end_object(json_wtr);
-   } else {
-   p_info("can't lookup element with key: ");
-   fprint_hex(stderr, key, info.key_size, " ");
-   fprintf(stderr, "\n");
-   }
-   }
-
+   num_elems += dump_map_elem(fd, key, value, , btf, btf_wtr);
prev_key = key;
}
 
-- 
2.17.1




[PATCH bpf-next 2/6] bpf: return EOPNOTSUPP when map lookup isn't supported

2018-10-08 Thread Prashant Bhole
Return ERR_PTR(-EOPNOTSUPP) from map_lookup_elem() methods of below
map types:
- BPF_MAP_TYPE_PROG_ARRAY
- BPF_MAP_TYPE_STACK_TRACE
- BPF_MAP_TYPE_XSKMAP
- BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH

Signed-off-by: Prashant Bhole 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/arraymap.c | 2 +-
 kernel/bpf/sockmap.c  | 2 +-
 kernel/bpf/stackmap.c | 2 +-
 kernel/bpf/xskmap.c   | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index dded84cbe814..24583da9ffd1 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -449,7 +449,7 @@ static void fd_array_map_free(struct bpf_map *map)
 
 static void *fd_array_map_lookup_elem(struct bpf_map *map, void *key)
 {
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 
 /* only called from syscall */
diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index d37a1a0a6e1e..5d0677d808ae 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -2096,7 +2096,7 @@ int sockmap_get_from_fd(const union bpf_attr *attr, int 
type,
 
 static void *sock_map_lookup(struct bpf_map *map, void *key)
 {
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 
 static int sock_map_update_elem(struct bpf_map *map,
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 8061a439ef18..b2ade10f7ec3 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -505,7 +505,7 @@ const struct bpf_func_proto bpf_get_stack_proto = {
 /* Called from eBPF program */
 static void *stack_map_lookup_elem(struct bpf_map *map, void *key)
 {
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 
 /* Called from syscall */
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9f8463afda9c..ef0b7b6ef8a5 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -154,7 +154,7 @@ void __xsk_map_flush(struct bpf_map *map)
 
 static void *xsk_map_lookup_elem(struct bpf_map *map, void *key)
 {
-   return NULL;
+   return ERR_PTR(-EOPNOTSUPP);
 }
 
 static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
-- 
2.17.1




[PATCH bpf-next 5/6] selftests/bpf: test_verifier, change names of fixup maps

2018-10-08 Thread Prashant Bhole
Currently fixup map are named like fixup_map1, fixup_map2, and so on.
As suggested by Alexei let's change change map names such that we can
identify map type by looking at the name.

This patch is basically a find and replace change:
fixup_map1  ->  fixup_map_hash_8b
fixup_map2  ->  fixup_map_hash_48b
fixup_map3  ->  fixup_map_hash_16b
fixup_map4  ->  fixup_map_array_48b

Suggested-by: Alexei Starovoitov 
Signed-off-by: Prashant Bhole 
Acked-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_verifier.c | 380 ++--
 1 file changed, 190 insertions(+), 190 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index bc9cd8537467..65ae44c85d27 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -61,10 +61,10 @@ static bool unpriv_disabled = false;
 struct bpf_test {
const char *descr;
struct bpf_insn insns[MAX_INSNS];
-   int fixup_map1[MAX_FIXUPS];
-   int fixup_map2[MAX_FIXUPS];
-   int fixup_map3[MAX_FIXUPS];
-   int fixup_map4[MAX_FIXUPS];
+   int fixup_map_hash_8b[MAX_FIXUPS];
+   int fixup_map_hash_48b[MAX_FIXUPS];
+   int fixup_map_hash_16b[MAX_FIXUPS];
+   int fixup_map_array_48b[MAX_FIXUPS];
int fixup_prog1[MAX_FIXUPS];
int fixup_prog2[MAX_FIXUPS];
int fixup_map_in_map[MAX_FIXUPS];
@@ -876,7 +876,7 @@ static struct bpf_test tests[] = {
 BPF_FUNC_map_lookup_elem),
BPF_EXIT_INSN(),
},
-   .fixup_map1 = { 2 },
+   .fixup_map_hash_8b = { 2 },
.errstr = "invalid indirect read from stack",
.result = REJECT,
},
@@ -1110,7 +1110,7 @@ static struct bpf_test tests[] = {
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0),
BPF_EXIT_INSN(),
},
-   .fixup_map1 = { 3 },
+   .fixup_map_hash_8b = { 3 },
.errstr = "R0 invalid mem access 'map_value_or_null'",
.result = REJECT,
},
@@ -1127,7 +1127,7 @@ static struct bpf_test tests[] = {
BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0),
BPF_EXIT_INSN(),
},
-   .fixup_map1 = { 3 },
+   .fixup_map_hash_8b = { 3 },
.errstr = "misaligned value access",
.result = REJECT,
.flags = F_LOAD_WITH_STRICT_ALIGNMENT,
@@ -1147,7 +1147,7 @@ static struct bpf_test tests[] = {
BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1),
BPF_EXIT_INSN(),
},
-   .fixup_map1 = { 3 },
+   .fixup_map_hash_8b = { 3 },
.errstr = "R0 invalid mem access",
.errstr_unpriv = "R0 leaks addr",
.result = REJECT,
@@ -1237,7 +1237,7 @@ static struct bpf_test tests[] = {
 BPF_FUNC_map_delete_elem),
BPF_EXIT_INSN(),
},
-   .fixup_map1 = { 24 },
+   .fixup_map_hash_8b = { 24 },
.errstr_unpriv = "R1 pointer comparison",
.result_unpriv = REJECT,
.result = ACCEPT,
@@ -1391,7 +1391,7 @@ static struct bpf_test tests[] = {
offsetof(struct __sk_buff, pkt_type)),
BPF_EXIT_INSN(),
},
-   .fixup_map1 = { 4 },
+   .fixup_map_hash_8b = { 4 },
.errstr = "different pointers",
.errstr_unpriv = "R1 pointer comparison",
.result = REJECT,
@@ -1414,7 +1414,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
BPF_JMP_IMM(BPF_JA, 0, 0, -12),
},
-   .fixup_map1 = { 6 },
+   .fixup_map_hash_8b = { 6 },
.errstr = "different pointers",
.errstr_unpriv = "R1 pointer comparison",
.result = REJECT,
@@ -1438,7 +1438,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
BPF_JMP_IMM(BPF_JA, 0, 0, -13),
},
-   .fixup_map1 = { 7 },
+   .fixup_map_hash_8b = { 7 },
.errstr = "different pointers",
.errstr_unpriv = "R1 pointer comparison",
.result = REJECT,
@@ -2575,7 +2575,7 @@ static struct bpf_test tests[] = {
BPF_MOV64_IMM(BPF_REG_0, 0),
BPF_EXIT_INSN(),
},
-   .fixup_map1 = { 3 },
+   .fixup_map_hash_8b = { 3 },
.errstr_unpriv = "R4 leaks addr",
.result_unpriv = REJECT,
.result = ACCEPT,
@@ -2592,7 +2592,7 @@ 

[PATCH bpf-next 1/6] bpf: error handling when map_lookup_elem isn't supported

2018-10-08 Thread Prashant Bhole
The error value returned by map_lookup_elem doesn't differentiate
whether lookup was failed because of invalid key or lookup is not
supported.

Lets add handling for -EOPNOTSUPP return value of map_lookup_elem()
method of map, with expectation from map's implementation that it
should return -EOPNOTSUPP if lookup is not supported.

The errno for bpf syscall for BPF_MAP_LOOKUP_ELEM command will be set
to EOPNOTSUPP if map lookup is not supported.

Signed-off-by: Prashant Bhole 
Acked-by: Alexei Starovoitov 
---
 kernel/bpf/syscall.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5742df21598c..4f416234251f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -719,10 +719,15 @@ static int map_lookup_elem(union bpf_attr *attr)
} else {
rcu_read_lock();
ptr = map->ops->map_lookup_elem(map, key);
-   if (ptr)
+   if (IS_ERR(ptr)) {
+   err = PTR_ERR(ptr);
+   } else if (!ptr) {
+   err = -ENOENT;
+   } else {
+   err = 0;
memcpy(value, ptr, value_size);
+   }
rcu_read_unlock();
-   err = ptr ? 0 : -ENOENT;
}
 
if (err)
-- 
2.17.1




[PATCH bpf-next 0/6] Error handling when map lookup isn't supported

2018-10-08 Thread Prashant Bhole
Currently when map a lookup fails, user space API can not make any
distinction whether given key was not found or lookup is not supported
by particular map.

In this series we modify return value of maps which do not support
lookup. Lookup on such map implementation will return -EOPNOTSUPP.
bpf() syscall with BPF_MAP_LOOKUP_ELEM command will set EOPNOTSUPP
errno. We also handle this error in bpftool to print appropriate
message.

Patch 1: adds handling of BPF_MAP_LOOKUP ELEM command of bpf syscall
such that errno will set to EOPNOTSUPP when map doesn't support lookup

Patch 2: Modifies the return value of map_lookup_elem() to EOPNOTSUPP
for maps which do not support lookup

Patch 3: Splits do_dump() in bpftool/map.c. Element printing code is
moved out into new function dump_map_elem(). This was done in order to
reduce deep indentation and accomodate further changes.

Patch 4: Changes in bpftool to print strerror() message when lookup
error is occured. This will result in appropriate message like
"Operation not supported" when map doesn't support lookup.

Patch 5: test_verifier: change fixup map naming convention as
suggested by Alexei

Patch 6: Added verifier tests to check whether verifier rejects call 
to bpf_map_lookup_elem from bpf program. For all map types those
do not support map lookup.

Prashant Bhole (6):
  bpf: error handling when map_lookup_elem isn't supported
  bpf: return EOPNOTSUPP when map lookup isn't supported
  tools/bpf: bpftool, split the function do_dump()
  tools/bpf: bpftool, print strerror when map lookup error occurs
  selftests/bpf: test_verifier, change names of fixup maps
  selftests/bpf: test_verifier, check bpf_map_lookup_elem access in bpf
prog

 kernel/bpf/arraymap.c   |   2 +-
 kernel/bpf/sockmap.c|   2 +-
 kernel/bpf/stackmap.c   |   2 +-
 kernel/bpf/syscall.c|   9 +-
 kernel/bpf/xskmap.c |   2 +-
 tools/bpf/bpftool/map.c | 102 ++--
 tools/testing/selftests/bpf/test_verifier.c | 501 
 7 files changed, 389 insertions(+), 231 deletions(-)

-- 
2.17.1




Re: [PATCH bpf-next 0/6] Error handling when map lookup isn't supported

2018-10-08 Thread Alexei Starovoitov
On Tue, Oct 9, 2018 at 12:48 AM Prashant Bhole
 wrote:
>
> Shall I repost with the same version and Alexei's Acked-by for the series?

yes. please repost as-is and add my Ack to all patches.
Thanks!


Re: [PATCH bpf-next 0/6] Error handling when map lookup isn't supported

2018-10-08 Thread Prashant Bhole




On 10/9/2018 9:43 AM, Daniel Borkmann wrote:

On 10/09/2018 02:02 AM, Prashant Bhole wrote:

On 10/6/2018 3:35 AM, Alexei Starovoitov wrote:

On Fri, Oct 05, 2018 at 12:35:55PM +0900, Prashant Bhole wrote:

Currently when map a lookup fails, user space API can not make any
distinction whether given key was not found or lookup is not supported
by particular map.

In this series we modify return value of maps which do not support
lookup. Lookup on such map implementation will return -EOPNOTSUPP.
bpf() syscall with BPF_MAP_LOOKUP_ELEM command will set EOPNOTSUPP
errno. We also handle this error in bpftool to print appropriate
message.

Patch 1: adds handling of BPF_MAP_LOOKUP ELEM command of bpf syscall
such that errno will set to EOPNOTSUPP when map doesn't support lookup

Patch 2: Modifies the return value of map_lookup_elem() to EOPNOTSUPP
for maps which do not support lookup

Patch 3: Splits do_dump() in bpftool/map.c. Element printing code is
moved out into new function dump_map_elem(). This was done in order to
reduce deep indentation and accomodate further changes.

Patch 4: Changes in bpftool to print strerror() message when lookup
error is occured. This will result in appropriate message like
"Operation not supported" when map doesn't support lookup.

Patch 5: test_verifier: change fixup map naming convention as
suggested by Alexei

Patch 6: Added verifier tests to check whether verifier rejects call
to bpf_map_lookup_elem from bpf program. For all map types those
do not support map lookup.


for the set:
Acked-by: Alexei Starovoitov 


Thanks. Is there any reason this series did not get posted on netdev-list and 
can not be seen in the patchwork?


Hmm, could you repost to netdev? Perhaps a netdev or patchwork issue that
it did not land there. I just double-checked and it's indeed not present.



Shall I repost with the same version and Alexei's Acked-by for the series?

-Prashant




Re: [PATCH bpf-next 0/6] Error handling when map lookup isn't supported

2018-10-08 Thread Daniel Borkmann
On 10/09/2018 02:02 AM, Prashant Bhole wrote:
> On 10/6/2018 3:35 AM, Alexei Starovoitov wrote:
>> On Fri, Oct 05, 2018 at 12:35:55PM +0900, Prashant Bhole wrote:
>>> Currently when map a lookup fails, user space API can not make any
>>> distinction whether given key was not found or lookup is not supported
>>> by particular map.
>>>
>>> In this series we modify return value of maps which do not support
>>> lookup. Lookup on such map implementation will return -EOPNOTSUPP.
>>> bpf() syscall with BPF_MAP_LOOKUP_ELEM command will set EOPNOTSUPP
>>> errno. We also handle this error in bpftool to print appropriate
>>> message.
>>>
>>> Patch 1: adds handling of BPF_MAP_LOOKUP ELEM command of bpf syscall
>>> such that errno will set to EOPNOTSUPP when map doesn't support lookup
>>>
>>> Patch 2: Modifies the return value of map_lookup_elem() to EOPNOTSUPP
>>> for maps which do not support lookup
>>>
>>> Patch 3: Splits do_dump() in bpftool/map.c. Element printing code is
>>> moved out into new function dump_map_elem(). This was done in order to
>>> reduce deep indentation and accomodate further changes.
>>>
>>> Patch 4: Changes in bpftool to print strerror() message when lookup
>>> error is occured. This will result in appropriate message like
>>> "Operation not supported" when map doesn't support lookup.
>>>
>>> Patch 5: test_verifier: change fixup map naming convention as
>>> suggested by Alexei
>>>
>>> Patch 6: Added verifier tests to check whether verifier rejects call
>>> to bpf_map_lookup_elem from bpf program. For all map types those
>>> do not support map lookup.
>>
>> for the set:
>> Acked-by: Alexei Starovoitov 
> 
> Thanks. Is there any reason this series did not get posted on netdev-list and 
> can not be seen in the patchwork?

Hmm, could you repost to netdev? Perhaps a netdev or patchwork issue that
it did not land there. I just double-checked and it's indeed not present.

Thanks,
Daniel


Re: [PATCH bpf] xsk: do not call synchronize_net() under RCU read lock

2018-10-08 Thread Song Liu
On Mon, Oct 8, 2018 at 10:41 AM Björn Töpel  wrote:
>
> From: Björn Töpel 
>
> The XSKMAP update and delete functions called synchronize_net(), which
> can sleep. It is not allowed to sleep during an RCU read section.
>
> Instead we need to make sure that the sock sk_destruct (xsk_destruct)
> function is asynchronously called after an RCU grace period. Setting
> the SOCK_RCU_FREE flag for XDP sockets takes care of this.
>
> Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type 
> BPF_MAP_TYPE_XSKMAP")
> Reported-by: Eric Dumazet 
> Signed-off-by: Björn Töpel 
Acked-by: Song Liu 

> ---
>  kernel/bpf/xskmap.c | 10 ++
>  net/xdp/xsk.c   |  2 ++
>  2 files changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
> index 9f8463afda9c..47147c9e184d 100644
> --- a/kernel/bpf/xskmap.c
> +++ b/kernel/bpf/xskmap.c
> @@ -192,11 +192,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void 
> *key, void *value,
> sock_hold(sock->sk);
>
> old_xs = xchg(>xsk_map[i], xs);
> -   if (old_xs) {
> -   /* Make sure we've flushed everything. */
> -   synchronize_net();
> +   if (old_xs)
> sock_put((struct sock *)old_xs);
> -   }
>
> sockfd_put(sock);
> return 0;
> @@ -212,11 +209,8 @@ static int xsk_map_delete_elem(struct bpf_map *map, void 
> *key)
> return -EINVAL;
>
> old_xs = xchg(>xsk_map[k], NULL);
> -   if (old_xs) {
> -   /* Make sure we've flushed everything. */
> -   synchronize_net();
> +   if (old_xs)
> sock_put((struct sock *)old_xs);
> -   }
>
> return 0;
>  }
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 0577cd49aa72..07156f43d295 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -754,6 +754,8 @@ static int xsk_create(struct net *net, struct socket 
> *sock, int protocol,
> sk->sk_destruct = xsk_destruct;
> sk_refcnt_debug_inc(sk);
>
> +   sock_set_flag(sk, SOCK_RCU_FREE);
> +
> xs = xdp_sk(sk);
> mutex_init(>mutex);
> spin_lock_init(>tx_completion_lock);
> --
> 2.17.1
>


Re: [PATCH bpf-next 0/6] Error handling when map lookup isn't supported

2018-10-08 Thread Prashant Bhole




On 10/6/2018 3:35 AM, Alexei Starovoitov wrote:

On Fri, Oct 05, 2018 at 12:35:55PM +0900, Prashant Bhole wrote:

Currently when map a lookup fails, user space API can not make any
distinction whether given key was not found or lookup is not supported
by particular map.

In this series we modify return value of maps which do not support
lookup. Lookup on such map implementation will return -EOPNOTSUPP.
bpf() syscall with BPF_MAP_LOOKUP_ELEM command will set EOPNOTSUPP
errno. We also handle this error in bpftool to print appropriate
message.

Patch 1: adds handling of BPF_MAP_LOOKUP ELEM command of bpf syscall
such that errno will set to EOPNOTSUPP when map doesn't support lookup

Patch 2: Modifies the return value of map_lookup_elem() to EOPNOTSUPP
for maps which do not support lookup

Patch 3: Splits do_dump() in bpftool/map.c. Element printing code is
moved out into new function dump_map_elem(). This was done in order to
reduce deep indentation and accomodate further changes.

Patch 4: Changes in bpftool to print strerror() message when lookup
error is occured. This will result in appropriate message like
"Operation not supported" when map doesn't support lookup.

Patch 5: test_verifier: change fixup map naming convention as
suggested by Alexei

Patch 6: Added verifier tests to check whether verifier rejects call
to bpf_map_lookup_elem from bpf program. For all map types those
do not support map lookup.


for the set:
Acked-by: Alexei Starovoitov 


Thanks. Is there any reason this series did not get posted on 
netdev-list and can not be seen in the patchwork?




Re: [PATCH v2 2/3] tools: sync linux/bpf.h

2018-10-08 Thread Song Liu
On Mon, Oct 8, 2018 at 3:34 AM Lorenz Bauer  wrote:
>
> Synchronize changes to linux/bpf.h from
> commit 88db241b34bf ("bpf: allow zero-initializing hash map seed").
I guess we cannot keep this hash during git-am? We probably don't
need this hash anyway, as the two patches will be applied back to back.

>
> Signed-off-by: Lorenz Bauer 
> ---
>  tools/include/uapi/linux/bpf.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
> index f9187b41dff6..2c121f862082 100644
> --- a/tools/include/uapi/linux/bpf.h
> +++ b/tools/include/uapi/linux/bpf.h
> @@ -253,6 +253,8 @@ enum bpf_attach_type {
>  #define BPF_F_NO_COMMON_LRU(1U << 1)
>  /* Specify numa node during map creation */
>  #define BPF_F_NUMA_NODE(1U << 2)
> +/* Zero-initialize hash function seed. This should only be used for testing. 
> */
> +#define BPF_F_ZERO_SEED(1U << 6)

Same as 01.

>
>  /* flags for BPF_PROG_QUERY */
>  #define BPF_F_QUERY_EFFECTIVE  (1U << 0)
> --
> 2.17.1
>


[PATCH net-next v3] net/ncsi: Extend NC-SI Netlink interface to allow user space to send NC-SI command

2018-10-08 Thread Justin.Lee1
The new command (NCSI_CMD_SEND_CMD) is added to allow user space 
application to send NC-SI command to the network card.
Also, add a new attribute (NCSI_ATTR_DATA) for transferring request and 
response.

The work flow is as below. 

Request:
User space application
-> Netlink interface (msg)
-> new Netlink handler - ncsi_send_cmd_nl()
-> ncsi_xmit_cmd()

Response:
Response received - ncsi_rcv_rsp()
-> internal response handler - ncsi_rsp_handler_xxx()
-> ncsi_rsp_handler_netlink()
-> ncsi_send_netlink_rsp ()
-> Netlink interface (msg)
-> user space application

Command timeout - ncsi_request_timeout()
-> ncsi_send_netlink_timeout ()
-> Netlink interface (msg with zero data length)
-> user space application

Error:
Error detected
-> ncsi_send_netlink_err ()
-> Netlink interface (err msg)
-> user space application

V3: Based on http://patchwork.ozlabs.org/patch/979688/ to remove the duplicated 
code.
V2: Remove non-related debug message and clean up the code.


Signed-off-by: Justin Lee  


---
 include/uapi/linux/ncsi.h |   3 +
 net/ncsi/internal.h   |  10 ++-
 net/ncsi/ncsi-cmd.c   |   8 ++
 net/ncsi/ncsi-manage.c|  16 
 net/ncsi/ncsi-netlink.c   | 204 ++
 net/ncsi/ncsi-netlink.h   |  12 +++
 net/ncsi/ncsi-rsp.c   |  67 +--
 7 files changed, 314 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/ncsi.h b/include/uapi/linux/ncsi.h
index 4c292ec..4992bfc 100644
--- a/include/uapi/linux/ncsi.h
+++ b/include/uapi/linux/ncsi.h
@@ -30,6 +30,7 @@ enum ncsi_nl_commands {
NCSI_CMD_PKG_INFO,
NCSI_CMD_SET_INTERFACE,
NCSI_CMD_CLEAR_INTERFACE,
+   NCSI_CMD_SEND_CMD,
 
__NCSI_CMD_AFTER_LAST,
NCSI_CMD_MAX = __NCSI_CMD_AFTER_LAST - 1
@@ -43,6 +44,7 @@ enum ncsi_nl_commands {
  * @NCSI_ATTR_PACKAGE_LIST: nested array of NCSI_PKG_ATTR attributes
  * @NCSI_ATTR_PACKAGE_ID: package ID
  * @NCSI_ATTR_CHANNEL_ID: channel ID
+ * @NCSI_ATTR_DATA: command payload
  * @NCSI_ATTR_MAX: highest attribute number
  */
 enum ncsi_nl_attrs {
@@ -51,6 +53,7 @@ enum ncsi_nl_attrs {
NCSI_ATTR_PACKAGE_LIST,
NCSI_ATTR_PACKAGE_ID,
NCSI_ATTR_CHANNEL_ID,
+   NCSI_ATTR_DATA,
 
__NCSI_ATTR_AFTER_LAST,
NCSI_ATTR_MAX = __NCSI_ATTR_AFTER_LAST - 1
diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
index 3d0a33b..e9db100 100644
--- a/net/ncsi/internal.h
+++ b/net/ncsi/internal.h
@@ -175,6 +175,8 @@ struct ncsi_package;
 #define NCSI_RESERVED_CHANNEL  0x1f
 #define NCSI_CHANNEL_INDEX(c)  ((c) & ((1 << NCSI_PACKAGE_SHIFT) - 1))
 #define NCSI_TO_CHANNEL(p, c)  (((p) << NCSI_PACKAGE_SHIFT) | (c))
+#define NCSI_MAX_PACKAGE   8
+#define NCSI_MAX_CHANNEL   32
 
 struct ncsi_channel {
unsigned char   id;
@@ -219,12 +221,17 @@ struct ncsi_request {
unsigned charid;  /* Request ID - 0 to 255   */
bool used;/* Request that has been assigned  */
unsigned int flags;   /* NCSI request property   */
-#define NCSI_REQ_FLAG_EVENT_DRIVEN 1
+#define NCSI_REQ_FLAG_EVENT_DRIVEN 1
+#define NCSI_REQ_FLAG_NETLINK_DRIVEN   2
struct ncsi_dev_priv *ndp;/* Associated NCSI device  */
struct sk_buff   *cmd;/* Associated NCSI command packet  */
struct sk_buff   *rsp;/* Associated NCSI response packet */
struct timer_listtimer;   /* Timer on waiting for response   */
bool enabled; /* Time has been enabled or not*/
+
+   u32  snd_seq; /* netlink sending sequence number */
+   u32  snd_portid;  /* netlink portid of sender*/
+   struct nlmsghdr  nlhdr;   /* netlink message header  */
 };
 
 enum {
@@ -310,6 +317,7 @@ struct ncsi_cmd_arg {
unsigned int   dwords[4];
};
unsigned char*data;   /* NCSI OEM data */
+   struct genl_info *info;   /* Netlink information   */
 };
 
 extern struct list_head ncsi_dev_list;
diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
index 82b7d92..356af47 100644
--- a/net/ncsi/ncsi-cmd.c
+++ b/net/ncsi/ncsi-cmd.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 #include "ncsi-pkt.h"
@@ -346,6 +347,13 @@ int ncsi_xmit_cmd(struct ncsi_cmd_arg *nca)
if (!nr)
return -ENOMEM;
 
+   /* track netlink information */
+   if (nca->req_flags == NCSI_REQ_FLAG_NETLINK_DRIVEN) {
+   nr->snd_seq = nca->info->snd_seq;
+   nr->snd_portid = nca->info->snd_portid;
+   nr->nlhdr = *nca->info->nlhdr;
+   }
+
/* Prepare the packet */
nca->id = nr->id;
ret = nch->handler(nr->cmd, nca);
diff --git 

Re: [PATCH v2 3/3] tools: add selftest for BPF_F_ZERO_SEED

2018-10-08 Thread Song Liu
On Mon, Oct 8, 2018 at 3:34 AM Lorenz Bauer  wrote:
>
> Check that iterating two separate hash maps produces the same
> order of keys if BPF_F_ZERO_SEED is used.
>
> Signed-off-by: Lorenz Bauer 
> ---
>  tools/testing/selftests/bpf/test_maps.c | 68 +
>  1 file changed, 57 insertions(+), 11 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c 
> b/tools/testing/selftests/bpf/test_maps.c
> index 9b552c0fc47d..a8d6af27803a 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -257,23 +257,35 @@ static void test_hashmap_percpu(int task, void *data)
> close(fd);
>  }
>
> +static int helper_fill_hashmap(int max_entries)

How about we add map_flags as the second argument of this function?
This will help
avoid the old_flags hack.

Thanks,
Song

> +{
> +   int i, fd, ret;
> +   long long key, value;
> +
> +   fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value),
> +   max_entries, map_flags);
> +   CHECK(fd < 0,
> + "failed to create hashmap",
> + "err: %s, flags: 0x%x\n", strerror(errno), map_flags);
> +
> +   for (i = 0; i < max_entries; i++) {
> +   key = i; value = key;
> +   ret = bpf_map_update_elem(fd, , , BPF_NOEXIST);
> +   CHECK(ret != 0,
> + "can't update hashmap",
> + "err: %s\n", strerror(ret));
> +   }
> +
> +   return fd;
> +}
> +
>  static void test_hashmap_walk(int task, void *data)
>  {
> int fd, i, max_entries = 1000;
> long long key, value, next_key;
> bool next_key_valid = true;
>
> -   fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(key), sizeof(value),
> -   max_entries, map_flags);
> -   if (fd < 0) {
> -   printf("Failed to create hashmap '%s'!\n", strerror(errno));
> -   exit(1);
> -   }
> -
> -   for (i = 0; i < max_entries; i++) {
> -   key = i; value = key;
> -   assert(bpf_map_update_elem(fd, , , BPF_NOEXIST) == 
> 0);
> -   }
> +   fd = helper_fill_hashmap(max_entries);
>
> for (i = 0; bpf_map_get_next_key(fd, !i ? NULL : ,
>  _key) == 0; i++) {
> @@ -305,6 +317,39 @@ static void test_hashmap_walk(int task, void *data)
> close(fd);
>  }
>
> +static void test_hashmap_zero_seed(void)
> +{
> +   int i, first, second, old_flags;
> +   long long key, next_first, next_second;
> +
> +   old_flags = map_flags;
> +   map_flags |= BPF_F_ZERO_SEED;
> +
> +   first = helper_fill_hashmap(3);
> +   second = helper_fill_hashmap(3);
> +
> +   for (i = 0; ; i++) {
> +   void *key_ptr = !i ? NULL : 
> +
> +   if (bpf_map_get_next_key(first, key_ptr, _first) != 0)
> +   break;
> +
> +   CHECK(bpf_map_get_next_key(second, key_ptr, _second) != 
> 0,
> + "next_key for second map must succeed",
> + "key_ptr: %p", key_ptr);
> +   CHECK(next_first != next_second,
> + "keys must match",
> + "i: %d first: %lld second: %lld\n", i,
> + next_first, next_second);
> +
> +   key = next_first;
> +   }
> +
> +   map_flags = old_flags;
> +   close(first);
> +   close(second);
> +}
> +
>  static void test_arraymap(int task, void *data)
>  {
> int key, next_key, fd;
> @@ -1417,6 +1462,7 @@ static void run_all_tests(void)
> test_hashmap(0, NULL);
> test_hashmap_percpu(0, NULL);
> test_hashmap_walk(0, NULL);
> +   test_hashmap_zero_seed();
>
> test_arraymap(0, NULL);
> test_arraymap_percpu(0, NULL);
> --
> 2.17.1
>


Re: [PATCH v2 1/3] bpf: allow zero-initializing hash map seed

2018-10-08 Thread Song Liu
On Mon, Oct 8, 2018 at 3:34 AM Lorenz Bauer  wrote:
>
> Add a new flag BPF_F_ZERO_SEED, which forces a hash map
> to initialize the seed to zero. This is useful when doing
> performance analysis both on individual BPF programs, as
> well as the kernel's hash table implementation.
>
> Signed-off-by: Lorenz Bauer 
> ---
>  include/uapi/linux/bpf.h |  2 ++
>  kernel/bpf/hashtab.c | 13 +++--
>  2 files changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index f9187b41dff6..2c121f862082 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -253,6 +253,8 @@ enum bpf_attach_type {
>  #define BPF_F_NO_COMMON_LRU(1U << 1)
>  /* Specify numa node during map creation */
>  #define BPF_F_NUMA_NODE(1U << 2)
> +/* Zero-initialize hash function seed. This should only be used for testing. 
> */
> +#define BPF_F_ZERO_SEED(1U << 6)

Please add this line after
#define BPF_F_STACK_BUILD_ID(1U << 5)

Other than this
Acked-by: Song Liu 


>
>  /* flags for BPF_PROG_QUERY */
>  #define BPF_F_QUERY_EFFECTIVE  (1U << 0)
> diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> index 2c1790288138..4b7c76765d9d 100644
> --- a/kernel/bpf/hashtab.c
> +++ b/kernel/bpf/hashtab.c
> @@ -23,7 +23,7 @@
>
>  #define HTAB_CREATE_FLAG_MASK  \
> (BPF_F_NO_PREALLOC | BPF_F_NO_COMMON_LRU | BPF_F_NUMA_NODE |\
> -BPF_F_RDONLY | BPF_F_WRONLY)
> +BPF_F_RDONLY | BPF_F_WRONLY | BPF_F_ZERO_SEED)
>
>  struct bucket {
> struct hlist_nulls_head head;
> @@ -244,6 +244,7 @@ static int htab_map_alloc_check(union bpf_attr *attr)
>  */
> bool percpu_lru = (attr->map_flags & BPF_F_NO_COMMON_LRU);
> bool prealloc = !(attr->map_flags & BPF_F_NO_PREALLOC);
> +   bool zero_seed = (attr->map_flags & BPF_F_ZERO_SEED);
> int numa_node = bpf_map_attr_numa_node(attr);
>
> BUILD_BUG_ON(offsetof(struct htab_elem, htab) !=
> @@ -257,6 +258,10 @@ static int htab_map_alloc_check(union bpf_attr *attr)
>  */
> return -EPERM;
>
> +   if (zero_seed && !capable(CAP_SYS_ADMIN))
> +   /* Guard against local DoS, and discourage production use. */
> +   return -EPERM;
> +
> if (attr->map_flags & ~HTAB_CREATE_FLAG_MASK)
> /* reserved bits should not be used */
> return -EINVAL;
> @@ -373,7 +378,11 @@ static struct bpf_map *htab_map_alloc(union bpf_attr 
> *attr)
> if (!htab->buckets)
> goto free_htab;
>
> -   htab->hashrnd = get_random_int();
> +   if (htab->map.map_flags & BPF_F_ZERO_SEED)
> +   htab->hashrnd = 0;
> +   else
> +   htab->hashrnd = get_random_int();
> +
> for (i = 0; i < htab->n_buckets; i++) {
> INIT_HLIST_NULLS_HEAD(>buckets[i].head, i);
> raw_spin_lock_init(>buckets[i].lock);
> --
> 2.17.1
>


[PATCH net-next] tcp: refactor DCTCP ECN ACK handling

2018-10-08 Thread Yuchung Cheng
DCTCP has two parts - a new ECN signalling mechanism and the response
function to it. The first part can be used by other congestion
control for DCTCP-ECN deployed networks. This patch moves that part
into a separate tcp_dctcp.h to be used by other congestion control
module (like how Yeah uses Vegas algorithmas). For example, BBR is
experimenting such ECN signal currently
https://tinyurl.com/ietf-102-iccrg-bbr2

Signed-off-by: Yuchung Cheng 
Signed-off-by: Yousuk Seung 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_dctcp.c | 55 
 net/ipv4/tcp_dctcp.h | 40 
 2 files changed, 44 insertions(+), 51 deletions(-)
 create mode 100644 net/ipv4/tcp_dctcp.h

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index ca61e2a659e7..cd4814f7e962 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -44,6 +44,7 @@
 #include 
 #include 
 #include 
+#include "tcp_dctcp.h"
 
 #define DCTCP_MAX_ALPHA1024U
 
@@ -118,54 +119,6 @@ static u32 dctcp_ssthresh(struct sock *sk)
return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 
2U);
 }
 
-/* Minimal DCTP CE state machine:
- *
- * S:  0 <- last pkt was non-CE
- * 1 <- last pkt was CE
- */
-
-static void dctcp_ce_state_0_to_1(struct sock *sk)
-{
-   struct dctcp *ca = inet_csk_ca(sk);
-   struct tcp_sock *tp = tcp_sk(sk);
-
-   if (!ca->ce_state) {
-   /* State has changed from CE=0 to CE=1, force an immediate
-* ACK to reflect the new CE state. If an ACK was delayed,
-* send that first to reflect the prior CE state.
-*/
-   if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
-   __tcp_send_ack(sk, ca->prior_rcv_nxt);
-   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
-   }
-
-   ca->prior_rcv_nxt = tp->rcv_nxt;
-   ca->ce_state = 1;
-
-   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
-}
-
-static void dctcp_ce_state_1_to_0(struct sock *sk)
-{
-   struct dctcp *ca = inet_csk_ca(sk);
-   struct tcp_sock *tp = tcp_sk(sk);
-
-   if (ca->ce_state) {
-   /* State has changed from CE=1 to CE=0, force an immediate
-* ACK to reflect the new CE state. If an ACK was delayed,
-* send that first to reflect the prior CE state.
-*/
-   if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER)
-   __tcp_send_ack(sk, ca->prior_rcv_nxt);
-   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
-   }
-
-   ca->prior_rcv_nxt = tp->rcv_nxt;
-   ca->ce_state = 0;
-
-   tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
-}
-
 static void dctcp_update_alpha(struct sock *sk, u32 flags)
 {
const struct tcp_sock *tp = tcp_sk(sk);
@@ -230,12 +183,12 @@ static void dctcp_state(struct sock *sk, u8 new_state)
 
 static void dctcp_cwnd_event(struct sock *sk, enum tcp_ca_event ev)
 {
+   struct dctcp *ca = inet_csk_ca(sk);
+
switch (ev) {
case CA_EVENT_ECN_IS_CE:
-   dctcp_ce_state_0_to_1(sk);
-   break;
case CA_EVENT_ECN_NO_CE:
-   dctcp_ce_state_1_to_0(sk);
+   dctcp_ece_ack_update(sk, ev, >prior_rcv_nxt, >ce_state);
break;
default:
/* Don't care for the rest. */
diff --git a/net/ipv4/tcp_dctcp.h b/net/ipv4/tcp_dctcp.h
new file mode 100644
index ..d69a77cbd0c7
--- /dev/null
+++ b/net/ipv4/tcp_dctcp.h
@@ -0,0 +1,40 @@
+#ifndef _TCP_DCTCP_H
+#define _TCP_DCTCP_H
+
+static inline void dctcp_ece_ack_cwr(struct sock *sk, u32 ce_state)
+{
+   struct tcp_sock *tp = tcp_sk(sk);
+
+   if (ce_state == 1)
+   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
+   else
+   tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+}
+
+/* Minimal DCTP CE state machine:
+ *
+ * S:  0 <- last pkt was non-CE
+ * 1 <- last pkt was CE
+ */
+static inline void dctcp_ece_ack_update(struct sock *sk, enum tcp_ca_event evt,
+   u32 *prior_rcv_nxt, u32 *ce_state)
+{
+   u32 new_ce_state = (evt == CA_EVENT_ECN_IS_CE) ? 1 : 0;
+
+   if (*ce_state != new_ce_state) {
+   /* CE state has changed, force an immediate ACK to
+* reflect the new CE state. If an ACK was delayed,
+* send that first to reflect the prior CE state.
+*/
+   if (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
+   dctcp_ece_ack_cwr(sk, *ce_state);
+   __tcp_send_ack(sk, *prior_rcv_nxt);
+   }
+   inet_csk(sk)->icsk_ack.pending |= ICSK_ACK_NOW;
+   }
+   *prior_rcv_nxt = tcp_sk(sk)->rcv_nxt;
+   *ce_state = new_ce_state;
+   dctcp_ece_ack_cwr(sk, new_ce_state);
+}
+
+#endif
-- 
2.19.0.605.g01d371f741-goog



Re: [PATCH net 1/2] net: ipv4: update fnhe_pmtu when first hop's MTU changes

2018-10-08 Thread Sabrina Dubroca
2018-10-08, 11:18:49 -0600, David Ahern wrote:
> On 10/8/18 6:36 AM, Sabrina Dubroca wrote:
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index c7861e4b402c..dc9d2668d9bb 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -2458,6 +2458,13 @@ struct netdev_notifier_info {
> > struct netlink_ext_ack  *extack;
> >  };
> >  
> > +struct netdev_notifier_info_ext {
> > +   struct netdev_notifier_info info; /* must be first */
> > +   union {
> > +   u32 u32;
> 
> I realize you want this to be generic, but that is a really odd
> definition. can you make that mtu instead? the union allows other use
> cases to add new names.

It might get ugly if we end up with 4 different u32, but ok, I'll
rename this and we can see how it evolves.

-- 
Sabrina


Re: [PATCH net-next] net: core: change bool members of struct net_device to bitfield members

2018-10-08 Thread Heiner Kallweit
On 08.10.2018 23:20, Stephen Hemminger wrote:
> On Mon, 8 Oct 2018 22:00:51 +0200
> Heiner Kallweit  wrote:
> 
>>   *
>> + *  @uc_promisc:Counter that indicates promiscuous mode
>> + *  has been enabled due to the need to listen to
>> + *  additional unicast addresses in a device that
>> + *  does not implement ndo_set_rx_mode()
>> + *
> 
> I see you just moved the pre-existing comment, but it the comment
> looks incorrect. uc_promisc is not a counter but a flag. A counter would
> have more than two states normally.
> 
Right. A v2 fixing the comment has been submitted already.


Re: [PATCH net-next] net: core: change bool members of struct net_device to bitfield members

2018-10-08 Thread Stephen Hemminger
On Mon, 8 Oct 2018 22:00:51 +0200
Heiner Kallweit  wrote:

>   *
> + *   @uc_promisc:Counter that indicates promiscuous mode
> + *   has been enabled due to the need to listen to
> + *   additional unicast addresses in a device that
> + *   does not implement ndo_set_rx_mode()
> + *

I see you just moved the pre-existing comment, but it the comment
looks incorrect. uc_promisc is not a counter but a flag. A counter would
have more than two states normally.


Re: [PATCH net 1/2] net: ipv4: update fnhe_pmtu when first hop's MTU changes

2018-10-08 Thread kbuild test robot
Hi Sabrina,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:
https://github.com/0day-ci/linux/commits/Sabrina-Dubroca/net-ipv4-update-fnhe_pmtu-when-first-hop-s-MTU-changes/20181008-225709
reproduce: make htmldocs

All warnings (new ones prefixed by >>):

   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.retry_failed' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.retry_count' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.lost_packets' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.last_tdls_pkt_time' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.msdu_retries' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.msdu_failed' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.last_ack' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.last_ack_signal' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.ack_signal_filled' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'status_stats.avg_ack_signal' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'tx_stats.packets' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'tx_stats.bytes' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'tx_stats.last_rate' not described in 'sta_info'
   net/mac80211/sta_info.h:588: warning: Function parameter or member 
'tx_stats.msdu' not described in 'sta_info'
   include/linux/dma-buf.h:304: warning: Function parameter or member 
'cb_excl.cb' not described in 'dma_buf'
   include/linux/dma-buf.h:304: warning: Function parameter or member 
'cb_excl.poll' not described in 'dma_buf'
   include/linux/dma-buf.h:304: warning: Function parameter or member 
'cb_excl.active' not described in 'dma_buf'
   include/linux/dma-buf.h:304: warning: Function parameter or member 
'cb_shared.cb' not described in 'dma_buf'
   include/linux/dma-buf.h:304: warning: Function parameter or member 
'cb_shared.poll' not described in 'dma_buf'
   include/linux/dma-buf.h:304: warning: Function parameter or member 
'cb_shared.active' not described in 'dma_buf'
   include/linux/dma-fence-array.h:54: warning: Function parameter or member 
'work' not described in 'dma_fence_array'
   include/linux/gpio/driver.h:142: warning: Function parameter or member 
'request_key' not described in 'gpio_irq_chip'
   include/linux/iio/hw-consumer.h:1: warning: no structured comments found
   include/linux/input/sparse-keymap.h:46: warning: Function parameter or 
member 'sw' not described in 'key_entry'
   drivers/pci/pci.c:218: warning: Excess function parameter 'p' description in 
'pci_dev_str_match_path'
   include/linux/regulator/driver.h:227: warning: Function parameter or member 
'resume' not described in 'regulator_ops'
   drivers/regulator/core.c:4479: warning: Excess function parameter 'state' 
description in 'regulator_suspend'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 
'esw.esw0' not described in 'irb'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 
'esw.esw1' not described in 'irb'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 
'esw.esw2' not described in 'irb'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 
'esw.esw3' not described in 'irb'
   arch/s390/include/asm/cio.h:245: warning: Function parameter or member 
'esw.eadm' not described in 'irb'
   drivers/slimbus/stream.c:1: warning: no structured comments found
   drivers/target/target_core_device.c:1: warning: no structured comments found
   drivers/usb/typec/bus.c:1: warning: no structured comments found
   drivers/usb/typec/class.c:1: warning: no structured comments found
   include/linux/w1.h:281: warning: Function parameter or member 
'of_match_table' not described in 'w1_family'
   fs/direct-io.c:257: warning: Excess function parameter 'offset' description 
in 'dio_complete'
   fs/file_table.c:1: warning: no structured comments found
   fs/libfs.c:477: warning: Excess function parameter 'available' description 
in 'simple_write_end'
   fs/posix_acl.c:646: warning: Function parameter or member 'inode' not 
described in 'posix_acl_update_mode'
   fs/posix_acl.c:646: warning: Function parameter or member 'mode_p' not 
described in 'posix_acl_update_mode'
   fs/posix_acl.c:646: warning: Function par

[PATCH net-next] net/ipv6: Make ipv6_route_table_template static

2018-10-08 Thread David Ahern
From: David Ahern 

ipv6_route_table_template is exported but there are no users outside
of route.c. Make it static.

Signed-off-by: David Ahern 
---
 include/net/ipv6.h | 2 --
 net/ipv6/route.c   | 2 +-
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index ff33f498c137..829650540780 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -1089,8 +1089,6 @@ static inline int snmp6_unregister_dev(struct inet6_dev 
*idev) { return 0; }
 #endif
 
 #ifdef CONFIG_SYSCTL
-extern struct ctl_table ipv6_route_table_template[];
-
 struct ctl_table *ipv6_icmp_sysctl_init(struct net *net);
 struct ctl_table *ipv6_route_sysctl_init(struct net *net);
 int ipv6_sysctl_register(void);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 7c38e0e058ae..bf4cd647d8b8 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -5031,7 +5031,7 @@ int ipv6_sysctl_rtcache_flush(struct ctl_table *ctl, int 
write,
return 0;
 }
 
-struct ctl_table ipv6_route_table_template[] = {
+static struct ctl_table ipv6_route_table_template[] = {
{
.procname   =   "flush",
.data   =   _net.ipv6.sysctl.flush_delay,
-- 
2.11.0



[PATCH net-next] rtnetlink: Update comment in rtnl_stats_dump regarding strict data checking

2018-10-08 Thread David Ahern
From: David Ahern 

The NLM_F_DUMP_PROPER_HDR netlink flag was replaced by a setsockopt.
Update the comment in rtnl_stats_dump.

Fixes: 841891ec0c65 ("rtnetlink: Update rtnl_stats_dump for strict data 
checking")
Reported-by: Christian Brauner 
Signed-off-by: David Ahern 
---
 net/core/rtnetlink.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index c894c4af8981..6406e26171ff 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -4775,8 +4775,8 @@ static int rtnl_stats_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
 
ifsm = nlmsg_data(cb->nlh);
 
-   /* only requests using NLM_F_DUMP_PROPER_HDR can pass data to
-* influence the dump. The legacy exception is filter_mask.
+   /* only requests using strict checks can pass data to influence
+* the dump. The legacy exception is filter_mask.
 */
if (cb->strict_check) {
if (ifsm->pad1 || ifsm->pad2 || ifsm->ifindex) {
-- 
2.11.0



[PATCH net-next] rtnetlink: Move ifm in valid_fdb_dump_legacy to closer to use

2018-10-08 Thread David Ahern
From: David Ahern 

Move setting of local variable ifm to after the message parsing in
valid_fdb_dump_legacy. Avoid potential future use of unchecked variable.

Fixes: 8dfbda19a21b ("rtnetlink: Move input checking for rtnl_fdb_dump to 
helper")
Reported-by: Christian Brauner 
Signed-off-by: David Ahern 
---
 net/core/rtnetlink.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 6406e26171ff..46328a10034a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -3857,7 +3857,6 @@ static int valid_fdb_dump_legacy(const struct nlmsghdr 
*nlh,
 int *br_idx, int *brport_idx,
 struct netlink_ext_ack *extack)
 {
-   struct ifinfomsg *ifm = nlmsg_data(nlh);
struct nlattr *tb[IFLA_MAX+1];
int err;
 
@@ -3871,6 +3870,8 @@ static int valid_fdb_dump_legacy(const struct nlmsghdr 
*nlh,
if (nlmsg_len(nlh) != sizeof(struct ndmsg) &&
(nlmsg_len(nlh) != sizeof(struct ndmsg) +
 nla_attr_size(sizeof(u32 {
+   struct ifinfomsg *ifm;
+
err = nlmsg_parse(nlh, sizeof(struct ifinfomsg), tb, IFLA_MAX,
  ifla_policy, extack);
if (err < 0) {
@@ -3880,6 +3881,7 @@ static int valid_fdb_dump_legacy(const struct nlmsghdr 
*nlh,
*br_idx = nla_get_u32(tb[IFLA_MASTER]);
}
 
+   ifm = nlmsg_data(nlh);
*brport_idx = ifm->ifi_index;
}
return 0;
-- 
2.11.0



[PATCH iproute2-next] libnetlink: fix use-after-free of message buf

2018-10-08 Thread Vlad Buslov
In __rtnl_talk_iov() main loop, err is a pointer to memory in dynamically
allocated 'buf' that is used to store netlink messages. If netlink message
is an error message, buf is deallocated before returning with error code.
However, on return err->error code is checked one more time to generate
return value, after memory which err points to has already been
freed. Save error code in temporary variable and use the variable to
generate return value.

Signed-off-by: Vlad Buslov 
---
 lib/libnetlink.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index f8b8fbfd0010..bc8338052e17 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -802,6 +802,7 @@ static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct 
iovec *iov,
 
if (h->nlmsg_type == NLMSG_ERROR) {
struct nlmsgerr *err = (struct nlmsgerr 
*)NLMSG_DATA(h);
+   int error = err->error;
 
if (l < sizeof(struct nlmsgerr)) {
fprintf(stderr, "ERROR truncated\n");
@@ -825,7 +826,7 @@ static int __rtnl_talk_iov(struct rtnl_handle *rtnl, struct 
iovec *iov,
else
free(buf);
 
-   return err->error ? -i : 0;
+   return error ? -i : 0;
}
 
if (answer) {
-- 
2.7.5



Re: [RFC PATCH v2 bpf-next 0/2] verifier liveness simplification

2018-10-08 Thread Jiong Wang

On 03/10/2018 17:53, Jiong Wang wrote:

On 03/10/2018 16:59, Alexei Starovoitov wrote:

On Wed, Oct 03, 2018 at 04:36:31PM +0100, Jiong Wang wrote:



Now this hasn't happened. I am still debugging the root cause, but kind of
feel
"64-bit" attribute propagation is the issue, it seems to me it can't be
nicely
integrated into the existing register read/write propagation infrastructure.


may be share your patches that modify the liveness propagation?


OK, I will share it after some clean up.


Have done more experiments on the algo, and finally got something passed my
local tests. bpf selftest passed as well except several test_verifier failures
due to some corner cases I guess, will have a look later.

In these test, x86-64 is using the 32-bit information. In the insn loop inside
sanitize_dead_code, once an insn is not marked as 64-bit and if it is ALU64, it
will then be changed to ALU32. When enable tracking using printk, I could see
quite a few ALU64 instructions really are rewritten into ALU32, so tests like
test_l4lb runs OK looks like a positive signal of the correctness.

The algo separates the problem into two steps.

  - First, assume there is no code path prune and all code paths will be walked
through. In this case, if we could propagate 64-bit info backward along the
use-def chains for all paths walked, one insn must will be marked as 64-bit
correctly. Finish this step requires building use-def chain, and it is done
in the following way:

 1. each insn could have two explicit uses, so add two chain fields in
bpf_insn_aux_data.
 2. we need finer enum to describe register use, so split SRC_OP into
SRC_OP_0, SRC_OP64_0, SRC_OP_1, SRC_OP64_1 to indicate the source
is the first/second source and whether it is a 64-bit source.
 3. create the chain at check_reg_arg which is exactly covering all
register use sites. The function to create the chain is link_reg_to_def.
 4. when creating the chain, if a source is a 64-bit source, also
propagating the information backward along use-def chain straight away.
This is done in mark_reg_read which further calls the new function
"mark_insn_64bit" which is doing the real job. "mark_insn_64bit" fetches
the def insn for the 64-bit source, and further marks the def insns of
its sources as 64-bit. This will be repeated until the whole use-def
chain consumed.
 5. by use-def chain described above, if there is no code path prune, one
insn must have been marked as 64-bit when it's result has 64-bit use.
 6. helper call causing implicit reg use and must be conservative treated
as 64-bit use, bpf-to-bpf function call has insn connected by use-def
so doesn't need to make that conservative assumption.

  - Second, to handle code path prune, define new liveness enum
REG_LIVE_READ64 and REG_LIVE_UNIQUE_WRITTEN. The latter will only be
set if reg_arg_type is the new U_DST_OP or U_DST_OP_NO_MARK, and
REG_LIVE_READ64 will be set if one 64-bit read is not screened off by
REG_LIVE_UNIQUE_WRITTEN.

The thought is 64-bit use info will only be screened off if the dst register
is unique in all register operands, meaning not the same as any source. For
example, insn 18 below will screen off r4 64-bit propagation.

  17: r3 += r7
  18: r4 = 1
  19: *(u64 *)(r10 - 32) = r3
  20: *(u64 *)(r10 - 40) = r4

So, U_DST_OP/U_DST_OP_NO_MARK have been introduced to differentiate with
DST_OP/DST_OP_NO_MARK. Inside check_reg_arg, checks are done on dst_reg,
and would pass U_DST_* as reg_arg_type when it is unique. U_DST_* then
will set liveness to REG_LIVE_UNIQUE_WRITTEN. In side mark_reg_read, if
one 64-bit read is not screened off by REG_LIVE_UNIQUE_WRITTEN, then
REG_LIVE_READ64 will be set in the reg_state. REG_LIVE_READ64 further
triggers propagating downstream 64-bit uses from the pruned paths into the
current path inside propagate_liveness when path prune happened.

Compared with propagating REG_LIVE_READ, propagating REG_LIVE_READ64 needs
more work. Because one 64-bit read could indicating more than one registers
are 64-bit. For example, insn 19 above shows r3 is 64-bit source, so its
define at insn 17 are 64-bit, then all sources of insn 17 must be 64-bit,
meaning both r3 and r7 are 64-bit. Therefore, REG_LIVE_READ64 needs to be
propagated on both r3 and r7 upward along the register parentage chain.
During walking def-use chain, we record all such affected reg_state, and
propagate REG_LIVE_READ64 for all of them. This logic is done inside
mark_insn_64bit as well.

  - For short, the implementation treating the new 64-bit info (REG_LIVE_READ64)
propagation the same as REG_LIVE_READ propagation.

REG_LIVE_READ triggers more path prune during state equal comparison, while
REG_LIVE_READ64 triggers 64-bit insn marking during def-use 

[PATCH net-next v2] net: core: change bool members of struct net_device to bitfield members

2018-10-08 Thread Heiner Kallweit
bool is good as parameter type or function return type, but if used
for struct members it consumes more memory than needed.
Changing the bool members of struct net_device to bitfield members
allows to decrease the memory footprint of this struct.

Signed-off-by: Heiner Kallweit 
---
v2:
- Change Counter to Flag in description of uc_promisc
---
 include/linux/netdevice.h | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 76603ee13..3d7b8df2e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1651,10 +1651,6 @@ enum netdev_priv_flags {
  * @dev_port:  Used to differentiate devices that share
  * the same function
  * @addr_list_lock:XXX: need comments on this one
- * @uc_promisc:Counter that indicates promiscuous mode
- * has been enabled due to the need to listen to
- * additional unicast addresses in a device that
- * does not implement ndo_set_rx_mode()
  * @uc:unicast mac addresses
  * @mc:multicast mac addresses
  * @dev_addrs: list of device hw addresses
@@ -1714,11 +1710,9 @@ enum netdev_priv_flags {
  * @link_watch_list:   XXX: need comments on this one
  *
  * @reg_state: Register/unregister state machine
- * @dismantle: Device is going to be freed
  * @rtnl_link_state:   This enum represents the phases of creating
  * a new link
  *
- * @needs_free_netdev: Should unregister perform free_netdev?
  * @priv_destructor:   Called from unregister
  * @npinfo:XXX: need comments on this one
  * @nd_net:Network namespace this network device is inside
@@ -1758,6 +1752,15 @@ enum netdev_priv_flags {
  * @qdisc_tx_busylock: lockdep class annotating Qdisc->busylock spinlock
  * @qdisc_running_key: lockdep class annotating Qdisc->running seqcount
  *
+ * @uc_promisc:Flag that indicates promiscuous mode
+ * has been enabled due to the need to listen to
+ * additional unicast addresses in a device that
+ * does not implement ndo_set_rx_mode()
+ *
+ * @dismantle: Device is going to be freed
+ *
+ * @needs_free_netdev: Should unregister perform free_netdev?
+ *
  * @proto_down:protocol port state information can be sent to the
  * switch driver and used to set the phys state of the
  * switch port.
@@ -1879,7 +1882,6 @@ struct net_device {
unsigned short  dev_port;
spinlock_t  addr_list_lock;
unsigned char   name_assign_type;
-   booluc_promisc;
struct netdev_hw_addr_list  uc;
struct netdev_hw_addr_list  mc;
struct netdev_hw_addr_list  dev_addrs;
@@ -1986,14 +1988,11 @@ struct net_device {
   NETREG_DUMMY,/* dummy device for NAPI poll */
} reg_state:8;
 
-   bool dismantle;
-
enum {
RTNL_LINK_INITIALIZED,
RTNL_LINK_INITIALIZING,
} rtnl_link_state:16;
 
-   bool needs_free_netdev;
void (*priv_destructor)(struct net_device *dev);
 
 #ifdef CONFIG_NETPOLL
@@ -2046,7 +2045,10 @@ struct net_device {
struct sfp_bus  *sfp_bus;
struct lock_class_key   *qdisc_tx_busylock;
struct lock_class_key   *qdisc_running_key;
-   boolproto_down;
+   unsigneduc_promisc:1;
+   unsigneddismantle:1;
+   unsignedneeds_free_netdev:1;
+   unsignedproto_down:1;
unsignedwol_enabled:1;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
-- 
2.19.1



Re: [PATCH net-next] net: core: change bool members of struct net_device to bitfield members

2018-10-08 Thread Heiner Kallweit
On 08.10.2018 22:07, Randy Dunlap wrote:
> On 10/8/18 1:00 PM, Heiner Kallweit wrote:
>> bool is good as parameter type or function return type, but if used
>> for struct members it consumes more memory than needed.
>> Changing the bool members of struct net_device to bitfield members
>> allows to decrease the memory footprint of this struct.
>>
>> Signed-off-by: Heiner Kallweit 
>> ---
>>  include/linux/netdevice.h | 24 +---
>>  1 file changed, 13 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 76603ee13..3d7b8df2e 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1651,10 +1651,6 @@ enum netdev_priv_flags {
>>   *  @dev_port:  Used to differentiate devices that share
>>   *  the same function
>>   *  @addr_list_lock:XXX: need comments on this one
>> - *  @uc_promisc:Counter that indicates promiscuous mode
>> - *  has been enabled due to the need to listen to
>> - *  additional unicast addresses in a device that
>> - *  does not implement ndo_set_rx_mode()
>>   *  @uc:unicast mac addresses
>>   *  @mc:multicast mac addresses
>>   *  @dev_addrs: list of device hw addresses
>> @@ -1714,11 +1710,9 @@ enum netdev_priv_flags {
>>   *  @link_watch_list:   XXX: need comments on this one
>>   *
>>   *  @reg_state: Register/unregister state machine
>> - *  @dismantle: Device is going to be freed
>>   *  @rtnl_link_state:   This enum represents the phases of creating
>>   *  a new link
>>   *
>> - *  @needs_free_netdev: Should unregister perform free_netdev?
>>   *  @priv_destructor:   Called from unregister
>>   *  @npinfo:XXX: need comments on this one
>>   *  @nd_net:Network namespace this network device is inside
>> @@ -1758,6 +1752,15 @@ enum netdev_priv_flags {
>>   *  @qdisc_tx_busylock: lockdep class annotating Qdisc->busylock spinlock
>>   *  @qdisc_running_key: lockdep class annotating Qdisc->running seqcount
>>   *
>> + *  @uc_promisc:Counter that indicates promiscuous mode
>> + *  has been enabled due to the need to listen to
>> + *  additional unicast addresses in a device that
>> + *  does not implement ndo_set_rx_mode()
> 
> Hi,
> 
> I see that all you did is copy/paste that text (above), but I wouldn't call
> a single bit a [1-bit] Counter.
> 
I stumbled across this comment too. Neither a bool member nor a 1-bit
bitfield member should be called a counter. I kept the original comment, 
but I'm totally fine with changing Counter -> Flag and will provide a v2.

> thanks,
> 


Re: [PATCH net-next] net: core: change bool members of struct net_device to bitfield members

2018-10-08 Thread Randy Dunlap
On 10/8/18 1:00 PM, Heiner Kallweit wrote:
> bool is good as parameter type or function return type, but if used
> for struct members it consumes more memory than needed.
> Changing the bool members of struct net_device to bitfield members
> allows to decrease the memory footprint of this struct.
> 
> Signed-off-by: Heiner Kallweit 
> ---
>  include/linux/netdevice.h | 24 +---
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 76603ee13..3d7b8df2e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1651,10 +1651,6 @@ enum netdev_priv_flags {
>   *   @dev_port:  Used to differentiate devices that share
>   *   the same function
>   *   @addr_list_lock:XXX: need comments on this one
> - *   @uc_promisc:Counter that indicates promiscuous mode
> - *   has been enabled due to the need to listen to
> - *   additional unicast addresses in a device that
> - *   does not implement ndo_set_rx_mode()
>   *   @uc:unicast mac addresses
>   *   @mc:multicast mac addresses
>   *   @dev_addrs: list of device hw addresses
> @@ -1714,11 +1710,9 @@ enum netdev_priv_flags {
>   *   @link_watch_list:   XXX: need comments on this one
>   *
>   *   @reg_state: Register/unregister state machine
> - *   @dismantle: Device is going to be freed
>   *   @rtnl_link_state:   This enum represents the phases of creating
>   *   a new link
>   *
> - *   @needs_free_netdev: Should unregister perform free_netdev?
>   *   @priv_destructor:   Called from unregister
>   *   @npinfo:XXX: need comments on this one
>   *   @nd_net:Network namespace this network device is inside
> @@ -1758,6 +1752,15 @@ enum netdev_priv_flags {
>   *   @qdisc_tx_busylock: lockdep class annotating Qdisc->busylock spinlock
>   *   @qdisc_running_key: lockdep class annotating Qdisc->running seqcount
>   *
> + *   @uc_promisc:Counter that indicates promiscuous mode
> + *   has been enabled due to the need to listen to
> + *   additional unicast addresses in a device that
> + *   does not implement ndo_set_rx_mode()

Hi,

I see that all you did is copy/paste that text (above), but I wouldn't call
a single bit a [1-bit] Counter.

thanks,
-- 
~Randy


[PATCH net-next] net: core: change bool members of struct net_device to bitfield members

2018-10-08 Thread Heiner Kallweit
bool is good as parameter type or function return type, but if used
for struct members it consumes more memory than needed.
Changing the bool members of struct net_device to bitfield members
allows to decrease the memory footprint of this struct.

Signed-off-by: Heiner Kallweit 
---
 include/linux/netdevice.h | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 76603ee13..3d7b8df2e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1651,10 +1651,6 @@ enum netdev_priv_flags {
  * @dev_port:  Used to differentiate devices that share
  * the same function
  * @addr_list_lock:XXX: need comments on this one
- * @uc_promisc:Counter that indicates promiscuous mode
- * has been enabled due to the need to listen to
- * additional unicast addresses in a device that
- * does not implement ndo_set_rx_mode()
  * @uc:unicast mac addresses
  * @mc:multicast mac addresses
  * @dev_addrs: list of device hw addresses
@@ -1714,11 +1710,9 @@ enum netdev_priv_flags {
  * @link_watch_list:   XXX: need comments on this one
  *
  * @reg_state: Register/unregister state machine
- * @dismantle: Device is going to be freed
  * @rtnl_link_state:   This enum represents the phases of creating
  * a new link
  *
- * @needs_free_netdev: Should unregister perform free_netdev?
  * @priv_destructor:   Called from unregister
  * @npinfo:XXX: need comments on this one
  * @nd_net:Network namespace this network device is inside
@@ -1758,6 +1752,15 @@ enum netdev_priv_flags {
  * @qdisc_tx_busylock: lockdep class annotating Qdisc->busylock spinlock
  * @qdisc_running_key: lockdep class annotating Qdisc->running seqcount
  *
+ * @uc_promisc:Counter that indicates promiscuous mode
+ * has been enabled due to the need to listen to
+ * additional unicast addresses in a device that
+ * does not implement ndo_set_rx_mode()
+ *
+ * @dismantle: Device is going to be freed
+ *
+ * @needs_free_netdev: Should unregister perform free_netdev?
+ *
  * @proto_down:protocol port state information can be sent to the
  * switch driver and used to set the phys state of the
  * switch port.
@@ -1879,7 +1882,6 @@ struct net_device {
unsigned short  dev_port;
spinlock_t  addr_list_lock;
unsigned char   name_assign_type;
-   booluc_promisc;
struct netdev_hw_addr_list  uc;
struct netdev_hw_addr_list  mc;
struct netdev_hw_addr_list  dev_addrs;
@@ -1986,14 +1988,11 @@ struct net_device {
   NETREG_DUMMY,/* dummy device for NAPI poll */
} reg_state:8;
 
-   bool dismantle;
-
enum {
RTNL_LINK_INITIALIZED,
RTNL_LINK_INITIALIZING,
} rtnl_link_state:16;
 
-   bool needs_free_netdev;
void (*priv_destructor)(struct net_device *dev);
 
 #ifdef CONFIG_NETPOLL
@@ -2046,7 +2045,10 @@ struct net_device {
struct sfp_bus  *sfp_bus;
struct lock_class_key   *qdisc_tx_busylock;
struct lock_class_key   *qdisc_running_key;
-   boolproto_down;
+   unsigneduc_promisc:1;
+   unsigneddismantle:1;
+   unsignedneeds_free_netdev:1;
+   unsignedproto_down:1;
unsignedwol_enabled:1;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
-- 
2.19.1



[PATCH bpf-next 6/6] selftests/bpf: add test cases for queue and stack maps

2018-10-08 Thread Mauricio Vasquez B
test_maps:
Tests that queue/stack maps are behaving correctly even in corner cases

test_progs:
Tests new ebpf helpers

Signed-off-by: Mauricio Vasquez B 
---
 tools/lib/bpf/bpf.c|   12 ++
 tools/lib/bpf/bpf.h|1 
 tools/testing/selftests/bpf/Makefile   |6 +
 tools/testing/selftests/bpf/bpf_helpers.h  |7 +
 tools/testing/selftests/bpf/test_maps.c|  122 
 tools/testing/selftests/bpf/test_progs.c   |   99 
 tools/testing/selftests/bpf/test_queue_map.c   |4 +
 tools/testing/selftests/bpf/test_queue_stack_map.h |   59 ++
 tools/testing/selftests/bpf/test_stack_map.c   |4 +
 9 files changed, 313 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
 create mode 100644 tools/testing/selftests/bpf/test_stack_map.c

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 3878a26a2071..13810c88e1b6 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -278,6 +278,18 @@ int bpf_map_lookup_elem(int fd, const void *key, void 
*value)
return sys_bpf(BPF_MAP_LOOKUP_ELEM, , sizeof(attr));
 }
 
+int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value)
+{
+   union bpf_attr attr;
+
+   bzero(, sizeof(attr));
+   attr.map_fd = fd;
+   attr.key = ptr_to_u64(key);
+   attr.value = ptr_to_u64(value);
+
+   return sys_bpf(BPF_MAP_LOOKUP_AND_DELETE_ELEM, , sizeof(attr));
+}
+
 int bpf_map_delete_elem(int fd, const void *key)
 {
union bpf_attr attr;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 6f38164b2618..6134ed9517d3 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -86,6 +86,7 @@ int bpf_map_update_elem(int fd, const void *key, const void 
*value,
__u64 flags);
 
 int bpf_map_lookup_elem(int fd, const void *key, void *value);
+int bpf_map_lookup_and_delete_elem(int fd, const void *key, const void *value);
 int bpf_map_delete_elem(int fd, const void *key);
 int bpf_map_get_next_key(int fd, const void *key, void *next_key);
 int bpf_obj_pin(int fd, const char *pathname);
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 1381ab81099c..f78cf72832aa 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -36,7 +36,8 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o test_sk_lookup_kern.o
+   test_skb_cgroup_id_kern.o bpf_flow.o netcnt_prog.o 
test_sk_lookup_kern.o \
+   test_queue_map.o test_stack_map.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -114,6 +115,9 @@ CLANG_FLAGS = -I. -I./include/uapi -I../../../include/uapi \
 $(OUTPUT)/test_l4lb_noinline.o: CLANG_FLAGS += -fno-inline
 $(OUTPUT)/test_xdp_noinline.o: CLANG_FLAGS += -fno-inline
 
+$(OUTPUT)/test_queue_map.o: test_queue_stack_map.h
+$(OUTPUT)/test_stack_map.o: test_queue_stack_map.h
+
 BTF_LLC_PROBE := $(shell $(LLC) -march=bpf -mattr=help 2>&1 | grep dwarfris)
 BTF_PAHOLE_PROBE := $(shell $(BTF_PAHOLE) --help 2>&1 | grep BTF)
 BTF_OBJCOPY_PROBE := $(shell $(LLVM_OBJCOPY) --help 2>&1 | grep -i 
'usage.*llvm')
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 1d407b3494f9..58dfcb88f9b4 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -16,6 +16,13 @@ static int (*bpf_map_update_elem)(void *map, void *key, void 
*value,
(void *) BPF_FUNC_map_update_elem;
 static int (*bpf_map_delete_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_delete_elem;
+static int (*bpf_map_push_elem)(void *map, void *value,
+   unsigned long long flags) =
+   (void *) BPF_FUNC_map_push_elem;
+static int (*bpf_map_pop_elem)(void *map, void *value) =
+   (void *) BPF_FUNC_map_pop_elem;
+static int (*bpf_map_peek_elem)(void *map, void *value) =
+   (void *) BPF_FUNC_map_peek_elem;
 static int (*bpf_probe_read)(void *dst, int size, void *unsafe_ptr) =
(void *) BPF_FUNC_probe_read;
 static unsigned long long (*bpf_ktime_get_ns)(void) =
diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 9b552c0fc47d..4db2116e52be 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -471,6 +472,122 

[PATCH bpf-next 5/6] Sync uapi/bpf.h to tools/include

2018-10-08 Thread Mauricio Vasquez B
Sync both files.

Signed-off-by: Mauricio Vasquez B 
---
 tools/include/uapi/linux/bpf.h |   36 
 1 file changed, 32 insertions(+), 4 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index f9187b41dff6..bfa042273fad 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -103,6 +103,7 @@ enum bpf_cmd {
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
+   BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 };
 
 enum bpf_map_type {
@@ -128,6 +129,8 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_STORAGE,
BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
+   BPF_MAP_TYPE_QUEUE,
+   BPF_MAP_TYPE_STACK,
 };
 
 enum bpf_prog_type {
@@ -462,6 +465,28 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
+ * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+ * Description
+ * Push an element *value* in *map*. *flags* is one of:
+ *
+ * **BPF_EXIST**
+ * If the queue/stack is full, the oldest element is removed to
+ * make room for this.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_map_pop_elem(struct bpf_map *map, void *value)
+ * Description
+ * Pop an element from *map*.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_map_peek_elem(struct bpf_map *map, void *value)
+ * Description
+ * Get an element from *map* without removing it.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
  * int bpf_probe_read(void *dst, u32 size, const void *src)
  * Description
  * For tracing programs, safely attempt to read *size* bytes from
@@ -789,14 +814,14 @@ union bpf_attr {
  *
  * int ret;
  * struct bpf_tunnel_key key = {};
- * 
+ *
  * ret = bpf_skb_get_tunnel_key(skb, , sizeof(key), 0);
  * if (ret < 0)
  * return TC_ACT_SHOT; // drop packet
- * 
+ *
  * if (key.remote_ipv4 != 0x0a01)
  * return TC_ACT_SHOT; // drop packet
- * 
+ *
  * return TC_ACT_OK;   // accept packet
  *
  * This interface can also be used with all encapsulation devices
@@ -2303,7 +2328,10 @@ union bpf_attr {
FN(skb_ancestor_cgroup_id), \
FN(sk_lookup_tcp),  \
FN(sk_lookup_udp),  \
-   FN(sk_release),
+   FN(sk_release), \
+   FN(map_push_elem),  \
+   FN(map_pop_elem),   \
+   FN(map_peek_elem),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call



[PATCH bpf-next 2/6] bpf/syscall: allow key to be null in map functions

2018-10-08 Thread Mauricio Vasquez B
This commit adds the required logic to allow key being NULL
in case the key_size of the map is 0.

A new __bpf_copy_key function helper only copies the key from
userpsace when key_size != 0, otherwise it enforces that key must be
null.

Signed-off-by: Mauricio Vasquez B 
---
 kernel/bpf/syscall.c |   19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5742df21598c..eb75e8af73ff 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -651,6 +651,17 @@ int __weak bpf_stackmap_copy(struct bpf_map *map, void 
*key, void *value)
return -ENOTSUPP;
 }
 
+static void *__bpf_copy_key(void __user *ukey, u64 key_size)
+{
+   if (key_size)
+   return memdup_user(ukey, key_size);
+
+   if (ukey)
+   return ERR_PTR(-EINVAL);
+
+   return NULL;
+}
+
 /* last field in 'union bpf_attr' used by this command */
 #define BPF_MAP_LOOKUP_ELEM_LAST_FIELD value
 
@@ -678,7 +689,7 @@ static int map_lookup_elem(union bpf_attr *attr)
goto err_put;
}
 
-   key = memdup_user(ukey, map->key_size);
+   key = __bpf_copy_key(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
goto err_put;
@@ -769,7 +780,7 @@ static int map_update_elem(union bpf_attr *attr)
goto err_put;
}
 
-   key = memdup_user(ukey, map->key_size);
+   key = __bpf_copy_key(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
goto err_put;
@@ -871,7 +882,7 @@ static int map_delete_elem(union bpf_attr *attr)
goto err_put;
}
 
-   key = memdup_user(ukey, map->key_size);
+   key = __bpf_copy_key(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
goto err_put;
@@ -923,7 +934,7 @@ static int map_get_next_key(union bpf_attr *attr)
}
 
if (ukey) {
-   key = memdup_user(ukey, map->key_size);
+   key = __bpf_copy_key(ukey, map->key_size);
if (IS_ERR(key)) {
err = PTR_ERR(key);
goto err_put;



[PATCH bpf-next 3/6] bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall

2018-10-08 Thread Mauricio Vasquez B
The following patch implements a bpf queue/stack maps that
provides the peek/pop/push functions.  There is not a direct
relationship between those functions and the current maps
syscalls, hence a new MAP_LOOKUP_AND_DELETE_ELEM syscall is added,
this is mapped to the pop operation in the queue/stack maps
and it is still to implement in other kind of maps.

Signed-off-by: Mauricio Vasquez B 
---
 include/linux/bpf.h  |1 +
 include/uapi/linux/bpf.h |1 +
 kernel/bpf/syscall.c |   81 ++
 3 files changed, 83 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 027697b6a22f..98c7eeb6d138 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -39,6 +39,7 @@ struct bpf_map_ops {
void *(*map_lookup_elem)(struct bpf_map *map, void *key);
int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 
flags);
int (*map_delete_elem)(struct bpf_map *map, void *key);
+   void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void *key);
 
/* funcs called by prog_array and perf_event_array map */
void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index f9187b41dff6..3bb94aa2d408 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -103,6 +103,7 @@ enum bpf_cmd {
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
BPF_TASK_FD_QUERY,
+   BPF_MAP_LOOKUP_AND_DELETE_ELEM,
 };
 
 enum bpf_map_type {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index eb75e8af73ff..c33d9303f72f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -975,6 +975,84 @@ static int map_get_next_key(union bpf_attr *attr)
return err;
 }
 
+#define BPF_MAP_LOOKUP_AND_DELETE_ELEM_LAST_FIELD value
+
+static int map_lookup_and_delete_elem(union bpf_attr *attr)
+{
+   void __user *ukey = u64_to_user_ptr(attr->key);
+   void __user *uvalue = u64_to_user_ptr(attr->value);
+   int ufd = attr->map_fd;
+   struct bpf_map *map;
+   void *key, *value, *ptr;
+   u32 value_size;
+   struct fd f;
+   int err;
+
+   if (CHECK_ATTR(BPF_MAP_LOOKUP_AND_DELETE_ELEM))
+   return -EINVAL;
+
+   f = fdget(ufd);
+   map = __bpf_map_get(f);
+   if (IS_ERR(map))
+   return PTR_ERR(map);
+
+   if (!(f.file->f_mode & FMODE_CAN_WRITE)) {
+   err = -EPERM;
+   goto err_put;
+   }
+
+   key = __bpf_copy_key(ukey, map->key_size);
+   if (IS_ERR(key)) {
+   err = PTR_ERR(key);
+   goto err_put;
+   }
+
+   value_size = map->value_size;
+
+   err = -ENOMEM;
+   value = kmalloc(value_size, GFP_USER | __GFP_NOWARN);
+   if (!value)
+   goto free_key;
+
+   err = -EFAULT;
+   if (copy_from_user(value, uvalue, value_size) != 0)
+   goto free_value;
+
+   /* must increment bpf_prog_active to avoid kprobe+bpf triggering from
+* inside bpf map update or delete otherwise deadlocks are possible
+*/
+   preempt_disable();
+   __this_cpu_inc(bpf_prog_active);
+   if (!map->ops->map_lookup_and_delete_elem) {
+   err = -ENOTSUPP;
+   goto free_value;
+   }
+   rcu_read_lock();
+   ptr = map->ops->map_lookup_and_delete_elem(map, key);
+   if (ptr)
+   memcpy(value, ptr, value_size);
+   rcu_read_unlock();
+   err = ptr ? 0 : -ENOENT;
+   __this_cpu_dec(bpf_prog_active);
+   preempt_enable();
+
+   if (err)
+   goto free_value;
+
+   if (copy_to_user(uvalue, value, value_size) != 0)
+   goto free_value;
+
+   err = 0;
+
+free_value:
+   kfree(value);
+free_key:
+   kfree(key);
+err_put:
+   fdput(f);
+   return err;
+}
+
 static const struct bpf_prog_ops * const bpf_prog_types[] = {
 #define BPF_PROG_TYPE(_id, _name) \
[_id] = & _name ## _prog_ops,
@@ -2448,6 +2526,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
uattr, unsigned int, siz
case BPF_TASK_FD_QUERY:
err = bpf_task_fd_query(, uattr);
break;
+   case BPF_MAP_LOOKUP_AND_DELETE_ELEM:
+   err = map_lookup_and_delete_elem();
+   break;
default:
err = -EINVAL;
break;



[PATCH bpf-next 4/6] bpf: add queue and stack maps

2018-10-08 Thread Mauricio Vasquez B
Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
These maps support peek, pop and push operations that are exposed to eBPF
programs through the new bpf_map[peek/pop/push] helpers.  Those operations
are exposed to userspace applications through the already existing
syscalls in the following way:

BPF_MAP_LOOKUP_ELEM-> peek
BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
BPF_MAP_UPDATE_ELEM-> push

Queue/stack maps are implemented using a buffer, tail and head indexes,
hence BPF_F_NO_PREALLOC is not supported.

As opposite to other maps, queue and stack do not use RCU for protecting
maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
argument that is a pointer to a memory zone where to save the value of a
map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
be passed as an extra argument.

Our main motivation for implementing queue/stack maps was to keep track
of a pool of elements, like network ports in a SNAT, however we forsee
other use cases, like for exampling saving last N kernel events in a map
and then analysing from userspace.

Signed-off-by: Mauricio Vasquez B 
---
 include/linux/bpf.h   |7 +
 include/linux/bpf_types.h |2 
 include/uapi/linux/bpf.h  |   35 -
 kernel/bpf/Makefile   |2 
 kernel/bpf/core.c |3 
 kernel/bpf/helpers.c  |   43 ++
 kernel/bpf/queue_stack_maps.c |  288 +
 kernel/bpf/syscall.c  |   30 +++-
 kernel/bpf/verifier.c |   28 +++-
 net/core/filter.c |6 +
 10 files changed, 426 insertions(+), 18 deletions(-)
 create mode 100644 kernel/bpf/queue_stack_maps.c

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 98c7eeb6d138..cad3bc5cffd1 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -40,6 +40,9 @@ struct bpf_map_ops {
int (*map_update_elem)(struct bpf_map *map, void *key, void *value, u64 
flags);
int (*map_delete_elem)(struct bpf_map *map, void *key);
void *(*map_lookup_and_delete_elem)(struct bpf_map *map, void *key);
+   int (*map_push_elem)(struct bpf_map *map, void *value, u64 flags);
+   int (*map_pop_elem)(struct bpf_map *map, void *value);
+   int (*map_peek_elem)(struct bpf_map *map, void *value);
 
/* funcs called by prog_array and perf_event_array map */
void *(*map_fd_get_ptr)(struct bpf_map *map, struct file *map_file,
@@ -139,6 +142,7 @@ enum bpf_arg_type {
ARG_CONST_MAP_PTR,  /* const argument used as pointer to bpf_map */
ARG_PTR_TO_MAP_KEY, /* pointer to stack used as map key */
ARG_PTR_TO_MAP_VALUE,   /* pointer to stack used as map value */
+   ARG_PTR_TO_UNINIT_MAP_VALUE,/* pointer to valid memory used to 
store a map value */
 
/* the following constraints used to prototype bpf_memcmp() and other
 * functions that access data on eBPF program stack
@@ -825,6 +829,9 @@ static inline int bpf_fd_reuseport_array_update_elem(struct 
bpf_map *map,
 extern const struct bpf_func_proto bpf_map_lookup_elem_proto;
 extern const struct bpf_func_proto bpf_map_update_elem_proto;
 extern const struct bpf_func_proto bpf_map_delete_elem_proto;
+extern const struct bpf_func_proto bpf_map_push_elem_proto;
+extern const struct bpf_func_proto bpf_map_pop_elem_proto;
+extern const struct bpf_func_proto bpf_map_peek_elem_proto;
 
 extern const struct bpf_func_proto bpf_get_prandom_u32_proto;
 extern const struct bpf_func_proto bpf_get_smp_processor_id_proto;
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 658509daacd4..a2ec73aa1ec7 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -69,3 +69,5 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_XSKMAP, xsk_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, reuseport_array_ops)
 #endif
 #endif
+BPF_MAP_TYPE(BPF_MAP_TYPE_QUEUE, queue_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_STACK, stack_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 3bb94aa2d408..bfa042273fad 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -129,6 +129,8 @@ enum bpf_map_type {
BPF_MAP_TYPE_CGROUP_STORAGE,
BPF_MAP_TYPE_REUSEPORT_SOCKARRAY,
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE,
+   BPF_MAP_TYPE_QUEUE,
+   BPF_MAP_TYPE_STACK,
 };
 
 enum bpf_prog_type {
@@ -463,6 +465,28 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
+ * int bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+ * Description
+ * Push an element *value* in *map*. *flags* is one of:
+ *
+ * **BPF_EXIST**
+ * If the queue/stack is full, the oldest element is removed to
+ * make room for this.
+ * Return
+ * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_map_pop_elem(struct bpf_map *map, void 

[PATCH bpf-next 0/6] Implement queue/stack maps

2018-10-08 Thread Mauricio Vasquez B
In some applications this is needed have a pool of free elements, for
example the list of free L4 ports in a SNAT.  None of the current maps allow
to do it as it is not possible to get any element without having they key
it is associated to, even if it were possible, the lack of locking mecanishms in
eBPF would do it almost impossible to be implemented without data races.

This patchset implements two new kind of eBPF maps: queue and stack.
Those maps provide to eBPF programs the peek, push and pop operations, and for
userspace applications a new bpf_map_lookup_and_delete_elem() is added.

Signed-off-by: Mauricio Vasquez B 

RFC v4 -> v1:
 - Remove roundup to power of 2 in memory allocation
 - Remove count and use a free slot to check if queue/stack is empty
 - Use if + assigment for wrapping indexes
 - Fix some minor style issues
 - Squash two patches together

RFC v3 -> RFC v4:
 - Revert renaming of kernel/bpf/stackmap.c
 - Remove restriction on value size
 - Remove len arguments from peek/pop helpers
 - Add new ARG_PTR_TO_UNINIT_MAP_VALUE

RFC v2 -> RFC v3:
 - Return elements by value instead that by reference
 - Implement queue/stack base on array and head + tail indexes
 - Rename stack trace related files to avoid confusion and conflicts

RFC v1 -> RFC v2:
 - Create two separate maps instead of single one + flags
 - Implement bpf_map_lookup_and_delete syscall
 - Support peek operation
 - Define replacement policy through flags in the update() method
 - Add eBPF side tests

---

Mauricio Vasquez B (6):
  bpf: rename stack trace map operations
  bpf/syscall: allow key to be null in map functions
  bpf: add MAP_LOOKUP_AND_DELETE_ELEM syscall
  bpf: add queue and stack maps
  Sync uapi/bpf.h to tools/include
  selftests/bpf: add test cases for queue and stack maps


 include/linux/bpf.h|8 +
 include/linux/bpf_types.h  |4 
 include/uapi/linux/bpf.h   |   36 ++-
 kernel/bpf/Makefile|2 
 kernel/bpf/core.c  |3 
 kernel/bpf/helpers.c   |   43 +++
 kernel/bpf/queue_stack_maps.c  |  288 
 kernel/bpf/stackmap.c  |2 
 kernel/bpf/syscall.c   |  112 
 kernel/bpf/verifier.c  |   28 ++
 net/core/filter.c  |6 
 tools/include/uapi/linux/bpf.h |   36 ++-
 tools/lib/bpf/bpf.c|   12 +
 tools/lib/bpf/bpf.h|1 
 tools/testing/selftests/bpf/Makefile   |6 
 tools/testing/selftests/bpf/bpf_helpers.h  |7 
 tools/testing/selftests/bpf/test_maps.c|  122 
 tools/testing/selftests/bpf/test_progs.c   |   99 +++
 tools/testing/selftests/bpf/test_queue_map.c   |4 
 tools/testing/selftests/bpf/test_queue_stack_map.h |   59 
 tools/testing/selftests/bpf/test_stack_map.c   |4 
 21 files changed, 862 insertions(+), 20 deletions(-)
 create mode 100644 kernel/bpf/queue_stack_maps.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_map.c
 create mode 100644 tools/testing/selftests/bpf/test_queue_stack_map.h
 create mode 100644 tools/testing/selftests/bpf/test_stack_map.c

--



[PATCH bpf-next 1/6] bpf: rename stack trace map operations

2018-10-08 Thread Mauricio Vasquez B
In the following patches queue and stack maps (FIFO and LIFO
datastructures) will be implemented.  In order to avoid confusion and
a possible name clash rename stack_map_ops to stack_trace_map_ops

Signed-off-by: Mauricio Vasquez B 
---
 include/linux/bpf_types.h |2 +-
 kernel/bpf/stackmap.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 5432f4c9f50e..658509daacd4 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -51,7 +51,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_HASH, htab_lru_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LRU_PERCPU_HASH, htab_lru_percpu_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_LPM_TRIE, trie_map_ops)
 #ifdef CONFIG_PERF_EVENTS
-BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_map_ops)
+BPF_MAP_TYPE(BPF_MAP_TYPE_STACK_TRACE, stack_trace_map_ops)
 #endif
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY_OF_MAPS, array_of_maps_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
diff --git a/kernel/bpf/stackmap.c b/kernel/bpf/stackmap.c
index 8061a439ef18..bb41e293418d 100644
--- a/kernel/bpf/stackmap.c
+++ b/kernel/bpf/stackmap.c
@@ -600,7 +600,7 @@ static void stack_map_free(struct bpf_map *map)
put_callchain_buffers();
 }
 
-const struct bpf_map_ops stack_map_ops = {
+const struct bpf_map_ops stack_trace_map_ops = {
.map_alloc = stack_map_alloc,
.map_free = stack_map_free,
.map_get_next_key = stack_map_get_next_key,



[PATCH net-next 0/3] mlxsw: selftests: Few small updates

2018-10-08 Thread Ido Schimmel
First patch fixes a typo in mlxsw.

Second patch fixes a race in a recent test.

Third patch makes a recent test executable.

Nir Dotan (1):
  mlxsw: pci: Fix a typo

Petr Machata (2):
  selftests: forwarding: Have lldpad_app_wait_set() wait for unknown,
too
  selftests: mlxsw: qos_mc_aware: Make executable

 drivers/net/ethernet/mellanox/mlxsw/pci_hw.h  | 2 +-
 tools/testing/selftests/drivers/net/mlxsw/qos_mc_aware.sh | 0
 tools/testing/selftests/net/forwarding/lib.sh | 2 +-
 3 files changed, 2 insertions(+), 2 deletions(-)
 mode change 100644 => 100755 
tools/testing/selftests/drivers/net/mlxsw/qos_mc_aware.sh

-- 
2.17.1



[PATCH net-next 3/3] selftests: mlxsw: qos_mc_aware: Make executable

2018-10-08 Thread Ido Schimmel
From: Petr Machata 

This is a self-standing test and as such should be itself executable.

Fixes: b5638d46c90a ("selftests: mlxsw: Add a test for UC behavior under MC 
flood")
Signed-off-by: Petr Machata 
Reviewed-by: Jiri Pirko 
Signed-off-by: Ido Schimmel 
---
 tools/testing/selftests/drivers/net/mlxsw/qos_mc_aware.sh | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 mode change 100644 => 100755 
tools/testing/selftests/drivers/net/mlxsw/qos_mc_aware.sh

diff --git a/tools/testing/selftests/drivers/net/mlxsw/qos_mc_aware.sh 
b/tools/testing/selftests/drivers/net/mlxsw/qos_mc_aware.sh
old mode 100644
new mode 100755
-- 
2.17.1



[PATCH net-next 1/3] mlxsw: pci: Fix a typo

2018-10-08 Thread Ido Schimmel
From: Nir Dotan 

Signed-off-by: Nir Dotan 
Signed-off-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/pci_hw.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h 
b/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h
index 83f452b7ccbb..bb99f6d41fe0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci_hw.h
@@ -221,7 +221,7 @@ MLXSW_ITEM32(pci, eqe, event_type, 0x0C, 24, 8);
 MLXSW_ITEM32(pci, eqe, event_sub_type, 0x0C, 16, 8);
 
 /* pci_eqe_cqn
- * Completion Queue that triggeret this EQE.
+ * Completion Queue that triggered this EQE.
  */
 MLXSW_ITEM32(pci, eqe, cqn, 0x0C, 8, 7);
 
-- 
2.17.1



[PATCH net-next 2/3] selftests: forwarding: Have lldpad_app_wait_set() wait for unknown, too

2018-10-08 Thread Ido Schimmel
From: Petr Machata 

Immediately after mlxsw module is probed and lldpad started, added APP
entries are briefly in "unknown" state before becoming "pending". That's
the state that lldpad_app_wait_set() typically sees, and since there are
no pending entries at that time, it bails out. However the entries have
not been pushed to the kernel yet at that point, and thus the test case
fails.

Fix by waiting for both unknown and pending entries to disappear before
proceeding.

Fixes: d159261f3662 ("selftests: mlxsw: Add test for trust-DSCP")
Signed-off-by: Petr Machata 
Signed-off-by: Ido Schimmel 
---
 tools/testing/selftests/net/forwarding/lib.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index 0e73698b2048..85d253546684 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -251,7 +251,7 @@ lldpad_app_wait_set()
 {
local dev=$1; shift
 
-   while lldptool -t -i $dev -V APP -c app | grep -q pending; do
+   while lldptool -t -i $dev -V APP -c app | grep -Eq "pending|unknown"; do
echo "$dev: waiting for lldpad to push pending APP updates"
sleep 5
done
-- 
2.17.1



RE: [Intel-wired-lan] [PATCH][-next] ixgbe: don't clear_bit on xdp_ring->state if xdp_ring is null

2018-10-08 Thread Bowers, AndrewX
> -Original Message-
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Colin King
> Sent: Thursday, October 4, 2018 10:58 AM
> To: Kirsher, Jeffrey T ; David S . Miller
> ; intel-wired-...@lists.osuosl.org;
> netdev@vger.kernel.org
> Cc: kernel-janit...@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH][-next] ixgbe: don't clear_bit on xdp_ring-
> >state if xdp_ring is null
> 
> From: Colin Ian King 
> 
> There is an earlier check to see if xdp_ring is null when configuring the tx 
> ring,
> so assuming that it can still be null, the clearing of the xdp_ring->state
> currently could end up with a null pointer dereference.  Fix this by only
> clearing the bit if xdp_ring is not null.
> 
> Detected by CoverityScan, CID#1473795 ("Dereference after null check")
> 
> Fixes: 024aa5800f32 ("ixgbe: added Rx/Tx ring disable/enable functions")
> Signed-off-by: Colin Ian King 
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Tested-by: Andrew Bowers 




Re: skb length without fragments

2018-10-08 Thread Eric Dumazet



On 10/08/2018 11:02 AM, pradeep kumar nalla wrote:
> Hi
> 
> While testing my network driver with pktgen I could see an skb greater
> than 16K without fragments in xmit function. This lead to a fix in my
> driver that assumes when an SKB whose length is greater than 16K will
> come with fragments. Apart from pktgen what are the chances or
> possibilities of getting an SKB greater than 16K without fragments? .
> When I tried with tools like iperf/iper3/netperf, didn’t see a single
> incidence where the SKB length is greater than 16K and without frags.
> Even socket layer I see function alloc_skb_with_frags, does this mean
> all the larger packets come with frags.
>

There are cases where skb_linearize() calls happen, and then certainly
can feed a driver with a big linear skb.

Even if current tree would not hit this, you have to play safe and
do the check in the driver.


Re: [PATCH net-next] net/ipv6: stop leaking percpu memory in fib6 info

2018-10-08 Thread David Ahern
On 10/8/18 6:06 AM, Mike Rapoport wrote:
> The fib6_info_alloc() function allocates percpu memory to hold per CPU
> pointers to rt6_info, but this memory is never freed. Fix it.
> 
> Fixes: a64efe142f5e ("net/ipv6: introduce fib6_info struct and helpers")
> 
> Signed-off-by: Mike Rapoport 
> Cc: sta...@vger.kernel.org
> ---
>  net/ipv6/ip6_fib.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index cf709eadc932..cc7de7eb8b9c 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -194,6 +194,8 @@ void fib6_info_destroy_rcu(struct rcu_head *head)
>   *ppcpu_rt = NULL;
>   }
>   }
> +
> + free_percpu(f6i->rt6i_pcpu);
>   }
>  
>   lwtstate_put(f6i->fib6_nh.nh_lwtstate);
> 

Odd that KMEMLEAK is not detecting this. Thanks for the fix.

Reviewed-by: David Ahern 



[PATCH v2 ipsec] Clear secpath on loopback_xmit

2018-10-08 Thread Benedict Wong
This patch clears the skb->sp when transmitted over loopback. This
ensures that the loopback-ed packet does not have any secpath
information from the outbound transforms.

At present, this causes XFRM tunnel mode packets to be dropped with
XFRMINNOPOLS, due to the outbound state being in the secpath, without
a matching inbound policy. Clearing the secpath ensures that all states
added to the secpath are exclusively from the inbound processing.

Tests: xfrm tunnel mode tests added for loopback:
https://android-review.googlesource.com/c/kernel/tests/+/777328
Fixes: 8fe7ee2ba983 ("[IPsec]: Strengthen policy checks")
Signed-off-by: Benedict Wong 
---
 drivers/net/loopback.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index 30612497643c..a6bf54df94bd 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -50,6 +50,7 @@
 #include 
 #include 
 #include 
+#include 
 #include /* For the statistics structure. */
 #include   /* For ARPHRD_ETHER */
 #include 
@@ -82,6 +83,9 @@ static netdev_tx_t loopback_xmit(struct sk_buff *skb,
 */
skb_dst_force(skb);
 
+   // Clear secpath to ensure xfrm policy check not tainted by outbound 
SAs.
+   secpath_reset(skb);
+
skb->protocol = eth_type_trans(skb, dev);
 
/* it's OK to use per_cpu_ptr() because BHs are off */
-- 
2.19.0.605.g01d371f741-goog



Re: [PATCH net-next] dpaa2-eth: Don't account Tx confirmation frames on NAPI poll

2018-10-08 Thread David Miller
From: Ioana Ciocoi Radulescu 
Date: Mon, 8 Oct 2018 14:16:31 +

> Until now, both Rx and Tx confirmation frames handled during
> NAPI poll were counted toward the NAPI budget. However, Tx
> confirmations are lighter to process than Rx frames, which can
> skew the amount of work actually done inside one NAPI cycle.
> 
> Update the code to only count Rx frames toward the NAPI budget
> and set a separate threshold on how many Tx conf frames can be
> processed in one poll cycle.
> 
> The NAPI poll routine stops when either the budget is consumed
> by Rx frames or when Tx confirmation frames reach this threshold.
> 
> Signed-off-by: Ioana Radulescu 

Applied.


Re: [PATCH net-next] net: mscc: ocelot: remove set but not used variable 'phy_mode'

2018-10-08 Thread David Miller
From: YueHaibing 
Date: Mon, 8 Oct 2018 14:07:50 +

> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/ethernet/mscc/ocelot_board.c: In function 'mscc_ocelot_probe':
> drivers/net/ethernet/mscc/ocelot_board.c:262:17: warning:
>  variable 'phy_mode' set but not used [-Wunused-but-set-variable]
>enum phy_mode phy_mode;
> 
> It never used since introduction in 
> commit 71e32a20cfbf ("net: mscc: ocelot: make use of SerDes PHYs for handling 
> their configuration")
> 
> Signed-off-by: YueHaibing 

Applied.


Re: [PATCH v2 net-next 11/23] rtnetlink: Update rtnl_stats_dump for strict data checking

2018-10-08 Thread David Miller
From: David Ahern 
Date: Mon, 8 Oct 2018 07:25:34 -0600

> On 10/8/18 4:17 AM, Christian Brauner wrote:
>>> @@ -4696,13 +4697,32 @@ static int rtnl_stats_dump(struct sk_buff *skb, 
>>> struct netlink_callback *cb)
>>>  
>>> cb->seq = net->dev_base_seq;
>>>  
>>> -   if (nlmsg_len(cb->nlh) < sizeof(*ifsm))
>>> +   if (nlmsg_len(cb->nlh) < sizeof(*ifsm)) {
>>> +   NL_SET_ERR_MSG(extack, "Invalid header for stats dump");
>>> return -EINVAL;
>>> +   }
>>>  
>>> ifsm = nlmsg_data(cb->nlh);
>>> +
>>> +   /* only requests using NLM_F_DUMP_PROPER_HDR can pass data to
>> 
>> That looks like an accidental leftover before we changed this to a
>> socket option. :)
>> 
> 
> ugh. thanks for noticing.

David, I applied this series, please send me relative fixups at this point
if necessary.

Thanks.


skb length without fragments

2018-10-08 Thread pradeep kumar nalla
Hi

While testing my network driver with pktgen I could see an skb greater
than 16K without fragments in xmit function. This lead to a fix in my
driver that assumes when an SKB whose length is greater than 16K will
come with fragments. Apart from pktgen what are the chances or
possibilities of getting an SKB greater than 16K without fragments? .
When I tried with tools like iperf/iper3/netperf, didn’t see a single
incidence where the SKB length is greater than 16K and without frags.
Even socket layer I see function alloc_skb_with_frags, does this mean
all the larger packets come with frags.

Please pardon if it is not a correct mailing list to ask above question.

Thanks
Pradeep.


Re: [PATCH V1 net 0/5] minor bug fixes for ENA Ethernet driver

2018-10-08 Thread David Miller
From: "Bshara, Nafea" 
Date: Mon, 8 Oct 2018 12:47:50 +

> Ship it

Anything but... see my feedback.


Re: [PATCH net-next 0/3] selftests: add more PMTU tests

2018-10-08 Thread David Miller
From: Sabrina Dubroca 
Date: Mon,  8 Oct 2018 14:37:02 +0200

> The current selftests for PMTU cover VTI tunnels, but there's nothing
> about the generation and handling of PMTU exceptions by intermediate
> routers. This series adds and improves existing helpers, then adds
> IPv4 and IPv6 selftests with a setup involving an intermediate router.
> 
> Joint work with Stefano Brivio.

This is really great stuff to see.

Series applied, thanks.


Re: [PATCH net 1/2] net: ipv4: update fnhe_pmtu when first hop's MTU changes

2018-10-08 Thread David Miller
From: Sabrina Dubroca 
Date: Mon,  8 Oct 2018 14:36:38 +0200

> Since commit 5aad1de5ea2c ("ipv4: use separate genid for next hop
> exceptions"), exceptions get deprecated separately from cached
> routes. In particular, administrative changes don't clear PMTU anymore.
> 
> As Stefano described in commit e9fa1495d738 ("ipv6: Reflect MTU changes
> on PMTU of exceptions for MTU-less routes"), the PMTU discovered before
> the local MTU change can become stale:
>  - if the local MTU is now lower than the PMTU, that PMTU is now
>incorrect
>  - if the local MTU was the lowest value in the path, and is increased,
>we might discover a higher PMTU
> 
> Similarly to what commit e9fa1495d738 did for IPv6, update PMTU in those
> cases.
> 
> If the exception was locked, discovered the PMTU was smaller than the
> minimal accepted PMTU. In that case, if the new local MTU is smaller
> than the current PMTU, let PMTU discovery figure out if locking of the
> exception is still needed.
> 
> To do this, we need to know the old link MTU in the NETDEV_CHANGEMTU
> notifier. By the time the notifier is called, dev->mtu has been
> changed. This patch adds the old MTU as additional information in the
> notifier structure, and a new call_netdevice_notifiers_u32() function.
> 
> Fixes: 5aad1de5ea2c ("ipv4: use separate genid for next hop exceptions")
> Signed-off-by: Sabrina Dubroca 
> Reviewed-by: Stefano Brivio 

Please, when you respin this to address David Ahern's feedback, provide
a proper "0/N" header posting.

Thank you.


Re: [PATCH V1 net 2/5] net: ena: fix warning in rmmod caused by double iounmap

2018-10-08 Thread David Miller
From: 
Date: Mon, 8 Oct 2018 15:28:41 +0300

> From: Arthur Kiyanovski 
> 
> Memory mapped with devm_ioremap is automatically freed when the driver
> is disconnected from the device. Therefore there is no need to
> explicitly call devm_iounmap.
> 
> Signed-off-by: Arthur Kiyanovski 
> ---
>  drivers/net/ethernet/amazon/ena/ena_netdev.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
> b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> index 25621a2..23f2dda 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
> @@ -3099,15 +3099,7 @@ static int ena_rss_init_default(struct ena_adapter 
> *adapter)
>  
>  static void ena_release_bars(struct ena_com_dev *ena_dev, struct pci_dev 
> *pdev)
>  {
> - int release_bars;
> -
> - if (ena_dev->mem_bar)
> - devm_iounmap(>dev, ena_dev->mem_bar);
> -
> - if (ena_dev->reg_bar)
> - devm_iounmap(>dev, ena_dev->reg_bar);
> -
> - release_bars = pci_select_bars(pdev, IORESOURCE_MEM) & ENA_BAR_MASK;
> + int release_bars = pci_select_bars(pdev, IORESOURCE_MEM) & ENA_BAR_MASK;
>   pci_release_selected_regions(pdev, release_bars);

Always have an empty line between local variable declarations and actual
function code.


Re: [PATCH V1 net 0/5] minor bug fixes for ENA Ethernet driver

2018-10-08 Thread David Miller
From: 
Date: Mon, 8 Oct 2018 15:28:39 +0300

> From: Arthur Kiyanovski 
> 
> Arthur Kiyanovski (5):
>   net: ena: fix indentations in ena_defs for better readability
>   net: ena: fix warning in rmmod caused by double iounmap
>   net: ena: fix rare bug when failed restart/resume is followed by
> driver removal
>   net: ena: fix NULL dereference due to untimely napi initialization
>   net: ena: fix auto casting to boolean

Indentation fixes are not appropriate for a series of bug fixes.  Separate
those out from the real legitimate bug fixes into a series for net-next.


Re: [PATCH net-next] net/ipv6: stop leaking percpu memory in fib6 info

2018-10-08 Thread David Miller
From: Mike Rapoport 
Date: Mon,  8 Oct 2018 15:06:03 +0300

> The fib6_info_alloc() function allocates percpu memory to hold per CPU
> pointers to rt6_info, but this memory is never freed. Fix it.
> 
> Fixes: a64efe142f5e ("net/ipv6: introduce fib6_info struct and helpers")
> 

Please do not put empty lines between Fixes: and other tags.  They
belong together as a group.

> Signed-off-by: Mike Rapoport 
> Cc: sta...@vger.kernel.org

Several problems here.

You cannot submit patches to net-next and expect them to go onward to
-stable.  That is not appropriate.  If it is going to stable, you
must target 'net' not 'net-next'.

Furthermore, for networking changes, explicit CC: of stable is not
appropriate.  Instead you must explicitly ask me to queue the patch
up for -stable as I handle networking stable submissions myself.

Thank you.


[PATCH bpf] xsk: do not call synchronize_net() under RCU read lock

2018-10-08 Thread Björn Töpel
From: Björn Töpel 

The XSKMAP update and delete functions called synchronize_net(), which
can sleep. It is not allowed to sleep during an RCU read section.

Instead we need to make sure that the sock sk_destruct (xsk_destruct)
function is asynchronously called after an RCU grace period. Setting
the SOCK_RCU_FREE flag for XDP sockets takes care of this.

Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type 
BPF_MAP_TYPE_XSKMAP")
Reported-by: Eric Dumazet 
Signed-off-by: Björn Töpel 
---
 kernel/bpf/xskmap.c | 10 ++
 net/xdp/xsk.c   |  2 ++
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9f8463afda9c..47147c9e184d 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -192,11 +192,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void 
*key, void *value,
sock_hold(sock->sk);
 
old_xs = xchg(>xsk_map[i], xs);
-   if (old_xs) {
-   /* Make sure we've flushed everything. */
-   synchronize_net();
+   if (old_xs)
sock_put((struct sock *)old_xs);
-   }
 
sockfd_put(sock);
return 0;
@@ -212,11 +209,8 @@ static int xsk_map_delete_elem(struct bpf_map *map, void 
*key)
return -EINVAL;
 
old_xs = xchg(>xsk_map[k], NULL);
-   if (old_xs) {
-   /* Make sure we've flushed everything. */
-   synchronize_net();
+   if (old_xs)
sock_put((struct sock *)old_xs);
-   }
 
return 0;
 }
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0577cd49aa72..07156f43d295 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -754,6 +754,8 @@ static int xsk_create(struct net *net, struct socket *sock, 
int protocol,
sk->sk_destruct = xsk_destruct;
sk_refcnt_debug_inc(sk);
 
+   sock_set_flag(sk, SOCK_RCU_FREE);
+
xs = xdp_sk(sk);
mutex_init(>mutex);
spin_lock_init(>tx_completion_lock);
-- 
2.17.1



Re: [PATCH v2 net-next 00/23] rtnetlink: Add support for rigid checking of data in dump request

2018-10-08 Thread David Miller
From: Christian Brauner 
Date: Mon, 8 Oct 2018 13:04:13 +0200

> On Sun, Oct 07, 2018 at 08:16:21PM -0700, David Ahern wrote:
>> From: David Ahern 
>> 
>> There are many use cases where a user wants to influence what is
>> returned in a dump for some rtnetlink command: one is wanting data
>> for a different namespace than the one the request is received and
>> another is limiting the amount of data returned in the dump to a
>> specific set of interest to userspace, reducing the cpu overhead of
>> both kernel and userspace. Unfortunately, the kernel has historically
>> not been strict with checking for the proper header or checking the
>> values passed in the header. This lenient implementation has allowed
>> iproute2 and other packages to pass any struct or data in the dump
>> request as long as the family is the first byte. For example, ifinfomsg
>> struct is used by iproute2 for all generic dump requests - links,
>> addresses, routes and rules when it is really only valid for link
>> requests.
>> 
>> There is 1 is example where the kernel deals with the wrong struct: link
>> dumps after VF support was added. Older iproute2 was sending rtgenmsg as
>> the header instead of ifinfomsg so a patch was added to try and detect
>> old userspace vs new:
>> e5eca6d41f53 ("rtnetlink: fix userspace API breakage for iproute2 < v3.9.0")
>> 
>> The latest example is Christian's patch set wanting to return addresses for
>> a target namespace. It guesses the header struct is an ifaddrmsg and if it
>> guesses wrong a netlink warning is generated in the kernel log on every
>> address dump which is unacceptable.
>> 
>> Another example where the kernel is a bit lenient is route dumps: iproute2
>> can send either a request with either ifinfomsg or a rtmsg as the header
>> struct, yet the kernel always treats the header as an rtmsg (see
>> inet_dump_fib and rtm_flags check). The header inconsistency impacts the
>> ability to add kernel side filters for route dumps - a necessary feature
>> for scale setups with 100k+ routes.
>> 
>> How to resolve the problem of not breaking old userspace yet be able to
>> move forward with new features such as kernel side filtering which are
>> crucial for efficient operation at high scale?
>> 
>> This patch set addresses the problem by adding a new socket flag,
>> NETLINK_DUMP_STRICT_CHK, that userspace can use with setsockopt to
>> request strict checking of headers and attributes on dump requests and
>> hence unlock the ability to use kernel side filters as they are added.
 ...
> At this point it's all nits so it's got my ACK but keener eyes than mine
> might see other issues.
> 
> Acked-by: Christian Brauner 

Series applied, thanks everyone.

Please be on the lookout for userspace regressions from this patch set.

Thanks.


Re: [PATCH net-next 3/3] selftests: pmtu: add basic IPv4 and IPv6 PMTU tests

2018-10-08 Thread David Ahern
On 10/8/18 6:37 AM, Sabrina Dubroca wrote:
> Commit d1f1b9cbf34c ("selftests: net: Introduce first PMTU test") and
> follow-ups introduced some PMTU tests, but they all rely on tunneling,
> and, particularly, on VTI.
> 
> These new tests use simple routing to exercise the generation and
> update of PMTU exceptions in IPv4 and IPv6.
> 
> Signed-off-by: Sabrina Dubroca 
> Signed-off-by: Stefano Brivio 
> ---
>  tools/testing/selftests/net/pmtu.sh | 207 +++-
>  1 file changed, 203 insertions(+), 4 deletions(-)
> 

Thanks for adding more pmtu tests.

Reviewed-by: David Ahern 



Re: [PATCH net-next 1/3] selftests: pmtu: Introduce check_pmtu_value()

2018-10-08 Thread David Ahern
On 10/8/18 6:37 AM, Sabrina Dubroca wrote:
> From: Stefano Brivio 
> 
> Introduce and use a function that checks PMTU values against
> expected values and logs error messages, to remove some clutter.
> 
> Signed-off-by: Stefano Brivio 
> Signed-off-by: Sabrina Dubroca 
> ---
>  tools/testing/selftests/net/pmtu.sh | 49 +
>  1 file changed, 22 insertions(+), 27 deletions(-)
> 

Reviewed-by: David Ahern 



Re: [PATCH net-next 2/3] selftests: pmtu: extend MTU parsing helper to locked MTU

2018-10-08 Thread David Ahern
On 10/8/18 6:37 AM, Sabrina Dubroca wrote:
> The mtu_parse helper introduced in commit f2c929feeccd ("selftests:
> pmtu: Factor out MTU parsing helper") can only handle "mtu 1234", but
> not "mtu lock 1234". Extend it, so that we can do IPv4 tests with PMTU
> smaller than net.ipv4.route.min_pmtu
> 
> Signed-off-by: Sabrina Dubroca 
> ---
>  tools/testing/selftests/net/pmtu.sh | 2 ++
>  1 file changed, 2 insertions(+)
> 

Reviewed-by: David Ahern 



Re: [PATCH net-next v2 00/12] net: sched: cls_u32 Various improvements

2018-10-08 Thread David Miller
From: Jamal Hadi Salim 
Date: Mon,  8 Oct 2018 06:22:32 -0400

> Various improvements from Al.
> 
> Changes from version 1: Add missing commit 

Series applied.


Re: [PATCH net-next] netdev: remove useless codes of tun_automq_select_queue

2018-10-08 Thread David Miller
From: Wang Li 
Date: Mon,  8 Oct 2018 16:51:12 +0800

> @@ -1045,15 +1039,12 @@ static void tun_automq_xmit(struct tun_struct *tun, 
> struct sk_buff *skb)
>* RPS hash and save it into the flow_table here.
>*/
>   __u32 rxhash;
> + struct tun_flow_entry *e;

Please always order local variable declarations from longest to
shortest line.

Thank you.


Re: [PATCH net 1/2] net: ipv4: update fnhe_pmtu when first hop's MTU changes

2018-10-08 Thread David Ahern
On 10/8/18 6:36 AM, Sabrina Dubroca wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index c7861e4b402c..dc9d2668d9bb 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2458,6 +2458,13 @@ struct netdev_notifier_info {
>   struct netlink_ext_ack  *extack;
>  };
>  
> +struct netdev_notifier_info_ext {
> + struct netdev_notifier_info info; /* must be first */
> + union {
> + u32 u32;

I realize you want this to be generic, but that is a really odd
definition. can you make that mtu instead? the union allows other use
cases to add new names.

> + } ext;
> +};
> +
>  struct netdev_notifier_change_info {
>   struct netdev_notifier_info info; /* must be first */
>   unsigned int flags_changed;



Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP

2018-10-08 Thread Björn Töpel
Den mån 8 okt. 2018 kl 18:55 skrev Eric Dumazet :
>
[...]
>
> You might take a look at SOCK_RCU_FREE flag for sockets.
>

Ah, thanks! I'll use this instead.


Re: [PATCH net] rds: RDS (tcp) hangs on sendto() to unresponding address

2018-10-08 Thread Santosh Shilimkar

10/8/2018 9:17 AM, Ka-Cheong Poon wrote:

In rds_send_mprds_hash(), if the calculated hash value is non-zero and
the MPRDS connections are not yet up, it will wait.  But it should not
wait if the send is non-blocking.  In this case, it should just use the
base c_path for sending the message.

Signed-off-by: Ka-Cheong Poon 
---

Thanks for getting this out on the list.

Acked-by: Santosh Shilimkar 


Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP

2018-10-08 Thread Eric Dumazet



On 10/08/2018 09:05 AM, Björn Töpel wrote:

> 
> Thanks for finding and pointing this out, Eric!
> 
> I'll have look and get back with a patch.
> 
>

You might take a look at SOCK_RCU_FREE flag for sockets.




Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP

2018-10-08 Thread Björn Töpel
Den mån 8 okt. 2018 kl 18:05 skrev Björn Töpel :
>
> Den mån 8 okt. 2018 kl 17:31 skrev Eric Dumazet :
> >
[...]
> > So it is illegal to call synchronize_net(), since it is a reschedule point.
> >
>
> Thanks for finding and pointing this out, Eric!
>
> I'll have look and get back with a patch.
>

Eric, something in the lines of the patch below? Or is it considered
bad practice to use call_rcu in this context (prone to DoSing the
kernel)?

Thanks for spending time on the xskmap code. Very much appreciated!

>From 491f7bd87705f72c45e59242fc6c3b1db9d3b56d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= 
Date: Mon, 8 Oct 2018 18:34:11 +0200
Subject: [PATCH] xsk: do not call synchronize_net() under RCU read lock
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

XSKMAP update and delete functions called synchronize_net(), which can
sleep. It is not allowed to sleep during an RCU read section.

Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type
BPF_MAP_TYPE_XSKMAP")
Reported-by: Eric Dumazet 
Signed-off-by: Björn Töpel 
---
 include/net/xdp_sock.h |  1 +
 kernel/bpf/xskmap.c| 21 +++--
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 13acb9803a6d..5b430141a3f6 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -68,6 +68,7 @@ struct xdp_sock {
  */
 spinlock_t tx_completion_lock;
 u64 rx_dropped;
+struct rcu_head rcu;
 };

 struct xdp_buff;
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9f8463afda9c..51e8e2785612 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -157,6 +157,13 @@ static void *xsk_map_lookup_elem(struct bpf_map
*map, void *key)
 return NULL;
 }

+static void __xsk_map_remove_async(struct rcu_head *rcu)
+{
+struct xdp_sock *xs = container_of(rcu, struct xdp_sock, rcu);
+
+sock_put((struct sock *)xs);
+}
+
 static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
u64 map_flags)
 {
@@ -192,11 +199,8 @@ static int xsk_map_update_elem(struct bpf_map
*map, void *key, void *value,
 sock_hold(sock->sk);

 old_xs = xchg(>xsk_map[i], xs);
-if (old_xs) {
-/* Make sure we've flushed everything. */
-synchronize_net();
-sock_put((struct sock *)old_xs);
-}
+if (old_xs)
+call_rcu(_xs->rcu, __xsk_map_remove_async);

 sockfd_put(sock);
 return 0;
@@ -212,11 +216,8 @@ static int xsk_map_delete_elem(struct bpf_map
*map, void *key)
 return -EINVAL;

 old_xs = xchg(>xsk_map[k], NULL);
-if (old_xs) {
-/* Make sure we've flushed everything. */
-synchronize_net();
-sock_put((struct sock *)old_xs);
-}
+if (old_xs)
+call_rcu(_xs->rcu, __xsk_map_remove_async);

 return 0;
 }
-- 
2.17.1


[PATCH net] rds: RDS (tcp) hangs on sendto() to unresponding address

2018-10-08 Thread Ka-Cheong Poon
In rds_send_mprds_hash(), if the calculated hash value is non-zero and
the MPRDS connections are not yet up, it will wait.  But it should not
wait if the send is non-blocking.  In this case, it should just use the
base c_path for sending the message.

Signed-off-by: Ka-Cheong Poon 
---
 net/rds/send.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 57b3d5a..fe785ee 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1007,7 +1007,8 @@ static int rds_cmsg_send(struct rds_sock *rs, struct 
rds_message *rm,
return ret;
 }
 
-static int rds_send_mprds_hash(struct rds_sock *rs, struct rds_connection 
*conn)
+static int rds_send_mprds_hash(struct rds_sock *rs,
+  struct rds_connection *conn, int nonblock)
 {
int hash;
 
@@ -1023,10 +1024,16 @@ static int rds_send_mprds_hash(struct rds_sock *rs, 
struct rds_connection *conn)
 * used.  But if we are interrupted, we have to use the zero
 * c_path in case the connection ends up being non-MP capable.
 */
-   if (conn->c_npaths == 0)
+   if (conn->c_npaths == 0) {
+   /* Cannot wait for the connection be made, so just use
+* the base c_path.
+*/
+   if (nonblock)
+   return 0;
if (wait_event_interruptible(conn->c_hs_waitq,
 conn->c_npaths != 0))
hash = 0;
+   }
if (conn->c_npaths == 1)
hash = 0;
}
@@ -1256,7 +1263,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
}
 
if (conn->c_trans->t_mp_capable)
-   cpath = >c_path[rds_send_mprds_hash(rs, conn)];
+   cpath = >c_path[rds_send_mprds_hash(rs, conn, nonblock)];
else
cpath = >c_path[0];
 
-- 
1.8.3.1



Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP

2018-10-08 Thread Björn Töpel
Den mån 8 okt. 2018 kl 17:31 skrev Eric Dumazet :
>
> On 05/02/2018 04:01 AM, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > The xskmap is yet another BPF map, very much inspired by
> > dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
> > adds AF_XDP sockets into the map, and by using the bpf_redirect_map
> > helper, an XDP program can redirect XDP frames to an AF_XDP socket.
> >
> > Note that a socket that is bound to certain ifindex/queue index will
> > *only* accept XDP frames from that netdev/queue index. If an XDP
> > program tries to redirect from a netdev/queue index other than what
> > the socket is bound to, the frame will not be received on the socket.
> >
> > A socket can reside in multiple maps.
> >
> > v3: Fixed race and simplified code.
> > v2: Removed one indirection in map lookup.
> >
> > Signed-off-by: Björn Töpel 
> > ---
> >  include/linux/bpf.h   |  25 +
> >  include/linux/bpf_types.h |   3 +
> >  include/net/xdp_sock.h|   7 ++
> >  include/uapi/linux/bpf.h  |   1 +
> >  kernel/bpf/Makefile   |   3 +
> >  kernel/bpf/verifier.c |   8 +-
> >  kernel/bpf/xskmap.c   | 239 
> > ++
> >  net/xdp/xsk.c |   5 +
> >  8 files changed, 289 insertions(+), 2 deletions(-)
> >  create mode 100644 kernel/bpf/xskmap.c
> >
>
> This function is called under rcu_read_lock() , from map_update_elem()
>
> > +
> > +static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
> > +u64 map_flags)
> > +{
> > + struct xsk_map *m = container_of(map, struct xsk_map, map);
> > + u32 i = *(u32 *)key, fd = *(u32 *)value;
> > + struct xdp_sock *xs, *old_xs;
> > + struct socket *sock;
> > + int err;
> > +
> > + if (unlikely(map_flags > BPF_EXIST))
> > + return -EINVAL;
> > + if (unlikely(i >= m->map.max_entries))
> > + return -E2BIG;
> > + if (unlikely(map_flags == BPF_NOEXIST))
> > + return -EEXIST;
> > +
> > + sock = sockfd_lookup(fd, );
> > + if (!sock)
> > + return err;
> > +
> > + if (sock->sk->sk_family != PF_XDP) {
> > + sockfd_put(sock);
> > + return -EOPNOTSUPP;
> > + }
> > +
> > + xs = (struct xdp_sock *)sock->sk;
> > +
> > + if (!xsk_is_setup_for_bpf_map(xs)) {
> > + sockfd_put(sock);
> > + return -EOPNOTSUPP;
> > + }
> > +
> > + sock_hold(sock->sk);
> > +
> > + old_xs = xchg(>xsk_map[i], xs);
> > + if (old_xs) {
> > + /* Make sure we've flushed everything. */
>
> So it is illegal to call synchronize_net(), since it is a reschedule point.
>

Thanks for finding and pointing this out, Eric!

I'll have look and get back with a patch.


Björn


> > + synchronize_net();
> > + sock_put((struct sock *)old_xs);
> > + }
> > +
> > + sockfd_put(sock);
> > + return 0;
> > +}
> >
>
>
>


Re: [PATCH V2 net-next 2/5] net: Introduce a new MII time stamping interface.

2018-10-08 Thread Richard Cochran
On Mon, Oct 08, 2018 at 08:28:00AM -0700, Richard Cochran wrote:
> Let's stick to phylib for now.  We can cross the other bridge when we
> come to it.  Maybe the net_device->phylink will emerge for purposes
> other that time stamping.  Let's not guess about how it should look.

In fact, it is kernel policy to reject new framework that lacks users.
Even if we added new net_device->phylink hooks, I bet that davem would
reject them for that reason.  We can't just add to net_device for
vaporware.

The whole point of _this_ series is the first patch.  That new option
is important for complete PTP support, and it could be used by every
PTP hardware.  Unfortunately I only have one device in hand that
implements this, and that is the reason why you see patches 2-5.

Thanks,
Richard


Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP

2018-10-08 Thread Eric Dumazet



On 05/02/2018 04:01 AM, Björn Töpel wrote:
> From: Björn Töpel 
> 
> The xskmap is yet another BPF map, very much inspired by
> dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
> adds AF_XDP sockets into the map, and by using the bpf_redirect_map
> helper, an XDP program can redirect XDP frames to an AF_XDP socket.
> 
> Note that a socket that is bound to certain ifindex/queue index will
> *only* accept XDP frames from that netdev/queue index. If an XDP
> program tries to redirect from a netdev/queue index other than what
> the socket is bound to, the frame will not be received on the socket.
> 
> A socket can reside in multiple maps.
> 
> v3: Fixed race and simplified code.
> v2: Removed one indirection in map lookup.
> 
> Signed-off-by: Björn Töpel 
> ---
>  include/linux/bpf.h   |  25 +
>  include/linux/bpf_types.h |   3 +
>  include/net/xdp_sock.h|   7 ++
>  include/uapi/linux/bpf.h  |   1 +
>  kernel/bpf/Makefile   |   3 +
>  kernel/bpf/verifier.c |   8 +-
>  kernel/bpf/xskmap.c   | 239 
> ++
>  net/xdp/xsk.c |   5 +
>  8 files changed, 289 insertions(+), 2 deletions(-)
>  create mode 100644 kernel/bpf/xskmap.c
> 

This function is called under rcu_read_lock() , from map_update_elem()

> +
> +static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
> +u64 map_flags)
> +{
> + struct xsk_map *m = container_of(map, struct xsk_map, map);
> + u32 i = *(u32 *)key, fd = *(u32 *)value;
> + struct xdp_sock *xs, *old_xs;
> + struct socket *sock;
> + int err;
> +
> + if (unlikely(map_flags > BPF_EXIST))
> + return -EINVAL;
> + if (unlikely(i >= m->map.max_entries))
> + return -E2BIG;
> + if (unlikely(map_flags == BPF_NOEXIST))
> + return -EEXIST;
> +
> + sock = sockfd_lookup(fd, );
> + if (!sock)
> + return err;
> +
> + if (sock->sk->sk_family != PF_XDP) {
> + sockfd_put(sock);
> + return -EOPNOTSUPP;
> + }
> +
> + xs = (struct xdp_sock *)sock->sk;
> +
> + if (!xsk_is_setup_for_bpf_map(xs)) {
> + sockfd_put(sock);
> + return -EOPNOTSUPP;
> + }
> +
> + sock_hold(sock->sk);
> +
> + old_xs = xchg(>xsk_map[i], xs);
> + if (old_xs) {
> + /* Make sure we've flushed everything. */

So it is illegal to call synchronize_net(), since it is a reschedule point.

> + synchronize_net();
> + sock_put((struct sock *)old_xs);
> + }
> +
> + sockfd_put(sock);
> + return 0;
> +}
> 





Re: [PATCH V2 net-next 2/5] net: Introduce a new MII time stamping interface.

2018-10-08 Thread Richard Cochran


On Mon, Oct 08, 2018 at 05:07:22PM +0200, Andrew Lunn wrote:
> So as you said, the phylib API has not changed much, which is common
> for mature code.

I meant that phy-LINK hasn't changed much.

> But i think long term, it will become less important.
> It will share the space with phylink. And any code which wants to be
> generically usable, should not depend on phydev.

Thanks for your view of the big picture.

> Architecturally, it
> seems wrong for you to hang what should be a generic time stamping
> framework on phydev. It is not future proof. net_device is future
> proof.

You still haven't said how net_device is going to work.

Today there are exactly zero phylink devices needing time stamping
support, but there are new phylib devices.  We don't have a
net_device->phylink connection, and it isn't needed yet.  Adding that
is way out of scope for this series.

Let's stick to phylib for now.  We can cross the other bridge when we
come to it.  Maybe the net_device->phylink will emerge for purposes
other that time stamping.  Let's not guess about how it should look.

We are only talking about kernel interfaces here, and so nothing is
set in stone.

Thanks,
Richard


Re: selftests/bpf: test_kmod.sh hangs on all devices

2018-10-08 Thread Shuah Khan
Hi Naresh,

Please use sh...@kernel.org for faster responses. I updated MAINTAINERS
entry a while back removing shua...@osg.samsung.com address due to IT
infrastructure changes at Samsung.

On 10/08/2018 08:55 AM, Naresh Kamboju wrote:
> Daniel,
> 
> On Mon, 8 Oct 2018 at 18:58, Daniel Borkmann  wrote:
>>
>> On 10/08/2018 03:13 PM, Naresh Kamboju wrote:
>>> BPF test case test_kmod.sh hangs on all devices running linux next.
>>>
>>> + cd /opt/kselftests/default-in-kernel/bpf
>>> + ./test_kmod.sh
>>> sysctl: cannot stat /proc/sys/net/core/bpf_jit_enable: No such file or 
>>> directory
>>> sysctl: cannot stat /proc/sys/net/core/bpf_jit_harden: No such file or 
>>> directory
>>> sysctl: cannot stat /proc/sys/net/core/bpf_jit_enable: No such file or 
>>> directory
>>> sysctl: cannot stat /proc/sys/net/core/bpf_jit_harden: No such file or 
>>> directory
>>> [ JIT enabled:0 hardened:0 ]
>>>
>>> https://lkft.validation.linaro.org/scheduler/job/429726
>>>
>>> Test hangs started from 4.19.0-rc4-next-20180918.
>>> Linux version 4.19.0-rc4-next-20180918 (oe-user@oe-host) (gcc version
>>> 7.1.1 20170707 (Linaro GCC 7.1-2017.08)) #1 SMP Tue Sep 18 05:26:00
>>> UTC 2018
>>>
>>> History can be compared from this page.
>>> https://qa-reports.linaro.org/lkft/linux-next-oe/tests/kselftest/bpf_test_kmod.sh
>>>
>>> OTOH,
>>> There is a kernel BUG,
>>
>> This is quite an old linux-next kernel, should be fixed by 100811936f89 
>> ("bpf: test_bpf:
>> add init_net to dev for flow_dissector"). Please make sure you have that 
>> commit included
>> in your testing:
> 
> I will re-validate on latest code base and let you know.
> 
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=100811936f89fd455eda1984810c09003550555b
> 
> Thanks for the quick reply.
> 

Great. Looks like this has been sorted. Thanks Daniel.

thanks,
-- Shuah


Re: [PATCH V2 net-next 2/5] net: Introduce a new MII time stamping interface.

2018-10-08 Thread Andrew Lunn
On Sun, Oct 07, 2018 at 07:04:39PM -0700, Richard Cochran wrote:
> On Sun, Oct 07, 2018 at 09:54:00PM +0200, Andrew Lunn wrote:
> > Sure, but things have moved on since then.
> 
> I was curious about this.  Based on your uses cases, I guess that you
> mean phylib?  But not much has changed AFAICT. (There is one new
> global function and two were removed, but that doesn't change the
> picture WRT time stamping.)
> 
> Phylink now has two or three new users, one of which is dsa.  Is that
> the big move?
> 
> The situation with MACs that handle their own PHYs without phylib is
> unchanged, AFAICT.
> 
> So what exactly do you mean?

Hi Richard

We are pushing phylink. I really do think anything using > 1Gbps links
should be using phylink, not phydev. And i think we have reached the
tipping point, that most new MACs will be > 1Gbps. 2.5G or maybe 5G
will be the new default. The MAC-PHY link is quiet messy when you get
above 1G. There are a number of options which you can use, and the MAC
and PHY need to negotiate a common set to use. phylink can do this,
phylib cannot easily do it. So i see phylib slowly becoming a legacy
API for MAC drivers.

We are also slowly seeing more SFPs, and Linux controlling them. SFP
are not new, they have been in top end switches for a long time. But
they are slowly becoming more popular in industrial settings, and such
embedded industrial systems tend to let Linux control them, not
firmware. And i think industry makes more use of PTP than other
fields, but i could be wrong. Since optical SFP modules are passive, a
bump-in-the-wire time stamper actually makes sense for these.

Also, fibre on the last mile is slowly becoming more of a thing, so
maybe we will start seeing CPE, consumer routers, with SFP ports?

As i said before, we are seeing more MACs which use firmware for
controlling the PHYs. I'm not sure why yet. Maybe it is coupled with
more MACs supporting > 1G, which is messy. Or the lack of good PHY
drivers for PHYs which support > 1G? Hopefully the drivers will
improve with time.

So as you said, the phylib API has not changed much, which is common
for mature code. But i think long term, it will become less important.
It will share the space with phylink. And any code which wants to be
generically usable, should not depend on phydev. Architecturally, it
seems wrong for you to hang what should be a generic time stamping
framework on phydev. It is not future proof. net_device is future
proof.

  Andrew


Re: selftests/bpf: test_kmod.sh hangs on all devices

2018-10-08 Thread Naresh Kamboju
Daniel,

On Mon, 8 Oct 2018 at 18:58, Daniel Borkmann  wrote:
>
> On 10/08/2018 03:13 PM, Naresh Kamboju wrote:
> > BPF test case test_kmod.sh hangs on all devices running linux next.
> >
> > + cd /opt/kselftests/default-in-kernel/bpf
> > + ./test_kmod.sh
> > sysctl: cannot stat /proc/sys/net/core/bpf_jit_enable: No such file or 
> > directory
> > sysctl: cannot stat /proc/sys/net/core/bpf_jit_harden: No such file or 
> > directory
> > sysctl: cannot stat /proc/sys/net/core/bpf_jit_enable: No such file or 
> > directory
> > sysctl: cannot stat /proc/sys/net/core/bpf_jit_harden: No such file or 
> > directory
> > [ JIT enabled:0 hardened:0 ]
> >
> > https://lkft.validation.linaro.org/scheduler/job/429726
> >
> > Test hangs started from 4.19.0-rc4-next-20180918.
> > Linux version 4.19.0-rc4-next-20180918 (oe-user@oe-host) (gcc version
> > 7.1.1 20170707 (Linaro GCC 7.1-2017.08)) #1 SMP Tue Sep 18 05:26:00
> > UTC 2018
> >
> > History can be compared from this page.
> > https://qa-reports.linaro.org/lkft/linux-next-oe/tests/kselftest/bpf_test_kmod.sh
> >
> > OTOH,
> > There is a kernel BUG,
>
> This is quite an old linux-next kernel, should be fixed by 100811936f89 
> ("bpf: test_bpf:
> add init_net to dev for flow_dissector"). Please make sure you have that 
> commit included
> in your testing:

I will re-validate on latest code base and let you know.

>
> https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=100811936f89fd455eda1984810c09003550555b

Thanks for the quick reply.

- Naresh
>
> Thanks,
> Daniel


Re: [PATCH V1 net 0/5] minor bug fixes for ENA Ethernet driver

2018-10-08 Thread Machulsky, Zorik
Acked-by: Zorik Machulsky 

On 10/8/18, 5:28 AM, "akiy...@amazon.com"  wrote:

From: Arthur Kiyanovski 

Arthur Kiyanovski (5):
  net: ena: fix indentations in ena_defs for better readability
  net: ena: fix warning in rmmod caused by double iounmap
  net: ena: fix rare bug when failed restart/resume is followed by
driver removal
  net: ena: fix NULL dereference due to untimely napi initialization
  net: ena: fix auto casting to boolean

 drivers/net/ethernet/amazon/ena/ena_admin_defs.h  | 308 
+-
 drivers/net/ethernet/amazon/ena/ena_eth_com.c |   8 +-
 drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h | 219 ---
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |  23 +-
 drivers/net/ethernet/amazon/ena/ena_regs_defs.h   | 206 +++
 5 files changed, 341 insertions(+), 423 deletions(-)

-- 
2.7.4







Re: [PATCH V1 net 0/5] minor bug fixes for ENA Ethernet driver

2018-10-08 Thread Machulsky, Zorik
Ship it

On 10/8/18, 5:47 AM, "Bshara, Nafea"  wrote:

Ship it

On 10/8/18, 5:28 AM, "akiy...@amazon.com"  wrote:

From: Arthur Kiyanovski 

Arthur Kiyanovski (5):
  net: ena: fix indentations in ena_defs for better readability
  net: ena: fix warning in rmmod caused by double iounmap
  net: ena: fix rare bug when failed restart/resume is followed by
driver removal
  net: ena: fix NULL dereference due to untimely napi initialization
  net: ena: fix auto casting to boolean

 drivers/net/ethernet/amazon/ena/ena_admin_defs.h  | 308 
+-
 drivers/net/ethernet/amazon/ena/ena_eth_com.c |   8 +-
 drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h | 219 ---
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |  23 +-
 drivers/net/ethernet/amazon/ena/ena_regs_defs.h   | 206 +++
 5 files changed, 341 insertions(+), 423 deletions(-)

-- 
2.7.4







Re: [PATCH v8 08/15] octeontx2-af: Add RVU block LF provisioning support

2018-10-08 Thread Arnd Bergmann
On Mon, Oct 8, 2018 at 3:59 PM Sunil Kovvuri  wrote:
> On Mon, Oct 8, 2018 at 5:41 PM Arnd Bergmann  wrote:
> > On Sun, Oct 7, 2018 at 5:00 PM  wrote:
> > >
> > > +/* Structure for requesting resource provisioning.
> > > + * 'modify' flag to be used when either requesting more
> > > + * or to detach partial of a cetain resource type.
> > > + * Rest of the fields specify how many of what type to
> > > + * be attached.
> > > + */
> > > +struct rsrc_attach {
> > > +   struct mbox_msghdr hdr;
> > > +   u8   modify:1;
> > > +   u8   npalf:1;
> > > +   u8   nixlf:1;
> > > +   u16  sso;
> > > +   u16  ssow;
> > > +   u16  timlfs;
> > > +   u16  cptlfs;
> > > +};
> > > +
> > > +/* Structure for relinquishing resources.
> > > + * 'partial' flag to be used when relinquishing all resources
> > > + * but only of a certain type. If not set, all resources of all
> > > + * types provisioned to the RVU function will be detached.
> > > + */
> > > +struct rsrc_detach {
> > > +   struct mbox_msghdr hdr;
> > > +   u8 partial:1;
> > > +   u8 npalf:1;
> > > +   u8 nixlf:1;
> > > +   u8 sso:1;
> > > +   u8 ssow:1;
> > > +   u8 timlfs:1;
> > > +   u8 cptlfs:1;
> > > +};
> >
> > Are these bitfields part of the message that gets sent to the
> > underlying implementation? It seems there is still an endianess
> > issue then.
>
> No these structures are not used for kernel driver to firmware
> communication where
> register reads via readq are involved. These structures are used for
> mailbox communication
> between different PCI devices and this mailbox is a shared memory.

Ok, thanks for the clarification.

   Arnd


Re: [PATCH v8 01/15] octeontx2-af: Add Marvell OcteonTX2 RVU AF driver

2018-10-08 Thread Arnd Bergmann
On Mon, Oct 8, 2018 at 3:50 PM Sunil Kovvuri  wrote:
>
> On Mon, Oct 8, 2018 at 5:52 PM Arnd Bergmann  wrote:
> >
> > On Sun, Oct 7, 2018 at 4:59 PM  wrote:
> >
> > > --- /dev/null
> > > +++ b/drivers/net/ethernet/marvell/octeontx2/Kconfig
> > > @@ -0,0 +1,12 @@
> > > +#
> > > +# Marvell OcteonTX2 drivers configuration
> > > +#
> > > +
> > > +config OCTEONTX2_AF
> > > +   tristate "Marvell OcteonTX2 RVU Admin Function driver"
> > > +   depends on ARM64 && PCI
> >
> > You should try to allow building it on x86 and other architectures, even 
> > though
> > the driver won't be used there this helps get reports from static
> > build infrastructure.
> > You could use e.g.
> >
> >   depends on (64BIT && COMPILE_TEST) || ARM64
> >   depends on PCI
> >
> >Arnd
>
> Thanks for the suggestion.
> But going forward we will have few arm64 assembly instructions used in
> the driver.

What is those inline assembly for? Is there a chance they can
be abstracted through some other interface?

 Arnd


[PATCH] qed: fix memory leak of pent on error exit paths

2018-10-08 Thread Colin King
From: Colin Ian King 

Currently, calls to qed_sp_init_request can leak pent on -EINVAL errors.
Fix this by adding the allocated pent back to the p_hwfn free_pool
on these error exits so it can be re-used later and hence fixes the
leak.

Detected by CoverityScan, CID#1411643 ("Resource leak")

Fixes: fe56b9e6a8d9 ("qed: Add module with basic common support")
Signed-off-by: Colin Ian King 
---
 .../net/ethernet/qlogic/qed/qed_sp_commands.c| 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c 
b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
index 77b6248ad3b9..810475a6522a 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_sp_commands.c
@@ -79,8 +79,10 @@ int qed_sp_init_request(struct qed_hwfn *p_hwfn,
break;
 
case QED_SPQ_MODE_BLOCK:
-   if (!p_data->p_comp_data)
-   return -EINVAL;
+   if (!p_data->p_comp_data) {
+   rc = -EINVAL;
+   goto err;
+   }
 
p_ent->comp_cb.cookie = p_data->p_comp_data->cookie;
break;
@@ -95,7 +97,8 @@ int qed_sp_init_request(struct qed_hwfn *p_hwfn,
default:
DP_NOTICE(p_hwfn, "Unknown SPQE completion mode %d\n",
  p_ent->comp_mode);
-   return -EINVAL;
+   rc = -EINVAL;
+   goto err;
}
 
DP_VERBOSE(p_hwfn, QED_MSG_SPQ,
@@ -109,6 +112,13 @@ int qed_sp_init_request(struct qed_hwfn *p_hwfn,
memset(_ent->ramrod, 0, sizeof(p_ent->ramrod));
 
return 0;
+
+err:
+   qed_spq_return_entry(p_hwfn, *pp_ent);
+   *pp_ent = NULL;
+
+   return rc;
+
 }
 
 static enum tunnel_clss qed_tunn_clss_to_fw_clss(u8 type)
-- 
2.17.1



[PATCH net-next] dpaa2-eth: Don't account Tx confirmation frames on NAPI poll

2018-10-08 Thread Ioana Ciocoi Radulescu
Until now, both Rx and Tx confirmation frames handled during
NAPI poll were counted toward the NAPI budget. However, Tx
confirmations are lighter to process than Rx frames, which can
skew the amount of work actually done inside one NAPI cycle.

Update the code to only count Rx frames toward the NAPI budget
and set a separate threshold on how many Tx conf frames can be
processed in one poll cycle.

The NAPI poll routine stops when either the budget is consumed
by Rx frames or when Tx confirmation frames reach this threshold.

Signed-off-by: Ioana Radulescu 
---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 68 +++-
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h |  5 ++
 2 files changed, 47 insertions(+), 26 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 108c137..156080d 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -289,10 +289,11 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv,
  *
  * Observance of NAPI budget is not our concern, leaving that to the caller.
  */
-static int consume_frames(struct dpaa2_eth_channel *ch)
+static int consume_frames(struct dpaa2_eth_channel *ch,
+ enum dpaa2_eth_fq_type *type)
 {
struct dpaa2_eth_priv *priv = ch->priv;
-   struct dpaa2_eth_fq *fq;
+   struct dpaa2_eth_fq *fq = NULL;
struct dpaa2_dq *dq;
const struct dpaa2_fd *fd;
int cleaned = 0;
@@ -311,12 +312,23 @@ static int consume_frames(struct dpaa2_eth_channel *ch)
 
fd = dpaa2_dq_fd(dq);
fq = (struct dpaa2_eth_fq *)(uintptr_t)dpaa2_dq_fqd_ctx(dq);
-   fq->stats.frames++;
 
fq->consume(priv, ch, fd, >napi, fq->flowid);
cleaned++;
} while (!is_last);
 
+   if (!cleaned)
+   return 0;
+
+   fq->stats.frames += cleaned;
+   ch->stats.frames += cleaned;
+
+   /* A dequeue operation only pulls frames from a single queue
+* into the store. Return the frame queue type as an out param.
+*/
+   if (type)
+   *type = fq->type;
+
return cleaned;
 }
 
@@ -921,14 +933,16 @@ static int pull_channel(struct dpaa2_eth_channel *ch)
 static int dpaa2_eth_poll(struct napi_struct *napi, int budget)
 {
struct dpaa2_eth_channel *ch;
-   int cleaned = 0, store_cleaned;
struct dpaa2_eth_priv *priv;
+   int rx_cleaned = 0, txconf_cleaned = 0;
+   enum dpaa2_eth_fq_type type;
+   int store_cleaned;
int err;
 
ch = container_of(napi, struct dpaa2_eth_channel, napi);
priv = ch->priv;
 
-   while (cleaned < budget) {
+   do {
err = pull_channel(ch);
if (unlikely(err))
break;
@@ -936,30 +950,32 @@ static int dpaa2_eth_poll(struct napi_struct *napi, int 
budget)
/* Refill pool if appropriate */
refill_pool(priv, ch, priv->bpid);
 
-   store_cleaned = consume_frames(ch);
-   cleaned += store_cleaned;
+   store_cleaned = consume_frames(ch, );
+   if (type == DPAA2_RX_FQ)
+   rx_cleaned += store_cleaned;
+   else
+   txconf_cleaned += store_cleaned;
 
-   /* If we have enough budget left for a full store,
-* try a new pull dequeue, otherwise we're done here
+   /* If we either consumed the whole NAPI budget with Rx frames
+* or we reached the Tx confirmations threshold, we're done.
 */
-   if (store_cleaned == 0 ||
-   cleaned > budget - DPAA2_ETH_STORE_SIZE)
-   break;
-   }
-
-   if (cleaned < budget && napi_complete_done(napi, cleaned)) {
-   /* Re-enable data available notifications */
-   do {
-   err = dpaa2_io_service_rearm(ch->dpio, >nctx);
-   cpu_relax();
-   } while (err == -EBUSY);
-   WARN_ONCE(err, "CDAN notifications rearm failed on core %d",
- ch->nctx.desired_cpu);
-   }
+   if (rx_cleaned >= budget ||
+   txconf_cleaned >= DPAA2_ETH_TXCONF_PER_NAPI)
+   return budget;
+   } while (store_cleaned);
 
-   ch->stats.frames += cleaned;
+   /* We didn't consume the entire budget, so finish napi and
+* re-enable data availability notifications
+*/
+   napi_complete_done(napi, rx_cleaned);
+   do {
+   err = dpaa2_io_service_rearm(ch->dpio, >nctx);
+   cpu_relax();
+   } while (err == -EBUSY);
+   WARN_ONCE(err, "CDAN notifications rearm failed on core %d",
+ ch->nctx.desired_cpu);
 
-   return cleaned;
+   return 

Re: [PATCH v8 08/15] octeontx2-af: Add RVU block LF provisioning support

2018-10-08 Thread Sunil Kovvuri
On Mon, Oct 8, 2018 at 5:41 PM Arnd Bergmann  wrote:
>
> On Sun, Oct 7, 2018 at 5:00 PM  wrote:
> >
> > +/* Structure for requesting resource provisioning.
> > + * 'modify' flag to be used when either requesting more
> > + * or to detach partial of a cetain resource type.
> > + * Rest of the fields specify how many of what type to
> > + * be attached.
> > + */
> > +struct rsrc_attach {
> > +   struct mbox_msghdr hdr;
> > +   u8   modify:1;
> > +   u8   npalf:1;
> > +   u8   nixlf:1;
> > +   u16  sso;
> > +   u16  ssow;
> > +   u16  timlfs;
> > +   u16  cptlfs;
> > +};
> > +
> > +/* Structure for relinquishing resources.
> > + * 'partial' flag to be used when relinquishing all resources
> > + * but only of a certain type. If not set, all resources of all
> > + * types provisioned to the RVU function will be detached.
> > + */
> > +struct rsrc_detach {
> > +   struct mbox_msghdr hdr;
> > +   u8 partial:1;
> > +   u8 npalf:1;
> > +   u8 nixlf:1;
> > +   u8 sso:1;
> > +   u8 ssow:1;
> > +   u8 timlfs:1;
> > +   u8 cptlfs:1;
> > +};
>
> Are these bitfields part of the message that gets sent to the
> underlying implementation? It seems there is still an endianess
> issue then.
>
>Arnd


No these structures are not used for kernel driver to firmware
communication where
register reads via readq are involved. These structures are used for
mailbox communication
between different PCI devices and this mailbox is a shared memory.

Sunil.


[PATCH net-next] net: mscc: ocelot: remove set but not used variable 'phy_mode'

2018-10-08 Thread YueHaibing
Fixes gcc '-Wunused-but-set-variable' warning:

drivers/net/ethernet/mscc/ocelot_board.c: In function 'mscc_ocelot_probe':
drivers/net/ethernet/mscc/ocelot_board.c:262:17: warning:
 variable 'phy_mode' set but not used [-Wunused-but-set-variable]
   enum phy_mode phy_mode;

It never used since introduction in 
commit 71e32a20cfbf ("net: mscc: ocelot: make use of SerDes PHYs for handling 
their configuration")

Signed-off-by: YueHaibing 
---
 drivers/net/ethernet/mscc/ocelot_board.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/ethernet/mscc/ocelot_board.c 
b/drivers/net/ethernet/mscc/ocelot_board.c
index 953b326..0cf0b09 100644
--- a/drivers/net/ethernet/mscc/ocelot_board.c
+++ b/drivers/net/ethernet/mscc/ocelot_board.c
@@ -259,7 +259,6 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
struct phy_device *phy;
struct resource *res;
struct phy *serdes;
-   enum phy_mode phy_mode;
void __iomem *regs;
char res_name[8];
u32 port;
@@ -297,10 +296,8 @@ static int mscc_ocelot_probe(struct platform_device *pdev)
case PHY_INTERFACE_MODE_NA:
continue;
case PHY_INTERFACE_MODE_SGMII:
-   phy_mode = PHY_MODE_SGMII;
break;
case PHY_INTERFACE_MODE_QSGMII:
-   phy_mode = PHY_MODE_QSGMII;
break;
default:
dev_err(ocelot->dev,



Re: [PATCH v8 01/15] octeontx2-af: Add Marvell OcteonTX2 RVU AF driver

2018-10-08 Thread Sunil Kovvuri
On Mon, Oct 8, 2018 at 5:52 PM Arnd Bergmann  wrote:
>
> On Sun, Oct 7, 2018 at 4:59 PM  wrote:
>
> > --- /dev/null
> > +++ b/drivers/net/ethernet/marvell/octeontx2/Kconfig
> > @@ -0,0 +1,12 @@
> > +#
> > +# Marvell OcteonTX2 drivers configuration
> > +#
> > +
> > +config OCTEONTX2_AF
> > +   tristate "Marvell OcteonTX2 RVU Admin Function driver"
> > +   depends on ARM64 && PCI
>
> You should try to allow building it on x86 and other architectures, even 
> though
> the driver won't be used there this helps get reports from static
> build infrastructure.
> You could use e.g.
>
>   depends on (64BIT && COMPILE_TEST) || ARM64
>   depends on PCI
>
>Arnd

Thanks for the suggestion.
But going forward we will have few arm64 assembly instructions used in
the driver.
So a x86 build will fail. And i have built code using sparse and
coverity as well before
submitting patches upstream.

Sunil.


Re: [PATCH v2 net-next 17/23] net/namespace: Update rtnl_net_dumpid for strict data checking

2018-10-08 Thread Christian Brauner
On Mon, Oct 08, 2018 at 07:28:33AM -0600, David Ahern wrote:
> On 10/8/18 4:54 AM, Christian Brauner wrote:
> > On Sun, Oct 07, 2018 at 08:16:38PM -0700, David Ahern wrote:
> >> From: David Ahern 
> >>
> >> Update rtnl_net_dumpid for strict data checking. If the flag is set,
> >> the dump request is expected to have an rtgenmsg struct as the header
> >> which has the family as the only element. No data may be appended.
> >>
> >> Signed-off-by: David Ahern 
> >> ---
> >>  net/core/net_namespace.c | 6 ++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
> >> index 670c84b1bfc2..fefe72774aeb 100644
> >> --- a/net/core/net_namespace.c
> >> +++ b/net/core/net_namespace.c
> >> @@ -853,6 +853,12 @@ static int rtnl_net_dumpid(struct sk_buff *skb, 
> >> struct netlink_callback *cb)
> >>.s_idx = cb->args[0],
> >>};
> >>  
> >> +  if (cb->strict_check &&
> > 
> > Hm, shouldn't this also verify that the passed header is indeed struct
> > rtgenmsg before checking whether there are any attributes?

Thanks!

> 
> rtgenmsg is only a struct with only the family as an element.
> rtnetlink_rcv_msg has already verified that the nl msg header contains
> at least the rtgenmsg struct.
> 
> 
> > 
> >> +  nlmsg_attrlen(cb->nlh, sizeof(struct rtgenmsg))) {
> >> +  NL_SET_ERR_MSG(cb->extack, "Unknown data in network 
> >> namespace id dump request");
> >> +  return -EINVAL;
> >> +  }
> >> +
> >>spin_lock_bh(>nsid_lock);
> >>idr_for_each(>netns_ids, rtnl_net_dumpid_one, _cb);
> >>spin_unlock_bh(>nsid_lock);
> >> -- 
> >> 2.11.0
> >>
> 


Re: selftests/bpf: test_kmod.sh hangs on all devices

2018-10-08 Thread Daniel Borkmann
On 10/08/2018 03:13 PM, Naresh Kamboju wrote:
> BPF test case test_kmod.sh hangs on all devices running linux next.
> 
> + cd /opt/kselftests/default-in-kernel/bpf
> + ./test_kmod.sh
> sysctl: cannot stat /proc/sys/net/core/bpf_jit_enable: No such file or 
> directory
> sysctl: cannot stat /proc/sys/net/core/bpf_jit_harden: No such file or 
> directory
> sysctl: cannot stat /proc/sys/net/core/bpf_jit_enable: No such file or 
> directory
> sysctl: cannot stat /proc/sys/net/core/bpf_jit_harden: No such file or 
> directory
> [ JIT enabled:0 hardened:0 ]
> 
> https://lkft.validation.linaro.org/scheduler/job/429726
> 
> Test hangs started from 4.19.0-rc4-next-20180918.
> Linux version 4.19.0-rc4-next-20180918 (oe-user@oe-host) (gcc version
> 7.1.1 20170707 (Linaro GCC 7.1-2017.08)) #1 SMP Tue Sep 18 05:26:00
> UTC 2018
> 
> History can be compared from this page.
> https://qa-reports.linaro.org/lkft/linux-next-oe/tests/kselftest/bpf_test_kmod.sh
> 
> OTOH,
> There is a kernel BUG,

This is quite an old linux-next kernel, should be fixed by 100811936f89 ("bpf: 
test_bpf:
add init_net to dev for flow_dissector"). Please make sure you have that commit 
included
in your testing:

https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=100811936f89fd455eda1984810c09003550555b

Thanks,
Daniel

> [   41.003698] BUG: unable to handle kernel paging request at 1460
> [   41.016603] PGD 80045d648067 P4D 80045d648067 PUD 4575c8067 PMD 0
> [   41.023475] Oops:  [#1] SMP PTI
> [   41.026959] CPU: 3 PID: 2790 Comm: modprobe Not tainted
> 4.19.0-rc4-next-20180920 #1
> [   41.034621] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
> 2.0b 07/27/2017
> [   41.042094] RIP: 0010:__skb_flow_dissect+0xd9/0x1740
> [   41.047057] Code: 7b 79 ff 85 c0 41 5a 74 0d 80 3d 06 cd 00 01 00
> 0f 84 4f 05 00 00 4d 85 ff 0f 84 4a 02 00 00 49 8b 47 10 48 8b 80 40
> 05 00 00 <4c> 8b 80 60 14 00 00 4c 89 85 28 ff ff ff e8 54 7b 79 ff 85
> c0 4c
> [   41.065795] RSP: 0018:9accc1cf79c0 EFLAGS: 00010286
> [   41.071019] RAX:  RBX: 000e RCX: 
> 2c29b266
> [   41.078143] RDX: 97969774 RSI:  RDI: 
> 0001
> [   41.085268] RBP: 9accc1cf7ac0 R08: 9b7517e9 R09: 
> 0002
> [   41.092392] R10: 9b7517e9 R11: 9c662080 R12: 
> 9c78b620
> [   41.099517] R13: 0008 R14: 9accc1cf7aec R15: 
> 882f575e6800
> [   41.106640] FS:  7fcb533fe740() GS:882f5fb8()
> knlGS:
> [   41.114716] CS:  0010 DS:  ES:  CR0: 80050033
> [   41.120454] CR2: 1460 CR3: 00045b606002 CR4: 
> 003606e0
> [   41.127590] DR0:  DR1:  DR2: 
> 
> [   41.134737] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   41.141860] Call Trace:
> [   41.144305]  ? find_held_lock+0x35/0xa0
> [   41.148136]  ? __lock_acquire+0x2b5/0x1340
> [   41.152227]  ? find_held_lock+0x35/0xa0
> [   41.156057]  skb_get_poff+0x4b/0x90
> [   41.159542]  ? skb_get_poff+0x4b/0x90
> [   41.163207]  bpf_skb_get_pay_offset+0xe/0x20
> [   41.167471]  ___bpf_prog_run+0x45f/0x10d0
> [   41.171475]  __bpf_prog_run32+0x39/0x50
> [   41.175305]  ? lockdep_hardirqs_on+0xef/0x180
> [   41.179656]  ? ktime_get+0x6b/0x110
> [   41.183143]  test_bpf_init+0x5ab/0x1000 [test_bpf]
> [   41.187933]  ? 0xc02e2000
> [   41.191244]  do_one_initcall+0x5f/0x2b7
> [   41.195074]  ? rcu_read_lock_sched_held+0x81/0x90
> [   41.199770]  ? kmem_cache_alloc_trace+0x1de/0x210
> [   41.204469]  ? do_init_module+0x27/0x212
> [   41.208387]  do_init_module+0x5f/0x212
> [   41.212138]  load_module+0x20f5/0x2640
> [   41.215886]  __do_sys_finit_module+0xd1/0xf0
> [   41.220156]  ? __do_sys_finit_module+0xd1/0xf0
> [   41.224626]  __x64_sys_finit_module+0x1a/0x20
> [   41.228984]  do_syscall_64+0x4f/0x190
> [   41.232645]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [   41.237697] RIP: 0033:0x7fcb52d087f9
> [   41.241266] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
> 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24
> 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 4f e6 2b 00 f7 d8 64 89
> 01 48
> [   41.260004] RSP: 002b:7ffe780dd828 EFLAGS: 0206 ORIG_RAX:
> 0139
> [   41.267589] RAX: ffda RBX: 006aba70 RCX: 
> 7fcb52d087f9
> [   41.274737] RDX:  RSI: 00418424 RDI: 
> 0003
> [   41.281861] RBP: 00418424 R08:  R09: 
> 
> [   41.288985] R10: 0003 R11: 0206 R12: 
> 
> [   41.296108] R13: 0004 R14:  R15: 
> 
> [   41.303234] Modules linked in: test_bpf(+) x86_pkg_temp_thermal fuse
> [   41.309590] CR2: 1460
> [   41.312930] ---[ end trace a7d5229a26c41aad ]---
> [   41.317541] RIP: 

  1   2   >