date:20180625

Re: [PATCH bpf-next 1/7] nfp: bpf: allow source ptr type be map ptr in memcpy optimization

2018-06-25 Thread Song Liu

On Sun, Jun 24, 2018 at 8:54 PM, Jakub Kicinski
 wrote:
> From: Jiong Wang 
>
> Map read has been supported on NFP, this patch enables optimization for
> memcpy from map to packet.
>
> This patch also fixed one latent bug which will cause copying from
> unexpected address once memcpy for map pointer enabled.
>
> Reported-by: Mary Pham 
> Reported-by: David Beckett 
> Signed-off-by: Jiong Wang 
> Reviewed-by: Jakub Kicinski 
> ---
>  drivers/net/ethernet/netronome/nfp/bpf/jit.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
> b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> index 8a92088df0d7..33111739b210 100644
> --- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> +++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
> @@ -670,7 +670,7 @@ static int nfp_cpp_memcpy(struct nfp_prog *nfp_prog, 
> struct nfp_insn_meta *meta)
> xfer_num = round_up(len, 4) / 4;
>
> if (src_40bit_addr)
> -   addr40_offset(nfp_prog, meta->insn.src_reg, off, _base,
> +   addr40_offset(nfp_prog, meta->insn.src_reg * 2, off, 
> _base,
>   );

Did this break other cases before this patch?

I am sorry if this is a dumb question. I don't think I fully
understand addr40_offset().

Song

>
> /* Setup PREV_ALU fields to override memory read length. */
> @@ -3299,7 +3299,8 @@ curr_pair_is_memcpy(struct nfp_insn_meta *ld_meta,
> if (!is_mbpf_load(ld_meta) || !is_mbpf_store(st_meta))
> return false;
>
> -   if (ld_meta->ptr.type != PTR_TO_PACKET)
> +   if (ld_meta->ptr.type != PTR_TO_PACKET &&
> +   ld_meta->ptr.type != PTR_TO_MAP_VALUE)
> return false;
>
> if (st_meta->ptr.type != PTR_TO_PACKET)
> --
> 2.17.1
>

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-25 Thread Ka-Cheong Poon

On 06/26/2018 01:50 AM, Sowmini Varadhan wrote:

If a socket is bound, I guess the scope_id should be used.  So
if a socket is not bound to a link local address and the socket
is used to sent to a link local peer, it should fail.

PF_RDS sockets *MUST* alwasy be bound.  See
Documentation/networking/rds.txt:
"   Sockets must be bound before you can send or receive data.
 This is needed because binding also selects a transport and
 attaches it to the socket. Once bound, the transport assignment
 does not change."

Also, rds_sendmsg checks this (from net-next, your version
has the equivalent ipv6_addr_any etc check):

 if (daddr == 0 || rs->rs_bound_addr == 0) {
 release_sock(sk);
 ret = -ENOTCONN; /* XXX not a great errno */
 goto out;
 }

I think you misunderstood what I wrote.  The above is in response
to your original question:

--
> And this is even more confusing because the fastpath in rds_sendmsg
> does not take the bound_scope_id into consideration at all:
> 1213 if (rs->rs_conn && 
ipv6_addr_equal(>rs_conn->c_faddr, ))

> 1214 conn = rs->rs_conn;
> 1215 else {
> 1216 conn = rds_conn_create_outgoing( /* .. */, scope_id)
> so if I erroneously passed a msg_name on a connected rds socket, what
> would happen? (see also question about rds_connect() itself, below)
--

My answer to this is that if a socket is not bound to a link
local address (meaning it is bound to a non-link local address)
and it is used to send to a link local peer, I think it should
fail.  This is consistent with the scope_id check I mentioned in
the previous mail.  If the socket is not bound to a link local
address, the bound_scope_id is 0.  So if the socket is used to
send to a link local address (which has a non-zero scope_id), the
check will catch it and fail the call.  A new conn should not
be created in this case.

--
K. Poon
ka-cheong.p...@oracle.com

Re: [patch net-next 6/9] net: sched: cls_flower: propagate chain teplate creation and destruction to drivers

2018-06-25 Thread Jakub Kicinski

On Mon, 25 Jun 2018 23:01:45 +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Introduce a couple of flower offload commands in order to propagate
> template creation/destruction events down to device drivers.
> Drivers may use this information to prepare HW in an optimal way
> for future filter insertions.
> 
> Signed-off-by: Jiri Pirko 

> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index d64d43843a3a..276ba25a09c3 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -1120,6 +1120,43 @@ static void fl_walk(struct tcf_proto *tp, struct 
> tcf_walker *arg)
>   }
>  }
>  
> +static void fl_hw_create_tmplt(struct tcf_chain *chain,
> +struct fl_flow_tmplt *tmplt,
> +struct netlink_ext_ack *extack)
> +{
> + struct tc_cls_flower_offload cls_flower = {};
> + struct tcf_block *block = chain->block;
> + struct tcf_exts dummy_exts = { 0, };
> +
> + cls_flower.common.chain_index = chain->index;

Did you skip extack on purpose?

> + cls_flower.command = TC_CLSFLOWER_TMPLT_CREATE;
> + cls_flower.cookie = (unsigned long) tmplt;
> + cls_flower.dissector = >dissector;
> + cls_flower.mask = >mask;
> + cls_flower.key = >dummy_key;
> + cls_flower.exts = _exts;
> +
> + /* We don't care if driver (any of them) fails to handle this
> +  * call. It serves just as a hint for it.
> +  */
> + tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER,
> +  _flower, false);
> +}

Re: [patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-25 Thread Jakub Kicinski

On Mon, 25 Jun 2018 23:01:39 +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> For the TC clsact offload these days, some of HW drivers need
> to hold a magic ball. The reason is, with the first inserted rule inside
> HW they need to guess what fields will be used for the matching. If
> later on this guess proves to be wrong and user adds a filter with a
> different field to match, there's a problem. Mlxsw resolves it now with
> couple of patterns. Those try to cover as many match fields as possible.
> This aproach is far from optimal, both performance-wise and scale-wise.
> Also, there is a combination of filters that in certain order won't
> succeed.
> 
> Most of the time, when user inserts filters in chain, he knows right away
> how the filters are going to look like - what type and option will they
> have. For example, he knows that he will only insert filters of type
> flower matching destination IP address. He can specify a template that
> would cover all the filters in the chain.

Perhaps it's lack of sleep, but this paragraph threw me a little off
the track.  IIUC the goal of this set is to provide a way to inform the
HW about expected matches before any rule is programmed into the HW.
Not before any rule is added to a particular chain.  One can just use
the first rule in the chain to make a guess about the chain, but thanks
to this set user can configure *all* chains before any rules are added.

And that's needed because once any rule is added the tcam config can no
longer be easily modified?

Re: [PATCHv2 net-next] sctp: add support for SCTP_REUSE_PORT sockopt

2018-06-25 Thread Xin Long

On Mon, Jun 25, 2018 at 9:30 PM, Marcelo Ricardo Leitner
 wrote:
> On Mon, Jun 25, 2018 at 10:06:46AM +0800, Xin Long wrote:
>> This feature is actually already supported by sk->sk_reuse which can be
>> set by socket level opt SO_REUSEADDR. But it's not working exactly as
>> RFC6458 demands in section 8.1.27, like:
>>
>>   - This option only supports one-to-one style SCTP sockets
>>   - This socket option must not be used after calling bind()
>> or sctp_bindx().
>>
>> Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
>> Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
>> work in linux.
>>
>> To separate it from the socket level version, this patch adds 'reuse' in
>> sctp_sock and it works pretty much as sk->sk_reuse, but with some extra
>> setup limitations that are needed when it is being enabled.
>>
>> "It should be noted that the behavior of the socket-level socket option
>> to reuse ports and/or addresses for SCTP sockets is unspecified", so it
>> leaves SO_REUSEADDR as is for the compatibility.
>>
>> Note that the name SCTP_REUSE_PORT is kind of confusing, it is identical
>> to SO_REUSEADDR with some extra restriction, so here it uses 'reuse' in
>> sctp_sock instead of 'reuseport'. As for sk->sk_reuseport support for
>> SCTP, it will be added in another patch.
>
> To help changelog readers later, please update to something like:
>
> """\
> Note that the name SCTP_REUSE_PORT is somewhat confusing, as its
> functionality is nearly identical to SO_REUSEADDR, but with some
> extra restrictions. Here it uses 'reuse' in sctp_sock instead of
> 'reuseport'. As for sk->sk_reuseport support for SCTP, it will be
> added in another patch.
> """
>
> Makes sense, can you note the difference?
Sure, will post v3. thanks.

Re: [PATCH net-next 1/5] ipv4: add __ip_queue_xmit() that supports tos param

2018-06-25 Thread Xin Long

On Mon, Jun 25, 2018 at 7:13 PM, Neil Horman  wrote:
> On Mon, Jun 25, 2018 at 04:26:54PM +0900, David Miller wrote:
>> From: Xin Long 
>> Date: Mon, 25 Jun 2018 10:14:33 +0800
>>
>> > +EXPORT_SYMBOL(__ip_queue_xmit);
>> > +
>> > +int ip_queue_xmit(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
>> > +{
>> > +   return __ip_queue_xmit(sk, skb, fl, inet_sk(sk)->tos);
>> > +}
>> >  EXPORT_SYMBOL(ip_queue_xmit);
>>
>> Maybe better to only export __ip_queue_xmit() and make ip_queue_xmit() an
>> inline function in net/ip.h?
>>
> I concur.  No need to export both here
>
will do that.

Re: [PATCH net-next 3/5] sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams

2018-06-25 Thread Xin Long

On Tue, Jun 26, 2018 at 12:31 AM, Marcelo Ricardo Leitner
 wrote:
> Hi,
>
> On Tue, Jun 26, 2018 at 01:12:00AM +0900, 吉藤英明 wrote:
>> Hi,
>>
>> 2018-06-25 22:03 GMT+09:00 Marcelo Ricardo Leitner 
>> :
>> > On Mon, Jun 25, 2018 at 07:28:47AM -0400, Neil Horman wrote:
>> >> On Mon, Jun 25, 2018 at 04:31:26PM +0900, David Miller wrote:
>> >> > From: Xin Long 
>> >> > Date: Mon, 25 Jun 2018 10:14:35 +0800
>> >> >
>> >> > >  struct sctp_paddrparams {
>> >> > > @@ -773,6 +775,8 @@ struct sctp_paddrparams {
>> >> > >   __u32   spp_pathmtu;
>> >> > >   __u32   spp_sackdelay;
>> >> > >   __u32   spp_flags;
>> >> > > + __u32   spp_ipv6_flowlabel;
>> >> > > + __u8spp_dscp;
>> >> > >  } __attribute__((packed, aligned(4)));
>> >> >
>> >> > I don't think you can change the size of this structure like this.
>> >> >
>> >> > This check in sctp_setsockopt_peer_addr_params():
>> >> >
>> >> > if (optlen != sizeof(struct sctp_paddrparams))
>> >> > return -EINVAL;
>> >> >
>> >> > is going to trigger in old kernels when executing programs
>> >> > built against the new struct definition.
>> >
>> > That will happen, yes, but do we really care about being future-proof
>> > here? I mean: if we also update such check(s) to support dealing with
>> > smaller-than-supported structs, newer kernels will be able to run
>> > programs built against the old struct, and the new one; while building
>> > using newer headers and running on older kernel may fool the
>> > application in other ways too (like enabling support for something
>> > that is available on newer kernel and that is not present in the older
>> > one).
>>
>> We should not break existing apps.
>> We still accept apps of pre-2.4 era without sin6_scope_id
>> (e.g., net/ipv6/af_inet6.c:inet6_bind()).
>
> Yes. That's what I tried to say. That is supporting an old app built
> with old kernel headers and running on a newer kernel, and not the
> other way around (an app built with fresh headers and running on an
> old kernel).
To make it, I will update the check like:

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 1df5d07..c949d8c 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -2715,13 +2715,18 @@ static int
sctp_setsockopt_peer_addr_params(struct sock *sk,
struct sctp_sock*sp = sctp_sk(sk);
int error;
int hb_change, pmtud_change, sackdelay_change;
+   int plen = sizeof(params);
+   int old_plen = plen - sizeof(u32) * 2;

-   if (optlen != sizeof(struct sctp_paddrparams))
+   if (optlen != plen && optlen != old_plen)
return -EINVAL;

if (copy_from_user(, optval, optlen))
return -EFAULT;

+   if (optlen == old_plen)
+   params.spp_flags &= ~(SPP_DSCP | SPP_IPV6_FLOWLABEL);
+
/* Validate flags and value parameters. */
hb_change= params.spp_flags & SPP_HB;
pmtud_change = params.spp_flags & SPP_PMTUD;
@@ -5591,10 +5596,13 @@ static int
sctp_getsockopt_peer_addr_params(struct sock *sk, int len,
struct sctp_transport   *trans = NULL;
struct sctp_association *asoc = NULL;
struct sctp_sock*sp = sctp_sk(sk);
+   int plen = sizeof(params);
+   int old_plen = plen - sizeof(u32) * 2;

-   if (len < sizeof(struct sctp_paddrparams))
+   if (len < old_plen)
return -EINVAL;
-   len = sizeof(struct sctp_paddrparams);
+
+   len = len >= plen ? plen : old_plen;
if (copy_from_user(, optval, len))
return -EFAULT;

does it look ok to you?

Re: [PATCH rdma-next 00/12] RDMA fixes 2018-06-24

2018-06-25 Thread Leon Romanovsky

On Mon, Jun 25, 2018 at 03:34:38PM -0600, Jason Gunthorpe wrote:
> On Sun, Jun 24, 2018 at 11:23:41AM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky 
> >
> > Hi,
> >
> > This is bunch of patches trigged by running syzkaller internally.
> >
> > I'm sending them based on rdma-next mainly for two reasons:
> > 1, Most of the patches fix the old issues and it doesn't matter when
> > they will hit the Linus's tree: now or later in a couple of weeks
> > during merge window.
> > 2. They interleave with code cleanup, mlx5-next patches and Michael's
> > feedback on flow counters series.
> >
> > Thanks
> >
> > Leon Romanovsky (12):
> >   RDMA/uverbs: Protect from attempts to create flows on unsupported QP
> >   RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow
>
> I applied these two to for-rc
>
> >   RDMA/uverbs: Check existence of create_flow callback
> >   RDMA/verbs: Drop kernel variant of create_flow
> >   RDMA/verbs: Drop kernel variant of destroy_flow
> >   net/mlx5: Rate limit errors in command interface
> >   RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR
> >   RDMA/umem: Don't check for negative return value of dma_map_sg_attrs()
> >   RDMA/uverbs: Remove redundant check
>
> These to for-next

Jason,

We would like to see patch "[PATCH mlx5-next 05/12] net/mlx5:
Rate limit errors in command interface" in out mlx5-next. Is it possible
at this point to drop it from for-next, so I'll be able to take it into
mlx5-next?

Thanks

>
> >   overflow.h: Add arithmetic shift helper
> >   RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq
> >   RDMA/mlx5: Reuse existed shift_overlow helper
>
> And these will have to be respun.
>
> Thanks,
> Jason


signature.asc
Description: PGP signature

[PATCH net v2 0/2] nfp: MPLS and shared blocks TC offload fixes

2018-06-25 Thread Jakub Kicinski

Hi!

This series brings two fixes to TC filter/action offload code.
Pieter fixes matching MPLS packets when the match is purely on
the MPLS ethertype and none of the MPLS fields are used.
John provides a fix for offload of shared blocks.  Unfortunately,
with shared blocks there is currently no guarantee that filters
which were added by the core will be removed before block unbind.
Our simple fix is to not support offload of rules on shared blocks
at all, a revert of this fix will be send for -next once the
reoffload infrastructure lands.  The shared blocks became important
as we are trying to use them for bonding offload (managed from user
space) and lack of remove calls leads to resource leaks.

v2:
 - fix build error reported by kbuild bot due to missing
   tcf_block_shared() helper.

John Hurley (1):
  nfp: reject binding to shared blocks

Pieter Jansen van Vuuren (1):
  nfp: flower: fix mpls ether type detection

 drivers/net/ethernet/netronome/nfp/bpf/main.c  |  3 +++
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 14 ++
 .../net/ethernet/netronome/nfp/flower/offload.c| 11 +++
 include/net/pkt_cls.h  |  5 +
 4 files changed, 33 insertions(+)

-- 
2.17.1

[PATCH net v2 2/2] nfp: reject binding to shared blocks

2018-06-25 Thread Jakub Kicinski

From: John Hurley 

TC shared blocks allow multiple qdiscs to be grouped together and filters
shared between them. Currently the chains of filters attached to a block
are only flushed when the block is removed. If a qdisc is removed from a
block but the block still exists, flow del messages are not passed to the
callback registered for that qdisc. For the NFP, this presents the
possibility of rules still existing in hw when they should be removed.

Prevent binding to shared blocks until the kernel can send per qdisc del
messages when block unbinds occur.

tcf_block_shared() was not used outside of the core until now, so also
add an empty implementation for builds with CONFIG_NET_CLS=n.

Fixes: 4861738775d7 ("net: sched: introduce shared filter blocks 
infrastructure")
Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
v2:
 - add a tcf_block_shared() for !CONFIG_NET_CLS
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c   | 3 +++
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 3 +++
 include/net/pkt_cls.h   | 5 +
 3 files changed, 11 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index fcdfb8e7fdea..6b15e3b11956 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -202,6 +202,9 @@ static int nfp_bpf_setup_tc_block(struct net_device *netdev,
if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
return -EOPNOTSUPP;
 
+   if (tcf_block_shared(f->block))
+   return -EOPNOTSUPP;
+
switch (f->command) {
case TC_BLOCK_BIND:
return tcf_block_cb_register(f->block,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 477f584f6d28..525057bee0ed 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -631,6 +631,9 @@ static int nfp_flower_setup_tc_block(struct net_device 
*netdev,
if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
return -EOPNOTSUPP;
 
+   if (tcf_block_shared(f->block))
+   return -EOPNOTSUPP;
+
switch (f->command) {
case TC_BLOCK_BIND:
return tcf_block_cb_register(f->block,
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a3c1a2c47cd4..20b059574e60 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -111,6 +111,11 @@ void tcf_block_put_ext(struct tcf_block *block, struct 
Qdisc *q,
 {
 }
 
+static inline bool tcf_block_shared(struct tcf_block *block)
+{
+   return false;
+}
+
 static inline struct Qdisc *tcf_block_q(struct tcf_block *block)
 {
return NULL;
-- 
2.17.1

[PATCH net v2 1/2] nfp: flower: fix mpls ether type detection

2018-06-25 Thread Jakub Kicinski

From: Pieter Jansen van Vuuren 

Previously it was not possible to distinguish between mpls ether types and
other ether types. This leads to incorrect classification of offloaded
filters that match on mpls ether type. For example the following two
filters overlap:

 # tc filter add dev eth0 parent : \
protocol 0x8847 flower \
action mirred egress redirect dev eth1

 # tc filter add dev eth0 parent : \
protocol 0x0800 flower \
action mirred egress redirect dev eth2

The driver now correctly includes the mac_mpls layer where HW stores mpls
fields, when it detects an mpls ether type. It also sets the MPLS_Q bit to
indicate that the filter should match mpls packets.

Fixes: bb055c198d9b ("nfp: add mpls match offloading support")
Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Simon Horman 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 14 ++
 .../net/ethernet/netronome/nfp/flower/offload.c|  8 
 2 files changed, 22 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/match.c 
b/drivers/net/ethernet/netronome/nfp/flower/match.c
index 91935405f586..84f7a5dbea9d 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/match.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/match.c
@@ -123,6 +123,20 @@ nfp_flower_compile_mac(struct nfp_flower_mac_mpls *frame,
 NFP_FLOWER_MASK_MPLS_Q;
 
frame->mpls_lse = cpu_to_be32(t_mpls);
+   } else if (dissector_uses_key(flow->dissector,
+ FLOW_DISSECTOR_KEY_BASIC)) {
+   /* Check for mpls ether type and set NFP_FLOWER_MASK_MPLS_Q
+* bit, which indicates an mpls ether type but without any
+* mpls fields.
+*/
+   struct flow_dissector_key_basic *key_basic;
+
+   key_basic = skb_flow_dissector_target(flow->dissector,
+ FLOW_DISSECTOR_KEY_BASIC,
+ flow->key);
+   if (key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_UC) ||
+   key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_MC))
+   frame->mpls_lse = cpu_to_be32(NFP_FLOWER_MASK_MPLS_Q);
}
 }
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index c42e64f32333..477f584f6d28 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -264,6 +264,14 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
case cpu_to_be16(ETH_P_ARP):
return -EOPNOTSUPP;
 
+   case cpu_to_be16(ETH_P_MPLS_UC):
+   case cpu_to_be16(ETH_P_MPLS_MC):
+   if (!(key_layer & NFP_FLOWER_LAYER_MAC)) {
+   key_layer |= NFP_FLOWER_LAYER_MAC;
+   key_size += sizeof(struct nfp_flower_mac_mpls);
+   }
+   break;
+
/* Will be included in layer 2. */
case cpu_to_be16(ETH_P_8021Q):
break;
-- 
2.17.1

[PATCH net-next] neighbour: force neigh_invalidate when NUD_FAILED update is from admin

2018-06-25 Thread Roopa Prabhu

From: Roopa Prabhu 

In systems where neigh gc thresh holds are set to high values,
admin deleted neigh entries (eg ip neigh flush or ip neigh del) can
linger around in NUD_FAILED state for a long time until periodic gc kicks
in. This patch forces neigh_invalidate when NUD_FAILED neigh_update is
from an admin.

Signed-off-by: Roopa Prabhu 
---
My testing has not shown any problems with this patch. But i
am not sure why historically neigh admin was not considered here:
I am assuming that it is because the problem is not very obvious in
default low gc threshold deployments.
 net/core/neighbour.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 8e3fda9..cbe85d8 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1148,7 +1148,8 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
neigh->nud_state = new;
err = 0;
notify = old & NUD_VALID;
-   if ((old & (NUD_INCOMPLETE | NUD_PROBE)) &&
+   if (((old & (NUD_INCOMPLETE | NUD_PROBE)) ||
+(flags & NEIGH_UPDATE_F_ADMIN)) &&
(new & NUD_FAILED)) {
neigh_invalidate(neigh);
notify = 1;
-- 
2.1.4

Re: [net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short

2018-06-25 Thread Alexander Duyck

On Mon, Jun 25, 2018 at 6:34 PM, Tom Herbert  wrote:
>
>
> On Mon, Jun 25, 2018 at 11:04 AM, Amritha Nambiar
>  wrote:
>>
>> Change 'skc_tx_queue_mapping' field in sock_common structure from
>> 'int' to 'unsigned short' type with 0 indicating unset and
>> a positive queue value being set. This way it is consistent with
>> the queue_mapping field in the sk_buff. This will also accommodate
>> adding a new 'unsigned short' field in sock_common in the next
>> patch for rx_queue_mapping.
>>
>> Signed-off-by: Amritha Nambiar 
>> ---
>>  include/net/sock.h |   10 ++
>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index b3b7541..009fd30 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -214,7 +214,7 @@ struct sock_common {
>> struct hlist_node   skc_node;
>> struct hlist_nulls_node skc_nulls_node;
>> };
>> -   int skc_tx_queue_mapping;
>> +   unsigned short  skc_tx_queue_mapping;
>> union {
>> int skc_incoming_cpu;
>> u32 skc_rcv_wnd;
>> @@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk,
>> struct sk_buff *skb,
>>
>>  static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
>>  {
>> -   sk->sk_tx_queue_mapping = tx_queue;
>> +   /* sk_tx_queue_mapping accept only upto a 16-bit value */
>> +   WARN_ON((unsigned short)tx_queue > USHRT_MAX);
>
>
> Shouldn't this be USHRT_MAX - 1 ?

Actually just a ">=" would probably do as well.

>
>> +   sk->sk_tx_queue_mapping = tx_queue + 1;
>>  }
>>
>>  static inline void sk_tx_queue_clear(struct sock *sk)
>>  {
>> -   sk->sk_tx_queue_mapping = -1;
>>
>> +   sk->sk_tx_queue_mapping = 0;
>
>
> I think it's slightly better to define a new constant like NO_QUEUE_MAPPING
> to be USHRT_MAX. That avoids needing to do the arithmetic every time the
> value is accessed.
>>
>>  }
>>
>>  static inline int sk_tx_queue_get(const struct sock *sk)
>>  {
>> -   return sk ? sk->sk_tx_queue_mapping : -1;
>> +   return sk ? sk->sk_tx_queue_mapping - 1 : -1;
>
>
> Doesn't the comparison in __netdev_pick_tx need to be simultaneously changed
> for this?

This doesn't change the result. It was still -1 if the queue mapping
is not set. It was just initialized to 0 instead of to -1 so we have
to perform the operation to get there.

Also in regards to the comment above about needing an extra operation
I am not sure it makes much difference.

In the case of us starting with 0 as a reserved value I think the
instruction count should be about the same. We move the unsigned short
into an unsigned in, then decrement, and if the value is non-negative
we can assume it is valid. Although maybe I should double check the
code to make certain it is doing what I thought it was supposed to be
doing.

>
>>
>>
>>
>>  }
>>
>>  static inline void sk_set_socket(struct sock *sk, struct socket *sock)
>>
>

Re: [PATCH net-next] tcp: add SNMP counter for zero-window drops

2018-06-25 Thread David Miller

From: Yafang Shao 
Date: Sun, 24 Jun 2018 10:02:54 -0400

> It will be helpful if we could display the drops due to zero window or no
> enough window space.
> So a new SNMP MIB entry is added to track this behavior.
> This entry is named LINUX_MIB_TCPZEROWINDOWDROP and published in
> /proc/net/netstat in TcpExt line as TCPZeroWindowDrop.
> 
> Signed-off-by: Yafang Shao 

Applied.

Re: [PATCH v2 net] nfp: cast sizeof() to int when comparing with error code

2018-06-25 Thread cgxu519





On 06/26/2018 10:29 AM, Joe Perches wrote:

On Tue, 2018-06-26 at 09:16 +0800, Chengguang Xu wrote:

sizeof() will return unsigned value so in the error check
negative error code will be always larger than sizeof().

[]

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c

[]

@@ -232,7 +232,7 @@ struct nfp_nffw_info *nfp_nffw_info_open(struct nfp_cpp 
*cpp)
err = nfp_cpp_read(cpp, nfp_resource_cpp_id(state->res),
   nfp_resource_address(state->res),
   fwinf, sizeof(*fwinf));
-   if (err < sizeof(*fwinf))
+   if (err < (int)sizeof(*fwinf))
goto err_release;
  
  	if (!nffw_res_flg_init_get(fwinf))

The way this is done in several places in the kernel is
to test first for < 0 and then test for < sizeof

if (err < 0 || err < sizeof(etc...)

see net/ceph/ceph_common.c etc...
If we need to distinguish the cases <0 and >0 && approach is better.

If not I think cast to int will be enough.

Thanks,
Chengguang.

Re: [PATCH v2 net] nfp: cast sizeof() to int when comparing with error code

2018-06-25 Thread Joe Perches

On Tue, 2018-06-26 at 09:16 +0800, Chengguang Xu wrote:
> sizeof() will return unsigned value so in the error check
> negative error code will be always larger than sizeof().
[]
> diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c 
> b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c
[]
> @@ -232,7 +232,7 @@ struct nfp_nffw_info *nfp_nffw_info_open(struct nfp_cpp 
> *cpp)
>   err = nfp_cpp_read(cpp, nfp_resource_cpp_id(state->res),
>  nfp_resource_address(state->res),
>  fwinf, sizeof(*fwinf));
> - if (err < sizeof(*fwinf))
> + if (err < (int)sizeof(*fwinf))
>   goto err_release;
>  
>   if (!nffw_res_flg_init_get(fwinf))

The way this is done in several places in the kernel is
to test first for < 0 and then test for < sizeof

if (err < 0 || err < sizeof(etc...)

see net/ceph/ceph_common.c etc...

[PATCH net-next] liquidio: fix kernel panic when NIC firmware is older than 1.7.2

2018-06-25 Thread Felix Manlunas

From: Rick Farrington 

Pre-1.7.2 NIC firmware does not support (and does not respond to) the "get
speed" command which is sent by the 1.7.2 driver during modprobe.  Due to a
bug in older firmware (with respect to unknown commands), this unsupported
command causes a cascade of errors that ends in a kernel panic.

Fix it by making the sending of the "get speed" command conditional on the
firmware version.

Signed-off-by: Rick Farrington 
Acked-by: Derek Chickles 
Signed-off-by: Felix Manlunas 
---
Note: To avoid checkpatch.pl "WARNING: line over 80 characters", the comma
  that separates the arguments in the call to strcmp() was placed one
  line below the usual spot.

 drivers/net/ethernet/cavium/liquidio/lio_main.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 7cb4e75..f83f884 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -3671,7 +3671,16 @@ static int setup_nic_devices(struct octeon_device 
*octeon_dev)
OCTEON_CN2350_25GB_SUBSYS_ID ||
octeon_dev->subsystem_id ==
OCTEON_CN2360_25GB_SUBSYS_ID) {
-   liquidio_get_speed(lio);
+   /* speed control unsupported in f/w older than 1.7.2 */
+   if (strcmp(octeon_dev->fw_info.liquidio_firmware_version
+  , "1.7.2") < 0) {
+   dev_info(_dev->pci_dev->dev,
+"speed setting not supported by f/w.");
+   octeon_dev->speed_setting = 25;
+   octeon_dev->no_speed_setting = 1;
+   } else {
+   liquidio_get_speed(lio);
+   }
 
if (octeon_dev->speed_setting == 0) {
octeon_dev->speed_setting = 25;

Re: [net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues

2018-06-25 Thread kbuild test robot

Hi Amritha,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Amritha-Nambiar/Symmetric-queue-selection-using-XPS-for-Rx-queues/20180626-070944
config: i386-randconfig-x078-201825 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   In file included from include/net/cls_cgroup.h:19:0,
from net/socket.c:98:
   include/net/sock.h: In function 'sk_rx_queue_get':
>> include/net/sock.h:1715:16: error: 'const struct sock' has no member named 
>> 'sk_rx_queue_mapping'
 return sk ? sk->sk_rx_queue_mapping - 1 : -1;
   ^~

vim +1715 include/net/sock.h

  1712  
  1713  static inline int sk_rx_queue_get(const struct sock *sk)
  1714  {
> 1715  return sk ? sk->sk_rx_queue_mapping - 1 : -1;
  1716  }
  1717  static inline void sk_set_socket(struct sock *sk, struct socket *sock)
  1718  {
  1719  sk_tx_queue_clear(sk);
  1720  sk->sk_socket = sock;
  1721  }
  1722  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

Re: [PATCH net 2/2] nfp: reject binding to shared blocks

2018-06-25 Thread kbuild test robot

Hi John,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net/master]

url:
https://github.com/0day-ci/linux/commits/Jakub-Kicinski/nfp-MPLS-and-shared-blocks-TC-offload-fixes/20180626-042358
config: x86_64-randconfig-u0-06260533 (attached as .config)
compiler: gcc-5 (Debian 5.5.0-3) 5.4.1 20171010
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/netronome/nfp/bpf/main.c: In function 
'nfp_bpf_setup_tc_block':
>> drivers/net/ethernet/netronome/nfp/bpf/main.c:205:6: error: implicit 
>> declaration of function 'tcf_block_shared' 
>> [-Werror=implicit-function-declaration]
 if (tcf_block_shared(f->block))
 ^
   cc1: some warnings being treated as errors

vim +/tcf_block_shared +205 drivers/net/ethernet/netronome/nfp/bpf/main.c

   196  
   197  static int nfp_bpf_setup_tc_block(struct net_device *netdev,
   198struct tc_block_offload *f)
   199  {
   200  struct nfp_net *nn = netdev_priv(netdev);
   201  
   202  if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
   203  return -EOPNOTSUPP;
   204  
 > 205  if (tcf_block_shared(f->block))
   206  return -EOPNOTSUPP;
   207  
   208  switch (f->command) {
   209  case TC_BLOCK_BIND:
   210  return tcf_block_cb_register(f->block,
   211   nfp_bpf_setup_tc_block_cb,
   212   nn, nn);
   213  case TC_BLOCK_UNBIND:
   214  tcf_block_cb_unregister(f->block,
   215  nfp_bpf_setup_tc_block_cb,
   216  nn);
   217  return 0;
   218  default:
   219  return -EOPNOTSUPP;
   220  }
   221  }
   222  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH v2 net] nfp: cast sizeof() to int when comparing with error code

2018-06-25 Thread Chengguang Xu

sizeof() will return unsigned value so in the error check
negative error code will be always larger than sizeof().

Fixes: a0d8e02c35ff ("nfp: add support for reading nffw info")

Signed-off-by: Chengguang Xu 
Acked-by: Jakub Kicinski 
---
v2:
- Add more information to patch subject and commit log.

 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c
index cd34097b79f1..37a6d7822a38 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nffw.c
@@ -232,7 +232,7 @@ struct nfp_nffw_info *nfp_nffw_info_open(struct nfp_cpp 
*cpp)
err = nfp_cpp_read(cpp, nfp_resource_cpp_id(state->res),
   nfp_resource_address(state->res),
   fwinf, sizeof(*fwinf));
-   if (err < sizeof(*fwinf))
+   if (err < (int)sizeof(*fwinf))
goto err_release;
 
if (!nffw_res_flg_init_get(fwinf))
-- 
2.17.1

Re: [PATCH v2 net-next 3/4] netdevsim: add ipsec offload testing

2018-06-25 Thread Jakub Kicinski

On Mon, 25 Jun 2018 16:41:35 -0700, Shannon Nelson wrote:
> Implement the IPsec/XFRM offload API for testing.
> 
> Signed-off-by: Shannon Nelson 
> ---
> V2 - addressed formatting comments from Jakub Kicinski

Thanks!  One more comment below, otherwise:

Reviewed-by: Jakub Kicinski 

> +static void nsim_ipsec_del_sa(struct xfrm_state *xs)
> +{
> + struct net_device *dev = xs->xso.dev;
> + struct netdevsim *ns = netdev_priv(dev);
> + struct nsim_ipsec *ipsec = >ipsec;
> + u16 sa_idx;

I didn't point this out, but above also breaks the reverse xmas tree
rule.  Sorry for not mentioning it, the "please fix everywhere" was
implicit :(

> + sa_idx = xs->xso.offload_handle & ~NSIM_IPSEC_VALID;
> + if (!ipsec->sa[sa_idx].used) {
> + netdev_err(dev, "Invalid SA for delete sa_idx=%d\n", sa_idx);
> + return;
> + }
> +
> + memset(>sa[sa_idx], 0, sizeof(struct nsim_sa));
> + ipsec->count--;
> +}
> +
> +static bool nsim_ipsec_offload_ok(struct sk_buff *skb, struct xfrm_state *xs)
> +{
> + struct net_device *dev = xs->xso.dev;
> + struct netdevsim *ns = netdev_priv(dev);

And here, but you don't use the dev here, so you can just inline the
xs->xso.dev into netdev_priv().

> + struct nsim_ipsec *ipsec = >ipsec;
> +
> + ipsec->ok++;
> +
> + return true;
> +}

Re: [PATCH v2 net-next] net/sched: add skbprio scheduler

2018-06-25 Thread Nishanth Devarajan

On Sun, Jun 24, 2018 at 11:43:07AM -0400, Jamal Hadi Salim wrote:
> On 23/06/18 04:47 PM, Nishanth Devarajan wrote:
> [..]
> 
> >+/* Drop the packet at the tail of the lowest priority qdisc. */
> >+lp_qdisc = >qdiscs[lp];
> >+to_drop = __skb_dequeue_tail(lp_qdisc);
> >+BUG_ON(!to_drop);
> >+qdisc_qstats_backlog_dec(sch, to_drop);
> >+qdisc_drop(to_drop, sch, to_free);
> >+
> 
> Maybe also increase overlimit stat here? It will keep track
> of low prio things dropped because you were congested.
> Such a stat helps when debugging or collecting analytics.
> 
> Per Alex's comment, how about:
> 
> ---
> Skbprio (SKB Priority Queue) is a queueing discipline that
> prioritizes packets according to their skb->priority field.
> Under congestion, already-enqueued lower priority packets
> will be dropped to make space available for higher priority
> packets. Skbprio was conceived as a solution for
> denial-of-service defenses that need to route packets with
> different priorities as a means to overcome DoS attacks
> as described in paper ...
> 
> 
> cheers,
> jamal

Sounds good, will make some changes in v3.

Thanks,
Nishanth

Re: [RFC v2 PATCH 0/4] eBPF and struct scatterlist

2018-06-25 Thread Tushar Dave





On 06/19/2018 11:00 AM, Tushar Dave wrote:

This follows up on https://patchwork.ozlabs.org/cover/927050/
where the review feedback was to use bpf_skb_load_bytes() to deal with
linear and non-linear skbs. While that feedback is valid and correct,
the motivation for this work is to allow eBPF based firewalling for
kernel modules that do not always get their packet as an sk_buff from
their downlink drivers. One such instance of this use-case is RDS, which
can be run both over IB (driver RDMA's a scatterlist to the RDS module)
or over TCP (TCP passes an sk_buff to the RDS module)

This RFC (call it v2) uses exiting socket filter infrastructure and
extend it with new eBPF program type that deals with struct scatterlist.
For RDS, the integrated approach treats the scatterlist as the common
denominator, and allows the application to write a filter for processing
a scatterlist.


Details:
Patch 1 adds new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER which
uses the existing socket filter infrastructure for bpf program attach
and load. eBPF program of type BPF_PROG_TYPE_SOCKET_SG_FILTER receives
struct scatterlist as bpf context contrast to
BPF_PROG_TYPE_SOCKET_FILTER which deals with struct skb. This new eBPF
program type allow socket filter to run on packet data that is in form
form of struct scatterlist.

Patch 2 adds functionality to run BPF_PROG_TYPE_SOCKET_SG_FILTER socket
filter program. A bpf helpers bpf_sg_next() is also added so users can
retrieve sg elements from scatterlist.

Patch 3 adds socket filter eBPF sample program that uses patch 1 and
patch 2. The sample program opens an rds socket, attach ebpf program
(socksg i.e. BPF_PROG_TYPE_SOCKET_SG_FILTER) to rds socket and uses
bpf_sg_next helper to look into sg. For a test, current ebpf program
only prints first few bytes from each elements of sg list.

Finally, patch 4 allows rds_recv_incoming to invoke socket filter
program which deals with scatterlist.


It would be really helpful to have your thoughts/comments on my
direction. I am planning to put together some complex example for
filtering RDS traffic using eBPF, and so it would be good to have
feedback before I go too far ahead.

Thanks.

-Tushar



Thanks.

-Tushar

Tushar Dave (4):
   eBPF: Add new eBPF prog type BPF_PROG_TYPE_SOCKET_SG_FILTER
   ebpf: Add sg_filter_run and sg helper
   ebpf: Add sample ebpf program for SOCKET_SG_FILTER
   rds: invoke socket sg filter attached to rds socket

  include/linux/bpf_types.h |   1 +
  include/linux/filter.h|  10 +
  include/uapi/linux/bpf.h  |  17 +-
  kernel/bpf/syscall.c  |   1 +
  kernel/bpf/verifier.c |   1 +
  net/core/filter.c | 149 -
  net/rds/ib.c  |   1 +
  net/rds/ib.h  |   1 +
  net/rds/ib_recv.c |  12 ++
  net/rds/rds.h |   2 +
  net/rds/recv.c|  16 ++
  net/rds/tcp.c |   2 +
  net/rds/tcp.h |   2 +
  net/rds/tcp_recv.c|  38 
  samples/bpf/Makefile  |   3 +
  samples/bpf/bpf_load.c|  11 +-
  samples/bpf/rds_filter_kern.c |  78 +++
  samples/bpf/rds_filter_user.c | 339 ++
  tools/bpf/bpftool/prog.c  |   1 +
  tools/include/uapi/linux/bpf.h|  17 +-
  tools/lib/bpf/libbpf.c|   3 +
  tools/lib/bpf/libbpf.h|   2 +
  tools/testing/selftests/bpf/bpf_helpers.h |   3 +
  23 files changed, 703 insertions(+), 7 deletions(-)
  create mode 100644 samples/bpf/rds_filter_kern.c
  create mode 100644 samples/bpf/rds_filter_user.c

Re: [PATCH v2 net-next] net/sched: add skbprio scheduler

2018-06-25 Thread Nishanth Devarajan

On Sat, Jun 23, 2018 at 03:10:32PM -0700, Alexander Duyck wrote:
> On Sat, Jun 23, 2018 at 1:47 PM, Nishanth Devarajan  
> wrote:
> > net/sched: add skbprio scheduler
> >
> > Skbprio (SKB Priority Queue) is a queueing discipline that prioritizes 
> > packets
> > according to their skb->priority field. Although Skbprio can be employed in 
> > any
> > scenario in which a higher skb->priority field means a higher priority 
> > packet,
> > Skbprio was concieved as a solution for denial-of-service defenses that 
> > need to
> > route packets with different priorities.
> 
> Really this description is not very good. Reading it I was thinking to
> myself "why do we need this, prio already does this". It wasn't until
> I read through the code that I figured out that you are basically
> adding dropping of lower priority frames.
> 

OK, I'll take Jamal's suggestion on this and write up a new description.

> >
> > v2
> > *Use skb->priority field rather than DS field. Rename queueing discipline as
> > SKB Priority Queue (previously Gatekeeper Priority Queue).
> >
> > *Queueing discipline is made classful to expose Skbprio's internal priority
> > queues.
> >
> > Signed-off-by: Nishanth Devarajan 
> > Reviewed-by: Sachin Paryani 
> > Reviewed-by: Cody Doucette 
> > Reviewed-by: Michel Machado 
> > ---
> >  include/uapi/linux/pkt_sched.h |  15 ++
> >  net/sched/Kconfig  |  13 ++
> >  net/sched/Makefile |   1 +
> >  net/sched/sch_skbprio.c| 347 
> > +
> >  4 files changed, 376 insertions(+)
> >  create mode 100644 net/sched/sch_skbprio.c
> >
> > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> > index 37b5096..6fd07e8 100644
> > --- a/include/uapi/linux/pkt_sched.h
> > +++ b/include/uapi/linux/pkt_sched.h
> > @@ -124,6 +124,21 @@ struct tc_fifo_qopt {
> > __u32   limit;  /* Queue length: bytes for bfifo, packets for pfifo 
> > */
> >  };
> >
> > +/* SKBPRIO section */
> > +
> > +/*
> > + * Priorities go from zero to (SKBPRIO_MAX_PRIORITY - 1).
> > + * SKBPRIO_MAX_PRIORITY should be at least 64 in order for skbprio to be 
> > able
> > + * to map one to one the DS field of IPV4 and IPV6 headers.
> > + * Memory allocation grows linearly with SKBPRIO_MAX_PRIORITY.
> > + */
> > +
> > +#define SKBPRIO_MAX_PRIORITY 64
> > +
> > +struct tc_skbprio_qopt {
> > +   __u32   limit;  /* Queue length in packets. */
> > +};
> > +
> >  /* PRIO section */
> >
> >  #define TCQ_PRIO_BANDS 16
> > diff --git a/net/sched/Kconfig b/net/sched/Kconfig
> > index a01169f..9ac4b53 100644
> > --- a/net/sched/Kconfig
> > +++ b/net/sched/Kconfig
> > @@ -240,6 +240,19 @@ config NET_SCH_MQPRIO
> >
> >   If unsure, say N.
> >
> > +config NET_SCH_SKBPRIO
> > +   tristate "SKB priority queue scheduler (SKBPRIO)"
> > +   help
> > + Say Y here if you want to use the SKB priority queue
> > + scheduler. This schedules packets according to skb->priority,
> > + which is useful for request packets in DoS mitigation systems such
> > + as Gatekeeper.
> > +
> > + To compile this driver as a module, choose M here: the module will
> > + be called sch_skbprio.
> > +
> > + If unsure, say N.
> > +
> >  config NET_SCH_CHOKE
> > tristate "CHOose and Keep responsive flow scheduler (CHOKE)"
> > help
> > diff --git a/net/sched/Makefile b/net/sched/Makefile
> > index 8811d38..a4d8893 100644
> > --- a/net/sched/Makefile
> > +++ b/net/sched/Makefile
> > @@ -46,6 +46,7 @@ obj-$(CONFIG_NET_SCH_NETEM)   += sch_netem.o
> >  obj-$(CONFIG_NET_SCH_DRR)  += sch_drr.o
> >  obj-$(CONFIG_NET_SCH_PLUG) += sch_plug.o
> >  obj-$(CONFIG_NET_SCH_MQPRIO)   += sch_mqprio.o
> > +obj-$(CONFIG_NET_SCH_SKBPRIO)  += sch_skbprio.o
> >  obj-$(CONFIG_NET_SCH_CHOKE)+= sch_choke.o
> >  obj-$(CONFIG_NET_SCH_QFQ)  += sch_qfq.o
> >  obj-$(CONFIG_NET_SCH_CODEL)+= sch_codel.o
> > diff --git a/net/sched/sch_skbprio.c b/net/sched/sch_skbprio.c
> > new file mode 100644
> > index 000..5e89446
> > --- /dev/null
> > +++ b/net/sched/sch_skbprio.c
> > @@ -0,0 +1,347 @@
> > +/*
> > + * net/sched/sch_skbprio.c  SKB Priority Queue.
> > + *
> > + * This program is free software; you can redistribute it 
> > and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version
> > + * 2 of the License, or (at your option) any later version.
> > + *
> > + * Authors:Nishanth Devarajan, 
> > + * Cody Doucette, 
> > + * original idea by Michel Machado, Cody Doucette, and Qiaobin 
> > Fu
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +
> > +/*   SKB Priority Queue
> > + * =
> > + *
> > + *

[PATCH net-next 6/6] selftests: forwarding: README: Require diagrams

2018-06-25 Thread Petr Machata

ASCII art diagrams are well suited for presenting the topology that a
test uses while being easy to embed directly in the test file iteslf.
They make the information very easy to grasp even for simple topologies,
and for more complex ones they are almost essential, as figuring out the
interconnects from the script itself proves to be difficult.

Therefore state the requirement for topology ASCII art in README.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 tools/testing/selftests/net/forwarding/README | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/net/forwarding/README 
b/tools/testing/selftests/net/forwarding/README
index 4a0964c42860..b8a2af8fcfb7 100644
--- a/tools/testing/selftests/net/forwarding/README
+++ b/tools/testing/selftests/net/forwarding/README
@@ -46,6 +46,8 @@ Guidelines for Writing Tests
 
 o Where possible, reuse an existing topology for different tests instead
   of recreating the same topology.
+o Tests that use anything but the most trivial topologies should include
+  an ASCII art showing the topology.
 o Where possible, IPv6 and IPv4 addresses shall conform to RFC 3849 and
   RFC 5737, respectively.
 o Where possible, tests shall be written so that they can be reused by
-- 
2.4.11

[PATCH net-next 3/6] selftests: forwarding: tc_rule_stats_get: Parameterize direction

2018-06-25 Thread Petr Machata

The GRE multipath tests need stats on an egress counter. Change
tc_rule_stats_get() to take direction as an optional argument, with
default of ingress.

Take the opportunity to change line continuation character from | to \.
Move the | to the next line, which indent.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 tools/testing/selftests/net/forwarding/lib.sh | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index 911d753c4ff0..f94ea4bafa13 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -383,9 +383,10 @@ tc_rule_stats_get()
 {
local dev=$1; shift
local pref=$1; shift
+   local dir=$1; shift
 
-   tc -j -s filter show dev $dev ingress pref $pref |
-   jq '.[1].options.actions[].stats.packets'
+   tc -j -s filter show dev $dev ${dir:-ingress} pref $pref \
+   | jq '.[1].options.actions[].stats.packets'
 }
 
 mac_get()
-- 
2.4.11

[PATCH net-next 5/6] selftests: forwarding: Test multipath tunneling

2018-06-25 Thread Petr Machata

Add a GRE-tunneling test such that there are two tunnels involved, with
a multipath route listing both as next hops. Similarly to
router_multipath.sh, test that the distribution of traffic to the
tunnels honors the configured weights.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../selftests/net/forwarding/gre_multipath.sh  | 354 +
 1 file changed, 354 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/gre_multipath.sh

diff --git a/tools/testing/selftests/net/forwarding/gre_multipath.sh 
b/tools/testing/selftests/net/forwarding/gre_multipath.sh
new file mode 100755
index ..982cc8c23200
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/gre_multipath.sh
@@ -0,0 +1,354 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Test traffic distribution when a wECMP route forwards traffic to two GRE
+# tunnels.
+#
+# +-+
+# | H1  |
+# |   $h1 + |
+# |  192.0.2.1/28 | |
+# |  2001:db8:1::1/64 | |
+# +---|-+
+# |
+# +---|+
+# | SW1   ||
+# |  $ol1 +|
+# |  192.0.2.2/28  |
+# |  2001:db8:1::2/64  |
+# ||
+# |  + g1a (gre)  + g1b (gre)  |
+# |loc=192.0.2.65   loc=192.0.2.81 |
+# |rem=192.0.2.66 --.   rem=192.0.2.82 --. |
+# |tos=inherit  |   tos=inherit  | |
+# |  .--'| |
+# |  |.--' |
+# |  vv|
+# |  + $ul1.111 (vlan)+ $ul1.222 (vlan)|
+# |  | 192.0.2.129/28 | 192.0.2.145/28 |
+# |   \  / |
+# |\/  |
+# ||   |
+# |+ $ul1  |
+# +|---+
+#  |
+# +|---+
+# | SW2+ $ul2  |
+# | ___|   |
+# |/\  |
+# |   /  \ |
+# |  + $ul2.111 (vlan)+ $ul2.222 (vlan)|
+# |  ^ 192.0.2.130/28 ^ 192.0.2.146/28 |
+# |  |||
+# |  |'--. |
+# |  '--.| |
+# |  + g2a (gre)| + g2b (gre)| |
+# |loc=192.0.2.66   |   loc=192.0.2.82   | |
+# |rem=192.0.2.65 --'   rem=192.0.2.81 --' |
+# |tos=inherit  tos=inherit|
+# ||
+# |  $ol2 +|
+# | 192.0.2.17/28 ||
+# |  2001:db8:2::1/64 ||
+# +---|+
+# |
+# +---|-+
+# | H2| |
+# |   $h2 + |
+# | 192.0.2.18/28   |
+# |  2001:db8:2::2/64   |
+# +-+
+
+ALL_TESTS="
+   ping_ipv4
+   ping_ipv6
+   multipath_ipv4
+   multipath_ipv6
+   multipath_ipv6_l4
+"
+
+NUM_NETIFS=6
+source lib.sh
+
+h1_create()
+{
+   simple_if_init $h1 192.0.2.1/28 2001:db8:1::1/64
+   ip route add vrf v$h1 192.0.2.16/28 via 192.0.2.2
+   ip route add vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2
+}
+
+h1_destroy()
+{
+   ip route del vrf v$h1 2001:db8:2::/64 via 2001:db8:1::2
+   ip route del vrf v$h1 192.0.2.16/28 via 192.0.2.2
+   simple_if_fini $h1 192.0.2.1/28
+}
+
+sw1_create()
+{
+   simple_if_init $ol1 192.0.2.2/28 2001:db8:1::2/64
+   __simple_if_init $ul1 v$ol1
+   vlan_create $ul1 111 v$ol1 192.0.2.129/28
+   vlan_create $ul1 222 v$ol1 192.0.2.145/28
+
+   tunnel_create g1a gre 192.0.2.65 192.0.2.66 tos inherit dev v$ol1
+   __simple_if_init g1a v$ol1 192.0.2.65/32
+   ip route add vrf v$ol1 192.0.2.66/32 via 192.0.2.130
+
+   tunnel_create g1b gre 192.0.2.81 192.0.2.82 tos inherit dev v$ol1
+   __simple_if_init g1b v$ol1 192.0.2.81/32
+   ip route add vrf v$ol1 192.0.2.82/32 via 192.0.2.146
+
+   ip route add vrf v$ol1 192.0.2.16/28 \
+  nexthop dev g1a \
+  nexthop dev g1b
+   ip route add vrf v$ol1 2001:db8:2::/64 \
+  nexthop dev g1a \
+  nexthop dev g1b
+
+   tc qdisc add dev $ul1 clsact
+   tc filter add dev $ul1 egress pref 111 prot 802.1q \
+  flower vlan_id 111 action pass
+   tc filter add dev $ul1 egress pref 222 prot 802.1q \
+  flower vlan_id 222 action pass
+}
+
+sw1_destroy()
+{
+   tc qdisc del dev $ul1 clsact
+
+   ip route del vrf v$ol1 2001:db8:2::/64
+   ip

[PATCH net-next 4/6] selftests: forwarding: lib: Extract interface-init functions

2018-06-25 Thread Petr Machata

The function simple_if_init() does two things: it creates a VRF, then
moves an interface into this VRF and configures addresses. The latter
comes in handy when adding more interfaces into a VRF later on. The
situation is similar for simple_if_fini().

Therefore split the interface remastering and address de/initialization
logic to a new pair of helpers __simple_if_init() / __simple_if_fini(),
and defer to these helpers from simple_if_init() and simple_if_fini().

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 tools/testing/selftests/net/forwarding/lib.sh | 32 +--
 1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index f94ea4bafa13..1dfdf14894e2 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -287,6 +287,29 @@ __addr_add_del()
done
 }
 
+__simple_if_init()
+{
+   local if_name=$1; shift
+   local vrf_name=$1; shift
+   local addrs=("${@}")
+
+   ip link set dev $if_name master $vrf_name
+   ip link set dev $if_name up
+
+   __addr_add_del $if_name add "${addrs[@]}"
+}
+
+__simple_if_fini()
+{
+   local if_name=$1; shift
+   local addrs=("${@}")
+
+   __addr_add_del $if_name del "${addrs[@]}"
+
+   ip link set dev $if_name down
+   ip link set dev $if_name nomaster
+}
+
 simple_if_init()
 {
local if_name=$1
@@ -298,11 +321,8 @@ simple_if_init()
array=("${@}")
 
vrf_create $vrf_name
-   ip link set dev $if_name master $vrf_name
ip link set dev $vrf_name up
-   ip link set dev $if_name up
-
-   __addr_add_del $if_name add "${array[@]}"
+   __simple_if_init $if_name $vrf_name "${array[@]}"
 }
 
 simple_if_fini()
@@ -315,9 +335,7 @@ simple_if_fini()
vrf_name=v$if_name
array=("${@}")
 
-   __addr_add_del $if_name del "${array[@]}"
-
-   ip link set dev $if_name down
+   __simple_if_fini $if_name "${array[@]}"
vrf_destroy $vrf_name
 }
 
-- 
2.4.11

[PATCH net-next 1/6] selftests: forwarding: Move multipath_eval() to lib.sh

2018-06-25 Thread Petr Machata

This function will be useful for the GRE multipath test that is coming
later.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 tools/testing/selftests/net/forwarding/lib.sh  | 39 ++
 .../selftests/net/forwarding/router_multipath.sh   | 39 --
 2 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index 7b18a53aa556..7fae805147ae 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -557,6 +557,45 @@ tests_run()
done
 }
 
+multipath_eval()
+{
+   local desc="$1"
+   local weight_rp12=$2
+   local weight_rp13=$3
+   local packets_rp12=$4
+   local packets_rp13=$5
+   local weights_ratio packets_ratio diff
+
+   RET=0
+
+   if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then
+  check_err 1 "Packet difference is 0"
+  log_test "Multipath"
+  log_info "Expected ratio $weights_ratio"
+  return
+   fi
+
+   if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
+   weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \
+  | bc -l)
+   packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \
+  | bc -l)
+   else
+   weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" | \
+  bc -l)
+   packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" | 
\
+  bc -l)
+   fi
+
+   diff=$(echo $weights_ratio - $packets_ratio | bc -l)
+   diff=${diff#-}
+
+   test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0
+   check_err $? "Too large discrepancy between expected and measured 
ratios"
+   log_test "$desc"
+   log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio"
+}
+
 ##
 # Tests
 
diff --git a/tools/testing/selftests/net/forwarding/router_multipath.sh 
b/tools/testing/selftests/net/forwarding/router_multipath.sh
index 8b6d0fb6d604..79a209927962 100755
--- a/tools/testing/selftests/net/forwarding/router_multipath.sh
+++ b/tools/testing/selftests/net/forwarding/router_multipath.sh
@@ -159,45 +159,6 @@ router2_destroy()
vrf_destroy "vrf-r2"
 }
 
-multipath_eval()
-{
-   local desc="$1"
-   local weight_rp12=$2
-   local weight_rp13=$3
-   local packets_rp12=$4
-   local packets_rp13=$5
-   local weights_ratio packets_ratio diff
-
-   RET=0
-
-   if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then
-  check_err 1 "Packet difference is 0"
-  log_test "Multipath"
-  log_info "Expected ratio $weights_ratio"
-  return
-   fi
-
-   if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
-   weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \
-  | bc -l)
-   packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \
-  | bc -l)
-   else
-   weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" | \
-  bc -l)
-   packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" | 
\
-  bc -l)
-   fi
-
-   diff=$(echo $weights_ratio - $packets_ratio | bc -l)
-   diff=${diff#-}
-
-   test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0
-   check_err $? "Too large discrepancy between expected and measured 
ratios"
-   log_test "$desc"
-   log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio"
-}
-
 multipath4_test()
 {
local desc="$1"
-- 
2.4.11

[PATCH net-next 2/6] selftests: forwarding: multipath_eval(): Improve style

2018-06-25 Thread Petr Machata

- Change the indentation of the function body from 7 spaces to one tab.
- Move initialization of weights_ratio up so that it can be referenced
  from the error message about packet difference being zero.
- Move |'s consistently to continuation line, which reindent.

Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 tools/testing/selftests/net/forwarding/lib.sh | 74 ++-
 1 file changed, 39 insertions(+), 35 deletions(-)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index 7fae805147ae..911d753c4ff0 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -559,41 +559,45 @@ tests_run()
 
 multipath_eval()
 {
-   local desc="$1"
-   local weight_rp12=$2
-   local weight_rp13=$3
-   local packets_rp12=$4
-   local packets_rp13=$5
-   local weights_ratio packets_ratio diff
-
-   RET=0
-
-   if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then
-  check_err 1 "Packet difference is 0"
-  log_test "Multipath"
-  log_info "Expected ratio $weights_ratio"
-  return
-   fi
-
-   if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
-   weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \
-  | bc -l)
-   packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \
-  | bc -l)
-   else
-   weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" | \
-  bc -l)
-   packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" | 
\
-  bc -l)
-   fi
-
-   diff=$(echo $weights_ratio - $packets_ratio | bc -l)
-   diff=${diff#-}
-
-   test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0
-   check_err $? "Too large discrepancy between expected and measured 
ratios"
-   log_test "$desc"
-   log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio"
+   local desc="$1"
+   local weight_rp12=$2
+   local weight_rp13=$3
+   local packets_rp12=$4
+   local packets_rp13=$5
+   local weights_ratio packets_ratio diff
+
+   RET=0
+
+   if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
+   weights_ratio=$(echo "scale=2; $weight_rp12 / $weight_rp13" \
+   | bc -l)
+   else
+   weights_ratio=$(echo "scale=2; $weight_rp13 / $weight_rp12" \
+   | bc -l)
+   fi
+
+   if [[ "$packets_rp12" -eq "0" || "$packets_rp13" -eq "0" ]]; then
+  check_err 1 "Packet difference is 0"
+  log_test "Multipath"
+  log_info "Expected ratio $weights_ratio"
+  return
+   fi
+
+   if [[ "$weight_rp12" -gt "$weight_rp13" ]]; then
+   packets_ratio=$(echo "scale=2; $packets_rp12 / $packets_rp13" \
+   | bc -l)
+   else
+   packets_ratio=$(echo "scale=2; $packets_rp13 / $packets_rp12" \
+   | bc -l)
+   fi
+
+   diff=$(echo $weights_ratio - $packets_ratio | bc -l)
+   diff=${diff#-}
+
+   test "$(echo "$diff / $weights_ratio > 0.15" | bc -l)" -eq 0
+   check_err $? "Too large discrepancy between expected and measured 
ratios"
+   log_test "$desc"
+   log_info "Expected ratio $weights_ratio Measured ratio $packets_ratio"
 }
 
 ##
-- 
2.4.11

[PATCH net-next 0/6] Multipath tests for tunnel devices

2018-06-25 Thread Petr Machata

This patchset adds a test for ECMP and weighted ECMP between two GRE
tunnels.

In patches #1 and #2, the function multipath_eval() is first moved from
router_multipath.sh to lib.sh for ease of reuse, and then fixed up.

In patch #3, the function tc_rule_stats_get() is parameterized to be
useful for egress rules as well.

In patch #4, a new function __simple_if_init() is extracted from
simple_if_init(). This covers the logic that needs to be done for the
usual interface: VRF migration, upping and installation of IP addresses.

Patch #5 then adds the test itself.

Additionally in patch #6, a requirement to add diagrams to selftests is
documented.

Petr Machata (6):
  selftests: forwarding: Move multipath_eval() to lib.sh
  selftests: forwarding: multipath_eval(): Improve style
  selftests: forwarding: tc_rule_stats_get: Parameterize direction
  selftests: forwarding: lib: Extract interface-init functions
  selftests: forwarding: Test multipath tunneling
  selftests: forwarding: README: Require diagrams

 tools/testing/selftests/net/forwarding/README  |   2 +
 .../selftests/net/forwarding/gre_multipath.sh  | 354 +
 tools/testing/selftests/net/forwarding/lib.sh  |  80 -
 .../selftests/net/forwarding/router_multipath.sh   |  39 ---
 4 files changed, 427 insertions(+), 48 deletions(-)
 create mode 100755 tools/testing/selftests/net/forwarding/gre_multipath.sh

-- 
2.4.11

[PATCH v2 net-next 2/4] selftests: rtnetlink: use dummydev as a test device

2018-06-25 Thread Shannon Nelson

We really shouldn't mess with local system settings, so let's
use the already created dummy device instead for ipsec testing.
Oh, and let's put the temp file into a proper directory.

Signed-off-by: Shannon Nelson 
---
 tools/testing/selftests/net/rtnetlink.sh | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index 261a981..15948cf 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -523,21 +523,19 @@ kci_test_macsec()
 kci_test_ipsec()
 {
ret=0
-
-   # find an ip address on this machine and make up a destination
-   srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head 
-1 | cut -f1 -d/`
-   net=`echo $srcip | cut -f1-3 -d.`
-   base=`echo $srcip | cut -f4 -d.`
-   dstip="$net."`expr $base + 1`
-
algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 
128"
+   srcip=192.168.123.1
+   dstip=192.168.123.2
+   spi=7
+
+   ip addr add $srcip dev $devdummy
 
# flush to be sure there's nothing configured
ip x s flush ; ip x p flush
check_err $?
 
# start the monitor in the background
-   tmpfile=`mktemp ipsectestXXX`
+   tmpfile=`mktemp /var/run/ipsectestXXX`
mpid=`(ip x m > $tmpfile & echo $!) 2>/dev/null`
sleep 0.2
 
@@ -601,6 +599,7 @@ kci_test_ipsec()
check_err $?
ip x p flush
check_err $?
+   ip addr del $srcip/32 dev $devdummy
 
if [ $ret -ne 0 ]; then
echo "FAIL: ipsec"
-- 
2.7.4

[PATCH v2 net-next 4/4] selftests: rtnetlink: add ipsec offload API test

2018-06-25 Thread Shannon Nelson

Using the netdevsim as a device for testing, try out the XFRM commands
for setting up IPsec hardware offloads.

Signed-off-by: Shannon Nelson 
---
 tools/testing/selftests/net/rtnetlink.sh | 114 +++
 1 file changed, 114 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index 15948cf..9e1a82e 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -608,6 +608,119 @@ kci_test_ipsec()
echo "PASS: ipsec"
 }
 
+#---
+# Example commands
+#   ip x s add proto esp src 14.0.0.52 dst 14.0.0.70 \
+#spi 0x07 mode transport reqid 0x07 replay-window 32 \
+#aead 'rfc4106(gcm(aes))' 1234567890123456dcba 128 \
+#sel src 14.0.0.52/24 dst 14.0.0.70/24
+#offload dev sim1 dir out
+#   ip x p add dir out src 14.0.0.52/24 dst 14.0.0.70/24 \
+#tmpl proto esp src 14.0.0.52 dst 14.0.0.70 \
+#spi 0x07 mode transport reqid 0x07
+#
+#---
+kci_test_ipsec_offload()
+{
+   ret=0
+   algo="aead rfc4106(gcm(aes)) 0x3132333435363738393031323334353664636261 
128"
+   srcip=192.168.123.3
+   dstip=192.168.123.4
+   dev=simx1
+   sysfsd=/sys/kernel/debug/netdevsim/$dev
+   sysfsf=$sysfsd/ipsec
+
+   # setup netdevsim since dummydev doesn't have offload support
+   modprobe netdevsim
+   check_err $?
+   if [ $ret -ne 0 ]; then
+   echo "FAIL: ipsec_offload can't load netdevsim"
+   return 1
+   fi
+
+   ip link add $dev type netdevsim
+   ip addr add $srcip dev $dev
+   ip link set $dev up
+   if [ ! -d $sysfsd ] ; then
+   echo "FAIL: ipsec_offload can't create device $dev"
+   return 1
+   fi
+   if [ ! -f $sysfsf ] ; then
+   echo "FAIL: ipsec_offload netdevsim doesn't support IPsec 
offload"
+   return 1
+   fi
+
+   # flush to be sure there's nothing configured
+   ip x s flush ; ip x p flush
+
+   # create offloaded SAs, both in and out
+   ip x p add dir out src $srcip/24 dst $dstip/24 \
+   tmpl proto esp src $srcip dst $dstip spi 9 \
+   mode transport reqid 42
+   check_err $?
+   ip x p add dir out src $dstip/24 dst $srcip/24 \
+   tmpl proto esp src $dstip dst $srcip spi 9 \
+   mode transport reqid 42
+   check_err $?
+
+   ip x s add proto esp src $srcip dst $dstip spi 9 \
+   mode transport reqid 42 $algo sel src $srcip/24 dst $dstip/24 \
+   offload dev $dev dir out
+   check_err $?
+   ip x s add proto esp src $dstip dst $srcip spi 9 \
+   mode transport reqid 42 $algo sel src $dstip/24 dst $srcip/24 \
+   offload dev $dev dir in
+   check_err $?
+   if [ $ret -ne 0 ]; then
+   echo "FAIL: ipsec_offload can't create SA"
+   return 1
+   fi
+
+   # does offload show up in ip output
+   lines=`ip x s list | grep -c "crypto offload parameters: dev $dev dir"`
+   if [ $lines -ne 2 ] ; then
+   echo "FAIL: ipsec_offload SA offload missing from list output"
+   check_err 1
+   fi
+
+   # use ping to exercise the Tx path
+   ping -I $dev -c 3 -W 1 -i 0 $dstip >/dev/null
+
+   # does driver have correct offload info
+   diff $sysfsf - << EOF
+SA count=2 tx=3
+sa[0] tx ipaddr=0x   
+sa[0]spi=0x0009 proto=0x32 salt=0x61626364 crypt=1
+sa[0]key=0x34333231 38373635 32313039 36353433
+sa[1] rx ipaddr=0x   037ba8c0
+sa[1]spi=0x0009 proto=0x32 salt=0x61626364 crypt=1
+sa[1]key=0x34333231 38373635 32313039 36353433
+EOF
+   if [ $? -ne 0 ] ; then
+   echo "FAIL: ipsec_offload incorrect driver data"
+   check_err 1
+   fi
+
+   # does offload get removed from driver
+   ip x s flush
+   ip x p flush
+   lines=`grep -c "SA count=0" $sysfsf`
+   if [ $lines -ne 1 ] ; then
+   echo "FAIL: ipsec_offload SA not removed from driver"
+   check_err 1
+   fi
+
+   # clean up any leftovers
+   ip link del $dev
+   rmmod netdevsim
+
+   if [ $ret -ne 0 ]; then
+   echo "FAIL: ipsec_offload"
+   return 1
+   fi
+   echo "PASS: ipsec_offload"
+}
+
 kci_test_gretap()
 {
testns="testns"
@@ -862,6 +975,7 @@ kci_test_rtnl()
kci_test_encap
kci_test_macsec
kci_test_ipsec
+   kci_test_ipsec_offload
 
kci_del_dummy
 }
-- 
2.7.4

[PATCH v2 net-next 3/4] netdevsim: add ipsec offload testing

2018-06-25 Thread Shannon Nelson

Implement the IPsec/XFRM offload API for testing.

Signed-off-by: Shannon Nelson 
---
V2 - addressed formatting comments from Jakub Kicinski

 drivers/net/netdevsim/Makefile|   4 +
 drivers/net/netdevsim/ipsec.c | 298 ++
 drivers/net/netdevsim/netdev.c|   7 +
 drivers/net/netdevsim/netdevsim.h |  41 ++
 4 files changed, 350 insertions(+)
 create mode 100644 drivers/net/netdevsim/ipsec.c

diff --git a/drivers/net/netdevsim/Makefile b/drivers/net/netdevsim/Makefile
index 449b2a1..0fee1d0 100644
--- a/drivers/net/netdevsim/Makefile
+++ b/drivers/net/netdevsim/Makefile
@@ -13,3 +13,7 @@ endif
 ifneq ($(CONFIG_NET_DEVLINK),)
 netdevsim-objs += devlink.o fib.o
 endif
+
+ifneq ($(CONFIG_XFRM_OFFLOAD),)
+netdevsim-objs += ipsec.o
+endif
diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
new file mode 100644
index 000..1a23426
--- /dev/null
+++ b/drivers/net/netdevsim/ipsec.c
@@ -0,0 +1,298 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Oracle and/or its affiliates. All rights reserved. */
+
+#include 
+#include 
+#include 
+
+#include "netdevsim.h"
+
+#define NSIM_IPSEC_AUTH_BITS   128
+
+static ssize_t nsim_dbg_netdev_ops_read(struct file *filp,
+   char __user *buffer,
+   size_t count, loff_t *ppos)
+{
+   struct netdevsim *ns = filp->private_data;
+   struct nsim_ipsec *ipsec = >ipsec;
+   size_t bufsize;
+   char *buf, *p;
+   int len;
+   int i;
+
+   /* the buffer needed is
+* (num SAs * 3 lines each * ~60 bytes per line) + one more line
+*/
+   bufsize = (ipsec->count * 4 * 60) + 60;
+   buf = kzalloc(bufsize, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   p = buf;
+   p += snprintf(p, bufsize - (p - buf),
+ "SA count=%u tx=%u\n",
+ ipsec->count, ipsec->tx);
+
+   for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+   struct nsim_sa *sap = >sa[i];
+
+   if (!sap->used)
+   continue;
+
+   p += snprintf(p, bufsize - (p - buf),
+ "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n",
+ i, (sap->rx ? 'r' : 't'), sap->ipaddr[0],
+ sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]);
+   p += snprintf(p, bufsize - (p - buf),
+ "sa[%i]spi=0x%08x proto=0x%x salt=0x%08x 
crypt=%d\n",
+ i, be32_to_cpu(sap->xs->id.spi),
+ sap->xs->id.proto, sap->salt, sap->crypt);
+   p += snprintf(p, bufsize - (p - buf),
+ "sa[%i]key=0x%08x %08x %08x %08x\n",
+ i, sap->key[0], sap->key[1],
+ sap->key[2], sap->key[3]);
+   }
+
+   len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf);
+
+   kfree(buf);
+   return len;
+}
+
+static const struct file_operations ipsec_dbg_fops = {
+   .owner = THIS_MODULE,
+   .open = simple_open,
+   .read = nsim_dbg_netdev_ops_read,
+};
+
+static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec)
+{
+   u32 i;
+
+   if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT)
+   return -ENOSPC;
+
+   /* search sa table */
+   for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+   if (!ipsec->sa[i].used)
+   return i;
+   }
+
+   return -ENOSPC;
+}
+
+static int nsim_ipsec_parse_proto_keys(struct xfrm_state *xs,
+  u32 *mykey, u32 *mysalt)
+{
+   const char aes_gcm_name[] = "rfc4106(gcm(aes))";
+   struct net_device *dev = xs->xso.dev;
+   unsigned char *key_data;
+   char *alg_name = NULL;
+   int key_len;
+
+   if (!xs->aead) {
+   netdev_err(dev, "Unsupported IPsec algorithm\n");
+   return -EINVAL;
+   }
+
+   if (xs->aead->alg_icv_len != NSIM_IPSEC_AUTH_BITS) {
+   netdev_err(dev, "IPsec offload requires %d bit 
authentication\n",
+  NSIM_IPSEC_AUTH_BITS);
+   return -EINVAL;
+   }
+
+   key_data = >aead->alg_key[0];
+   key_len = xs->aead->alg_key_len;
+   alg_name = xs->aead->alg_name;
+
+   if (strcmp(alg_name, aes_gcm_name)) {
+   netdev_err(dev, "Unsupported IPsec algorithm - please use %s\n",
+  aes_gcm_name);
+   return -EINVAL;
+   }
+
+   /* 160 accounts for 16 byte key and 4 byte salt */
+   if (key_len > NSIM_IPSEC_AUTH_BITS) {
+   *mysalt = ((u32 *)key_data)[4];
+   } else if (key_len == NSIM_IPSEC_AUTH_BITS) {
+   *mysalt = 0;
+   } else {
+   netdev_err(dev, "IPsec hw offload only supports 128 bit keys 
with optional 32 bit

[PATCH v2 net-next 0/4] Updates for ipsec selftests

2018-06-25 Thread Shannon Nelson

Fix up the existing ipsec selftest and add tests for
the ipsec offload driver API.

v2: addressed formatting nits in netdevsim from Jakub Kicinski

Shannon Nelson (4):
  selftests: rtnetlink: clear the return code at start of ipsec test
  selftests: rtnetlink: use dummydev as a test device
  netdevsim: add ipsec offload testing
  selftests: rtnetlink: add ipsec offload API test

 drivers/net/netdevsim/Makefile   |   4 +
 drivers/net/netdevsim/ipsec.c| 345 +++
 drivers/net/netdevsim/netdev.c   |   7 +
 drivers/net/netdevsim/netdevsim.h|  37 
 tools/testing/selftests/net/rtnetlink.sh | 132 +++-
 5 files changed, 518 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/netdevsim/ipsec.c

-- 
2.7.4

[PATCH v2 net-next 1/4] selftests: rtnetlink: clear the return code at start of ipsec test

2018-06-25 Thread Shannon Nelson

Following the custom from the other functions, clear the global
ret code before starting the test so as to not have previously
failed tests cause us to thing this test has failed.

Reported-by: Anders Roxell 
Signed-off-by: Shannon Nelson 
---
 tools/testing/selftests/net/rtnetlink.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/net/rtnetlink.sh 
b/tools/testing/selftests/net/rtnetlink.sh
index b33a371..261a981 100755
--- a/tools/testing/selftests/net/rtnetlink.sh
+++ b/tools/testing/selftests/net/rtnetlink.sh
@@ -522,6 +522,8 @@ kci_test_macsec()
 #---
 kci_test_ipsec()
 {
+   ret=0
+
# find an ip address on this machine and make up a destination
srcip=`ip -o addr | awk '/inet / { print $4; }' | grep -v "^127" | head 
-1 | cut -f1 -d/`
net=`echo $srcip | cut -f1-3 -d.`
-- 
2.7.4

[PATCH v2] fib_rules: match rules based on suppress_* properties too

2018-06-25 Thread Jason A. Donenfeld

Two rules with different values of suppress_prefix or suppress_ifgroup
are not the same. This fixes an -EEXIST when running:

   $ ip -4 rule add table main suppress_prefixlength 0

Signed-off-by: Jason A. Donenfeld 
Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs 
into fib_nl2rule")
---
This adds the new condition you mentioned. I'm not sure what you make of
DaveM's remark about this not being in the original code, but here is
nonetheless the requested change.

 net/core/fib_rules.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 126ffc5bc630..bc8425d81022 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -416,6 +416,14 @@ static struct fib_rule *rule_find(struct fib_rules_ops 
*ops,
if (rule->mark && r->mark != rule->mark)
continue;
 
+   if (rule->suppress_ifgroup != -1 &&
+   r->suppress_ifgroup != rule->suppress_ifgroup)
+   continue;
+
+   if (rule->suppress_prefixlen != -1 &&
+   r->suppress_prefixlen != rule->suppress_prefixlen)
+   continue;
+
if (rule->mark_mask && r->mark_mask != rule->mark_mask)
continue;
 
-- 
2.17.1

Re: [PATCH net 2/2] nfp: reject binding to shared blocks

2018-06-25 Thread kbuild test robot

Hi John,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net/master]

url:
https://github.com/0day-ci/linux/commits/Jakub-Kicinski/nfp-MPLS-and-shared-blocks-TC-offload-fixes/20180626-042358
config: i386-randconfig-x001-201825 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/netronome/nfp/flower/offload.c: In function 
'nfp_flower_setup_tc_block':
>> drivers/net/ethernet/netronome/nfp/flower/offload.c:634:6: error: implicit 
>> declaration of function 'tcf_block_shared'; did you mean 'tcf_block_dev'? 
>> [-Werror=implicit-function-declaration]
 if (tcf_block_shared(f->block))
 ^~~~
 tcf_block_dev
   cc1: some warnings being treated as errors

vim +634 drivers/net/ethernet/netronome/nfp/flower/offload.c

   625  
   626  static int nfp_flower_setup_tc_block(struct net_device *netdev,
   627   struct tc_block_offload *f)
   628  {
   629  struct nfp_repr *repr = netdev_priv(netdev);
   630  
   631  if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
   632  return -EOPNOTSUPP;
   633  
 > 634  if (tcf_block_shared(f->block))
   635  return -EOPNOTSUPP;
   636  
   637  switch (f->command) {
   638  case TC_BLOCK_BIND:
   639  return tcf_block_cb_register(f->block,
   640   
nfp_flower_setup_tc_block_cb,
   641   repr, repr);
   642  case TC_BLOCK_UNBIND:
   643  tcf_block_cb_unregister(f->block,
   644  nfp_flower_setup_tc_block_cb,
   645  repr);
   646  return 0;
   647  default:
   648  return -EOPNOTSUPP;
   649  }
   650  }
   651  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip

[PATCH 3/4] net: lan78xx: Add support for VLAN tag stripping.

2018-06-25 Thread Dave Stevenson

The chip supports stripping the VLAN tag and reporting it
in metadata.
Complete the support for this.

Signed-off-by: Dave Stevenson 
---
 drivers/net/usb/lan78xx.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index afe7fa3..f72a8f5 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -64,6 +64,7 @@
 #define DEFAULT_RX_CSUM_ENABLE (true)
 #define DEFAULT_TSO_CSUM_ENABLE(true)
 #define DEFAULT_VLAN_FILTER_ENABLE (true)
+#define DEFAULT_VLAN_RX_OFFLOAD(true)
 #define TX_OVERHEAD(8)
 #define RXW_PADDING2
 
@@ -2363,6 +2364,11 @@ static int lan78xx_set_features(struct net_device 
*netdev,
pdata->rfe_ctl &= ~(RFE_CTL_ICMP_COE_ | RFE_CTL_IGMP_COE_);
}
 
+   if (features & NETIF_F_HW_VLAN_CTAG_RX)
+   pdata->rfe_ctl |= RFE_CTL_VLAN_STRIP_;
+   else
+   pdata->rfe_ctl &= ~RFE_CTL_VLAN_STRIP_;
+
if (features & NETIF_F_HW_VLAN_CTAG_FILTER)
pdata->rfe_ctl |= RFE_CTL_VLAN_FILTER_;
else
@@ -2976,6 +2982,9 @@ static int lan78xx_bind(struct lan78xx_net *dev, struct 
usb_interface *intf)
if (DEFAULT_TSO_CSUM_ENABLE)
dev->net->features |= NETIF_F_TSO | NETIF_F_TSO6 | NETIF_F_SG;
 
+   if (DEFAULT_VLAN_RX_OFFLOAD)
+   dev->net->features |= NETIF_F_HW_VLAN_CTAG_RX;
+
if (DEFAULT_VLAN_FILTER_ENABLE)
dev->net->features |= NETIF_F_HW_VLAN_CTAG_FILTER;
 
@@ -3052,6 +3061,16 @@ static void lan78xx_rx_csum_offload(struct lan78xx_net 
*dev,
}
 }
 
+static void lan78xx_rx_vlan_offload(struct lan78xx_net *dev,
+   struct sk_buff *skb,
+   u32 rx_cmd_a, u32 rx_cmd_b)
+{
+   if ((dev->net->features & NETIF_F_HW_VLAN_CTAG_RX) &&
+   (rx_cmd_a & RX_CMD_A_FVTG_))
+   __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
+  (rx_cmd_b & 0x));
+}
+
 static void lan78xx_skb_return(struct lan78xx_net *dev, struct sk_buff *skb)
 {
int status;
@@ -3116,6 +3135,8 @@ static int lan78xx_rx(struct lan78xx_net *dev, struct 
sk_buff *skb)
if (skb->len == size) {
lan78xx_rx_csum_offload(dev, skb,
rx_cmd_a, rx_cmd_b);
+   lan78xx_rx_vlan_offload(dev, skb,
+   rx_cmd_a, rx_cmd_b);
 
skb_trim(skb, skb->len - 4); /* remove fcs */
skb->truesize = size + sizeof(struct sk_buff);
@@ -3134,6 +3155,7 @@ static int lan78xx_rx(struct lan78xx_net *dev, struct 
sk_buff *skb)
skb_set_tail_pointer(skb2, size);
 
lan78xx_rx_csum_offload(dev, skb2, rx_cmd_a, rx_cmd_b);
+   lan78xx_rx_vlan_offload(dev, skb2, rx_cmd_a, rx_cmd_b);
 
skb_trim(skb2, skb2->len - 4); /* remove fcs */
skb2->truesize = size + sizeof(struct sk_buff);
-- 
2.7.4

Re: [PATCH v2 net-next] net/sched: add skbprio scheduler

2018-06-25 Thread Nishanth Devarajan

On Sat, Jun 23, 2018 at 02:43:16PM -0700, Cong Wang wrote:
> On Sat, Jun 23, 2018 at 1:47 PM, Nishanth Devarajan  
> wrote:
> > diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
> > index 37b5096..6fd07e8 100644
> > --- a/include/uapi/linux/pkt_sched.h
> > +++ b/include/uapi/linux/pkt_sched.h
> ...
> > +#define SKBPRIO_MAX_PRIORITY 64
> > +
> > +struct tc_skbprio_qopt {
> > +   __u32   limit;  /* Queue length in packets. */
> > +};
> 
> 
> Since this is just an integer, you can just make it NLA_U32 instead
> of a struct?
> 
>

Making it NLA_U32, wouldn't that be incurring a nla_policy struct in the
code? I also feel uneasy that we'd be straying convention of having a tc qopt
struct to pass in essential parameters from userspace.

> > +static int skbprio_change(struct Qdisc *sch, struct nlattr *opt,
> > +   struct netlink_ext_ack *extack)
> > +{
> > +   struct skbprio_sched_data *q = qdisc_priv(sch);
> > +   struct tc_skbprio_qopt *ctl = nla_data(opt);
> > +   const unsigned int min_limit = 1;
> > +
> > +   if (ctl->limit == (typeof(ctl->limit))-1)
> > +   q->max_limit = max(qdisc_dev(sch)->tx_queue_len, min_limit);
> > +   else if (ctl->limit < min_limit ||
> > +   ctl->limit > qdisc_dev(sch)->tx_queue_len)
> > +   return -EINVAL;
> > +   else
> > +   q->max_limit = ctl->limit;
> > +
> > +   return 0;
> > +}
> 
> Isn't q->max_limit same with sch->limit?
>

q->max_limit was intended to represent the maximum limit that Skbprio could
accomodate i.e the tx queue len of the device attached to the qdisc, to check
the limit parameter passed from userspace. I'll correct this in v3.
 
> Also, please avoid dev->tx_queue_len here, it may change
> independently of your qdisc change, unless you want to implement
> ops->change_tx_queue_len().

OK, will make this change.

[PATCH net-next] selftests: forwarding: mirror_gre_vlan_bridge_1q: Unset rp_filter

2018-06-25 Thread Petr Machata

The IP addresses of tunnel endpoint at H3 are set at the VLAN device
$h3.555. Therefore when test_gretap_untagged_egress() sets vlan 555 to
egress untagged at $swp3, $h3's rp_filter rejects these packets. The
test then spuriously fails.

Therefore turn off net.ipv4.conf.{all, $h3}.rp_filter.

Fixes: 9c7c8a82442c ("selftests: forwarding: mirror_gre_vlan_bridge_1q: Add 
more tests")
Signed-off-by: Petr Machata 
Reviewed-by: Ido Schimmel 
---
 .../selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh| 9 +
 1 file changed, 9 insertions(+)

diff --git 
a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh 
b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
index 5dbc7a08f4bd..1ac5038ae256 100755
--- a/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_vlan_bridge_1q.sh
@@ -39,6 +39,12 @@ setup_prepare()
swp3=${NETIFS[p5]}
h3=${NETIFS[p6]}
 
+   # gt4's remote address is at $h3.555, not $h3. Thus the packets arriving
+   # directly to $h3 for test_gretap_untagged_egress() are rejected by
+   # rp_filter and the test spuriously fails.
+   sysctl_set net.ipv4.conf.all.rp_filter 0
+   sysctl_set net.ipv4.conf.$h3.rp_filter 0
+
vrf_prepare
mirror_gre_topo_create
 
@@ -65,6 +71,9 @@ cleanup()
 
mirror_gre_topo_destroy
vrf_cleanup
+
+   sysctl_restore net.ipv4.conf.$h3.rp_filter
+   sysctl_restore net.ipv4.conf.all.rp_filter
 }
 
 test_vlan_match()
-- 
2.4.11

Re: [PATCH net-next 3/4] netdevsim: add ipsec offload testing

2018-06-25 Thread Jakub Kicinski

On Mon, 25 Jun 2018 15:37:10 -0700, Shannon Nelson wrote:
> On 6/22/2018 9:07 PM, Jakub Kicinski wrote:
> > On Fri, 22 Jun 2018 17:31:37 -0700, Shannon Nelson wrote:  
> >> Implement the IPsec/XFRM offload API for testing.
> >>
> >> Signed-off-by: Shannon Nelson   
> >> +#define NSIM_IPSEC_AUTH_BITS  128
> >> +
> >> +/**
> >> + * nsim_ipsec_dbg_read - read for ipsec data
> >> + * @filp: the opened file
> >> + * @buffer: where to write the data for the user to read
> >> + * @count: the size of the user's buffer
> >> + * @ppos: file position offset
> >> + **/
> >> +static ssize_t nsim_dbg_netdev_ops_read(struct file *filp,  
> > 
> > Doesn't match the kdoc.  Please run
> > 
> > ./scripts/kernel-doc -none $file
> > 
> > if you want kdoc.  Although IMHO you may as well drop the kdoc, your
> > code is quite self explanatory and local.  
> 
> By adding -v to that I got a couple of warnings that I didn't include 
> the Return information - is that what you were commenting on?  The rest 
> seems acceptable to the script I'm using from the net-next tree.

Hm, strange.  Two things: first kdoc requires () after function name,
second the function is called nsim_dbg_netdev_ops_read() while the doc
refers to nsim_ipsec_dbg_read().  Perhaps the combination of the two
makes the script miss the problem. 

> >> +  char __user *buffer,
> >> +  size_t count, loff_t *ppos)
> >> +{
> >> +  struct netdevsim *ns = filp->private_data;
> >> +  struct nsim_ipsec *ipsec = >ipsec;
> >> +  size_t bufsize;
> >> +  char *buf, *p;
> >> +  int len;
> >> +  int i;
> >> +
> >> +  /* don't allow partial reads */
> >> +  if (*ppos != 0)
> >> +  return 0;
> >> +
> >> +  /* the buffer needed is
> >> +   * (num SAs * 3 lines each * ~60 bytes per line) + one more line
> >> +   */
> >> +  bufsize = (ipsec->count * 4 * 60) + 60;
> >> +  buf = kzalloc(bufsize, GFP_KERNEL);
> >> +  if (!buf)
> >> +  return -ENOMEM;
> >> +
> >> +  p = buf;
> >> +  p += snprintf(p, bufsize - (p - buf),
> >> +"SA count=%u tx=%u\n",
> >> +ipsec->count, ipsec->tx);
> >> +
> >> +  for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
> >> +  struct nsim_sa *sap = >sa[i];
> >> +
> >> +  if (!sap->used)
> >> +  continue;
> >> +
> >> +  p += snprintf(p, bufsize - (p - buf),
> >> +"sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n",
> >> +i, (sap->rx ? 'r' : 't'), sap->ipaddr[0],
> >> +sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]);
> >> +  p += snprintf(p, bufsize - (p - buf),
> >> +"sa[%i]spi=0x%08x proto=0x%x salt=0x%08x 
> >> crypt=%d\n",
> >> +i, be32_to_cpu(sap->xs->id.spi),
> >> +sap->xs->id.proto, sap->salt, sap->crypt);
> >> +  p += snprintf(p, bufsize - (p - buf),
> >> +"sa[%i]key=0x%08x %08x %08x %08x\n",
> >> +i, sap->key[0], sap->key[1],
> >> +sap->key[2], sap->key[3]);
> >> +  }
> >> +
> >> +  len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf);  
> > 
> > Why not seq_file for this?  
> 
> Why bother with more interface code?  This is useful enough to support 
> the API testing needed.

No objection on this, seq_file is less error prone, but I don't mind.
FWIW you can drop the *ppos == 0 requirement, simple_read_from_buffer()
will handle other cases just fine.

> >> +  kfree(buf);
> >> +  return len;
> >> +}
> >> +
> >> +static const struct file_operations ipsec_dbg_fops = {
> >> +  .owner = THIS_MODULE,
> >> +  .open = simple_open,
> >> +  .read = nsim_dbg_netdev_ops_read,
> >> +};
> >> +
> >> +/**
> >> + * nsim_ipsec_find_empty_idx - find the first unused security parameter 
> >> index
> >> + * @ipsec: pointer to ipsec struct
> >> + **/
> >> +static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec)
> >> +{
> >> +  u32 i;
> >> +
> >> +  if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT)
> >> +  return -ENOSPC;
> >> +
> >> +  /* search sa table */
> >> +  for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
> >> +  if (!ipsec->sa[i].used)
> >> +  return i;
> >> +  }
> >> +
> >> +  return -ENOSPC;  
> > 
> > FWIW I personally find bitmaps and find_first_zero_bit() etc. nice and
> > concise for a small ID allocator, but no objection to open coding.  
> 
> Sure, we could add a parallel bitmap data structure to track usage of 
> our array elements, and probably would for a much larger array so as to 
> lessen the impact of a serial search.  But, since this is a short array 
> for simple testing purposes, the search time is minimal so I think the 
> simple code is fine.

Ack, no objection.

> >   
> >> +  } else if (key_len == 128) {
> >> +  *mysalt = 0;
> >> +  } else {
> >> +  netdev_err(dev, "IPsec hw offload only supports 128 bit

[net-next PATCH v4 7/7] Documentation: Add explanation for XPS using Rx-queue(s) map

2018-06-25 Thread Amritha Nambiar

Signed-off-by: Amritha Nambiar 
---
 Documentation/ABI/testing/sysfs-class-net-queues |   11 
 Documentation/networking/scaling.txt |   57 ++
 2 files changed, 58 insertions(+), 10 deletions(-)

diff --git a/Documentation/ABI/testing/sysfs-class-net-queues 
b/Documentation/ABI/testing/sysfs-class-net-queues
index 0c0df91..978b763 100644
--- a/Documentation/ABI/testing/sysfs-class-net-queues
+++ b/Documentation/ABI/testing/sysfs-class-net-queues
@@ -42,6 +42,17 @@ Description:
network device transmit queue. Possible vaules depend on the
number of available CPU(s) in the system.
 
+What:  /sys/class//queues/tx-/xps_rxqs
+Date:  June 2018
+KernelVersion: 4.18.0
+Contact:   netdev@vger.kernel.org
+Description:
+   Mask of the receive queue(s) currently enabled to participate
+   into the Transmit Packet Steering packet processing flow for 
this
+   network device transmit queue. Possible values depend on the
+   number of available receive queue(s) in the network device.
+   Default is disabled.
+
 What:  /sys/class//queues/tx-/byte_queue_limits/hold_time
 Date:  November 2011
 KernelVersion: 3.3
diff --git a/Documentation/networking/scaling.txt 
b/Documentation/networking/scaling.txt
index f55639d..8336116 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -366,8 +366,13 @@ XPS: Transmit Packet Steering
 
 Transmit Packet Steering is a mechanism for intelligently selecting
 which transmit queue to use when transmitting a packet on a multi-queue
-device. To accomplish this, a mapping from CPU to hardware queue(s) is
-recorded. The goal of this mapping is usually to assign queues
+device. This can be accomplished by recording two kinds of maps, either
+a mapping of CPU to hardware queue(s) or a mapping of receive queue(s)
+to hardware transmit queue(s).
+
+1. XPS using CPUs map
+
+The goal of this mapping is usually to assign queues
 exclusively to a subset of CPUs, where the transmit completions for
 these queues are processed on a CPU within this set. This choice
 provides two benefits. First, contention on the device queue lock is
@@ -377,12 +382,35 @@ transmit queue). Secondly, cache miss rate on transmit 
completion is
 reduced, in particular for data cache lines that hold the sk_buff
 structures.
 
-XPS is configured per transmit queue by setting a bitmap of CPUs that
-may use that queue to transmit. The reverse mapping, from CPUs to
-transmit queues, is computed and maintained for each network device.
-When transmitting the first packet in a flow, the function
-get_xps_queue() is called to select a queue. This function uses the ID
-of the running CPU as a key into the CPU-to-queue lookup table. If the
+2. XPS using receive queues map
+
+This mapping is used to pick transmit queue based on the receive
+queue(s) map configuration set by the administrator. A set of receive
+queues can be mapped to a set of transmit queues (many:many), although
+the common use case is a 1:1 mapping. This will enable sending packets
+on the same queue associations for transmit and receive. This is useful for
+busy polling multi-threaded workloads where there are challenges in
+associating a given CPU to a given application thread. The application
+threads are not pinned to CPUs and each thread handles packets
+received on a single queue. The receive queue number is cached in the
+socket for the connection. In this model, sending the packets on the same
+transmit queue corresponding to the associated receive queue has benefits
+in keeping the CPU overhead low. Transmit completion work is locked into
+the same queue-association that a given application is polling on. This
+avoids the overhead of triggering an interrupt on another CPU. When the
+application cleans up the packets during the busy poll, transmit completion
+may be processed along with it in the same thread context and so result in
+reduced latency.
+
+XPS is configured per transmit queue by setting a bitmap of
+CPUs/receive-queues that may use that queue to transmit. The reverse
+mapping, from CPUs to transmit queues or from receive-queues to transmit
+queues, is computed and maintained for each network device. When
+transmitting the first packet in a flow, the function get_xps_queue() is
+called to select a queue. This function uses the ID of the receive queue
+for the socket connection for a match in the receive queue-to-transmit queue
+lookup table. Alternatively, this function can also use the ID of the
+running CPU as a key into the CPU-to-queue lookup table. If the
 ID matches a single queue, that is used for transmission. If multiple
 queues match, one is selected by using the flow hash to compute an index
 into the set.
@@ -404,11 +432,15 @@ acknowledged.
 
 XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
 default for SMP).

[net-next PATCH v4 6/7] net-sysfs: Add interface for Rx queue(s) map per Tx queue

2018-06-25 Thread Amritha Nambiar

Extend transmit queue sysfs attribute to configure Rx queue(s) map
per Tx queue. By default no receive queues are configured for the
Tx queue.

- /sys/class/net/eth0/queues/tx-*/xps_rxqs

Signed-off-by: Amritha Nambiar 
---
 net/core/net-sysfs.c |   81 ++
 1 file changed, 81 insertions(+)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index b39987c..5d2ed02 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1283,6 +1283,86 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
 
 static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init
= __ATTR_RW(xps_cpus);
+
+static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf)
+{
+   struct net_device *dev = queue->dev;
+   struct xps_dev_maps *dev_maps;
+   unsigned long *mask, index;
+   int j, len, num_tc = 1, tc = 0;
+
+   mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+  GFP_KERNEL);
+   if (!mask)
+   return -ENOMEM;
+
+   index = get_netdev_queue_index(queue);
+
+   if (dev->num_tc) {
+   num_tc = dev->num_tc;
+   tc = netdev_txq_to_tc(dev, index);
+   if (tc < 0)
+   return -EINVAL;
+   }
+
+   rcu_read_lock();
+   dev_maps = rcu_dereference(dev->xps_rxqs_map);
+   if (dev_maps) {
+   for (j = -1; j = attrmask_next(j, NULL, dev->num_rx_queues),
+j < dev->num_rx_queues;) {
+   int i, tci = j * num_tc + tc;
+   struct xps_map *map;
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (!map)
+   continue;
+
+   for (i = map->len; i--;) {
+   if (map->queues[i] == index) {
+   set_bit(j, mask);
+   break;
+   }
+   }
+   }
+   }
+
+   len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues);
+   rcu_read_unlock();
+   kfree(mask);
+
+   return len < PAGE_SIZE ? len : -EINVAL;
+}
+
+static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf,
+ size_t len)
+{
+   struct net_device *dev = queue->dev;
+   unsigned long *mask, index;
+   int err;
+
+   if (!capable(CAP_NET_ADMIN))
+   return -EPERM;
+
+   mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long),
+  GFP_KERNEL);
+   if (!mask)
+   return -ENOMEM;
+
+   index = get_netdev_queue_index(queue);
+
+   err = bitmap_parse(buf, len, mask, dev->num_rx_queues);
+   if (err) {
+   kfree(mask);
+   return err;
+   }
+
+   err = __netif_set_xps_queue(dev, mask, index, true);
+   kfree(mask);
+   return err ? : len;
+}
+
+static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init
+   = __ATTR_RW(xps_rxqs);
 #endif /* CONFIG_XPS */
 
 static struct attribute *netdev_queue_default_attrs[] __ro_after_init = {
@@ -1290,6 +1370,7 @@ static struct attribute *netdev_queue_default_attrs[] 
__ro_after_init = {
_traffic_class.attr,
 #ifdef CONFIG_XPS
_cpus_attribute.attr,
+   _rxqs_attribute.attr,
_tx_maxrate.attr,
 #endif
NULL

[net-next PATCH v4 4/7] net: Record receive queue number for a connection

2018-06-25 Thread Amritha Nambiar

This patch adds a new field to sock_common 'skc_rx_queue_mapping'
which holds the receive queue number for the connection. The Rx queue
is marked in tcp_finish_connect() to allow a client app to do
SO_INCOMING_NAPI_ID after a connect() call to get the right queue
association for a socket. Rx queue is also marked in tcp_conn_request()
to allow syn-ack to go on the right tx-queue associated with
the queue on which syn is received.

Signed-off-by: Amritha Nambiar 
Signed-off-by: Sridhar Samudrala 
---
 include/net/busy_poll.h |1 +
 include/net/sock.h  |   14 ++
 net/core/sock.c |4 
 net/ipv4/tcp_input.c|3 +++
 4 files changed, 22 insertions(+)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index c518743..9e36fda6 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -151,6 +151,7 @@ static inline void sk_mark_napi_id(struct sock *sk, const 
struct sk_buff *skb)
 #ifdef CONFIG_NET_RX_BUSY_POLL
sk->sk_napi_id = skb->napi_id;
 #endif
+   sk_rx_queue_set(sk, skb);
 }
 
 /* variant used for unconnected sockets */
diff --git a/include/net/sock.h b/include/net/sock.h
index 009fd30..0ff4416 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -139,6 +139,7 @@ typedef __u64 __bitwise __addrpair;
  * @skc_node: main hash linkage for various protocol lookup tables
  * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  * @skc_tx_queue_mapping: tx queue number for this connection
+ * @skc_rx_queue_mapping: rx queue number for this connection
  * @skc_flags: place holder for sk_flags
  * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
  * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
@@ -215,6 +216,9 @@ struct sock_common {
struct hlist_nulls_node skc_nulls_node;
};
unsigned short  skc_tx_queue_mapping;
+#ifdef CONFIG_XPS
+   unsigned short  skc_rx_queue_mapping;
+#endif
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
@@ -326,6 +330,9 @@ struct sock {
 #define sk_nulls_node  __sk_common.skc_nulls_node
 #define sk_refcnt  __sk_common.skc_refcnt
 #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping
+#ifdef CONFIG_XPS
+#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping
+#endif
 
 #define sk_dontcopy_begin  __sk_common.skc_dontcopy_begin
 #define sk_dontcopy_end__sk_common.skc_dontcopy_end
@@ -1696,6 +1703,13 @@ static inline int sk_tx_queue_get(const struct sock *sk)
return sk ? sk->sk_tx_queue_mapping - 1 : -1;
 }
 
+static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
+{
+#ifdef CONFIG_XPS
+   sk->sk_rx_queue_mapping = skb_get_rx_queue(skb) + 1;
+#endif
+}
+
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
sk_tx_queue_clear(sk);
diff --git a/net/core/sock.c b/net/core/sock.c
index bcc4182..5e4715b 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2818,6 +2818,10 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_pacing_rate = ~0U;
sk->sk_pacing_shift = 10;
sk->sk_incoming_cpu = -1;
+
+#ifdef CONFIG_XPS
+   sk->sk_rx_queue_mapping = ~0;
+#endif
/*
 * Before updating sk_refcnt, we must commit prior changes to memory
 * (Documentation/RCU/rculist_nulls.txt for details)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f..c404c53 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -78,6 +78,7 @@
 #include 
 #include 
 #include 
+#include 
 
 int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 
@@ -5584,6 +5585,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff 
*skb)
if (skb) {
icsk->icsk_af_ops->sk_rx_dst_set(sk, skb);
security_inet_conn_established(sk, skb);
+   sk_mark_napi_id(sk, skb);
}
 
tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
@@ -6412,6 +6414,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
tcp_rsk(req)->snt_isn = isn;
tcp_rsk(req)->txhash = net_tx_rndhash();
tcp_openreq_init_rwin(req, sk, dst);
+   sk_rx_queue_set(req_to_sk(req), skb);
if (!want_cookie) {
tcp_reqsk_record_syn(sk, req, skb);
fastopen_sk = tcp_try_fastopen(sk, skb, req, , dst);

[net-next PATCH v4 5/7] net: Enable Tx queue selection based on Rx queues

2018-06-25 Thread Amritha Nambiar

This patch adds support to pick Tx queue based on the Rx queue(s) map
configuration set by the admin through the sysfs attribute
for each Tx queue. If the user configuration for receive queue(s) map
does not apply, then the Tx queue selection falls back to CPU(s) map
based selection and finally to hashing.

Signed-off-by: Amritha Nambiar 
---
 include/net/sock.h |4 +++
 net/core/dev.c |   62 ++--
 2 files changed, 49 insertions(+), 17 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 0ff4416..cb18139 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1710,6 +1710,10 @@ static inline void sk_rx_queue_set(struct sock *sk, 
const struct sk_buff *skb)
 #endif
 }
 
+static inline int sk_rx_queue_get(const struct sock *sk)
+{
+   return sk ? sk->sk_rx_queue_mapping - 1 : -1;
+}
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 {
sk_tx_queue_clear(sk);
diff --git a/net/core/dev.c b/net/core/dev.c
index df2a78d..2450c5e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3454,35 +3454,63 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct 
net_device *dev)
 }
 #endif /* CONFIG_NET_EGRESS */
 
-static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
+#ifdef CONFIG_XPS
+static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb,
+  struct xps_dev_maps *dev_maps, unsigned int tci)
+{
+   struct xps_map *map;
+   int queue_index = -1;
+
+   if (dev->num_tc) {
+   tci *= dev->num_tc;
+   tci += netdev_get_prio_tc_map(dev, skb->priority);
+   }
+
+   map = rcu_dereference(dev_maps->attr_map[tci]);
+   if (map) {
+   if (map->len == 1)
+   queue_index = map->queues[0];
+   else
+   queue_index = map->queues[reciprocal_scale(
+   skb_get_hash(skb), map->len)];
+   if (unlikely(queue_index >= dev->real_num_tx_queues))
+   queue_index = -1;
+   }
+   return queue_index;
+}
+#endif
+
+static int get_xps_queue(struct net_device *dev, struct sk_buff *skb)
 {
 #ifdef CONFIG_XPS
struct xps_dev_maps *dev_maps;
-   struct xps_map *map;
+   struct sock *sk = skb->sk;
int queue_index = -1;
 
if (!static_key_false(_needed))
return -1;
 
rcu_read_lock();
-   dev_maps = rcu_dereference(dev->xps_cpus_map);
+   if (!static_key_false(_rxqs_needed))
+   goto get_cpus_map;
+
+   dev_maps = rcu_dereference(dev->xps_rxqs_map);
if (dev_maps) {
-   unsigned int tci = skb->sender_cpu - 1;
+   int tci = sk_rx_queue_get(sk);
 
-   if (dev->num_tc) {
-   tci *= dev->num_tc;
-   tci += netdev_get_prio_tc_map(dev, skb->priority);
-   }
+   if (tci >= 0 && tci < dev->num_rx_queues)
+   queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+ tci);
+   }
 
-   map = rcu_dereference(dev_maps->attr_map[tci]);
-   if (map) {
-   if (map->len == 1)
-   queue_index = map->queues[0];
-   else
-   queue_index = 
map->queues[reciprocal_scale(skb_get_hash(skb),
-  
map->len)];
-   if (unlikely(queue_index >= dev->real_num_tx_queues))
-   queue_index = -1;
+get_cpus_map:
+   if (queue_index < 0) {
+   dev_maps = rcu_dereference(dev->xps_cpus_map);
+   if (dev_maps) {
+   unsigned int tci = skb->sender_cpu - 1;
+
+   queue_index = __get_xps_queue_idx(dev, skb, dev_maps,
+ tci);
}
}
rcu_read_unlock();

[net-next PATCH v4 0/7] Symmetric queue selection using XPS for Rx queues

2018-06-25 Thread Amritha Nambiar

This patch series implements support for Tx queue selection based on
Rx queue(s) map. This is done by configuring Rx queue(s) map per Tx-queue
using sysfs attribute. If the user configuration for Rx queues does
not apply, then the Tx queue selection falls back to XPS using CPUs and
finally to hashing.

XPS is refactored to support Tx queue selection based on either the
CPUs map or the Rx-queues map. The config option CONFIG_XPS needs to be
enabled. By default no receive queues are configured for the Tx queue.

- /sys/class/net//queues/tx-*/xps_rxqs

A set of receive queues can be mapped to a set of transmit queues (many:many),
although the common use case is a 1:1 mapping. This will enable sending
packets on the same Tx-Rx queue association as this is useful for busy polling
multi-threaded workloads where it is not possible to pin the threads to
a CPU. This is a rework of Sridhar's patch for symmetric queueing via
socket option:
https://www.spinics.net/lists/netdev/msg453106.html

Testing Hints:
Kernel:  Linux 4.17.0-rc7+
Interface:
driver: ixgbe
version: 5.1.0-k
firmware-version: 0x00015e0b

Configuration:
ethtool -L $iface combined 16
ethtool -C $iface rx-usecs 1000
sysctl net.core.busy_poll=1000
ATR disabled:
ethtool -K $iface ntuple on

Workload:
Modified memcached that changes the thread selection policy to be based
on the incoming rx-queue of a connection using SO_INCOMING_NAPI_ID socket
option. The default is round-robin.

Default: No rxqs_map configured
Symmetric queues: Enable rxqs_map for all queues 1:1 mapped to Tx queue

System:
Architecture:  x86_64
CPU(s):72
Model name:Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz

16 threads  400K requests/sec
=
---
Default Symmetric queues
---
RTT min/avg/max 4/51/2215   2/30/5163
(usec)


intr/sec26655   18606

contextswitch/sec   51454044

insn per cycle  0.430.72

cache-misses6.919   4.310
(% of all cache refs)

L1-dcache-load- 4.493.29
-misses
(% of all L1-dcache hits)

LLC-load-misses 13.26   8.96
(% of all LL-cache hits)

---

32 threads  400K requests/sec
=
---
Default Symmetric queues
---
RTT min/avg/max 10/112/5562 9/46/4637
(usec)


intr/sec30456   27666

contextswitch/sec   75525133

insn per cycle  0.410.49

cache-misses9.357   2.769
(% of all cache refs)

L1-dcache-load- 4.093.98
-misses
(% of all L1-dcache hits)

LLC-load-misses 12.96   3.96
(% of all LL-cache hits)

---

16 threads  800K requests/sec
=
---
Default Symmetric queues
---
RTT min/avg/max  5/151/4989 9/69/2611
(usec)


intr/sec35686   22907

contextswitch/sec   25522   12281

insn per cycle  0.670.74

cache-misses8.652   6.38
(% of all cache refs)

L1-dcache-load- 3.192.86
-misses
(% of all L1-dcache hits)

LLC-load-misses 16.53   11.99
(% of all LL-cache hits)

---
32 threads  800K requests/sec
=
---
Default Symmetric queues
---
RTT min/avg/max  6/163/6152 8/88/4209
(usec)


intr/sec47079   26548

contextswitch/sec   42190   39168

insn per cycle  0.450.54

cache-misses8.798

[net-next PATCH v4 3/7] net: sock: Change tx_queue_mapping in sock_common to unsigned short

2018-06-25 Thread Amritha Nambiar

Change 'skc_tx_queue_mapping' field in sock_common structure from
'int' to 'unsigned short' type with 0 indicating unset and
a positive queue value being set. This way it is consistent with
the queue_mapping field in the sk_buff. This will also accommodate
adding a new 'unsigned short' field in sock_common in the next
patch for rx_queue_mapping.

Signed-off-by: Amritha Nambiar 
---
 include/net/sock.h |   10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b7541..009fd30 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -214,7 +214,7 @@ struct sock_common {
struct hlist_node   skc_node;
struct hlist_nulls_node skc_nulls_node;
};
-   int skc_tx_queue_mapping;
+   unsigned short  skc_tx_queue_mapping;
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
@@ -1681,17 +1681,19 @@ static inline int sk_receive_skb(struct sock *sk, 
struct sk_buff *skb,
 
 static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
 {
-   sk->sk_tx_queue_mapping = tx_queue;
+   /* sk_tx_queue_mapping accept only upto a 16-bit value */
+   WARN_ON((unsigned short)tx_queue > USHRT_MAX);
+   sk->sk_tx_queue_mapping = tx_queue + 1;
 }
 
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
-   sk->sk_tx_queue_mapping = -1;
+   sk->sk_tx_queue_mapping = 0;
 }
 
 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-   return sk ? sk->sk_tx_queue_mapping : -1;
+   return sk ? sk->sk_tx_queue_mapping - 1 : -1;
 }
 
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)

[net-next PATCH v4 1/7] net: Refactor XPS for CPUs and Rx queues

2018-06-25 Thread Amritha Nambiar

Refactor XPS code to support Tx queue selection based on
CPU(s) map or Rx queue(s) map.

Signed-off-by: Amritha Nambiar 
---
 include/linux/cpumask.h   |   11 ++
 include/linux/netdevice.h |  100 +
 net/core/dev.c|  211 ++---
 net/core/net-sysfs.c  |4 -
 4 files changed, 246 insertions(+), 80 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bf53d89..57f20a0 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -115,12 +115,17 @@ extern struct cpumask __cpu_active_mask;
 #define cpu_active(cpu)((cpu) == 0)
 #endif
 
-/* verify cpu argument to cpumask_* operators */
-static inline unsigned int cpumask_check(unsigned int cpu)
+static inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bits)
 {
 #ifdef CONFIG_DEBUG_PER_CPU_MAPS
-   WARN_ON_ONCE(cpu >= nr_cpumask_bits);
+   WARN_ON_ONCE(cpu >= bits);
 #endif /* CONFIG_DEBUG_PER_CPU_MAPS */
+}
+
+/* verify cpu argument to cpumask_* operators */
+static inline unsigned int cpumask_check(unsigned int cpu)
+{
+   cpu_max_bits_warn(cpu, nr_cpumask_bits);
return cpu;
 }
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850..c534f03 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -730,10 +730,15 @@ struct xps_map {
  */
 struct xps_dev_maps {
struct rcu_head rcu;
-   struct xps_map __rcu *cpu_map[0];
+   struct xps_map __rcu *attr_map[0]; /* Either CPUs map or RXQs map */
 };
-#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
+
+#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \
(nr_cpu_ids * (_tcs) * sizeof(struct xps_map *)))
+
+#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\
+   (_rxqs * (_tcs) * sizeof(struct xps_map *)))
+
 #endif /* CONFIG_XPS */
 
 #define TC_MAX_QUEUE   16
@@ -1909,7 +1914,8 @@ struct net_device {
int watchdog_timeo;
 
 #ifdef CONFIG_XPS
-   struct xps_dev_maps __rcu *xps_maps;
+   struct xps_dev_maps __rcu *xps_cpus_map;
+   struct xps_dev_maps __rcu *xps_rxqs_map;
 #endif
 #ifdef CONFIG_NET_CLS_ACT
struct mini_Qdisc __rcu *miniq_egress;
@@ -3258,6 +3264,94 @@ static inline void netif_wake_subqueue(struct net_device 
*dev, u16 queue_index)
 #ifdef CONFIG_XPS
 int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask,
u16 index);
+int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask,
+ u16 index, bool is_rxqs_map);
+
+/**
+ * attr_test_mask - Test a CPU or Rx queue set in a cpumask/rx queues mask
+ * @j: CPU/Rx queue index
+ * @mask: bitmask of all cpus/rx queues
+ * @nr_bits: number of bits in the bitmask
+ *
+ * Test if a CPU or Rx queue index is set in a mask of all CPU/Rx queues.
+ */
+static inline bool attr_test_mask(unsigned long j, const unsigned long *mask,
+ unsigned int nr_bits)
+{
+   cpu_max_bits_warn(j, nr_bits);
+   return test_bit(j, mask);
+}
+
+/**
+ * attr_test_online - Test for online CPU/Rx queue
+ * @j: CPU/Rx queue index
+ * @online_mask: bitmask for CPUs/Rx queues that are online
+ * @nr_bits: number of bits in the bitmask
+ *
+ * Returns true if a CPU/Rx queue is online.
+ */
+static inline bool attr_test_online(unsigned long j,
+   const unsigned long *online_mask,
+   unsigned int nr_bits)
+{
+   cpu_max_bits_warn(j, nr_bits);
+
+   if (online_mask)
+   return test_bit(j, online_mask);
+
+   if (j >= 0 && j < nr_bits)
+   return true;
+
+   return false;
+}
+
+/**
+ * attrmask_next - get the next CPU/Rx queue in a cpumask/Rx queues mask
+ * @n: CPU/Rx queue index
+ * @srcp: the cpumask/Rx queue mask pointer
+ * @nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set.
+ */
+static inline unsigned int attrmask_next(int n, const unsigned long *srcp,
+unsigned int nr_bits)
+{
+   /* -1 is a legal arg here. */
+   if (n != -1)
+   cpu_max_bits_warn(n, nr_bits);
+
+   if (srcp)
+   return find_next_bit(srcp, nr_bits, n + 1);
+
+   return n + 1;
+}
+
+/**
+ * attrmask_next_and - get the next CPU/Rx queue in *src1p & *src2p
+ * @n: CPU/Rx queue index
+ * @src1p: the first CPUs/Rx queues mask pointer
+ * @src2p: the second CPUs/Rx queues mask pointer
+ * @nr_bits: number of bits in the bitmask
+ *
+ * Returns >= nr_bits if no further CPUs/Rx queues set in both.
+ */
+static inline int attrmask_next_and(int n, const unsigned long *src1p,
+   const unsigned long *src2p,
+

[net-next PATCH v4 2/7] net: Use static_key for XPS maps

2018-06-25 Thread Amritha Nambiar

Use static_key for XPS maps to reduce the cost of extra map checks,
similar to how it is used for RPS and RFS. This includes static_key
'xps_needed' for XPS and another for 'xps_rxqs_needed' for XPS using
Rx queues map.

Signed-off-by: Amritha Nambiar 
---
 net/core/dev.c |   26 --
 1 file changed, 20 insertions(+), 6 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 2552556..df2a78d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2081,6 +2081,10 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned 
int txq)
 EXPORT_SYMBOL(netdev_txq_to_tc);
 
 #ifdef CONFIG_XPS
+struct static_key xps_needed __read_mostly;
+EXPORT_SYMBOL(xps_needed);
+struct static_key xps_rxqs_needed __read_mostly;
+EXPORT_SYMBOL(xps_rxqs_needed);
 static DEFINE_MUTEX(xps_map_mutex);
 #define xmap_dereference(P)\
rcu_dereference_protected((P), lockdep_is_held(_map_mutex))
@@ -2170,12 +2174,14 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
 
mutex_lock(_map_mutex);
 
-   dev_maps = xmap_dereference(dev->xps_rxqs_map);
-   if (dev_maps) {
-   nr_ids = dev->num_rx_queues;
-   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids, offset,
-  count, true);
-
+   if (static_key_false(_rxqs_needed)) {
+   dev_maps = xmap_dereference(dev->xps_rxqs_map);
+   if (dev_maps) {
+   nr_ids = dev->num_rx_queues;
+   clean_xps_maps(dev, possible_mask, dev_maps, nr_ids,
+  offset, count, true);
+   }
+   static_key_slow_dec(_rxqs_needed);
}
 
dev_maps = xmap_dereference(dev->xps_cpus_map);
@@ -2189,6 +2195,7 @@ static void netif_reset_xps_queues(struct net_device 
*dev, u16 offset,
   false);
 
 out_no_maps:
+   static_key_slow_dec(_needed);
mutex_unlock(_map_mutex);
 }
 
@@ -2297,6 +2304,10 @@ int __netif_set_xps_queue(struct net_device *dev, const 
unsigned long *mask,
if (!new_dev_maps)
goto out_no_new_maps;
 
+   static_key_slow_inc(_needed);
+   if (is_rxqs_map)
+   static_key_slow_inc(_rxqs_needed);
+
for (j = -1; j = attrmask_next(j, possible_mask, nr_ids),
 j < nr_ids;) {
/* copy maps belonging to foreign traffic classes */
@@ -3450,6 +3461,9 @@ static inline int get_xps_queue(struct net_device *dev, 
struct sk_buff *skb)
struct xps_map *map;
int queue_index = -1;
 
+   if (!static_key_false(_needed))
+   return -1;
+
rcu_read_lock();
dev_maps = rcu_dereference(dev->xps_cpus_map);
if (dev_maps) {

Re: [PATCH net-next 3/4] netdevsim: add ipsec offload testing

2018-06-25 Thread Shannon Nelson


On 6/22/2018 9:07 PM, Jakub Kicinski wrote:

On Fri, 22 Jun 2018 17:31:37 -0700, Shannon Nelson wrote:

Implement the IPsec/XFRM offload API for testing.

Signed-off-by: Shannon Nelson 


Thanks for the patch!  Just a number of stylistic nit picks.


diff --git a/drivers/net/netdevsim/ipsec.c b/drivers/net/netdevsim/ipsec.c
new file mode 100644
index 000..ad64266
--- /dev/null
+++ b/drivers/net/netdevsim/ipsec.c
@@ -0,0 +1,345 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2018 Oracle and/or its affiliates. All rights reserved. */
+
+#include 
+#include 
+#include 
+#include "netdevsim.h"


Other files in the driver sort headers alphabetically and put an empty
line between global and local headers.


Sure.




+#define NSIM_IPSEC_AUTH_BITS   128
+
+/**
+ * nsim_ipsec_dbg_read - read for ipsec data
+ * @filp: the opened file
+ * @buffer: where to write the data for the user to read
+ * @count: the size of the user's buffer
+ * @ppos: file position offset
+ **/
+static ssize_t nsim_dbg_netdev_ops_read(struct file *filp,


Doesn't match the kdoc.  Please run

./scripts/kernel-doc -none $file

if you want kdoc.  Although IMHO you may as well drop the kdoc, your
code is quite self explanatory and local.


By adding -v to that I got a couple of warnings that I didn't include 
the Return information - is that what you were commenting on?  The rest 
seems acceptable to the script I'm using from the net-next tree.





+   char __user *buffer,
+   size_t count, loff_t *ppos)
+{
+   struct netdevsim *ns = filp->private_data;
+   struct nsim_ipsec *ipsec = >ipsec;
+   size_t bufsize;
+   char *buf, *p;
+   int len;
+   int i;
+
+   /* don't allow partial reads */
+   if (*ppos != 0)
+   return 0;
+
+   /* the buffer needed is
+* (num SAs * 3 lines each * ~60 bytes per line) + one more line
+*/
+   bufsize = (ipsec->count * 4 * 60) + 60;
+   buf = kzalloc(bufsize, GFP_KERNEL);
+   if (!buf)
+   return -ENOMEM;
+
+   p = buf;
+   p += snprintf(p, bufsize - (p - buf),
+ "SA count=%u tx=%u\n",
+ ipsec->count, ipsec->tx);
+
+   for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+   struct nsim_sa *sap = >sa[i];
+
+   if (!sap->used)
+   continue;
+
+   p += snprintf(p, bufsize - (p - buf),
+ "sa[%i] %cx ipaddr=0x%08x %08x %08x %08x\n",
+ i, (sap->rx ? 'r' : 't'), sap->ipaddr[0],
+ sap->ipaddr[1], sap->ipaddr[2], sap->ipaddr[3]);
+   p += snprintf(p, bufsize - (p - buf),
+ "sa[%i]spi=0x%08x proto=0x%x salt=0x%08x 
crypt=%d\n",
+ i, be32_to_cpu(sap->xs->id.spi),
+ sap->xs->id.proto, sap->salt, sap->crypt);
+   p += snprintf(p, bufsize - (p - buf),
+ "sa[%i]key=0x%08x %08x %08x %08x\n",
+ i, sap->key[0], sap->key[1],
+ sap->key[2], sap->key[3]);
+   }
+
+   len = simple_read_from_buffer(buffer, count, ppos, buf, p - buf);


Why not seq_file for this?


Why bother with more interface code?  This is useful enough to support 
the API testing needed.





+   kfree(buf);
+   return len;
+}
+
+static const struct file_operations ipsec_dbg_fops = {
+   .owner = THIS_MODULE,
+   .open = simple_open,
+   .read = nsim_dbg_netdev_ops_read,
+};
+
+/**
+ * nsim_ipsec_find_empty_idx - find the first unused security parameter index
+ * @ipsec: pointer to ipsec struct
+ **/
+static int nsim_ipsec_find_empty_idx(struct nsim_ipsec *ipsec)
+{
+   u32 i;
+
+   if (ipsec->count == NSIM_IPSEC_MAX_SA_COUNT)
+   return -ENOSPC;
+
+   /* search sa table */
+   for (i = 0; i < NSIM_IPSEC_MAX_SA_COUNT; i++) {
+   if (!ipsec->sa[i].used)
+   return i;
+   }
+
+   return -ENOSPC;


FWIW I personally find bitmaps and find_first_zero_bit() etc. nice and
concise for a small ID allocator, but no objection to open coding.


Sure, we could add a parallel bitmap data structure to track usage of 
our array elements, and probably would for a much larger array so as to 
lessen the impact of a serial search.  But, since this is a short array 
for simple testing purposes, the search time is minimal so I think the 
simple code is fine.





+}
+
+/**
+ * nsim_ipsec_parse_proto_keys - find the key and salt based on the protocol
+ * @xs: pointer to xfrm_state struct
+ * @mykey: pointer to key array to populate
+ * @mysalt: pointer to salt value to populate
+ *
+ * This copies the protocol keys and salt to our own data tables.  The
+ * 82599 family only supports the one algorithm.


82599 is a fine chip, it's not

Re: [PATCH net-next v2 7/7] net: sched: call reoffload op on block callback reg

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 11:30:10PM CEST, jakub.kicin...@netronome.com wrote:
>From: John Hurley 
>
>Call the reoffload tcf_proto_op on all tcf_proto nodes in all chains of a
>block when a callback tries to register to a block that already has
>offloaded rules. If all existing rules cannot be offloaded then the
>registration is rejected. This replaces the previous policy of rejecting
>such callback registration outright.
>
>On unregistration of a callback, the rules are flushed for that given cb.
>The implementation of block sharing in the NFP driver, for example,
>duplicates shared rules to all devs bound to a block. This meant that
>rules could still exist in hw even after a device is unbound from a block
>(assuming the block still remains active).
>
>Signed-off-by: John Hurley 
>Signed-off-by: Jakub Kicinski 

Acked-by: Jiri Pirko

Re: [PATCH net-next v2 4/7] net: sched: cls_matchall: implement offload tcf_proto_op

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 11:30:07PM CEST, jakub.kicin...@netronome.com wrote:
>From: John Hurley 
>
>Add the reoffload tcf_proto_op in matchall to generate an offload message
>for each filter in the given tcf_proto. Call the specified callback with
>this new offload message. The function only returns an error if the
>callback rejects adding a 'hardware only' rule.
>
>Ensure matchall flags correctly report if the rule is in hw by keeping a
>reference counter for the number of instances of the rule offloaded. Only
>update the flag when this counter changes from or to 0.
>
>Signed-off-by: John Hurley 
>Signed-off-by: Jakub Kicinski 

Acked-by: Jiri Pirko

Re: [PATCH net-next v2 3/7] net: sched: cls_flower: implement offload tcf_proto_op

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 11:30:06PM CEST, jakub.kicin...@netronome.com wrote:
>From: John Hurley 
>
>Add the reoffload tcf_proto_op in flower to generate an offload message
>for each filter in the given tcf_proto. Call the specified callback with
>this new offload message. The function only returns an error if the
>callback rejects adding a 'hardware only' rule.
>
>A filter contains a flag to indicate if it is in hardware or not. To
>ensure the reoffload function properly maintains this flag, keep a
>reference counter for the number of instances of the filter that are in
>hardware. Only update the flag when this counter changes from or to 0. Add
>a generic helper function to implement this behaviour.
>
>Signed-off-by: John Hurley 
>Signed-off-by: Jakub Kicinski 

Acked-by: Jiri Pirko

Re: [PATCH net-next v2 2/7] net: sched: add tcf_proto_op to offload a rule

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 11:30:05PM CEST, jakub.kicin...@netronome.com wrote:
>From: John Hurley 
>
>Create a new tcf_proto_op called 'reoffload' that generates a new offload
>message for each node in a tcf_proto. Pointers to the tcf_proto and
>whether the offload request is to add or delete the node are included.
>Also included is a callback function to send the offload message to and
>the option of priv data to go with the cb.
>
>Signed-off-by: John Hurley 
>Signed-off-by: Jakub Kicinski 

Acked-by: Jiri Pirko

Re: [PATCH net-next v2 1/7] net: sched: pass extack pointer to block binds and cb registration

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 11:30:04PM CEST, jakub.kicin...@netronome.com wrote:
>From: John Hurley 
>
>Pass the extact struct from a tc qdisc add to the block bind function and,
>in turn, to the setup_tc ndo of binding device via the tc_block_offload
>struct. Pass this back to any block callback registrations to allow
>netlink logging of fails in the bind process.
>
>Signed-off-by: John Hurley 
>Signed-off-by: Jakub Kicinski 

Acked-by: Jiri Pirko

Re: [PATCH rdma-next 00/12] RDMA fixes 2018-06-24

2018-06-25 Thread Jason Gunthorpe

On Sun, Jun 24, 2018 at 11:23:41AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Hi,
> 
> This is bunch of patches trigged by running syzkaller internally.
> 
> I'm sending them based on rdma-next mainly for two reasons:
> 1, Most of the patches fix the old issues and it doesn't matter when
> they will hit the Linus's tree: now or later in a couple of weeks
> during merge window.
> 2. They interleave with code cleanup, mlx5-next patches and Michael's
> feedback on flow counters series.
> 
> Thanks
> 
> Leon Romanovsky (12):
>   RDMA/uverbs: Protect from attempts to create flows on unsupported QP
>   RDMA/uverbs: Fix slab-out-of-bounds in ib_uverbs_ex_create_flow

I applied these two to for-rc

>   RDMA/uverbs: Check existence of create_flow callback
>   RDMA/verbs: Drop kernel variant of create_flow
>   RDMA/verbs: Drop kernel variant of destroy_flow
>   net/mlx5: Rate limit errors in command interface
>   RDMA/uverbs: Don't overwrite NULL pointer with ZERO_SIZE_PTR
>   RDMA/umem: Don't check for negative return value of dma_map_sg_attrs()
>   RDMA/uverbs: Remove redundant check

These to for-next

>   overflow.h: Add arithmetic shift helper
>   RDMA/mlx5: Fix shift overflow in mlx5_ib_create_wq
>   RDMA/mlx5: Reuse existed shift_overlow helper

And these will have to be respun.

Thanks,
Jason

[PATCH net-next v2 1/7] net: sched: pass extack pointer to block binds and cb registration

2018-06-25 Thread Jakub Kicinski

From: John Hurley 

Pass the extact struct from a tc qdisc add to the block bind function and,
in turn, to the setup_tc ndo of binding device via the tc_block_offload
struct. Pass this back to any block callback registrations to allow
netlink logging of fails in the bind process.

Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |  2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c |  2 +-
 .../net/ethernet/chelsio/cxgb4/cxgb4_main.c   |  2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  2 +-
 .../net/ethernet/intel/i40evf/i40evf_main.c   |  2 +-
 drivers/net/ethernet/intel/igb/igb_main.c |  2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |  2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c| 10 +---
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  2 +-
 .../ethernet/netronome/nfp/flower/offload.c   |  2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |  2 +-
 drivers/net/netdevsim/netdev.c|  2 +-
 include/net/pkt_cls.h | 11 +---
 net/dsa/slave.c   |  2 +-
 net/sched/cls_api.c   | 25 ---
 17 files changed, 43 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 176fc9f4d7de..b5fc6414a951 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7984,7 +7984,7 @@ static int bnxt_setup_tc_block(struct net_device *dev,
switch (f->command) {
case TC_BLOCK_BIND:
return tcf_block_cb_register(f->block, bnxt_setup_tc_block_cb,
-bp, bp);
+bp, bp, f->extack);
case TC_BLOCK_UNBIND:
tcf_block_cb_unregister(f->block, bnxt_setup_tc_block_cb, bp);
return 0;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
index 05d405905906..0745f2dfc80c 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c
@@ -173,7 +173,7 @@ static int bnxt_vf_rep_setup_tc_block(struct net_device 
*dev,
case TC_BLOCK_BIND:
return tcf_block_cb_register(f->block,
 bnxt_vf_rep_setup_tc_block_cb,
-vf_rep, vf_rep);
+vf_rep, vf_rep, f->extack);
case TC_BLOCK_UNBIND:
tcf_block_cb_unregister(f->block,
bnxt_vf_rep_setup_tc_block_cb, vf_rep);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index dd04a2f89ce6..84eca1d45ad1 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3016,7 +3016,7 @@ static int cxgb_setup_tc_block(struct net_device *dev,
switch (f->command) {
case TC_BLOCK_BIND:
return tcf_block_cb_register(f->block, cxgb_setup_tc_block_cb,
-pi, dev);
+pi, dev, f->extack);
case TC_BLOCK_UNBIND:
tcf_block_cb_unregister(f->block, cxgb_setup_tc_block_cb, pi);
return 0;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 7ad2b1b0b125..426b0ccb1fc6 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -7554,7 +7554,7 @@ static int i40e_setup_tc_block(struct net_device *dev,
switch (f->command) {
case TC_BLOCK_BIND:
return tcf_block_cb_register(f->block, i40e_setup_tc_block_cb,
-np, np);
+np, np, f->extack);
case TC_BLOCK_UNBIND:
tcf_block_cb_unregister(f->block, i40e_setup_tc_block_cb, np);
return 0;
diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c 
b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
index dc56a8667495..5906c1c1d19d 100644
--- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c
+++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c
@@ -2926,7 +2926,7 @@ static int i40evf_setup_tc_block(struct net_device *dev,
switch (f->command) {
case TC_BLOCK_BIND:
return tcf_block_cb_register(f->block, i40evf_setup_tc_block_cb,
-adapter, adapter);
+adapter, adapter, f->extack);
case TC_BLOCK_UNBIND:
tcf_block_cb_unregister(f->block,

[PATCH net-next v2 2/7] net: sched: add tcf_proto_op to offload a rule

2018-06-25 Thread Jakub Kicinski

From: John Hurley 

Create a new tcf_proto_op called 'reoffload' that generates a new offload
message for each node in a tcf_proto. Pointers to the tcf_proto and
whether the offload request is to add or delete the node are included.
Also included is a callback function to send the offload message to and
the option of priv data to go with the cb.

Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
---
v2:
 - name parameters in header file.
---
 include/net/act_api.h | 3 ---
 include/net/sch_generic.h | 6 ++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 9e59ebfded62..5ff11adbe2a6 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -190,9 +190,6 @@ static inline void tcf_action_stats_update(struct tc_action 
*a, u64 bytes,
 #endif
 }
 
-typedef int tc_setup_cb_t(enum tc_setup_type type,
- void *type_data, void *cb_priv);
-
 #ifdef CONFIG_NET_CLS_ACT
 int tc_setup_cb_egdev_register(const struct net_device *dev,
   tc_setup_cb_t *cb, void *cb_priv);
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6488daa32f82..18adc9142b18 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -20,6 +20,9 @@ struct qdisc_walker;
 struct tcf_walker;
 struct module;
 
+typedef int tc_setup_cb_t(enum tc_setup_type type,
+ void *type_data, void *cb_priv);
+
 struct qdisc_rate_table {
struct tc_ratespec rate;
u32 data[256];
@@ -256,6 +259,9 @@ struct tcf_proto_ops {
  bool *last,
  struct netlink_ext_ack *);
void(*walk)(struct tcf_proto*, struct tcf_walker 
*arg);
+   int (*reoffload)(struct tcf_proto *tp, bool add,
+tc_setup_cb_t *cb, void *cb_priv,
+struct netlink_ext_ack *extack);
void(*bind_class)(void *, u32, unsigned long);
 
/* rtnetlink specific */
-- 
2.17.1

[PATCH net-next v2 5/7] net: sched: cls_u32: implement offload tcf_proto_op

2018-06-25 Thread Jakub Kicinski

From: John Hurley 

Add the offload tcf_proto_op in cls_u32 to generate an offload message for
each filter and the hashtable in the given tcf_proto. Call the specified
callback with this new offload message. The function only returns an error
if the callback rejects adding a 'hardware only' rule.

A filter contains a flag to indicate if it is in hardware or not. To
ensure the offload function properly maintains this flag, keep a reference
counter for the number of instances of the filter that are in hardware.
Only update the flag when this counter changes from or to 0.

Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
---
 net/sched/cls_u32.c | 111 
 1 file changed, 111 insertions(+)

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index fb861f90fde6..d5d2a6dc3921 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -62,6 +62,7 @@ struct tc_u_knode {
struct tc_u32_pcnt __percpu *pf;
 #endif
u32 flags;
+   unsigned intin_hw_count;
 #ifdef CONFIG_CLS_U32_MARK
u32 val;
u32 mask;
@@ -571,6 +572,7 @@ static int u32_replace_hw_knode(struct tcf_proto *tp, 
struct tc_u_knode *n,
u32_remove_hw_knode(tp, n, NULL);
return err;
} else if (err > 0) {
+   n->in_hw_count = err;
tcf_block_offload_inc(block, >flags);
}
 
@@ -1199,6 +1201,114 @@ static void u32_walk(struct tcf_proto *tp, struct 
tcf_walker *arg)
}
 }
 
+static int u32_reoffload_hnode(struct tcf_proto *tp, struct tc_u_hnode *ht,
+  bool add, tc_setup_cb_t *cb, void *cb_priv,
+  struct netlink_ext_ack *extack)
+{
+   struct tc_cls_u32_offload cls_u32 = {};
+   int err;
+
+   tc_cls_common_offload_init(_u32.common, tp, ht->flags, extack);
+   cls_u32.command = add ? TC_CLSU32_NEW_HNODE : TC_CLSU32_DELETE_HNODE;
+   cls_u32.hnode.divisor = ht->divisor;
+   cls_u32.hnode.handle = ht->handle;
+   cls_u32.hnode.prio = ht->prio;
+
+   err = cb(TC_SETUP_CLSU32, _u32, cb_priv);
+   if (err && add && tc_skip_sw(ht->flags))
+   return err;
+
+   return 0;
+}
+
+static int u32_reoffload_knode(struct tcf_proto *tp, struct tc_u_knode *n,
+  bool add, tc_setup_cb_t *cb, void *cb_priv,
+  struct netlink_ext_ack *extack)
+{
+   struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
+   struct tcf_block *block = tp->chain->block;
+   struct tc_cls_u32_offload cls_u32 = {};
+   int err;
+
+   tc_cls_common_offload_init(_u32.common, tp, n->flags, extack);
+   cls_u32.command = add ?
+   TC_CLSU32_REPLACE_KNODE : TC_CLSU32_DELETE_KNODE;
+   cls_u32.knode.handle = n->handle;
+
+   if (add) {
+   cls_u32.knode.fshift = n->fshift;
+#ifdef CONFIG_CLS_U32_MARK
+   cls_u32.knode.val = n->val;
+   cls_u32.knode.mask = n->mask;
+#else
+   cls_u32.knode.val = 0;
+   cls_u32.knode.mask = 0;
+#endif
+   cls_u32.knode.sel = >sel;
+   cls_u32.knode.exts = >exts;
+   if (n->ht_down)
+   cls_u32.knode.link_handle = ht->handle;
+   }
+
+   err = cb(TC_SETUP_CLSU32, _u32, cb_priv);
+   if (err) {
+   if (add && tc_skip_sw(n->flags))
+   return err;
+   return 0;
+   }
+
+   tc_cls_offload_cnt_update(block, >in_hw_count, >flags, add);
+
+   return 0;
+}
+
+static int u32_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+void *cb_priv, struct netlink_ext_ack *extack)
+{
+   struct tc_u_common *tp_c = tp->data;
+   struct tc_u_hnode *ht;
+   struct tc_u_knode *n;
+   unsigned int h;
+   int err;
+
+   for (ht = rtnl_dereference(tp_c->hlist);
+ht;
+ht = rtnl_dereference(ht->next)) {
+   if (ht->prio != tp->prio)
+   continue;
+
+   /* When adding filters to a new dev, try to offload the
+* hashtable first. When removing, do the filters before the
+* hashtable.
+*/
+   if (add && !tc_skip_hw(ht->flags)) {
+   err = u32_reoffload_hnode(tp, ht, add, cb, cb_priv,
+ extack);
+   if (err)
+   return err;
+   }
+
+   for (h = 0; h <= ht->divisor; h++) {
+   for (n = rtnl_dereference(ht->ht[h]);
+n;
+n = rtnl_dereference(n->next)) {
+   if (tc_skip_hw(n->flags))
+   continue;
+
+

[PATCH net-next v2 4/7] net: sched: cls_matchall: implement offload tcf_proto_op

2018-06-25 Thread Jakub Kicinski

From: John Hurley 

Add the reoffload tcf_proto_op in matchall to generate an offload message
for each filter in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' rule.

Ensure matchall flags correctly report if the rule is in hw by keeping a
reference counter for the number of instances of the rule offloaded. Only
update the flag when this counter changes from or to 0.

Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
---
 net/sched/cls_matchall.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/net/sched/cls_matchall.c b/net/sched/cls_matchall.c
index 47b207ef7762..af16f36ed578 100644
--- a/net/sched/cls_matchall.c
+++ b/net/sched/cls_matchall.c
@@ -21,6 +21,7 @@ struct cls_mall_head {
struct tcf_result res;
u32 handle;
u32 flags;
+   unsigned int in_hw_count;
struct rcu_work rwork;
 };
 
@@ -95,6 +96,7 @@ static int mall_replace_hw_filter(struct tcf_proto *tp,
mall_destroy_hw_filter(tp, head, cookie, NULL);
return err;
} else if (err > 0) {
+   head->in_hw_count = err;
tcf_block_offload_inc(block, >flags);
}
 
@@ -235,6 +237,35 @@ static void mall_walk(struct tcf_proto *tp, struct 
tcf_walker *arg)
arg->count++;
 }
 
+static int mall_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+ void *cb_priv, struct netlink_ext_ack *extack)
+{
+   struct cls_mall_head *head = rtnl_dereference(tp->root);
+   struct tc_cls_matchall_offload cls_mall = {};
+   struct tcf_block *block = tp->chain->block;
+   int err;
+
+   if (tc_skip_hw(head->flags))
+   return 0;
+
+   tc_cls_common_offload_init(_mall.common, tp, head->flags, extack);
+   cls_mall.command = add ?
+   TC_CLSMATCHALL_REPLACE : TC_CLSMATCHALL_DESTROY;
+   cls_mall.exts = >exts;
+   cls_mall.cookie = (unsigned long)head;
+
+   err = cb(TC_SETUP_CLSMATCHALL, _mall, cb_priv);
+   if (err) {
+   if (add && tc_skip_sw(head->flags))
+   return err;
+   return 0;
+   }
+
+   tc_cls_offload_cnt_update(block, >in_hw_count, >flags, add);
+
+   return 0;
+}
+
 static int mall_dump(struct net *net, struct tcf_proto *tp, void *fh,
 struct sk_buff *skb, struct tcmsg *t)
 {
@@ -289,6 +320,7 @@ static struct tcf_proto_ops cls_mall_ops __read_mostly = {
.change = mall_change,
.delete = mall_delete,
.walk   = mall_walk,
+   .reoffload  = mall_reoffload,
.dump   = mall_dump,
.bind_class = mall_bind_class,
.owner  = THIS_MODULE,
-- 
2.17.1

[PATCH net-next v2 7/7] net: sched: call reoffload op on block callback reg

2018-06-25 Thread Jakub Kicinski

From: John Hurley 

Call the reoffload tcf_proto_op on all tcf_proto nodes in all chains of a
block when a callback tries to register to a block that already has
offloaded rules. If all existing rules cannot be offloaded then the
registration is rejected. This replaces the previous policy of rejecting
such callback registration outright.

On unregistration of a callback, the rules are flushed for that given cb.
The implementation of block sharing in the NFP driver, for example,
duplicates shared rules to all devs bound to a block. This meant that
rules could still exist in hw even after a device is unbound from a block
(assuming the block still remains active).

Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
---
v2:
 - improve extack message.
---
 .../net/ethernet/mellanox/mlxsw/spectrum.c|  4 +-
 include/net/pkt_cls.h |  6 ++-
 net/sched/cls_api.c   | 54 ---
 3 files changed, 52 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index d2bc335dda11..52437363766a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1542,7 +1542,7 @@ mlxsw_sp_setup_tc_block_flower_bind(struct mlxsw_sp_port 
*mlxsw_sp_port,
 
 err_block_bind:
if (!tcf_block_cb_decref(block_cb)) {
-   __tcf_block_cb_unregister(block_cb);
+   __tcf_block_cb_unregister(block, block_cb);
 err_cb_register:
mlxsw_sp_acl_block_destroy(acl_block);
}
@@ -1572,7 +1572,7 @@ mlxsw_sp_setup_tc_block_flower_unbind(struct 
mlxsw_sp_port *mlxsw_sp_port,
err = mlxsw_sp_acl_block_unbind(mlxsw_sp, acl_block,
mlxsw_sp_port, ingress);
if (!err && !tcf_block_cb_decref(block_cb)) {
-   __tcf_block_cb_unregister(block_cb);
+   __tcf_block_cb_unregister(block, block_cb);
mlxsw_sp_acl_block_destroy(acl_block);
}
 }
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a2c6d35ba057..4070b8eb6d14 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -78,7 +78,8 @@ struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block 
*block,
 int tcf_block_cb_register(struct tcf_block *block,
  tc_setup_cb_t *cb, void *cb_ident,
  void *cb_priv, struct netlink_ext_ack *extack);
-void __tcf_block_cb_unregister(struct tcf_block_cb *block_cb);
+void __tcf_block_cb_unregister(struct tcf_block *block,
+  struct tcf_block_cb *block_cb);
 void tcf_block_cb_unregister(struct tcf_block *block,
 tc_setup_cb_t *cb, void *cb_ident);
 
@@ -177,7 +178,8 @@ int tcf_block_cb_register(struct tcf_block *block,
 }
 
 static inline
-void __tcf_block_cb_unregister(struct tcf_block_cb *block_cb)
+void __tcf_block_cb_unregister(struct tcf_block *block,
+  struct tcf_block_cb *block_cb)
 {
 }
 
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 8c9fb4b827a1..bbf8dda96b0e 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -751,19 +751,53 @@ unsigned int tcf_block_cb_decref(struct tcf_block_cb 
*block_cb)
 }
 EXPORT_SYMBOL(tcf_block_cb_decref);
 
+static int
+tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
+   void *cb_priv, bool add, bool offload_in_use,
+   struct netlink_ext_ack *extack)
+{
+   struct tcf_chain *chain;
+   struct tcf_proto *tp;
+   int err;
+
+   list_for_each_entry(chain, >chain_list, list) {
+   for (tp = rtnl_dereference(chain->filter_chain); tp;
+tp = rtnl_dereference(tp->next)) {
+   if (tp->ops->reoffload) {
+   err = tp->ops->reoffload(tp, add, cb, cb_priv,
+extack);
+   if (err && add)
+   goto err_playback_remove;
+   } else if (add && offload_in_use) {
+   err = -EOPNOTSUPP;
+   NL_SET_ERR_MSG(extack, "Filter HW offload 
failed - classifier without re-offloading support");
+   goto err_playback_remove;
+   }
+   }
+   }
+
+   return 0;
+
+err_playback_remove:
+   tcf_block_playback_offloads(block, cb, cb_priv, false, offload_in_use,
+   extack);
+   return err;
+}
+
 struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
 tc_setup_cb_t *cb, void *cb_ident,
 void *cb_priv,
 struct netlink_ext_ack *extack)

[PATCH net-next v2 6/7] net: sched: cls_bpf: implement offload tcf_proto_op

2018-06-25 Thread Jakub Kicinski

From: John Hurley 

Add the offload tcf_proto_op in cls_bpf to generate an offload message for
each bpf prog in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' prog.

A prog contains a flag to indicate if it is in hardware or not. To
ensure the offload function properly maintains this flag, keep a reference
counter for the number of instances of the prog that are in hardware. Only
update the flag when this counter changes from or to 0.

Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
---
 net/sched/cls_bpf.c | 39 +++
 1 file changed, 39 insertions(+)

diff --git a/net/sched/cls_bpf.c b/net/sched/cls_bpf.c
index 1aa7f6511065..66e0ac9811f9 100644
--- a/net/sched/cls_bpf.c
+++ b/net/sched/cls_bpf.c
@@ -43,6 +43,7 @@ struct cls_bpf_prog {
struct tcf_result res;
bool exts_integrated;
u32 gen_flags;
+   unsigned int in_hw_count;
struct tcf_exts exts;
u32 handle;
u16 bpf_num_ops;
@@ -174,6 +175,7 @@ static int cls_bpf_offload_cmd(struct tcf_proto *tp, struct 
cls_bpf_prog *prog,
cls_bpf_offload_cmd(tp, oldprog, prog, extack);
return err;
} else if (err > 0) {
+   prog->in_hw_count = err;
tcf_block_offload_inc(block, >gen_flags);
}
}
@@ -652,6 +654,42 @@ static void cls_bpf_walk(struct tcf_proto *tp, struct 
tcf_walker *arg)
}
 }
 
+static int cls_bpf_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+void *cb_priv, struct netlink_ext_ack *extack)
+{
+   struct cls_bpf_head *head = rtnl_dereference(tp->root);
+   struct tcf_block *block = tp->chain->block;
+   struct tc_cls_bpf_offload cls_bpf = {};
+   struct cls_bpf_prog *prog;
+   int err;
+
+   list_for_each_entry(prog, >plist, link) {
+   if (tc_skip_hw(prog->gen_flags))
+   continue;
+
+   tc_cls_common_offload_init(_bpf.common, tp, prog->gen_flags,
+  extack);
+   cls_bpf.command = TC_CLSBPF_OFFLOAD;
+   cls_bpf.exts = >exts;
+   cls_bpf.prog = add ? prog->filter : NULL;
+   cls_bpf.oldprog = add ? NULL : prog->filter;
+   cls_bpf.name = prog->bpf_name;
+   cls_bpf.exts_integrated = prog->exts_integrated;
+
+   err = cb(TC_SETUP_CLSBPF, _bpf, cb_priv);
+   if (err) {
+   if (add && tc_skip_sw(prog->gen_flags))
+   return err;
+   continue;
+   }
+
+   tc_cls_offload_cnt_update(block, >in_hw_count,
+ >gen_flags, add);
+   }
+
+   return 0;
+}
+
 static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
.kind   =   "bpf",
.owner  =   THIS_MODULE,
@@ -662,6 +700,7 @@ static struct tcf_proto_ops cls_bpf_ops __read_mostly = {
.change =   cls_bpf_change,
.delete =   cls_bpf_delete,
.walk   =   cls_bpf_walk,
+   .reoffload  =   cls_bpf_reoffload,
.dump   =   cls_bpf_dump,
.bind_class =   cls_bpf_bind_class,
 };
-- 
2.17.1

[PATCH net-next v2 0/7] net: sched: support replay of filter offload when binding to block

2018-06-25 Thread Jakub Kicinski

Hi!

This series from John adds the ability to replay filter offload requests
when new offload callback is being registered on a TC block.  This is most
likely to take place for shared blocks today, when a block which already
has rules is bound to another interface.  Prior to this patch set if any
of the rules were offloaded the block bind would fail.

A new tcf_proto_op is added to generate a filter-specific offload request.
The new 'offload' op is supporting extack from day 0, hence we need to
propagate extack to .ndo_setup_tc TC_BLOCK_BIND/TC_BLOCK_UNBIND and
through tcf_block_cb_register() to tcf_block_playback_offloads().

The immediate use of this patch set is to simplify life of drivers which
require duplicating rules when sharing blocks.  Switch drivers (mlxsw)
can bind ports to rule lists dynamically, NIC drivers generally don't
have that ability and need the rules to be duplicated for each ingress
they match on.  In code terms this means that switch drivers don't
register multiple callbacks for each port.  NIC drivers do, and get a
separate request and hance rule per-port, as if the block was not shared.
The registration fails today, however, if some rules were already present.

As John notes in description of patch 7, drivers which register multiple
callbacks to shared blocks will likely need to flush the rules on block
unbind.  This set makes the core not only replay the the offload add
requests but also offload remove requests when callback is unregistered.

v2:
 - name parameters in patch 2;
 - use unsigned int instead of u32 for in_hw_coun;
 - improve extack message in patch 7.

John Hurley (7):
  net: sched: pass extack pointer to block binds and cb registration
  net: sched: add tcf_proto_op to offload a rule
  net: sched: cls_flower: implement offload tcf_proto_op
  net: sched: cls_matchall: implement offload tcf_proto_op
  net: sched: cls_u32: implement offload tcf_proto_op
  net: sched: cls_bpf: implement offload tcf_proto_op
  net: sched: call reoffload op on block callback reg

 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   2 +-
 drivers/net/ethernet/broadcom/bnxt/bnxt_vfr.c |   2 +-
 .../net/ethernet/chelsio/cxgb4/cxgb4_main.c   |   2 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   2 +-
 .../net/ethernet/intel/i40evf/i40evf_main.c   |   2 +-
 drivers/net/ethernet/intel/igb/igb_main.c |   2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   2 +-
 .../net/ethernet/mellanox/mlx5/core/en_rep.c  |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c|  14 ++-
 drivers/net/ethernet/netronome/nfp/bpf/main.c |   2 +-
 .../ethernet/netronome/nfp/flower/offload.c   |   2 +-
 .../net/ethernet/stmicro/stmmac/stmmac_main.c |   2 +-
 drivers/net/netdevsim/netdev.c|   2 +-
 include/net/act_api.h |   3 -
 include/net/pkt_cls.h |  17 ++-
 include/net/sch_generic.h |  21 
 net/dsa/slave.c   |   2 +-
 net/sched/cls_api.c   |  79 ++---
 net/sched/cls_bpf.c   |  39 ++
 net/sched/cls_flower.c|  44 +++
 net/sched/cls_matchall.c  |  32 +
 net/sched/cls_u32.c   | 111 ++
 23 files changed, 342 insertions(+), 46 deletions(-)

-- 
2.17.1

[PATCH net-next v2 3/7] net: sched: cls_flower: implement offload tcf_proto_op

2018-06-25 Thread Jakub Kicinski

From: John Hurley 

Add the reoffload tcf_proto_op in flower to generate an offload message
for each filter in the given tcf_proto. Call the specified callback with
this new offload message. The function only returns an error if the
callback rejects adding a 'hardware only' rule.

A filter contains a flag to indicate if it is in hardware or not. To
ensure the reoffload function properly maintains this flag, keep a
reference counter for the number of instances of the filter that are in
hardware. Only update the flag when this counter changes from or to 0. Add
a generic helper function to implement this behaviour.

Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
--
v2:
 - use unsigned int for counting how many times filter was offloaded.
---
 include/net/sch_generic.h | 15 +
 net/sched/cls_flower.c| 44 +++
 2 files changed, 59 insertions(+)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 18adc9142b18..7432100027b7 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -336,6 +336,21 @@ static inline void tcf_block_offload_dec(struct tcf_block 
*block, u32 *flags)
block->offloadcnt--;
 }
 
+static inline void
+tc_cls_offload_cnt_update(struct tcf_block *block, unsigned int *cnt,
+ u32 *flags, bool add)
+{
+   if (add) {
+   if (!*cnt)
+   tcf_block_offload_inc(block, flags);
+   (*cnt)++;
+   } else {
+   (*cnt)--;
+   if (!*cnt)
+   tcf_block_offload_dec(block, flags);
+   }
+}
+
 static inline void qdisc_cb_private_validate(const struct sk_buff *skb, int sz)
 {
struct qdisc_skb_cb *qcb;
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9e8b26a80fb3..352876bb901b 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -87,6 +87,7 @@ struct cls_fl_filter {
struct list_head list;
u32 handle;
u32 flags;
+   unsigned int in_hw_count;
struct rcu_work rwork;
struct net_device *hw_dev;
 };
@@ -289,6 +290,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
fl_hw_destroy_filter(tp, f, NULL);
return err;
} else if (err > 0) {
+   f->in_hw_count = err;
tcf_block_offload_inc(block, >flags);
}
 
@@ -1087,6 +1089,47 @@ static void fl_walk(struct tcf_proto *tp, struct 
tcf_walker *arg)
}
 }
 
+static int fl_reoffload(struct tcf_proto *tp, bool add, tc_setup_cb_t *cb,
+   void *cb_priv, struct netlink_ext_ack *extack)
+{
+   struct cls_fl_head *head = rtnl_dereference(tp->root);
+   struct tc_cls_flower_offload cls_flower = {};
+   struct tcf_block *block = tp->chain->block;
+   struct fl_flow_mask *mask;
+   struct cls_fl_filter *f;
+   int err;
+
+   list_for_each_entry(mask, >masks, list) {
+   list_for_each_entry(f, >filters, list) {
+   if (tc_skip_hw(f->flags))
+   continue;
+
+   tc_cls_common_offload_init(_flower.common, tp,
+  f->flags, extack);
+   cls_flower.command = add ?
+   TC_CLSFLOWER_REPLACE : TC_CLSFLOWER_DESTROY;
+   cls_flower.cookie = (unsigned long)f;
+   cls_flower.dissector = >dissector;
+   cls_flower.mask = >mkey;
+   cls_flower.key = >key;
+   cls_flower.exts = >exts;
+   cls_flower.classid = f->res.classid;
+
+   err = cb(TC_SETUP_CLSFLOWER, _flower, cb_priv);
+   if (err) {
+   if (add && tc_skip_sw(f->flags))
+   return err;
+   continue;
+   }
+
+   tc_cls_offload_cnt_update(block, >in_hw_count,
+ >flags, add);
+   }
+   }
+
+   return 0;
+}
+
 static int fl_dump_key_val(struct sk_buff *skb,
   void *val, int val_type,
   void *mask, int mask_type, int len)
@@ -1438,6 +1481,7 @@ static struct tcf_proto_ops cls_fl_ops __read_mostly = {
.change = fl_change,
.delete = fl_delete,
.walk   = fl_walk,
+   .reoffload  = fl_reoffload,
.dump   = fl_dump,
.bind_class = fl_bind_class,
.owner  = THIS_MODULE,
-- 
2.17.1

Re: [PATCH net-next 7/7] net: sched: call reoffload op on block callback reg

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 11:10:14PM CEST, jakub.kicin...@netronome.com wrote:
>On Mon, 25 Jun 2018 22:58:32 +0200, Jiri Pirko wrote:
>> Mon, Jun 25, 2018 at 06:34:31AM CEST, jakub.kicin...@netronome.com wrote:
>> >From: John Hurley   
>> 
>> [...]
>>  
>> >+static int
>> >+tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
>> >+   void *cb_priv, bool add, bool offload_in_use,
>> >+   struct netlink_ext_ack *extack)
>> >+{
>> >+   struct tcf_chain *chain;
>> >+   struct tcf_proto *tp;
>> >+   int err;
>> >+
>> >+   list_for_each_entry(chain, >chain_list, list) {
>> >+   for (tp = rtnl_dereference(chain->filter_chain); tp;
>> >+tp = rtnl_dereference(tp->next)) {
>> >+   if (tp->ops->reoffload) {
>> >+   err = tp->ops->reoffload(tp, add, cb, cb_priv,
>> >+extack);
>> >+   if (err && add)
>> >+   goto err_playback_remove;
>> >+   } else if (add && offload_in_use) {
>> >+   err = -EOPNOTSUPP;
>> >+   NL_SET_ERR_MSG(extack, "Filter replay failed - 
>> >a filters doesn't support re-offloading");  
>> 
>> This msg sounds weird. Please fix it.
>
>Indeed..  How about: 
>
>"Filter HW offload failed - classifier without re-offloading support"

Sounds good.

>
>> Otherwise this looks very good to me! Thanks!
>
>Cool, thanks for the comments!  I will respin shortly.

Re: [PATCH rdma-next 01/12] RDMA/uverbs: Protect from attempts to create flows on unsupported QP

2018-06-25 Thread Jason Gunthorpe

On Sun, Jun 24, 2018 at 11:23:42AM +0300, Leon Romanovsky wrote:
> From: Leon Romanovsky 
> 
> Flows can be created on UD and RAW_PACKET QP types. Attempts to provide
> other QP types as an input causes to various unpredictable failures.
> 
> The reason to it that in order to support all various types (e.g. XRC),
> we are supposed to use real_qp handle and not qp handle and give to
> driver/FW to fail such (XRC) flows. Being valuable solution, the simpler
> and safer variant is to ban all QP types except UD and RAW_PACKET,
> instead of relying on driver/FW.
> 
> Cc:  # 3.11
> Fixes: 436f2ad05a0b ("IB/core: Export ib_create/destroy_flow through uverbs")
> Cc: syzkaller 
> Reported-by: Noa Osherovich 
> Signed-off-by: Leon Romanovsky 
> ---
>  drivers/infiniband/core/uverbs_cmd.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/infiniband/core/uverbs_cmd.c 
> b/drivers/infiniband/core/uverbs_cmd.c
> index 779892b63729..c842a9423fbf 100644
> --- a/drivers/infiniband/core/uverbs_cmd.c
> +++ b/drivers/infiniband/core/uverbs_cmd.c
> @@ -3553,14 +3553,20 @@ int ib_uverbs_ex_create_flow(struct ib_uverbs_file 
> *file,
>   goto err_free_attr;
>   }
>  
> - qp = uobj_get_obj_read(qp, UVERBS_OBJECT_QP, cmd.qp_handle, 
> file->ucontext);
> + qp = uobj_get_obj_read(qp, UVERBS_OBJECT_QP, cmd.qp_handle,
> +file->ucontext);

This hunk is just whitespace changing

>   if (!qp) {
>   err = -EINVAL;
>   goto err_uobj;
>   }
>  
> + if (qp->qp_type != IB_QPT_UD && qp->qp_type != IB_QPT_RAW_PACKET) {
> + err = -EINVAL;
> + goto err_put;
> + }
> +
>   flow_attr = kzalloc(struct_size(flow_attr, flows,
> - cmd.flow_attr.num_of_specs), GFP_KERNEL);
> + cmd.flow_attr.num_of_specs), 
> GFP_KERNEL);

Same here.

I dropped the two hunks and applied this to for-rc since it has
stable tags.

Jason

Re: [PATCH net-next 7/7] net: sched: call reoffload op on block callback reg

2018-06-25 Thread Jakub Kicinski

On Mon, 25 Jun 2018 22:58:32 +0200, Jiri Pirko wrote:
> Mon, Jun 25, 2018 at 06:34:31AM CEST, jakub.kicin...@netronome.com wrote:
> >From: John Hurley   
> 
> [...] 
>   
> >+static int
> >+tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
> >+void *cb_priv, bool add, bool offload_in_use,
> >+struct netlink_ext_ack *extack)
> >+{
> >+struct tcf_chain *chain;
> >+struct tcf_proto *tp;
> >+int err;
> >+
> >+list_for_each_entry(chain, >chain_list, list) {
> >+for (tp = rtnl_dereference(chain->filter_chain); tp;
> >+ tp = rtnl_dereference(tp->next)) {
> >+if (tp->ops->reoffload) {
> >+err = tp->ops->reoffload(tp, add, cb, cb_priv,
> >+ extack);
> >+if (err && add)
> >+goto err_playback_remove;
> >+} else if (add && offload_in_use) {
> >+err = -EOPNOTSUPP;
> >+NL_SET_ERR_MSG(extack, "Filter replay failed - 
> >a filters doesn't support re-offloading");  
> 
> This msg sounds weird. Please fix it.

Indeed..  How about: 

"Filter HW offload failed - classifier without re-offloading support"

> Otherwise this looks very good to me! Thanks!

Cool, thanks for the comments!  I will respin shortly.

[patch iproute2/net-next] tc: introduce support for chain templates

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

Signed-off-by: Jiri Pirko 
---
 include/uapi/linux/rtnetlink.h |  7 
 man/man8/tc.8  | 26 
 tc/tc_filter.c | 95 ++
 tc/tc_monitor.c|  5 ++-
 4 files changed, 105 insertions(+), 28 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index c3a7d8ecc7b9..dddb05e5cca8 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -150,6 +150,13 @@ enum {
RTM_NEWCACHEREPORT = 96,
 #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT
 
+   RTM_NEWCHAINTMPLT = 100,
+#define RTM_NEWCHAINTMPLT RTM_NEWCHAINTMPLT
+   RTM_DELCHAINTMPLT,
+#define RTM_DELCHAINTMPLT RTM_DELCHAINTMPLT
+   RTM_GETCHAINTMPLT,
+#define RTM_GETCHAINTMPLT RTM_GETCHAINTMPLT
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/man/man8/tc.8 b/man/man8/tc.8
index 840880fbdba6..291eeb2ce8ca 100644
--- a/man/man8/tc.8
+++ b/man/man8/tc.8
@@ -58,6 +58,22 @@ tc \- show / manipulate traffic control settings
 .B flowid
 \fIflow-id\fR
 
+.B tc
+.RI "[ " OPTIONS " ]"
+.B filter template [ add | delete | get ] dev
+\fIDEV\fR
+.B [ parent
+\fIqdisc-id\fR
+.B | root ]\fR filtertype
+[ filtertype specific parameters ]
+
+.B tc
+.RI "[ " OPTIONS " ]"
+.B filter template [ add | delete | get ] block
+\fIBLOCK_INDEX\fR filtertype
+[ filtertype specific parameters ]
+
+
 .B tc
 .RI "[ " OPTIONS " ]"
 .RI "[ " FORMAT " ]"
@@ -80,6 +96,16 @@ tc \- show / manipulate traffic control settings
 .RI "[ " OPTIONS " ]"
 .B filter show block
 \fIBLOCK_INDEX\fR
+.P
+.B tc
+.RI "[ " OPTIONS " ]"
+.B filter template show dev
+\fIDEV\fR
+.P
+.B tc
+.RI "[ " OPTIONS " ]"
+.B filter template show block
+\fIBLOCK_INDEX\fR
 
 .P
 .B tc
diff --git a/tc/tc_filter.c b/tc/tc_filter.c
index c5bb0bffe19b..94d7692fe2fa 100644
--- a/tc/tc_filter.c
+++ b/tc/tc_filter.c
@@ -39,6 +39,8 @@ static void usage(void)
"\n"
"   tc filter show [ dev STRING ] [ root | ingress | egress 
| parent CLASSID ]\n"
"   tc filter show [ block BLOCK_INDEX ]\n"
+   "   tc filter template [ add | del | get | show ] [ dev 
STRING ]\n"
+   "   tc filter template [ add | del | get | show ] [ block 
BLOCK_INDEX ] ]\n"
"Where:\n"
"FILTER_TYPE := { rsvp | u32 | bpf | fw | route | etc. }\n"
"FILTERID := ... format depends on classifier, see there\n"
@@ -85,7 +87,8 @@ static int tc_filter_modify(int cmd, unsigned int flags, int 
argc, char **argv,
req->n.nlmsg_type = cmd;
req->t.tcm_family = AF_UNSPEC;
 
-   if (cmd == RTM_NEWTFILTER && flags & NLM_F_CREATE)
+   if ((cmd == RTM_NEWTFILTER || cmd == RTM_NEWCHAINTMPLT) &&
+   flags & NLM_F_CREATE)
protocol = htons(ETH_P_ALL);
 
while (argc > 0) {
@@ -261,7 +264,10 @@ int print_filter(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
 
if (n->nlmsg_type != RTM_NEWTFILTER &&
n->nlmsg_type != RTM_GETTFILTER &&
-   n->nlmsg_type != RTM_DELTFILTER) {
+   n->nlmsg_type != RTM_DELTFILTER &&
+   n->nlmsg_type != RTM_NEWCHAINTMPLT &&
+   n->nlmsg_type != RTM_GETCHAINTMPLT &&
+   n->nlmsg_type != RTM_DELCHAINTMPLT) {
fprintf(stderr, "Not a filter(cmd %d)\n", n->nlmsg_type);
return 0;
}
@@ -280,20 +286,26 @@ int print_filter(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
 
open_json_object(NULL);
 
-   if (n->nlmsg_type == RTM_DELTFILTER)
+   if (n->nlmsg_type == RTM_DELTFILTER || n->nlmsg_type == 
RTM_DELCHAINTMPLT)
print_bool(PRINT_ANY, "deleted", "deleted ", true);
 
-   if (n->nlmsg_type == RTM_NEWTFILTER &&
+   if ((n->nlmsg_type == RTM_NEWTFILTER ||
+n->nlmsg_type == RTM_NEWCHAINTMPLT) &&
(n->nlmsg_flags & NLM_F_CREATE) &&
!(n->nlmsg_flags & NLM_F_EXCL))
print_bool(PRINT_ANY, "replaced", "replaced ", true);
 
-   if (n->nlmsg_type == RTM_NEWTFILTER &&
+   if ((n->nlmsg_type == RTM_NEWTFILTER ||
+n->nlmsg_type == RTM_NEWCHAINTMPLT) &&
(n->nlmsg_flags & NLM_F_CREATE) &&
(n->nlmsg_flags & NLM_F_EXCL))
print_bool(PRINT_ANY, "added", "added ", true);
 
-   print_string(PRINT_FP, NULL, "filter ", NULL);
+   if (n->nlmsg_type == RTM_NEWTFILTER ||
+   n->nlmsg_type == RTM_DELTFILTER)
+   print_string(PRINT_FP, NULL, "filter ", NULL);
+   else
+   print_string(PRINT_FP, NULL, "filter template ", NULL);
if (t->tcm_ifindex == TCM_IFINDEX_MAGIC_BLOCK) {
if (!filter_block_index ||
filter_block_index != t->tcm_block_index)
@@ -317,7 +329,9 @@ int

[patch net-next 4/9] net: sched: cls_flower: change fl_init_dissector to accept mask and dissector

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

This function is going to be used for templates as well, so we need to
pass the pointer separately.

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 39 ---
 1 file changed, 20 insertions(+), 19 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 76c5516357d5..9ce4375b3252 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -793,47 +793,48 @@ static int fl_init_mask_hashtable(struct fl_flow_mask 
*mask)
FL_KEY_SET(keys, cnt, id, member);  
\
} while(0);
 
-static void fl_init_dissector(struct fl_flow_mask *mask)
+static void fl_init_dissector(struct flow_dissector *dissector,
+ struct fl_flow_key *mask)
 {
struct flow_dissector_key keys[FLOW_DISSECTOR_KEY_MAX];
size_t cnt = 0;
 
FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_CONTROL, control);
FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_BASIC, basic);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ETH_ADDRS, eth);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_IPV4_ADDRS, ipv4);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_IPV6_ADDRS, ipv6);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_PORTS, tp);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_IP, ip);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_TCP, tcp);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ICMP, icmp);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ARP, arp);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_MPLS, mpls);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_VLAN, vlan);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_KEYID, enc_key_id);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_IPV4_ADDRS, enc_ipv4);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_IPV6_ADDRS, enc_ipv6);
-   if (FL_KEY_IS_MASKED(>key, enc_ipv4) ||
-   FL_KEY_IS_MASKED(>key, enc_ipv6))
+   if (FL_KEY_IS_MASKED(mask, enc_ipv4) ||
+   FL_KEY_IS_MASKED(mask, enc_ipv6))
FL_KEY_SET(keys, cnt, FLOW_DISSECTOR_KEY_ENC_CONTROL,
   enc_control);
-   FL_KEY_SET_IF_MASKED(>key, keys, cnt,
+   FL_KEY_SET_IF_MASKED(mask, keys, cnt,
 FLOW_DISSECTOR_KEY_ENC_PORTS, enc_tp);
 
-   skb_flow_dissector_init(>dissector, keys, cnt);
+   skb_flow_dissector_init(dissector, keys, cnt);
 }
 
 static struct fl_flow_mask *fl_create_new_mask(struct cls_fl_head *head,
@@ -852,7 +853,7 @@ static struct fl_flow_mask *fl_create_new_mask(struct 
cls_fl_head *head,
if (err)
goto errout_free;
 
-   fl_init_dissector(newmask);
+   fl_init_dissector(>dissector, >key);
 
INIT_LIST_HEAD_RCU(>filters);
 
-- 
2.14.4

[patch net-next 5/9] net: sched: cls_flower: implement chain templates

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

Use the previously introduced template extension and implement
callback to create, destroy and dump chain template. The existing
parsing and dumping functions are re-used. Also, check if newly added
filters fit the template if it is set.

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 107 -
 1 file changed, 106 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9ce4375b3252..d64d43843a3a 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -70,6 +70,13 @@ struct fl_flow_mask {
struct list_head list;
 };
 
+struct fl_flow_tmplt {
+   struct fl_flow_key dummy_key;
+   struct fl_flow_key mask;
+   struct flow_dissector dissector;
+   struct tcf_chain *chain;
+};
+
 struct cls_fl_head {
struct rhashtable ht;
struct list_head masks;
@@ -144,6 +151,23 @@ static void fl_set_masked_key(struct fl_flow_key *mkey, 
struct fl_flow_key *key,
*lmkey++ = *lkey++ & *lmask++;
 }
 
+static bool fl_mask_fits_tmplt(struct fl_flow_tmplt *tmplt,
+  struct fl_flow_mask *mask)
+{
+   const long *lmask = fl_key_get_start(>key, mask);
+   const long *ltmplt;
+   int i;
+
+   if (!tmplt)
+   return true;
+   ltmplt = fl_key_get_start(>mask, mask);
+   for (i = 0; i < fl_mask_range(mask); i += sizeof(long)) {
+   if (~*ltmplt++ & *lmask++)
+   return false;
+   }
+   return true;
+}
+
 static void fl_clear_masked_range(struct fl_flow_key *key,
  struct fl_flow_mask *mask)
 {
@@ -902,6 +926,7 @@ static int fl_set_parms(struct net *net, struct tcf_proto 
*tp,
struct cls_fl_filter *f, struct fl_flow_mask *mask,
unsigned long base, struct nlattr **tb,
struct nlattr *est, bool ovr,
+   struct fl_flow_tmplt *tmplt,
struct netlink_ext_ack *extack)
 {
int err;
@@ -922,6 +947,11 @@ static int fl_set_parms(struct net *net, struct tcf_proto 
*tp,
fl_mask_update_range(mask);
fl_set_masked_key(>mkey, >key, mask);
 
+   if (!fl_mask_fits_tmplt(tmplt, mask)) {
+   NL_SET_ERR_MSG_MOD(extack, "Mask does not fit the template");
+   return -EINVAL;
+   }
+
return 0;
 }
 
@@ -932,6 +962,7 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
 struct netlink_ext_ack *extack)
 {
struct cls_fl_head *head = rtnl_dereference(tp->root);
+   struct fl_flow_tmplt *tmplt = tmplt_priv;
struct cls_fl_filter *fold = *arg;
struct cls_fl_filter *fnew;
struct nlattr **tb;
@@ -988,7 +1019,7 @@ static int fl_change(struct net *net, struct sk_buff 
*in_skb,
}
 
err = fl_set_parms(net, tp, fnew, , base, tb, tca[TCA_RATE], ovr,
-  extack);
+  tmplt, extack);
if (err)
goto errout_idr;
 
@@ -1089,6 +1120,52 @@ static void fl_walk(struct tcf_proto *tp, struct 
tcf_walker *arg)
}
 }
 
+static void *fl_tmplt_create(struct net *net, struct tcf_chain *chain,
+struct nlattr **tca,
+struct netlink_ext_ack *extack)
+{
+   struct fl_flow_tmplt *tmplt;
+   struct nlattr **tb;
+   int err;
+
+   if (!tca[TCA_OPTIONS])
+   return ERR_PTR(-EINVAL);
+
+   tb = kcalloc(TCA_FLOWER_MAX + 1, sizeof(struct nlattr *), GFP_KERNEL);
+   if (!tb)
+   return ERR_PTR(-ENOBUFS);
+   err = nla_parse_nested(tb, TCA_FLOWER_MAX, tca[TCA_OPTIONS],
+  fl_policy, NULL);
+   if (err)
+   goto errout_tb;
+
+   tmplt = kzalloc(sizeof(*tmplt), GFP_KERNEL);
+   if (!tmplt)
+   goto errout_tb;
+   tmplt->chain = chain;
+   err = fl_set_key(net, tb, >dummy_key, >mask, extack);
+   if (err)
+   goto errout_tmplt;
+   kfree(tb);
+
+   fl_init_dissector(>dissector, >mask);
+
+   return tmplt;
+
+errout_tmplt:
+   kfree(tmplt);
+errout_tb:
+   kfree(tb);
+   return ERR_PTR(err);
+}
+
+static void fl_tmplt_destroy(void *tmplt_priv)
+{
+   struct fl_flow_tmplt *tmplt = tmplt_priv;
+
+   kfree(tmplt);
+}
+
 static int fl_dump_key_val(struct sk_buff *skb,
   void *val, int val_type,
   void *mask, int mask_type, int len)
@@ -1435,6 +1512,31 @@ static int fl_dump(struct net *net, struct tcf_proto 
*tp, void *fh,
return -1;
 }
 
+static int fl_tmplt_dump(struct sk_buff *skb, struct net *net, void 
*tmplt_priv)
+{
+   struct fl_flow_tmplt *tmplt = tmplt_priv;
+   struct fl_flow_key *key, *mask;
+   struct nlattr *nest;
+
+   nest = nla_nest_start(skb, TCA_OPTIONS);
+

[patch net-next 6/9] net: sched: cls_flower: propagate chain teplate creation and destruction to drivers

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

Introduce a couple of flower offload commands in order to propagate
template creation/destruction events down to device drivers.
Drivers may use this information to prepare HW in an optimal way
for future filter insertions.

Signed-off-by: Jiri Pirko 
---
 include/net/pkt_cls.h  |  2 ++
 net/sched/cls_flower.c | 40 
 2 files changed, 42 insertions(+)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index a3c1a2c47cd4..e83968cf9a70 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -715,6 +715,8 @@ enum tc_fl_command {
TC_CLSFLOWER_REPLACE,
TC_CLSFLOWER_DESTROY,
TC_CLSFLOWER_STATS,
+   TC_CLSFLOWER_TMPLT_CREATE,
+   TC_CLSFLOWER_TMPLT_DESTROY,
 };
 
 struct tc_cls_flower_offload {
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index d64d43843a3a..276ba25a09c3 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1120,6 +1120,43 @@ static void fl_walk(struct tcf_proto *tp, struct 
tcf_walker *arg)
}
 }
 
+static void fl_hw_create_tmplt(struct tcf_chain *chain,
+  struct fl_flow_tmplt *tmplt,
+  struct netlink_ext_ack *extack)
+{
+   struct tc_cls_flower_offload cls_flower = {};
+   struct tcf_block *block = chain->block;
+   struct tcf_exts dummy_exts = { 0, };
+
+   cls_flower.common.chain_index = chain->index;
+   cls_flower.command = TC_CLSFLOWER_TMPLT_CREATE;
+   cls_flower.cookie = (unsigned long) tmplt;
+   cls_flower.dissector = >dissector;
+   cls_flower.mask = >mask;
+   cls_flower.key = >dummy_key;
+   cls_flower.exts = _exts;
+
+   /* We don't care if driver (any of them) fails to handle this
+* call. It serves just as a hint for it.
+*/
+   tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER,
+_flower, false);
+}
+
+static void fl_hw_destroy_tmplt(struct tcf_chain *chain,
+   struct fl_flow_tmplt *tmplt)
+{
+   struct tc_cls_flower_offload cls_flower = {};
+   struct tcf_block *block = chain->block;
+
+   cls_flower.common.chain_index = chain->index;
+   cls_flower.command = TC_CLSFLOWER_TMPLT_DESTROY;
+   cls_flower.cookie = (unsigned long) tmplt;
+
+   tc_setup_cb_call(block, NULL, TC_SETUP_CLSFLOWER,
+_flower, false);
+}
+
 static void *fl_tmplt_create(struct net *net, struct tcf_chain *chain,
 struct nlattr **tca,
 struct netlink_ext_ack *extack)
@@ -1150,6 +1187,8 @@ static void *fl_tmplt_create(struct net *net, struct 
tcf_chain *chain,
 
fl_init_dissector(>dissector, >mask);
 
+   fl_hw_create_tmplt(chain, tmplt, extack);
+
return tmplt;
 
 errout_tmplt:
@@ -1163,6 +1202,7 @@ static void fl_tmplt_destroy(void *tmplt_priv)
 {
struct fl_flow_tmplt *tmplt = tmplt_priv;
 
+   fl_hw_destroy_tmplt(tmplt->chain, tmplt);
kfree(tmplt);
 }
 
-- 
2.14.4

[patch net-next 7/9] mlxsw: spectrum: Implement chain template hinting

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

Since cld_flower provides information about the filter template for
specific chain, use this information in order to prepare a region.
Use the template to find out what elements are going to be used
and pass that down to mlxsw_sp_acl_tcam_group_add(). Later on, when the
first filter is inserted, the mlxsw_sp_acl_tcam_group_use_patterns()
function would use this element usage information instead of looking
up a pattern.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  5 +++
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h | 12 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c | 12 --
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c| 25 ++--
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 44 --
 5 files changed, 86 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 968b88af2ef5..da19fa343d0b 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1441,6 +1441,11 @@ mlxsw_sp_setup_tc_cls_flower(struct mlxsw_sp_acl_block 
*acl_block,
return 0;
case TC_CLSFLOWER_STATS:
return mlxsw_sp_flower_stats(mlxsw_sp, acl_block, f);
+   case TC_CLSFLOWER_TMPLT_CREATE:
+   return mlxsw_sp_flower_tmplt_create(mlxsw_sp, acl_block, f);
+   case TC_CLSFLOWER_TMPLT_DESTROY:
+   mlxsw_sp_flower_tmplt_destroy(mlxsw_sp, acl_block, f);
+   return 0;
default:
return -EOPNOTSUPP;
}
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index 4a519d8edec8..b0a8e611e730 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -459,7 +459,8 @@ enum mlxsw_sp_acl_profile {
 struct mlxsw_sp_acl_profile_ops {
size_t ruleset_priv_size;
int (*ruleset_add)(struct mlxsw_sp *mlxsw_sp,
-  void *priv, void *ruleset_priv);
+  void *priv, void *ruleset_priv,
+  struct mlxsw_afk_element_usage *tmplt_elusage);
void (*ruleset_del)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv);
int (*ruleset_bind)(struct mlxsw_sp *mlxsw_sp, void *ruleset_priv,
struct mlxsw_sp_port *mlxsw_sp_port,
@@ -514,7 +515,8 @@ mlxsw_sp_acl_ruleset_lookup(struct mlxsw_sp *mlxsw_sp,
 struct mlxsw_sp_acl_ruleset *
 mlxsw_sp_acl_ruleset_get(struct mlxsw_sp *mlxsw_sp,
 struct mlxsw_sp_acl_block *block, u32 chain_index,
-enum mlxsw_sp_acl_profile profile);
+enum mlxsw_sp_acl_profile profile,
+struct mlxsw_afk_element_usage *tmplt_elusage);
 void mlxsw_sp_acl_ruleset_put(struct mlxsw_sp *mlxsw_sp,
  struct mlxsw_sp_acl_ruleset *ruleset);
 u16 mlxsw_sp_acl_ruleset_group_id(struct mlxsw_sp_acl_ruleset *ruleset);
@@ -594,6 +596,12 @@ void mlxsw_sp_flower_destroy(struct mlxsw_sp *mlxsw_sp,
 int mlxsw_sp_flower_stats(struct mlxsw_sp *mlxsw_sp,
  struct mlxsw_sp_acl_block *block,
  struct tc_cls_flower_offload *f);
+int mlxsw_sp_flower_tmplt_create(struct mlxsw_sp *mlxsw_sp,
+struct mlxsw_sp_acl_block *block,
+struct tc_cls_flower_offload *f);
+void mlxsw_sp_flower_tmplt_destroy(struct mlxsw_sp *mlxsw_sp,
+  struct mlxsw_sp_acl_block *block,
+  struct tc_cls_flower_offload *f);
 
 /* spectrum_qdisc.c */
 int mlxsw_sp_tc_qdisc_init(struct mlxsw_sp_port *mlxsw_sp_port);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
index 79b1fa27a9a4..ea42605c451d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
@@ -319,7 +319,8 @@ int mlxsw_sp_acl_block_unbind(struct mlxsw_sp *mlxsw_sp,
 static struct mlxsw_sp_acl_ruleset *
 mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp,
struct mlxsw_sp_acl_block *block, u32 chain_index,
-   const struct mlxsw_sp_acl_profile_ops *ops)
+   const struct mlxsw_sp_acl_profile_ops *ops,
+   struct mlxsw_afk_element_usage *tmplt_elusage)
 {
struct mlxsw_sp_acl *acl = mlxsw_sp->acl;
struct mlxsw_sp_acl_ruleset *ruleset;
@@ -339,7 +340,8 @@ mlxsw_sp_acl_ruleset_create(struct mlxsw_sp *mlxsw_sp,
if (err)
goto err_rhashtable_init;
 
-   err = ops->ruleset_add(mlxsw_sp, acl->priv, ruleset->priv);
+   err = ops->ruleset_add(mlxsw_sp, acl->priv, ruleset->priv,
+

[patch net-next 3/9] net: sched: cls_flower: move key/mask dumping into a separate function

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

Push key/mask dumping from fl_dump() into a separate function
fl_dump_key(), that will be reused for template dumping.

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_flower.c | 62 ++
 1 file changed, 37 insertions(+), 25 deletions(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 09d6c6e67f9d..76c5516357d5 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -1217,29 +1217,9 @@ static int fl_dump_key_flags(struct sk_buff *skb, u32 
flags_key, u32 flags_mask)
return nla_put(skb, TCA_FLOWER_KEY_FLAGS_MASK, 4, &_mask);
 }
 
-static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh,
-  struct sk_buff *skb, struct tcmsg *t)
+static int fl_dump_key(struct sk_buff *skb, struct net *net,
+  struct fl_flow_key *key, struct fl_flow_key *mask)
 {
-   struct cls_fl_filter *f = fh;
-   struct nlattr *nest;
-   struct fl_flow_key *key, *mask;
-
-   if (!f)
-   return skb->len;
-
-   t->tcm_handle = f->handle;
-
-   nest = nla_nest_start(skb, TCA_OPTIONS);
-   if (!nest)
-   goto nla_put_failure;
-
-   if (f->res.classid &&
-   nla_put_u32(skb, TCA_FLOWER_CLASSID, f->res.classid))
-   goto nla_put_failure;
-
-   key = >key;
-   mask = >mask->key;
-
if (mask->indev_ifindex) {
struct net_device *dev;
 
@@ -1248,9 +1228,6 @@ static int fl_dump(struct net *net, struct tcf_proto *tp, 
void *fh,
goto nla_put_failure;
}
 
-   if (!tc_skip_hw(f->flags))
-   fl_hw_update_stats(tp, f);
-
if (fl_dump_key_val(skb, key->eth.dst, TCA_FLOWER_KEY_ETH_DST,
mask->eth.dst, TCA_FLOWER_KEY_ETH_DST_MASK,
sizeof(key->eth.dst)) ||
@@ -1404,6 +1381,41 @@ static int fl_dump(struct net *net, struct tcf_proto 
*tp, void *fh,
if (fl_dump_key_flags(skb, key->control.flags, mask->control.flags))
goto nla_put_failure;
 
+   return 0;
+
+nla_put_failure:
+   return -EMSGSIZE;
+}
+
+static int fl_dump(struct net *net, struct tcf_proto *tp, void *fh,
+  struct sk_buff *skb, struct tcmsg *t)
+{
+   struct cls_fl_filter *f = fh;
+   struct nlattr *nest;
+   struct fl_flow_key *key, *mask;
+
+   if (!f)
+   return skb->len;
+
+   t->tcm_handle = f->handle;
+
+   nest = nla_nest_start(skb, TCA_OPTIONS);
+   if (!nest)
+   goto nla_put_failure;
+
+   if (f->res.classid &&
+   nla_put_u32(skb, TCA_FLOWER_CLASSID, f->res.classid))
+   goto nla_put_failure;
+
+   key = >key;
+   mask = >mask->key;
+
+   if (fl_dump_key(skb, net, key, mask))
+   goto nla_put_failure;
+
+   if (!tc_skip_hw(f->flags))
+   fl_hw_update_stats(tp, f);
+
if (f->flags && nla_put_u32(skb, TCA_FLOWER_FLAGS, f->flags))
goto nla_put_failure;
 
-- 
2.14.4

[patch net-next 2/9] net: sched: introduce chain templates

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

Introduce a group of new tc-rtnl commands to allow user to set per-chain
template. Templates lock down individual chains for particular
classifier type/options combinations. The classifier needs to support
templates, otherwise kernel would reply with error.

For example, to lock chain 22 to allow only filters of type
flower with destination mac address, user needs to do:
  chain 22 flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF

In case the chain already contains some filters it is not possible to
add or remove template. That is permitted only for empty chains.

Alongside with add/del commands, introduce also get/dump and
notifications.

Signed-off-by: Jiri Pirko 
---
 include/net/sch_generic.h  |  14 +-
 include/uapi/linux/rtnetlink.h |   7 +
 net/sched/cls_api.c| 371 -
 net/sched/cls_basic.c  |   2 +-
 net/sched/cls_bpf.c|   3 +-
 net/sched/cls_cgroup.c |   2 +-
 net/sched/cls_flow.c   |   3 +-
 net/sched/cls_flower.c |   3 +-
 net/sched/cls_fw.c |   3 +-
 net/sched/cls_matchall.c   |   3 +-
 net/sched/cls_route.c  |   2 +-
 net/sched/cls_rsvp.h   |   3 +-
 net/sched/cls_tcindex.c|   2 +-
 net/sched/cls_u32.c|   2 +-
 security/selinux/nlmsgtab.c|   2 +-
 15 files changed, 405 insertions(+), 17 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 6488daa32f82..f2a27d41fed5 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -235,6 +235,8 @@ struct tcf_result {
};
 };
 
+struct tcf_chain;
+
 struct tcf_proto_ops {
struct list_headhead;
charkind[IFNAMSIZ];
@@ -250,17 +252,25 @@ struct tcf_proto_ops {
int (*change)(struct net *net, struct sk_buff *,
struct tcf_proto*, unsigned long,
u32 handle, struct nlattr **,
-   void **, bool,
+   void **, bool, void *tmplt_priv,
struct netlink_ext_ack *);
int (*delete)(struct tcf_proto *tp, void *arg,
  bool *last,
  struct netlink_ext_ack *);
void(*walk)(struct tcf_proto*, struct tcf_walker 
*arg);
void(*bind_class)(void *, u32, unsigned long);
+   void *  (*tmplt_create)(struct net *net,
+   struct tcf_chain *chain,
+   struct nlattr **tca,
+   struct netlink_ext_ack *extack);
+   void(*tmplt_destroy)(void *tmplt_priv);
 
/* rtnetlink specific */
int (*dump)(struct net*, struct tcf_proto*, void *,
struct sk_buff *skb, struct tcmsg*);
+   int (*tmplt_dump)(struct sk_buff *skb,
+ struct net *net,
+ void *tmplt_priv);
 
struct module   *owner;
 };
@@ -299,6 +309,8 @@ struct tcf_chain {
struct tcf_block *block;
u32 index; /* chain index */
unsigned int refcnt;
+   const struct tcf_proto_ops *tmplt_ops;
+   void *tmplt_priv;
 };
 
 struct tcf_block {
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index 7d8502313c99..45fd8cc1fdb2 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -150,6 +150,13 @@ enum {
RTM_NEWCACHEREPORT = 96,
 #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT
 
+   RTM_NEWCHAINTMPLT = 100,
+#define RTM_NEWCHAINTMPLT RTM_NEWCHAINTMPLT
+   RTM_DELCHAINTMPLT,
+#define RTM_DELCHAINTMPLT RTM_DELCHAINTMPLT
+   RTM_GETCHAINTMPLT,
+#define RTM_GETCHAINTMPLT RTM_GETCHAINTMPLT
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index db45931bbada..0c88520f80f2 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -227,7 +227,7 @@ static void tcf_chain_head_change(struct tcf_chain *chain,
tcf_chain_head_change_item(item, tp_head);
 }
 
-static void tcf_chain_flush(struct tcf_chain *chain)
+static void tcf_chain_flush(struct tcf_chain *chain, bool destroy_template)
 {
struct tcf_proto *tp = rtnl_dereference(chain->filter_chain);
 
@@ -238,6 +238,11 @@ static void tcf_chain_flush(struct tcf_chain *chain)
tp = rtnl_dereference(chain->filter_chain);
tcf_chain_put(chain);
}
+   if (destroy_template && chain->tmplt_ops) {
+

[patch net-next 1/9] net: sched: push ops lookup bits into tcf_proto_lookup_ops()

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

Push all bits that take care of ops lookup, including module loading
outside tcf_proto_create() function, into tcf_proto_lookup_ops()

Signed-off-by: Jiri Pirko 
---
 net/sched/cls_api.c | 53 +++--
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index cdc3c87c53e6..db45931bbada 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -39,7 +39,7 @@ static DEFINE_RWLOCK(cls_mod_lock);
 
 /* Find classifier type by string name */
 
-static const struct tcf_proto_ops *tcf_proto_lookup_ops(const char *kind)
+static const struct tcf_proto_ops *__tcf_proto_lookup_ops(const char *kind)
 {
const struct tcf_proto_ops *t, *res = NULL;
 
@@ -57,6 +57,33 @@ static const struct tcf_proto_ops 
*tcf_proto_lookup_ops(const char *kind)
return res;
 }
 
+static const struct tcf_proto_ops *
+tcf_proto_lookup_ops(const char *kind, struct netlink_ext_ack *extack)
+{
+   const struct tcf_proto_ops *ops;
+
+   ops = __tcf_proto_lookup_ops(kind);
+   if (ops)
+   return ops;
+#ifdef CONFIG_MODULES
+   rtnl_unlock();
+   request_module("cls_%s", kind);
+   rtnl_lock();
+   ops = __tcf_proto_lookup_ops(kind);
+   /* We dropped the RTNL semaphore in order to perform
+* the module load. So, even if we succeeded in loading
+* the module we have to replay the request. We indicate
+* this using -EAGAIN.
+*/
+   if (ops) {
+   module_put(ops->owner);
+   return ERR_PTR(-EAGAIN);
+   }
+#endif
+   NL_SET_ERR_MSG(extack, "TC classifier not found");
+   return ERR_PTR(-ENOENT);
+}
+
 /* Register(unregister) new classifier type */
 
 int register_tcf_proto_ops(struct tcf_proto_ops *ops)
@@ -133,27 +160,9 @@ static struct tcf_proto *tcf_proto_create(const char 
*kind, u32 protocol,
if (!tp)
return ERR_PTR(-ENOBUFS);
 
-   err = -ENOENT;
-   tp->ops = tcf_proto_lookup_ops(kind);
-   if (!tp->ops) {
-#ifdef CONFIG_MODULES
-   rtnl_unlock();
-   request_module("cls_%s", kind);
-   rtnl_lock();
-   tp->ops = tcf_proto_lookup_ops(kind);
-   /* We dropped the RTNL semaphore in order to perform
-* the module load. So, even if we succeeded in loading
-* the module we have to replay the request. We indicate
-* this using -EAGAIN.
-*/
-   if (tp->ops) {
-   module_put(tp->ops->owner);
-   err = -EAGAIN;
-   } else {
-   NL_SET_ERR_MSG(extack, "TC classifier not found");
-   err = -ENOENT;
-   }
-#endif
+   tp->ops = tcf_proto_lookup_ops(kind, extack);
+   if (IS_ERR(tp->ops)) {
+   err = PTR_ERR(tp->ops);
goto errout;
}
tp->classify = tp->ops->classify;
-- 
2.14.4

[patch net-next 0/9] net: sched: introduce chain templates support with offloading to mlxsw

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

For the TC clsact offload these days, some of HW drivers need
to hold a magic ball. The reason is, with the first inserted rule inside
HW they need to guess what fields will be used for the matching. If
later on this guess proves to be wrong and user adds a filter with a
different field to match, there's a problem. Mlxsw resolves it now with
couple of patterns. Those try to cover as many match fields as possible.
This aproach is far from optimal, both performance-wise and scale-wise.
Also, there is a combination of filters that in certain order won't
succeed.

Most of the time, when user inserts filters in chain, he knows right away
how the filters are going to look like - what type and option will they
have. For example, he knows that he will only insert filters of type
flower matching destination IP address. He can specify a template that
would cover all the filters in the chain.

This patchset is providing the possibility to user to provide such
template  to kernel and propagate it all the way down to device
drivers.

See the examples below.

Create dummy device with clsact first:
# ip link add type dummy
# tc qdisc add dev dummy0 clsact

There is no template assigned by default:
# tc filter template show dev dummy0 ingress

Add a template of type flower allowing to insert rules matching on last
2 bytes of destination mac address:
# tc filter template add dev dummy0 ingress proto ip flower dst_mac 
00:00:00:00:00:00/00:00:00:00:FF:FF

The template is now showed in the list:
# tc filter template show dev dummy0 ingress
filter flower chain 0
  dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
  eth_type ipv4

Add another template, this time for chain number 22:
# tc filter template add dev dummy0 ingress proto ip chain 22 flower dst_ip 
0.0.0.0/16
# tc filter template show dev dummy0 ingress
filter flower chain 0
  dst_mac 00:00:00:00:00:00/00:00:00:00:ff:ff
  eth_type ipv4
filter flower chain 22
  eth_type ipv4
  dst_ip 0.0.0.0/16

Add a filter that fits the template:
# tc filter add dev dummy0 ingress proto ip flower dst_mac 
aa:bb:cc:dd:ee:ff/00:00:00:00:00:0F action drop

Addition of filters that does not fit the template would fail:
# tc filter add dev dummy0 ingress proto ip flower dst_mac 
aa:11:22:33:44:55/00:00:00:FF:00:00 action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1
# tc filter add dev dummy0 ingress proto ip flower dst_ip 10.0.0.1 action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1

Additions of filters to chain 22:
# tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/8 
action drop
# tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1 
action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1
# tc filter add dev dummy0 ingress proto ip chain 22 flower dst_ip 10.0.0.1/24 
action drop
Error: Mask does not fit the template.
We have an error talking to the kernel, -1

Removal of a template from non-empty chain would fail:
# tc filter template del dev dummy0 ingress
Error: The chain is not empty, unable to delete template.
We have an error talking to the kernel, -1

Once the chain is flushed, the template could be removed:
# tc filter del dev dummy0 ingress
# tc filter template del dev dummy0 ingress

Jiri Pirko (9):
  net: sched: push ops lookup bits into tcf_proto_lookup_ops()
  net: sched: introduce chain templates
  net: sched: cls_flower: move key/mask dumping into a separate function
  net: sched: cls_flower: change fl_init_dissector to accept mask and
dissector
  net: sched: cls_flower: implement chain templates
  net: sched: cls_flower: propagate chain teplate creation and
destruction to drivers
  mlxsw: spectrum: Implement chain template hinting
  selftests: forwarding: move shblock tc support check to a separate
helper
  selftests: forwarding: add tests for TC chain templates

 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |   5 +
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  12 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |  12 +-
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c|  25 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  44 ++-
 include/net/pkt_cls.h  |   2 +
 include/net/sch_generic.h  |  14 +-
 include/uapi/linux/rtnetlink.h |   7 +
 net/sched/cls_api.c| 424 +++--
 net/sched/cls_basic.c  |   2 +-
 net/sched/cls_bpf.c|   3 +-
 net/sched/cls_cgroup.c |   2 +-
 net/sched/cls_flow.c   |   3 +-
 net/sched/cls_flower.c | 251 +---
 net/sched/cls_fw.c |   3 +-
 net/sched/cls_matchall.c   |   3 +-
 net/sched/cls_route.c

[patch net-next 9/9] selftests: forwarding: add tests for TC chain templates

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

Add basic sanity tests for TC chain templates.

Signed-off-by: Jiri Pirko 
---
 tools/testing/selftests/net/forwarding/lib.sh  |   9 ++
 .../selftests/net/forwarding/tc_chaintemplates.sh  | 160 +
 2 files changed, 169 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/tc_chaintemplates.sh

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index a736d1d7ecdb..128a5b5a8ea9 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -39,6 +39,15 @@ check_tc_shblock_support()
fi
 }
 
+check_tc_chaintemplate_support()
+{
+   tc filter help 2>&1|grep template &> /dev/null
+   if [[ $? -ne 0 ]]; then
+   echo "SKIP: iproute2 too old; tc is missing chain template 
support"
+   exit 1
+   fi
+}
+
 if [[ "$(id -u)" -ne 0 ]]; then
echo "SKIP: need root privileges"
exit 0
diff --git a/tools/testing/selftests/net/forwarding/tc_chaintemplates.sh 
b/tools/testing/selftests/net/forwarding/tc_chaintemplates.sh
new file mode 100755
index ..21f2c18e973a
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/tc_chaintemplates.sh
@@ -0,0 +1,160 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+ALL_TESTS="template_create_destroy template_filter_fits \
+  template_create_nonempty template_destroy_nonempty"
+NUM_NETIFS=2
+source tc_common.sh
+source lib.sh
+
+h1_create()
+{
+   simple_if_init $h1 192.0.2.1/24
+}
+
+h1_destroy()
+{
+   simple_if_fini $h1 192.0.2.1/24
+}
+
+h2_create()
+{
+   simple_if_init $h2 192.0.2.2/24
+   tc qdisc add dev $h2 clsact
+}
+
+h2_destroy()
+{
+   tc qdisc del dev $h2 clsact
+   simple_if_fini $h2 192.0.2.2/24
+}
+
+template_create_destroy()
+{
+   RET=0
+
+   tc filter template add dev $h2 ingress protocol ip \
+   flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF
+   check_err $? "Failed to create template for default chain"
+
+   tc filter template add dev $h2 ingress chain 1 protocol ip \
+   flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF
+   check_err $? "Failed to create template for chain 1"
+
+   tc filter template del dev $h2 ingress
+   check_err $? "Failed to destroy template for default chain"
+
+   tc filter template del dev $h2 ingress chain 1
+   check_err $? "Failed to destroy template for chain 1"
+
+   log_test "template create destroy"
+}
+
+template_filter_fits()
+{
+   RET=0
+
+   tc filter template add dev $h2 ingress protocol ip \
+   flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF &> /dev/null
+   tc filter template add dev $h2 ingress chain 1 protocol ip \
+   flower src_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF &> /dev/null
+
+   tc filter add dev $h2 ingress protocol ip pref 1 handle 1101 \
+   flower dst_mac $h2mac action drop
+   check_err $? "Failed to insert filter which fits template"
+
+   tc filter add dev $h2 ingress protocol ip pref 1 handle 1102 \
+   flower src_mac $h2mac action drop &> /dev/null
+   check_fail $? "Incorrectly succeded to insert filter which does not 
template"
+
+   tc filter add dev $h2 ingress chain 1 protocol ip pref 1 handle 1101 \
+   flower src_mac $h2mac action drop
+   check_err $? "Failed to insert filter which fits template"
+
+   tc filter add dev $h2 ingress chain 1protocol ip pref 1 handle 1102 \
+   flower dst_mac $h2mac action drop &> /dev/null
+   check_fail $? "Incorrectly succeded to insert filter which does not 
template"
+
+   tc filter del dev $h2 ingress chain 1 protocol ip pref 1 handle 1102 \
+   flower &> /dev/null
+   tc filter del dev $h2 ingress chain 1 protocol ip pref 1 handle 1101 \
+   flower &> /dev/null
+
+   tc filter del dev $h2 ingress protocol ip pref 1 handle 1102 \
+   flower &> /dev/null
+   tc filter del dev $h2 ingress protocol ip pref 1 handle 1101 \
+   flower &> /dev/null
+
+   tc filter template del dev $h2 ingress chain 1
+   tc filter template del dev $h2 ingress
+
+   log_test "template filter fits"
+}
+
+template_create_nonempty()
+{
+   RET=0
+
+   tc filter add dev $h2 ingress protocol ip pref 1 handle 1101 \
+   flower dst_mac $h2mac action drop
+   tc filter template add dev $h2 ingress protocol ip \
+   flower dst_mac 00:00:00:00:00:00/FF:FF:FF:FF:FF:FF &> /dev/null
+   check_fail $? "Incorrectly succeded to create template for non-empty 
chain"
+
+   tc filter template del dev $h2 ingress &> /dev/null
+   tc filter del dev $h2 ingress protocol ip pref 1 handle 1101 flower
+
+   log_test "template create non-empty"
+}
+
+template_destroy_nonempty()
+{
+   RET=0
+
+   tc filter template

[patch net-next 8/9] selftests: forwarding: move shblock tc support check to a separate helper

2018-06-25 Thread Jiri Pirko

From: Jiri Pirko 

The shared block support is only needed for tc_shblock.sh. No need to
require that for other test.

Signed-off-by: Jiri Pirko 
---
 tools/testing/selftests/net/forwarding/lib.sh | 3 +++
 tools/testing/selftests/net/forwarding/tc_shblocks.sh | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/tools/testing/selftests/net/forwarding/lib.sh 
b/tools/testing/selftests/net/forwarding/lib.sh
index 7b18a53aa556..a736d1d7ecdb 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -28,7 +28,10 @@ check_tc_version()
echo "SKIP: iproute2 too old; tc is missing JSON support"
exit 1
fi
+}
 
+check_tc_shblock_support()
+{
tc filter help 2>&1 | grep block &> /dev/null
if [[ $? -ne 0 ]]; then
echo "SKIP: iproute2 too old; tc is missing shared block 
support"
diff --git a/tools/testing/selftests/net/forwarding/tc_shblocks.sh 
b/tools/testing/selftests/net/forwarding/tc_shblocks.sh
index b5b917203815..9826a446e2c0 100755
--- a/tools/testing/selftests/net/forwarding/tc_shblocks.sh
+++ b/tools/testing/selftests/net/forwarding/tc_shblocks.sh
@@ -105,6 +105,8 @@ cleanup()
ip link set $swp2 address $swp2origmac
 }
 
+check_tc_shblock_support
+
 trap cleanup EXIT
 
 setup_prepare
-- 
2.14.4

Re: [PATCH net-next 7/7] net: sched: call reoffload op on block callback reg

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 06:34:31AM CEST, jakub.kicin...@netronome.com wrote:
>From: John Hurley 

[...]   

>+static int
>+tcf_block_playback_offloads(struct tcf_block *block, tc_setup_cb_t *cb,
>+  void *cb_priv, bool add, bool offload_in_use,
>+  struct netlink_ext_ack *extack)
>+{
>+  struct tcf_chain *chain;
>+  struct tcf_proto *tp;
>+  int err;
>+
>+  list_for_each_entry(chain, >chain_list, list) {
>+  for (tp = rtnl_dereference(chain->filter_chain); tp;
>+   tp = rtnl_dereference(tp->next)) {
>+  if (tp->ops->reoffload) {
>+  err = tp->ops->reoffload(tp, add, cb, cb_priv,
>+   extack);
>+  if (err && add)
>+  goto err_playback_remove;
>+  } else if (add && offload_in_use) {
>+  err = -EOPNOTSUPP;
>+  NL_SET_ERR_MSG(extack, "Filter replay failed - 
>a filters doesn't support re-offloading");

This msg sounds weird. Please fix it.

Otherwise this looks very good to me! Thanks!

Re: [PATCH net 2/2] nfp: reject binding to shared blocks

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 10:50:14PM CEST, jakub.kicin...@netronome.com wrote:
>On Mon, 25 Jun 2018 22:40:21 +0200, Jiri Pirko wrote:
>> Mon, Jun 25, 2018 at 10:22:46PM CEST, jakub.kicin...@netronome.com wrote:
>> >From: John Hurley 
>> >
>> >TC shared blocks allow multiple qdiscs to be grouped together and filters
>> >shared between them. Currently the chains of filters attached to a block
>> >are only flushed when the block is removed. If a qdisc is removed from a
>> >block but the block still exists, flow del messages are not passed to the
>> >callback registered for that qdisc. For the NFP, this presents the
>> >possibility of rules still existing in hw when they should be removed.
>> >
>> >Prevent binding to shared blocks until the kernel can send per qdisc del
>> >messages when block unbinds occur.  
>> 
>> This is not nfp-specific problem. Should be handled differently. The
>> driver has information about offloaded filters. On unbind, it have
>> enough info to do the flush, doesn't it?
>
>Certainly.  But this fix is simpler and sufficient.  We need to
>backport it back to 4.16.  If we have to go through driver tables 
>and flush filters we may as well merge the reoffload series to net.

Oh, I missed this is for net. Sorry.

Acked-by: Jiri Pirko

Re: [PATCH net 2/2] nfp: reject binding to shared blocks

2018-06-25 Thread Jakub Kicinski

On Mon, 25 Jun 2018 22:40:21 +0200, Jiri Pirko wrote:
> Mon, Jun 25, 2018 at 10:22:46PM CEST, jakub.kicin...@netronome.com wrote:
> >From: John Hurley 
> >
> >TC shared blocks allow multiple qdiscs to be grouped together and filters
> >shared between them. Currently the chains of filters attached to a block
> >are only flushed when the block is removed. If a qdisc is removed from a
> >block but the block still exists, flow del messages are not passed to the
> >callback registered for that qdisc. For the NFP, this presents the
> >possibility of rules still existing in hw when they should be removed.
> >
> >Prevent binding to shared blocks until the kernel can send per qdisc del
> >messages when block unbinds occur.  
> 
> This is not nfp-specific problem. Should be handled differently. The
> driver has information about offloaded filters. On unbind, it have
> enough info to do the flush, doesn't it?

Certainly.  But this fix is simpler and sufficient.  We need to
backport it back to 4.16.  If we have to go through driver tables 
and flush filters we may as well merge the reoffload series to net.

Re: [PATCH net-next 3/7] net: sched: cls_flower: implement offload tcf_proto_op

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 06:34:27AM CEST, jakub.kicin...@netronome.com wrote:
>From: John Hurley 
>
>Add the reoffload tcf_proto_op in flower to generate an offload message
>for each filter in the given tcf_proto. Call the specified callback with
>this new offload message. The function only returns an error if the
>callback rejects adding a 'hardware only' rule.
>
>A filter contains a flag to indicate if it is in hardware or not. To
>ensure the reoffload function properly maintains this flag, keep a
>reference counter for the number of instances of the filter that are in
>hardware. Only update the flag when this counter changes from or to 0. Add
>a generic helper function to implement this behaviour.
>
>Signed-off-by: John Hurley 
>Signed-off-by: Jakub Kicinski 
>---
> include/net/sch_generic.h | 15 +
> net/sched/cls_flower.c| 44 +++
> 2 files changed, 59 insertions(+)
>
>diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>index 88ed64f60056..c0bd11a928ed 100644
>--- a/include/net/sch_generic.h
>+++ b/include/net/sch_generic.h
>@@ -336,6 +336,21 @@ static inline void tcf_block_offload_dec(struct tcf_block 
>*block, u32 *flags)
>   block->offloadcnt--;
> }
> 
>+static inline void
>+tc_cls_offload_cnt_update(struct tcf_block *block, u32 *cnt, u32 *flags,

Why u32? This is not uapi or hw facing. Just use "unsigned int".

[...]

Re: [PATCH v3,net-next] vlan: implement vlan id and protocol changes

2018-06-25 Thread David Ahern

On 6/25/18 4:30 AM, Chas Williams wrote:
> vlan_changelink silently ignores attempts to change the vlan id
> or protocol id of an existing vlan interface.  Implement by adding
> the new vlan id and protocol to the interface's vlan group and then
> removing the old vlan id and protocol from the vlan group.
> 
> Signed-off-by: Chas Williams <3ch...@gmail.com>
> ---
>  include/linux/netdevice.h |  1 +
>  net/8021q/vlan.c  |  4 ++--
>  net/8021q/vlan.h  |  2 ++
>  net/8021q/vlan_netlink.c  | 38 ++
>  net/core/dev.c|  1 +
>  5 files changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 3ec9850c7936..a95ae238addf 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2409,6 +2409,7 @@ enum netdev_cmd {
>   NETDEV_CVLAN_FILTER_DROP_INFO,
>   NETDEV_SVLAN_FILTER_PUSH_INFO,
>   NETDEV_SVLAN_FILTER_DROP_INFO,
> + NETDEV_CHANGEVLAN,
>  };
>  const char *netdev_cmd_to_name(enum netdev_cmd cmd);
>  

you add the new notifier, but do not add any hooks to catch and process it.

Personally, I think it is a bit sketchy to change the vlan id on an
existing device and I suspect it will cause latent errors.

What's your use case for trying to implement the change versus causing
it to generate an unsupported error?

If this patch does get accepted, I believe the mlxsw switchdev driver
will be impacted.


> diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
> index 73a65789271b..b5e0ad1a581a 100644
> --- a/net/8021q/vlan.c
> +++ b/net/8021q/vlan.c
> @@ -51,8 +51,8 @@ const char vlan_version[] = DRV_VERSION;
>  
>  /* End of global variables definitions. */
>  
> -static int vlan_group_prealloc_vid(struct vlan_group *vg,
> -__be16 vlan_proto, u16 vlan_id)
> +int vlan_group_prealloc_vid(struct vlan_group *vg,
> + __be16 vlan_proto, u16 vlan_id)
>  {
>   struct net_device **array;
>   unsigned int pidx, vidx;
> diff --git a/net/8021q/vlan.h b/net/8021q/vlan.h
> index 44df1c3df02d..c734dd21d70d 100644
> --- a/net/8021q/vlan.h
> +++ b/net/8021q/vlan.h
> @@ -116,6 +116,8 @@ int register_vlan_dev(struct net_device *dev, struct 
> netlink_ext_ack *extack);
>  void unregister_vlan_dev(struct net_device *dev, struct list_head *head);
>  bool vlan_dev_inherit_address(struct net_device *dev,
> struct net_device *real_dev);
> +int vlan_group_prealloc_vid(struct vlan_group *vg,
> + __be16 vlan_proto, u16 vlan_id);
>  
>  static inline u32 vlan_get_ingress_priority(struct net_device *dev,
>   u16 vlan_tci)
> diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
> index 9b60c1e399e2..ee27781939e0 100644
> --- a/net/8021q/vlan_netlink.c
> +++ b/net/8021q/vlan_netlink.c
> @@ -107,10 +107,48 @@ static int vlan_changelink(struct net_device *dev, 
> struct nlattr *tb[],
>  struct nlattr *data[],
>  struct netlink_ext_ack *extack)
>  {
> + struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
>   struct ifla_vlan_flags *flags;
>   struct ifla_vlan_qos_mapping *m;
>   struct nlattr *attr;
>   int rem;
> + __be16 vlan_proto = vlan->vlan_proto;
> + u16 vlan_id = vlan->vlan_id;
> +
> + if (data[IFLA_VLAN_ID])
> + vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
> +
> + if (data[IFLA_VLAN_PROTOCOL])
> + vlan_proto = nla_get_be16(data[IFLA_VLAN_PROTOCOL]);
> +
> + if (vlan->vlan_id != vlan_id || vlan->vlan_proto != vlan_proto) {
> + struct net_device *real_dev = vlan->real_dev;
> + struct vlan_info *vlan_info;
> + struct vlan_group *grp;
> + __be16 old_vlan_proto = vlan->vlan_proto;
> + u16 old_vlan_id = vlan->vlan_id;
> + int err;
> +
> + err = vlan_vid_add(real_dev, vlan_proto, vlan_id);
> + if (err)
> + return err;
> + vlan_info = rtnl_dereference(real_dev->vlan_info);
> + grp = _info->grp;
> + err = vlan_group_prealloc_vid(grp, vlan_proto, vlan_id);
> + if (err < 0) {
> + vlan_vid_del(real_dev, vlan_proto, vlan_id);
> + return err;
> + }
> + vlan_group_set_device(grp, vlan_proto, vlan_id, dev);
> + vlan->vlan_proto = vlan_proto;
> + vlan->vlan_id = vlan_id;
> +
> + vlan_group_set_device(grp, old_vlan_proto, old_vlan_id, NULL);
> + vlan_vid_del(real_dev, old_vlan_proto, old_vlan_id);
> +
> + err = call_netdevice_notifiers(NETDEV_CHANGEVLAN, dev);
> + notifier_to_errno(err);
> + }
>  
>   if (data[IFLA_VLAN_FLAGS]) {
>   flags = nla_data(data[IFLA_VLAN_FLAGS]);
> diff --git a/net/core/dev.c b/net/core/dev.c

Re: [PATCH net-next 2/7] net: sched: add tcf_proto_op to offload a rule

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 06:34:26AM CEST, jakub.kicin...@netronome.com wrote:
>From: John Hurley 
>
>Create a new tcf_proto_op called 'reoffload' that generates a new offload
>message for each node in a tcf_proto. Pointers to the tcf_proto and
>whether the offload request is to add or delete the node are included.
>Also included is a callback function to send the offload message to and
>the option of priv data to go with the cb.
>
>Signed-off-by: John Hurley 
>Signed-off-by: Jakub Kicinski 
>---
> include/net/act_api.h | 3 ---
> include/net/sch_generic.h | 6 ++
> 2 files changed, 6 insertions(+), 3 deletions(-)
>
>diff --git a/include/net/act_api.h b/include/net/act_api.h
>index 9e59ebfded62..5ff11adbe2a6 100644
>--- a/include/net/act_api.h
>+++ b/include/net/act_api.h
>@@ -190,9 +190,6 @@ static inline void tcf_action_stats_update(struct 
>tc_action *a, u64 bytes,
> #endif
> }
> 
>-typedef int tc_setup_cb_t(enum tc_setup_type type,
>-void *type_data, void *cb_priv);
>-
> #ifdef CONFIG_NET_CLS_ACT
> int tc_setup_cb_egdev_register(const struct net_device *dev,
>  tc_setup_cb_t *cb, void *cb_priv);
>diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>index 6488daa32f82..88ed64f60056 100644
>--- a/include/net/sch_generic.h
>+++ b/include/net/sch_generic.h
>@@ -20,6 +20,9 @@ struct qdisc_walker;
> struct tcf_walker;
> struct module;
> 
>+typedef int tc_setup_cb_t(enum tc_setup_type type,
>+void *type_data, void *cb_priv);
>+
> struct qdisc_rate_table {
>   struct tc_ratespec rate;
>   u32 data[256];
>@@ -256,6 +259,9 @@ struct tcf_proto_ops {
> bool *last,
> struct netlink_ext_ack *);
>   void(*walk)(struct tcf_proto*, struct tcf_walker 
> *arg);
>+  int (*reoffload)(struct tcf_proto *, bool,
>+   tc_setup_cb_t *, void *,
>+   struct netlink_ext_ack *);

Please, name the args. Unnamed args are annoying...


>   void(*bind_class)(void *, u32, unsigned long);
> 
>   /* rtnetlink specific */
>-- 
>2.17.1
>

Re: [PATCH net 2/2] nfp: reject binding to shared blocks

2018-06-25 Thread Jiri Pirko

Mon, Jun 25, 2018 at 10:22:46PM CEST, jakub.kicin...@netronome.com wrote:
>From: John Hurley 
>
>TC shared blocks allow multiple qdiscs to be grouped together and filters
>shared between them. Currently the chains of filters attached to a block
>are only flushed when the block is removed. If a qdisc is removed from a
>block but the block still exists, flow del messages are not passed to the
>callback registered for that qdisc. For the NFP, this presents the
>possibility of rules still existing in hw when they should be removed.
>
>Prevent binding to shared blocks until the kernel can send per qdisc del
>messages when block unbinds occur.

This is not nfp-specific problem. Should be handled differently. The
driver has information about offloaded filters. On unbind, it have
enough info to do the flush, doesn't it?


>
>Fixes: 4861738775d7 ("net: sched: introduce shared filter blocks 
>infrastructure")
>Signed-off-by: John Hurley 
>Signed-off-by: Jakub Kicinski 
>Reviewed-by: Simon Horman 
>---
> drivers/net/ethernet/netronome/nfp/bpf/main.c   | 3 +++
> drivers/net/ethernet/netronome/nfp/flower/offload.c | 3 +++
> 2 files changed, 6 insertions(+)
>
>diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
>b/drivers/net/ethernet/netronome/nfp/bpf/main.c
>index fcdfb8e7fdea..6b15e3b11956 100644
>--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
>+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
>@@ -202,6 +202,9 @@ static int nfp_bpf_setup_tc_block(struct net_device 
>*netdev,
>   if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
>   return -EOPNOTSUPP;
> 
>+  if (tcf_block_shared(f->block))
>+  return -EOPNOTSUPP;
>+
>   switch (f->command) {
>   case TC_BLOCK_BIND:
>   return tcf_block_cb_register(f->block,
>diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
>b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>index 477f584f6d28..525057bee0ed 100644
>--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
>+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
>@@ -631,6 +631,9 @@ static int nfp_flower_setup_tc_block(struct net_device 
>*netdev,
>   if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
>   return -EOPNOTSUPP;
> 
>+  if (tcf_block_shared(f->block))
>+  return -EOPNOTSUPP;
>+
>   switch (f->command) {
>   case TC_BLOCK_BIND:
>   return tcf_block_cb_register(f->block,
>-- 
>2.17.1
>

[PATCH net 2/2] nfp: reject binding to shared blocks

2018-06-25 Thread Jakub Kicinski

From: John Hurley 

TC shared blocks allow multiple qdiscs to be grouped together and filters
shared between them. Currently the chains of filters attached to a block
are only flushed when the block is removed. If a qdisc is removed from a
block but the block still exists, flow del messages are not passed to the
callback registered for that qdisc. For the NFP, this presents the
possibility of rules still existing in hw when they should be removed.

Prevent binding to shared blocks until the kernel can send per qdisc del
messages when block unbinds occur.

Fixes: 4861738775d7 ("net: sched: introduce shared filter blocks 
infrastructure")
Signed-off-by: John Hurley 
Signed-off-by: Jakub Kicinski 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/bpf/main.c   | 3 +++
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c 
b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index fcdfb8e7fdea..6b15e3b11956 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -202,6 +202,9 @@ static int nfp_bpf_setup_tc_block(struct net_device *netdev,
if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
return -EOPNOTSUPP;
 
+   if (tcf_block_shared(f->block))
+   return -EOPNOTSUPP;
+
switch (f->command) {
case TC_BLOCK_BIND:
return tcf_block_cb_register(f->block,
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 477f584f6d28..525057bee0ed 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -631,6 +631,9 @@ static int nfp_flower_setup_tc_block(struct net_device 
*netdev,
if (f->binder_type != TCF_BLOCK_BINDER_TYPE_CLSACT_INGRESS)
return -EOPNOTSUPP;
 
+   if (tcf_block_shared(f->block))
+   return -EOPNOTSUPP;
+
switch (f->command) {
case TC_BLOCK_BIND:
return tcf_block_cb_register(f->block,
-- 
2.17.1

[PATCH net 0/2] nfp: MPLS and shared blocks TC offload fixes

2018-06-25 Thread Jakub Kicinski

Hi!

This series brings two fixes to TC filter/action offload code.
Pieter fixes matching MPLS packets when the match is purely on
the MPLS ethertype and none of the MPLS fields are used.
John provides a fix for offload of shared blocks.  Unfortunately,
with shared blocks there is currently no guarantee that filters
which were added by the core will be removed before block unbind.
Our simple fix is to not support offload of rules on shared blocks
at all, a revert of this fix will be send for -next once the
reoffload infrastructure lands.  The shared blocks became important
as we are trying to use them for bonding offload (managed from user
space) and lack of remove calls leads to resource leaks.


John Hurley (1):
  nfp: reject binding to shared blocks

Pieter Jansen van Vuuren (1):
  nfp: flower: fix mpls ether type detection

 drivers/net/ethernet/netronome/nfp/bpf/main.c  |  3 +++
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 14 ++
 .../net/ethernet/netronome/nfp/flower/offload.c| 12 
 3 files changed, 29 insertions(+)

-- 
2.17.1

[PATCH net 1/2] nfp: flower: fix mpls ether type detection

2018-06-25 Thread Jakub Kicinski

From: Pieter Jansen van Vuuren 

Previously it was not possible to distinguish between mpls ether types and
other ether types. This leads to incorrect classification of offloaded
filters that match on mpls ether type. For example the following two
filters overlap:

 # tc filter add dev eth0 parent : \
protocol 0x8847 flower \
action mirred egress redirect dev eth1

 # tc filter add dev eth0 parent : \
protocol 0x0800 flower \
action mirred egress redirect dev eth2

The driver now correctly includes the mac_mpls layer where HW stores mpls
fields, when it detects an mpls ether type. It also sets the MPLS_Q bit to
indicate that the filter should match mpls packets.

Fixes: bb055c198d9b ("nfp: add mpls match offloading support")
Signed-off-by: Pieter Jansen van Vuuren 
Reviewed-by: Simon Horman 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 14 ++
 .../net/ethernet/netronome/nfp/flower/offload.c|  8 
 2 files changed, 22 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/match.c 
b/drivers/net/ethernet/netronome/nfp/flower/match.c
index 91935405f586..84f7a5dbea9d 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/match.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/match.c
@@ -123,6 +123,20 @@ nfp_flower_compile_mac(struct nfp_flower_mac_mpls *frame,
 NFP_FLOWER_MASK_MPLS_Q;
 
frame->mpls_lse = cpu_to_be32(t_mpls);
+   } else if (dissector_uses_key(flow->dissector,
+ FLOW_DISSECTOR_KEY_BASIC)) {
+   /* Check for mpls ether type and set NFP_FLOWER_MASK_MPLS_Q
+* bit, which indicates an mpls ether type but without any
+* mpls fields.
+*/
+   struct flow_dissector_key_basic *key_basic;
+
+   key_basic = skb_flow_dissector_target(flow->dissector,
+ FLOW_DISSECTOR_KEY_BASIC,
+ flow->key);
+   if (key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_UC) ||
+   key_basic->n_proto == cpu_to_be16(ETH_P_MPLS_MC))
+   frame->mpls_lse = cpu_to_be32(NFP_FLOWER_MASK_MPLS_Q);
}
 }
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index c42e64f32333..477f584f6d28 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -264,6 +264,14 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
case cpu_to_be16(ETH_P_ARP):
return -EOPNOTSUPP;
 
+   case cpu_to_be16(ETH_P_MPLS_UC):
+   case cpu_to_be16(ETH_P_MPLS_MC):
+   if (!(key_layer & NFP_FLOWER_LAYER_MAC)) {
+   key_layer |= NFP_FLOWER_LAYER_MAC;
+   key_size += sizeof(struct nfp_flower_mac_mpls);
+   }
+   break;
+
/* Will be included in layer 2. */
case cpu_to_be16(ETH_P_8021Q):
break;
-- 
2.17.1

Re: Route fallback issue

2018-06-25 Thread Grant Taylor


On 06/25/2018 12:50 PM, Julian Anastasov wrote:

Hello,


Hi Julian,

Yes, ARP state for unreachable GWs may be updated slowly, there is 
in-time feedback only for reachable state.


Fair.

Most of the installations where I needed D.G.D. to work would be okay 
with a < 5 minute timeout.  Obviously they would like faster, but 
automation is a LOT better than waiting on manual intervention.


IMHO < 30 seconds is great.  < 90 seconds is acceptable.  < 300 seconds 
leaves some room for improvement.


You can create the two routes, of course. But only the default routes 
are alternative.


Are you saying that the functionality I'm describing only works for 
default gateways or that the term "alternative route" only applies to 
default gateways?


The testing that I did indicated that alternative routes worked for 
specific prefixes too.


I tested multiple NetNSs with only directly attached routes and appended 
routes to a destination prefix, no default gateway / route of last resort.


The behavior seemed to be different when ignore_routes_with_linkdown was 
set verses unset.  Specifically, ignore_routes_with_linkdown seemed to 
help considerably.


Hence why I question the requirement for the "default" route verses a 
route to a specific prefix.


Can you explain why I saw the behavior difference with 
ignore_routes_with_linkdown if it only applies to the default route?



The alternative routes work in this way:

- on lookup, routes are walked in order - as listed in table

- as long as route contains reachable gateway (ARP state), only this 
route is used


- if some gateway becomes unreachable (ARP state), next alternative 
routes are tried


- if ARP entry is expired (missing), this gateway can be probed if the 
route is before the currently used route. This is what happens initially 
when no ARP state is present for the GWs. It is bad luck if the probed 
GW is actually unreachable.


- active probing by user space (ping GWs) can only help to keep the ARP 
state present for the used gateways. By this way, if ARP entry for GW 
is missing, the kernel will not risk to select unavailable route with 
the goal to probe the GW.


This all makes sense.

Please confirm if "gateway" in this context is the "/default/ gateway" 
or not.  I ask because arguably "gateway" can be used as a term to 
describe the next hop for a route, or gateway, to a prefix.  Further, 
the "/default/ (gateway,router)" is the gateway or route of last resort. 
 Which to me means that "gateway" can be any route in this context.



nexthop is the GW in the route


Thank you for confirming.


Yes, the kernel avoids alternative routes with unreachable GWs


Fair enough.

The multipath route uses all its alive nexthops at the same time... But 
you may need in the same way active probing by user space, otherwise 
unavailable GW can be selected.


I assume that the dead ECMP NEXTHOP is also subject to similar timeouts 
as alternative routes.  Correct?



Yes, if you prefer, you may run PING every second to avoid such delays...


Agreed.

I'm trying to make sure I understand basic functionality before I do 
things to modify it.




--
Grant. . . .
unix || die



smime.p7s
Description: S/MIME Cryptographic Signature

Re: Route fallback issue

2018-06-25 Thread Julian Anastasov

Hello,

On Sun, 24 Jun 2018, Erik Auerswald wrote:

> Hello Julien,
> 
> > http://ja.ssi.bg/dgd-usage.txt
> 
> Thanks for that info!
> 
> Can you tell us what parts from the above text is actually implemented
> in the upstream Linux kernel, and starting with which version(s)
> (approximately)? The text describes ideas and patches from nearly two
> decades ago, is more recent documentation available somewhere?

Nothing is included in kernel. The idea is that user space
has more control. It is best done with CONNMARK: stick NATed
connection to some path (via alive ISP), use route lookup just
to select alive path for the first packet in connection. So, what
we balance are connections, not packets (which does not work with
different ISPs). Probe GWs to keep only alive routes in the table.

Regards

--
Julian Anastasov

Re: Route fallback issue

2018-06-25 Thread Julian Anastasov



Hello,

On Thu, 21 Jun 2018, Grant Taylor wrote:

> On 06/21/2018 01:57 PM, Julian Anastasov wrote:
> > Hello,
> 
> > http://ja.ssi.bg/dgd-usage.txt
> 
> "DGD" or "Dead Gateway Detection" sounds very familiar.  I referenced it in an
> earlier reply.
> 
> I distinctly remember DGD not behaving satisfactorily years ago.  Where
> unsatisfactorily was something like 90 seconds (or more) to recover. Which
> actually matches what I was getting without the ignore_routes_with_linkdown=1
> setting that David A. mentioned.

Yes, ARP state for unreachable GWs may be updated slowly,
there is in-time feedback only for reachable state.

> With ignore_routes_with_linkdown=1 things behaved much better.
> 
> > Not true. net/ipv4/fib_semantics.c:fib_select_path() calls
> > fib_select_default() only when prefixlen = 0 (default route).
> 
> Okay  My testing last night disagrees with you.  Specifically, I was able
> to add a alternate routes to the same prefix, 192.0.2.128/26. There was not
> any default gateway configured on any of the NetNSs.  So everything was using
> routes for locally attacked or the two added via "ip route append".
> 
> What am I misinterpreting?  Or where are we otherwise talking past each other?

You can create the two routes, of course. But only the
default routes are alternative.

> 
> > Otherwise, only the first route will be considered.
> 
> "only the first route" almost sounds like something akin to Equal Cost Multi
> Path.
> 
> I was not expecting "alternative routes" to use more than one route at a time,
> equally or otherwise.  I was wanting for the kernel to fall back to an
> alternate route / gateway / path in the event that the one that was being used
> became unusable / unreachable.
> 
> So what should "Alternative Routes" do?  How does this compare / contract to
> E.C.M.P. or D.G.D.

The alternative routes work in this way:

- on lookup, routes are walked in order - as listed in table

- as long as route contains reachable gateway (ARP state), only this route 
is used

- if some gateway becomes unreachable (ARP state), next alternative routes 
are tried

- if ARP entry is expired (missing), this gateway can be probed if the 
route is before the currently used route. This is what happens initially
when no ARP state is present for the GWs. It is bad luck if the probed
GW is actually unreachable.

- active probing by user space (ping GWs) can only help to keep the
ARP state present for the used gateways. By this way, if ARP entry 
for GW is missing, the kernel will not risk to select unavailable route 
with the goal to probe the GW.

> > fib_select_default() is the function that decides which nexthop is reachable
> > and whether to contact it. It uses the ARP state via fib_detect_death().
> > That is all code that is behind this feature called "alternative routes":
> > the kernel selects one based on nexthop's ARP state.
> 
> Please confirm that you aren't entering / referring to E.C.M.P. territory when
> you say "nexthop".  I think that you are not, but I want to ask and be sure,
> particularly seeing as how things are very closely related.

nexthop is the GW in the route

> It sounds like you're referring to literally the router that is the next hop
> in the path.  I.e. the device on the other end of the wire.

Yes, the kernel avoids alternative routes with unreachable GWs

> I want to do some testing to see if fib_multipath_use_neigh alters this
> behavior at all.  I'm hoping that it will invalidate an alternate route if the
> MAC is not resolvable even if the physical link stays up.

The multipath route uses all its alive nexthops at the same 
time... But you may need in the same way active probing by user space,
otherwise unavailable GW can be selected.

> Sure, the ARP cache may have a 30 ~ 120 second timeout before triggering this
> behavior.  But having that timeout and starting to use an alternative route is
> considerably better than not using an alternative route.

Yes, if you prefer, you may run PING every second to avoid such 
delays...

Regards

--
Julian Anastasov

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-25 Thread Santosh Shilimkar


On 6/25/2018 10:50 AM, Sowmini Varadhan wrote:

On (06/26/18 01:43), Ka-Cheong Poon wrote:


Yes, I think if the socket is bound, it should check the scope_id
in msg_name (if not NULL) to make sure that they match.  A bound
RDS socket can send to multiple peers.  But if the bound local
address is link local, it should only be allowed to send to peers
on the same link.


agree.

Yep. Its inline with RDS bind behavior.





If a socket is bound, I guess the scope_id should be used.  So
if a socket is not bound to a link local address and the socket
is used to sent to a link local peer, it should fail.


PF_RDS sockets *MUST* alwasy be bound.  See
Documentation/networking/rds.txt:
"   Sockets must be bound before you can send or receive data.
 This is needed because binding also selects a transport and
 attaches it to the socket. Once bound, the transport assignment
 does not change."


In any case link local or not, the socket needs to be bound before
any data can be sent as documented. Send path already enforces
it.


Also, why is there no IPv6 support in rds_connect?



Oops, I missed this when I ported the internal version to the
net-next version.  Will add it back.



So the net-next wasn't tested? IPv6 connections
itself wouldn't be formed with this missing. As mentioned
already, please test v2 before posting on list.

Regards,
Santosh

Re: [net regression] "fib_rules: move common handling of newrule delrule msgs into fib_nl2rule" breaks suppress_prefixlength

2018-06-25 Thread Roopa Prabhu

On Mon, Jun 25, 2018 at 8:23 AM, Roopa Prabhu  wrote:
> On Sat, Jun 23, 2018 at 8:46 AM, Jason A. Donenfeld  wrote:
>> Hey Roopa,
>>
>> On a kernel with a minimal networking config,
>> CONFIG_IP_MULTIPLE_TABLES appears to be broken for certain rules after
>> f9d4b0c1e9695e3de7af3768205bacc27312320c.
>>
>> Try, for example, running:
>>
>> $ ip -4 rule add table main suppress_prefixlength 0
>>
>> It returns with EEXIST.
>>
>> Perhaps the reason is that the new rule_find function does not match
>> on suppress_prefixlength? However, rule_exist from before didn't do
>> that either. I'll keep playing and see if I can track it down myself,
>> but thought I should let you know first.
>
> I am surprised at that also. I cannot find prior rule_exist looking at
> suppress_prefixlength.
> I will dig deeper also today. But your patch LGTM with a small change
> I commented on it.
>

So the previous rule_exists code did not check for attribute matches correctly.
It would ignore a rule at the first non-existent attribute mis-match.
eg in your case, it would
return at pref mismatch.


$ip -4 rule add table main suppress_prefixlength 0
$ip -4 rule add table main suppress_prefixlength 0
$ip -4 rule add table main suppress_prefixlength 0

$ip rule show
0:  from all lookup local

32763:  from all lookup main suppress_prefixlength 0

32764:  from all lookup main suppress_prefixlength 0

32765:  from all lookup main suppress_prefixlength 0

32766:  from all lookup main

32767:  from all lookup default


With your patch (with my proposed fixes), you should get proper EXISTS check

$ git diff HEAD

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c

index 126ffc5..de4c171 100644

--- a/net/core/fib_rules.c

+++ b/net/core/fib_rules.c

@@ -428,6 +428,14 @@ static struct fib_rule *rule_find(struct
fib_rules_ops *ops,

if (rule->l3mdev && r->l3mdev != rule->l3mdev)

continue;



+   if (rule->suppress_ifgroup != -1 &&

+   (r->suppress_ifgroup != rule->suppress_ifgroup))

+   continue;

+

+   if (rule->suppress_prefixlen != -1 &&

+   (r->suppress_prefixlen != rule->suppress_prefixlen))

+   continue;

+

if (uid_range_set(>uid_range) &&

(!uid_eq(r->uid_range.start, rule->uid_range.start) ||

!uid_eq(r->uid_range.end, rule->uid_range.end)))



$ ip -4 rule add table main suppress_prefixlength 0

$ ip -4 rule add table main suppress_prefixlength 0

RTNETLINK answers: File exists

[PATCH net-next] r8169: reject unsupported WoL options

2018-06-25 Thread Heiner Kallweit

So far unsupported WoL options are silently ignored. Change this and
reject attempts to set unsupported options. This prevents situations
where a user tries to set an unsupported WoL option and is under the
impression it was successful because ethtool doesn't complain.

Signed-off-by: Heiner Kallweit 
---
 drivers/net/ethernet/realtek/r8169.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 1d33672c..70c13cc2 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -1712,11 +1712,14 @@ static int rtl8169_set_wol(struct net_device *dev, 
struct ethtool_wolinfo *wol)
struct rtl8169_private *tp = netdev_priv(dev);
struct device *d = tp_to_dev(tp);
 
+   if (wol->wolopts & ~WAKE_ANY)
+   return -EINVAL;
+
pm_runtime_get_noresume(d);
 
rtl_lock_work(tp);
 
-   tp->saved_wolopts = wol->wolopts & WAKE_ANY;
+   tp->saved_wolopts = wol->wolopts;
 
if (pm_runtime_active(d))
__rtl8169_set_wol(tp, tp->saved_wolopts);
-- 
2.18.0

Re: [PATCH] net/nfp: cast sizeof() to int when comparing with error code

2018-06-25 Thread Jakub Kicinski

On Mon, 25 Jun 2018 15:33:18 +0800, Chengguang Xu wrote:
> Negative error code will be larger than sizeof().
> 
> Signed-off-by: Chengguang Xu 

You should include the name of the tree in the tag, e.g. [PATCH net] or
[PATCH net-next] and if it's a fix, i.e. directed at the net tree, you
should add a fixes tag, in this case it would most likely be:

Fixes: a0d8e02c35ff ("nfp: add support for reading nffw info")

You can find this and more information about the development process of
the networking subsystem in:

https://www.kernel.org/doc/Documentation/networking/netdev-FAQ.txt

Acked-by: Jakub Kicinski 

Thank you!

[PATCH 0/4] lan78xx minor fixes

2018-06-25 Thread Dave Stevenson

This is a small set of patches for the Microchip LAN78xx chip,
as used in the Raspberry Pi 3B+.
The main debug/discussion was on
https://github.com/raspberrypi/linux/issues/2458

Initial symptoms were that VLANs were very unreliable.
A couple of things were found:
- firstly that the hardware timeout value set failed to 
  take into account the VLAN tag, so a full MTU packet
  would be timed out.
- second was that regular checksum failures were being
  reported. Disabling checksum offload confirmed that
  the checksums were valid, and further experimentation
  identified that it was only if the VLAN tags were being
  passed through to the kernel that there were issues.
  The hardware supports VLAN filtering and tag stripping,
  therefore those have been implemented (much of the work
  was already done), and the driver drops back to s/w
  checksums should the choice be made not to use the h/w
  VLAN stripping.

Dave Stevenson (4):
  net: lan78xx: Allow for VLAN headers in timeout calcs
  net: lan78xx: Add support for VLAN filtering.
  net: lan78xx: Add support for VLAN tag stripping.
  net: lan78xx: Use s/w csum check on VLANs without tag stripping

 drivers/net/usb/lan78xx.c | 37 ++---
 1 file changed, 34 insertions(+), 3 deletions(-)

-- 
2.7.4

[PATCH 1/4] net: lan78xx: Allow for VLAN headers in timeout calcs

2018-06-25 Thread Dave Stevenson

The frame abort timeout being set by lan78xx_set_rx_max_frame_length
didn't account for any VLAN headers, resulting in very low
throughput if used with tagged VLANs.

Use VLAN_ETH_HLEN instead of ETH_HLEN to correct for this.

Signed-off-by: Dave Stevenson 
---
 drivers/net/usb/lan78xx.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index a89570f..2f793d4 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -2298,7 +2298,7 @@ static int lan78xx_change_mtu(struct net_device *netdev, 
int new_mtu)
if ((ll_mtu % dev->maxpacket) == 0)
return -EDOM;
 
-   ret = lan78xx_set_rx_max_frame_length(dev, new_mtu + ETH_HLEN);
+   ret = lan78xx_set_rx_max_frame_length(dev, new_mtu + VLAN_ETH_HLEN);
 
netdev->mtu = new_mtu;
 
@@ -2587,7 +2587,8 @@ static int lan78xx_reset(struct lan78xx_net *dev)
buf |= FCT_TX_CTL_EN_;
ret = lan78xx_write_reg(dev, FCT_TX_CTL, buf);
 
-   ret = lan78xx_set_rx_max_frame_length(dev, dev->net->mtu + ETH_HLEN);
+   ret = lan78xx_set_rx_max_frame_length(dev,
+ dev->net->mtu + VLAN_ETH_HLEN);
 
ret = lan78xx_read_reg(dev, MAC_RX, );
buf |= MAC_RX_RXEN_;
-- 
2.7.4

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-25 Thread Samudrala, Sridhar


On 6/22/2018 5:17 PM, Siwei Liu wrote:

On Fri, Jun 22, 2018 at 4:40 PM, Siwei Liu  wrote:

On Fri, Jun 22, 2018 at 3:25 PM, Michael S. Tsirkin  wrote:

On Fri, Jun 22, 2018 at 02:51:11PM -0700, Siwei Liu wrote:

On Fri, Jun 22, 2018 at 2:29 PM, Michael S. Tsirkin  wrote:

On Fri, Jun 22, 2018 at 01:03:04PM -0700, Siwei Liu wrote:

On Fri, Jun 22, 2018 at 1:00 PM, Siwei Liu  wrote:

On Thu, Jun 21, 2018 at 7:32 PM, Michael S. Tsirkin  wrote:

On Thu, Jun 21, 2018 at 06:21:55PM -0700, Siwei Liu wrote:

On Thu, Jun 21, 2018 at 7:59 AM, Cornelia Huck  wrote:

On Wed, 20 Jun 2018 22:48:58 +0300
"Michael S. Tsirkin"  wrote:


On Wed, Jun 20, 2018 at 06:06:19PM +0200, Cornelia Huck wrote:

In any case, I'm not sure anymore why we'd want the extra uuid.

It's mostly so we can have e.g. multiple devices with same MAC
(which some people seem to want in order to then use
then with different containers).

But it is also handy for when you assign a PF, since then you
can't set the MAC.


OK, so what about the following:

- introduce a new feature bit, VIRTIO_NET_F_STANDBY_UUID that indicates
   that we have a new uuid field in the virtio-net config space
- in QEMU, add a property for virtio-net that allows to specify a uuid,
   offer VIRTIO_NET_F_STANDBY_UUID if set
- when configuring, set the property to the group UUID of the vfio-pci
   device

If feature negotiation fails on VIRTIO_NET_F_STANDBY_UUID, is it safe
to still expose UUID in the config space on virtio-pci?


Yes but guest is not supposed to read it.


I'm not even sure if it's sane to expose group UUID on the PCI bridge
where the corresponding vfio-pci device attached to for a guest which
doesn't support the feature (legacy).

-Siwei

Yes but you won't add the primary behind such a bridge.

I assume the UUID feature is a new one besides the exiting
VIRTIO_NET_F_STANDBY feature, where I think the current proposal is,
if UUID feature is present and supported by guest, we'll do pairing
based on UUID; if UUID feature present

  ^^^  is NOT present

Let's go back for a bit.

What is wrong with matching by mac?

1. Does not allow multiple NICs with same mac
2. Requires MAC to be programmed by host in the PT device
(which is often possible with VFs but isn't possible with all PT
devices)

Might not neccessarily be something wrong, but it's very limited to
prohibit the MAC of VF from changing when enslaved by failover.

You mean guest changing MAC? I'm not sure why we prohibit that.

I think Sridhar and Jiri might be better person to answer it. My
impression was that sync'ing the MAC address change between all 3
devices is challenging, as the failover driver uses MAC address to
match net_device internally.


Yes. The MAC address is assigned by the hypervisor and it needs to manage the 
movement
of the MAC between the PF and VF.  Allowing the guest to change the MAC will 
require
synchronization between the hypervisor and the PF/VF drivers. Most of the VF 
drivers
don't allow changing guest MAC unless it is a trusted VF.

Re: [PATCH net-next 2/3] rds: Enable RDS IPv6 support

2018-06-25 Thread Sowmini Varadhan

On (06/26/18 01:43), Ka-Cheong Poon wrote:
> 
> Yes, I think if the socket is bound, it should check the scope_id
> in msg_name (if not NULL) to make sure that they match.  A bound
> RDS socket can send to multiple peers.  But if the bound local
> address is link local, it should only be allowed to send to peers
> on the same link.

agree.

> If a socket is bound, I guess the scope_id should be used.  So
> if a socket is not bound to a link local address and the socket
> is used to sent to a link local peer, it should fail.

PF_RDS sockets *MUST* alwasy be bound.  See
Documentation/networking/rds.txt:
"   Sockets must be bound before you can send or receive data.
This is needed because binding also selects a transport and
attaches it to the socket. Once bound, the transport assignment
does not change."

Also, rds_sendmsg checks this (from net-next, your version
has the equivalent ipv6_addr_any etc check):

if (daddr == 0 || rs->rs_bound_addr == 0) {
release_sock(sk);
ret = -ENOTCONN; /* XXX not a great errno */
goto out;
}

> 
> >Also, why is there no IPv6 support in rds_connect?
> 
> 
> Oops, I missed this when I ported the internal version to the
> net-next version.  Will add it back.

Ok

--Sowmini

1 2 >

1 - 100 of 175 matches

Mail list logo