date:20170203

[PATCH net] netlabel: out of bound access in cipso_v4_validate()

2017-02-03 Thread Eric Dumazet

From: Eric Dumazet 

syzkaller found another out of bound access in ip_options_compile(),
or more exactly in cipso_v4_validate()

Fixes: 20e2a8648596 ("cipso: handle CIPSO options correctly when NetLabel is 
disabled")
Fixes: 446fda4f2682 ("[NetLabel]: CIPSOv4 engine")
Signed-off-by: Eric Dumazet 
Reported-by: Dmitry Vyukov  
Cc: Paul Moore 
---
 include/net/cipso_ipv4.h |4 
 net/ipv4/cipso_ipv4.c|4 
 2 files changed, 8 insertions(+)

diff --git a/include/net/cipso_ipv4.h b/include/net/cipso_ipv4.h
index 
3ebb168b9afc68ad639b5d32f6182a845c83d759..a34b141f125f0032662f147b598c9fef4fb4bcef
 100644
--- a/include/net/cipso_ipv4.h
+++ b/include/net/cipso_ipv4.h
@@ -309,6 +309,10 @@ static inline int cipso_v4_validate(const struct sk_buff 
*skb,
}
 
for (opt_iter = 6; opt_iter < opt_len;) {
+   if (opt_iter + 1 == opt_len) {
+   err_offset = opt_iter;
+   goto out;
+   }
tag_len = opt[opt_iter + 1];
if ((tag_len == 0) || (tag_len > (opt_len - opt_iter))) {
err_offset = opt_iter + 1;
diff --git a/net/ipv4/cipso_ipv4.c b/net/ipv4/cipso_ipv4.c
index 
72d6f056d863603c959e1d04b9f863909a37c758..ae206163c273381ba6e8bd8a24fa050619a4a6ae
 100644
--- a/net/ipv4/cipso_ipv4.c
+++ b/net/ipv4/cipso_ipv4.c
@@ -1587,6 +1587,10 @@ int cipso_v4_validate(const struct sk_buff *skb, 
unsigned char **option)
goto validate_return_locked;
}
 
+   if (opt_iter + 1 == opt_len) {
+   err_offset = opt_iter;
+   goto validate_return_locked;
+   }
tag_len = tag[1];
if (tag_len > (opt_len - opt_iter)) {
err_offset = opt_iter + 1;

Re: Inconsistency in packet drop due to MTU (eth vs veth)

2017-02-03 Thread Fredrik Markstrom



/F


  On Tue, 31 Jan 2017 17:27:09 +0100 Eric Dumazet  
wrote  
 > On Tue, 2017-01-31 at 14:32 +0100, Fredrik Markstrom wrote: 
 > >   On Thu, 19 Jan 2017 19:53:47 +0100 Eric Dumazet 
 > >  wrote   
 > >  > On Thu, 2017-01-19 at 17:41 +0100, Fredrik Markstrom wrote:  
 > >  > > Hello,  
 > >  > >   
 > >  > > I've noticed an inconsistency between how physical ethernet and 
 > > veth handles mtu.  
 > >  > >   
 > >  > > If I setup two physical interfaces (directly connected) with 
 > > different mtu:s, only the size of the outgoing packets are limited by 
 > > the mtu. But with veth a packet is dropped if the mtu of the receiving 
 > > interface is smaller then the packet size.   
 > >  > >   
 > >  > > This seems inconsistent to me, but maybe there is a reason for 
 > > it ?   
 > >  > >   
 > >  > > Can someone confirm if it's a deliberate inconsistency or just a 
 > > side effect of using dev_forward_skb() ?  
 > >  >   
 > >  > It looks this was added in commit  
 > >  > 38d408152a86598a50680a82fe3353b506630409  
 > >  > ("veth: Allow setting the L3 MTU")  
 > >  >   
 > >  > But what was really needed here was a way to change MRU :(  
 > >  
 > > Ok, do we consider this correct and/or something we need to be 
 > > backwards compatible with ? Is it insane to believe that we can fix 
 > > this "inconsistency" by removing the check ? 
 > >  
 > > The commit message reads "For consistency I drop packets on the 
 > > receive side when they are larger than the MTU", do we know what it's 
 > > supposed 
 > > to be consistent with or is that lost in history ? 
 >  
 > There is no consistency among existing Ethernet drivers. 
 >  
 > Many ethernet drivers size the buffers they post in RX ring buffer 
 > according to MTU. 
 >  
 > If MTU is set to 1500, RX buffers are sized to be about 1536 bytes, 
 > so you wont be able to receive a 1700 bytes frame. 
 >  
 > I guess that you could add a specific veth attribute to precisely 
 > control MRU, that would not break existing applications. 

Ok, I will propose a patch shortly. And thanks, your response time is
awesome !

/Fredrik

 >  
 >  
 >  
 >

Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit

2017-02-03 Thread nick viljoen



> On Feb 2, 2017, at 11:04 PM, Shubham Bansal  wrote:
> 
> Hi Nick,
> 
> On Thu, Feb 2, 2017 at 12:59 PM, nick viljoen
>  wrote:
>> Hey Shubham,
>> 
>> I have been doing some similar work-might be worth pooling
>> resource if there is interest?
> 
> Sure. That sounds great.
> 
>> 
>> We made a presentation at the previous netdev conference about
>> what we are doing-you can check it out here :)
>> 
>> https://www.youtube.com/watch?v=-5BzT1ch19s&t=45s
> 
> Sorry for the late reply. I had to watch the whole video. Its was fun.
> Now. Its seems like a small of your complete project was related to
> eBPF 64 bit register to 32 bit register mapping, although I don't have
> any knowledge about the Hardware aspect of it.
> Now, getting back to your slides, on Page 7 you are mapping eBPF 64
> bit register to 32 bit register.
> 
> 1. Can you explain that to me? I didn't get this part from you presentation.
> 2. How are you taking care of Race Condition on 64 bit eBPF registers
> Read/Write as you are using 32 bit registers to emulate them ?
> 
>> 
>> What is your reason for looking at these problems?
> 
> I just wanted to contribute toward linux kernel. This is the only
> reason I think.
There seems to have been some tying of emails here-my previous
email ended here-currently on my mail client it appears as though the
below is my email. As you have implied, I presume the below is you
replying to yourself.
-
> 
>> I was thinking of first implementing only instructions with 32 bit
>> register operands. It will hugely decrease the surface area of eBPF
>> instructions that I have to cover for the first patch.
>> 
>> So, What I am thinking is something like this :
>> 
>> - bpf_mov r0(64),r1(64) will be JITed like this :
>> - ar1(32) <- r1(64). Convert/Mask 64 bit ebpf register(r1) value into 32
>> bit and store it in arm register(ar1).
>> - Do MOV ar0(32),ar1(32) as an ARM instruction.
>> - ar0(32) -> r0(64). Zero Extend the ar0 32 bit register value
>> and store it in 64 bit ebpf register r0.
> 
> What about this ? Does this makes sense to you ?
>> 
>> - Similarly, For all BPF_ALU class instructions.
>> - For BPF_ADD, I will mask the addition result to 32 bit only.
>> I am not sure, Overflow might be a problem.
>> - For BPF_SUB, I will mask the subtraction result to 32 bit only.
>> I am not sure, Underflow might be problem.
>> - For BPF_MUL, similar to BPF_ADD. Overflow Problem ?
>> - For BPF_DIV, 32 bit masking should be fine, I guess.
>> - For BPF_OR, BPF_AND, BPF_XOR, BPF_LSH, BPF_RSH, BPF_MOD 32 bit
>> masking should be fine.
>> - For BPF_NEG and BPF_ARSH, might be a problem because of the sign bit.
>> - For BPF_END, 32 bit masking should work fine.
>> Let me know if any of the above point is wrong or need your suggestion.
> What about this ?
>> 
>> - Although, for ALU instructions, there is a big problem of register
>> flag manipulations. Generally, architecture's ABI takes care of this
>> part but as we are doing 64 bit Instructions emulation(kind of) on 32
>> bit machine, it needs to be done manually. Does that sound correct ?
>> 
>> - I am not JITing BPF_ALU64 class instructions as of now. As we have to
>> take care of atomic instructions and race conditions with these
>> instruction which looks complicated to me as of now. Will try to figure out
>> this part and implement it later. Currently, I will just let it be
>> interpreted by the ebpf interpreter.
>> 
>> - For BPF_JMP class, I am assuming that, although eBPF is 64 bit ABI,
>> the address pointers on 32 bit arch like arm will be of 32 bit only.
>> So, for BPF_JMP, masking the 64 bit destination address to 32 bit
>> should do the trick and no address will be corrupted in this way. Am I
>> correct to assume this ?
>> Also, I need to check for address getting out of the allowed memory
>> range.
>> 
>> - For BPF_LD, BPF_LDX, BPF_ST and BPF_STX class instructions, I am
>> assuming the same thing as above - All addresses and pointers are 32
>> bit - which can be taken care just by maksing the eBPF register
>> values. Does that sound correct ?
>> Also, I need to check for the address overflow, address getting out
>> of the allowed memory range and things like that.
>> 

> Nick, It would be great if you could give me your comments/suggestions
> on all of the above points for JIT implementation.

As we are selectively offloading to a NPU based NIC we can avoid some of
the problems you have mentioned so I am afraid I don't have all the 
answers

While we have stated publicly we are doing this work and aren't trying to
hide anything, the reason I replied to you in private is that it is generally
not a good idea to share half baked ideas on the mailing list as it wastes
peoples time :). 

The best approach is to wait until you are able to post an RFC patch for
public discussion.
> 
>> Do you have any code references for me to take a look? Otherwise, I think
>> its not possible for me to implement it without using any reference

Re: [PATCH net-next 2/2] Add a eBPF helper function to retrieve socket uid

2017-02-03 Thread Daniel Borkmann


On 02/03/2017 02:51 AM, Eric Dumazet wrote:

On Fri, 2017-02-03 at 10:18 +0900, Lorenzo Colitti wrote:

On Fri, Feb 3, 2017 at 9:31 AM, Eric Dumazet  wrote:

It should be safe to call sock_net_uid on any type of socket
(including NULL). sk_uid was added to struct sock in 86741ec25462
("net: core: Add a UID field to struct sock.")


But a request socket or a timewait socket do not have this field.

Daniel point is valid.


My bad. Yes.

It would definitely be useful to have the UID available in request
sockets, and perhaps timewait sockets as well. That could be done by
moving the UID to sock_common, or with something along the lines of:

  static inline kuid_t sock_net_uid(const struct net *net, const struct sock 
*sk)
  {
+   if (sk && sk->sk_state == TCP_NEW_SYN_RECV)
+   sk = sk->__sk_common.skc_listener;
+   else if (sk && !sk_fullsock(sk))
+   sk = NULL;
+
 return sk ? sk->sk_uid : make_kuid(net->user_ns, 0);
  }

Any thoughts on which is better?


You could use

if (sk) {
 sk = sk_to_full_sk(sk);
 if (sk_fullsock(sk))
 return sk->sk_uid;
}


Yeah, if that moves into the sock_net_uid() helper, then you could
remove the sk && sk_fullsock(sk) ? sk : NULL tests from the current
sock_net_uid() call sites such as in tcp code. Maybe then also make
the sock_net_uid() as __always_inline, so that most of the callers
with sock_net_uid(net, NULL) are guaranteed to optimize away their
sk checks at compile time?

Re: [PATCH RFC net-next 4/4] bridge: add ability to turn off fdb used updates

2017-02-03 Thread Nikolay Aleksandrov

On 03/02/17 03:47, David Miller wrote:
> From: Nikolay Aleksandrov 
> Date: Tue, 31 Jan 2017 16:31:58 +0100
> 
>> @@ -197,7 +197,8 @@ int br_handle_frame_finish(struct net *net, struct sock 
>> *sk, struct sk_buff *skb
>>  if (dst->is_local)
>>  return br_pass_frame_up(skb);
>>  
>> -dst->used = jiffies;
>> +if (br->used_enabled)
>> +dst->used = jiffies;
> 
> Have you tried:
> 
>   if (dst->used != jiffies)
>   dst->used = jiffies;
> 
> If that isn't effective, you can tweak the test to decrease the
> granularity of the value.  Basically, if dst->used is within
> 1 HZ of jiffies, don't do the write.
> 
> I suspect this might help a lot, and not require a new bridging
> option.
> 

Yes, I actually have a patch titled "used granularity". :-) I've tested with 
different
values and it does help but it either needs to be paired with another similar 
test for
the "updated" field (since they share a write-heavy cache line) or they need to 
be
in separate cache lines to avoid that dst's source port from causing the load 
HitM for
all who check the value.

I'll run some more tests and probably go this way for now.

Thanks,
 Nik

Re: [PATCH net-next] bpf: fix verifier issue at check_packet_ptr_add

2017-02-03 Thread Daniel Borkmann


On 02/03/2017 06:31 AM, William Tu wrote:
[...]

Yes, this is auto-generated. We want to use P4 2016 as front end to
generate ebpf for XDP.


[...]

R2 is no longer pkt_end, it's R2 == R0 == 0
269: (bf) r2 = r0
270: (77) r2 >>= 3
271: (bf) r4 = r1
272: (0f) r4 += r2

So at line 272, it's pkt_ptr = pkt_ptr + 0
thus the following fix works for us.
-   if (imm <= 0) {
+   if (imm < 0) {


Okay, makes sense. I'll wait with ACK for your respin with kselftest
case.

Thanks,
Daniel

Re: [PATCH net-next] sfc: get rid of custom busy polling code

2017-02-03 Thread Bert Kenward

On 03/02/17 01:13, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> In linux-4.5, busy polling was implemented in core
> NAPI stack, meaning that all custom implementation can
> be removed from drivers.
> 
> Not only we remove lot's of tricky code, we also remove
> one lock operation in fast path.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Edward Cree 
> Cc: Bert Kenward 

We were talking about doing this just yesterday.
Thanks Eric.

Acked-by: Bert Kenward

Re: [PATCH net-next] sfc-falcon: get rid of custom busy polling code

2017-02-03 Thread Bert Kenward

On 03/02/17 02:22, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> In linux-4.5, busy polling was implemented in core
> NAPI stack, meaning that all custom implementation can
> be removed from drivers.
> 
> Not only we remove lot's of tricky code, we also remove
> one lock operation in fast path.
> 
> Signed-off-by: Eric Dumazet 
> Cc: Edward Cree 
> Cc: Bert Kenward 

Acked-by: Bert Kenward

Re: [PATCHv3 net-next 5/7] net: add confirm_neigh method to dst_ops

2017-02-03 Thread Julian Anastasov


Hello,

On Fri, 3 Feb 2017, Steffen Klassert wrote:

> On Thu, Feb 02, 2017 at 01:04:34AM +0200, Julian Anastasov wrote:
> > 
> > It may sounds good. But only dst->path->ops->confirm_neigh
> > points to real IPv4/IPv6 function. And also, I guess, the
> > family can change while walking the chain, so we should be
> > careful while providing the original daddr (which comes from
> > sendmsg). I had the idea to walk all xforms to get the latest
> > tunnel address but this can be slow. 
> 
> Is this a per packet call or is the information cached somewhere?

It is for every packet that is sent with both
MSG_CONFIRM and MSG_PROBE, i.e. when nothing is sent
on the wire. It is used by patch 6 just for UDP, RAW, ICMP, L2TP.

> > Something like this?:
> > 
> > static void xfrm_confirm_neigh(const struct dst_entry *dst, const void 
> > *daddr)
> > {
> > const struct dst_entry *path = dst->path;
> > 
> > /* By default, daddr is from sendmsg() if we have no tunnels */
> > for (;dst != path; dst = dst->child) {
> > const struct xfrm_state *xfrm = dst->xfrm;
> > 
> > /* Use address from last tunnel */
> > if (xfrm->props.mode != XFRM_MODE_TRANSPORT)
> > daddr = &xfrm->id.daddr;
> > }
> > path->ops->confirm_neigh(path, daddr);
> > }
> 
> I thought about this (completely untested) one:
> 
> static void xfrm_confirm_neigh(const struct dst_entry *dst, const void
> *daddr)
> 
> {
>   const struct dst_entry *dst = dst->child;

When starting and dst arg is first xform, the above
assignment skips it. May be both lines should be swapped.

>   const struct xfrm_state *xfrm = dst->xfrm;
> 
>   if (xfrm)
>   daddr = &xfrm->id.daddr;
> 
>   dst->ops->confirm_neigh(dst, daddr);
> }
> 
> Only the last dst_entry in this call chain (path) sould
> not have dst->xfrm set. So it finally calls path->ops->confirm_neigh
> with the daddr of the last transformation. But your version
> should do the same.

Above can be fixed but it is risky for the stack
usage when using recursion. In practice, there should not be
many xforms, though. Also, is id.daddr valid for transports?

> > This should work as long as path and last tunnel are
> > from same family.
> 
> Yes, the outer mode of the last transformation has the same
> family as path.
> 
> > Also, after checking xfrm_dst_lookup() I'm not
> > sure using just &xfrm->id.daddr is enough. Should we consider
> > more places for daddr value?
> 
> Yes, indeed. We should do it like xfrm_dst_lookup() does it.

OK, I'll get logic from there. Should I use loop or
recursion?

Regards

--
Julian Anastasov

[patch net-next v2 09/19] mlxsw: reg: Add Policy-Engine Policy Based Switching Register

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

The PPBS register retrieves and sets Policy Based Switching Table entries.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 31 +++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 555cb80..c503363 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -2019,6 +2019,36 @@ static inline void mlxsw_reg_ptar_unpack(char *payload, 
char *tcam_region_info)
mlxsw_reg_ptar_tcam_region_info_memcpy_from(payload, tcam_region_info);
 }
 
+/* PPBS - Policy-Engine Policy Based Switching Register
+ * 
+ * This register retrieves and sets Policy Based Switching Table entries.
+ */
+#define MLXSW_REG_PPBS_ID 0x300C
+#define MLXSW_REG_PPBS_LEN 0x14
+
+MLXSW_REG_DEFINE(ppbs, MLXSW_REG_PPBS_ID, MLXSW_REG_PPBS_LEN);
+
+/* reg_ppbs_pbs_ptr
+ * Index into the PBS table.
+ * For Spectrum, the index points to the KVD Linear.
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, ppbs, pbs_ptr, 0x08, 0, 24);
+
+/* reg_ppbs_system_port
+ * Unique port identifier for the final destination of the packet.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, ppbs, system_port, 0x10, 0, 16);
+
+static inline void mlxsw_reg_ppbs_pack(char *payload, u32 pbs_ptr,
+  u16 system_port)
+{
+   MLXSW_REG_ZERO(ppbs, payload);
+   mlxsw_reg_ppbs_pbs_ptr_set(payload, pbs_ptr);
+   mlxsw_reg_ppbs_system_port_set(payload, system_port);
+}
+
 /* PRCR - Policy-Engine Rules Copy Register
  * 
  * This register is used for accessing rules within a TCAM region.
@@ -5875,6 +5905,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(pacl),
MLXSW_REG(pagt),
MLXSW_REG(ptar),
+   MLXSW_REG(ppbs),
MLXSW_REG(prcr),
MLXSW_REG(ptce2),
MLXSW_REG(qpcr),
-- 
2.7.4

[patch net-next v2 00/19] mlxsw: Introduce TC Flower offload using TCAM

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

This patchset introduces support for offloading TC cls_flower and actions
to Spectrum TCAM-base policy engine.

The patchset contains patches to allow work with flexible keys and actions
which are used in Spectrum TCAM.

It also contains in-driver infrastructure for offloading TC rules to TCAM HW.
The TCAM management code is simple and limited for now. It is going to be
extended as a follow-up work.

The last patch uses the previously introduced infra to allow to implement
cls_flower offloading. Initially, only limited set of match-keys and only
a drop and forward actions are supported.

As a dependency, this patchset introduces parman - priority array
area manager - as a library.

---
v1->v2:
- patch11:
  - use __set_bit and __test_and_clear_bit as suggested by DaveM
- patch16:
  - Added documentation to the API functions as suggested by Tom Herbert
- patch17:
  - use __set_bit and __clear_bit as suggested by DaveM

Jiri Pirko (19):
  mlxsw: item: Add 8bit item helpers
  mlxsw: item: Add helpers for getting pointer into payload for char
buffer item
  mlxsw: reg: Add Policy-Engine ACL Register
  mlxsw: reg: Add Policy-Engine ACL Group Table register
  mlxsw: reg: Add Policy-Engine TCAM Allocation Register
  mlxsw: reg: Add Policy-Engine TCAM Entry Register Version 2
  mlxsw: reg: Add Policy-Engine Port Binding Table
  mlxsw: reg: Add Policy-Engine Rules Copy Register
  mlxsw: reg: Add Policy-Engine Policy Based Switching Register
  mlxsw: reg: Add Policy-Engine Extended Flexible Action Register
  mlxsw: core: Introduce flexible keys support
  mlxsw: core: Introduce flexible actions support
  mlxsw: spectrum: Introduce basic set of flexible key blocks
  mlxsw: resources: Add ACL related resources
  list: introduce list_for_each_entry_from_reverse helper
  lib: Introduce priority array area manager
  mlxsw: spectrum: Introduce ACL core with simple TCAM implementation
  sched: cls_flower: expose priority to offloading netdevice
  mlxsw: spectrum: Implement TC flower offload

 MAINTAINERS|8 +
 drivers/net/ethernet/mellanox/mlxsw/Kconfig|1 +
 drivers/net/ethernet/mellanox/mlxsw/Makefile   |6 +-
 .../mellanox/mlxsw/core_acl_flex_actions.c |  685 +
 .../mellanox/mlxsw/core_acl_flex_actions.h |   66 ++
 .../ethernet/mellanox/mlxsw/core_acl_flex_keys.c   |  475 +
 .../ethernet/mellanox/mlxsw/core_acl_flex_keys.h   |  238 +
 drivers/net/ethernet/mellanox/mlxsw/item.h |   98 +-
 drivers/net/ethernet/mellanox/mlxsw/reg.h  |  511 -
 drivers/net/ethernet/mellanox/mlxsw/resources.h|   20 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |   32 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  106 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |  572 +++
 .../mellanox/mlxsw/spectrum_acl_flex_keys.h|  109 ++
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c| 1084 
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  309 ++
 include/linux/list.h   |   13 +
 include/linux/parman.h |   76 ++
 include/net/pkt_cls.h  |1 +
 lib/Kconfig|3 +
 lib/Kconfig.debug  |   10 +
 lib/Makefile   |3 +
 lib/parman.c   |  376 +++
 lib/test_parman.c  |  395 +++
 net/sched/cls_flower.c |3 +
 25 files changed, 5184 insertions(+), 16 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.h
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
 create mode 100644 include/linux/parman.h
 create mode 100644 lib/parman.c
 create mode 100644 lib/test_parman.c

-- 
2.7.4

[patch net-next v2 06/19] mlxsw: reg: Add Policy-Engine TCAM Entry Register Version 2

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

The PTCE-V2 register is used for accessing rules within a TCAM region.
It is a new version of PTCE in order to support wider key, mask and
action within a TCAM region.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 100 ++
 1 file changed, 100 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 444f0a3..1008251 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -1957,6 +1957,105 @@ static inline void mlxsw_reg_ptar_unpack(char *payload, 
char *tcam_region_info)
mlxsw_reg_ptar_tcam_region_info_memcpy_from(payload, tcam_region_info);
 }
 
+/* PTCE-V2 - Policy-Engine TCAM Entry Register Version 2
+ * -
+ * This register is used for accessing rules within a TCAM region.
+ * It is a new version of PTCE in order to support wider key,
+ * mask and action within a TCAM region. This register is not supported
+ * by SwitchX and SwitchX-2.
+ */
+#define MLXSW_REG_PTCE2_ID 0x3017
+#define MLXSW_REG_PTCE2_LEN 0x1D8
+
+MLXSW_REG_DEFINE(ptce2, MLXSW_REG_PTCE2_ID, MLXSW_REG_PTCE2_LEN);
+
+/* reg_ptce2_v
+ * Valid.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, ptce2, v, 0x00, 31, 1);
+
+/* reg_ptce2_a
+ * Activity. Set if a packet lookup has hit on the specific entry.
+ * To clear the "a" bit, use "clear activity" op or "clear on read" op.
+ * Access: RO
+ */
+MLXSW_ITEM32(reg, ptce2, a, 0x00, 30, 1);
+
+enum mlxsw_reg_ptce2_op {
+   /* Read operation. */
+   MLXSW_REG_PTCE2_OP_QUERY_READ = 0,
+   /* clear on read operation. Used to read entry
+* and clear Activity bit.
+*/
+   MLXSW_REG_PTCE2_OP_QUERY_CLEAR_ON_READ = 1,
+   /* Write operation. Used to write a new entry to the table.
+* All R/W fields are relevant for new entry. Activity bit is set
+* for new entries - Note write with v = 0 will delete the entry.
+*/
+   MLXSW_REG_PTCE2_OP_WRITE_WRITE = 0,
+   /* Update action. Only action set will be updated. */
+   MLXSW_REG_PTCE2_OP_WRITE_UPDATE = 1,
+   /* Clear activity. A bit is cleared for the entry. */
+   MLXSW_REG_PTCE2_OP_WRITE_CLEAR_ACTIVITY = 2,
+};
+
+/* reg_ptce2_op
+ * Access: OP
+ */
+MLXSW_ITEM32(reg, ptce2, op, 0x00, 20, 3);
+
+/* reg_ptce2_offset
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, ptce2, offset, 0x00, 0, 16);
+
+/* reg_ptce2_tcam_region_info
+ * Opaque object that represents the TCAM region.
+ * Access: Index
+ */
+MLXSW_ITEM_BUF(reg, ptce2, tcam_region_info, 0x10,
+  MLXSW_REG_PXXX_TCAM_REGION_INFO_LEN);
+
+#define MLXSW_REG_PTCE2_FLEX_KEY_BLOCKS_LEN 96
+
+/* reg_ptce2_flex_key_blocks
+ * ACL Key.
+ * Access: RW
+ */
+MLXSW_ITEM_BUF(reg, ptce2, flex_key_blocks, 0x20,
+  MLXSW_REG_PTCE2_FLEX_KEY_BLOCKS_LEN);
+
+/* reg_ptce2_mask
+ * mask- in the same size as key. A bit that is set directs the TCAM
+ * to compare the corresponding bit in key. A bit that is clear directs
+ * the TCAM to ignore the corresponding bit in key.
+ * Access: RW
+ */
+MLXSW_ITEM_BUF(reg, ptce2, mask, 0x80,
+  MLXSW_REG_PTCE2_FLEX_KEY_BLOCKS_LEN);
+
+#define MLXSW_REG_PTCE2_FLEX_ACTION_SET_LEN 0xA8
+
+/* reg_ptce2_flex_action_set
+ * ACL action set.
+ * Access: RW
+ */
+MLXSW_ITEM_BUF(reg, ptce2, flex_action_set, 0xE0,
+  MLXSW_REG_PTCE2_FLEX_ACTION_SET_LEN);
+
+static inline void mlxsw_reg_ptce2_pack(char *payload, bool valid,
+   enum mlxsw_reg_ptce2_op op,
+   const char *tcam_region_info,
+   u16 offset)
+{
+   MLXSW_REG_ZERO(ptce2, payload);
+   mlxsw_reg_ptce2_v_set(payload, valid);
+   mlxsw_reg_ptce2_op_set(payload, op);
+   mlxsw_reg_ptce2_offset_set(payload, offset);
+   mlxsw_reg_ptce2_tcam_region_info_memcpy_to(payload, tcam_region_info);
+}
+
 /* QPCR - QoS Policer Configuration Register
  * -
  * The QPCR register is used to create policers - that limit
@@ -5637,6 +5736,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(pacl),
MLXSW_REG(pagt),
MLXSW_REG(ptar),
+   MLXSW_REG(ptce2),
MLXSW_REG(qpcr),
MLXSW_REG(qtct),
MLXSW_REG(qeec),
-- 
2.7.4

[patch net-next v2 03/19] mlxsw: reg: Add Policy-Engine ACL Register

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

The PACL register is used for configuration of the ACL.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 9fb0316..18b2da4 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -1,9 +1,9 @@
 /*
  * drivers/net/ethernet/mellanox/mlxsw/reg.h
- * Copyright (c) 2015 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2015-2017 Mellanox Technologies. All rights reserved.
  * Copyright (c) 2015-2016 Ido Schimmel 
  * Copyright (c) 2015 Elad Raz 
- * Copyright (c) 2015-2016 Jiri Pirko 
+ * Copyright (c) 2015-2017 Jiri Pirko 
  * Copyright (c) 2016 Yotam Gigi 
  *
  * Redistribution and use in source and binary forms, with or without
@@ -1757,6 +1757,48 @@ static inline void mlxsw_reg_spvmlr_pack(char *payload, 
u8 local_port,
}
 }
 
+/* PACL - Policy-Engine ACL Register
+ * -
+ * This register is used for configuration of the ACL.
+ */
+#define MLXSW_REG_PACL_ID 0x3004
+#define MLXSW_REG_PACL_LEN 0x70
+
+MLXSW_REG_DEFINE(pacl, MLXSW_REG_PACL_ID, MLXSW_REG_PACL_LEN);
+
+/* reg_pacl_v
+ * Valid. Setting the v bit makes the ACL valid. It should not be cleared
+ * while the ACL is bounded to either a port, VLAN or ACL rule.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, pacl, v, 0x00, 24, 1);
+
+/* reg_pacl_acl_id
+ * An identifier representing the ACL (managed by software)
+ * Range 0 .. cap_max_acl_regions - 1
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, pacl, acl_id, 0x08, 0, 16);
+
+#define MLXSW_REG_PXXX_TCAM_REGION_INFO_LEN 16
+
+/* reg_pacl_tcam_region_info
+ * Opaque object that represents a TCAM region.
+ * Obtained through PTAR register.
+ * Access: RW
+ */
+MLXSW_ITEM_BUF(reg, pacl, tcam_region_info, 0x30,
+  MLXSW_REG_PXXX_TCAM_REGION_INFO_LEN);
+
+static inline void mlxsw_reg_pacl_pack(char *payload, u16 acl_id,
+  bool valid, const char *tcam_region_info)
+{
+   MLXSW_REG_ZERO(pacl, payload);
+   mlxsw_reg_pacl_acl_id_set(payload, acl_id);
+   mlxsw_reg_pacl_v_set(payload, valid);
+   mlxsw_reg_pacl_tcam_region_info_memcpy_to(payload, tcam_region_info);
+}
+
 /* QPCR - QoS Policer Configuration Register
  * -
  * The QPCR register is used to create policers - that limit
@@ -5434,6 +5476,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(svpe),
MLXSW_REG(sfmr),
MLXSW_REG(spvmlr),
+   MLXSW_REG(pacl),
MLXSW_REG(qpcr),
MLXSW_REG(qtct),
MLXSW_REG(qeec),
-- 
2.7.4

[patch net-next v2 11/19] mlxsw: core: Introduce flexible keys support

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

Hardware supports matching on so called "flexible keys". The idea is to
assemble an optimal key to use for matching according to the fields in
packet (elements) requested by user. Certain sets of elements are
combined into pre-defined blocks. There is a picker to find needed blocks.
Keys consist of 1..n blocks.

Alongside with that, an initial portion of elements is introduced in order
to be able to offload basic cls_flower rules.

Picked keys are cached so multiple rules could share them.

There is an encode function provided that takes care of encoding key and
mask values according to given key.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
v1->v2:
- use __set_bit and __test_and_clear_bit as suggested by DaveM
---
 drivers/net/ethernet/mellanox/mlxsw/Makefile   |   2 +-
 .../ethernet/mellanox/mlxsw/core_acl_flex_keys.c   | 475 +
 .../ethernet/mellanox/mlxsw/core_acl_flex_keys.h   | 238 +++
 3 files changed, 714 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.h

diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile 
b/drivers/net/ethernet/mellanox/mlxsw/Makefile
index fe8dadb..6a83768 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Makefile
+++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile
@@ -1,5 +1,5 @@
 obj-$(CONFIG_MLXSW_CORE)   += mlxsw_core.o
-mlxsw_core-objs:= core.o
+mlxsw_core-objs:= core.o core_acl_flex_keys.o
 mlxsw_core-$(CONFIG_MLXSW_CORE_HWMON) += core_hwmon.o
 mlxsw_core-$(CONFIG_MLXSW_CORE_THERMAL) += core_thermal.o
 obj-$(CONFIG_MLXSW_PCI)+= mlxsw_pci.o
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
new file mode 100644
index 000..b32a009
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
@@ -0,0 +1,475 @@
+/*
+ * drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_keys.c
+ * Copyright (c) 2017 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2017 Jiri Pirko 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *contributors may be used to endorse or promote products derived from
+ *this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "item.h"
+#include "core_acl_flex_keys.h"
+
+struct mlxsw_afk {
+   struct list_head key_info_list;
+   unsigned int max_blocks;
+   const struct mlxsw_afk_block *blocks;
+   unsigned int blocks_count;
+};
+
+static bool mlxsw_afk_blocks_check(struct mlxsw_afk *mlxsw_afk)
+{
+   int i;
+   int j;
+
+   for (i = 0; i < mlxsw_afk->blocks_count; i++) {
+   const struct mlxsw_afk_block *block = &mlxsw_afk->blocks[i];
+
+   for (j = 0; j < block->instances_count; j++) {
+   struct mlxsw_afk_element_inst *elinst;
+
+   elinst = &block->instances[j];
+   if (elinst->type != elinst->info->type ||
+   elinst->item.size.bits !=
+   elinst->info->item.size.bits)
+   return false;
+   }
+   }
+   return true;
+}
+
+struct mlxsw_afk *mlxsw_afk_create(unsigned int max_blocks,
+  con

[patch net-next v2 08/19] mlxsw: reg: Add Policy-Engine Rules Copy Register

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

The PRCR register is used for accessing rules within a TCAM region.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 77 +++
 1 file changed, 77 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index ce6d85a..555cb80 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -2019,6 +2019,82 @@ static inline void mlxsw_reg_ptar_unpack(char *payload, 
char *tcam_region_info)
mlxsw_reg_ptar_tcam_region_info_memcpy_from(payload, tcam_region_info);
 }
 
+/* PRCR - Policy-Engine Rules Copy Register
+ * 
+ * This register is used for accessing rules within a TCAM region.
+ */
+#define MLXSW_REG_PRCR_ID 0x300D
+#define MLXSW_REG_PRCR_LEN 0x40
+
+MLXSW_REG_DEFINE(prcr, MLXSW_REG_PRCR_ID, MLXSW_REG_PRCR_LEN);
+
+enum mlxsw_reg_prcr_op {
+   /* Move rules. Moves the rules from "tcam_region_info" starting
+* at offset "offset" to "dest_tcam_region_info"
+* at offset "dest_offset."
+*/
+   MLXSW_REG_PRCR_OP_MOVE,
+   /* Copy rules. Copies the rules from "tcam_region_info" starting
+* at offset "offset" to "dest_tcam_region_info"
+* at offset "dest_offset."
+*/
+   MLXSW_REG_PRCR_OP_COPY,
+};
+
+/* reg_prcr_op
+ * Access: OP
+ */
+MLXSW_ITEM32(reg, prcr, op, 0x00, 28, 4);
+
+/* reg_prcr_offset
+ * Offset within the source region to copy/move from.
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, prcr, offset, 0x00, 0, 16);
+
+/* reg_prcr_size
+ * The number of rules to copy/move.
+ * Access: WO
+ */
+MLXSW_ITEM32(reg, prcr, size, 0x04, 0, 16);
+
+/* reg_prcr_tcam_region_info
+ * Opaque object that represents the source TCAM region.
+ * Access: Index
+ */
+MLXSW_ITEM_BUF(reg, prcr, tcam_region_info, 0x10,
+  MLXSW_REG_PXXX_TCAM_REGION_INFO_LEN);
+
+/* reg_prcr_dest_offset
+ * Offset within the source region to copy/move to.
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, prcr, dest_offset, 0x20, 0, 16);
+
+/* reg_prcr_dest_tcam_region_info
+ * Opaque object that represents the destination TCAM region.
+ * Access: Index
+ */
+MLXSW_ITEM_BUF(reg, prcr, dest_tcam_region_info, 0x30,
+  MLXSW_REG_PXXX_TCAM_REGION_INFO_LEN);
+
+static inline void mlxsw_reg_prcr_pack(char *payload, enum mlxsw_reg_prcr_op 
op,
+  const char *src_tcam_region_info,
+  u16 src_offset,
+  const char *dest_tcam_region_info,
+  u16 dest_offset, u16 size)
+{
+   MLXSW_REG_ZERO(prcr, payload);
+   mlxsw_reg_prcr_op_set(payload, op);
+   mlxsw_reg_prcr_offset_set(payload, src_offset);
+   mlxsw_reg_prcr_size_set(payload, size);
+   mlxsw_reg_prcr_tcam_region_info_memcpy_to(payload,
+ src_tcam_region_info);
+   mlxsw_reg_prcr_dest_offset_set(payload, dest_offset);
+   mlxsw_reg_prcr_dest_tcam_region_info_memcpy_to(payload,
+  dest_tcam_region_info);
+}
+
 /* PTCE-V2 - Policy-Engine TCAM Entry Register Version 2
  * -
  * This register is used for accessing rules within a TCAM region.
@@ -5799,6 +5875,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(pacl),
MLXSW_REG(pagt),
MLXSW_REG(ptar),
+   MLXSW_REG(prcr),
MLXSW_REG(ptce2),
MLXSW_REG(qpcr),
MLXSW_REG(qtct),
-- 
2.7.4

[patch net-next v2 12/19] mlxsw: core: Introduce flexible actions support

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

Each entry which is matched during ACL lookup points to an action set.
This action set contains up to three separate actions. If more actions
are needed to be chained, the extended set is created to hold them
in KVD linear area.

This patch implements handling of sets and encoding of actions.
Currectly, only two actions are supported. Drop and forward. Forward
action uses PBS pointer to KVD linear area, so the action code needs to
take care of this as well.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/Makefile   |   3 +-
 .../mellanox/mlxsw/core_acl_flex_actions.c | 685 +
 .../mellanox/mlxsw/core_acl_flex_actions.h |  66 ++
 3 files changed, 753 insertions(+), 1 deletion(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.h

diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile 
b/drivers/net/ethernet/mellanox/mlxsw/Makefile
index 6a83768..c4c48ba 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Makefile
+++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_MLXSW_CORE)   += mlxsw_core.o
-mlxsw_core-objs:= core.o core_acl_flex_keys.o
+mlxsw_core-objs:= core.o core_acl_flex_keys.o \
+  core_acl_flex_actions.o
 mlxsw_core-$(CONFIG_MLXSW_CORE_HWMON) += core_hwmon.o
 mlxsw_core-$(CONFIG_MLXSW_CORE_THERMAL) += core_thermal.o
 obj-$(CONFIG_MLXSW_PCI)+= mlxsw_pci.o
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c 
b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
new file mode 100644
index 000..34e2fef
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
@@ -0,0 +1,685 @@
+/*
+ * drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c
+ * Copyright (c) 2017 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2017 Jiri Pirko 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *contributors may be used to endorse or promote products derived from
+ *this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "item.h"
+#include "core_acl_flex_actions.h"
+
+enum mlxsw_afa_set_type {
+   MLXSW_AFA_SET_TYPE_NEXT,
+   MLXSW_AFA_SET_TYPE_GOTO,
+};
+
+/* afa_set_type
+ * Type of the record at the end of the action set.
+ */
+MLXSW_ITEM32(afa, set, type, 0xA0, 28, 4);
+
+/* afa_set_next_action_set_ptr
+ * A pointer to the next action set in the KVD Centralized database.
+ */
+MLXSW_ITEM32(afa, set, next_action_set_ptr, 0xA4, 0, 24);
+
+/* afa_set_goto_g
+ * group - When set, the binding is of an ACL group. When cleared,
+ * the binding is of an ACL.
+ * Must be set to 1 for Spectrum.
+ */
+MLXSW_ITEM32(afa, set, goto_g, 0xA4, 29, 1);
+
+enum mlxsw_afa_set_goto_binding_cmd {
+   /* continue go the next binding point */
+   MLXSW_AFA_SET_GOTO_BINDING_CMD_NONE,
+   /* jump to the next binding point no return */
+   MLXSW_AFA_SET_GOTO_BINDING_CMD_JUMP,
+   /* terminate the acl binding */
+   MLXSW_AFA_SET_GOTO_BINDING_CMD_TERM = 4,
+};
+
+/* afa_set_goto_binding_cmd */
+MLXSW_ITEM32(afa, set, goto_binding_cmd, 0xA4, 24, 3);
+
+/* afa_set_goto_next_binding
+ * ACL/ACL group identifier. If the

[patch net-next v2 10/19] mlxsw: reg: Add Policy-Engine Extended Flexible Action Register

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

PEFA register is used for accessing an extended flexible action entry
in the central KVD Linear Database.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 39 ---
 1 file changed, 36 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index c503363..b50a312 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -2125,6 +2125,40 @@ static inline void mlxsw_reg_prcr_pack(char *payload, 
enum mlxsw_reg_prcr_op op,
   dest_tcam_region_info);
 }
 
+/* PEFA - Policy-Engine Extended Flexible Action Register
+ * --
+ * This register is used for accessing an extended flexible action entry
+ * in the central KVD Linear Database.
+ */
+#define MLXSW_REG_PEFA_ID 0x300F
+#define MLXSW_REG_PEFA_LEN 0xB0
+
+MLXSW_REG_DEFINE(pefa, MLXSW_REG_PEFA_ID, MLXSW_REG_PEFA_LEN);
+
+/* reg_pefa_index
+ * Index in the KVD Linear Centralized Database.
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, pefa, index, 0x00, 0, 24);
+
+#define MLXSW_REG_PXXX_FLEX_ACTION_SET_LEN 0xA8
+
+/* reg_pefa_flex_action_set
+ * Action-set to perform when rule is matched.
+ * Must be zero padded if action set is shorter.
+ * Access: RW
+ */
+MLXSW_ITEM_BUF(reg, pefa, flex_action_set, 0x08,
+  MLXSW_REG_PXXX_FLEX_ACTION_SET_LEN);
+
+static inline void mlxsw_reg_pefa_pack(char *payload, u32 index,
+  const char *flex_action_set)
+{
+   MLXSW_REG_ZERO(pefa, payload);
+   mlxsw_reg_pefa_index_set(payload, index);
+   mlxsw_reg_pefa_flex_action_set_memcpy_to(payload, flex_action_set);
+}
+
 /* PTCE-V2 - Policy-Engine TCAM Entry Register Version 2
  * -
  * This register is used for accessing rules within a TCAM region.
@@ -2203,14 +2237,12 @@ MLXSW_ITEM_BUF(reg, ptce2, flex_key_blocks, 0x20,
 MLXSW_ITEM_BUF(reg, ptce2, mask, 0x80,
   MLXSW_REG_PTCE2_FLEX_KEY_BLOCKS_LEN);
 
-#define MLXSW_REG_PTCE2_FLEX_ACTION_SET_LEN 0xA8
-
 /* reg_ptce2_flex_action_set
  * ACL action set.
  * Access: RW
  */
 MLXSW_ITEM_BUF(reg, ptce2, flex_action_set, 0xE0,
-  MLXSW_REG_PTCE2_FLEX_ACTION_SET_LEN);
+  MLXSW_REG_PXXX_FLEX_ACTION_SET_LEN);
 
 static inline void mlxsw_reg_ptce2_pack(char *payload, bool valid,
enum mlxsw_reg_ptce2_op op,
@@ -5907,6 +5939,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(ptar),
MLXSW_REG(ppbs),
MLXSW_REG(prcr),
+   MLXSW_REG(pefa),
MLXSW_REG(ptce2),
MLXSW_REG(qpcr),
MLXSW_REG(qtct),
-- 
2.7.4

[patch net-next v2 17/19] mlxsw: spectrum: Introduce ACL core with simple TCAM implementation

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

Add ACL core infrastructure for Spectrum ASIC. This infra provides an
abstraction layer over specific HW implementations. There are two basic
objects used. One is "rule" and the second is "ruleset" which serves as a
container of multiple rules. In general, within one ruleset the rules are
allowed to have multiple priorities and masks. Each ruleset is bound to
either ingress or egress a of port netdevice.

The initial TCAM implementation is very simple and limited. It utilizes
parman lsort manager to take care of TCAM region layout.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
v1->v2:
- use __set_bit and __clear_bit as suggested by DaveM
---
 drivers/net/ethernet/mellanox/mlxsw/Kconfig|1 +
 drivers/net/ethernet/mellanox/mlxsw/Makefile   |3 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |   17 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |  100 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |  572 +++
 .../ethernet/mellanox/mlxsw/spectrum_acl_tcam.c| 1084 
 6 files changed, 1769 insertions(+), 8 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_tcam.c

diff --git a/drivers/net/ethernet/mellanox/mlxsw/Kconfig 
b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
index 16f44b9..76a7574 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Kconfig
+++ b/drivers/net/ethernet/mellanox/mlxsw/Kconfig
@@ -73,6 +73,7 @@ config MLXSW_SWITCHX2
 config MLXSW_SPECTRUM
tristate "Mellanox Technologies Spectrum support"
depends on MLXSW_CORE && MLXSW_PCI && NET_SWITCHDEV && VLAN_8021Q
+   select PARMAN
default m
---help---
  This driver supports Mellanox Technologies Spectrum Ethernet
diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile 
b/drivers/net/ethernet/mellanox/mlxsw/Makefile
index c4c48ba..1459716 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Makefile
+++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile
@@ -14,7 +14,8 @@ mlxsw_switchx2-objs   := switchx2.o
 obj-$(CONFIG_MLXSW_SPECTRUM)   += mlxsw_spectrum.o
 mlxsw_spectrum-objs:= spectrum.o spectrum_buffers.o \
   spectrum_switchdev.o spectrum_router.o \
-  spectrum_kvdl.o
+  spectrum_kvdl.o spectrum_acl.o \
+  spectrum_acl_tcam.o
 mlxsw_spectrum-$(CONFIG_MLXSW_SPECTRUM_DCB)+= spectrum_dcb.o
 obj-$(CONFIG_MLXSW_MINIMAL)+= mlxsw_minimal.o
 mlxsw_minimal-objs := minimal.o
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 467aa52..b1d77e1 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1,7 +1,7 @@
 /*
  * drivers/net/ethernet/mellanox/mlxsw/spectrum.c
- * Copyright (c) 2015 Mellanox Technologies. All rights reserved.
- * Copyright (c) 2015 Jiri Pirko 
+ * Copyright (c) 2015-2017 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2015-2017 Jiri Pirko 
  * Copyright (c) 2015 Ido Schimmel 
  * Copyright (c) 2015 Elad Raz 
  *
@@ -138,8 +138,6 @@ MLXSW_ITEM32(tx, hdr, fid, 0x08, 0, 16);
  */
 MLXSW_ITEM32(tx, hdr, type, 0x0C, 0, 4);
 
-static bool mlxsw_sp_port_dev_check(const struct net_device *dev);
-
 static void mlxsw_sp_txhdr_construct(struct sk_buff *skb,
 const struct mlxsw_tx_info *tx_info)
 {
@@ -3203,6 +3201,12 @@ static int mlxsw_sp_init(struct mlxsw_core *mlxsw_core,
goto err_span_init;
}
 
+   err = mlxsw_sp_acl_init(mlxsw_sp);
+   if (err) {
+   dev_err(mlxsw_sp->bus_info->dev, "Failed to initialize ACL\n");
+   goto err_acl_init;
+   }
+
err = mlxsw_sp_ports_create(mlxsw_sp);
if (err) {
dev_err(mlxsw_sp->bus_info->dev, "Failed to create ports\n");
@@ -3212,6 +3216,8 @@ static int mlxsw_sp_init(struct mlxsw_core *mlxsw_core,
return 0;
 
 err_ports_create:
+   mlxsw_sp_acl_fini(mlxsw_sp);
+err_acl_init:
mlxsw_sp_span_fini(mlxsw_sp);
 err_span_init:
mlxsw_sp_router_fini(mlxsw_sp);
@@ -3232,6 +3238,7 @@ static void mlxsw_sp_fini(struct mlxsw_core *mlxsw_core)
struct mlxsw_sp *mlxsw_sp = mlxsw_core_driver_priv(mlxsw_core);
 
mlxsw_sp_ports_remove(mlxsw_sp);
+   mlxsw_sp_acl_fini(mlxsw_sp);
mlxsw_sp_span_fini(mlxsw_sp);
mlxsw_sp_router_fini(mlxsw_sp);
mlxsw_sp_switchdev_fini(mlxsw_sp);
@@ -3297,7 +3304,7 @@ static struct mlxsw_driver mlxsw_sp_driver = {
.profile= &mlxsw_sp_config_profile,
 };
 
-static bool mlxsw_sp_port_dev_check(const struct net_device *dev)
+bool mlxsw_sp_port_dev_check(const struct net_device *dev)
 {
return dev->netdev_ops == &mlxsw_sp_port_netdev_ops

[patch net-next v2 13/19] mlxsw: spectrum: Introduce basic set of flexible key blocks

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

Introduce basic set of Spectrum flexible key blocks. It contains blocks
needed to carry all elements defined so far.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 .../mellanox/mlxsw/spectrum_acl_flex_keys.h| 109 +
 1 file changed, 109 insertions(+)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.h

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.h
new file mode 100644
index 000..82b81cf
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.h
@@ -0,0 +1,109 @@
+/*
+ * drivers/net/ethernet/mellanox/mlxsw/spectrum_acl_flex_keys.h
+ * Copyright (c) 2017 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2017 Jiri Pirko 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *contributors may be used to endorse or promote products derived from
+ *this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _MLXSW_SPECTRUM_ACL_FLEX_KEYS_H
+#define _MLXSW_SPECTRUM_ACL_FLEX_KEYS_H
+
+#include "core_acl_flex_keys.h"
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_l2_dmac[] = {
+   MLXSW_AFK_ELEMENT_INST_BUF(DMAC, 0x00, 6),
+   MLXSW_AFK_ELEMENT_INST_U32(SRC_SYS_PORT, 0x0C, 0, 16),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_l2_smac[] = {
+   MLXSW_AFK_ELEMENT_INST_BUF(SMAC, 0x00, 6),
+   MLXSW_AFK_ELEMENT_INST_U32(SRC_SYS_PORT, 0x0C, 0, 16),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_l2_smac_ex[] = {
+   MLXSW_AFK_ELEMENT_INST_BUF(SMAC, 0x02, 6),
+   MLXSW_AFK_ELEMENT_INST_U32(ETHERTYPE, 0x0C, 0, 16),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv4_sip[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(SRC_IP4, 0x00, 0, 32),
+   MLXSW_AFK_ELEMENT_INST_U32(IP_PROTO, 0x08, 0, 8),
+   MLXSW_AFK_ELEMENT_INST_U32(SRC_SYS_PORT, 0x0C, 0, 16),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv4_dip[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(DST_IP4, 0x00, 0, 32),
+   MLXSW_AFK_ELEMENT_INST_U32(IP_PROTO, 0x08, 0, 8),
+   MLXSW_AFK_ELEMENT_INST_U32(SRC_SYS_PORT, 0x0C, 0, 16),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv4_ex[] = {
+   MLXSW_AFK_ELEMENT_INST_U32(SRC_L4_PORT, 0x08, 0, 16),
+   MLXSW_AFK_ELEMENT_INST_U32(DST_L4_PORT, 0x0C, 0, 16),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv6_dip[] = {
+   MLXSW_AFK_ELEMENT_INST_BUF(DST_IP6_LO, 0x00, 8),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv6_ex1[] = {
+   MLXSW_AFK_ELEMENT_INST_BUF(DST_IP6_HI, 0x00, 8),
+   MLXSW_AFK_ELEMENT_INST_U32(IP_PROTO, 0x08, 0, 8),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv6_sip[] = {
+   MLXSW_AFK_ELEMENT_INST_BUF(SRC_IP6_LO, 0x00, 8),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_ipv6_sip_ex[] = 
{
+   MLXSW_AFK_ELEMENT_INST_BUF(SRC_IP6_HI, 0x00, 8),
+};
+
+static struct mlxsw_afk_element_inst mlxsw_sp_afk_element_info_packet_type[] = 
{
+   MLXSW_AFK_ELEMENT_INST_U32(ETHERTYPE, 0x00, 0, 16),
+};
+
+static const struct mlxsw_afk_block mlxsw_sp_afk_blocks[] = {
+   MLXSW_AFK_BLOCK(0x10, mlxsw_sp_afk_element_info_l2_dmac),
+   MLXSW_AFK_BLOCK(0x11, mlxsw_sp_afk_element_info_l2_smac),
+   MLX

[patch net-next v2 15/19] list: introduce list_for_each_entry_from_reverse helper

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

Similar to list_for_each_entry_continue and its reverse variant
list_for_each_entry_continue_reverse, introduce reverse helper for
list_for_each_entry_from.

Signed-off-by: Jiri Pirko 
Acked-by: Ido Schimmel 
---
 include/linux/list.h | 13 +
 1 file changed, 13 insertions(+)

diff --git a/include/linux/list.h b/include/linux/list.h
index d1039ec..ae537fa 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -527,6 +527,19 @@ static inline void list_splice_tail_init(struct list_head 
*list,
 pos = list_next_entry(pos, member))
 
 /**
+ * list_for_each_entry_from_reverse - iterate backwards over list of given type
+ *from the current point
+ * @pos:   the type * to use as a loop cursor.
+ * @head:  the head for your list.
+ * @member:the name of the list_head within the struct.
+ *
+ * Iterate backwards over list of given type, continuing from current position.
+ */
+#define list_for_each_entry_from_reverse(pos, head, member)\
+   for (; &pos->member != (head);  \
+pos = list_prev_entry(pos, member))
+
+/**
  * list_for_each_entry_safe - iterate over list of given type safe against 
removal of list entry
  * @pos:   the type * to use as a loop cursor.
  * @n: another type * to use as temporary storage
-- 
2.7.4

[patch net-next v2 16/19] lib: Introduce priority array area manager

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

This introduces a infrastructure for management of linear priority
areas. Priority order in an array matters, however order of items inside
a priority group does not matter.

As an initial implementation, L-sort algorithm is used. It is quite
trivial. More advanced algorithm called P-sort will be introduced as a
follow-up. The infrastructure is prepared for other algos.

Alongside this, a testing module is introduced as well.

Signed-off-by: Jiri Pirko 
---
v1->v2:
- Added documentation to the API functions as suggested by Tom Herbert
---
 MAINTAINERS|   8 +
 include/linux/parman.h |  76 ++
 lib/Kconfig|   3 +
 lib/Kconfig.debug  |  10 ++
 lib/Makefile   |   3 +
 lib/parman.c   | 376 ++
 lib/test_parman.c  | 395 +
 7 files changed, 871 insertions(+)
 create mode 100644 include/linux/parman.h
 create mode 100644 lib/parman.c
 create mode 100644 lib/test_parman.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 300d2ec..626758b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9375,6 +9375,14 @@ F:   drivers/video/fbdev/sti*
 F: drivers/video/console/sti*
 F: drivers/video/logo/logo_parisc*
 
+PARMAN
+M: Jiri Pirko 
+L: netdev@vger.kernel.org
+S: Supported
+F: lib/parman.c
+F: lib/test_parman.c
+F: include/linux/parman.h
+
 PC87360 HARDWARE MONITORING DRIVER
 M: Jim Cromie 
 L: linux-hw...@vger.kernel.org
diff --git a/include/linux/parman.h b/include/linux/parman.h
new file mode 100644
index 000..3c8
--- /dev/null
+++ b/include/linux/parman.h
@@ -0,0 +1,76 @@
+/*
+ * include/linux/parman.h - Manager for linear priority array areas
+ * Copyright (c) 2017 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2017 Jiri Pirko 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *contributors may be used to endorse or promote products derived from
+ *this software without specific prior written permission.
+ *
+ * Alternatively, this software may be distributed under the terms of the
+ * GNU General Public License ("GPL") version 2 as published by the Free
+ * Software Foundation.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+ * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
+ * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+ * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+ * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+ * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+ * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+ * POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _PARMAN_H
+#define _PARMAN_H
+
+#include 
+
+enum parman_algo_type {
+   PARMAN_ALGO_TYPE_LSORT,
+};
+
+struct parman_item {
+   struct list_head list;
+   unsigned long index;
+};
+
+struct parman_prio {
+   struct list_head list;
+   struct list_head item_list;
+   unsigned long priority;
+};
+
+struct parman_ops {
+   unsigned long base_count;
+   unsigned long resize_step;
+   int (*resize)(void *priv, unsigned long new_count);
+   void (*move)(void *priv, unsigned long from_index,
+unsigned long to_index, unsigned long count);
+   enum parman_algo_type algo;
+};
+
+struct parman;
+
+struct parman *parman_create(const struct parman_ops *ops, void *priv);
+void parman_destroy(struct parman *parman);
+void parman_prio_init(struct parman *parman, struct parman_prio *prio,
+ unsigned long priority);
+void parman_prio_fini(struct parman_prio *prio);
+int parman_item_add(struct parman *parman, struct parman_prio *prio,
+   struct parman_item *item);
+void parman_item_remove(struct parman *parman, struct parman_prio *prio,
+   struct parman_item *item);
+
+#endif
diff --git a/lib/Kconfig b/lib/Kconfig
index 260a80e..5d644f1 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -550,4 +550,7 @@ config STACKDEPOT
 config SBITMAP
bool
 
+config PARMAN
+   trist

[patch net-next v2 04/19] mlxsw: reg: Add Policy-Engine ACL Group Table register

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

The PAGT register is used for configuration of the ACL Group Table.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 54 +++
 1 file changed, 54 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 18b2da4..5f76fbc 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -1799,6 +1799,59 @@ static inline void mlxsw_reg_pacl_pack(char *payload, 
u16 acl_id,
mlxsw_reg_pacl_tcam_region_info_memcpy_to(payload, tcam_region_info);
 }
 
+/* PAGT - Policy-Engine ACL Group Table
+ * 
+ * This register is used for configuration of the ACL Group Table.
+ */
+#define MLXSW_REG_PAGT_ID 0x3005
+#define MLXSW_REG_PAGT_BASE_LEN 0x30
+#define MLXSW_REG_PAGT_ACL_LEN 4
+#define MLXSW_REG_PAGT_ACL_MAX_NUM 16
+#define MLXSW_REG_PAGT_LEN (MLXSW_REG_PAGT_BASE_LEN + \
+   MLXSW_REG_PAGT_ACL_MAX_NUM * MLXSW_REG_PAGT_ACL_LEN)
+
+MLXSW_REG_DEFINE(pagt, MLXSW_REG_PAGT_ID, MLXSW_REG_PAGT_LEN);
+
+/* reg_pagt_size
+ * Number of ACLs in the group.
+ * Size 0 invalidates a group.
+ * Range 0 .. cap_max_acl_group_size (hard coded to 16 for now)
+ * Total number of ACLs in all groups must be lower or equal
+ * to cap_max_acl_tot_groups
+ * Note: a group which is binded must not be invalidated
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, pagt, size, 0x00, 0, 8);
+
+/* reg_pagt_acl_group_id
+ * An identifier (numbered from 0..cap_max_acl_groups-1) representing
+ * the ACL Group identifier (managed by software).
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, pagt, acl_group_id, 0x08, 0, 16);
+
+/* reg_pagt_acl_id
+ * ACL identifier
+ * Access: RW
+ */
+MLXSW_ITEM32_INDEXED(reg, pagt, acl_id, 0x30, 0, 16, 0x04, 0x00, false);
+
+static inline void mlxsw_reg_pagt_pack(char *payload, u16 acl_group_id)
+{
+   MLXSW_REG_ZERO(pagt, payload);
+   mlxsw_reg_pagt_acl_group_id_set(payload, acl_group_id);
+}
+
+static inline void mlxsw_reg_pagt_acl_id_pack(char *payload, int index,
+ u16 acl_id)
+{
+   u8 size = mlxsw_reg_pagt_size_get(payload);
+
+   if (index >= size)
+   mlxsw_reg_pagt_size_set(payload, index + 1);
+   mlxsw_reg_pagt_acl_id_set(payload, index, acl_id);
+}
+
 /* QPCR - QoS Policer Configuration Register
  * -
  * The QPCR register is used to create policers - that limit
@@ -5477,6 +5530,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(sfmr),
MLXSW_REG(spvmlr),
MLXSW_REG(pacl),
+   MLXSW_REG(pagt),
MLXSW_REG(qpcr),
MLXSW_REG(qtct),
MLXSW_REG(qeec),
-- 
2.7.4

[patch net-next v2 01/19] mlxsw: item: Add 8bit item helpers

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

Item heplers for 8bit values are needed, let's add them.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/item.h | 79 +-
 1 file changed, 77 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h 
b/drivers/net/ethernet/mellanox/mlxsw/item.h
index 3c95e3d..09f35de 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/item.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/item.h
@@ -1,7 +1,7 @@
 /*
  * drivers/net/ethernet/mellanox/mlxsw/item.h
- * Copyright (c) 2015 Mellanox Technologies. All rights reserved.
- * Copyright (c) 2015 Jiri Pirko 
+ * Copyright (c) 2015-2017 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2015-2017 Jiri Pirko 
  * Copyright (c) 2015 Ido Schimmel 
  *
  * Redistribution and use in source and binary forms, with or without
@@ -72,6 +72,40 @@ __mlxsw_item_offset(const struct mlxsw_item *item, unsigned 
short index,
typesize);
 }
 
+static inline u8 __mlxsw_item_get8(const char *buf,
+  const struct mlxsw_item *item,
+  unsigned short index)
+{
+   unsigned int offset = __mlxsw_item_offset(item, index, sizeof(u8));
+   u8 *b = (u8 *) buf;
+   u8 tmp;
+
+   tmp = b[offset];
+   tmp >>= item->shift;
+   tmp &= GENMASK(item->size.bits - 1, 0);
+   if (item->no_real_shift)
+   tmp <<= item->shift;
+   return tmp;
+}
+
+static inline void __mlxsw_item_set8(char *buf, const struct mlxsw_item *item,
+unsigned short index, u8 val)
+{
+   unsigned int offset = __mlxsw_item_offset(item, index,
+ sizeof(u8));
+   u8 *b = (u8 *) buf;
+   u8 mask = GENMASK(item->size.bits - 1, 0) << item->shift;
+   u8 tmp;
+
+   if (!item->no_real_shift)
+   val <<= item->shift;
+   val &= mask;
+   tmp = b[offset];
+   tmp &= ~mask;
+   tmp |= val;
+   b[offset] = tmp;
+}
+
 static inline u16 __mlxsw_item_get16(const char *buf,
 const struct mlxsw_item *item,
 unsigned short index)
@@ -253,6 +287,47 @@ static inline void __mlxsw_item_bit_array_set(char *buf,
  * _iname: item name within the container
  */
 
+#define MLXSW_ITEM8(_type, _cname, _iname, _offset, _shift, _sizebits) 
\
+static struct mlxsw_item __ITEM_NAME(_type, _cname, _iname) = {
\
+   .offset = _offset,  
\
+   .shift = _shift,
\
+   .size = {.bits = _sizebits,},   
\
+   .name = #_type "_" #_cname "_" #_iname, 
\
+}; 
\
+static inline u8 mlxsw_##_type##_##_cname##_##_iname##_get(const char *buf)
\
+{  
\
+   return __mlxsw_item_get8(buf, &__ITEM_NAME(_type, _cname, _iname), 0);  
\
+}  
\
+static inline void mlxsw_##_type##_##_cname##_##_iname##_set(char *buf, u8 
val)\
+{  
\
+   __mlxsw_item_set8(buf, &__ITEM_NAME(_type, _cname, _iname), 0, val);
\
+}
+
+#define MLXSW_ITEM8_INDEXED(_type, _cname, _iname, _offset, _shift, _sizebits, 
\
+   _step, _instepoffset, _norealshift) 
\
+static struct mlxsw_item __ITEM_NAME(_type, _cname, _iname) = {
\
+   .offset = _offset,  
\
+   .step = _step,  
\
+   .in_step_offset = _instepoffset,
\
+   .shift = _shift,
\
+   .no_real_shift = _norealshift,  
\
+   .size = {.bits = _sizebits,},   
\
+   .name = #_type "_" #_cname "_" #_iname, 
\
+}; 
\
+static inline u8   
\
+mlxsw_##_type##_##_cname##_##_iname##_get(const char *buf, unsigned short 
index)\
+{  
\
+   return __mlxsw_item_get8(buf, &__ITEM_NAME(_type, _cname, _iname),  
\
+index);
\
+}  
\
+stat

[patch net-next v2 05/19] mlxsw: reg: Add Policy-Engine TCAM Allocation Register

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

The PTAR register is used for allocation of regions in the TCAM.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 106 ++
 1 file changed, 106 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 5f76fbc..444f0a3 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -1852,6 +1852,111 @@ static inline void mlxsw_reg_pagt_acl_id_pack(char 
*payload, int index,
mlxsw_reg_pagt_acl_id_set(payload, index, acl_id);
 }
 
+/* PTAR - Policy-Engine TCAM Allocation Register
+ * -
+ * This register is used for allocation of regions in the TCAM.
+ * Note: Query method is not supported on this register.
+ */
+#define MLXSW_REG_PTAR_ID 0x3006
+#define MLXSW_REG_PTAR_BASE_LEN 0x20
+#define MLXSW_REG_PTAR_KEY_ID_LEN 1
+#define MLXSW_REG_PTAR_KEY_ID_MAX_NUM 16
+#define MLXSW_REG_PTAR_LEN (MLXSW_REG_PTAR_BASE_LEN + \
+   MLXSW_REG_PTAR_KEY_ID_MAX_NUM * MLXSW_REG_PTAR_KEY_ID_LEN)
+
+MLXSW_REG_DEFINE(ptar, MLXSW_REG_PTAR_ID, MLXSW_REG_PTAR_LEN);
+
+enum mlxsw_reg_ptar_op {
+   /* allocate a TCAM region */
+   MLXSW_REG_PTAR_OP_ALLOC,
+   /* resize a TCAM region */
+   MLXSW_REG_PTAR_OP_RESIZE,
+   /* deallocate TCAM region */
+   MLXSW_REG_PTAR_OP_FREE,
+   /* test allocation */
+   MLXSW_REG_PTAR_OP_TEST,
+};
+
+/* reg_ptar_op
+ * Access: OP
+ */
+MLXSW_ITEM32(reg, ptar, op, 0x00, 28, 4);
+
+/* reg_ptar_action_set_type
+ * Type of action set to be used on this region.
+ * For Spectrum, this is always type 2 - "flexible"
+ * Access: WO
+ */
+MLXSW_ITEM32(reg, ptar, action_set_type, 0x00, 16, 8);
+
+/* reg_ptar_key_type
+ * TCAM key type for the region.
+ * For Spectrum, this is always type 0x50 - "FLEX_KEY"
+ * Access: WO
+ */
+MLXSW_ITEM32(reg, ptar, key_type, 0x00, 0, 8);
+
+/* reg_ptar_region_size
+ * TCAM region size. When allocating/resizing this is the requested size,
+ * the response is the actual size. Note that actual size may be
+ * larger than requested.
+ * Allowed range 1 .. cap_max_rules-1
+ * Reserved during op deallocate.
+ * Access: WO
+ */
+MLXSW_ITEM32(reg, ptar, region_size, 0x04, 0, 16);
+
+/* reg_ptar_region_id
+ * Region identifier
+ * Range 0 .. cap_max_regions-1
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, ptar, region_id, 0x08, 0, 16);
+
+/* reg_ptar_tcam_region_info
+ * Opaque object that represents the TCAM region.
+ * Returned when allocating a region.
+ * Provided by software for ACL generation and region deallocation and resize.
+ * Access: RW
+ */
+MLXSW_ITEM_BUF(reg, ptar, tcam_region_info, 0x10,
+  MLXSW_REG_PXXX_TCAM_REGION_INFO_LEN);
+
+/* reg_ptar_flexible_key_id
+ * Identifier of the Flexible Key.
+ * Only valid if key_type == "FLEX_KEY"
+ * The key size will be rounded up to one of the following values:
+ * 9B, 18B, 36B, 54B.
+ * This field is reserved for in resize operation.
+ * Access: WO
+ */
+MLXSW_ITEM8_INDEXED(reg, ptar, flexible_key_id, 0x20, 0, 8,
+   MLXSW_REG_PTAR_KEY_ID_LEN, 0x00, false);
+
+static inline void mlxsw_reg_ptar_pack(char *payload, enum mlxsw_reg_ptar_op 
op,
+  u16 region_size, u16 region_id,
+  const char *tcam_region_info)
+{
+   MLXSW_REG_ZERO(ptar, payload);
+   mlxsw_reg_ptar_op_set(payload, op);
+   mlxsw_reg_ptar_action_set_type_set(payload, 2); /* "flexible" */
+   mlxsw_reg_ptar_key_type_set(payload, 0x50); /* "FLEX_KEY" */
+   mlxsw_reg_ptar_region_size_set(payload, region_size);
+   mlxsw_reg_ptar_region_id_set(payload, region_id);
+   mlxsw_reg_ptar_tcam_region_info_memcpy_to(payload, tcam_region_info);
+}
+
+static inline void mlxsw_reg_ptar_key_id_pack(char *payload, int index,
+ u16 key_id)
+{
+   mlxsw_reg_ptar_flexible_key_id_set(payload, index, key_id);
+}
+
+static inline void mlxsw_reg_ptar_unpack(char *payload, char *tcam_region_info)
+{
+   mlxsw_reg_ptar_tcam_region_info_memcpy_from(payload, tcam_region_info);
+}
+
 /* QPCR - QoS Policer Configuration Register
  * -
  * The QPCR register is used to create policers - that limit
@@ -5531,6 +5636,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(spvmlr),
MLXSW_REG(pacl),
MLXSW_REG(pagt),
+   MLXSW_REG(ptar),
MLXSW_REG(qpcr),
MLXSW_REG(qtct),
MLXSW_REG(qeec),
-- 
2.7.4

[patch net-next v2 02/19] mlxsw: item: Add helpers for getting pointer into payload for char buffer item

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

Sometimes it is handy to get a pointer to a char buffer item and use it
direcly to write/read data. So add these helpers.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/item.h | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/item.h 
b/drivers/net/ethernet/mellanox/mlxsw/item.h
index 09f35de..28427f0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/item.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/item.h
@@ -225,6 +225,14 @@ static inline void __mlxsw_item_memcpy_to(char *buf, const 
char *src,
memcpy(&buf[offset], src, item->size.bytes);
 }
 
+static inline char *__mlxsw_item_data(char *buf, const struct mlxsw_item *item,
+ unsigned short index)
+{
+   unsigned int offset = __mlxsw_item_offset(item, index, sizeof(char));
+
+   return &buf[offset];
+}
+
 static inline u16
 __mlxsw_item_bit_array_offset(const struct mlxsw_item *item,
  u16 index, u8 *shift)
@@ -468,6 +476,11 @@ mlxsw_##_type##_##_cname##_##_iname##_memcpy_to(char *buf, 
const char *src)\
 {  
\
__mlxsw_item_memcpy_to(buf, src,
\
   &__ITEM_NAME(_type, _cname, _iname), 0); 
\
+}  
\
+static inline char *   
\
+mlxsw_##_type##_##_cname##_##_iname##_data(char *buf)  
\
+{  
\
+   return __mlxsw_item_data(buf, &__ITEM_NAME(_type, _cname, _iname), 0);  
\
 }
 
 #define MLXSW_ITEM_BUF_INDEXED(_type, _cname, _iname, _offset, _sizebytes, 
\
@@ -494,6 +507,12 @@ mlxsw_##_type##_##_cname##_##_iname##_memcpy_to(char *buf, 
\
 {  
\
__mlxsw_item_memcpy_to(buf, src,
\
   &__ITEM_NAME(_type, _cname, _iname), index); 
\
+}  
\
+static inline char *   
\
+mlxsw_##_type##_##_cname##_##_iname##_data(char *buf, unsigned short index)
\
+{  
\
+   return __mlxsw_item_data(buf,   
\
+&__ITEM_NAME(_type, _cname, _iname), index);   
\
 }
 
 #define MLXSW_ITEM_BIT_ARRAY(_type, _cname, _iname, _offset, _sizebytes,   
\
-- 
2.7.4

[patch net-next v2 07/19] mlxsw: reg: Add Policy-Engine Port Binding Table

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

The PPBT is used for configuration of the Port Binding Table.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/reg.h | 63 +++
 1 file changed, 63 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/reg.h 
b/drivers/net/ethernet/mellanox/mlxsw/reg.h
index 1008251..ce6d85a 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/reg.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/reg.h
@@ -1757,6 +1757,68 @@ static inline void mlxsw_reg_spvmlr_pack(char *payload, 
u8 local_port,
}
 }
 
+/* PPBT - Policy-Engine Port Binding Table
+ * ---
+ * This register is used for configuration of the Port Binding Table.
+ */
+#define MLXSW_REG_PPBT_ID 0x3002
+#define MLXSW_REG_PPBT_LEN 0x14
+
+MLXSW_REG_DEFINE(ppbt, MLXSW_REG_PPBT_ID, MLXSW_REG_PPBT_LEN);
+
+enum mlxsw_reg_pxbt_e {
+   MLXSW_REG_PXBT_E_IACL,
+   MLXSW_REG_PXBT_E_EACL,
+};
+
+/* reg_ppbt_e
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, ppbt, e, 0x00, 31, 1);
+
+enum mlxsw_reg_pxbt_op {
+   MLXSW_REG_PXBT_OP_BIND,
+   MLXSW_REG_PXBT_OP_UNBIND,
+};
+
+/* reg_ppbt_op
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, ppbt, op, 0x00, 28, 3);
+
+/* reg_ppbt_local_port
+ * Local port. Not including CPU port.
+ * Access: Index
+ */
+MLXSW_ITEM32(reg, ppbt, local_port, 0x00, 16, 8);
+
+/* reg_ppbt_g
+ * group - When set, the binding is of an ACL group. When cleared,
+ * the binding is of an ACL.
+ * Must be set to 1 for Spectrum.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, ppbt, g, 0x10, 31, 1);
+
+/* reg_ppbt_acl_info
+ * ACL/ACL group identifier. If the g bit is set, this field should hold
+ * the acl_group_id, else it should hold the acl_id.
+ * Access: RW
+ */
+MLXSW_ITEM32(reg, ppbt, acl_info, 0x10, 0, 16);
+
+static inline void mlxsw_reg_ppbt_pack(char *payload, enum mlxsw_reg_pxbt_e e,
+  enum mlxsw_reg_pxbt_op op,
+  u8 local_port, u16 acl_info)
+{
+   MLXSW_REG_ZERO(ppbt, payload);
+   mlxsw_reg_ppbt_e_set(payload, e);
+   mlxsw_reg_ppbt_op_set(payload, op);
+   mlxsw_reg_ppbt_local_port_set(payload, local_port);
+   mlxsw_reg_ppbt_g_set(payload, true);
+   mlxsw_reg_ppbt_acl_info_set(payload, acl_info);
+}
+
 /* PACL - Policy-Engine ACL Register
  * -
  * This register is used for configuration of the ACL.
@@ -5733,6 +5795,7 @@ static const struct mlxsw_reg_info *mlxsw_reg_infos[] = {
MLXSW_REG(svpe),
MLXSW_REG(sfmr),
MLXSW_REG(spvmlr),
+   MLXSW_REG(ppbt),
MLXSW_REG(pacl),
MLXSW_REG(pagt),
MLXSW_REG(ptar),
-- 
2.7.4

[patch net-next v2 18/19] sched: cls_flower: expose priority to offloading netdevice

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

The driver that offloads flower rules needs to know with which priority
user inserted the rules. So add this information into offload struct.

Signed-off-by: Jiri Pirko 
Acked-by: Ido Schimmel 
---
 include/net/pkt_cls.h  | 1 +
 net/sched/cls_flower.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index b43077e..dabb00a 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -481,6 +481,7 @@ enum tc_fl_command {
 
 struct tc_cls_flower_offload {
enum tc_fl_command command;
+   u32 prio;
unsigned long cookie;
struct flow_dissector *dissector;
struct fl_flow_key *mask;
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index 9e74b0f..e96ced5 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -229,6 +229,7 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, 
struct cls_fl_filter *f)
return;
 
offload.command = TC_CLSFLOWER_DESTROY;
+   offload.prio = tp->prio;
offload.cookie = (unsigned long)f;
 
tc->type = TC_SETUP_CLSFLOWER;
@@ -260,6 +261,7 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
}
 
offload.command = TC_CLSFLOWER_REPLACE;
+   offload.prio = tp->prio;
offload.cookie = (unsigned long)f;
offload.dissector = dissector;
offload.mask = mask;
@@ -287,6 +289,7 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct 
cls_fl_filter *f)
return;
 
offload.command = TC_CLSFLOWER_STATS;
+   offload.prio = tp->prio;
offload.cookie = (unsigned long)f;
offload.exts = &f->exts;
 
-- 
2.7.4

[patch net-next v2 19/19] mlxsw: spectrum: Implement TC flower offload

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

Extend the existing setup_tc ndo call and allow to offload cls_flower
rules. Only limited set of dissector keys and actions are supported now.
Use previously introduced ACL infrastructure to offload cls_flower rules
to be processed in the HW.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/Makefile   |   4 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c |  15 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum.h |   6 +
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 309 +
 4 files changed, 331 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c

diff --git a/drivers/net/ethernet/mellanox/mlxsw/Makefile 
b/drivers/net/ethernet/mellanox/mlxsw/Makefile
index 1459716..6b6c30d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/Makefile
+++ b/drivers/net/ethernet/mellanox/mlxsw/Makefile
@@ -14,8 +14,8 @@ mlxsw_switchx2-objs   := switchx2.o
 obj-$(CONFIG_MLXSW_SPECTRUM)   += mlxsw_spectrum.o
 mlxsw_spectrum-objs:= spectrum.o spectrum_buffers.o \
   spectrum_switchdev.o spectrum_router.o \
-  spectrum_kvdl.o spectrum_acl.o \
-  spectrum_acl_tcam.o
+  spectrum_kvdl.o spectrum_acl_tcam.o \
+  spectrum_acl.o spectrum_flower.o
 mlxsw_spectrum-$(CONFIG_MLXSW_SPECTRUM_DCB)+= spectrum_dcb.o
 obj-$(CONFIG_MLXSW_MINIMAL)+= mlxsw_minimal.o
 mlxsw_minimal-objs := minimal.o
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index b1d77e1..8a52c86 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -1355,7 +1355,8 @@ static int mlxsw_sp_setup_tc(struct net_device *dev, u32 
handle,
struct mlxsw_sp_port *mlxsw_sp_port = netdev_priv(dev);
bool ingress = TC_H_MAJ(handle) == TC_H_MAJ(TC_H_INGRESS);
 
-   if (tc->type == TC_SETUP_MATCHALL) {
+   switch (tc->type) {
+   case TC_SETUP_MATCHALL:
switch (tc->cls_mall->command) {
case TC_CLSMATCHALL_REPLACE:
return mlxsw_sp_port_add_cls_matchall(mlxsw_sp_port,
@@ -1369,6 +1370,18 @@ static int mlxsw_sp_setup_tc(struct net_device *dev, u32 
handle,
default:
return -EINVAL;
}
+   case TC_SETUP_CLSFLOWER:
+   switch (tc->cls_flower->command) {
+   case TC_CLSFLOWER_REPLACE:
+   return mlxsw_sp_flower_replace(mlxsw_sp_port, ingress,
+  proto, tc->cls_flower);
+   case TC_CLSFLOWER_DESTROY:
+   mlxsw_sp_flower_destroy(mlxsw_sp_port, ingress,
+   tc->cls_flower);
+   return 0;
+   default:
+   return -EOPNOTSUPP;
+   }
}
 
return -EOPNOTSUPP;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
index cd9b4b2..4d251e0 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.h
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "port.h"
 #include "core.h"
@@ -698,4 +699,9 @@ void mlxsw_sp_acl_fini(struct mlxsw_sp *mlxsw_sp);
 
 extern const struct mlxsw_sp_acl_ops mlxsw_sp_acl_tcam_ops;
 
+int mlxsw_sp_flower_replace(struct mlxsw_sp_port *mlxsw_sp_port, bool ingress,
+   __be16 protocol, struct tc_cls_flower_offload *f);
+void mlxsw_sp_flower_destroy(struct mlxsw_sp_port *mlxsw_sp_port, bool ingress,
+struct tc_cls_flower_offload *f);
+
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
new file mode 100644
index 000..35b147a
--- /dev/null
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -0,0 +1,309 @@
+/*
+ * drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+ * Copyright (c) 2017 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2017 Jiri Pirko 
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in the
+ *documentation and/or other materials provided with the distribution.
+ * 3. Neither the names of the copyright holders nor the names of its
+ *contri

[patch net-next v2 14/19] mlxsw: resources: Add ACL related resources

2017-02-03 Thread Jiri Pirko

From: Jiri Pirko 

Add couple of resource limits related to ACL.

Signed-off-by: Jiri Pirko 
Reviewed-by: Ido Schimmel 
---
 drivers/net/ethernet/mellanox/mlxsw/resources.h | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/resources.h 
b/drivers/net/ethernet/mellanox/mlxsw/resources.h
index 3c2171d..bce8c2e 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/resources.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/resources.h
@@ -1,7 +1,7 @@
 /*
  * drivers/net/ethernet/mellanox/mlxsw/resources.h
- * Copyright (c) 2016 Mellanox Technologies. All rights reserved.
- * Copyright (c) 2016 Jiri Pirko 
+ * Copyright (c) 2016-2017 Mellanox Technologies. All rights reserved.
+ * Copyright (c) 2016-2017 Jiri Pirko 
  *
  * Redistribution and use in source and binary forms, with or without
  * modification, are permitted provided that the following conditions are met:
@@ -48,6 +48,14 @@ enum mlxsw_res_id {
MLXSW_RES_ID_MAX_LAG,
MLXSW_RES_ID_MAX_LAG_MEMBERS,
MLXSW_RES_ID_MAX_BUFFER_SIZE,
+   MLXSW_RES_ID_ACL_MAX_TCAM_REGIONS,
+   MLXSW_RES_ID_ACL_MAX_TCAM_RULES,
+   MLXSW_RES_ID_ACL_MAX_REGIONS,
+   MLXSW_RES_ID_ACL_MAX_GROUPS,
+   MLXSW_RES_ID_ACL_MAX_GROUP_SIZE,
+   MLXSW_RES_ID_ACL_FLEX_KEYS,
+   MLXSW_RES_ID_ACL_MAX_ACTION_PER_RULE,
+   MLXSW_RES_ID_ACL_ACTIONS_PER_SET,
MLXSW_RES_ID_MAX_CPU_POLICERS,
MLXSW_RES_ID_MAX_VRS,
MLXSW_RES_ID_MAX_RIFS,
@@ -72,6 +80,14 @@ static u16 mlxsw_res_ids[] = {
[MLXSW_RES_ID_MAX_LAG] = 0x2520,
[MLXSW_RES_ID_MAX_LAG_MEMBERS] = 0x2521,
[MLXSW_RES_ID_MAX_BUFFER_SIZE] = 0x2802,/* Bytes */
+   [MLXSW_RES_ID_ACL_MAX_TCAM_REGIONS] = 0x2901,
+   [MLXSW_RES_ID_ACL_MAX_TCAM_RULES] = 0x2902,
+   [MLXSW_RES_ID_ACL_MAX_REGIONS] = 0x2903,
+   [MLXSW_RES_ID_ACL_MAX_GROUPS] = 0x2904,
+   [MLXSW_RES_ID_ACL_MAX_GROUP_SIZE] = 0x2905,
+   [MLXSW_RES_ID_ACL_FLEX_KEYS] = 0x2910,
+   [MLXSW_RES_ID_ACL_MAX_ACTION_PER_RULE] = 0x2911,
+   [MLXSW_RES_ID_ACL_ACTIONS_PER_SET] = 0x2912,
[MLXSW_RES_ID_MAX_CPU_POLICERS] = 0x2A13,
[MLXSW_RES_ID_MAX_VRS] = 0x2C01,
[MLXSW_RES_ID_MAX_RIFS] = 0x2C02,
-- 
2.7.4

[PATCH net-next 0/2] enable relax order mode in intel NIC's driver

2017-02-03 Thread Mao Wenan

These two patches give a way to enable relax order mode in intel
NIC's driver. CONFIG_ARCH_WANT_RELAX_ORDER is a common macro for
CPU architecture to use relax order, it can be defined in
arch/xxx/Kconfig, such as arch/sparc/Kconfig; but for some special
CPU architecture, such as arm64, it can't be defined in
arch/arm64/Kconfig, because not all of arm64 CPU must use relax order
mode, so this patch give a way to select
CONFIG_ARCH_WANT_RELAX_ORDER in intel NIC's Kconfig.

Mao Wenan (2):
  ixgbevf and 82598 relax order mode support
  add one config to select relax order mode in intel NIC's Kconfig

 drivers/net/ethernet/intel/Kconfig| 15 +++
 drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c|  4 ++--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 +-
 3 files changed, 18 insertions(+), 3 deletions(-)

-- 
2.7.0

[PATCH net-next 2/2] add one config to select relax order mode in intel NIC's Kconfig

2017-02-03 Thread Mao Wenan

This patch allows one to enable relax order mode in intel NIC's
Kconfig. CONFIG_ARCH_WANT_RELAX_ORDER is a common macro for some
CPU architecture to use relax order mode in NIC's source codes.
CONFIG_ARCH_WANT_RELAX_ORDER can be defined in arch/xxx/Kconfig,
such as sparc system exists in arch/sparc/Kconfig, but not all
of arm64 systems can use relax order mode, so it can't be defined
in arch/arm64/Kconfig. Therefore PCI_RELAX_ORDER in NIC's Kconfig
provide one way to define macro CONFIG_ARCH_WANT_RELAX_ORDER.

Signed-off-by: Mao Wenan 
---
 drivers/net/ethernet/intel/Kconfig | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/net/ethernet/intel/Kconfig 
b/drivers/net/ethernet/intel/Kconfig
index 1349b45..b366722 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -275,4 +275,19 @@ config FM10K
  To compile this driver as a module, choose M here. The module
  will be called fm10k.  MSI-X interrupt support is required
 
+config PCI_RELAX_ORDER
+bool "PCI relax order mode support"
+default n
+select ARCH_WANT_RELAX_ORDER
+---help---
+  This allows one to enable relax order mode in driver.
+  CONFIG_ARCH_WANT_RELAX_ORDER is a common macro for some
+  CPU architecture to use relax order mode in NIC's source codes.
+  CONFIG_ARCH_WANT_RELAX_ORDER can be defined in arch/xxx/Kconfig,
+  such as sparc system exists in arch/sparc/Kconfig, but not all
+  of arm64 systems can use relax order mode, so it can't be defined
+  in arch/arm64/Kconfig. Therefore PCI_RELAX_ORDER provide one way
+  to define macro CONFIG_ARCH_WANT_RELAX_ORDER. Say Y here if you
+  want to enable relax order.
+
 endif # NET_VENDOR_INTEL
-- 
2.7.0

[PATCH net-next 1/2] ixgbevf and 82598 relax order mode support

2017-02-03 Thread Mao Wenan

This patch allows one to use common macro
CONFIG_ARCH_WANT_RELAX_ORDER to enable or
disable to relax order mode.

Signed-off-by: Mao Wenan 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c| 4 ++--
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c
index 523f9d0..167d1fc 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_82598.c
@@ -175,7 +175,7 @@ static s32 ixgbe_init_phy_ops_82598(struct ixgbe_hw *hw)
  **/
 static s32 ixgbe_start_hw_82598(struct ixgbe_hw *hw)
 {
-#ifndef CONFIG_SPARC
+#ifndef CONFIG_ARCH_WANT_RELAX_ORDER
u32 regval;
u32 i;
 #endif
@@ -183,7 +183,7 @@ static s32 ixgbe_start_hw_82598(struct ixgbe_hw *hw)
 
ret_val = ixgbe_start_hw_generic(hw);
 
-#ifndef CONFIG_SPARC
+#ifndef CONFIG_ARCH_WANT_RELAX_ORDER
/* Disable relaxed ordering */
for (i = 0; ((i < hw->mac.max_tx_queues) &&
 (i < IXGBE_DCA_MAX_QUEUES_82598)); i++) {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index b068635..0efbf0b 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -1765,7 +1765,7 @@ static void ixgbevf_configure_rx_ring(struct 
ixgbevf_adapter *adapter,
IXGBE_WRITE_REG(hw, IXGBE_VFRDLEN(reg_idx),
ring->count * sizeof(union ixgbe_adv_rx_desc));
 
-#ifndef CONFIG_SPARC
+#ifndef CONFIG_ARCH_WANT_RELAX_ORDER
/* enable relaxed ordering */
IXGBE_WRITE_REG(hw, IXGBE_VFDCA_RXCTRL(reg_idx),
IXGBE_DCA_RXCTRL_DESC_RRO_EN);
-- 
2.7.0

[PATCH 1/1] gtp: support SGSN-side tunnels

2017-02-03 Thread Jonas Bonn

The GTP-tunnel driver is explicitly GGSN-side as it searches for PDP
contexts based on the incoming packets _destination_ address.  If we
want to write an SGSN, then we want to be idenityfing PDP contexts
based on _source_ address.

This patch adds a "flags" argument at GTP-link creation time to specify
whether we are on the GGSN or SGSN side of the tunnel; this flag is then
used to determine which part of the IP packet to use in determining
the PDP context.

Signed-off-by: Jonas Bonn 
---

 drivers/net/gtp.c| 43 ---
 include/uapi/linux/gtp.h |  2 +-
 include/uapi/linux/if_link.h |  5 +
 3 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c
index 50349a9..1bbac69 100644
--- a/drivers/net/gtp.c
+++ b/drivers/net/gtp.c
@@ -72,6 +72,7 @@ struct gtp_dev {
struct net  *net;
struct net_device   *dev;
 
+   unsigned intflags;
unsigned inthash_size;
struct hlist_head   *tid_hash;
struct hlist_head   *addr_hash;
@@ -150,8 +151,8 @@ static struct pdp_ctx *ipv4_pdp_find(struct gtp_dev *gtp, 
__be32 ms_addr)
return NULL;
 }
 
-static bool gtp_check_src_ms_ipv4(struct sk_buff *skb, struct pdp_ctx *pctx,
- unsigned int hdrlen)
+static bool gtp_check_ms_ipv4(struct sk_buff *skb, struct pdp_ctx *pctx,
+ unsigned int hdrlen, unsigned int flags)
 {
struct iphdr *iph;
 
@@ -160,18 +161,22 @@ static bool gtp_check_src_ms_ipv4(struct sk_buff *skb, 
struct pdp_ctx *pctx,
 
iph = (struct iphdr *)(skb->data + hdrlen);
 
-   return iph->saddr == pctx->ms_addr_ip4.s_addr;
+   if (flags & GTP_FLAGS_SGSN) {
+   return iph->daddr == pctx->ms_addr_ip4.s_addr;
+   } else {
+   return iph->saddr == pctx->ms_addr_ip4.s_addr;
+   }
 }
 
-/* Check if the inner IP source address in this packet is assigned to any
+/* Check if the inner IP address in this packet is assigned to any
  * existing mobile subscriber.
  */
-static bool gtp_check_src_ms(struct sk_buff *skb, struct pdp_ctx *pctx,
-unsigned int hdrlen)
+static bool gtp_check_ms(struct sk_buff *skb, struct pdp_ctx *pctx,
+unsigned int hdrlen, unsigned int flags)
 {
switch (ntohs(skb->protocol)) {
case ETH_P_IP:
-   return gtp_check_src_ms_ipv4(skb, pctx, hdrlen);
+   return gtp_check_ms_ipv4(skb, pctx, hdrlen, flags);
}
return false;
 }
@@ -205,7 +210,7 @@ static int gtp0_udp_encap_recv(struct gtp_dev *gtp, struct 
sk_buff *skb,
goto out_rcu;
}
 
-   if (!gtp_check_src_ms(skb, pctx, hdrlen)) {
+   if (!gtp_check_ms(skb, pctx, hdrlen, gtp->flags)) {
netdev_dbg(gtp->dev, "No PDP ctx for this MS\n");
ret = -1;
goto out_rcu;
@@ -248,7 +253,7 @@ static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, struct 
sk_buff *skb,
if (gtp1->flags & GTP1_F_MASK)
hdrlen += 4;
 
-   /* Make sure the header is larger enough, including extensions. */
+   /* Make sure the header is large enough, including extensions. */
if (!pskb_may_pull(skb, hdrlen))
return -1;
 
@@ -262,7 +267,7 @@ static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, struct 
sk_buff *skb,
goto out_rcu;
}
 
-   if (!gtp_check_src_ms(skb, pctx, hdrlen)) {
+   if (!gtp_check_ms(skb, pctx, hdrlen, gtp->flags)) {
netdev_dbg(gtp->dev, "No PDP ctx for this MS\n");
ret = -1;
goto out_rcu;
@@ -491,7 +496,11 @@ static int gtp_build_skb_ip4(struct sk_buff *skb, struct 
net_device *dev,
 * Prepend PDP header with TEI/TID from PDP ctx.
 */
iph = ip_hdr(skb);
-   pctx = ipv4_pdp_find(gtp, iph->daddr);
+   if (gtp->flags & GTP_FLAGS_SGSN) {
+   pctx = ipv4_pdp_find(gtp, iph->saddr);
+   } else {
+   pctx = ipv4_pdp_find(gtp, iph->daddr);
+   }
if (!pctx) {
netdev_dbg(dev, "no PDP ctx found for %pI4, skip\n",
   &iph->daddr);
@@ -666,12 +675,23 @@ static int gtp_newlink(struct net *src_net, struct 
net_device *dev,
int hashsize, err, fd0, fd1;
struct gtp_dev *gtp;
struct gtp_net *gn;
+   unsigned int flags;
+
+   if (data[IFLA_GTP_FLAGS]) {
+   flags = nla_get_u32(data[IFLA_GTP_FLAGS]);
+   if (flags & ~GTP_FLAGS_MASK)
+   return -EINVAL;
+   } else {
+   flags = 0;
+   }
 
if (!data[IFLA_GTP_FD0] || !data[IFLA_GTP_FD1])
return -EINVAL;
 
gtp = netdev_priv(dev);
 
+   gtp->flags = flags;
+
fd0 = nla_get_u32(data[IFLA_GTP_FD0]);
fd1 = nla_get_u32(data[IFLA_

[PATCH net-next] sctp: process fwd tsn chunk only when prsctp is enabled

2017-02-03 Thread Xin Long

This patch is to check if asoc->peer.prsctp_capable is set before
processing fwd tsn chunk, if not, it will return an ERROR to the
peer, just as rfc3758 section 3.3.1 demands.

Reported-by: Julian Cordes 
Signed-off-by: Xin Long 
---
 net/sctp/sm_statefuns.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
index 782e579..d8798dd 100644
--- a/net/sctp/sm_statefuns.c
+++ b/net/sctp/sm_statefuns.c
@@ -3867,6 +3867,9 @@ sctp_disposition_t sctp_sf_eat_fwd_tsn(struct net *net,
return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
}
 
+   if (!asoc->peer.prsctp_capable)
+   return sctp_sf_unk_chunk(net, ep, asoc, type, arg, commands);
+
/* Make sure that the FORWARD_TSN chunk has valid length.  */
if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_fwdtsn_chunk)))
return sctp_sf_violation_chunklen(net, ep, asoc, type, arg,
@@ -3935,6 +3938,9 @@ sctp_disposition_t sctp_sf_eat_fwd_tsn_fast(
return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
}
 
+   if (!asoc->peer.prsctp_capable)
+   return sctp_sf_unk_chunk(net, ep, asoc, type, arg, commands);
+
/* Make sure that the FORWARD_TSN chunk has a valid length.  */
if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_fwdtsn_chunk)))
return sctp_sf_violation_chunklen(net, ep, asoc, type, arg,
-- 
2.1.0

Re: [PATCH] xen-netfront: Delete rx_refill_timer in xennet_disconnect_backend()

2017-02-03 Thread Juergen Gross

On 30/01/17 18:45, Boris Ostrovsky wrote:
> rx_refill_timer should be deleted as soon as we disconnect from the
> backend since otherwise it is possible for the timer to go off before
> we get to xennet_destroy_queues(). If this happens we may dereference
> queue->rx.sring which is set to NULL in xennet_disconnect_backend().
> 
> Signed-off-by: Boris Ostrovsky 
> CC: sta...@vger.kernel.org

Reviewed-by: Juergen Gross 


Juergen

[patch net-next] net/mlx4_en: fix a condition

2017-02-03 Thread Dan Carpenter

There is a "||" vs "|" typo here so we test 0x1 instead of 0x6.

Fixes: 1f8176f7352a ("net/mlx4_en: Check the enabling pptx/pprx flags in 
SET_PORT wrapper flow")
Signed-off-by: Dan Carpenter 

diff --git a/drivers/net/ethernet/mellanox/mlx4/port.c 
b/drivers/net/ethernet/mellanox/mlx4/port.c
index 5053c949148f..4e36e287d605 100644
--- a/drivers/net/ethernet/mellanox/mlx4/port.c
+++ b/drivers/net/ethernet/mellanox/mlx4/port.c
@@ -1395,7 +1395,7 @@ static int mlx4_common_set_port(struct mlx4_dev *dev, int 
slave, u32 in_mod,
  gen_context);
 
if (gen_context->flags &
-   (MLX4_FLAG_V_PPRX_MASK || MLX4_FLAG_V_PPTX_MASK))
+   (MLX4_FLAG_V_PPRX_MASK | MLX4_FLAG_V_PPTX_MASK))
mlx4_en_set_port_global_pause(dev, slave,
  gen_context);

Re: [PATCH net] net: phy: Fix lack of reference count on PHY driver

2017-02-03 Thread Russell King - ARM Linux

On Thu, Feb 02, 2017 at 09:54:07PM -0500, David Miller wrote:
> Hot plugging PHYs and notifications and all of that business is
> net-next material.

I was talking more about unbinding of the driver, which is something
that can be done today, eg:

$ ls -l /sys/bus/mdio_bus/drivers/Atheros\ 8035\ ethernet/
total 0
lrwxrwxrwx 1 root root0 Feb  3 09:49 2188000.ethernet:00 -> 
../../../../devices/soc0/soc/210.aips-bus/2188000.ethernet/mdio_bus/2188000.ethernet/2188000.ethernet:00
--w--- 1 root root 4096 Feb  3 09:49 bind
--w--- 1 root root 4096 Feb  3 09:49 uevent
--w--- 1 root root 4096 Feb  3 09:49 unbind
$ echo 2188000.ethernet:00 > /sys/bus/mdio_bus/drivers/Atheros\ 8035\ 
ethernet/unbind

is all it takes, and the same oops will happen.  Try it on a box
you don't care about crashing. :)

This is my point - locking the module into the kernel using
try_module_get() doesn't actually fix the problem where drivers are
concerned, it just has the illusion of being safe.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

Re: [PATCHv3 net-next 5/7] net: add confirm_neigh method to dst_ops

2017-02-03 Thread Steffen Klassert

On Fri, Feb 03, 2017 at 11:16:16AM +0200, Julian Anastasov wrote:
> 
>   Hello,
> 
> On Fri, 3 Feb 2017, Steffen Klassert wrote:
> > 
> > I thought about this (completely untested) one:
> > 
> > static void xfrm_confirm_neigh(const struct dst_entry *dst, const void
> > *daddr)
> > 
> > {
> > const struct dst_entry *dst = dst->child;
> 
>   When starting and dst arg is first xform, the above
> assignment skips it. May be both lines should be swapped.

Yes, that's better :)

> 
> > const struct xfrm_state *xfrm = dst->xfrm;
> > 
> > if (xfrm)
> > daddr = &xfrm->id.daddr;
> > 
> > dst->ops->confirm_neigh(dst, daddr);
> > }
> > 
> > Only the last dst_entry in this call chain (path) sould
> > not have dst->xfrm set. So it finally calls path->ops->confirm_neigh
> > with the daddr of the last transformation. But your version
> > should do the same.
> 
>   Above can be fixed but it is risky for the stack
> usage when using recursion. In practice, there should not be
> many xforms, though. Also, is id.daddr valid for transports?

Yes, it is needed for the lookup. But id.daddr ist the same
as daddr of the packet on transport mode.

> 
> > >   This should work as long as path and last tunnel are
> > > from same family.
> > 
> > Yes, the outer mode of the last transformation has the same
> > family as path.
> > 
> > > Also, after checking xfrm_dst_lookup() I'm not
> > > sure using just &xfrm->id.daddr is enough. Should we consider
> > > more places for daddr value?
> > 
> > Yes, indeed. We should do it like xfrm_dst_lookup() does it.
> 
>   OK, I'll get logic from there. Should I use loop or
> recursion?

I don't have a strong opinion on that. Both should work,
choose whatever you prefer.

Re: [PATCH 0/8] Bug fixes

2017-02-03 Thread Herbert Xu

On Fri, Jan 27, 2017 at 04:09:04PM +0530, Harsh Jain wrote:
> This patch series is based on Herbert's cryptodev-2.6 tree and depends on 
> patch series "Bug Fixes for 4.10". It includes Bug Fixes.
> 
> Atul Gupta (2)
>   crypto:chcr-Change flow IDs
>   crypto:chcr- Fix wrong typecasting
> Harsh Jain (8):
>   crypto:chcr- Fix key length for RFC4106
>   crypto:chcr-fix itnull.cocci warnings
>   crypto:chcr- Use cipher instead of Block Cipher in gcm setkey
>   crypto:chcr: Change cra_flags for cipher algos
>   crypto:chcr- Change algo priority
>   crypto:chcr-Fix Smatch Complaint

All applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [PATCH net] bpf: expose netns inode to bpf programs

2017-02-03 Thread Eric W. Biederman

Alexei Starovoitov  writes:

> On Fri, Feb 03, 2017 at 05:33:45PM +1300, Eric W. Biederman wrote:
>> 
>> The point is that we can make the inode number stable across migration
>> and the user space API for namespaces has been designed with that
>> possibility in mind.
>> 
>> What you have proposed is the equivalent of reporting a file name, and
>> instead of reporting /dir1/file1 /dir2/file1 just reporting file1 for
>> both cases.
>> 
>> That is problematic.
>> 
>> It doesn't matter that eBPF and CRIU do not mix.  When we implement
>> migration of the namespace file descriptors and can move them from
>> one system to another preserving the device number and inode number
>> so that criu of other parts of userspace can function better there will
>> be a problem.  There is not one unique inode number per namespace and
>> the proposed interface in your eBPF programs is broken.
>> 
>> I don't know when inode numbers are going to be the bottleneck we decide
>> to make migratable to make CRIU work better but things have been
>> designed and maintained very carefully so that we can do that.
>> 
>> Inode numbers are in the namespace of the filesystem they reside in.
>
> I saw that iproute2 is doing:
>   if ((st.st_dev == netst.st_dev) &&
>   (st.st_ino == netst.st_ino)) {
> but proc_alloc_inum() is using global ida,
> so I figured that iproute2 extra st_dev check must have been obsolete.
> So the long term plan is to make /proc to be namespace-aware?

The long term plan is to make /proc more namespace aware.  It is pretty
strongly namespace aware in several senses.  Of course when those things
are well executed you don't see them in userspace so they are easy to
over look.

> That's fair. In such case exposing inode only will
> lead to wrong assumptions.

Exactly.

>> >> But you told Eric that his nack doesn't matter, and maybe it would be
>> >> nice to ask him to clarify instead.
>> >
>> > Fair enough. Eric, thoughts?
>> 
>> In very short terms exporting just the inode number would require
>> implementing a namespace of namespaces, and that is NOT happening.
>> We are not going to design our kernel interfaces so badly that we need
>> to do that.
>> 
>> At a bare minimum you need to export the device number of the filesystem
>> as well as the inode number.
>
> Agree. Will do.
>
>> My expectation would be that now you are starting to look at concepts
>> that are namespaced the way you would proceed would be to associate a
>> full set of namespaces with your ebpf program.  Those namespaces would
>> come from the submitter of your ebpf program.  Namespaced values
>> would be in the terms of your associated namespaces.
>> 
>> That keeps things working the way userspace would expect.
>> 
>> The easy way to build such an association is to not allow your
>> contextless ebpf programs from being submitted to kernel in anything
>> other than the initial set of namespaces.
>> 
>> But please assume all global identifiers are namespaced.  If they aren't
>> that needs to be fixed because not having them namespaced will break
>> process migration at some point.
>> 
>> In short the fix here is to export both the inode number the device
>> number.  That is what it takes to uniquely identify a file.  It would be
>
> Agree. Will respin.
>
>> good if you went farther and limited your contextless ebpf programs to
>> only being installed by programs in the initial set of namespaces.
>
> you mean to limit to init_net only? This might break existing users.

And for proc/pid_namespace things like the device and inum the initial
pid namespace.

I expect you can limit yourself to fields that are namespace specific
before adding such a requirement in which case that will avoid breaking
existing users.

>> Does that make things clearer?
>
> yep. thanks for the feedback.

Welcome.

Eric

Re: "TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised." message with "ethtool -K eth0 gro off"

2017-02-03 Thread Marcelo Ricardo Leitner

On Thu, Feb 02, 2017 at 05:59:24AM -0800, Eric Dumazet wrote:
> On Thu, 2017-02-02 at 05:31 -0800, Eric Dumazet wrote:
> 
> > Anyway, I suspect the test is simply buggy ;)
> > 
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 
> > 41dcbd568cbe2403f2a9e659669afe462a42e228..5394a39fcce964a7fe7075b1531a8a1e05550a54
> >  100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -164,7 +164,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, const 
> > struct sk_buff *skb)
> > if (len >= icsk->icsk_ack.rcv_mss) {
> > icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> >tcp_sk(sk)->advmss);
> > -   if (unlikely(icsk->icsk_ack.rcv_mss != len))
> > +   if (unlikely(icsk->icsk_ack.rcv_mss != len && skb_is_gso(skb)))
> > tcp_gro_dev_warn(sk, skb);
> > } else {
> > /* Otherwise, we make more careful check taking into account,
> 
> This wont really help.
> 
> Our tcp_sk(sk)->advmss can be lower than the MSS used by the remote
> peer.
> 
> ip ro add  advmss 512

I don't follow. With a good driver, how can advmss be smaller than the
MSS used by the remote peer? Even with the route entry above, I get
segments just up to advmss, and no warning.

Though yeah, interesting that this driver doesn't even support GRO. FCS
perhaps?

Markus, do you have other interfaces in your system? Which MTU do you
use, and please try the (untested) patch below, to gather more debug
info:

---8<---

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bfa165cc455a..eddd5b6a28b1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -128,6 +128,7 @@ int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
 
 static void tcp_gro_dev_warn(struct sock *sk, const struct sk_buff *skb)
 {
+   struct inet_connection_sock *icsk = inet_csk(sk);
static bool __once __read_mostly;
 
if (!__once) {
@@ -137,8 +138,9 @@ static void tcp_gro_dev_warn(struct sock *sk, const struct 
sk_buff *skb)
 
rcu_read_lock();
dev = dev_get_by_index_rcu(sock_net(sk), skb->skb_iif);
-   pr_warn("%s: Driver has suspect GRO implementation, TCP 
performance may be compromised.\n",
-   dev ? dev->name : "Unknown driver");
+   pr_warn("%s: Driver has suspect GRO implementation, TCP 
performance may be compromised. rcv_mss:%u advmss:%u len:%u\n",
+   dev ? dev->name : "Unknown driver",
+   icsk->icsk_ack.rcv_mss, tcp_sk(sk)->advmss, skb->len);
rcu_read_unlock();
}
 }

Re: "TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised." message with "ethtool -K eth0 gro off"

2017-02-03 Thread Markus Trippelsdorf

On 2017.02.03 at 09:54 -0200, Marcelo Ricardo Leitner wrote:
> On Thu, Feb 02, 2017 at 05:59:24AM -0800, Eric Dumazet wrote:
> > On Thu, 2017-02-02 at 05:31 -0800, Eric Dumazet wrote:
> > 
> > > Anyway, I suspect the test is simply buggy ;)
> > > 
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 
> > > 41dcbd568cbe2403f2a9e659669afe462a42e228..5394a39fcce964a7fe7075b1531a8a1e05550a54
> > >  100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -164,7 +164,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, 
> > > const struct sk_buff *skb)
> > >   if (len >= icsk->icsk_ack.rcv_mss) {
> > >   icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> > >  tcp_sk(sk)->advmss);
> > > - if (unlikely(icsk->icsk_ack.rcv_mss != len))
> > > + if (unlikely(icsk->icsk_ack.rcv_mss != len && skb_is_gso(skb)))
> > >   tcp_gro_dev_warn(sk, skb);
> > >   } else {
> > >   /* Otherwise, we make more careful check taking into account,
> > 
> > This wont really help.
> > 
> > Our tcp_sk(sk)->advmss can be lower than the MSS used by the remote
> > peer.
> > 
> > ip ro add  advmss 512
> 
> I don't follow. With a good driver, how can advmss be smaller than the
> MSS used by the remote peer? Even with the route entry above, I get
> segments just up to advmss, and no warning.
> 
> Though yeah, interesting that this driver doesn't even support GRO. FCS
> perhaps?
> 
> Markus, do you have other interfaces in your system? Which MTU do you
> use, and please try the (untested) patch below, to gather more debug
> info:

No, eth0 is the only interface. MTU = 1500.
Sure, I will try your patch. But I don't know how to reproduce the
issue, so you will have to wait until it triggers again.

-- 
Markus

[PATCH 00/27] Netfilter updates for net-next

2017-02-03 Thread Pablo Neira Ayuso

Hi David,

The following patchset contains Netfilter updates for your net-next
tree, they are:

1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from
   sk_buff so we only access one single cacheline in the conntrack
   hotpath. Patchset from Florian Westphal.

2) Don't leak pointer to internal structures when exporting x_tables
   ruleset back to userspace, from Willem DeBruijn. This includes new
   helper functions to copy data to userspace such as xt_data_to_user()
   as well as conversions of our ip_tables, ip6_tables and arp_tables
   clients to use it. Not surprinsingly, ebtables requires an ad-hoc
   update. There is also a new field in x_tables extensions to indicate
   the amount of bytes that we copy to userspace.

3) Add nf_log_all_netns sysctl: This new knob allows you to enable
   logging via nf_log infrastructure for all existing netnamespaces.
   Given the effort to provide pernet syslog has been discontinued,
   let's provide a way to restore logging using netfilter kernel logging
   facilities in trusted environments. Patch from Michal Kubecek.

4) Validate SCTP checksum from conntrack helper, from Davide Caratti.

5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly
   a copy&paste from the original helper, from Florian Westphal.

6) Reset netfilter state when duplicating packets, also from Florian.

7) Remove unnecessary check for broadcast in IPv6 in pkttype match and
   nft_meta, from Liping Zhang.

8) Add missing code to deal with loopback packets from nft_meta when
   used by the netdev family, also from Liping.

9) Several cleanups on nf_tables, one to remove unnecessary check from
   the netlink control plane path to add table, set and stateful objects
   and code consolidation when unregister chain hooks, from Gao Feng.

10) Fix harmless reference counter underflow in IPVS that, however,
results in problems with the introduction of the new refcount_t
type, from David Windsor.

11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp,
from Davide Caratti.

12) Missing documentation on nf_tables uapi header, from Liping Zhang.

13) Use rb_entry() helper in xt_connlimit, from Geliang Tang.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git

Thanks!



The following changes since commit 0a0a8d6b0e88d947d7ab3198b325e31f677bebc2:

  net: fealnx: use new api ethtool_{get|set}_link_ksettings (2017-01-02 
16:59:10 -0500)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git HEAD

for you to fetch changes up to 2851940ffee313e0ff12540a8e11a8c54dea9c65:

  netfilter: allow logging from non-init namespaces (2017-02-02 14:31:58 +0100)


David Windsor (1):
  ipvs: free ip_vs_dest structs when refcnt=0

Davide Caratti (2):
  netfilter: select LIBCRC32C together with SCTP conntrack
  netfilter: conntrack: validate SCTP crc32c in PREROUTING

Feng (1):
  netfilter: nf_tables: Eliminate duplicated code in 
nf_tables_table_enable()

Florian Westphal (9):
  netfilter: merge udp and udplite conntrack helpers
  netfilter: nat: merge udp and udplite helpers
  netfilter: conntrack: no need to pass ctinfo to error handler
  netfilter: reset netfilter state when duplicating packet
  netfilter: reduce direct skb->nfct usage
  skbuff: add and use skb_nfct helper
  netfilter: add and use nf_ct_set helper
  netfilter: guarantee 8 byte minalign for template addresses
  netfilter: merge ctinfo into nfct pointer storage area

Gao Feng (1):
  netfilter: nf_tables: eliminate useless condition checks

Geliang Tang (1):
  netfilter: xt_connlimit: use rb_entry()

Liping Zhang (4):
  netfilter: nf_tables: add missing descriptions in nft_ct_keys
  netfilter: nft_ct: add average bytes per packet support
  netfilter: pkttype: unnecessary to check ipv6 multicast address
  netfilter: nft_meta: deal with PACKET_LOOPBACK in netdev family

Michal Kubeček (1):
  netfilter: allow logging from non-init namespaces

Willem de Bruijn (7):
  xtables: add xt_match, xt_target and data copy_to_user functions
  iptables: use match, target and data copy_to_user helpers
  ip6tables: use match, target and data copy_to_user helpers
  arptables: use match, target and data copy_to_user helpers
  ebtables: use match, target and data copy_to_user helpers
  xtables: use match, target and data copy_to_user helpers in compat
  xtables: extend matches and targets with .usersize

 Documentation/networking/netfilter-sysctl.txt  |  10 +
 include/linux/netfilter/x_tables.h |   9 +
 include/linux/skbuff.h |  32 +--
 include/net/ip_vs.h|  12 +-
 include/net/netfilter/ipv4/nf_conntrack_ipv4.h |   1 +
 include/net/ne

[PATCH 03/27] netfilter: nf_tables: add missing descriptions in nft_ct_keys

2017-02-03 Thread Pablo Neira Ayuso

From: Liping Zhang 

We missed to add descriptions about NFT_CT_LABELS, NFT_CT_PKTS and
NFT_CT_BYTES, now add it.

Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 881d49e94569..5726f90bfc2f 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -860,6 +860,9 @@ enum nft_rt_attributes {
  * @NFT_CT_PROTOCOL: conntrack layer 4 protocol
  * @NFT_CT_PROTO_SRC: conntrack layer 4 protocol source
  * @NFT_CT_PROTO_DST: conntrack layer 4 protocol destination
+ * @NFT_CT_LABELS: conntrack labels
+ * @NFT_CT_PKTS: conntrack packets
+ * @NFT_CT_BYTES: conntrack bytes
  */
 enum nft_ct_keys {
NFT_CT_STATE,
-- 
2.1.4

[PATCH 19/27] netfilter: conntrack: no need to pass ctinfo to error handler

2017-02-03 Thread Pablo Neira Ayuso

From: Florian Westphal 

It is never accessed for reading and the only places that write to it
are the icmp(6) handlers, which also set skb->nfct (and skb->nfctinfo).

The conntrack core specifically checks for attached skb->nfct after
->error() invocation and returns early in this case.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_conntrack_l4proto.h   |  2 +-
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   | 12 ++--
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c | 12 ++--
 net/netfilter/nf_conntrack_core.c  |  3 +--
 net/netfilter/nf_conntrack_proto_dccp.c|  1 -
 net/netfilter/nf_conntrack_proto_sctp.c|  2 +-
 net/netfilter/nf_conntrack_proto_tcp.c |  1 -
 net/netfilter/nf_conntrack_proto_udp.c |  3 +--
 8 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack_l4proto.h 
b/include/net/netfilter/nf_conntrack_l4proto.h
index e7b836590f0b..85e993e278d5 100644
--- a/include/net/netfilter/nf_conntrack_l4proto.h
+++ b/include/net/netfilter/nf_conntrack_l4proto.h
@@ -55,7 +55,7 @@ struct nf_conntrack_l4proto {
void (*destroy)(struct nf_conn *ct);
 
int (*error)(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
-unsigned int dataoff, enum ip_conntrack_info *ctinfo,
+unsigned int dataoff,
 u_int8_t pf, unsigned int hooknum);
 
/* Print out the per-protocol part of the tuple. Return like seq_* */
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c 
b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index d075b3cf2400..566afac98a88 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -128,13 +128,13 @@ static bool icmp_new(struct nf_conn *ct, const struct 
sk_buff *skb,
 /* Returns conntrack if it dealt with ICMP, and filled in skb fields */
 static int
 icmp_error_message(struct net *net, struct nf_conn *tmpl, struct sk_buff *skb,
-enum ip_conntrack_info *ctinfo,
 unsigned int hooknum)
 {
struct nf_conntrack_tuple innertuple, origtuple;
const struct nf_conntrack_l4proto *innerproto;
const struct nf_conntrack_tuple_hash *h;
const struct nf_conntrack_zone *zone;
+   enum ip_conntrack_info ctinfo;
struct nf_conntrack_zone tmp;
 
NF_CT_ASSERT(skb->nfct == NULL);
@@ -160,7 +160,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, 
struct sk_buff *skb,
return -NF_ACCEPT;
}
 
-   *ctinfo = IP_CT_RELATED;
+   ctinfo = IP_CT_RELATED;
 
h = nf_conntrack_find_get(net, zone, &innertuple);
if (!h) {
@@ -169,11 +169,11 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, 
struct sk_buff *skb,
}
 
if (NF_CT_DIRECTION(h) == IP_CT_DIR_REPLY)
-   *ctinfo += IP_CT_IS_REPLY;
+   ctinfo += IP_CT_IS_REPLY;
 
/* Update skb to refer to this connection */
skb->nfct = &nf_ct_tuplehash_to_ctrack(h)->ct_general;
-   skb->nfctinfo = *ctinfo;
+   skb->nfctinfo = ctinfo;
return NF_ACCEPT;
 }
 
@@ -181,7 +181,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, 
struct sk_buff *skb,
 static int
 icmp_error(struct net *net, struct nf_conn *tmpl,
   struct sk_buff *skb, unsigned int dataoff,
-  enum ip_conntrack_info *ctinfo, u_int8_t pf, unsigned int hooknum)
+  u8 pf, unsigned int hooknum)
 {
const struct icmphdr *icmph;
struct icmphdr _ih;
@@ -225,7 +225,7 @@ icmp_error(struct net *net, struct nf_conn *tmpl,
icmph->type != ICMP_REDIRECT)
return NF_ACCEPT;
 
-   return icmp_error_message(net, tmpl, skb, ctinfo, hooknum);
+   return icmp_error_message(net, tmpl, skb, hooknum);
 }
 
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c 
b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index f5a61bc3ec2b..44b9af3f813e 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -145,12 +145,12 @@ static int
 icmpv6_error_message(struct net *net, struct nf_conn *tmpl,
 struct sk_buff *skb,
 unsigned int icmp6off,
-enum ip_conntrack_info *ctinfo,
 unsigned int hooknum)
 {
struct nf_conntrack_tuple intuple, origtuple;
const struct nf_conntrack_tuple_hash *h;
const struct nf_conntrack_l4proto *inproto;
+   enum ip_conntrack_info ctinfo;
struct nf_conntrack_zone tmp;
 
NF_CT_ASSERT(skb->nfct == NULL);
@@ -176,7 +176,7 @@ icmpv6_error_message(struct net *net, struct nf_conn *tmpl,
return -NF_ACCEPT;
}
 
-   *ctinfo = IP_CT_RELATED;
+   ctinfo = IP_CT_RELATED;
 
h = nf_conntrack_find_get(n

[PATCH 26/27] ipvs: free ip_vs_dest structs when refcnt=0

2017-02-03 Thread Pablo Neira Ayuso

From: David Windsor 

Currently, the ip_vs_dest cache frees ip_vs_dest objects when their
reference count becomes < 0.  Aside from not being semantically sound,
this is problematic for the new type refcount_t, which will be introduced
shortly in a separate patch. refcount_t is the new kernel type for
holding reference counts, and provides overflow protection and a
constrained interface relative to atomic_t (the type currently being
used for kernel reference counts).

Per Julian Anastasov: "The problem is that dest_trash currently holds
deleted dests (unlinked from RCU lists) with refcnt=0."  Changing
dest_trash to hold dest with refcnt=1 will allow us to free ip_vs_dest
structs when their refcnt=0, in ip_vs_dest_put_and_free().

Signed-off-by: David Windsor 
Signed-off-by: Julian Anastasov 
Signed-off-by: Simon Horman 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/ip_vs.h| 2 +-
 net/netfilter/ipvs/ip_vs_ctl.c | 8 +++-
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 4b46c591b542..7bdfa7d78363 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1421,7 +1421,7 @@ static inline void ip_vs_dest_put(struct ip_vs_dest *dest)
 
 static inline void ip_vs_dest_put_and_free(struct ip_vs_dest *dest)
 {
-   if (atomic_dec_return(&dest->refcnt) < 0)
+   if (atomic_dec_and_test(&dest->refcnt))
kfree(dest);
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 55e0169caa4c..5fc4836e7c79 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -711,7 +711,6 @@ ip_vs_trash_get_dest(struct ip_vs_service *svc, int dest_af,
  dest->vport == svc->port))) {
/* HIT */
list_del(&dest->t_list);
-   ip_vs_dest_hold(dest);
goto out;
}
}
@@ -741,7 +740,7 @@ static void ip_vs_dest_free(struct ip_vs_dest *dest)
  *  When the ip_vs_control_clearup is activated by ipvs module exit,
  *  the service tables must have been flushed and all the connections
  *  are expired, and the refcnt of each destination in the trash must
- *  be 0, so we simply release them here.
+ *  be 1, so we simply release them here.
  */
 static void ip_vs_trash_cleanup(struct netns_ipvs *ipvs)
 {
@@ -1080,11 +1079,10 @@ static void __ip_vs_del_dest(struct netns_ipvs *ipvs, 
struct ip_vs_dest *dest,
if (list_empty(&ipvs->dest_trash) && !cleanup)
mod_timer(&ipvs->dest_trash_timer,
  jiffies + (IP_VS_DEST_TRASH_PERIOD >> 1));
-   /* dest lives in trash without reference */
+   /* dest lives in trash with reference */
list_add(&dest->t_list, &ipvs->dest_trash);
dest->idle_start = 0;
spin_unlock_bh(&ipvs->dest_trash_lock);
-   ip_vs_dest_put(dest);
 }
 
 
@@ -1160,7 +1158,7 @@ static void ip_vs_dest_trash_expire(unsigned long data)
 
spin_lock(&ipvs->dest_trash_lock);
list_for_each_entry_safe(dest, next, &ipvs->dest_trash, t_list) {
-   if (atomic_read(&dest->refcnt) > 0)
+   if (atomic_read(&dest->refcnt) > 1)
continue;
if (dest->idle_start) {
if (time_before(now, dest->idle_start +
-- 
2.1.4

[PATCH 02/27] netfilter: nat: merge udp and udplite helpers

2017-02-03 Thread Pablo Neira Ayuso

From: Florian Westphal 

udplite nat was copied from udp nat, they are virtually 100% identical.
Not really surprising given udplite is just udp with partial csum coverage.

old:
   textdata bss dec hex filename
  116061457 210   1327333d9 nf_nat.ko
330   0   2 332 14c nf_nat_proto_udp.o
276   0   2 278 116 nf_nat_proto_udplite.o
new:
   textdata bss dec hex filename
  115981457 210   1326533d1 nf_nat.ko
640   0   4 644 284 nf_nat_proto_udp.o

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/Makefile   |  1 -
 net/netfilter/nf_nat_proto_udp.c | 78 ++--
 net/netfilter/nf_nat_proto_udplite.c | 73 -
 3 files changed, 66 insertions(+), 86 deletions(-)
 delete mode 100644 net/netfilter/nf_nat_proto_udplite.c

diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index bf5c577113b6..6b3034f12661 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -46,7 +46,6 @@ nf_nat-y  := nf_nat_core.o nf_nat_proto_unknown.o 
nf_nat_proto_common.o \
 # NAT protocols (nf_nat)
 nf_nat-$(CONFIG_NF_NAT_PROTO_DCCP) += nf_nat_proto_dccp.o
 nf_nat-$(CONFIG_NF_NAT_PROTO_SCTP) += nf_nat_proto_sctp.o
-nf_nat-$(CONFIG_NF_NAT_PROTO_UDPLITE) += nf_nat_proto_udplite.o
 
 # generic transport layer logging
 obj-$(CONFIG_NF_LOG_COMMON) += nf_log_common.o
diff --git a/net/netfilter/nf_nat_proto_udp.c b/net/netfilter/nf_nat_proto_udp.c
index b1e627227b6e..edd4a77dc09a 100644
--- a/net/netfilter/nf_nat_proto_udp.c
+++ b/net/netfilter/nf_nat_proto_udp.c
@@ -30,20 +30,15 @@ udp_unique_tuple(const struct nf_nat_l3proto *l3proto,
&udp_port_rover);
 }
 
-static bool
-udp_manip_pkt(struct sk_buff *skb,
- const struct nf_nat_l3proto *l3proto,
- unsigned int iphdroff, unsigned int hdroff,
- const struct nf_conntrack_tuple *tuple,
- enum nf_nat_manip_type maniptype)
+static void
+__udp_manip_pkt(struct sk_buff *skb,
+   const struct nf_nat_l3proto *l3proto,
+   unsigned int iphdroff, struct udphdr *hdr,
+   const struct nf_conntrack_tuple *tuple,
+   enum nf_nat_manip_type maniptype, bool do_csum)
 {
-   struct udphdr *hdr;
__be16 *portptr, newport;
 
-   if (!skb_make_writable(skb, hdroff + sizeof(*hdr)))
-   return false;
-   hdr = (struct udphdr *)(skb->data + hdroff);
-
if (maniptype == NF_NAT_MANIP_SRC) {
/* Get rid of src port */
newport = tuple->src.u.udp.port;
@@ -53,7 +48,7 @@ udp_manip_pkt(struct sk_buff *skb,
newport = tuple->dst.u.udp.port;
portptr = &hdr->dest;
}
-   if (hdr->check || skb->ip_summed == CHECKSUM_PARTIAL) {
+   if (do_csum) {
l3proto->csum_update(skb, iphdroff, &hdr->check,
 tuple, maniptype);
inet_proto_csum_replace2(&hdr->check, skb, *portptr, newport,
@@ -62,9 +57,68 @@ udp_manip_pkt(struct sk_buff *skb,
hdr->check = CSUM_MANGLED_0;
}
*portptr = newport;
+}
+
+static bool udp_manip_pkt(struct sk_buff *skb,
+ const struct nf_nat_l3proto *l3proto,
+ unsigned int iphdroff, unsigned int hdroff,
+ const struct nf_conntrack_tuple *tuple,
+ enum nf_nat_manip_type maniptype)
+{
+   struct udphdr *hdr;
+   bool do_csum;
+
+   if (!skb_make_writable(skb, hdroff + sizeof(*hdr)))
+   return false;
+
+   hdr = (struct udphdr *)(skb->data + hdroff);
+   do_csum = hdr->check || skb->ip_summed == CHECKSUM_PARTIAL;
+
+   __udp_manip_pkt(skb, l3proto, iphdroff, hdr, tuple, maniptype, do_csum);
+   return true;
+}
+
+#ifdef CONFIG_NF_NAT_PROTO_UDPLITE
+static u16 udplite_port_rover;
+
+static bool udplite_manip_pkt(struct sk_buff *skb,
+ const struct nf_nat_l3proto *l3proto,
+ unsigned int iphdroff, unsigned int hdroff,
+ const struct nf_conntrack_tuple *tuple,
+ enum nf_nat_manip_type maniptype)
+{
+   struct udphdr *hdr;
+
+   if (!skb_make_writable(skb, hdroff + sizeof(*hdr)))
+   return false;
+
+   hdr = (struct udphdr *)(skb->data + hdroff);
+   __udp_manip_pkt(skb, l3proto, iphdroff, hdr, tuple, maniptype, true);
return true;
 }
 
+static void
+udplite_unique_tuple(const struct nf_nat_l3proto *l3proto,
+struct nf_conntrack_tuple *tuple,
+const struct nf_nat_range *range,
+enum nf_nat_manip_type maniptype,
+const struct nf_conn *ct)
+{
+   nf_nat_l4proto_unique_tuple(l

[PATCH 23/27] netfilter: add and use nf_ct_set helper

2017-02-03 Thread Pablo Neira Ayuso

From: Florian Westphal 

Add a helper to assign a nf_conn entry and the ctinfo bits to an sk_buff.
This avoids changing code in followup patch that merges skb->nfct and
skb->nfctinfo into skb->_nfct.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/ip_vs.h|  3 +--
 include/net/netfilter/nf_conntrack.h   |  8 
 net/ipv4/netfilter/ipt_SYNPROXY.c  |  3 +--
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |  3 +--
 net/ipv4/netfilter/nf_dup_ipv4.c   |  3 +--
 net/ipv6/netfilter/ip6t_SYNPROXY.c |  3 +--
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |  6 ++
 net/ipv6/netfilter/nf_dup_ipv6.c   |  3 +--
 net/netfilter/nf_conntrack_core.c  | 11 +++
 net/netfilter/nft_ct.c |  3 +--
 net/netfilter/xt_CT.c  |  6 ++
 net/openvswitch/conntrack.c|  6 ++
 12 files changed, 24 insertions(+), 34 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index 2a344ebd7ebe..4b46c591b542 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1559,8 +1559,7 @@ static inline void ip_vs_notrack(struct sk_buff *skb)
nf_conntrack_put(&ct->ct_general);
untracked = nf_ct_untracked_get();
nf_conntrack_get(&untracked->ct_general);
-   skb->nfct = &untracked->ct_general;
-   skb->nfctinfo = IP_CT_NEW;
+   nf_ct_set(skb, untracked, IP_CT_NEW);
}
 #endif
 }
diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index 5916aa9ab3f0..d704aed11684 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -34,6 +34,7 @@ union nf_conntrack_proto {
struct ip_ct_sctp sctp;
struct ip_ct_tcp tcp;
struct nf_ct_gre gre;
+   unsigned int tmpl_padto;
 };
 
 union nf_conntrack_expect_proto {
@@ -341,6 +342,13 @@ struct nf_conn *nf_ct_tmpl_alloc(struct net *net,
 gfp_t flags);
 void nf_ct_tmpl_free(struct nf_conn *tmpl);
 
+static inline void
+nf_ct_set(struct sk_buff *skb, struct nf_conn *ct, enum ip_conntrack_info info)
+{
+   skb->nfct = &ct->ct_general;
+   skb->nfctinfo = info;
+}
+
 #define NF_CT_STAT_INC(net, count)   __this_cpu_inc((net)->ct.stat->count)
 #define NF_CT_STAT_INC_ATOMIC(net, count) this_cpu_inc((net)->ct.stat->count)
 #define NF_CT_STAT_ADD_ATOMIC(net, count, v) 
this_cpu_add((net)->ct.stat->count, (v))
diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c 
b/net/ipv4/netfilter/ipt_SYNPROXY.c
index a12d4f0aa674..3240a2614e82 100644
--- a/net/ipv4/netfilter/ipt_SYNPROXY.c
+++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
@@ -57,8 +57,7 @@ synproxy_send_tcp(struct net *net,
goto free_nskb;
 
if (nfct) {
-   nskb->nfct = nfct;
-   nskb->nfctinfo = ctinfo;
+   nf_ct_set(nskb, (struct nf_conn *)nfct, ctinfo);
nf_conntrack_get(nfct);
}
 
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c 
b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index 478a025909fc..73c591d8a9a8 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -172,8 +172,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, 
struct sk_buff *skb,
ctinfo += IP_CT_IS_REPLY;
 
/* Update skb to refer to this connection */
-   skb->nfct = &nf_ct_tuplehash_to_ctrack(h)->ct_general;
-   skb->nfctinfo = ctinfo;
+   nf_ct_set(skb, nf_ct_tuplehash_to_ctrack(h), ctinfo);
return NF_ACCEPT;
 }
 
diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index 1a5e1f53ceaa..f0dbff05fc28 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -69,8 +69,7 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, 
unsigned int hooknum,
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
/* Avoid counting cloned packets towards the original connection. */
nf_reset(skb);
-   skb->nfct = &nf_ct_untracked_get()->ct_general;
-   skb->nfctinfo = IP_CT_NEW;
+   nf_ct_set(skb, nf_ct_untracked_get(), IP_CT_NEW);
nf_conntrack_get(skb_nfct(skb));
 #endif
/*
diff --git a/net/ipv6/netfilter/ip6t_SYNPROXY.c 
b/net/ipv6/netfilter/ip6t_SYNPROXY.c
index 2dc01d2c6ec0..4ef1ddd4bbbd 100644
--- a/net/ipv6/netfilter/ip6t_SYNPROXY.c
+++ b/net/ipv6/netfilter/ip6t_SYNPROXY.c
@@ -71,8 +71,7 @@ synproxy_send_tcp(struct net *net,
skb_dst_set(nskb, dst);
 
if (nfct) {
-   nskb->nfct = nfct;
-   nskb->nfctinfo = ctinfo;
+   nf_ct_set(nskb, (struct nf_conn *)nfct, ctinfo);
nf_conntrack_get(nfct);
}
 
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c 
b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c

[PATCH 20/27] netfilter: reset netfilter state when duplicating packet

2017-02-03 Thread Pablo Neira Ayuso

From: Florian Westphal 

We should also toss nf_bridge_info, if any -- packet is leaving via
ip_local_out, also, this skb isn't bridged -- it is a locally generated
copy.  Also this avoids the need to touch this later when skb->nfct is
replaced with 'unsigned long _nfct' in followup patch.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/nf_dup_ipv4.c | 2 +-
 net/ipv6/netfilter/nf_dup_ipv6.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/netfilter/nf_dup_ipv4.c b/net/ipv4/netfilter/nf_dup_ipv4.c
index cf986e1c7bbd..a981ef7151ca 100644
--- a/net/ipv4/netfilter/nf_dup_ipv4.c
+++ b/net/ipv4/netfilter/nf_dup_ipv4.c
@@ -68,7 +68,7 @@ void nf_dup_ipv4(struct net *net, struct sk_buff *skb, 
unsigned int hooknum,
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
/* Avoid counting cloned packets towards the original connection. */
-   nf_conntrack_put(skb->nfct);
+   nf_reset(skb);
skb->nfct = &nf_ct_untracked_get()->ct_general;
skb->nfctinfo = IP_CT_NEW;
nf_conntrack_get(skb->nfct);
diff --git a/net/ipv6/netfilter/nf_dup_ipv6.c b/net/ipv6/netfilter/nf_dup_ipv6.c
index 4a84b5ad9ecb..5f52e5f90e7e 100644
--- a/net/ipv6/netfilter/nf_dup_ipv6.c
+++ b/net/ipv6/netfilter/nf_dup_ipv6.c
@@ -57,7 +57,7 @@ void nf_dup_ipv6(struct net *net, struct sk_buff *skb, 
unsigned int hooknum,
return;
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-   nf_conntrack_put(skb->nfct);
+   nf_reset(skb);
skb->nfct = &nf_ct_untracked_get()->ct_general;
skb->nfctinfo = IP_CT_NEW;
nf_conntrack_get(skb->nfct);
-- 
2.1.4

[PATCH 25/27] netfilter: merge ctinfo into nfct pointer storage area

2017-02-03 Thread Pablo Neira Ayuso

From: Florian Westphal 

After this change conntrack operations (lookup, creation, matching from
ruleset) only access one instead of two sk_buff cache lines.

This works for normal conntracks because those are allocated from a slab
that guarantees hw cacheline or 8byte alignment (whatever is larger)
so the 3 bits needed for ctinfo won't overlap with nf_conn addresses.

Template allocation now does manual address alignment (see previous change)
on arches that don't have sufficent kmalloc min alignment.

Some spots intentionally use skb->_nfct instead of skb_nfct() helpers,
this is to avoid undoing the skb_nfct() use when we remove untracked
conntrack object in the future.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/linux/skbuff.h  | 21 +
 include/net/netfilter/nf_conntrack.h| 11 ++-
 net/ipv6/netfilter/nf_dup_ipv6.c|  2 +-
 net/netfilter/core.c|  2 +-
 net/netfilter/nf_conntrack_core.c   | 11 ++-
 net/netfilter/nf_conntrack_standalone.c |  3 +++
 net/netfilter/xt_CT.c   |  4 ++--
 7 files changed, 28 insertions(+), 26 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 276431e047af..ac0bc085b139 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -585,7 +585,6 @@ static inline bool skb_mstamp_after(const struct skb_mstamp 
*t1,
  * @cloned: Head may be cloned (check refcnt to be sure)
  * @ip_summed: Driver fed us an IP checksum
  * @nohdr: Payload reference only, must not modify header
- * @nfctinfo: Relationship of this skb to the connection
  * @pkt_type: Packet class
  * @fclone: skbuff clone status
  * @ipvs_property: skbuff is owned by ipvs
@@ -594,7 +593,7 @@ static inline bool skb_mstamp_after(const struct skb_mstamp 
*t1,
  * @nf_trace: netfilter packet trace flag
  * @protocol: Packet protocol from driver
  * @destructor: Destruct function
- * @nfct: Associated connection, if any
+ * @_nfct: Associated connection, if any (with nfctinfo bits)
  * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c
  * @skb_iif: ifindex of device we arrived on
  * @tc_index: Traffic control index
@@ -668,7 +667,7 @@ struct sk_buff {
struct  sec_path*sp;
 #endif
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-   struct nf_conntrack *nfct;
+   unsigned long_nfct;
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
struct nf_bridge_info   *nf_bridge;
@@ -721,7 +720,6 @@ struct sk_buff {
__u8pkt_type:3;
__u8pfmemalloc:1;
__u8ignore_df:1;
-   __u8nfctinfo:3;
 
__u8nf_trace:1;
__u8ip_summed:2;
@@ -836,6 +834,7 @@ static inline bool skb_pfmemalloc(const struct sk_buff *skb)
 #define SKB_DST_NOREF  1UL
 #define SKB_DST_PTRMASK~(SKB_DST_NOREF)
 
+#define SKB_NFCT_PTRMASK   ~(7UL)
 /**
  * skb_dst - returns skb dst_entry
  * @skb: buffer
@@ -3556,7 +3555,7 @@ static inline void skb_remcsum_process(struct sk_buff 
*skb, void *ptr,
 static inline struct nf_conntrack *skb_nfct(const struct sk_buff *skb)
 {
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-   return skb->nfct;
+   return (void *)(skb->_nfct & SKB_NFCT_PTRMASK);
 #else
return NULL;
 #endif
@@ -3590,8 +3589,8 @@ static inline void nf_bridge_get(struct nf_bridge_info 
*nf_bridge)
 static inline void nf_reset(struct sk_buff *skb)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-   nf_conntrack_put(skb->nfct);
-   skb->nfct = NULL;
+   nf_conntrack_put(skb_nfct(skb));
+   skb->_nfct = 0;
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
nf_bridge_put(skb->nf_bridge);
@@ -3611,10 +3610,8 @@ static inline void __nf_copy(struct sk_buff *dst, const 
struct sk_buff *src,
 bool copy)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-   dst->nfct = src->nfct;
-   nf_conntrack_get(src->nfct);
-   if (copy)
-   dst->nfctinfo = src->nfctinfo;
+   dst->_nfct = src->_nfct;
+   nf_conntrack_get(skb_nfct(src));
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
dst->nf_bridge  = src->nf_bridge;
@@ -3629,7 +3626,7 @@ static inline void __nf_copy(struct sk_buff *dst, const 
struct sk_buff *src,
 static inline void nf_copy(struct sk_buff *dst, const struct sk_buff *src)
 {
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
-   nf_conntrack_put(dst->nfct);
+   nf_conntrack_put(skb_nfct(dst));
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
nf_bridge_put(dst->nf_bridge);
diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index 06d3d2d24fe0..f540f9ad2af4 100644
--- a/i

[PATCH 21/27] netfilter: reduce direct skb->nfct usage

2017-02-03 Thread Pablo Neira Ayuso

From: Florian Westphal 

Next patch makes direct skb->nfct access illegal, reduce noise
in next patch by using accessors we already have.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/ip_vs.h   |  9 ++---
 net/netfilter/nf_conntrack_core.c | 15 +--
 2 files changed, 15 insertions(+), 9 deletions(-)

diff --git a/include/net/ip_vs.h b/include/net/ip_vs.h
index cd6018a9ee24..2a344ebd7ebe 100644
--- a/include/net/ip_vs.h
+++ b/include/net/ip_vs.h
@@ -1554,10 +1554,13 @@ static inline void ip_vs_notrack(struct sk_buff *skb)
struct nf_conn *ct = nf_ct_get(skb, &ctinfo);
 
if (!ct || !nf_ct_is_untracked(ct)) {
-   nf_conntrack_put(skb->nfct);
-   skb->nfct = &nf_ct_untracked_get()->ct_general;
+   struct nf_conn *untracked;
+
+   nf_conntrack_put(&ct->ct_general);
+   untracked = nf_ct_untracked_get();
+   nf_conntrack_get(&untracked->ct_general);
+   skb->nfct = &untracked->ct_general;
skb->nfctinfo = IP_CT_NEW;
-   nf_conntrack_get(skb->nfct);
}
 #endif
 }
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index 86186a2e2715..adb7af3a4c4c 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -686,8 +686,11 @@ static int nf_ct_resolve_clash(struct net *net, struct 
sk_buff *skb,
!nfct_nat(ct) &&
!nf_ct_is_dying(ct) &&
atomic_inc_not_zero(&ct->ct_general.use)) {
-   nf_ct_acct_merge(ct, ctinfo, (struct nf_conn *)skb->nfct);
-   nf_conntrack_put(skb->nfct);
+   enum ip_conntrack_info oldinfo;
+   struct nf_conn *loser_ct = nf_ct_get(skb, &oldinfo);
+
+   nf_ct_acct_merge(ct, ctinfo, loser_ct);
+   nf_conntrack_put(&loser_ct->ct_general);
/* Assign conntrack already in hashes to this skbuff. Don't
 * modify skb->nfctinfo to ensure consistent stateful filtering.
 */
@@ -1288,7 +1291,7 @@ unsigned int
 nf_conntrack_in(struct net *net, u_int8_t pf, unsigned int hooknum,
struct sk_buff *skb)
 {
-   struct nf_conn *ct, *tmpl = NULL;
+   struct nf_conn *ct, *tmpl;
enum ip_conntrack_info ctinfo;
struct nf_conntrack_l3proto *l3proto;
struct nf_conntrack_l4proto *l4proto;
@@ -1298,9 +1301,9 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned 
int hooknum,
int set_reply = 0;
int ret;
 
-   if (skb->nfct) {
+   tmpl = nf_ct_get(skb, &ctinfo);
+   if (tmpl) {
/* Previously seen (loopback or untracked)?  Ignore. */
-   tmpl = (struct nf_conn *)skb->nfct;
if (!nf_ct_is_template(tmpl)) {
NF_CT_STAT_INC_ATOMIC(net, ignore);
return NF_ACCEPT;
@@ -1364,7 +1367,7 @@ nf_conntrack_in(struct net *net, u_int8_t pf, unsigned 
int hooknum,
/* Invalid: inverse of the return code tells
 * the netfilter core what to do */
pr_debug("nf_conntrack_in: Can't track with proto module\n");
-   nf_conntrack_put(skb->nfct);
+   nf_conntrack_put(&ct->ct_general);
skb->nfct = NULL;
NF_CT_STAT_INC_ATOMIC(net, invalid);
if (ret == -NF_DROP)
-- 
2.1.4

[PATCH 17/27] netfilter: nf_tables: eliminate useless condition checks

2017-02-03 Thread Pablo Neira Ayuso

From: Gao Feng 

The return value of nf_tables_table_lookup() is valid pointer or one
pointer error. There are two cases:

1) IS_ERR(table) is true, it would return the error or reset the
   table as NULL, it is unnecessary to perform the latter check
   "table != NULL".

2) IS_ERR(obj) is false, the table is one valid pointer. It is also
   unnecessary to perform that check.

The nf_tables_newset() and nf_tables_newobj() have same logic codes.

In summary, we could move the block of condition check "table != NULL"
in the else block to eliminate the original condition checks.

Signed-off-by: Gao Feng 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c | 15 +++
 1 file changed, 3 insertions(+), 12 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index a019a87e58ee..6e07c214c208 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -696,10 +696,7 @@ static int nf_tables_newtable(struct net *net, struct sock 
*nlsk,
if (IS_ERR(table)) {
if (PTR_ERR(table) != -ENOENT)
return PTR_ERR(table);
-   table = NULL;
-   }
-
-   if (table != NULL) {
+   } else {
if (nlh->nlmsg_flags & NLM_F_EXCL)
return -EEXIST;
if (nlh->nlmsg_flags & NLM_F_REPLACE)
@@ -2963,10 +2960,7 @@ static int nf_tables_newset(struct net *net, struct sock 
*nlsk,
if (IS_ERR(set)) {
if (PTR_ERR(set) != -ENOENT)
return PTR_ERR(set);
-   set = NULL;
-   }
-
-   if (set != NULL) {
+   } else {
if (nlh->nlmsg_flags & NLM_F_EXCL)
return -EEXIST;
if (nlh->nlmsg_flags & NLM_F_REPLACE)
@@ -4153,10 +4147,7 @@ static int nf_tables_newobj(struct net *net, struct sock 
*nlsk,
if (err != -ENOENT)
return err;
 
-   obj = NULL;
-   }
-
-   if (obj != NULL) {
+   } else {
if (nlh->nlmsg_flags & NLM_F_EXCL)
return -EEXIST;
 
-- 
2.1.4

[PATCH 18/27] netfilter: nf_tables: Eliminate duplicated code in nf_tables_table_enable()

2017-02-03 Thread Pablo Neira Ayuso

From: Feng 

If something fails in nf_tables_table_enable(), it unregisters the
chains. But the rollback code is the same as nf_tables_table_disable()
almostly, except there is one counter check.  Now create one wrapper
function to eliminate the duplicated codes.

Signed-off-by: Feng 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_tables_api.c | 48 ++-
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 6e07c214c208..e6741ac4ccc1 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -576,6 +576,28 @@ static int nf_tables_gettable(struct net *net, struct sock 
*nlsk,
return err;
 }
 
+static void _nf_tables_table_disable(struct net *net,
+const struct nft_af_info *afi,
+struct nft_table *table,
+u32 cnt)
+{
+   struct nft_chain *chain;
+   u32 i = 0;
+
+   list_for_each_entry(chain, &table->chains, list) {
+   if (!nft_is_active_next(net, chain))
+   continue;
+   if (!(chain->flags & NFT_BASE_CHAIN))
+   continue;
+
+   if (cnt && i++ == cnt)
+   break;
+
+   nf_unregister_net_hooks(net, nft_base_chain(chain)->ops,
+   afi->nops);
+   }
+}
+
 static int nf_tables_table_enable(struct net *net,
  const struct nft_af_info *afi,
  struct nft_table *table)
@@ -598,18 +620,8 @@ static int nf_tables_table_enable(struct net *net,
}
return 0;
 err:
-   list_for_each_entry(chain, &table->chains, list) {
-   if (!nft_is_active_next(net, chain))
-   continue;
-   if (!(chain->flags & NFT_BASE_CHAIN))
-   continue;
-
-   if (i-- <= 0)
-   break;
-
-   nf_unregister_net_hooks(net, nft_base_chain(chain)->ops,
-   afi->nops);
-   }
+   if (i)
+   _nf_tables_table_disable(net, afi, table, i);
return err;
 }
 
@@ -617,17 +629,7 @@ static void nf_tables_table_disable(struct net *net,
const struct nft_af_info *afi,
struct nft_table *table)
 {
-   struct nft_chain *chain;
-
-   list_for_each_entry(chain, &table->chains, list) {
-   if (!nft_is_active_next(net, chain))
-   continue;
-   if (!(chain->flags & NFT_BASE_CHAIN))
-   continue;
-
-   nf_unregister_net_hooks(net, nft_base_chain(chain)->ops,
-   afi->nops);
-   }
+   _nf_tables_table_disable(net, afi, table, 0);
 }
 
 static int nf_tables_updtable(struct nft_ctx *ctx)
-- 
2.1.4

[PATCH 27/27] netfilter: allow logging from non-init namespaces

2017-02-03 Thread Pablo Neira Ayuso

From: Michal Kubeček 

Commit 69b34fb996b2 ("netfilter: xt_LOG: add net namespace support for
xt_LOG") disabled logging packets using the LOG target from non-init
namespaces. The motivation was to prevent containers from flooding
kernel log of the host. The plan was to keep it that way until syslog
namespace implementation allows containers to log in a safe way.

However, the work on syslog namespace seems to have hit a dead end
somewhere in 2013 and there are users who want to use xt_LOG in all
network namespaces. This patch allows to do so by setting

  /proc/sys/net/netfilter/nf_log_all_netns

to a nonzero value. This sysctl is only accessible from init_net so that
one cannot switch the behaviour from inside a container.

Signed-off-by: Michal Kubecek 
Signed-off-by: Pablo Neira Ayuso 
---
 Documentation/networking/netfilter-sysctl.txt | 10 ++
 include/net/netfilter/nf_log.h|  3 +++
 net/bridge/netfilter/ebt_log.c|  2 +-
 net/ipv4/netfilter/nf_log_arp.c   |  2 +-
 net/ipv4/netfilter/nf_log_ipv4.c  |  2 +-
 net/ipv6/netfilter/nf_log_ipv6.c  |  2 +-
 net/netfilter/nf_log.c| 24 
 7 files changed, 41 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/networking/netfilter-sysctl.txt

diff --git a/Documentation/networking/netfilter-sysctl.txt 
b/Documentation/networking/netfilter-sysctl.txt
new file mode 100644
index ..55791e50e169
--- /dev/null
+++ b/Documentation/networking/netfilter-sysctl.txt
@@ -0,0 +1,10 @@
+/proc/sys/net/netfilter/* Variables:
+
+nf_log_all_netns - BOOLEAN
+   0 - disabled (default)
+   not 0 - enabled
+
+   By default, only init_net namespace can log packets into kernel log
+   with LOG target; this aims to prevent containers from flooding host
+   kernel log. If enabled, this target also works in other network
+   namespaces. This variable is only accessible from init_net.
diff --git a/include/net/netfilter/nf_log.h b/include/net/netfilter/nf_log.h
index 450f87f95415..42e0696f38d8 100644
--- a/include/net/netfilter/nf_log.h
+++ b/include/net/netfilter/nf_log.h
@@ -51,6 +51,9 @@ struct nf_logger {
struct module   *me;
 };
 
+/* sysctl_nf_log_all_netns - allow LOG target in all network namespaces */
+extern int sysctl_nf_log_all_netns;
+
 /* Function to register/unregister log function. */
 int nf_log_register(u_int8_t pf, struct nf_logger *logger);
 void nf_log_unregister(struct nf_logger *logger);
diff --git a/net/bridge/netfilter/ebt_log.c b/net/bridge/netfilter/ebt_log.c
index e88bd4827ac1..98b9c8e8615e 100644
--- a/net/bridge/netfilter/ebt_log.c
+++ b/net/bridge/netfilter/ebt_log.c
@@ -78,7 +78,7 @@ ebt_log_packet(struct net *net, u_int8_t pf, unsigned int 
hooknum,
unsigned int bitmask;
 
/* FIXME: Disabled from containers until syslog ns is supported */
-   if (!net_eq(net, &init_net))
+   if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
return;
 
spin_lock_bh(&ebt_log_lock);
diff --git a/net/ipv4/netfilter/nf_log_arp.c b/net/ipv4/netfilter/nf_log_arp.c
index b24795e2ee6d..f6f713376e6e 100644
--- a/net/ipv4/netfilter/nf_log_arp.c
+++ b/net/ipv4/netfilter/nf_log_arp.c
@@ -87,7 +87,7 @@ static void nf_log_arp_packet(struct net *net, u_int8_t pf,
struct nf_log_buf *m;
 
/* FIXME: Disabled from containers until syslog ns is supported */
-   if (!net_eq(net, &init_net))
+   if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
return;
 
m = nf_log_buf_open();
diff --git a/net/ipv4/netfilter/nf_log_ipv4.c b/net/ipv4/netfilter/nf_log_ipv4.c
index 856648966f4c..c83a9963269b 100644
--- a/net/ipv4/netfilter/nf_log_ipv4.c
+++ b/net/ipv4/netfilter/nf_log_ipv4.c
@@ -319,7 +319,7 @@ static void nf_log_ip_packet(struct net *net, u_int8_t pf,
struct nf_log_buf *m;
 
/* FIXME: Disabled from containers until syslog ns is supported */
-   if (!net_eq(net, &init_net))
+   if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
return;
 
m = nf_log_buf_open();
diff --git a/net/ipv6/netfilter/nf_log_ipv6.c b/net/ipv6/netfilter/nf_log_ipv6.c
index 57d86066a13b..055c51b80f5d 100644
--- a/net/ipv6/netfilter/nf_log_ipv6.c
+++ b/net/ipv6/netfilter/nf_log_ipv6.c
@@ -351,7 +351,7 @@ static void nf_log_ip6_packet(struct net *net, u_int8_t pf,
struct nf_log_buf *m;
 
/* FIXME: Disabled from containers until syslog ns is supported */
-   if (!net_eq(net, &init_net))
+   if (!net_eq(net, &init_net) && !sysctl_nf_log_all_netns)
return;
 
m = nf_log_buf_open();
diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index 3dca90dc24ad..0a034f52b912 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -16,6 +16,9 @@
 #define NF_LOG_PREFIXLEN   128
 #define NFLOGGER_NAME_LEN  64
 
+int

[PATCH 24/27] netfilter: guarantee 8 byte minalign for template addresses

2017-02-03 Thread Pablo Neira Ayuso

From: Florian Westphal 

The next change will merge skb->nfct pointer and skb->nfctinfo
status bits into single skb->_nfct (unsigned long) area.

For this to work nf_conn addresses must always be aligned at least on
an 8 byte boundary since we will need the lower 3bits to store nfctinfo.

Conntrack templates are allocated via kmalloc.
kbuild test robot reported
BUILD_BUG_ON failed: NFCT_INFOMASK >= ARCH_KMALLOC_MINALIGN
on v1 of this patchset, so not all platforms meet this requirement.

Do manual alignment if needed,  the alignment offset is stored in the
nf_conn entry protocol area. This works because templates are not
handed off to L4 protocol trackers.

Reported-by: kbuild test robot 
Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_conntrack.h |  2 ++
 net/netfilter/nf_conntrack_core.c| 29 -
 2 files changed, 26 insertions(+), 5 deletions(-)

diff --git a/include/net/netfilter/nf_conntrack.h 
b/include/net/netfilter/nf_conntrack.h
index d704aed11684..06d3d2d24fe0 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -163,6 +163,8 @@ void nf_conntrack_alter_reply(struct nf_conn *ct,
 int nf_conntrack_tuple_taken(const struct nf_conntrack_tuple *tuple,
 const struct nf_conn *ignored_conntrack);
 
+#define NFCT_INFOMASK  7UL
+
 /* Return conntrack_info and tuple hash for given skb. */
 static inline struct nf_conn *
 nf_ct_get(const struct sk_buff *skb, enum ip_conntrack_info *ctinfo)
diff --git a/net/netfilter/nf_conntrack_core.c 
b/net/netfilter/nf_conntrack_core.c
index c9bd10747864..768968fba7f6 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -350,16 +350,31 @@ static void 
nf_ct_del_from_dying_or_unconfirmed_list(struct nf_conn *ct)
spin_unlock(&pcpu->lock);
 }
 
+#define NFCT_ALIGN(len)(((len) + NFCT_INFOMASK) & ~NFCT_INFOMASK)
+
 /* Released via destroy_conntrack() */
 struct nf_conn *nf_ct_tmpl_alloc(struct net *net,
 const struct nf_conntrack_zone *zone,
 gfp_t flags)
 {
-   struct nf_conn *tmpl;
+   struct nf_conn *tmpl, *p;
 
-   tmpl = kzalloc(sizeof(*tmpl), flags);
-   if (tmpl == NULL)
-   return NULL;
+   if (ARCH_KMALLOC_MINALIGN <= NFCT_INFOMASK) {
+   tmpl = kzalloc(sizeof(*tmpl) + NFCT_INFOMASK, flags);
+   if (!tmpl)
+   return NULL;
+
+   p = tmpl;
+   tmpl = (struct nf_conn *)NFCT_ALIGN((unsigned long)p);
+   if (tmpl != p) {
+   tmpl = (struct nf_conn *)NFCT_ALIGN((unsigned long)p);
+   tmpl->proto.tmpl_padto = (char *)tmpl - (char *)p;
+   }
+   } else {
+   tmpl = kzalloc(sizeof(*tmpl), flags);
+   if (!tmpl)
+   return NULL;
+   }
 
tmpl->status = IPS_TEMPLATE;
write_pnet(&tmpl->ct_net, net);
@@ -374,7 +389,11 @@ void nf_ct_tmpl_free(struct nf_conn *tmpl)
 {
nf_ct_ext_destroy(tmpl);
nf_ct_ext_free(tmpl);
-   kfree(tmpl);
+
+   if (ARCH_KMALLOC_MINALIGN <= NFCT_INFOMASK)
+   kfree((char *)tmpl - tmpl->proto.tmpl_padto);
+   else
+   kfree(tmpl);
 }
 EXPORT_SYMBOL_GPL(nf_ct_tmpl_free);
 
-- 
2.1.4

[PATCH 07/27] netfilter: xt_connlimit: use rb_entry()

2017-02-03 Thread Pablo Neira Ayuso

From: Geliang Tang 

To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Signed-off-by: Geliang Tang 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/xt_connlimit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
index 2aff2b7c4689..660b61dbd776 100644
--- a/net/netfilter/xt_connlimit.c
+++ b/net/netfilter/xt_connlimit.c
@@ -218,7 +218,7 @@ count_tree(struct net *net, struct rb_root *root,
int diff;
bool addit;
 
-   rbconn = container_of(*rbnode, struct xt_connlimit_rb, node);
+   rbconn = rb_entry(*rbnode, struct xt_connlimit_rb, node);
 
parent = *rbnode;
diff = same_source_net(addr, mask, &rbconn->addr, family);
@@ -398,7 +398,7 @@ static void destroy_tree(struct rb_root *r)
struct rb_node *node;
 
while ((node = rb_first(r)) != NULL) {
-   rbconn = container_of(node, struct xt_connlimit_rb, node);
+   rbconn = rb_entry(node, struct xt_connlimit_rb, node);
 
rb_erase(node, r);
 
-- 
2.1.4

[PATCH 22/27] skbuff: add and use skb_nfct helper

2017-02-03 Thread Pablo Neira Ayuso

From: Florian Westphal 

Followup patch renames skb->nfct and changes its type so add a helper to
avoid intrusive rename change later.

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/linux/skbuff.h | 13 ++---
 include/net/netfilter/nf_conntrack_core.h  |  2 +-
 net/core/skbuff.c  |  2 +-
 net/ipv4/netfilter/ipt_SYNPROXY.c  |  8 
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |  2 +-
 net/ipv4/netfilter/nf_defrag_ipv4.c|  4 ++--
 net/ipv4/netfilter/nf_dup_ipv4.c   |  2 +-
 net/ipv6/netfilter/ip6t_SYNPROXY.c |  8 
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |  4 ++--
 net/ipv6/netfilter/nf_defrag_ipv6_hooks.c  |  4 ++--
 net/netfilter/nf_conntrack_core.c  |  4 ++--
 net/netfilter/nf_nat_helper.c  |  2 +-
 net/netfilter/xt_CT.c  |  2 +-
 net/openvswitch/conntrack.c|  6 +++---
 net/sched/cls_flow.c   |  2 +-
 15 files changed, 36 insertions(+), 29 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index b53c0cfd417e..276431e047af 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3553,6 +3553,15 @@ static inline void skb_remcsum_process(struct sk_buff 
*skb, void *ptr,
skb->csum = csum_add(skb->csum, delta);
 }
 
+static inline struct nf_conntrack *skb_nfct(const struct sk_buff *skb)
+{
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+   return skb->nfct;
+#else
+   return NULL;
+#endif
+}
+
 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
 void nf_conntrack_destroy(struct nf_conntrack *nfct);
 static inline void nf_conntrack_put(struct nf_conntrack *nfct)
@@ -3652,9 +3661,7 @@ static inline bool skb_irq_freeable(const struct sk_buff 
*skb)
 #if IS_ENABLED(CONFIG_XFRM)
!skb->sp &&
 #endif
-#if IS_ENABLED(CONFIG_NF_CONNTRACK)
-   !skb->nfct &&
-#endif
+   !skb_nfct(skb) &&
!skb->_skb_refdst &&
!skb_has_frag_list(skb);
 }
diff --git a/include/net/netfilter/nf_conntrack_core.h 
b/include/net/netfilter/nf_conntrack_core.h
index 62e17d1319ff..84ec7ca5f195 100644
--- a/include/net/netfilter/nf_conntrack_core.h
+++ b/include/net/netfilter/nf_conntrack_core.h
@@ -62,7 +62,7 @@ int __nf_conntrack_confirm(struct sk_buff *skb);
 /* Confirm a connection: returns NF_DROP if packet must be dropped. */
 static inline int nf_conntrack_confirm(struct sk_buff *skb)
 {
-   struct nf_conn *ct = (struct nf_conn *)skb->nfct;
+   struct nf_conn *ct = (struct nf_conn *)skb_nfct(skb);
int ret = NF_ACCEPT;
 
if (ct && !nf_ct_is_untracked(ct)) {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5a03730fbc1a..cac3ebfb4b45 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -655,7 +655,7 @@ static void skb_release_head_state(struct sk_buff *skb)
skb->destructor(skb);
}
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-   nf_conntrack_put(skb->nfct);
+   nf_conntrack_put(skb_nfct(skb));
 #endif
 #if IS_ENABLED(CONFIG_BRIDGE_NETFILTER)
nf_bridge_put(skb->nf_bridge);
diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c 
b/net/ipv4/netfilter/ipt_SYNPROXY.c
index 30c0de53e254..a12d4f0aa674 100644
--- a/net/ipv4/netfilter/ipt_SYNPROXY.c
+++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
@@ -107,8 +107,8 @@ synproxy_send_client_synack(struct net *net,
 
synproxy_build_options(nth, opts);
 
-   synproxy_send_tcp(net, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
- niph, nth, tcp_hdr_size);
+   synproxy_send_tcp(net, skb, nskb, skb_nfct(skb),
+ IP_CT_ESTABLISHED_REPLY, niph, nth, tcp_hdr_size);
 }
 
 static void
@@ -230,8 +230,8 @@ synproxy_send_client_ack(struct net *net,
 
synproxy_build_options(nth, opts);
 
-   synproxy_send_tcp(net, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
- niph, nth, tcp_hdr_size);
+   synproxy_send_tcp(net, skb, nskb, skb_nfct(skb),
+ IP_CT_ESTABLISHED_REPLY, niph, nth, tcp_hdr_size);
 }
 
 static bool
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c 
b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index 566afac98a88..478a025909fc 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -137,7 +137,7 @@ icmp_error_message(struct net *net, struct nf_conn *tmpl, 
struct sk_buff *skb,
enum ip_conntrack_info ctinfo;
struct nf_conntrack_zone tmp;
 
-   NF_CT_ASSERT(skb->nfct == NULL);
+   NF_CT_ASSERT(!skb_nfct(skb));
zone = nf_ct_zone_tmpl(tmpl, skb, &tmp);
 
/* Are they talking about one of our connections? */
diff --git a/net/ipv4/netfilter/nf_defrag_ipv4.c 
b/net/ipv4/netfilter/nf_defrag_ipv4.c
index 49bd6a54404f..346bf7ccac08 100644
--- a/ne

[PATCH 14/27] xtables: extend matches and targets with .usersize

2017-02-03 Thread Pablo Neira Ayuso

From: Willem de Bruijn 

In matches and targets that define a kernel-only tail to their
xt_match and xt_target data structs, add a field .usersize that
specifies up to where data is to be shared with userspace.

Performed a search for comment "Used internally by the kernel" to find
relevant matches and targets. Manually inspected the structs to derive
a valid offsetof.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Pablo Neira Ayuso 
---
 net/bridge/netfilter/ebt_limit.c   | 1 +
 net/ipv4/netfilter/ipt_CLUSTERIP.c | 1 +
 net/ipv6/netfilter/ip6t_NPT.c  | 2 ++
 net/netfilter/xt_CT.c  | 3 +++
 net/netfilter/xt_RATEEST.c | 1 +
 net/netfilter/xt_TEE.c | 2 ++
 net/netfilter/xt_bpf.c | 2 ++
 net/netfilter/xt_cgroup.c  | 1 +
 net/netfilter/xt_connlimit.c   | 1 +
 net/netfilter/xt_hashlimit.c   | 4 
 net/netfilter/xt_limit.c   | 2 ++
 net/netfilter/xt_quota.c   | 1 +
 net/netfilter/xt_rateest.c | 1 +
 net/netfilter/xt_string.c  | 1 +
 14 files changed, 23 insertions(+)

diff --git a/net/bridge/netfilter/ebt_limit.c b/net/bridge/netfilter/ebt_limit.c
index 517e78befcb2..61a9f1be1263 100644
--- a/net/bridge/netfilter/ebt_limit.c
+++ b/net/bridge/netfilter/ebt_limit.c
@@ -105,6 +105,7 @@ static struct xt_match ebt_limit_mt_reg __read_mostly = {
.match  = ebt_limit_mt,
.checkentry = ebt_limit_mt_check,
.matchsize  = sizeof(struct ebt_limit_info),
+   .usersize   = offsetof(struct ebt_limit_info, prev),
 #ifdef CONFIG_COMPAT
.compatsize = sizeof(struct ebt_compat_limit_info),
 #endif
diff --git a/net/ipv4/netfilter/ipt_CLUSTERIP.c 
b/net/ipv4/netfilter/ipt_CLUSTERIP.c
index 21db00d0362b..8a3d20ebb815 100644
--- a/net/ipv4/netfilter/ipt_CLUSTERIP.c
+++ b/net/ipv4/netfilter/ipt_CLUSTERIP.c
@@ -468,6 +468,7 @@ static struct xt_target clusterip_tg_reg __read_mostly = {
.checkentry = clusterip_tg_check,
.destroy= clusterip_tg_destroy,
.targetsize = sizeof(struct ipt_clusterip_tgt_info),
+   .usersize   = offsetof(struct ipt_clusterip_tgt_info, config),
 #ifdef CONFIG_COMPAT
.compatsize = sizeof(struct compat_ipt_clusterip_tgt_info),
 #endif /* CONFIG_COMPAT */
diff --git a/net/ipv6/netfilter/ip6t_NPT.c b/net/ipv6/netfilter/ip6t_NPT.c
index 590f767db5d4..a379d2f79b19 100644
--- a/net/ipv6/netfilter/ip6t_NPT.c
+++ b/net/ipv6/netfilter/ip6t_NPT.c
@@ -112,6 +112,7 @@ static struct xt_target ip6t_npt_target_reg[] __read_mostly 
= {
.table  = "mangle",
.target = ip6t_snpt_tg,
.targetsize = sizeof(struct ip6t_npt_tginfo),
+   .usersize   = offsetof(struct ip6t_npt_tginfo, adjustment),
.checkentry = ip6t_npt_checkentry,
.family = NFPROTO_IPV6,
.hooks  = (1 << NF_INET_LOCAL_IN) |
@@ -123,6 +124,7 @@ static struct xt_target ip6t_npt_target_reg[] __read_mostly 
= {
.table  = "mangle",
.target = ip6t_dnpt_tg,
.targetsize = sizeof(struct ip6t_npt_tginfo),
+   .usersize   = offsetof(struct ip6t_npt_tginfo, adjustment),
.checkentry = ip6t_npt_checkentry,
.family = NFPROTO_IPV6,
.hooks  = (1 << NF_INET_PRE_ROUTING) |
diff --git a/net/netfilter/xt_CT.c b/net/netfilter/xt_CT.c
index 95c750358747..26b0bccfa0c5 100644
--- a/net/netfilter/xt_CT.c
+++ b/net/netfilter/xt_CT.c
@@ -373,6 +373,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = {
.name   = "CT",
.family = NFPROTO_UNSPEC,
.targetsize = sizeof(struct xt_ct_target_info),
+   .usersize   = offsetof(struct xt_ct_target_info, ct),
.checkentry = xt_ct_tg_check_v0,
.destroy= xt_ct_tg_destroy_v0,
.target = xt_ct_target_v0,
@@ -384,6 +385,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = {
.family = NFPROTO_UNSPEC,
.revision   = 1,
.targetsize = sizeof(struct xt_ct_target_info_v1),
+   .usersize   = offsetof(struct xt_ct_target_info, ct),
.checkentry = xt_ct_tg_check_v1,
.destroy= xt_ct_tg_destroy_v1,
.target = xt_ct_target_v1,
@@ -395,6 +397,7 @@ static struct xt_target xt_ct_tg_reg[] __read_mostly = {
.family = NFPROTO_UNSPEC,
.revision   = 2,
.targetsize = sizeof(struct xt_ct_target_info_v1),
+   .usersize   = offsetof(struct xt_ct_target_info, ct),
.checkentry = xt_ct_tg_check_v2,
.destroy= xt_ct_tg_destroy_v1,
.target

[PATCH 15/27] netfilter: pkttype: unnecessary to check ipv6 multicast address

2017-02-03 Thread Pablo Neira Ayuso

From: Liping Zhang 

Since there's no broadcast address in IPV6, so in ipv6 family, the
PACKET_LOOPBACK must be multicast packets, there's no need to check
it again.

Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_meta.c   | 5 +
 net/netfilter/xt_pkttype.c | 3 +--
 2 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 66c7f4b4c49b..9a22b24346b8 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -154,10 +154,7 @@ void nft_meta_get_eval(const struct nft_expr *expr,
*dest = PACKET_BROADCAST;
break;
case NFPROTO_IPV6:
-   if (ipv6_hdr(skb)->daddr.s6_addr[0] == 0xFF)
-   *dest = PACKET_MULTICAST;
-   else
-   *dest = PACKET_BROADCAST;
+   *dest = PACKET_MULTICAST;
break;
default:
WARN_ON(1);
diff --git a/net/netfilter/xt_pkttype.c b/net/netfilter/xt_pkttype.c
index 57efb703ff18..1ef99151b3ba 100644
--- a/net/netfilter/xt_pkttype.c
+++ b/net/netfilter/xt_pkttype.c
@@ -33,8 +33,7 @@ pkttype_mt(const struct sk_buff *skb, struct xt_action_param 
*par)
else if (xt_family(par) == NFPROTO_IPV4 &&
ipv4_is_multicast(ip_hdr(skb)->daddr))
type = PACKET_MULTICAST;
-   else if (xt_family(par) == NFPROTO_IPV6 &&
-   ipv6_hdr(skb)->daddr.s6_addr[0] == 0xFF)
+   else if (xt_family(par) == NFPROTO_IPV6)
type = PACKET_MULTICAST;
else
type = PACKET_BROADCAST;
-- 
2.1.4

[PATCH 13/27] xtables: use match, target and data copy_to_user helpers in compat

2017-02-03 Thread Pablo Neira Ayuso

From: Willem de Bruijn 

Convert compat to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/x_tables.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index feccf527abdd..016db6be94b9 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -619,17 +619,14 @@ int xt_compat_match_to_user(const struct xt_entry_match 
*m,
int off = xt_compat_match_offset(match);
u_int16_t msize = m->u.user.match_size - off;
 
-   if (copy_to_user(cm, m, sizeof(*cm)) ||
-   put_user(msize, &cm->u.user.match_size) ||
-   copy_to_user(cm->u.user.name, m->u.kernel.match->name,
-strlen(m->u.kernel.match->name) + 1))
+   if (XT_OBJ_TO_USER(cm, m, match, msize))
return -EFAULT;
 
if (match->compat_to_user) {
if (match->compat_to_user((void __user *)cm->data, m->data))
return -EFAULT;
} else {
-   if (copy_to_user(cm->data, m->data, msize - sizeof(*cm)))
+   if (XT_DATA_TO_USER(cm, m, match, msize - sizeof(*cm)))
return -EFAULT;
}
 
@@ -977,17 +974,14 @@ int xt_compat_target_to_user(const struct xt_entry_target 
*t,
int off = xt_compat_target_offset(target);
u_int16_t tsize = t->u.user.target_size - off;
 
-   if (copy_to_user(ct, t, sizeof(*ct)) ||
-   put_user(tsize, &ct->u.user.target_size) ||
-   copy_to_user(ct->u.user.name, t->u.kernel.target->name,
-strlen(t->u.kernel.target->name) + 1))
+   if (XT_OBJ_TO_USER(ct, t, target, tsize))
return -EFAULT;
 
if (target->compat_to_user) {
if (target->compat_to_user((void __user *)ct->data, t->data))
return -EFAULT;
} else {
-   if (copy_to_user(ct->data, t->data, tsize - sizeof(*ct)))
+   if (XT_DATA_TO_USER(ct, t, target, tsize - sizeof(*ct)))
return -EFAULT;
}
 
-- 
2.1.4

[PATCH 12/27] ebtables: use match, target and data copy_to_user helpers

2017-02-03 Thread Pablo Neira Ayuso

From: Willem de Bruijn 

Convert ebtables to copying entries, matches and targets one by one.

The solution is analogous to that of generic xt_(match|target)_to_user
helpers, but is applied to different structs.

Convert existing helpers ebt_make_XXXname helpers that overwrite
fields of an already copy_to_user'd struct with ebt_XXX_to_user
helpers that copy all relevant fields of the struct from scratch.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Pablo Neira Ayuso 
---
 net/bridge/netfilter/ebtables.c | 78 +
 1 file changed, 47 insertions(+), 31 deletions(-)

diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 537e3d506fc2..79b69917f521 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1346,56 +1346,72 @@ static int update_counters(struct net *net, const void 
__user *user,
hlp.num_counters, user, len);
 }
 
-static inline int ebt_make_matchname(const struct ebt_entry_match *m,
-const char *base, char __user *ubase)
+static inline int ebt_obj_to_user(char __user *um, const char *_name,
+ const char *data, int entrysize,
+ int usersize, int datasize)
 {
-   char __user *hlp = ubase + ((char *)m - base);
-   char name[EBT_FUNCTION_MAXNAMELEN] = {};
+   char name[EBT_FUNCTION_MAXNAMELEN] = {0};
 
/* ebtables expects 32 bytes long names but xt_match names are 29 bytes
 * long. Copy 29 bytes and fill remaining bytes with zeroes.
 */
-   strlcpy(name, m->u.match->name, sizeof(name));
-   if (copy_to_user(hlp, name, EBT_FUNCTION_MAXNAMELEN))
+   strlcpy(name, _name, sizeof(name));
+   if (copy_to_user(um, name, EBT_FUNCTION_MAXNAMELEN) ||
+   put_user(datasize, (int __user *)(um + EBT_FUNCTION_MAXNAMELEN)) ||
+   xt_data_to_user(um + entrysize, data, usersize, datasize))
return -EFAULT;
+
return 0;
 }
 
-static inline int ebt_make_watchername(const struct ebt_entry_watcher *w,
-  const char *base, char __user *ubase)
+static inline int ebt_match_to_user(const struct ebt_entry_match *m,
+   const char *base, char __user *ubase)
 {
-   char __user *hlp = ubase + ((char *)w - base);
-   char name[EBT_FUNCTION_MAXNAMELEN] = {};
+   return ebt_obj_to_user(ubase + ((char *)m - base),
+  m->u.match->name, m->data, sizeof(*m),
+  m->u.match->usersize, m->match_size);
+}
 
-   strlcpy(name, w->u.watcher->name, sizeof(name));
-   if (copy_to_user(hlp, name, EBT_FUNCTION_MAXNAMELEN))
-   return -EFAULT;
-   return 0;
+static inline int ebt_watcher_to_user(const struct ebt_entry_watcher *w,
+ const char *base, char __user *ubase)
+{
+   return ebt_obj_to_user(ubase + ((char *)w - base),
+  w->u.watcher->name, w->data, sizeof(*w),
+  w->u.watcher->usersize, w->watcher_size);
 }
 
-static inline int ebt_make_names(struct ebt_entry *e, const char *base,
-char __user *ubase)
+static inline int ebt_entry_to_user(struct ebt_entry *e, const char *base,
+   char __user *ubase)
 {
int ret;
char __user *hlp;
const struct ebt_entry_target *t;
-   char name[EBT_FUNCTION_MAXNAMELEN] = {};
 
-   if (e->bitmask == 0)
+   if (e->bitmask == 0) {
+   /* special case !EBT_ENTRY_OR_ENTRIES */
+   if (copy_to_user(ubase + ((char *)e - base), e,
+sizeof(struct ebt_entries)))
+   return -EFAULT;
return 0;
+   }
+
+   if (copy_to_user(ubase + ((char *)e - base), e, sizeof(*e)))
+   return -EFAULT;
 
hlp = ubase + (((char *)e + e->target_offset) - base);
t = (struct ebt_entry_target *)(((char *)e) + e->target_offset);
 
-   ret = EBT_MATCH_ITERATE(e, ebt_make_matchname, base, ubase);
+   ret = EBT_MATCH_ITERATE(e, ebt_match_to_user, base, ubase);
if (ret != 0)
return ret;
-   ret = EBT_WATCHER_ITERATE(e, ebt_make_watchername, base, ubase);
+   ret = EBT_WATCHER_ITERATE(e, ebt_watcher_to_user, base, ubase);
if (ret != 0)
return ret;
-   strlcpy(name, t->u.target->name, sizeof(name));
-   if (copy_to_user(hlp, name, EBT_FUNCTION_MAXNAMELEN))
-   return -EFAULT;
+   ret = ebt_obj_to_user(hlp, t->u.target->name, t->data, sizeof(*t),
+ t->u.target->usersize, t->target_size);
+   if (ret != 0)
+   return ret;
+
return 0;
 }
 
@@ -1475,13 +1491,9 @@ static int copy_everything_to_user(struct ebt_table *t, 
void

[PATCH 16/27] netfilter: nft_meta: deal with PACKET_LOOPBACK in netdev family

2017-02-03 Thread Pablo Neira Ayuso

From: Liping Zhang 

After adding the following nft rule, then ping 224.0.0.1:
  # nft add rule netdev t c pkttype host counter

The warning complain message will be printed out again and again:
  WARNING: CPU: 0 PID: 10182 at net/netfilter/nft_meta.c:163 \
   nft_meta_get_eval+0x3fe/0x460 [nft_meta]
  [...]
  Call Trace:
  
  dump_stack+0x85/0xc2
  __warn+0xcb/0xf0
  warn_slowpath_null+0x1d/0x20
  nft_meta_get_eval+0x3fe/0x460 [nft_meta]
  nft_do_chain+0xff/0x5e0 [nf_tables]

So we should deal with PACKET_LOOPBACK in netdev family too. For ipv4,
convert it to PACKET_BROADCAST/MULTICAST according to the destination
address's type; For ipv6, convert it to PACKET_MULTICAST directly.

Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nft_meta.c | 28 +++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 9a22b24346b8..e1f5ca9b423b 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -156,8 +156,34 @@ void nft_meta_get_eval(const struct nft_expr *expr,
case NFPROTO_IPV6:
*dest = PACKET_MULTICAST;
break;
+   case NFPROTO_NETDEV:
+   switch (skb->protocol) {
+   case htons(ETH_P_IP): {
+   int noff = skb_network_offset(skb);
+   struct iphdr *iph, _iph;
+
+   iph = skb_header_pointer(skb, noff,
+sizeof(_iph), &_iph);
+   if (!iph)
+   goto err;
+
+   if (ipv4_is_multicast(iph->daddr))
+   *dest = PACKET_MULTICAST;
+   else
+   *dest = PACKET_BROADCAST;
+
+   break;
+   }
+   case htons(ETH_P_IPV6):
+   *dest = PACKET_MULTICAST;
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   goto err;
+   }
+   break;
default:
-   WARN_ON(1);
+   WARN_ON_ONCE(1);
goto err;
}
break;
-- 
2.1.4

[PATCH 09/27] iptables: use match, target and data copy_to_user helpers

2017-02-03 Thread Pablo Neira Ayuso

From: Willem de Bruijn 

Convert iptables to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/ip_tables.c | 21 ++---
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 91656a1d8fbd..384b85713e06 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -826,10 +826,6 @@ copy_entries_to_user(unsigned int total_size,
return PTR_ERR(counters);
 
loc_cpu_entry = private->entries;
-   if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
-   ret = -EFAULT;
-   goto free_counters;
-   }
 
/* FIXME: use iterator macros --RR */
/* ... then go back and fix counters and names */
@@ -839,6 +835,10 @@ copy_entries_to_user(unsigned int total_size,
const struct xt_entry_target *t;
 
e = (struct ipt_entry *)(loc_cpu_entry + off);
+   if (copy_to_user(userptr + off, e, sizeof(*e))) {
+   ret = -EFAULT;
+   goto free_counters;
+   }
if (copy_to_user(userptr + off
 + offsetof(struct ipt_entry, counters),
 &counters[num],
@@ -852,23 +852,14 @@ copy_entries_to_user(unsigned int total_size,
 i += m->u.match_size) {
m = (void *)e + i;
 
-   if (copy_to_user(userptr + off + i
-+ offsetof(struct xt_entry_match,
-   u.user.name),
-m->u.kernel.match->name,
-strlen(m->u.kernel.match->name)+1)
-   != 0) {
+   if (xt_match_to_user(m, userptr + off + i)) {
ret = -EFAULT;
goto free_counters;
}
}
 
t = ipt_get_target_c(e);
-   if (copy_to_user(userptr + off + e->target_offset
-+ offsetof(struct xt_entry_target,
-   u.user.name),
-t->u.kernel.target->name,
-strlen(t->u.kernel.target->name)+1) != 0) {
+   if (xt_target_to_user(t, userptr + off + e->target_offset)) {
ret = -EFAULT;
goto free_counters;
}
-- 
2.1.4

[PATCH 10/27] ip6tables: use match, target and data copy_to_user helpers

2017-02-03 Thread Pablo Neira Ayuso

From: Willem de Bruijn 

Convert ip6tables to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv6/netfilter/ip6_tables.c | 21 ++---
 1 file changed, 6 insertions(+), 15 deletions(-)

diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 25a022d41a70..1e15c54fd5e2 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -855,10 +855,6 @@ copy_entries_to_user(unsigned int total_size,
return PTR_ERR(counters);
 
loc_cpu_entry = private->entries;
-   if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
-   ret = -EFAULT;
-   goto free_counters;
-   }
 
/* FIXME: use iterator macros --RR */
/* ... then go back and fix counters and names */
@@ -868,6 +864,10 @@ copy_entries_to_user(unsigned int total_size,
const struct xt_entry_target *t;
 
e = (struct ip6t_entry *)(loc_cpu_entry + off);
+   if (copy_to_user(userptr + off, e, sizeof(*e))) {
+   ret = -EFAULT;
+   goto free_counters;
+   }
if (copy_to_user(userptr + off
 + offsetof(struct ip6t_entry, counters),
 &counters[num],
@@ -881,23 +881,14 @@ copy_entries_to_user(unsigned int total_size,
 i += m->u.match_size) {
m = (void *)e + i;
 
-   if (copy_to_user(userptr + off + i
-+ offsetof(struct xt_entry_match,
-   u.user.name),
-m->u.kernel.match->name,
-strlen(m->u.kernel.match->name)+1)
-   != 0) {
+   if (xt_match_to_user(m, userptr + off + i)) {
ret = -EFAULT;
goto free_counters;
}
}
 
t = ip6t_get_target_c(e);
-   if (copy_to_user(userptr + off + e->target_offset
-+ offsetof(struct xt_entry_target,
-   u.user.name),
-t->u.kernel.target->name,
-strlen(t->u.kernel.target->name)+1) != 0) {
+   if (xt_target_to_user(t, userptr + off + e->target_offset)) {
ret = -EFAULT;
goto free_counters;
}
-- 
2.1.4

[PATCH 06/27] netfilter: conntrack: validate SCTP crc32c in PREROUTING

2017-02-03 Thread Pablo Neira Ayuso

From: Davide Caratti 

implement sctp_error to let nf_conntrack_in validate crc32c on the packet
transport header. Assign skb->ip_summed to CHECKSUM_UNNECESSARY and return
NF_ACCEPT in case of successful validation; otherwise, return -NF_ACCEPT to
let netfilter skip connection tracking, like other protocols do.

Besides preventing corrupted packets from matching conntrack entries, this
fixes functionality of REJECT target: it was not generating any ICMP upon
reception of SCTP packets, because it was computing RFC 1624 checksum on
the packet and systematically mismatching crc32c in the SCTP header.

Signed-off-by: Davide Caratti 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/nf_conntrack_proto_sctp.c | 32 
 1 file changed, 32 insertions(+)

diff --git a/net/netfilter/nf_conntrack_proto_sctp.c 
b/net/netfilter/nf_conntrack_proto_sctp.c
index a0efde38da44..44a647418948 100644
--- a/net/netfilter/nf_conntrack_proto_sctp.c
+++ b/net/netfilter/nf_conntrack_proto_sctp.c
@@ -22,7 +22,9 @@
 #include 
 #include 
 #include 
+#include 
 
+#include 
 #include 
 #include 
 #include 
@@ -505,6 +507,34 @@ static bool sctp_new(struct nf_conn *ct, const struct 
sk_buff *skb,
return true;
 }
 
+static int sctp_error(struct net *net, struct nf_conn *tpl, struct sk_buff 
*skb,
+ unsigned int dataoff, enum ip_conntrack_info *ctinfo,
+ u8 pf, unsigned int hooknum)
+{
+   const struct sctphdr *sh;
+   struct sctphdr _sctph;
+   const char *logmsg;
+
+   sh = skb_header_pointer(skb, dataoff, sizeof(_sctph), &_sctph);
+   if (!sh) {
+   logmsg = "nf_ct_sctp: short packet ";
+   goto out_invalid;
+   }
+   if (net->ct.sysctl_checksum && hooknum == NF_INET_PRE_ROUTING &&
+   skb->ip_summed == CHECKSUM_NONE) {
+   if (sh->checksum != sctp_compute_cksum(skb, dataoff)) {
+   logmsg = "nf_ct_sctp: bad CRC ";
+   goto out_invalid;
+   }
+   skb->ip_summed = CHECKSUM_UNNECESSARY;
+   }
+   return NF_ACCEPT;
+out_invalid:
+   if (LOG_INVALID(net, IPPROTO_SCTP))
+   nf_log_packet(net, pf, 0, skb, NULL, NULL, NULL, "%s", logmsg);
+   return -NF_ACCEPT;
+}
+
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
 
 #include 
@@ -752,6 +782,7 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_sctp4 
__read_mostly = {
.packet = sctp_packet,
.get_timeouts   = sctp_get_timeouts,
.new= sctp_new,
+   .error  = sctp_error,
.me = THIS_MODULE,
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
.to_nlattr  = sctp_to_nlattr,
@@ -786,6 +817,7 @@ struct nf_conntrack_l4proto nf_conntrack_l4proto_sctp6 
__read_mostly = {
.packet = sctp_packet,
.get_timeouts   = sctp_get_timeouts,
.new= sctp_new,
+   .error  = sctp_error,
.me = THIS_MODULE,
 #if IS_ENABLED(CONFIG_NF_CT_NETLINK)
.to_nlattr  = sctp_to_nlattr,
-- 
2.1.4

[PATCH 04/27] netfilter: nft_ct: add average bytes per packet support

2017-02-03 Thread Pablo Neira Ayuso

From: Liping Zhang 

Similar to xt_connbytes, user can match how many average bytes per packet
a connection has transferred so far.

Signed-off-by: Liping Zhang 
Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |  2 ++
 net/netfilter/nft_ct.c   | 22 +-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 5726f90bfc2f..b00a05d1ee56 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -863,6 +863,7 @@ enum nft_rt_attributes {
  * @NFT_CT_LABELS: conntrack labels
  * @NFT_CT_PKTS: conntrack packets
  * @NFT_CT_BYTES: conntrack bytes
+ * @NFT_CT_AVGPKT: conntrack average bytes per packet
  */
 enum nft_ct_keys {
NFT_CT_STATE,
@@ -881,6 +882,7 @@ enum nft_ct_keys {
NFT_CT_LABELS,
NFT_CT_PKTS,
NFT_CT_BYTES,
+   NFT_CT_AVGPKT,
 };
 
 /**
diff --git a/net/netfilter/nft_ct.c b/net/netfilter/nft_ct.c
index e6baeaebe653..d774d7823688 100644
--- a/net/netfilter/nft_ct.c
+++ b/net/netfilter/nft_ct.c
@@ -129,6 +129,22 @@ static void nft_ct_get_eval(const struct nft_expr *expr,
memcpy(dest, &count, sizeof(count));
return;
}
+   case NFT_CT_AVGPKT: {
+   const struct nf_conn_acct *acct = nf_conn_acct_find(ct);
+   u64 avgcnt = 0, bcnt = 0, pcnt = 0;
+
+   if (acct) {
+   pcnt = nft_ct_get_eval_counter(acct->counter,
+  NFT_CT_PKTS, priv->dir);
+   bcnt = nft_ct_get_eval_counter(acct->counter,
+  NFT_CT_BYTES, priv->dir);
+   if (pcnt != 0)
+   avgcnt = div64_u64(bcnt, pcnt);
+   }
+
+   memcpy(dest, &avgcnt, sizeof(avgcnt));
+   return;
+   }
case NFT_CT_L3PROTOCOL:
*dest = nf_ct_l3num(ct);
return;
@@ -316,6 +332,7 @@ static int nft_ct_get_init(const struct nft_ctx *ctx,
break;
case NFT_CT_BYTES:
case NFT_CT_PKTS:
+   case NFT_CT_AVGPKT:
/* no direction? return sum of original + reply */
if (tb[NFTA_CT_DIRECTION] == NULL)
priv->dir = IP_CT_DIR_MAX;
@@ -346,7 +363,9 @@ static int nft_ct_get_init(const struct nft_ctx *ctx,
if (err < 0)
return err;
 
-   if (priv->key == NFT_CT_BYTES || priv->key == NFT_CT_PKTS)
+   if (priv->key == NFT_CT_BYTES ||
+   priv->key == NFT_CT_PKTS  ||
+   priv->key == NFT_CT_AVGPKT)
nf_ct_set_acct(ctx->net, true);
 
return 0;
@@ -445,6 +464,7 @@ static int nft_ct_get_dump(struct sk_buff *skb, const 
struct nft_expr *expr)
break;
case NFT_CT_BYTES:
case NFT_CT_PKTS:
+   case NFT_CT_AVGPKT:
if (priv->dir < IP_CT_DIR_MAX &&
nla_put_u8(skb, NFTA_CT_DIRECTION, priv->dir))
goto nla_put_failure;
-- 
2.1.4

[PATCH 05/27] netfilter: select LIBCRC32C together with SCTP conntrack

2017-02-03 Thread Pablo Neira Ayuso

From: Davide Caratti 

nf_conntrack needs to compute crc32c when dealing with SCTP packets.
Moreover, NF_NAT_PROTO_SCTP (currently selecting LIBCRC32C) can be enabled
only if conntrack support for SCTP is enabled. Therefore, move enabling of
kernel support for crc32c so that it is selected when NF_CT_PROTO_SCTP=y.

Signed-off-by: Davide Caratti 
Reviewed-by: Marcelo Ricardo Leitner 
Signed-off-by: Pablo Neira Ayuso 
---
 net/netfilter/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 63729b489c2c..6d425e355bf5 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -162,6 +162,7 @@ config NF_CT_PROTO_SCTP
bool 'SCTP protocol connection tracking support'
depends on NETFILTER_ADVANCED
default y
+   select LIBCRC32C
help
  With this option enabled, the layer 3 independent connection
  tracking code will be able to do state tracking on SCTP connections.
@@ -397,7 +398,6 @@ config NF_NAT_PROTO_SCTP
bool
default NF_NAT && NF_CT_PROTO_SCTP
depends on NF_NAT && NF_CT_PROTO_SCTP
-   select LIBCRC32C
 
 config NF_NAT_AMANDA
tristate
-- 
2.1.4

[PATCH 08/27] xtables: add xt_match, xt_target and data copy_to_user functions

2017-02-03 Thread Pablo Neira Ayuso

From: Willem de Bruijn 

xt_entry_target, xt_entry_match and their private data may contain
kernel data.

Introduce helper functions xt_match_to_user, xt_target_to_user and
xt_data_to_user that copy only the expected fields. These replace
existing logic that calls copy_to_user on entire structs, then
overwrites select fields.

Private data is defined in xt_match and xt_target. All matches and
targets that maintain kernel data store this at the tail of their
private structure. Extend xt_match and xt_target with .usersize to
limit how many bytes of data are copied. The remainder is cleared.

If compatsize is specified, usersize can only safely be used if all
fields up to usersize use platform-independent types. Otherwise, the
compat_to_user callback must be defined.

This patch does not yet enable the support logic.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Pablo Neira Ayuso 
---
 include/linux/netfilter/x_tables.h |  9 +++
 net/netfilter/x_tables.c   | 54 ++
 2 files changed, 63 insertions(+)

diff --git a/include/linux/netfilter/x_tables.h 
b/include/linux/netfilter/x_tables.h
index 5117e4d2ddfa..be378cf47fcc 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -167,6 +167,7 @@ struct xt_match {
 
const char *table;
unsigned int matchsize;
+   unsigned int usersize;
 #ifdef CONFIG_COMPAT
unsigned int compatsize;
 #endif
@@ -207,6 +208,7 @@ struct xt_target {
 
const char *table;
unsigned int targetsize;
+   unsigned int usersize;
 #ifdef CONFIG_COMPAT
unsigned int compatsize;
 #endif
@@ -287,6 +289,13 @@ int xt_check_match(struct xt_mtchk_param *, unsigned int 
size, u_int8_t proto,
 int xt_check_target(struct xt_tgchk_param *, unsigned int size, u_int8_t proto,
bool inv_proto);
 
+int xt_match_to_user(const struct xt_entry_match *m,
+struct xt_entry_match __user *u);
+int xt_target_to_user(const struct xt_entry_target *t,
+ struct xt_entry_target __user *u);
+int xt_data_to_user(void __user *dst, const void *src,
+   int usersize, int size);
+
 void *xt_copy_counters_from_user(const void __user *user, unsigned int len,
 struct xt_counters_info *info, bool compat);
 
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 2ff499680cc6..feccf527abdd 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -262,6 +262,60 @@ struct xt_target *xt_request_find_target(u8 af, const char 
*name, u8 revision)
 }
 EXPORT_SYMBOL_GPL(xt_request_find_target);
 
+
+static int xt_obj_to_user(u16 __user *psize, u16 size,
+ void __user *pname, const char *name,
+ u8 __user *prev, u8 rev)
+{
+   if (put_user(size, psize))
+   return -EFAULT;
+   if (copy_to_user(pname, name, strlen(name) + 1))
+   return -EFAULT;
+   if (put_user(rev, prev))
+   return -EFAULT;
+
+   return 0;
+}
+
+#define XT_OBJ_TO_USER(U, K, TYPE, C_SIZE) \
+   xt_obj_to_user(&U->u.TYPE##_size, C_SIZE ? : K->u.TYPE##_size,  \
+  U->u.user.name, K->u.kernel.TYPE->name,  \
+  &U->u.user.revision, K->u.kernel.TYPE->revision)
+
+int xt_data_to_user(void __user *dst, const void *src,
+   int usersize, int size)
+{
+   usersize = usersize ? : size;
+   if (copy_to_user(dst, src, usersize))
+   return -EFAULT;
+   if (usersize != size && clear_user(dst + usersize, size - usersize))
+   return -EFAULT;
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(xt_data_to_user);
+
+#define XT_DATA_TO_USER(U, K, TYPE, C_SIZE)\
+   xt_data_to_user(U->data, K->data,   \
+   K->u.kernel.TYPE->usersize, \
+   C_SIZE ? : K->u.kernel.TYPE->TYPE##size)
+
+int xt_match_to_user(const struct xt_entry_match *m,
+struct xt_entry_match __user *u)
+{
+   return XT_OBJ_TO_USER(u, m, match, 0) ||
+  XT_DATA_TO_USER(u, m, match, 0);
+}
+EXPORT_SYMBOL_GPL(xt_match_to_user);
+
+int xt_target_to_user(const struct xt_entry_target *t,
+ struct xt_entry_target __user *u)
+{
+   return XT_OBJ_TO_USER(u, t, target, 0) ||
+  XT_DATA_TO_USER(u, t, target, 0);
+}
+EXPORT_SYMBOL_GPL(xt_target_to_user);
+
 static int match_revfn(u8 af, const char *name, u8 revision, int *bestp)
 {
const struct xt_match *m;
-- 
2.1.4

[PATCH 11/27] arptables: use match, target and data copy_to_user helpers

2017-02-03 Thread Pablo Neira Ayuso

From: Willem de Bruijn 

Convert arptables to copying entries, matches and targets one by one,
using the xt_match_to_user and xt_target_to_user helper functions.

Signed-off-by: Willem de Bruijn 
Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/arp_tables.c | 15 +--
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index a467e1236c43..6241a81fd7f5 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -677,11 +677,6 @@ static int copy_entries_to_user(unsigned int total_size,
return PTR_ERR(counters);
 
loc_cpu_entry = private->entries;
-   /* ... then copy entire thing ... */
-   if (copy_to_user(userptr, loc_cpu_entry, total_size) != 0) {
-   ret = -EFAULT;
-   goto free_counters;
-   }
 
/* FIXME: use iterator macros --RR */
/* ... then go back and fix counters and names */
@@ -689,6 +684,10 @@ static int copy_entries_to_user(unsigned int total_size,
const struct xt_entry_target *t;
 
e = (struct arpt_entry *)(loc_cpu_entry + off);
+   if (copy_to_user(userptr + off, e, sizeof(*e))) {
+   ret = -EFAULT;
+   goto free_counters;
+   }
if (copy_to_user(userptr + off
 + offsetof(struct arpt_entry, counters),
 &counters[num],
@@ -698,11 +697,7 @@ static int copy_entries_to_user(unsigned int total_size,
}
 
t = arpt_get_target_c(e);
-   if (copy_to_user(userptr + off + e->target_offset
-+ offsetof(struct xt_entry_target,
-   u.user.name),
-t->u.kernel.target->name,
-strlen(t->u.kernel.target->name)+1) != 0) {
+   if (xt_target_to_user(t, userptr + off + e->target_offset)) {
ret = -EFAULT;
goto free_counters;
}
-- 
2.1.4

[PATCH 01/27] netfilter: merge udp and udplite conntrack helpers

2017-02-03 Thread Pablo Neira Ayuso

From: Florian Westphal 

udplite was copied from udp, they are virtually 100% identical.

This adds udplite tracker to udp instead, removes udplite module,
and then makes the udplite tracker builtin.

udplite will then simply re-use udp timeout settings.
It makes little sense to add separate sysctls, nowadays we have
fine-grained timeout policy support via the CT target.

old:
 textdata bss dec hex filename
 1633 672   02305 901 nf_conntrack_proto_udp.o
 1756 672   02428 97c nf_conntrack_proto_udplite.o
69526   17937 268   87731   156b3 nf_conntrack.ko

new:
 textdata bss dec hex filename
 24421184   03626 e2a nf_conntrack_proto_udp.o
68565   17721 268   86554   1521a nf_conntrack.ko

Signed-off-by: Florian Westphal 
Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/ipv4/nf_conntrack_ipv4.h |   1 +
 include/net/netfilter/ipv6/nf_conntrack_ipv6.h |   1 +
 include/net/netns/conntrack.h  |  16 --
 net/netfilter/Makefile |   1 -
 net/netfilter/nf_conntrack_proto_udp.c | 123 ++
 net/netfilter/nf_conntrack_proto_udplite.c | 324 -
 6 files changed, 125 insertions(+), 341 deletions(-)
 delete mode 100644 net/netfilter/nf_conntrack_proto_udplite.c

diff --git a/include/net/netfilter/ipv4/nf_conntrack_ipv4.h 
b/include/net/netfilter/ipv4/nf_conntrack_ipv4.h
index 919e4e8af327..6ff32815641b 100644
--- a/include/net/netfilter/ipv4/nf_conntrack_ipv4.h
+++ b/include/net/netfilter/ipv4/nf_conntrack_ipv4.h
@@ -14,6 +14,7 @@ extern struct nf_conntrack_l3proto nf_conntrack_l3proto_ipv4;
 
 extern struct nf_conntrack_l4proto nf_conntrack_l4proto_tcp4;
 extern struct nf_conntrack_l4proto nf_conntrack_l4proto_udp4;
+extern struct nf_conntrack_l4proto nf_conntrack_l4proto_udplite4;
 extern struct nf_conntrack_l4proto nf_conntrack_l4proto_icmp;
 #ifdef CONFIG_NF_CT_PROTO_DCCP
 extern struct nf_conntrack_l4proto nf_conntrack_l4proto_dccp4;
diff --git a/include/net/netfilter/ipv6/nf_conntrack_ipv6.h 
b/include/net/netfilter/ipv6/nf_conntrack_ipv6.h
index eaea968f8657..c59b82456f89 100644
--- a/include/net/netfilter/ipv6/nf_conntrack_ipv6.h
+++ b/include/net/netfilter/ipv6/nf_conntrack_ipv6.h
@@ -5,6 +5,7 @@ extern struct nf_conntrack_l3proto nf_conntrack_l3proto_ipv6;
 
 extern struct nf_conntrack_l4proto nf_conntrack_l4proto_tcp6;
 extern struct nf_conntrack_l4proto nf_conntrack_l4proto_udp6;
+extern struct nf_conntrack_l4proto nf_conntrack_l4proto_udplite6;
 extern struct nf_conntrack_l4proto nf_conntrack_l4proto_icmpv6;
 #ifdef CONFIG_NF_CT_PROTO_DCCP
 extern struct nf_conntrack_l4proto nf_conntrack_l4proto_dccp6;
diff --git a/include/net/netns/conntrack.h b/include/net/netns/conntrack.h
index cf799fc3fdec..17724c62de97 100644
--- a/include/net/netns/conntrack.h
+++ b/include/net/netns/conntrack.h
@@ -69,19 +69,6 @@ struct nf_sctp_net {
 };
 #endif
 
-#ifdef CONFIG_NF_CT_PROTO_UDPLITE
-enum udplite_conntrack {
-   UDPLITE_CT_UNREPLIED,
-   UDPLITE_CT_REPLIED,
-   UDPLITE_CT_MAX
-};
-
-struct nf_udplite_net {
-   struct nf_proto_net pn;
-   unsigned int timeouts[UDPLITE_CT_MAX];
-};
-#endif
-
 struct nf_ip_net {
struct nf_generic_net   generic;
struct nf_tcp_net   tcp;
@@ -94,9 +81,6 @@ struct nf_ip_net {
 #ifdef CONFIG_NF_CT_PROTO_SCTP
struct nf_sctp_net  sctp;
 #endif
-#ifdef CONFIG_NF_CT_PROTO_UDPLITE
-   struct nf_udplite_net   udplite;
-#endif
 };
 
 struct ct_pcpu {
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index ca30d1960f1d..bf5c577113b6 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -7,7 +7,6 @@ nf_conntrack-$(CONFIG_NF_CONNTRACK_EVENTS) += 
nf_conntrack_ecache.o
 nf_conntrack-$(CONFIG_NF_CONNTRACK_LABELS) += nf_conntrack_labels.o
 nf_conntrack-$(CONFIG_NF_CT_PROTO_DCCP) += nf_conntrack_proto_dccp.o
 nf_conntrack-$(CONFIG_NF_CT_PROTO_SCTP) += nf_conntrack_proto_sctp.o
-nf_conntrack-$(CONFIG_NF_CT_PROTO_UDPLITE) += nf_conntrack_proto_udplite.o
 
 obj-$(CONFIG_NETFILTER) = netfilter.o
 
diff --git a/net/netfilter/nf_conntrack_proto_udp.c 
b/net/netfilter/nf_conntrack_proto_udp.c
index 20f35ed68030..ae63944c9dc4 100644
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -108,6 +108,59 @@ static bool udp_new(struct nf_conn *ct, const struct 
sk_buff *skb,
return true;
 }
 
+#ifdef CONFIG_NF_CT_PROTO_UDPLITE
+static int udplite_error(struct net *net, struct nf_conn *tmpl,
+struct sk_buff *skb,
+unsigned int dataoff,
+enum ip_conntrack_info *ctinfo,
+u8 pf, unsigned int hooknum)
+{
+   unsigned int udplen = skb->len - dataoff;
+   const struct udphdr *hdr;
+   struct udphdr _hdr;
+   unsigned int cscov;
+
+   /* Header is too small? */
+   hdr = skb_header_pointer(skb, dataoff, sizeof(_

Re: [PATCH net-next 0/2] Extract IFE logic to module

2017-02-03 Thread Jamal Hadi Salim


On 17-02-02 06:12 AM, Yotam Gigi wrote:

-Original Message-




I have no objection to this modularisation but I am curious to know
if you have a use-case in mind. My understanding is that earlier versions
of the sample action used IFE but that is not the case in the version that
was ultimately accepted.


Hi Simon.

You are right that the patches were done for the former version of the sample
classifier, and there are not required for the current version. We don't have
current use-case in mind, but I did send the patches because I think it can help
others, or us in the future.


For what its worth given Yotam has done this work and vetted it and weve
reviewed and discussed it in the past, I am going to sign off on it.


cheers,
jamal

Re: [PATCH 1/3] net/sched: act_ife: Unexport ife_tlv_meta_encode

2017-02-03 Thread Jamal Hadi Salim


On 17-02-01 08:30 AM, Yotam Gigi wrote:

As the function ife_tlv_meta_encode is not used by any other module,
unexport it and make it static for the act_ife module.

Signed-off-by: Yotam Gigi 


Signed-off-by: Jamal Hadi Salim 

cheers,
jamal

Re: [PATCH 2/3] net: Introduce ife encapsulation module

2017-02-03 Thread Jamal Hadi Salim


On 17-02-01 08:30 AM, Yotam Gigi wrote:

This module is responsible for the ife encapsulation protocol
encode/decode logics. That module can:
 - ife_encode: encode skb and reserve space for the ife meta header
 - ife_decode: decode skb and extract the meta header size
 - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
   header space.
 - ife_tlv_meta_decode - decodes one tlv entry from the packet
 - ife_tlv_meta_next - advance to the next tlv

Reviewed-by: Jiri Pirko 
Signed-off-by: Yotam Gigi 



Signed-off-by: Jamal Hadi Salim 

cheers,
jamal

Re: [PATCH 3/3] net/sched: act_ife: Change to use ife module

2017-02-03 Thread Jamal Hadi Salim


On 17-02-01 08:30 AM, Yotam Gigi wrote:

Use the encode/decode functionality from the ife module instead of using
implementation inside the act_ife.

Reviewed-by: Jiri Pirko 
Signed-off-by: Yotam Gigi 


Signed-off-by: Jamal Hadi Salim 


cheers,
jamal

Re: [PATCH 14/17] net: stmmac: print phy information

2017-02-03 Thread Corentin Labbe

On Tue, Jan 31, 2017 at 11:10:04AM +0100, Giuseppe CAVALLARO wrote:
> On 1/31/2017 10:11 AM, Corentin Labbe wrote:
> > When a PHY is found, printing which one was found (and which type/model) is
> > a good information to know.
> >
> > Signed-off-by: Corentin Labbe 
> > ---
> >  drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
> > b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > index e53b727..3d52b8c 100644
> > --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> > @@ -885,6 +885,7 @@ static int stmmac_init_phy(struct net_device *dev)
> > netdev_dbg(priv->dev, "%s: attached to PHY (UID 0x%x) Link = %d\n",
> >__func__, phydev->phy_id, phydev->link);
> >
> > +   phy_attached_info(phydev);
> 
> maybe we could remove the netdev_dbg above and just keep
> phy_attached_info(phydev);
> 
> peppe
> 

Ok, I will remove it

Regards
Corentin Labbe

Re: "TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised." message with "ethtool -K eth0 gro off"

2017-02-03 Thread Eric Dumazet

On Fri, 2017-02-03 at 09:54 -0200, Marcelo Ricardo Leitner wrote:
> On Thu, Feb 02, 2017 at 05:59:24AM -0800, Eric Dumazet wrote:
> > On Thu, 2017-02-02 at 05:31 -0800, Eric Dumazet wrote:
> > 
> > > Anyway, I suspect the test is simply buggy ;)
> > > 
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 
> > > 41dcbd568cbe2403f2a9e659669afe462a42e228..5394a39fcce964a7fe7075b1531a8a1e05550a54
> > >  100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -164,7 +164,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, 
> > > const struct sk_buff *skb)
> > >   if (len >= icsk->icsk_ack.rcv_mss) {
> > >   icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> > >  tcp_sk(sk)->advmss);
> > > - if (unlikely(icsk->icsk_ack.rcv_mss != len))
> > > + if (unlikely(icsk->icsk_ack.rcv_mss != len && skb_is_gso(skb)))
> > >   tcp_gro_dev_warn(sk, skb);
> > >   } else {
> > >   /* Otherwise, we make more careful check taking into account,
> > 
> > This wont really help.
> > 
> > Our tcp_sk(sk)->advmss can be lower than the MSS used by the remote
> > peer.
> > 
> > ip ro add  advmss 512
> 
> I don't follow. With a good driver, how can advmss be smaller than the
> MSS used by the remote peer? Even with the route entry above, I get
> segments just up to advmss, and no warning.
> 

A TCP flow has two ends.

Common MTU = 1500

One can have advmss 500, the other one no advmss (or the standard 1460
one)

So if we compare apple and orange, result might be shocking ;)

If you want to reproduce this use the "ip ro add  advmss 512" hint,
and/or play with sysctl_tcp_mtu_probing

Re: [PATCHv2 net-next 05/16] net: mvpp2: introduce PPv2.2 HW descriptors and adapt accessors

2017-02-03 Thread Thomas Petazzoni

Hello,

On Fri, 6 Jan 2017 14:44:56 +, Robin Murphy wrote:

> >> +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
> >> +  dma_addr_t dma_addr =
> >> +  rx_desc->pp22.buf_phys_addr_key_hash & DMA_BIT_MASK(40);
> >> +  phys_addr_t phys_addr =
> >> +  dma_to_phys(port->dev->dev.parent, dma_addr);  
> 
> Ugh, this looks bogus. dma_to_phys(), in the arm64 case at least, is
> essentially a SWIOTLB internal helper function which has to be
> implemented in architecture code because reasons. Calling it from a
> driver is almost certainly wrong (it doesn't even exist on most
> architectures). Besides, if this is really a genuine dma_addr_t obtained
> from a DMA API call, you cannot infer it to be related to a CPU physical
> address, or convertible to one at all.

So do you have a better suggestion? The descriptors only have enough
space to store a 40-bit virtual address, which is not enough to fit the
virtual addresses used by Linux for SKBs. This is why I'm instead
relying on the fact that the descriptors can store the 40-bit physical
address, and convert it back to a virtual address, which should be fine
on ARM64 because the entire physical memory is part of the kernel linear
mapping.

> >> +  return (unsigned long)phys_to_virt(phys_addr);
> >> +#else
> >> +  return rx_desc->pp22.buf_cookie_misc & DMA_BIT_MASK(40);
> >> +#endif  
> > 
> > I'm not sure that's the best way of selecting the difference.  
> 
> Given that CONFIG_ARCH_DMA_ADDR_T_64BIT could be enabled on 32-bit LPAE
> systems, indeed it definitely isn't.

Russell proposal of testing the size of a virtual address
pointer instead would solve this I believe, correct?

Thanks,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

Re: [PATCH net-next] sfc: get rid of custom busy polling code

2017-02-03 Thread Eric Dumazet

On Fri, 2017-02-03 at 09:14 +, Bert Kenward wrote:
> On 03/02/17 01:13, Eric Dumazet wrote:
> > From: Eric Dumazet 
> > 
> > In linux-4.5, busy polling was implemented in core
> > NAPI stack, meaning that all custom implementation can
> > be removed from drivers.
> > 
> > Not only we remove lot's of tricky code, we also remove
> > one lock operation in fast path.
> > 
> > Signed-off-by: Eric Dumazet 
> > Cc: Edward Cree 
> > Cc: Bert Kenward 
> 
> We were talking about doing this just yesterday.
> Thanks Eric.
> 
> Acked-by: Bert Kenward 

Excellent then, thanks !

Re: [PATCH 1/3] net/sched: act_ife: Unexport ife_tlv_meta_encode

2017-02-03 Thread Roman Mashak

Yotam Gigi  writes:

> As the function ife_tlv_meta_encode is not used by any other module,
> unexport it and make it static for the act_ife module.
>
> Signed-off-by: Yotam Gigi 

Signed-off-by: Roman Mashak 

-- 
Roman Mashak

Re: [PATCH 2/3] net: Introduce ife encapsulation module

2017-02-03 Thread Roman Mashak

Yotam Gigi  writes:

> This module is responsible for the ife encapsulation protocol
> encode/decode logics. That module can:
>  - ife_encode: encode skb and reserve space for the ife meta header
>  - ife_decode: decode skb and extract the meta header size
>  - ife_tlv_meta_encode - encodes one tlv entry into the reserved ife
>header space.
>  - ife_tlv_meta_decode - decodes one tlv entry from the packet
>  - ife_tlv_meta_next - advance to the next tlv
>
> Reviewed-by: Jiri Pirko 
> Signed-off-by: Yotam Gigi 

Signed-off-by: Roman Mashak 

-- 
Roman Mashak

Re: [PATCH 3/3] net/sched: act_ife: Change to use ife module

2017-02-03 Thread Roman Mashak

Yotam Gigi  writes:

> Use the encode/decode functionality from the ife module instead of using
> implementation inside the act_ife.
>
> Reviewed-by: Jiri Pirko 
> Signed-off-by: Yotam Gigi 

Signed-off-by: Roman Mashak 

-- 
Roman Mashak

Re: [PATCH 13/17] net: stmmac: Implement NAPI for TX

2017-02-03 Thread Corentin Labbe

On Tue, Jan 31, 2017 at 11:12:25PM -0500, David Miller wrote:
> From: Corentin Labbe 
> Date: Tue, 31 Jan 2017 10:11:48 +0100
> 
> > The stmmac driver run TX completion under NAPI but without checking
> > the work done by the TX completion function.
> 
> The current behavior is correct and completely intentional.
> 
> A driver should _never_ account TX work to the NAPI poll budget.
> 
> This is because TX liberation is orders of magnitude cheaper than
> receiving a packet, and such SKB freeing makes more SKBs available
> for RX processing.
> 
> Therefore, TX work should never count against the NAPI budget.
> 
> Please do not fix something which is not broken.

So at least the documentation I read must be fixed 
(https://wiki.linuxfoundation.org/networking/napi)

So perhaps the best way is to do like intel igb/ixgbe, keeping under NAPI until 
the stmmac_tx_clean function said that it finish handling the queue ?

Regards
Corentin Labbe

Re: "TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised." message with "ethtool -K eth0 gro off"

2017-02-03 Thread Marcelo Ricardo Leitner

On Fri, Feb 03, 2017 at 05:24:06AM -0800, Eric Dumazet wrote:
> On Fri, 2017-02-03 at 09:54 -0200, Marcelo Ricardo Leitner wrote:
> > On Thu, Feb 02, 2017 at 05:59:24AM -0800, Eric Dumazet wrote:
> > > On Thu, 2017-02-02 at 05:31 -0800, Eric Dumazet wrote:
> > > 
> > > > Anyway, I suspect the test is simply buggy ;)
> > > > 
> > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > > index 
> > > > 41dcbd568cbe2403f2a9e659669afe462a42e228..5394a39fcce964a7fe7075b1531a8a1e05550a54
> > > >  100644
> > > > --- a/net/ipv4/tcp_input.c
> > > > +++ b/net/ipv4/tcp_input.c
> > > > @@ -164,7 +164,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, 
> > > > const struct sk_buff *skb)
> > > > if (len >= icsk->icsk_ack.rcv_mss) {
> > > > icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> > > >tcp_sk(sk)->advmss);
> > > > -   if (unlikely(icsk->icsk_ack.rcv_mss != len))
> > > > +   if (unlikely(icsk->icsk_ack.rcv_mss != len && 
> > > > skb_is_gso(skb)))
> > > > tcp_gro_dev_warn(sk, skb);
> > > > } else {
> > > > /* Otherwise, we make more careful check taking into 
> > > > account,
> > > 
> > > This wont really help.
> > > 
> > > Our tcp_sk(sk)->advmss can be lower than the MSS used by the remote
> > > peer.
> > > 
> > > ip ro add  advmss 512
> > 
> > I don't follow. With a good driver, how can advmss be smaller than the
> > MSS used by the remote peer? Even with the route entry above, I get
> > segments just up to advmss, and no warning.
> > 
> 
> A TCP flow has two ends.

Indeed, though should be mostly about only one of them.

> 
> Common MTU = 1500
> 
> One can have advmss 500, the other one no advmss (or the standard 1460
> one)

Considering the rx side of peer A. Peer A advertises a given MSS to peer
B and should not receive any segment from peer B larger than so.
I'm failing to see how advmss can be smaller than the segment size just
received.

> 
> So if we compare apple and orange, result might be shocking ;)

Yes heh just not seeing the mix here..

> 
> If you want to reproduce this use the "ip ro add  advmss 512" hint,
> and/or play with sysctl_tcp_mtu_probing

I tried the route with advmss, no luck so far.
Still digging..

  Marcelo

Re: [PATCHv2 net-next 05/16] net: mvpp2: introduce PPv2.2 HW descriptors and adapt accessors

2017-02-03 Thread Robin Murphy

On 03/02/17 13:24, Thomas Petazzoni wrote:
> Hello,
> 
> On Fri, 6 Jan 2017 14:44:56 +, Robin Murphy wrote:
> 
 +#ifdef CONFIG_ARCH_DMA_ADDR_T_64BIT
 +  dma_addr_t dma_addr =
 +  rx_desc->pp22.buf_phys_addr_key_hash & DMA_BIT_MASK(40);
 +  phys_addr_t phys_addr =
 +  dma_to_phys(port->dev->dev.parent, dma_addr);  
>>
>> Ugh, this looks bogus. dma_to_phys(), in the arm64 case at least, is
>> essentially a SWIOTLB internal helper function which has to be
>> implemented in architecture code because reasons. Calling it from a
>> driver is almost certainly wrong (it doesn't even exist on most
>> architectures). Besides, if this is really a genuine dma_addr_t obtained
>> from a DMA API call, you cannot infer it to be related to a CPU physical
>> address, or convertible to one at all.
> 
> So do you have a better suggestion? The descriptors only have enough
> space to store a 40-bit virtual address, which is not enough to fit the
> virtual addresses used by Linux for SKBs. This is why I'm instead
> relying on the fact that the descriptors can store the 40-bit physical
> address, and convert it back to a virtual address, which should be fine
> on ARM64 because the entire physical memory is part of the kernel linear
> mapping.

OK, that has nothing to do with DMA addresses then.

 +  return (unsigned long)phys_to_virt(phys_addr);
 +#else
 +  return rx_desc->pp22.buf_cookie_misc & DMA_BIT_MASK(40);
 +#endif  
>>>
>>> I'm not sure that's the best way of selecting the difference.  
>>
>> Given that CONFIG_ARCH_DMA_ADDR_T_64BIT could be enabled on 32-bit LPAE
>> systems, indeed it definitely isn't.
> 
> Russell proposal of testing the size of a virtual address
> pointer instead would solve this I believe, correct?

AFAICS, even that shouldn't really be necessary - for all VA/PA
combinations of 32/32, 32/40 and 64/40, storing virt_to_phys() of the
SKB VA won't overflow 40 bits, so a corresponding phys_to_virt() at the
other end can't go wrong either. If you really do want to special-case
things based on VA size, though, either CONFIG_64BIT or sizeof(void *)
would indeed be infinitely more useful than the unrelated DMA address
width - I know this driver's never going to run on SPARC64, but that's
one example of where the above logic would lead to precisely the
truncated VA it's trying to avoid.

Robin.

> 
> Thanks,
> 
> Thomas
>

Re: [PATCH net-next] sctp: process fwd tsn chunk only when prsctp is enabled

2017-02-03 Thread Neil Horman

On Fri, Feb 03, 2017 at 05:37:06PM +0800, Xin Long wrote:
> This patch is to check if asoc->peer.prsctp_capable is set before
> processing fwd tsn chunk, if not, it will return an ERROR to the
> peer, just as rfc3758 section 3.3.1 demands.
> 
> Reported-by: Julian Cordes 
> Signed-off-by: Xin Long 
> ---
>  net/sctp/sm_statefuns.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c
> index 782e579..d8798dd 100644
> --- a/net/sctp/sm_statefuns.c
> +++ b/net/sctp/sm_statefuns.c
> @@ -3867,6 +3867,9 @@ sctp_disposition_t sctp_sf_eat_fwd_tsn(struct net *net,
>   return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
>   }
>  
> + if (!asoc->peer.prsctp_capable)
> + return sctp_sf_unk_chunk(net, ep, asoc, type, arg, commands);
> +
>   /* Make sure that the FORWARD_TSN chunk has valid length.  */
>   if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_fwdtsn_chunk)))
>   return sctp_sf_violation_chunklen(net, ep, asoc, type, arg,
> @@ -3935,6 +3938,9 @@ sctp_disposition_t sctp_sf_eat_fwd_tsn_fast(
>   return sctp_sf_pdiscard(net, ep, asoc, type, arg, commands);
>   }
>  
> + if (!asoc->peer.prsctp_capable)
> + return sctp_sf_unk_chunk(net, ep, asoc, type, arg, commands);
> +
>   /* Make sure that the FORWARD_TSN chunk has a valid length.  */
>   if (!sctp_chunk_length_valid(chunk, sizeof(struct sctp_fwdtsn_chunk)))
>   return sctp_sf_violation_chunklen(net, ep, asoc, type, arg,
> -- 
> 2.1.0
> 
> 
Acked-by: Neil Horman

Re: "TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised." message with "ethtool -K eth0 gro off"

2017-02-03 Thread Eric Dumazet

On Fri, 2017-02-03 at 11:53 -0200, Marcelo Ricardo Leitner wrote:
> On Fri, Feb 03, 2017 at 05:24:06AM -0800, Eric Dumazet wrote:
> > On Fri, 2017-02-03 at 09:54 -0200, Marcelo Ricardo Leitner wrote:
> > > On Thu, Feb 02, 2017 at 05:59:24AM -0800, Eric Dumazet wrote:
> > > > On Thu, 2017-02-02 at 05:31 -0800, Eric Dumazet wrote:
> > > > 
> > > > > Anyway, I suspect the test is simply buggy ;)
> > > > > 
> > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > > > index 
> > > > > 41dcbd568cbe2403f2a9e659669afe462a42e228..5394a39fcce964a7fe7075b1531a8a1e05550a54
> > > > >  100644
> > > > > --- a/net/ipv4/tcp_input.c
> > > > > +++ b/net/ipv4/tcp_input.c
> > > > > @@ -164,7 +164,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, 
> > > > > const struct sk_buff *skb)
> > > > >   if (len >= icsk->icsk_ack.rcv_mss) {
> > > > >   icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> > > > >  tcp_sk(sk)->advmss);
> > > > > - if (unlikely(icsk->icsk_ack.rcv_mss != len))
> > > > > + if (unlikely(icsk->icsk_ack.rcv_mss != len && 
> > > > > skb_is_gso(skb)))
> > > > >   tcp_gro_dev_warn(sk, skb);
> > > > >   } else {
> > > > >   /* Otherwise, we make more careful check taking into 
> > > > > account,
> > > > 
> > > > This wont really help.
> > > > 
> > > > Our tcp_sk(sk)->advmss can be lower than the MSS used by the remote
> > > > peer.
> > > > 
> > > > ip ro add  advmss 512
> > > 
> > > I don't follow. With a good driver, how can advmss be smaller than the
> > > MSS used by the remote peer? Even with the route entry above, I get
> > > segments just up to advmss, and no warning.
> > > 
> > 
> > A TCP flow has two ends.
> 
> Indeed, though should be mostly about only one of them.
> 
> > 
> > Common MTU = 1500
> > 
> > One can have advmss 500, the other one no advmss (or the standard 1460
> > one)
> 
> Considering the rx side of peer A. Peer A advertises a given MSS to peer
> B and should not receive any segment from peer B larger than so.
> I'm failing to see how advmss can be smaller than the segment size just
> received.

tcp_sk(sk)->advmss records what the peer announced during its SYN (or
SYNACK) message, in the MSS option.

Nothing prevents the peer to change its mind later.

Eg starting with MSS 512, then switch later to sending packets of 1024
or 1400 bytes.

So the innocent NIC driver is not the problem here.

Re: "TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised." message with "ethtool -K eth0 gro off"

2017-02-03 Thread Marcelo Ricardo Leitner

On Fri, Feb 03, 2017 at 06:16:06AM -0800, Eric Dumazet wrote:
> On Fri, 2017-02-03 at 11:53 -0200, Marcelo Ricardo Leitner wrote:
> > On Fri, Feb 03, 2017 at 05:24:06AM -0800, Eric Dumazet wrote:
> > > On Fri, 2017-02-03 at 09:54 -0200, Marcelo Ricardo Leitner wrote:
> > > > On Thu, Feb 02, 2017 at 05:59:24AM -0800, Eric Dumazet wrote:
> > > > > On Thu, 2017-02-02 at 05:31 -0800, Eric Dumazet wrote:
> > > > > 
> > > > > > Anyway, I suspect the test is simply buggy ;)
> > > > > > 
> > > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > > > > index 
> > > > > > 41dcbd568cbe2403f2a9e659669afe462a42e228..5394a39fcce964a7fe7075b1531a8a1e05550a54
> > > > > >  100644
> > > > > > --- a/net/ipv4/tcp_input.c
> > > > > > +++ b/net/ipv4/tcp_input.c
> > > > > > @@ -164,7 +164,7 @@ static void tcp_measure_rcv_mss(struct sock 
> > > > > > *sk, const struct sk_buff *skb)
> > > > > > if (len >= icsk->icsk_ack.rcv_mss) {
> > > > > > icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
> > > > > >tcp_sk(sk)->advmss);
> > > > > > -   if (unlikely(icsk->icsk_ack.rcv_mss != len))
> > > > > > +   if (unlikely(icsk->icsk_ack.rcv_mss != len && 
> > > > > > skb_is_gso(skb)))
> > > > > > tcp_gro_dev_warn(sk, skb);
> > > > > > } else {
> > > > > > /* Otherwise, we make more careful check taking into 
> > > > > > account,
> > > > > 
> > > > > This wont really help.
> > > > > 
> > > > > Our tcp_sk(sk)->advmss can be lower than the MSS used by the remote
> > > > > peer.
> > > > > 
> > > > > ip ro add  advmss 512
> > > > 
> > > > I don't follow. With a good driver, how can advmss be smaller than the
> > > > MSS used by the remote peer? Even with the route entry above, I get
> > > > segments just up to advmss, and no warning.
> > > > 
> > > 
> > > A TCP flow has two ends.
> > 
> > Indeed, though should be mostly about only one of them.
> > 
> > > 
> > > Common MTU = 1500
> > > 
> > > One can have advmss 500, the other one no advmss (or the standard 1460
> > > one)
> > 
> > Considering the rx side of peer A. Peer A advertises a given MSS to peer
> > B and should not receive any segment from peer B larger than so.
> > I'm failing to see how advmss can be smaller than the segment size just
> > received.
> 
> tcp_sk(sk)->advmss records what the peer announced during its SYN (or
> SYNACK) message, in the MSS option.
> 
> Nothing prevents the peer to change its mind later.
> 
> Eg starting with MSS 512, then switch later to sending packets of 1024
> or 1400 bytes.

Aren't you mixing the endpoints here? MSS is the largest amount of data
that the peer can receive in a single segment, and not how much it will
send. For the sending part, that depends on what the other peer
announced, and we can have 2 different MSS in a single connection, one
for each peer.

If a peer later wants to send larger segments, it can, but it must
respect the mss advertised by the other peer during handshake.

> 
> So the innocent NIC driver is not the problem here.
> 
>

Re: "TCP: eth0: Driver has suspect GRO implementation, TCP performance may be compromised." message with "ethtool -K eth0 gro off"

2017-02-03 Thread Eric Dumazet

On Fri, 2017-02-03 at 12:28 -0200, Marcelo Ricardo Leitner wrote:

> Aren't you mixing the endpoints here? MSS is the largest amount of data
> that the peer can receive in a single segment, and not how much it will
> send. For the sending part, that depends on what the other peer
> announced, and we can have 2 different MSS in a single connection, one
> for each peer.
> 
> If a peer later wants to send larger segments, it can, but it must
> respect the mss advertised by the other peer during handshake.
> 

I am not mixing endpoints, you are.

If you need to be convinced, please grab :
https://patchwork.ozlabs.org/patch/723028/

And just watch "ss -temoi ..."

Re: Inconsistency in packet drop due to MTU (eth vs veth)

2017-02-03 Thread Toshiaki Makita


On 17/02/03 (金) 17:07, Fredrik Markstrom wrote:

  On Tue, 31 Jan 2017 17:27:09 +0100 Eric Dumazet  
wrote 
 > On Tue, 2017-01-31 at 14:32 +0100, Fredrik Markstrom wrote:
 > >   On Thu, 19 Jan 2017 19:53:47 +0100 Eric Dumazet 
 wrote 
 > >  > On Thu, 2017-01-19 at 17:41 +0100, Fredrik Markstrom wrote:
 > >  > > Hello,
 > >  > >
 > >  > > I've noticed an inconsistency between how physical ethernet and
 > > veth handles mtu.
 > >  > >
 > >  > > If I setup two physical interfaces (directly connected) with
 > > different mtu:s, only the size of the outgoing packets are limited by
 > > the mtu. But with veth a packet is dropped if the mtu of the receiving
 > > interface is smaller then the packet size.
 > >  > >
 > >  > > This seems inconsistent to me, but maybe there is a reason for
 > > it ?
 > >  > >
 > >  > > Can someone confirm if it's a deliberate inconsistency or just a
 > > side effect of using dev_forward_skb() ?
 > >  >
 > >  > It looks this was added in commit
 > >  > 38d408152a86598a50680a82fe3353b506630409
 > >  > ("veth: Allow setting the L3 MTU")
 > >  >
 > >  > But what was really needed here was a way to change MRU :(
 > >
 > > Ok, do we consider this correct and/or something we need to be
 > > backwards compatible with ? Is it insane to believe that we can fix
 > > this "inconsistency" by removing the check ?
 > >
 > > The commit message reads "For consistency I drop packets on the
 > > receive side when they are larger than the MTU", do we know what it's
 > > supposed
 > > to be consistent with or is that lost in history ?
 >
 > There is no consistency among existing Ethernet drivers.
 >
 > Many ethernet drivers size the buffers they post in RX ring buffer
 > according to MTU.
 >
 > If MTU is set to 1500, RX buffers are sized to be about 1536 bytes,
 > so you wont be able to receive a 1700 bytes frame.
 >
 > I guess that you could add a specific veth attribute to precisely
 > control MRU, that would not break existing applications.

Ok, I will propose a patch shortly. And thanks, your response time is
awesome !


But why do you want to configure MRU?
What is the problem with setting MTU instead.

Toshiaki Makita

Re: [PATCH net-next] sfc: get rid of custom busy polling code

2017-02-03 Thread David Miller

From: Eric Dumazet 
Date: Thu, 02 Feb 2017 17:13:19 -0800

> From: Eric Dumazet 
> 
> In linux-4.5, busy polling was implemented in core
> NAPI stack, meaning that all custom implementation can
> be removed from drivers.
> 
> Not only we remove lot's of tricky code, we also remove
> one lock operation in fast path.
> 
> Signed-off-by: Eric Dumazet 

Applied.

Re: [PATCH net-next] sfc-falcon: get rid of custom busy polling code

2017-02-03 Thread David Miller

From: Eric Dumazet 
Date: Thu, 02 Feb 2017 18:22:28 -0800

> From: Eric Dumazet 
> 
> In linux-4.5, busy polling was implemented in core
> NAPI stack, meaning that all custom implementation can
> be removed from drivers.
> 
> Not only we remove lot's of tricky code, we also remove
> one lock operation in fast path.
> 
> Signed-off-by: Eric Dumazet 

Applied.

Re: [PATCHv2 net-next 05/16] net: mvpp2: introduce PPv2.2 HW descriptors and adapt accessors

2017-02-03 Thread Thomas Petazzoni

Hello,

On Fri, 3 Feb 2017 14:05:13 +, Robin Murphy wrote:

> > So do you have a better suggestion? The descriptors only have enough
> > space to store a 40-bit virtual address, which is not enough to fit the
> > virtual addresses used by Linux for SKBs. This is why I'm instead
> > relying on the fact that the descriptors can store the 40-bit physical
> > address, and convert it back to a virtual address, which should be fine
> > on ARM64 because the entire physical memory is part of the kernel linear
> > mapping.  
> 
> OK, that has nothing to do with DMA addresses then.

Well, it has to do with DMA in the sense that those buffers are
mapped with dma_map_single(). So the address that is given to us by the
hardware as the "physical address of the RX buffer" is the one that we
have initially given to the hardware and was the result of 
dma_map_single().

> > Russell proposal of testing the size of a virtual address
> > pointer instead would solve this I believe, correct?  
> 
> AFAICS, even that shouldn't really be necessary - for all VA/PA
> combinations of 32/32, 32/40 and 64/40, storing virt_to_phys() of the
> SKB VA won't overflow 40 bits,

I'm already lost here. Why are you talking about virt_to_phys() ? See
above: we have the dma_addr_t returned from dma_map_single(), and we
need to find back the corresponding virtual address, because there is
not enough room in the HW descriptors to store a full 64-bit VA.

> so a corresponding phys_to_virt() at the other end can't go wrong
> either. If you really do want to special-case things based on VA
> size, though, either CONFIG_64BIT or sizeof(void *) would indeed be
> infinitely more useful than the unrelated DMA address width - I know
> this driver's never going to run on SPARC64, but that's one example
> of where the above logic would lead to precisely the truncated VA
> it's trying to avoid.

What is different on SPARC64 here?

The situation we have is the following:

 - On systems where VAs are 32-bit wide, we have enough room to store
   the VA in the HW descriptor. So when we receive a packet, the HW
   descriptor provides us directly with the VA of the network packet,
   and the DMA address of the packet. We can dma_unmap_single() the
   packet, and do its processing.

 - On systems where VAs are 64-bit wide, we don't have enough room to
   store the VA in the HW descriptor. However, on 64-bit systems, the
   entire physical memory is mapped in the kernel linear mapping, so
   phys_to_virt() is valid on any physical address. And we use this
   property to retrieve the full 64-bit VA using the DMA address that
   we get from the HW descriptor.

   Since what we get from the HW descriptor is a DMA address, that's
   why we're using phys_to_virt(dma_to_phys(...)).

Best regards,

Thomas
-- 
Thomas Petazzoni, CTO, Free Electrons
Embedded Linux and Kernel engineering
http://free-electrons.com

Re: [PATCH net-next] virtio: Fix affinity for >32 VCPUs

2017-02-03 Thread Michael S. Tsirkin

On Thu, Feb 02, 2017 at 10:19:05PM -0800, Ben Serebrin wrote:
> From: Benjamin Serebrin 
> 
> If the number of virtio queue pairs is not equal to the
> number of VCPUs, the virtio guest driver doesn't assign
> any CPU affinity for the queue interrupts or the xps
> aggregation interrupt.
> 
> Google Compute Engine currently provides 1 queue pair for
> every VCPU, but limits that at a maximum of 32 queue pairs.
> 
> This code assigns interrupt affinity even when there are more than
> 32 VCPUs.
> 
> Tested:
> 
> (on a 64-VCPU VM with debian 8, jessie-backports 4.9.2)
> 
> Without the fix we see all queues affinitized to all CPUs:
> 
> cd /proc/irq
> for i in `seq 24 92` ; do sudo grep ".*" $i/smp_affinity_list;  done
> 0-63
> [...]
> 0-63
> 
> and we see all TX queues' xps_cpus affinitzed to no cores:
> 
> for i in `seq 0 31` ; do sudo grep ".*" tx-$i/xps_cpus; done
> ,
> [...]
> ,
> 
> With the fix, we see each queue assigned to the a single core,
> and xps affinity set to 1 unique cpu per TX queue.
> 
> 64 VCPU:
> 
> cd /proc/irq
> for i in `seq 24 92` ; do sudo grep ".*" $i/smp_affinity_list;  done
> 
> 0-63
> 0
> 0
> 1
> 1
> 2
> 2
> 3
> 3
> 4
> 4
> 5
> 5
> 6
> 6
> 7
> 7
> 8
> 8
> 9
> 9
> 10
> 10
> 11
> 11
> 12
> 12
> 13
> 13
> 14
> 14
> 15
> 15
> 16
> 16
> 17
> 17
> 18
> 18
> 19
> 19
> 20
> 20
> 21
> 21
> 22
> 22
> 23
> 23
> 24
> 24
> 25
> 25
> 26
> 26
> 27
> 27
> 28
> 28
> 29
> 29
> 30
> 30
> 31
> 31
> 0-63
> 0-63
> 0-63
> 0-63
> 
> cd /sys/class/net/eth0/queues
> for i in `seq 0 31` ; do sudo grep ".*" tx-$i/xps_cpus;  done
> 
> 0001,0001
> 0002,0002
> 0004,0004
> 0008,0008
> 0010,0010
> 0020,0020
> 0040,0040
> 0080,0080
> 0100,0100
> 0200,0200
> 0400,0400
> 0800,0800
> 1000,1000
> 2000,2000
> 4000,4000
> 8000,8000
> 0001,0001
> 0002,0002
> 0004,0004
> 0008,0008
> 0010,0010
> 0020,0020
> 0040,0040
> 0080,0080
> 0100,0100
> 0200,0200
> 0400,0400
> 0800,0800
> 1000,1000
> 2000,2000
> 4000,4000
> 8000,8000
> 
> 48 VCPU:
> 
> cd /proc/irq
> for i in `seq 24 92` ; do sudo grep ".*" $i/smp_affinity_list;  done
> 0-47
> 0
> 0
> 1
> 1
> 2
> 2
> 3
> 3
> 4
> 4
> 5
> 5
> 6
> 6
> 7
> 7
> 8
> 8
> 9
> 9
> 10
> 10
> 11
> 11
> 12
> 12
> 13
> 13
> 14
> 14
> 15
> 15
> 16
> 16
> 17
> 17
> 18
> 18
> 19
> 19
> 20
> 20
> 21
> 21
> 22
> 22
> 23
> 23
> 24
> 24
> 25
> 25
> 26
> 26
> 27
> 27
> 28
> 28
> 29
> 29
> 30
> 30
> 31
> 31
> 0-47
> 0-47
> 0-47
> 0-47
> 
> cd /sys/class/net/eth0/queues
> for i in `seq 0 31` ; do sudo grep ".*" tx-$i/xps_cpus;  done
> 
> 0001,0001
> 0002,0002
> 0004,0004
> 0008,0008
> 0010,0010
> 0020,0020
> 0040,0040
> 0080,0080
> 0100,0100
> 0200,0200
> 0400,0400
> 0800,0800
> 1000,1000
> 2000,2000
> 4000,4000
> 8000,8000
> ,0001
> ,0002
> ,0004
> ,0008
> ,0010
> ,0020
> ,0040
> ,0080
> ,0100
> ,0200
> ,0400
> ,0800
> ,1000
> ,2000
> ,4000
> ,8000
> 
> Signed-off-by: Ben Serebrin 
> Acked-by: Willem de Bruijn 
> Acked-by: Jim Mattson 
> Acked-by: Venkatesh Srinivas 

I wonder why not just do it in userspace though.
It would be nice to mention this in the commit log.
Are we sure this distribution is best for all workloads?
While irqbalancer is hardly a perfect oracle it does
manage to balance the load somewhat, and IIUC kernel
affinities would break that.
Thoughts?


> Effort: kvm
> ---
>  drivers/net/virtio_net.c | 30 +++---
>  1 file changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 765c2d6358da..0dc3a102bfc4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1502,20 +1502,44 @@ static void virtnet_set_affinity(struct virtnet_info 
> *vi)
>* queue pairs, we let the queue pairs to be private to one cpu by
>* setting the affinity hint to eliminate the contention.
>*/
> - if (vi->curr_queue_pairs == 1 ||
> - vi->max_queue_pairs != num_online_cpus()) {
> + if (vi->curr_queue_pairs == 1) {
>   virtnet_clean_affinity(vi, -1);
>   return;
>   }
>  
> + /* If there are more cpus than queues, then assign the queues'
> +  * interrupts to the first cpus until we run out.
> +  */
>   i = 0;
>   for_each_online_cpu(cpu) {
> + if (i == vi->max_queue_pairs)
> + break;
>   virtqueue_set_affinity(vi->rq[i].vq, cpu);
>   virtqueue_set_affinity(vi->sq[i].vq, cpu);
> - netif_set_xps_queue(vi->dev, cpumask_of(cpu), i);
>   i++;
>   }
>  
> + /* Stripe the XPS a

Re: [PATCH 13/17] net: stmmac: Implement NAPI for TX

2017-02-03 Thread David Miller

From: Corentin Labbe 
Date: Fri, 3 Feb 2017 14:41:45 +0100

> On Tue, Jan 31, 2017 at 11:12:25PM -0500, David Miller wrote:
>> From: Corentin Labbe 
>> Date: Tue, 31 Jan 2017 10:11:48 +0100
>> 
>> > The stmmac driver run TX completion under NAPI but without checking
>> > the work done by the TX completion function.
>> 
>> The current behavior is correct and completely intentional.
>> 
>> A driver should _never_ account TX work to the NAPI poll budget.
>> 
>> This is because TX liberation is orders of magnitude cheaper than
>> receiving a packet, and such SKB freeing makes more SKBs available
>> for RX processing.
>> 
>> Therefore, TX work should never count against the NAPI budget.
>> 
>> Please do not fix something which is not broken.
> 
> So at least the documentation I read must be fixed 
> (https://wiki.linuxfoundation.org/networking/napi)

We have no control over nor care about what the Linux Foundation writes
about the Linux networking code.

Complain to them and please do not bother us about it.

Thank you.

Re: [PATCH net-next 0/2] Extract IFE logic to module

2017-02-03 Thread Simon Horman

No objections here.

On Fri, Feb 3, 2017 at 2:02 PM, Jamal Hadi Salim  wrote:
> On 17-02-02 06:12 AM, Yotam Gigi wrote:
>>>
>>> -Original Message-
>
>
>>>
>>> I have no objection to this modularisation but I am curious to know
>>> if you have a use-case in mind. My understanding is that earlier versions
>>> of the sample action used IFE but that is not the case in the version
>>> that
>>> was ultimately accepted.
>>
>>
>> Hi Simon.
>>
>> You are right that the patches were done for the former version of the
>> sample
>> classifier, and there are not required for the current version. We don't
>> have
>> current use-case in mind, but I did send the patches because I think it
>> can help
>> others, or us in the future.
>
>
> For what its worth given Yotam has done this work and vetted it and weve
> reviewed and discussed it in the past, I am going to sign off on it.
>
>
> cheers,
> jamal
>

Re: [PATCHv2 net-next 05/16] net: mvpp2: introduce PPv2.2 HW descriptors and adapt accessors

2017-02-03 Thread Russell King - ARM Linux

On Fri, Feb 03, 2017 at 02:05:13PM +, Robin Murphy wrote:
> AFAICS, even that shouldn't really be necessary - for all VA/PA
> combinations of 32/32, 32/40 and 64/40, storing virt_to_phys() of the
> SKB VA won't overflow 40 bits, so a corresponding phys_to_virt() at the
> other end can't go wrong either.

Except for the detail that virt_to_phys()/phys_to_virt() is only defined
for the direct-mapped memory, not for highmem.  That matters a lot for
32-bit platforms.

Now for a bit of a whinge.  Reading through this code is rather annoying
because of what's called a "physical" address which is actually a DMA
address as far as the kernel is concerned - this makes it much harder
when thinking about this issue because it causes all sorts of confusion.
Please can the next revision of the patches start calling things by their
right name - a dma_addr_t is a DMA address, not a physical address, even
though _numerically_ it may be the same thing.  From the point of view
of the kernel, you must not do phys_to_virt() on a dma_addr_t address.
Thanks.

If we're going to start dealing with _real_ physical addresses, then
this is even more important to separate the concept of what's a physical
address and what's a DMA address in this driver.

Taking a step backwards...

How do DMA addresses and this cookie get into the receive ring - from what
I can see, the driver doesn't write these into the receive ring, it's the
hardware that writes it, and the only route I can see that they get there
is via writes performed in mvpp2_bm_pool_put().

Now, from what I can see, the "buf_virt_addr" comes from:

+static void *mvpp2_frag_alloc(const struct mvpp2_bm_pool *pool)
+{
+   if (likely(pool->frag_size <= PAGE_SIZE))
+   return netdev_alloc_frag(pool->frag_size);
+   else
+   return kmalloc(pool->frag_size, GFP_ATOMIC);
+}

via mvpp2_buf_alloc().

Both kmalloc() and netdev_alloc_frag() guarantee that the virtual
address will be in lowmem.

Given that, I would suggest changing mvpp2_bm_pool_put() as follows -
and this is where my point above about separating the notion of "dma
address" and "physical address" becomes very important:

 static inline void mvpp2_bm_pool_put(struct mvpp2_port *port, int pool,
-dma_addr_t buf_phys_addr,
-unsigned long buf_virt_addr)
+dma_addr_t dma, phys_addr_t phys)
 {

and updating it to write "phys" as the existing buf_virt_addr.

In mvpp2_bm_bufs_add():

buf = mvpp2_buf_alloc(port, bm_pool, &phys_addr, GFP_KERNEL);
if (!buf)
break;

mvpp2_bm_pool_put(port, bm_pool->id, phys_addr,
- (unsigned long)buf);
+ virt_to_phys(buf));

which I think means that mvpp2_rxdesc_virt_addr_get() can just become:

phys_addr_t cookie;

/* PPv2.1 can only be used on 32 bits architectures, and there
 * are 32 bits in buf_cookie which are enough to store the
 * full virtual address, so things are easy.
 */
if (port->priv->hw_version == MVPP21)
cookie = rx_desc->pp21.buf_cookie;
else
cookie = rx_desc->pp22.buf_cookie_misc & FORTY_BIT_MASK;

return phys_to_virt(cookie);

I'd suggest against using DMA_BIT_MASK(40) there - because it's not a
DMA address, even though it happens to resolve to the same number.

Again, I may have missed how the addresses end up getting into
buf_cookie_misc, so what I suggest above may not be possible.

I'd also suggest that there is some test in mvpp2_bm_bufs_add() to
verify that the physical addresses and DMA addresses do fit within
the available number of bits - if they don't we could end up scribbling
over memory that we shouldn't be, and it looks like we have a failure
path there to gracefully handle that situation - gracefully compared
to a nasty BUG_ON().

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

Re: [PATCH 13/17] net: stmmac: Implement NAPI for TX

2017-02-03 Thread Corentin Labbe

On Fri, Feb 03, 2017 at 10:15:30AM -0500, David Miller wrote:
> From: Corentin Labbe 
> Date: Fri, 3 Feb 2017 14:41:45 +0100
> 
> > On Tue, Jan 31, 2017 at 11:12:25PM -0500, David Miller wrote:
> >> From: Corentin Labbe 
> >> Date: Tue, 31 Jan 2017 10:11:48 +0100
> >> 
> >> > The stmmac driver run TX completion under NAPI but without checking
> >> > the work done by the TX completion function.
> >> 
> >> The current behavior is correct and completely intentional.
> >> 
> >> A driver should _never_ account TX work to the NAPI poll budget.
> >> 
> >> This is because TX liberation is orders of magnitude cheaper than
> >> receiving a packet, and such SKB freeing makes more SKBs available
> >> for RX processing.
> >> 
> >> Therefore, TX work should never count against the NAPI budget.
> >> 
> >> Please do not fix something which is not broken.
> > 
> > So at least the documentation I read must be fixed 
> > (https://wiki.linuxfoundation.org/networking/napi)
> 
> We have no control over nor care about what the Linux Foundation writes
> about the Linux networking code.
> 
> Complain to them and please do not bother us about it.
> 
> Thank you.

Sorry, this was not to bother you.

Could you give me your opinion on the other question of the mail ? (just copied 
below)
So perhaps the best way is to do like intel igb/ixgbe, keeping under NAPI until 
the stmmac_tx_clean function said that it finished handling the queue (with a 
distinct TX budget)?

Thanks
Regards

Re: [PATCH net-next] cxgb4: Fix uld_send() for ctrl pkts

2017-02-03 Thread David Miller

From: Ganesh Goudar 
Date: Thu,  2 Feb 2017 12:43:29 +0530

> From: Arjun V 
> 
> Without any uld being loaded, uld_txq_info[] will be NULL. uld_send()
> is also used for sending control work requests(for eg: setting filter)
> that dont require any ulds to be loaded. Hence move uld_txq_info[]
> assignment after ctrl_xmit().
> 
> Also added a NULL check for uld_txq_info[].
> 
> Fixes: 94cdb8bb993a (cxgb4: Add support for dynamic allocation
>of resources for ULD).
> Signed-off-by: Arjun V 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudar 

Applied, thanks.

Re: [PATCH v2 3/4] phy: Add USB3 PHY support for Broadcom NSP SoC

2017-02-03 Thread Jon Mason

On Thu, Feb 2, 2017 at 1:48 AM, Rafał Miłecki  wrote:
> [Resending with fixed/complete Cc-s]
>
> On Tue, 17 Jan 2017 11:14:29 -0500, Yendapally Reddy Dhananjaya Reddy
>  wrote:> This patch adds support for Broadcom
> NSP USB3 PHY
>>
>> Signed-off-by: Yendapally Reddy Dhananjaya Reddy
>> 
>
> Seriously?! I really dislike what you did there.
>
> NACK.
>
> You are aware this block is common for both: Northstar and Northstar Plus
> and
> we already have phy-bcm-ns-usb3.c! In fact Jon told me to rewrite my initial
> driver to make is possible to reuse it on NSP and I did that!
>
> This is old comment from Jon:
>
> In 30 March 2016 at 23:31, Jon Mason  wrote:
>> On Mon, Mar 28, 2016 at 9:46 PM, Florian Fainelli 
>> wrote:
>>>
>>> CC: bcm-kernel-feedback-list, Jon
>>
>>
>> This is a common IP block with NSP.  I believe with some minor changes it
>> can support both.  Please allow me 1-2 days to look at these in more
>> detail
>> and see if I can get these patches working on NSP.
>
> Please start using existing code instead of inventing everything from the
> scratch internally at Broadcom. You did the same thing with (Q)SPI driver.
>
>
> This driver duplicates phy-bcm-ns-usb3.c and should have not been accepted.
> I
> strongly suggest *reverting* it and adjusting existing driver if needed.

I agree that we need to be heading in the same direction with 4708/9
(Northstar) and Northstar+.  Duplication of work is a sin (and if not,
it should be).  So, I apologize for this and let's move forward
together.

Regarding the SPI duplication of drivers, the QSPI driver covers a
much broader array of SoCs across Broadcom, and was a joint effort
between multiple teams internally.  To resolve this, I believe the
best way forward is to add QSPI to the 4708/9 device trees, and remove
the BSPI driver from Linux.  I'll have someone work on this internally
and get it out ASAP.

Regarding the duplication of function for the USB PHYs, the MDIO bus
for our PHYs is the way we would like to support everything going
forward.  This MDIO bus supports more than just USB.  So, it will be
much more extensible in the future.  Since there is already a USB PHY
driver for NS, I would recommend that we modify that driver to have
MDIO support.  If we are in agreement, Kishon can drop the current
series in his tree and Dhananjay will abandon the unaccepted ones.

Thanks,
Jon

Re: [PATCH net] ipv6: sr: remove cleanup flag and fix HMAC computation

2017-02-03 Thread David Miller

From: David Lebrun 
Date: Thu, 2 Feb 2017 11:29:38 +0100

> In the latest version of the IPv6 Segment Routing IETF draft [1] the
> cleanup flag is removed and the flags field length is shrunk from 16 bits
> to 8 bits. As a consequence, the input of the HMAC computation is modified
> in a non-backward compatible way by covering the whole octet of flags
> instead of only the cleanup bit. As such, if an implementation compatible
> with the latest draft computes the HMAC of an SRH who has other flags set
> to 1, then the HMAC result would differ from the current implementation.
> 
> This patch carries those modifications to prevent conflict with other
> implementations of IPv6 SR.
> 
> [1] https://tools.ietf.org/html/draft-ietf-6man-segment-routing-header-05
> 
> Signed-off-by: David Lebrun 

Applied.

1 2 3 >

1 - 100 of 286 matches

Mail list logo