date:20171222

[PATCHv4 2/3] net: socionext: Add Synquacer NetSec driver

2017-12-22 Thread jassisinghbrar

From: Jassi Brar 

This driver adds support for Socionext "netsec" IP Gigabit
Ethernet + PHY IP used in the Synquacer SC2A11 SoC.

Signed-off-by: Ard Biesheuvel 
Signed-off-by: Jassi Brar 
---
 drivers/net/ethernet/Kconfig|1 +
 drivers/net/ethernet/Makefile   |1 +
 drivers/net/ethernet/socionext/Kconfig  |   29 +
 drivers/net/ethernet/socionext/Makefile |1 +
 drivers/net/ethernet/socionext/netsec.c | 1844 +++
 5 files changed, 1876 insertions(+)
 create mode 100644 drivers/net/ethernet/socionext/Kconfig
 create mode 100644 drivers/net/ethernet/socionext/Makefile
 create mode 100644 drivers/net/ethernet/socionext/netsec.c

diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
index c604213..d50519e 100644
--- a/drivers/net/ethernet/Kconfig
+++ b/drivers/net/ethernet/Kconfig
@@ -170,6 +170,7 @@ source "drivers/net/ethernet/sis/Kconfig"
 source "drivers/net/ethernet/sfc/Kconfig"
 source "drivers/net/ethernet/sgi/Kconfig"
 source "drivers/net/ethernet/smsc/Kconfig"
+source "drivers/net/ethernet/socionext/Kconfig"
 source "drivers/net/ethernet/stmicro/Kconfig"
 source "drivers/net/ethernet/sun/Kconfig"
 source "drivers/net/ethernet/tehuti/Kconfig"
diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
index 39f62733..6cf5ade 100644
--- a/drivers/net/ethernet/Makefile
+++ b/drivers/net/ethernet/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_SFC) += sfc/
 obj-$(CONFIG_SFC_FALCON) += sfc/falcon/
 obj-$(CONFIG_NET_VENDOR_SGI) += sgi/
 obj-$(CONFIG_NET_VENDOR_SMSC) += smsc/
+obj-$(CONFIG_NET_VENDOR_SOCIONEXT) += socionext/
 obj-$(CONFIG_NET_VENDOR_STMICRO) += stmicro/
 obj-$(CONFIG_NET_VENDOR_SUN) += sun/
 obj-$(CONFIG_NET_VENDOR_TEHUTI) += tehuti/
diff --git a/drivers/net/ethernet/socionext/Kconfig 
b/drivers/net/ethernet/socionext/Kconfig
new file mode 100644
index 000..4601c2f
--- /dev/null
+++ b/drivers/net/ethernet/socionext/Kconfig
@@ -0,0 +1,29 @@
+#
+# Socionext Network device configuration
+#
+
+config NET_VENDOR_SOCIONEXT
+   bool "Socionext devices"
+   default y
+   ---help---
+ If you have a network (Ethernet) card belonging to this class, say Y.
+
+ Note that the answer to this question doesn't directly affect the
+ the questions about Socionext cards. If you say Y, you will be asked
+ for your specific card in the following questions.
+
+if NET_VENDOR_SOCIONEXT
+
+config SNI_NETSEC
+   tristate "NETSEC Driver Support"
+   depends on (ARCH_SYNQUACER || COMPILE_TEST) && OF
+   select PHYLIB
+   select MII
+help
+ Enable to add support for the SocioNext NetSec Gigabit Ethernet
+ controller + PHY, as found on the Synquacer SC2A11 SoC
+
+ To compile this driver as a module, choose M here: the module will be
+ called netsec.  If unsure, say N.
+
+endif # NET_VENDOR_SOCIONEXT
diff --git a/drivers/net/ethernet/socionext/Makefile 
b/drivers/net/ethernet/socionext/Makefile
new file mode 100644
index 000..9505923
--- /dev/null
+++ b/drivers/net/ethernet/socionext/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_SNI_NETSEC) += netsec.o
diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
new file mode 100644
index 000..6af047b
--- /dev/null
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -0,0 +1,1844 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#define NETSEC_REG_SOFT_RST0x104
+#define NETSEC_REG_COM_INIT0x120
+
+#define NETSEC_REG_TOP_STATUS  0x200
+#define NETSEC_IRQ_RX  BIT(1)
+#define NETSEC_IRQ_TX  BIT(0)
+
+#define NETSEC_REG_TOP_INTEN   0x204
+#define NETSEC_REG_INTEN_SET   0x234
+#define NETSEC_REG_INTEN_CLR   0x238
+
+#define NETSEC_REG_NRM_TX_STATUS   0x400
+#define NETSEC_REG_NRM_TX_INTEN0x404
+#define NETSEC_REG_NRM_TX_INTEN_SET0x428
+#define NETSEC_REG_NRM_TX_INTEN_CLR0x42c
+#define NRM_TX_ST_NTOWNR   BIT(17)
+#define NRM_TX_ST_TR_ERR   BIT(16)
+#define NRM_TX_ST_TXDONE   BIT(15)
+#define NRM_TX_ST_TMREXP   BIT(14)
+
+#define NETSEC_REG_NRM_RX_STATUS   0x440
+#define NETSEC_REG_NRM_RX_INTEN0x444
+#define NETSEC_REG_NRM_RX_INTEN_SET0x468
+#define NETSEC_REG_NRM_RX_INTEN_CLR0x46c
+#define NRM_RX_ST_RC_ERR   BIT(16)
+#define NRM_RX_ST_PKTCNT   BIT(15)
+#define NRM_RX_ST_TMREXP   BIT(14)
+
+#define NETSEC_REG_PKT_CMD_BUF 0xd0
+
+#define NETSEC_REG_CLK_EN  0x100
+
+#define NETSEC_REG_PKT_CTRL0x140
+
+#define NETSEC_REG_DMA_TMR_CTRL

[PATCHv4 3/3] MAINTAINERS: Add entry for Socionext ethernet driver

2017-12-22 Thread jassisinghbrar

From: Jassi Brar 

Add entry for the Socionext Netsec controller driver and DT bindings.

Acked-by: Ard Biesheuvel 
Signed-off-by: Jassi Brar 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9e0045e..0e1f0d4 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12630,6 +12630,13 @@ F: drivers/md/raid*
 F: include/linux/raid/
 F: include/uapi/linux/raid/
 
+SOCIONEXT (SNI) NETSEC NETWORK DRIVER
+M: Jassi Brar 
+L: netdev@vger.kernel.org
+S: Maintained
+F: drivers/net/ethernet/socionext/netsec.c
+F: Documentation/devicetree/bindings/net/socionext-netsec.txt
+
 SONIC NETWORK DRIVER
 M: Thomas Bogendoerfer 
 L: netdev@vger.kernel.org
-- 
2.7.4

[PATCHv4 1/3] dt-bindings: net: Add DT bindings for Socionext Netsec

2017-12-22 Thread jassisinghbrar

From: Jassi Brar 

This patch adds documentation for Device-Tree bindings for the
Socionext NetSec Controller driver.

Signed-off-by: Jassi Brar 
Signed-off-by: Ard Biesheuvel 
---
 .../devicetree/bindings/net/socionext-netsec.txt   | 55 ++
 1 file changed, 55 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/socionext-netsec.txt

diff --git a/Documentation/devicetree/bindings/net/socionext-netsec.txt 
b/Documentation/devicetree/bindings/net/socionext-netsec.txt
new file mode 100644
index 000..adc7bfa
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/socionext-netsec.txt
@@ -0,0 +1,55 @@
+* Socionext NetSec Ethernet Controller IP
+
+Required properties:
+- compatible: Should be "socionext,synquacer-netsec"
+- reg: Address and length of the control register area, followed by the
+   address and length of the EEPROM holding the MAC address and
+   microengine firmware
+- interrupts: Should contain ethernet controller interrupt
+- clocks: phandle to the PHY reference clock, and any other clocks to be
+  switched by runtime_pm
+- clock-names: Required only if more than a single clock is listed in 'clocks'.
+   The PHY reference clock must be named 'phy_refclk'
+- phy-mode: See ethernet.txt file in the same directory
+- phy-handle: See ethernet.txt in the same directory.
+
+- mdio device tree subnode: When the Netsec has a phy connected to its local
+   mdio, there must be device tree subnode with the following
+   required properties:
+
+   - #address-cells: Must be <1>.
+   - #size-cells: Must be <0>.
+
+   For each phy on the mdio bus, there must be a node with the following
+   fields:
+   - compatible: Refer to phy.txt
+   - reg: phy id used to communicate to phy.
+
+Optional properties: (See ethernet.txt file in the same directory)
+- dma-coherent: Boolean property, must only be present if memory
+   accesses performed by the device are cache coherent.
+- local-mac-address: See ethernet.txt in the same directory.
+- mac-address: See ethernet.txt in the same directory.
+- max-speed: See ethernet.txt in the same directory.
+- max-frame-size: See ethernet.txt in the same directory.
+
+Example:
+   eth0: netsec@522d {
+   compatible = "socionext,synquacer-netsec";
+   reg = <0 0x522d 0x0 0x1>, <0 0x1000 0x0 0x1>;
+   interrupts = ;
+   clocks = <_netsec>;
+   phy-mode = "rgmii";
+   max-speed = <1000>;
+   max-frame-size = <9000>;
+   phy-handle = <>;
+
+   mdio {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   phy1: ethernet-phy@1 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <1>;
+   };
+   };
+   };
-- 
2.7.4

[PATCHv4 0/3] Socionext Synquacer NETSEC driver

2017-12-22 Thread jassisinghbrar

From: Jassi Brar 

Hi,

Changes since v3
# Discard 'socionext,snq-mdio', and simply use 'mdio' subnode.
# Use ioremap on ucode region as well, instead of memremap.

Changes since v2
# Use 'mdio' subnode in DT bindings.
# Use phy_interface_mode_is_rgmii(), instead of open coding the check.
# Use readl/b with eeprom_base pointer.
# Unregister mdio bus upon failure in probe.

Changes since v1
# Switched from using memremap to ioremap
# Implemented ndo_do_ioctl callback
# Defined optional 'dma-coherent' DT property

Jassi Brar (3):
  dt-bindings: net: Add DT bindings for Socionext Netsec
  net: socionext: Add Synquacer NetSec driver
  MAINTAINERS: Add entry for Socionext ethernet driver

 .../devicetree/bindings/net/socionext-netsec.txt   |   55 +
 MAINTAINERS|7 +
 drivers/net/ethernet/Kconfig   |1 +
 drivers/net/ethernet/Makefile  |1 +
 drivers/net/ethernet/socionext/Kconfig |   29 +
 drivers/net/ethernet/socionext/Makefile|1 +
 drivers/net/ethernet/socionext/netsec.c| 1844 
 7 files changed, 1938 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/socionext-netsec.txt
 create mode 100644 drivers/net/ethernet/socionext/Kconfig
 create mode 100644 drivers/net/ethernet/socionext/Makefile
 create mode 100644 drivers/net/ethernet/socionext/netsec.c

-- 
2.7.4

Re: [bpf-next V2 PATCH 13/14] bpf: finally expose xdp_rxq_info to XDP bpf-programs

2017-12-22 Thread Alexei Starovoitov

On Fri, Dec 22, 2017 at 06:12:41PM +0100, Jesper Dangaard Brouer wrote:
> Now all XDP driver have been updated to setup xdp_rxq_info and assign
> this to xdp_buff->rxq.  Thus, it is now safe to enable access to some
> of the xdp_rxq_info struct members.
> 
> This patch extend xdp_md and expose UAPI to userspace for
> ingress_ifindex and rx_queue_index.  Access happens via bpf
> instruction rewrite, that load data directly from struct xdp_rxq_info.
> 
> * ingress_ifindex map to xdp_rxq_info->dev->ifindex
> * rx_queue_index  map to xdp_rxq_info->queue_index
> 
> Signed-off-by: Jesper Dangaard Brouer 
...
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 69eabfcb9bdb..a6000a95d40e 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -899,6 +899,9 @@ struct xdp_md {
>   __u32 data;
>   __u32 data_end;
>   __u32 data_meta;
> + /* Below access go though struct xdp_rxq_info */
> + __u32 ingress_ifindex; /* rxq->dev->ifindex */
> + __u32 rx_queue_index;  /* rxq->queue_index  */
>  };

Acked-by: Alexei Starovoitov 

I think this is very useful extension and I hope driver maintainers
will do a timely review of corresponding patches.

my only nit:
please use SPDX license header for two new files added in patch 14.

kasan for bpf

2017-12-22 Thread Alexei Starovoitov

Hi All,

the recent bugs make people question the safety of bpf and not a surprise
that some folks propose to set kernel.unprivileged_bpf_disabled = 1 by default.

I think it will be wrong long term decision, since it will steer
bpf into "root only" mentality. The verifier checks will become sloppier
and developers will worry less about missing safety checks.
I'd like bpf to go the other way. The root-only programs should be
treated with the same respect as unprivileged.

Today out of 15 program types only one BPF_PROG_TYPE_SOCKET_FILTER is
allowed for unpriv while in many cases like XDP, CLS, ACT, SKB, SOCK progs
could be relaxed to cap_net_admin and sometimes to unpriv.
I think we need to disallow leaking kernel address in XDP, CLS and
all other networking programs. They should be operating on packets
and should never leak addresses. The existing approach:
env->allow_ptr_leaks = capable(CAP_SYS_ADMIN);
if (env->allow_ptr_leaks) ...;
if (env->allow_ptr_leaks) ...;
is just a big hammer.
Realistically only tracing programs need to deal with kernel data and
they probably should be restricted further with additional sysctl or CAPs.

So instead of existing two classes of root/unpriv we need several:
- unpriv prog type
- all prog types for unpriv but execute via bpf_prog_run only (for testing)
- net_admin without leaks (that can run in containers)
- root without leaks
- root with leaks for tracing only

As far as unpriv I think it's clear that the verifier only
is not enough and we need to optionally add run-time checks.
I think 'kasan for bpf' by default would be good first step.
The basic idea is that all accessible memory (stack, ctx, maps)
will have shadow bits and every load/store will be automatically
instrumented with shadow memory checks.
'kasan for bpf' will be on by default for unpriv and extra
sysctl to turn it off in setups that cannot tolerate
slightly degraded performance.
For root and other classes of programs we will be able to
turn it on for testing as well.
I think that will stop many exploits except side channel attacks.

Thoughts?

Re: [PATCH 4.9] bpf/verifier: Fix states_equal() comparison of pointer and UNKNOWN

2017-12-22 Thread Alexei Starovoitov

On Sat, Dec 23, 2017 at 02:26:17AM +, Ben Hutchings wrote:
> An UNKNOWN_VALUE is not supposed to be derived from a pointer, unless
> pointer leaks are allowed.  Therefore, states_equal() must not treat
> a state with a pointer in a register as "equal" to a state with an
> UNKNOWN_VALUE in that register.
> 
> This was fixed differently upstream, but the code around here was
> largely rewritten in 4.14 by commit f1174f77b50c "bpf/verifier: rework
> value tracking".  The bug can be detected by the bpf/verifier sub-test
> "pointer/scalar confusion in state equality check (way 1)".
> 
> Signed-off-by: Ben Hutchings 
> Cc: Edward Cree 
> Cc: Jann Horn 
> Cc: Alexei Starovoitov 

Acked-by: Alexei Starovoitov

Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic

2017-12-22 Thread Alexei Starovoitov

On Sat, Dec 23, 2017 at 03:26:27AM +0100, Jann Horn wrote:
> On Sat, Dec 23, 2017 at 3:07 AM, Alexei Starovoitov
>  wrote:
> > On Sat, Dec 23, 2017 at 02:38:29AM +0100, Jann Horn wrote:
> >> On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov  
> >> wrote:
> >> > instead of computing max stack depth for current call chain only
> >> > track the maximum possible stack depth of any function at given
> >> > frame position. Such algorithm is simple and fast, but conservative,
> >> > since it overestimates amount of stack used. Consider:
> >> > main() // stack 32
> >> > {
> >> >   A();
> >> >   B();
> >> > }
> >> >
> >> > A(){} // stack 256
> >> >
> >> > B()  // stack 64
> >> > {
> >> >   A();
> >> > }
> >> >
> >> > since A() is called at frame[1] and frame[2], the algorithm
> >> > will estimate the max stack depth as 32 + 256 + 256 and will reject
> >> > such program, though real max is 32 + 64 + 256.
> >> >
> >> > Fortunately the algorithm is good enough in practice. The alternative
> >> > would be to track max stack of every function in the fast pass through
> >> > the verifier and then do additional CFG walk just to compute max total.
> >> >
> >> > Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
> >> > Reported-by: Jann Horn 
> >> > Signed-off-by: Alexei Starovoitov 
> >>
> >> Does this work in cases where multiple invocations of a function have
> >> different stack access patterns because their inputs have different
> >> bounds?
> >>
> >> Consider this pseudocode example:
> >>
> >> void main(void) {
> >>   func1(0);
> >>   func1(1);
> >>   func2(1);
> >> }
> >> void func1(int alloc_or_recurse) {
> >>   if (alloc_or_recurse) {
> >> frame_pointer[-300] = 1;
> >>   } else {
> >> func2(alloc_or_recurse);
> >>   }
> >> }
> >> void func2(int alloc_or_recurse) {
> >>   if (alloc_or_recurse) {
> >> frame_pointer[-300] = 1;
> >>   }
> >> }
> >>
> >> AFAICS this will work as follows:
> >>
> >> Call to func1->func2 runs without any stack accesses because the
> >> verifier can prove that alloc_or_recurse is 0.
> >
> > argh. right.
> > I guess that ruins my attemp to do the stack check inline
> > with the main verifier pass.
> > Do you see an algorithm that can do it without extra
> > cfg walk at the end?
> 
> A crappy heuristic would be to forbid recursion (calling a function
> that is already present somewhere in the call stack) 

the recursion is already forbidden. It's a 'back-edge' from cfg point
of view. There are 3 tests that cover that in few variants.

> and then sum up
> the maximum stack depths of all functions at the end and see whether
> the sum is bigger than the maximum stack size. While it'd be horribly
> conservative, it might work for now? 512 bytes are a lot of stack.

That's what I tried first while developing bpf calls.
It should have been good enough, but round up to 32-byte makes even
tiny function into sizeable and both *_noinline.c tests fail to load.
Arguably they are artificial stress test that push the limits with
number of calls, argument passing and pointer accesses, but I don't
want to scale them down.

> Or as a more complicated, but slightly less conservative heuristic,
> you could forbid recursion, record the maximum number of stack frames
> (max_stack_frames), then at the end select the top max_stack_frames
> functions with the biggest stack sizes and sum up their sizes?

it's pretty much the same as previous one. In these two tests I'm
building long stack chain out of all functions.

> Anything else I can come up with is probably more complicated than an
> extra cfg walk.

I guess we have to generalize (or remember) cfg anyway for future use,
so will code it up now and see how it looks.

Thank you for the feedback!

Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic

2017-12-22 Thread Jann Horn

On Sat, Dec 23, 2017 at 3:07 AM, Alexei Starovoitov
 wrote:
> On Sat, Dec 23, 2017 at 02:38:29AM +0100, Jann Horn wrote:
>> On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov  wrote:
>> > instead of computing max stack depth for current call chain only
>> > track the maximum possible stack depth of any function at given
>> > frame position. Such algorithm is simple and fast, but conservative,
>> > since it overestimates amount of stack used. Consider:
>> > main() // stack 32
>> > {
>> >   A();
>> >   B();
>> > }
>> >
>> > A(){} // stack 256
>> >
>> > B()  // stack 64
>> > {
>> >   A();
>> > }
>> >
>> > since A() is called at frame[1] and frame[2], the algorithm
>> > will estimate the max stack depth as 32 + 256 + 256 and will reject
>> > such program, though real max is 32 + 64 + 256.
>> >
>> > Fortunately the algorithm is good enough in practice. The alternative
>> > would be to track max stack of every function in the fast pass through
>> > the verifier and then do additional CFG walk just to compute max total.
>> >
>> > Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
>> > Reported-by: Jann Horn 
>> > Signed-off-by: Alexei Starovoitov 
>>
>> Does this work in cases where multiple invocations of a function have
>> different stack access patterns because their inputs have different
>> bounds?
>>
>> Consider this pseudocode example:
>>
>> void main(void) {
>>   func1(0);
>>   func1(1);
>>   func2(1);
>> }
>> void func1(int alloc_or_recurse) {
>>   if (alloc_or_recurse) {
>> frame_pointer[-300] = 1;
>>   } else {
>> func2(alloc_or_recurse);
>>   }
>> }
>> void func2(int alloc_or_recurse) {
>>   if (alloc_or_recurse) {
>> frame_pointer[-300] = 1;
>>   }
>> }
>>
>> AFAICS this will work as follows:
>>
>> Call to func1->func2 runs without any stack accesses because the
>> verifier can prove that alloc_or_recurse is 0.
>
> argh. right.
> I guess that ruins my attemp to do the stack check inline
> with the main verifier pass.
> Do you see an algorithm that can do it without extra
> cfg walk at the end?

A crappy heuristic would be to forbid recursion (calling a function
that is already present somewhere in the call stack) and then sum up
the maximum stack depths of all functions at the end and see whether
the sum is bigger than the maximum stack size. While it'd be horribly
conservative, it might work for now? 512 bytes are a lot of stack.
Or as a more complicated, but slightly less conservative heuristic,
you could forbid recursion, record the maximum number of stack frames
(max_stack_frames), then at the end select the top max_stack_frames
functions with the biggest stack sizes and sum up their sizes? (Or if
you want to support recursion, check
max_stack_frames*biggest_frame_size<=MAX_BPF_STACK.)
Anything else I can come up with is probably more complicated than an
extra cfg walk.

[PATCH 4.9] bpf/verifier: Fix states_equal() comparison of pointer and UNKNOWN

2017-12-22 Thread Ben Hutchings

An UNKNOWN_VALUE is not supposed to be derived from a pointer, unless
pointer leaks are allowed.  Therefore, states_equal() must not treat
a state with a pointer in a register as "equal" to a state with an
UNKNOWN_VALUE in that register.

This was fixed differently upstream, but the code around here was
largely rewritten in 4.14 by commit f1174f77b50c "bpf/verifier: rework
value tracking".  The bug can be detected by the bpf/verifier sub-test
"pointer/scalar confusion in state equality check (way 1)".

Signed-off-by: Ben Hutchings 
Cc: Edward Cree 
Cc: Jann Horn 
Cc: Alexei Starovoitov 
---
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2722,11 +2722,12 @@ static bool states_equal(struct bpf_veri
 
/* If we didn't map access then again we don't care about the
 * mismatched range values and it's ok if our old type was
-* UNKNOWN and we didn't go to a NOT_INIT'ed reg.
+* UNKNOWN and we didn't go to a NOT_INIT'ed or pointer reg.
 */
if (rold->type == NOT_INIT ||
(!varlen_map_access && rold->type == UNKNOWN_VALUE &&
-rcur->type != NOT_INIT))
+rcur->type != NOT_INIT &&
+!__is_pointer_value(env->allow_ptr_leaks, rcur)))
continue;
 
/* Don't care about the reg->id in this case. */


signature.asc
Description: Digital signature

Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic

2017-12-22 Thread Alexei Starovoitov

On Sat, Dec 23, 2017 at 02:38:29AM +0100, Jann Horn wrote:
> On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov  wrote:
> > instead of computing max stack depth for current call chain only
> > track the maximum possible stack depth of any function at given
> > frame position. Such algorithm is simple and fast, but conservative,
> > since it overestimates amount of stack used. Consider:
> > main() // stack 32
> > {
> >   A();
> >   B();
> > }
> >
> > A(){} // stack 256
> >
> > B()  // stack 64
> > {
> >   A();
> > }
> >
> > since A() is called at frame[1] and frame[2], the algorithm
> > will estimate the max stack depth as 32 + 256 + 256 and will reject
> > such program, though real max is 32 + 64 + 256.
> >
> > Fortunately the algorithm is good enough in practice. The alternative
> > would be to track max stack of every function in the fast pass through
> > the verifier and then do additional CFG walk just to compute max total.
> >
> > Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
> > Reported-by: Jann Horn 
> > Signed-off-by: Alexei Starovoitov 
> 
> Does this work in cases where multiple invocations of a function have
> different stack access patterns because their inputs have different
> bounds?
> 
> Consider this pseudocode example:
> 
> void main(void) {
>   func1(0);
>   func1(1);
>   func2(1);
> }
> void func1(int alloc_or_recurse) {
>   if (alloc_or_recurse) {
> frame_pointer[-300] = 1;
>   } else {
> func2(alloc_or_recurse);
>   }
> }
> void func2(int alloc_or_recurse) {
>   if (alloc_or_recurse) {
> frame_pointer[-300] = 1;
>   }
> }
> 
> AFAICS this will work as follows:
> 
> Call to func1->func2 runs without any stack accesses because the
> verifier can prove that alloc_or_recurse is 0.

argh. right.
I guess that ruins my attemp to do the stack check inline
with the main verifier pass.
Do you see an algorithm that can do it without extra
cfg walk at the end?

Re: [PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic

2017-12-22 Thread Jann Horn

On Fri, Dec 22, 2017 at 10:33 PM, Alexei Starovoitov  wrote:
> instead of computing max stack depth for current call chain only
> track the maximum possible stack depth of any function at given
> frame position. Such algorithm is simple and fast, but conservative,
> since it overestimates amount of stack used. Consider:
> main() // stack 32
> {
>   A();
>   B();
> }
>
> A(){} // stack 256
>
> B()  // stack 64
> {
>   A();
> }
>
> since A() is called at frame[1] and frame[2], the algorithm
> will estimate the max stack depth as 32 + 256 + 256 and will reject
> such program, though real max is 32 + 64 + 256.
>
> Fortunately the algorithm is good enough in practice. The alternative
> would be to track max stack of every function in the fast pass through
> the verifier and then do additional CFG walk just to compute max total.
>
> Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
> Reported-by: Jann Horn 
> Signed-off-by: Alexei Starovoitov 

Does this work in cases where multiple invocations of a function have
different stack access patterns because their inputs have different
bounds?

Consider this pseudocode example:

void main(void) {
  func1(0);
  func1(1);
  func2(1);
}
void func1(int alloc_or_recurse) {
  if (alloc_or_recurse) {
frame_pointer[-300] = 1;
  } else {
func2(alloc_or_recurse);
  }
}
void func2(int alloc_or_recurse) {
  if (alloc_or_recurse) {
frame_pointer[-300] = 1;
  }
}

AFAICS this will work as follows:

Call to func1->func2 runs without any stack accesses because the
verifier can prove that alloc_or_recurse is 0.
Second call to func1 allocates stack memory, bumping up
frame_stack_depth[1].
Second call to func2 allocates stack memory, leaving
frame_stack_depth[1] the same.

So I think this will still pass the verifier, even though func1
and func2 will end up with 300 bytes stack memory each, causing
the func1->func2 call to use more stack memory than permitted.

Re: [PATCH net-next] net/trace: fix printk format in inet_sock_set_state

2017-12-22 Thread Yafang Shao

On Sat, Dec 23, 2017 at 9:10 AM, Yafang Shao  wrote:
> On Sat, Dec 23, 2017 at 1:04 AM, Sergei Shtylyov
>  wrote:
>> Hello!
>>
>> On 12/22/2017 06:37 PM, Yafang Shao wrote:
>>
>>> There's a space character missed in the printk messages.
>>> This error should be prevented with checkscript.pl, but it couldn't caught
>>
>>  ^ be?
>
> It is checkpatch.pl.
>
>>
>>> by running with "checkscript.pl -f .patch", that's what I had run
>>> before.
>>> What a carelessness.
>>
>>
>>You generally don't need to break up the messages violating 80-column
>> limit, and checkpatch.pl should be aware of this...
>>
>
> Oh. That's right.
> It can be aware of that.
>
> I just want to make the code easy to read and limit the textwidth to
> 80 character.
>
> If the message takes two lines as bellow,
> printk("xxx "
>  ^ space character.
>   "yyy");
> The checkpatch.pl  could also be aware of that if the first line is
> not end with space character, but it couldn't be aware of that if run
> with "checkpatch.pl -f .patch".
>

Should we need to check that error as well when we run with
"checkpatch.pl -f" ?

>
>>> Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint
>>> with
>>> inet_sock_set_state tracepoint")
>>> Signed-off-by: Yafang Shao 
>>
>> [...]
>>
>> MBR, Sergei

[PATCH net-next] net/trace: fix printk format in inet_sock_set_state

2017-12-22 Thread Yafang Shao

There's a space character missed in the printk messages.
This error should be prevented with checkpatch.pl, but it couldn't caught
by running with "checkpatch.pl -f .patch", that's what I had run
before.
What a carelessness.

Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint with
inet_sock_set_state tracepoint")
Cc: Sergei Shtylyov 
Signed-off-by: Yafang Shao 
---
 include/trace/events/sock.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/trace/events/sock.h b/include/trace/events/sock.h
index 3b9094a..598399da 100644
--- a/include/trace/events/sock.h
+++ b/include/trace/events/sock.h
@@ -160,7 +160,7 @@
}
),

-   TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4"
+   TP_printk("protocol=%s sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 "
"saddrv6=%pI6c daddrv6=%pI6c oldstate=%s newstate=%s",
show_inet_protocol_name(__entry->protocol),
__entry->sport, __entry->dport,
--
1.8.3.1

Re: [PATCH net-next] net/trace: fix printk format in inet_sock_set_state

2017-12-22 Thread Yafang Shao

On Sat, Dec 23, 2017 at 1:04 AM, Sergei Shtylyov
 wrote:
> Hello!
>
> On 12/22/2017 06:37 PM, Yafang Shao wrote:
>
>> There's a space character missed in the printk messages.
>> This error should be prevented with checkscript.pl, but it couldn't caught
>
>  ^ be?

It is checkpatch.pl.

>
>> by running with "checkscript.pl -f .patch", that's what I had run
>> before.
>> What a carelessness.
>
>
>You generally don't need to break up the messages violating 80-column
> limit, and checkpatch.pl should be aware of this...
>

Oh. That's right.
It can be aware of that.

I just want to make the code easy to read and limit the textwidth to
80 character.

If the message takes two lines as bellow,
printk("xxx "
 ^ space character.
  "yyy");
The checkpatch.pl  could also be aware of that if the first line is
not end with space character, but it couldn't be aware of that if run
with "checkpatch.pl -f .patch".

>> Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint
>> with
>> inet_sock_set_state tracepoint")
>> Signed-off-by: Yafang Shao 
>
> [...]
>
> MBR, Sergei

Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2017-12-22 Thread Ed Swierk

On Fri, Dec 22, 2017 at 3:31 PM, Pravin Shelar  wrote:
> On Thu, Dec 21, 2017 at 7:17 AM, Ed Swierk  wrote:
>> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
>> included in the L3 length. For example, a short IPv4 packet may have
>> up to 6 bytes of padding following the IP payload when received on an
>> Ethernet device. In the normal IPv4 receive path, ip_rcv() trims the
>> packet to ip_hdr->tot_len before invoking netfilter hooks (including
>> conntrack and nat).
>>
>> In the IPv6 receive path, ip6_rcv() does the same using
>> ipv6_hdr->payload_len. Similarly in the br_netfilter receive path,
>> br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3
>> length before invoking NF_INET_PRE_ROUTING hooks.
>>
>> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
>> the L3 header but does not trim it to the L3 length before calling
>> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
>> encounters a packet with lower-layer padding, nf_checksum() fails and
>> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
>> affect the checksum, the length in the IP pseudoheader does. That
>> length is based on skb->len, and without trimming, it doesn't match
>> the length the sender used when computing the checksum.
>>
>> The assumption throughout nf_conntrack and nf_nat is that skb->len
>> reflects the length of the L3 header and payload, so there is no need
>> to refer back to ip_hdr->tot_len or ipv6_hdr->payload_len.
>>
>> This change brings OVS into line with other netfilter users, trimming
>> IPv4 and IPv6 packets prior to L3+ netfilter processing.
>>
>> Signed-off-by: Ed Swierk 
>> ---
>> v2:
>> - Trim packet in nat receive path as well as conntrack
>> - Free skb on error
>> ---
>>  net/openvswitch/conntrack.c | 34 ++
>>  1 file changed, 34 insertions(+)
>>
>> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
>> index b27c5c6..1bdc78f 100644
>> --- a/net/openvswitch/conntrack.c
>> +++ b/net/openvswitch/conntrack.c
>> @@ -703,6 +703,33 @@ static bool skb_nfct_cached(struct net *net,
>> return ct_executed;
>>  }
>>
>> +/* Trim the skb to the L3 length. Assumes the skb is already pulled to
>> + * the L3 header. The skb is freed on error.
>> + */
>> +static int skb_trim_l3(struct sk_buff *skb)
>> +{
>> +   unsigned int nh_len;
>> +   int err;
>> +
>> +   switch (skb->protocol) {
>> +   case htons(ETH_P_IP):
>> +   nh_len = ntohs(ip_hdr(skb)->tot_len);
>> +   break;
>> +   case htons(ETH_P_IPV6):
>> +   nh_len = ntohs(ipv6_hdr(skb)->payload_len)
>> +   + sizeof(struct ipv6hdr);
>> +   break;
>> +   default:
>> +   nh_len = skb->len;
>> +   }
>> +
>> +   err = pskb_trim_rcsum(skb, nh_len);
>> +   if (err)
> This should is unlikely.

I'll add unlikely().

>> +   kfree_skb(skb);
>> +
>> +   return err;
>> +}
>> +
> This looks like a generic function, it probably does not belong to OVS
> code base.

Indeed. I'll move it to skbuff.c, unless you have a better idea.

>>  #ifdef CONFIG_NF_NAT_NEEDED
>>  /* Modelled after nf_nat_ipv[46]_fn().
>>   * range is only used for new, uninitialized NAT state.
>> @@ -715,8 +742,12 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, 
>> struct nf_conn *ct,
>>  {
>> int hooknum, nh_off, err = NF_ACCEPT;
>>
>> +   /* The nat module expects to be working at L3. */
>> nh_off = skb_network_offset(skb);
>> skb_pull_rcsum(skb, nh_off);
>> +   err = skb_trim_l3(skb);
>> +   if (err)
>> +   return err;
>>
> ct-nat is executed within ct action, so I do not see why you you call
> skb-trim again from ovs_ct_nat_execute().
> ovs_ct_execute() trim should take care of the skb.

I see. Doesn't that mean that skb_pull_rcsum() is also unnecessary in
ovs_ct_nat_execute(), as ovs_ct_execute() has already pulled the skb
to the L3 header?

>> /* See HOOK2MANIP(). */
>> if (maniptype == NF_NAT_MANIP_SRC)
>> @@ -,6 +1142,9 @@ int ovs_ct_execute(struct net *net, struct sk_buff 
>> *skb,
>> /* The conntrack module expects to be working at L3. */
>> nh_ofs = skb_network_offset(skb);
>> skb_pull_rcsum(skb, nh_ofs);
>> +   err = skb_trim_l3(skb);
>> +   if (err)
>> +   return err;
>>
>> if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
>> err = handle_fragments(net, key, info->zone.id, skb);
>> --
>> 1.9.1
>>

Re: thunderx sgmii interface hang

2017-12-22 Thread David Daney


On 12/22/2017 04:22 PM, Tim Harvey wrote:

On Fri, Dec 22, 2017 at 3:00 PM, David Daney  wrote:

On 12/22/2017 02:19 PM, Tim Harvey wrote:


On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn  wrote:


On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote:


On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn  wrote:


The nic appears to work fine (pings, TCP etc) up until a performance
test is attempted.
When an iperf bandwidth test is attempted the nic ends up in a state
where truncated-ip packets are being sent out (per a tcpdump from
another board):



Hi Tim

Are pause frames supported? Have you tried turning them off?

Can you reproduce the issue with UDP? Or is it TCP only?



Andrew,

Pause frames don't appear to be supported yet and the issue occurs
when using UDP as well as TCP. I'm not clear what the best way to
troubleshoot this is.



Hi Tim

Is pause being negotiated? In theory, it should not be. The PHY should
not offer it, if the MAC has not enabled it. But some PHY drivers are
probably broken and offer pause when they should not.

Also, can you trigger the issue using UDP at say 75% the maximum
bandwidth. That should be low enough that the peer never even tries to
use pause.

All this pause stuff is just a stab in the dark. Something else to try
is to turn off various forms off acceleration, ethtook -K, and see if
that makes a difference.



Andrew,

Currently I'm not using the DP83867_PHY driver (after verifying the
issue occurs with or without that driver).

It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
and the issue still occurs.

I have found that once the issue occurs I can recover to a working
state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter
the issue and recover with that, I can never trigger the issue again.
If toggle that register bit upon power-up before the issue occurs it
will still occur.

The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as:
- when cleared all dedicated BGX context state for LMAC (state
machine, FIFOs, counters etc) are reset and LMAC access to shared BGX
resources (data path, serdes lanes) is disabled
- when set LMAC operation is enabled (link bring-up, sync, and tx/rx
of idles and fault sequences)



You could try looking at
BGXX_GMP_PCS_INTX
BGXX_GMP_GMI_RXX_INT
BGXX_GMP_GMI_TXX_INT

Those are all W1C registers that should contain all zeros.  If they don't,
just write back to them to clear before running a test.

If there are bits asserting in these when the thing gets wedged up, it might
point to a possible cause.


David,

BGXX_GMP_GMI_TXX_INT[UNDFLW] is getting set when the issue is
triggered. From CN80XX-HM-1.2P this is caused by:

"In the unlikely event that P2X data cannot keep the GMP TX FIFO full,
the SGMII/1000BASE-X/ QSGMII packet transfer will underflow. This
should be detected by the receiving device as an FCS error.
Internally, the packet is drained and lost"



Yikes!


Perhaps this needs to be caught and handled in some way. There's some
interrupt handlers in nicvf_main.c yet I'm not clear where to hook up
this one.


This would be an interrupt generated by the BGX device, not the NIC 
device  It will have an MSI-X index of (6 + LMAC * 7).  See 
BGX_INT_VEC_E in the HRM.


Note that I am telling you which interrupt it is, but not recommending 
that catching it and doing something is necessarily the best thing to do.






You could also look at these RO registers:
BGXX_GMP_PCS_TXX_STATES
BGXX_GMP_PCS_RXX_STATES



These show the same before/after triggering the issue and
RX_BAD/TX_BAD are still 0.

Tim

Re: thunderx sgmii interface hang

2017-12-22 Thread Tim Harvey

On Fri, Dec 22, 2017 at 3:00 PM, David Daney  wrote:
> On 12/22/2017 02:19 PM, Tim Harvey wrote:
>>
>> On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn  wrote:
>>>
>>> On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote:

 On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn  wrote:
>>
>> The nic appears to work fine (pings, TCP etc) up until a performance
>> test is attempted.
>> When an iperf bandwidth test is attempted the nic ends up in a state
>> where truncated-ip packets are being sent out (per a tcpdump from
>> another board):
>
>
> Hi Tim
>
> Are pause frames supported? Have you tried turning them off?
>
> Can you reproduce the issue with UDP? Or is it TCP only?
>

 Andrew,

 Pause frames don't appear to be supported yet and the issue occurs
 when using UDP as well as TCP. I'm not clear what the best way to
 troubleshoot this is.
>>>
>>>
>>> Hi Tim
>>>
>>> Is pause being negotiated? In theory, it should not be. The PHY should
>>> not offer it, if the MAC has not enabled it. But some PHY drivers are
>>> probably broken and offer pause when they should not.
>>>
>>> Also, can you trigger the issue using UDP at say 75% the maximum
>>> bandwidth. That should be low enough that the peer never even tries to
>>> use pause.
>>>
>>> All this pause stuff is just a stab in the dark. Something else to try
>>> is to turn off various forms off acceleration, ethtook -K, and see if
>>> that makes a difference.
>>>
>>
>> Andrew,
>>
>> Currently I'm not using the DP83867_PHY driver (after verifying the
>> issue occurs with or without that driver).
>>
>> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
>> and the issue still occurs.
>>
>> I have found that once the issue occurs I can recover to a working
>> state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter
>> the issue and recover with that, I can never trigger the issue again.
>> If toggle that register bit upon power-up before the issue occurs it
>> will still occur.
>>
>> The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as:
>> - when cleared all dedicated BGX context state for LMAC (state
>> machine, FIFOs, counters etc) are reset and LMAC access to shared BGX
>> resources (data path, serdes lanes) is disabled
>> - when set LMAC operation is enabled (link bring-up, sync, and tx/rx
>> of idles and fault sequences)
>
>
> You could try looking at
> BGXX_GMP_PCS_INTX
> BGXX_GMP_GMI_RXX_INT
> BGXX_GMP_GMI_TXX_INT
>
> Those are all W1C registers that should contain all zeros.  If they don't,
> just write back to them to clear before running a test.
>
> If there are bits asserting in these when the thing gets wedged up, it might
> point to a possible cause.

David,

BGXX_GMP_GMI_TXX_INT[UNDFLW] is getting set when the issue is
triggered. From CN80XX-HM-1.2P this is caused by:

"In the unlikely event that P2X data cannot keep the GMP TX FIFO full,
the SGMII/1000BASE-X/ QSGMII packet transfer will underflow. This
should be detected by the receiving device as an FCS error.
Internally, the packet is drained and lost"

Perhaps this needs to be caught and handled in some way. There's some
interrupt handlers in nicvf_main.c yet I'm not clear where to hook up
this one.

>
> You could also look at these RO registers:
> BGXX_GMP_PCS_TXX_STATES
> BGXX_GMP_PCS_RXX_STATES
>

These show the same before/after triggering the issue and
RX_BAD/TX_BAD are still 0.

Tim

Re: thunderx sgmii interface hang

2017-12-22 Thread Tim Harvey

On Fri, Dec 22, 2017 at 2:45 PM, Andrew Lunn  wrote:
>> Currently I'm not using the DP83867_PHY driver (after verifying the
>> issue occurs with or without that driver).
>>
>> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
>> and the issue still occurs.
>
>> I'm told that the particular Cavium reference board with an SGMII phy
>> doesn't show this issue (I don't have that specific board to do my own
>> testing or comparisons against our board) so I'm inclined to think it
>> has something to do with an interaction with the DP83867 PHY. I would
>> like to start poking at PHY registers to see if I can find anything
>> unusual. The best way to do that from userspace is via
>> SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently
>> support ioctl's so I guess I'll have to add that support unless
>> there's a way to get at phy registers from userspace through a phy
>> driver?
>
> phy_mii_ioctl() does what you need, and is simple to use.
>
> mii-tool will then give you access to the standard PHY registers.
>
>  Andre

I didn't think mii-tool or ethtool (its replacement right?) could
read/write specific phy registers via cmdline? I wrote my own tool
some time ago to do that [1].

Tim

[1] http://trac.gateworks.com/attachment/wiki/conformance_testing/mii-reg.c

Re: [bpf-next V2 PATCH 01/14] xdp: base API for new XDP rx-queue info concept

2017-12-22 Thread Jakub Kicinski

On Fri, 22 Dec 2017 18:11:39 +0100, Jesper Dangaard Brouer wrote:
> +struct xdp_rxq_info {
> + struct net_device *dev;
> + u32 queue_index;
> + u32 reg_state;
> +} cacheline_aligned; /* perf critical, avoid false-sharing */

I'm assuming this is cacheline_aligned, because of some stuff you will
add here in the future for the completion path?  (The comment could
mention that this data is read-mostly.)  Drivers are likely to already
have a read-mostly (or unused-mostly) section of the rx ring structure.
Would it be possible to define this in a way that would allow people
who carefully lay out their data path structures to save cache space?

Re: [PATCH bpf 0/2] tools: bpftool: fix unlikely race and JSON output on error path

2017-12-22 Thread Daniel Borkmann

On 12/22/2017 08:36 PM, Jakub Kicinski wrote:
> Hi!
> 
> Two small fixes here to listing maps and programs.  The loop for showing
> maps is written slightly differently to programs which was missed in JSON
> output support, and output would be broken if any of the system calls
> failed.  Second fix is in very unlikely case that program or map disappears
> after we get its ID we should just skip over that object instead of failing.

Applied to bpf tree, thanks Jakub!

Re: pull-request: bpf-next 2017-12-18

2017-12-22 Thread Daniel Borkmann

On 12/22/2017 03:42 PM, David Miller wrote:
> From: Daniel Borkmann 
> Date: Fri, 22 Dec 2017 00:48:22 +0100
> 
>> Looks good, one thing: If I spot this correctly, isn't here a ...
>>
>>  prog->aux->jit_data = jit_data;
>>
>> ... missing? Otherwise the context from the initial pass is neither
>> saved for the extra pass nor freed.
> 
> Good catch, here is an updated patch:
> 
> 
> bpf: sparc64: Add JIT support for multi-function programs.
> 
> Modelled strongly upon the arm64 implementation.
> 
> Signed-off-by: David S. Miller 

Applied to bpf-next, thanks David!

[PATCH net-next 1/4] skbuff: in skb_segment, call zerocopy functions once per nskb

2017-12-22 Thread Willem de Bruijn

From: Willem de Bruijn 

This is a net-next follow-up to commit 268b79067942 ("skbuff: orphan
frags before zerocopy clone"), which fixed a bug in net, but added a
call to skb_zerocopy_clone at each frag to do so.

When segmenting skbs with user frags, either the user frags must be
replaced with private copies and uarg released, or the uarg must have
its refcount increased for each new skb.

skb_orphan_frags does the first, except for cases that can handle
reference counting. skb_zerocopy_clone then does the second.

Call these once per nskb, instead of once per frag.

That is, in the common case. With a frag list, also refresh when the
origin skb (frag_skb) changes.

Signed-off-by: Willem de Bruijn 
---
 net/core/skbuff.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a3cb0be4c6f3..00b0757830e2 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3656,6 +3656,10 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
skb_shinfo(nskb)->tx_flags |= skb_shinfo(head_skb)->tx_flags &
  SKBTX_SHARED_FRAG;
 
+   if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
+   skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC))
+   goto err;
+
while (pos < offset + len) {
if (i >= nfrags) {
BUG_ON(skb_headlen(list_skb));
@@ -3667,6 +3671,11 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
 
BUG_ON(!nfrags);
 
+   if (skb_orphan_frags(frag_skb, GFP_ATOMIC) ||
+   skb_zerocopy_clone(nskb, frag_skb,
+  GFP_ATOMIC))
+   goto err;
+
list_skb = list_skb->next;
}
 
@@ -3678,11 +3687,6 @@ struct sk_buff *skb_segment(struct sk_buff *head_skb,
goto err;
}
 
-   if (unlikely(skb_orphan_frags(frag_skb, GFP_ATOMIC)))
-   goto err;
-   if (skb_zerocopy_clone(nskb, frag_skb, GFP_ATOMIC))
-   goto err;
-
*nskb_frag = *frag;
__skb_frag_ref(nskb_frag);
size = skb_frag_size(nskb_frag);
-- 
2.15.1.620.gb9897f4670-goog

[PATCH net-next 3/4] tcp: place all zerocopy payload in frags

2017-12-22 Thread Willem de Bruijn

From: Willem de Bruijn 

This avoids an unnecessary copy of 1-2KB and improves tso_fragment,
which has to fall back to tcp_fragment if skb->len != skb_data_len.

It also avoids a surprising inconsistency in notifications:
Zerocopy packets sent over loopback have their frags copied, so set
SO_EE_CODE_ZEROCOPY_COPIED in the notification. But this currently
does not happen for small packets, because when all data fits in the
linear fragment, data is not copied in skb_orphan_frags_rx.

Reported-by: Tom Deseyn 
Signed-off-by: Willem de Bruijn 
---
 net/ipv4/tcp.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 44102484a76f..947348872c3e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1186,7 +1186,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
int flags, err, copied = 0;
int mss_now = 0, size_goal, copied_syn = 0;
bool process_backlog = false;
-   bool sg;
+   bool sg, zc = false;
long timeo;
 
flags = msg->msg_flags;
@@ -1204,7 +1204,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
goto out_err;
}
 
-   if (!(sk_check_csum_caps(sk) && sk->sk_route_caps & NETIF_F_SG))
+   zc = sk_check_csum_caps(sk) && sk->sk_route_caps & NETIF_F_SG;
+   if (!zc)
uarg->zerocopy = 0;
}
 
@@ -1325,13 +1326,13 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
copy = msg_data_left(msg);
 
/* Where to copy to? */
-   if (skb_availroom(skb) > 0) {
+   if (skb_availroom(skb) > 0 && !zc) {
/* We have some space in skb head. Superb! */
copy = min_t(int, copy, skb_availroom(skb));
err = skb_add_data_nocache(sk, skb, >msg_iter, 
copy);
if (err)
goto do_fault;
-   } else if (!uarg || !uarg->zerocopy) {
+   } else if (!zc) {
bool merge = true;
int i = skb_shinfo(skb)->nr_frags;
struct page_frag *pfrag = sk_page_frag(sk);
-- 
2.15.1.620.gb9897f4670-goog

[PATCH net-next 4/4] tcp: do not allocate linear memory for zerocopy skbs

2017-12-22 Thread Willem de Bruijn

From: Willem de Bruijn 

Zerocopy payload is now always stored in frags, and space for headers
is reversed, so this memory is unused.

Signed-off-by: Willem de Bruijn 
---
 net/ipv4/tcp.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 947348872c3e..7ac583a2b9fe 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1104,12 +1104,15 @@ static int linear_payload_sz(bool first_skb)
return 0;
 }
 
-static int select_size(const struct sock *sk, bool sg, bool first_skb)
+static int select_size(const struct sock *sk, bool sg, bool first_skb, bool zc)
 {
const struct tcp_sock *tp = tcp_sk(sk);
int tmp = tp->mss_cache;
 
if (sg) {
+   if (zc)
+   return 0;
+
if (sk_can_gso(sk)) {
tmp = linear_payload_sz(first_skb);
} else {
@@ -1282,6 +1285,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
 
if (copy <= 0 || !tcp_skb_can_collapse_to(skb)) {
bool first_skb;
+   int linear;
 
 new_segment:
/* Allocate new segment. If the interface is SG,
@@ -1295,9 +1299,8 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
goto restart;
}
first_skb = tcp_rtx_and_write_queues_empty(sk);
-   skb = sk_stream_alloc_skb(sk,
- select_size(sk, sg, 
first_skb),
- sk->sk_allocation,
+   linear = select_size(sk, sg, first_skb, zc);
+   skb = sk_stream_alloc_skb(sk, linear, sk->sk_allocation,
  first_skb);
if (!skb)
goto wait_for_memory;
-- 
2.15.1.620.gb9897f4670-goog

[PATCH net-next 2/4] tcp: push full zerocopy packets

2017-12-22 Thread Willem de Bruijn

From: Willem de Bruijn 

Skbs that reach MAX_SKB_FRAGS cannot be extended further. Do the
same for zerocopy frags as non-zerocopy frags and set the PSH bit.
This improves GRO assembly.

Suggested-by: Eric Dumazet 
Signed-off-by: Willem de Bruijn 
---
 net/ipv4/tcp.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 67d39b79c801..44102484a76f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1371,8 +1371,10 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr 
*msg, size_t size)
pfrag->offset += copy;
} else {
err = skb_zerocopy_iter_stream(sk, skb, msg, copy, 
uarg);
-   if (err == -EMSGSIZE || err == -EEXIST)
+   if (err == -EMSGSIZE || err == -EEXIST) {
+   tcp_mark_push(tp, skb);
goto new_segment;
+   }
if (err < 0)
goto do_error;
copy = err;
-- 
2.15.1.620.gb9897f4670-goog

[PATCH net-next 0/4] zerocopy refinements

2017-12-22 Thread Willem de Bruijn

From: Willem de Bruijn 

1/4 is a small optimization follow-up to the earlier fix to skb_segment:
check skb state once per skb, instead of once per frag.
2/4 makes behavior more consistent between standard and zerocopy send:
set the PSH bit when hitting MAX_SKB_FRAGS. This helps GRO.
3/4 resolves a surprising inconsistency in notification:
because small packets were not stored in frags, they would not set
the copied error code over loopback. This change also optimizes
the path by removing copying and making tso_fragment cheaper.
4/4 follows-up to 3/4 by no longer allocated now unused memory.
this was actually already in RFC patches, but dropped as I pared
down the patch set during revisions.
   
Willem de Bruijn (4):
  skbuff: in skb_segment, call zerocopy functions once per nskb
  tcp: push full zerocopy packets
  tcp: place all zerocopy payload in frags
  tcp: do not allocate linear memory for zerocopy skbs

 net/core/skbuff.c | 14 +-
 net/ipv4/tcp.c| 24 +++-
 2 files changed, 24 insertions(+), 14 deletions(-)

-- 
2.15.1.620.gb9897f4670-goog

Re: [PATCH RFC 13/18] r8168: replace speed_down with genphy_restart_aneg

2017-12-22 Thread Heiner Kallweit

Am 22.12.2017 um 11:14 schrieb Andrew Lunn:
> On Thu, Dec 21, 2017 at 09:50:39PM +0100, Heiner Kallweit wrote:
>> Dealing with link partner abilities is handled by phylib, so let's
>> just trigger autonegotiation here.
>>
>> Signed-off-by: Heiner Kallweit 
>> ---
>>  drivers/net/ethernet/realtek/r8168.c | 26 +-
>>  1 file changed, 1 insertion(+), 25 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/realtek/r8168.c 
>> b/drivers/net/ethernet/realtek/r8168.c
>> index d33f93a31..6b398915f 100644
>> --- a/drivers/net/ethernet/realtek/r8168.c
>> +++ b/drivers/net/ethernet/realtek/r8168.c
>> @@ -4360,30 +4360,6 @@ static void rtl_init_mdio_ops(struct rtl8168_private 
>> *tp)
>>  }
>>  }
>>  
>> -static void rtl_speed_down(struct rtl8168_private *tp)
>> -{
>> -u32 adv;
>> -int lpa;
>> -
>> -rtl_writephy(tp, 0x1f, 0x);
>> -lpa = rtl_readphy(tp, MII_LPA);
>> -
>> -if (lpa & (LPA_10HALF | LPA_10FULL))
>> -adv = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full;
>> -else if (lpa & (LPA_100HALF | LPA_100FULL))
>> -adv = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full |
>> -  ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full;
>> -else
>> -adv = ADVERTISED_10baseT_Half | ADVERTISED_10baseT_Full |
>> -  ADVERTISED_100baseT_Half | ADVERTISED_100baseT_Full |
>> -  (tp->mii.supports_gmii ?
>> -   ADVERTISED_1000baseT_Half |
>> -   ADVERTISED_1000baseT_Full : 0);
>> -
>> -rtl8168_set_speed(tp->dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL,
>> -  adv);
>> -}
>> -
>>  static void rtl_wol_suspend_quirk(struct rtl8168_private *tp)
>>  {
>>  void __iomem *ioaddr = tp->mmio_addr;
>> @@ -4424,7 +4400,7 @@ static bool rtl_wol_pll_power_down(struct 
>> rtl8168_private *tp)
>>  if (!(__rtl8168_get_wol(tp) & WAKE_ANY))
>>  return false;
>>  
>> -rtl_speed_down(tp);
>> +genphy_restart_aneg(tp->dev->phydev);
>>  rtl_wol_suspend_quirk(tp);
>>  
>>  return true;
> 
> I'm not too clear what is going on here? Is this suspend while WOL is
> enabled? There should be no need to change the PHY settings. The PHY
> driver should leave the PHY running in whatever state it was
> configured to. The only danger here is that the MAC driver has called
> phy_stop() during suspend. That should not be done when WOL is
> enabled.
> 
> Is the wol being passed to the phylib? phy_ethtool_set_wol() and
> phy_ethtool_get_wol()?
> 
I also have a hard time to understand what this is good for.
Here's the mail thread regarding introduction of this function:
https://lkml.org/lkml/2013/4/2/669

If I understand this correctly it's about fixing an issue when
aneg is disabled and the PHY automatically changes the speed.

In this case we may have to use genphy_config_aneg here which
calls genphy_setup_forced if aneg is disabled.

>   Andrew
>

Re: [PATCH v2] openvswitch: Trim off padding before L3+ netfilter processing

2017-12-22 Thread Pravin Shelar

On Thu, Dec 21, 2017 at 7:17 AM, Ed Swierk  wrote:
> IPv4 and IPv6 packets may arrive with lower-layer padding that is not
> included in the L3 length. For example, a short IPv4 packet may have
> up to 6 bytes of padding following the IP payload when received on an
> Ethernet device. In the normal IPv4 receive path, ip_rcv() trims the
> packet to ip_hdr->tot_len before invoking netfilter hooks (including
> conntrack and nat).
>
> In the IPv6 receive path, ip6_rcv() does the same using
> ipv6_hdr->payload_len. Similarly in the br_netfilter receive path,
> br_validate_ipv4() and br_validate_ipv6() trim the packet to the L3
> length before invoking NF_INET_PRE_ROUTING hooks.
>
> In the OVS conntrack receive path, ovs_ct_execute() pulls the skb to
> the L3 header but does not trim it to the L3 length before calling
> nf_conntrack_in(NF_INET_PRE_ROUTING). When nf_conntrack_proto_tcp
> encounters a packet with lower-layer padding, nf_checksum() fails and
> logs "nf_ct_tcp: bad TCP checksum". While extra zero bytes don't
> affect the checksum, the length in the IP pseudoheader does. That
> length is based on skb->len, and without trimming, it doesn't match
> the length the sender used when computing the checksum.
>
> The assumption throughout nf_conntrack and nf_nat is that skb->len
> reflects the length of the L3 header and payload, so there is no need
> to refer back to ip_hdr->tot_len or ipv6_hdr->payload_len.
>
> This change brings OVS into line with other netfilter users, trimming
> IPv4 and IPv6 packets prior to L3+ netfilter processing.
>
> Signed-off-by: Ed Swierk 
> ---
> v2:
> - Trim packet in nat receive path as well as conntrack
> - Free skb on error
> ---
>  net/openvswitch/conntrack.c | 34 ++
>  1 file changed, 34 insertions(+)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index b27c5c6..1bdc78f 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -703,6 +703,33 @@ static bool skb_nfct_cached(struct net *net,
> return ct_executed;
>  }
>
> +/* Trim the skb to the L3 length. Assumes the skb is already pulled to
> + * the L3 header. The skb is freed on error.
> + */
> +static int skb_trim_l3(struct sk_buff *skb)
> +{
> +   unsigned int nh_len;
> +   int err;
> +
> +   switch (skb->protocol) {
> +   case htons(ETH_P_IP):
> +   nh_len = ntohs(ip_hdr(skb)->tot_len);
> +   break;
> +   case htons(ETH_P_IPV6):
> +   nh_len = ntohs(ipv6_hdr(skb)->payload_len)
> +   + sizeof(struct ipv6hdr);
> +   break;
> +   default:
> +   nh_len = skb->len;
> +   }
> +
> +   err = pskb_trim_rcsum(skb, nh_len);
> +   if (err)
This should is unlikely.
> +   kfree_skb(skb);
> +
> +   return err;
> +}
> +
This looks like a generic function, it probably does not belong to OVS
code base.

>  #ifdef CONFIG_NF_NAT_NEEDED
>  /* Modelled after nf_nat_ipv[46]_fn().
>   * range is only used for new, uninitialized NAT state.
> @@ -715,8 +742,12 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, 
> struct nf_conn *ct,
>  {
> int hooknum, nh_off, err = NF_ACCEPT;
>
> +   /* The nat module expects to be working at L3. */
> nh_off = skb_network_offset(skb);
> skb_pull_rcsum(skb, nh_off);
> +   err = skb_trim_l3(skb);
> +   if (err)
> +   return err;
>
ct-nat is executed within ct action, so I do not see why you you call
skb-trim again from ovs_ct_nat_execute().
ovs_ct_execute() trim should take care of the skb.

> /* See HOOK2MANIP(). */
> if (maniptype == NF_NAT_MANIP_SRC)
> @@ -,6 +1142,9 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
> /* The conntrack module expects to be working at L3. */
> nh_ofs = skb_network_offset(skb);
> skb_pull_rcsum(skb, nh_ofs);
> +   err = skb_trim_l3(skb);
> +   if (err)
> +   return err;
>
> if (key->ip.frag != OVS_FRAG_TYPE_NONE) {
> err = handle_fragments(net, key, info->zone.id, skb);
> --
> 1.9.1
>

Re: [PATCH RFC 12/18] r8168: switch to phy_mii_ioctl

2017-12-22 Thread Heiner Kallweit

Am 22.12.2017 um 11:00 schrieb Andrew Lunn:
>>  static int rtl8168_ioctl(struct net_device *dev, struct ifreq *ifr, int cmd)
>>  {
>> -struct rtl8168_private *tp = netdev_priv(dev);
>> -struct mii_ioctl_data *data = if_mii(ifr);
>> +if (!netif_running(dev))
>> +return -ENODEV;
> 
> It is sometimes useful to be able to prod and poke the PHY when the
> MAC is not running. Is this limit because of power saving?
> 
Actually I'm not 100% sure, I just kept the existing logic.
When checking other users of phy_mii_ioctl I saw that it is
a typical pattern to return an error if the interface is down.

> Otherwise, this patch looks good.
> 
> Thanks
>   Andrew
>

Re: thunderx sgmii interface hang

2017-12-22 Thread David Daney


On 12/22/2017 02:19 PM, Tim Harvey wrote:

On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn  wrote:

On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote:

On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn  wrote:

The nic appears to work fine (pings, TCP etc) up until a performance
test is attempted.
When an iperf bandwidth test is attempted the nic ends up in a state
where truncated-ip packets are being sent out (per a tcpdump from
another board):


Hi Tim

Are pause frames supported? Have you tried turning them off?

Can you reproduce the issue with UDP? Or is it TCP only?



Andrew,

Pause frames don't appear to be supported yet and the issue occurs
when using UDP as well as TCP. I'm not clear what the best way to
troubleshoot this is.


Hi Tim

Is pause being negotiated? In theory, it should not be. The PHY should
not offer it, if the MAC has not enabled it. But some PHY drivers are
probably broken and offer pause when they should not.

Also, can you trigger the issue using UDP at say 75% the maximum
bandwidth. That should be low enough that the peer never even tries to
use pause.

All this pause stuff is just a stab in the dark. Something else to try
is to turn off various forms off acceleration, ethtook -K, and see if
that makes a difference.



Andrew,

Currently I'm not using the DP83867_PHY driver (after verifying the
issue occurs with or without that driver).

It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
and the issue still occurs.

I have found that once the issue occurs I can recover to a working
state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter
the issue and recover with that, I can never trigger the issue again.
If toggle that register bit upon power-up before the issue occurs it
will still occur.

The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as:
- when cleared all dedicated BGX context state for LMAC (state
machine, FIFOs, counters etc) are reset and LMAC access to shared BGX
resources (data path, serdes lanes) is disabled
- when set LMAC operation is enabled (link bring-up, sync, and tx/rx
of idles and fault sequences)


You could try looking at
BGXX_GMP_PCS_INTX
BGXX_GMP_GMI_RXX_INT
BGXX_GMP_GMI_TXX_INT

Those are all W1C registers that should contain all zeros.  If they 
don't, just write back to them to clear before running a test.


If there are bits asserting in these when the thing gets wedged up, it 
might point to a possible cause.


You could also look at these RO registers:
BGXX_GMP_PCS_TXX_STATES
BGXX_GMP_PCS_RXX_STATES






I'm told that the particular Cavium reference board with an SGMII phy
doesn't show this issue (I don't have that specific board to do my own
testing or comparisons against our board) so I'm inclined to think it
has something to do with an interaction with the DP83867 PHY. I would
like to start poking at PHY registers to see if I can find anything
unusual. The best way to do that from userspace is via
SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently
support ioctl's so I guess I'll have to add that support unless
there's a way to get at phy registers from userspace through a phy
driver?

Regards,

Tim

Re: thunderx sgmii interface hang

2017-12-22 Thread Andrew Lunn

> Currently I'm not using the DP83867_PHY driver (after verifying the
> issue occurs with or without that driver).
> 
> It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
> and the issue still occurs.

> I'm told that the particular Cavium reference board with an SGMII phy
> doesn't show this issue (I don't have that specific board to do my own
> testing or comparisons against our board) so I'm inclined to think it
> has something to do with an interaction with the DP83867 PHY. I would
> like to start poking at PHY registers to see if I can find anything
> unusual. The best way to do that from userspace is via
> SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently
> support ioctl's so I guess I'll have to add that support unless
> there's a way to get at phy registers from userspace through a phy
> driver?

phy_mii_ioctl() does what you need, and is simple to use.

mii-tool will then give you access to the standard PHY registers.

 Andre

Re: [PATCH v3 next-queue 08/10] ixgbe: process the Tx ipsec offload

2017-12-22 Thread Shannon Nelson


On 12/22/2017 12:24 AM, Yanjun Zhu wrote:

On 2017/12/20 8:00, Shannon Nelson wrote:

If the skb has a security association referenced in the skb, then
set up the Tx descriptor with the ipsec offload bits.  While we're
here, we fix an oddly named field in the context descriptor struct.



[...]


+int ixgbe_ipsec_tx(struct ixgbe_ring *tx_ring,
+   struct ixgbe_tx_buffer *first,
+   struct ixgbe_ipsec_tx_data *itd)
+{
+    struct ixgbe_adapter *adapter = netdev_priv(tx_ring->netdev);
+    struct ixgbe_ipsec *ipsec = adapter->ipsec;
+    struct xfrm_state *xs;
+    struct tx_sa *tsa;
+
+    if (!first->skb->sp->len) {

Hi, Nelson

The function ixgbe_ipsec_tx is called in tx fastpath. Can we add 
unlikely as below:

if (unlikely(!first->skb->sp->len)) ?

If I am wrong, please correct me.

Thanks a lot.
Zhu Yanjun


Yes, we can probably throw those in.  I'll be working on this code in 
the new year to get the checksum and TSO bits working and can add these 
at that time.


sln

Re: [PATCH RFC 09/18] r8168: use genphy_soft_reset instead of open coding the soft reset

2017-12-22 Thread Heiner Kallweit

Am 22.12.2017 um 10:57 schrieb Andrew Lunn:
> On Thu, Dec 21, 2017 at 09:50:28PM +0100, Heiner Kallweit wrote:
>> Use genphy_soft_reset instead of open coding the soft reset.
> 
> Hi Heiner
> 
> At this point, you have swapped over the phylib. Does one of the
> drivers in drivers/net/phy now take control of the PHY? Does the PHY
> ID match one of those in realtek.c?
> 
> The PHY driver and phylib should be responsible for resetting the
> PHY. The MAC driver should not need to do this.
> 
In my case the PHY ID matches the existing RTL8211E driver in
realtek.c and the MAC driver wouldn't have to do any soft reset.

However there may be chips with a PHY ID not yet being supported
by realtek.c. AFAICS in such a case phylib uses genphy_driver
which has a no-op soft reset.
Therefore I'd tend to say that we can remove the phy soft reset
from MAC driver only after being sure that all PHY ID's of
supported NIC's have a driver in realtek.c.

>  Andrew
> 
Heiner

By the way: Thanks for the review comments.

Re: thunderx sgmii interface hang

2017-12-22 Thread Tim Harvey

On Tue, Dec 19, 2017 at 12:52 PM, Andrew Lunn  wrote:
> On Mon, Dec 18, 2017 at 01:53:47PM -0800, Tim Harvey wrote:
>> On Wed, Dec 13, 2017 at 11:43 AM, Andrew Lunn  wrote:
>> >> The nic appears to work fine (pings, TCP etc) up until a performance
>> >> test is attempted.
>> >> When an iperf bandwidth test is attempted the nic ends up in a state
>> >> where truncated-ip packets are being sent out (per a tcpdump from
>> >> another board):
>> >
>> > Hi Tim
>> >
>> > Are pause frames supported? Have you tried turning them off?
>> >
>> > Can you reproduce the issue with UDP? Or is it TCP only?
>> >
>>
>> Andrew,
>>
>> Pause frames don't appear to be supported yet and the issue occurs
>> when using UDP as well as TCP. I'm not clear what the best way to
>> troubleshoot this is.
>
> Hi Tim
>
> Is pause being negotiated? In theory, it should not be. The PHY should
> not offer it, if the MAC has not enabled it. But some PHY drivers are
> probably broken and offer pause when they should not.
>
> Also, can you trigger the issue using UDP at say 75% the maximum
> bandwidth. That should be low enough that the peer never even tries to
> use pause.
>
> All this pause stuff is just a stab in the dark. Something else to try
> is to turn off various forms off acceleration, ethtook -K, and see if
> that makes a difference.
>

Andrew,

Currently I'm not using the DP83867_PHY driver (after verifying the
issue occurs with or without that driver).

It does not occur if I limit UDP (ie 950mbps). I disabled all offloads
and the issue still occurs.

I have found that once the issue occurs I can recover to a working
state by clearing/setting BGX_CMRX_CFG[BGX_EN] and once I encounter
the issue and recover with that, I can never trigger the issue again.
If toggle that register bit upon power-up before the issue occurs it
will still occur.

The CN80XX reference manual describes BGX_CMRX_CFG[BGX_EN] as:
- when cleared all dedicated BGX context state for LMAC (state
machine, FIFOs, counters etc) are reset and LMAC access to shared BGX
resources (data path, serdes lanes) is disabled
- when set LMAC operation is enabled (link bring-up, sync, and tx/rx
of idles and fault sequences)

I'm told that the particular Cavium reference board with an SGMII phy
doesn't show this issue (I don't have that specific board to do my own
testing or comparisons against our board) so I'm inclined to think it
has something to do with an interaction with the DP83867 PHY. I would
like to start poking at PHY registers to see if I can find anything
unusual. The best way to do that from userspace is via
SIOCGMIIREG/SIOCSMIIREG right? The thunderx nic doesn't currently
support ioctl's so I guess I'll have to add that support unless
there's a way to get at phy registers from userspace through a phy
driver?

Regards,

Tim

RE: [Intel-wired-lan] [PATCH] e1000e: Fix e1000_check_for_copper_link_ich8lan return value.

2017-12-22 Thread Brown, Aaron F

> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Neftin, Sasha
> Sent: Wednesday, December 20, 2017 10:57 PM
> To: Benjamin Poirier ; Kirsher, Jeffrey T
> 
> Cc: Ben Hutchings ; Gabriel C
> ; netdev@vger.kernel.org; Christian Hesse
> ; sta...@vger.kernel.org; linux-ker...@vger.kernel.org;
> intel-wired-...@lists.osuosl.org
> Subject: Re: [Intel-wired-lan] [PATCH] e1000e: Fix
> e1000_check_for_copper_link_ich8lan return value.
> 
> On 11/12/2017 9:26, Benjamin Poirier wrote:
> > e1000e_check_for_copper_link() and
> e1000_check_for_copper_link_ich8lan()
> > are the two functions that may be assigned to mac.ops.check_for_link
> when
> > phy.media_type == e1000_media_type_copper. Commit 19110cfbb34d
> ("e1000e:
> > Separate signaling for link check/link up") changed the meaning of the
> > return value of check_for_link for copper media but only adjusted the first
> > function. This patch adjusts the second function likewise.
> >
> > Reported-by: Christian Hesse 
> > Reported-by: Gabriel C 
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=198047
> > Fixes: 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
> > Tested-by: Christian Hesse 
> > Signed-off-by: Benjamin Poirier 
> > ---
> >   drivers/net/ethernet/intel/e1000e/ich8lan.c | 11 ---
> >   1 file changed, 8 insertions(+), 3 deletions(-)
> >

Tested-by: Aaron Brown 

> Acked by sasha.nef...@intel.com
> 
> ___
> Intel-wired-lan mailing list
> intel-wired-...@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

[PATCH 3/3] qemu: add linkspeed and duplex settings to virtio-net

2017-12-22 Thread Jason Baron

Although linkspeed and duplex can be set in a linux guest via 'ethtool -s',
this requires custom ethtool commands for virtio-net by default.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Linkspeed and duplex settings can be set as:
'-device virtio-net,speed=1,duplex=full'

where speed is [-1...INT_MAX], and duplex is ["half"|"full"].

Signed-off-by: Jason Baron 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
---
 hw/net/virtio-net.c | 29 +
 include/hw/virtio/virtio-net.h  |  3 +++
 include/standard-headers/linux/virtio_net.h |  4 
 3 files changed, 36 insertions(+)
 create mode 16 pixman

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index adc20df..7fafe6e 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -40,6 +40,12 @@
 #define VIRTIO_NET_RX_QUEUE_MIN_SIZE VIRTIO_NET_RX_QUEUE_DEFAULT_SIZE
 #define VIRTIO_NET_TX_QUEUE_MIN_SIZE VIRTIO_NET_TX_QUEUE_DEFAULT_SIZE
 
+/* duplex and speed defines */
+#define DUPLEX_UNKNOWN  0xff
+#define DUPLEX_HALF 0x00
+#define DUPLEX_FULL 0x01
+#define SPEED_UNKNOWN   -1
+
 /*
  * Calculate the number of bytes up to and including the given 'field' of
  * 'container'.
@@ -61,6 +67,8 @@ static VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
 {.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
+{.flags = 1ULL << VIRTIO_NET_F_SPEED_DUPLEX,
+ .end = endof(struct virtio_net_config, duplex)},
 {}
 };
 
@@ -88,6 +96,8 @@ static void virtio_net_get_config(VirtIODevice *vdev, uint8_t 
*config)
 virtio_stw_p(vdev, , n->status);
 virtio_stw_p(vdev, _virtqueue_pairs, n->max_queues);
 virtio_stw_p(vdev, , n->net_conf.mtu);
+virtio_stl_p(vdev, , n->net_conf.speed);
+netcfg.duplex = n->net_conf.duplex;
 memcpy(netcfg.mac, n->mac, ETH_ALEN);
 memcpy(config, , n->config_size);
 }
@@ -1941,6 +1951,23 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
+n->host_features |= (1ULL << VIRTIO_NET_F_SPEED_DUPLEX);
+if (n->net_conf.duplex_str) {
+if (strncmp(n->net_conf.duplex_str, "half", 5) == 0) {
+n->net_conf.duplex = DUPLEX_HALF;
+} else if (strncmp(n->net_conf.duplex_str, "full", 5) == 0) {
+n->net_conf.duplex = DUPLEX_FULL;
+} else {
+error_setg(errp, "'duplex' must be 'half' or 'full'");
+}
+} else {
+n->net_conf.duplex = DUPLEX_UNKNOWN;
+}
+if (n->net_conf.speed < SPEED_UNKNOWN) {
+error_setg(errp, "'speed' must be between -1 (SPEED_UNKOWN) and "
+   "INT_MAX");
+}
+
 virtio_net_set_config_size(n, n->host_features);
 virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
 
@@ -2160,6 +2187,8 @@ static Property virtio_net_properties[] = {
 DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0),
 DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend,
  true),
+DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
+DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-net.h
index e7634c9..02484dc 100644
--- a/include/hw/virtio/virtio-net.h
+++ b/include/hw/virtio/virtio-net.h
@@ -38,6 +38,9 @@ typedef struct virtio_net_conf
 uint16_t rx_queue_size;
 uint16_t tx_queue_size;
 uint16_t mtu;
+int32_t speed;
+char *duplex_str;
+uint8_t duplex;
 } virtio_net_conf;
 
 /* Maximum packet size we can receive from tap device: header + 64k */
diff --git a/include/standard-headers/linux/virtio_net.h 
b/include/standard-headers/linux/virtio_net.h
index 30ff249..0ff1447 100644
--- a/include/standard-headers/linux/virtio_net.h
+++ b/include/standard-headers/linux/virtio_net.h
@@ -36,6 +36,7 @@
 #define VIRTIO_NET_F_GUEST_CSUM1   /* Guest handles pkts w/ 
partial csum */
 #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */
 #define VIRTIO_NET_F_MTU   3   /* Initial MTU advice */
+#define VIRTIO_NET_F_SPEED_DUPLEX 4/* Host set linkspeed and duplex */
 #define VIRTIO_NET_F_MAC   5   /* Host has given MAC address. */
 #define VIRTIO_NET_F_GUEST_TSO47   /* Guest can handle TSOv4 in. */
 #define VIRTIO_NET_F_GUEST_TSO68   /* Guest can handle TSOv6 in. */
@@ -76,6 +77,9 @@ struct virtio_net_config {
uint16_t max_virtqueue_pairs;
/* Default maximum transmit unit advice */
uint16_t mtu;
+   /*

[PATCH 2/3] qemu: use 64-bit values for feature flags in virtio-net

2017-12-22 Thread Jason Baron

In prepartion for using some of the high order feature bits, make sure that
virtio-net uses 64-bit values everywhere.

Signed-off-by: Jason Baron 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
---
 hw/net/virtio-net.c| 54 +-
 include/hw/virtio/virtio-net.h |  2 +-
 2 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index 38674b0..adc20df 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -48,18 +48,18 @@
 (offsetof(container, field) + sizeof(((container *)0)->field))
 
 typedef struct VirtIOFeature {
-uint32_t flags;
+uint64_t flags;
 size_t end;
 } VirtIOFeature;
 
 static VirtIOFeature feature_sizes[] = {
-{.flags = 1 << VIRTIO_NET_F_MAC,
+{.flags = 1ULL << VIRTIO_NET_F_MAC,
  .end = endof(struct virtio_net_config, mac)},
-{.flags = 1 << VIRTIO_NET_F_STATUS,
+{.flags = 1ULL << VIRTIO_NET_F_STATUS,
  .end = endof(struct virtio_net_config, status)},
-{.flags = 1 << VIRTIO_NET_F_MQ,
+{.flags = 1ULL << VIRTIO_NET_F_MQ,
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
-{.flags = 1 << VIRTIO_NET_F_MTU,
+{.flags = 1ULL << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
 {}
 };
@@ -1938,7 +1938,7 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 int i;
 
 if (n->net_conf.mtu) {
-n->host_features |= (0x1 << VIRTIO_NET_F_MTU);
+n->host_features |= (1ULL << VIRTIO_NET_F_MTU);
 }
 
 virtio_net_set_config_size(n, n->host_features);
@@ -2109,45 +2109,45 @@ static const VMStateDescription vmstate_virtio_net = {
 };
 
 static Property virtio_net_properties[] = {
-DEFINE_PROP_BIT("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, true),
-DEFINE_PROP_BIT("guest_csum", VirtIONet, host_features,
+DEFINE_PROP_BIT64("csum", VirtIONet, host_features, VIRTIO_NET_F_CSUM, 
true),
+DEFINE_PROP_BIT64("guest_csum", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_CSUM, true),
-DEFINE_PROP_BIT("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
-DEFINE_PROP_BIT("guest_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("gso", VirtIONet, host_features, VIRTIO_NET_F_GSO, true),
+DEFINE_PROP_BIT64("guest_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO4, true),
-DEFINE_PROP_BIT("guest_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_TSO6, true),
-DEFINE_PROP_BIT("guest_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ECN, true),
-DEFINE_PROP_BIT("guest_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_UFO, true),
-DEFINE_PROP_BIT("guest_announce", VirtIONet, host_features,
+DEFINE_PROP_BIT64("guest_announce", VirtIONet, host_features,
 VIRTIO_NET_F_GUEST_ANNOUNCE, true),
-DEFINE_PROP_BIT("host_tso4", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso4", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO4, true),
-DEFINE_PROP_BIT("host_tso6", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_tso6", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_TSO6, true),
-DEFINE_PROP_BIT("host_ecn", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ecn", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_ECN, true),
-DEFINE_PROP_BIT("host_ufo", VirtIONet, host_features,
+DEFINE_PROP_BIT64("host_ufo", VirtIONet, host_features,
 VIRTIO_NET_F_HOST_UFO, true),
-DEFINE_PROP_BIT("mrg_rxbuf", VirtIONet, host_features,
+DEFINE_PROP_BIT64("mrg_rxbuf", VirtIONet, host_features,
 VIRTIO_NET_F_MRG_RXBUF, true),
-DEFINE_PROP_BIT("status", VirtIONet, host_features,
+DEFINE_PROP_BIT64("status", VirtIONet, host_features,
 VIRTIO_NET_F_STATUS, true),
-DEFINE_PROP_BIT("ctrl_vq", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_vq", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_VQ, true),
-DEFINE_PROP_BIT("ctrl_rx", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_rx", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_RX, true),
-DEFINE_PROP_BIT("ctrl_vlan", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_vlan", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_VLAN, true),
-DEFINE_PROP_BIT("ctrl_rx_extra", VirtIONet, host_features,
+DEFINE_PROP_BIT64("ctrl_rx_extra", VirtIONet, host_features,
 VIRTIO_NET_F_CTRL_RX_EXTRA, true),
-DEFINE_PROP_BIT("ctrl_mac_addr", VirtIONet, host_features,
+

[PATCH net-next v2 1/3] virtio_net: propagate linkspeed/duplex settings from the hypervisor

2017-12-22 Thread Jason Baron

The ability to set speed and duplex for virtio_net in useful in various
scenarios as described here:

16032be virtio_net: add ethtool support for set and get of settings

However, it would be nice to be able to set this from the hypervisor,
such that virtio_net doesn't require custom guest ethtool commands.

Introduce a new feature flag, VIRTIO_NET_F_SPEED_DUPLEX, which allows
the hypervisor to export a linkspeed and duplex setting. The user can
subsequently overwrite it later if desired via: 'ethtool -s'.

Signed-off-by: Jason Baron 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
---
 drivers/net/virtio_net.c| 17 -
 include/uapi/linux/virtio_net.h |  5 +
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 511f833..4168d82 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2621,6 +2621,20 @@ static int virtnet_probe(struct virtio_device *vdev)
netif_set_real_num_rx_queues(dev, vi->curr_queue_pairs);
 
virtnet_init_settings(dev);
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_SPEED_DUPLEX)) {
+   u32 speed;
+   u8 duplex;
+
+   speed = virtio_cread32(vdev, offsetof(struct virtio_net_config,
+  speed));
+   if (ethtool_validate_speed(speed))
+   vi->speed = speed;
+   duplex = virtio_cread8(vdev,
+  offsetof(struct virtio_net_config,
+  duplex));
+   if (ethtool_validate_duplex(duplex))
+   vi->duplex = duplex;
+   }
 
err = register_netdev(dev);
if (err) {
@@ -2746,7 +2760,8 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_CTRL_RX, VIRTIO_NET_F_CTRL_VLAN, \
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
-   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
+   VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
+   VIRTIO_NET_F_SPEED_DUPLEX
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index fc353b5..0f1548e 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,8 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Host set linkspeed and duplex */
+
 #ifndef VIRTIO_NET_NO_LEGACY
 #define VIRTIO_NET_F_GSO   6   /* Host handles pkts w/ any GSO type */
 #endif /* VIRTIO_NET_NO_LEGACY */
@@ -76,6 +78,9 @@ struct virtio_net_config {
__u16 max_virtqueue_pairs;
/* Default maximum transmit unit advice */
__u16 mtu;
+   /* Host exported linkspeed and duplex */
+   __u32 speed;
+   __u8 duplex;
 } __attribute__((packed));
 
 /*
-- 
2.6.1

[PATCH v2 0/3] virtio_net: allow hypervisor to indicate linkspeed and duplex setting

2017-12-22 Thread Jason Baron

We have found it useful to be able to set the linkspeed and duplex
settings from the host-side for virtio_net. This obviates the need
for guest changes and settings for these fields, and does not require
custom ethtool commands for virtio_net.

  
The ability to set linkspeed and duplex is useful in various cases
as described here:  
  

  
16032be virtio_net: add ethtool support for set and get of settings 
  

  
Using 'ethtool -s' continues to over-write the linkspeed/duplex 
  
settings with this patch.   
   

  
The 1/3 patch is against net-next, while the 2-3/3 patch are the associated
qemu changes that would go in after as update-linux-headers.sh should
be run first. So the qemu patches are a demonstration of how I intend this
to work.

  
Thanks, 
  

  
-Jason  
  

  
 linux changes: 
  

Jason Baron (1):
  virtio_net: propagate linkspeed/duplex settings from the hypervisor

 drivers/net/virtio_net.c| 17 -
 include/uapi/linux/virtio_net.h |  5 +
 2 files changed, 21 insertions(+), 1 deletion(-)

  

  
 qemu changes:  
  

Jason Baron (2):
  qemu: virtio-net: use 64-bit values for feature flags
  qemu: add linkspeed and duplex settings to virtio-net

 hw/net/virtio-net.c | 83 +++--
 include/hw/virtio/virtio-net.h  |  5 +-
 include/standard-headers/linux/virtio_net.h |  4 ++
 3 files changed, 64 insertions(+), 28 deletions(-)

Re: [Patch net-next] net_sched: call qdisc_reset() with qdisc lock

2017-12-22 Thread Cong Wang

On Thu, Dec 21, 2017 at 7:36 PM, John Fastabend
 wrote:
> On 12/21/2017 04:03 PM, Cong Wang wrote:
>> qdisc_reset() should always be called with qdisc spinlock
>> and with BH disabled, otherwise qdisc ->reset() could race
>> with TX BH.
>>
> hmm I don't see how this fixes the issue. pfifo_fast is no longer
> using the qdisc lock so that doesn't help.  And it is only a
> local_bh_disable.


First of all, this is to fix non-pfifo_fast which you seem totally forget.


>
>
>> Fixes: 7bbde83b1860 ("net: sched: drop qdisc_reset from dev_graft_qdisc")
>> Reported-by: Jakub Kicinski 
>> Cc: John Fastabend 
>> Signed-off-by: Cong Wang 
>> ---
>>  net/sched/sch_generic.c | 5 -
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 10aaa3b615ce..00ddb5f8f430 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -1097,8 +1097,11 @@ static void dev_qdisc_reset(struct net_device *dev,
>>  {
>>   struct Qdisc *qdisc = dev_queue->qdisc_sleeping;
>>
>> - if (qdisc)
>> + if (qdisc) {
>> + spin_lock_bh(qdisc_lock(qdisc));
>>   qdisc_reset(qdisc);
>> + spin_unlock_bh(qdisc_lock(qdisc));
>> + }
>>  }
>>
>>  /**
>>
>
> OK first the cases to get to qdisc_reset that I've tracked
> down are,
>
> dev_shutdown()
>   qdisc_destroy()
>
> dev_deactivate_many()
>   dev_qdisc_reset() <- for each txq
>  qdisc_reset()
>
> chained calls from qdisc_reset ops
>
> At the moment all the lockless qdiscs don't care about chained
> calls so we can ignore that, but would be nice to keep in mind.
>
> Next qdisc_reset() is doing a couple things calling the qdisc
> ops reset call but also walking gso_skb and skb_bad_txq. The
> 'unlink' operations there are not safe to be called while an
> enqueue/dequeue op is in-flight. Also pfifo_fast's reset op
> is not safe to be called with enqueue/dequeue ops in-flight.

This is why I sent two patches instead just this one.


>
> So I've made the assumption that qdisc_reset is _only_ ever
> called after a qdisc is no longer attached on the enqueue
> dev_xmit side and also any in-progress tx_action calls are
> completed. For what its worth this has always been the assumption
> AFAIK.
>
> So those are the assumptions what did I miss?


Speaking of this, qdisc_reset() is also called in dev_deactivate_queue()
even before synchronize_net()...

>
> The biggest gap I see is dev_deactivate_many() is supposed
> to wait for all tx_action calls to complete, this bit:

In some_qdisc_is_busy() you hold qdisc spinlock for non-lockless
ones, I don't understand why you don't want it in dev_qdisc_reset().

[PATCH bpf-next 1/2] bpf: fix maximum stack depth tracking logic

2017-12-22 Thread Alexei Starovoitov

instead of computing max stack depth for current call chain only
track the maximum possible stack depth of any function at given
frame position. Such algorithm is simple and fast, but conservative,
since it overestimates amount of stack used. Consider:
main() // stack 32
{
  A();
  B();
}

A(){} // stack 256

B()  // stack 64
{
  A();
}

since A() is called at frame[1] and frame[2], the algorithm
will estimate the max stack depth as 32 + 256 + 256 and will reject
such program, though real max is 32 + 64 + 256.

Fortunately the algorithm is good enough in practice. The alternative
would be to track max stack of every function in the fast pass through
the verifier and then do additional CFG walk just to compute max total.

Fixes: f4d7e40a5b71 ("bpf: introduce function calls (verification)")
Reported-by: Jann Horn 
Signed-off-by: Alexei Starovoitov 
---
 include/linux/bpf_verifier.h | 17 +
 kernel/bpf/verifier.c| 15 +--
 2 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index aaac589e490c..adc65ef093d1 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -194,7 +194,24 @@ struct bpf_verifier_env {
struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
struct bpf_verifer_log log;
u32 subprog_starts[BPF_MAX_SUBPROGS];
+   /* computes the stack depth of each bpf function */
u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1];
+   /* gradually computes the maximum stack depth out of all functions
+* called at this frame position. Example:
+* main()
+* {
+*   a();
+*   b();
+* }
+* b()
+* {
+*   a();
+* }
+* frame_stack_depth[0] = stack depth of main()
+* frame_stack_depth[1] = max { stack depth of a(), stack depth of b() }
+* frame_stack_depth[2] = stack depth of a()
+*/
+   u16 frame_stack_depth[MAX_CALL_FRAMES];
u32 subprog_cnt;
 };
 
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4ae46b2cba88..096681bcb312 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -1434,15 +1434,18 @@ static int update_stack_depth(struct bpf_verifier_env 
*env,
struct bpf_verifier_state *cur = env->cur_state;
int i;
 
+   if (stack < -off)
+   /* update known max for given subprogram */
+   env->subprog_stack_depth[func->subprogno] = -off;
+
+   stack = env->frame_stack_depth[func->frameno];
if (stack >= -off)
return 0;
+   env->frame_stack_depth[func->frameno] = -off;
 
-   /* update known max for given subprogram */
-   env->subprog_stack_depth[func->subprogno] = -off;
-
-   /* compute the total for current call chain */
-   for (i = 0; i <= cur->curframe; i++) {
-   u32 depth = env->subprog_stack_depth[cur->frame[i]->subprogno];
+   /* some frame changed its max, recompute the total */
+   for (i = 0; i < MAX_CALL_FRAMES; i++) {
+   u32 depth = env->frame_stack_depth[i];
 
/* round up to 32-bytes, since this is granularity
 * of interpreter stack sizes
-- 
2.9.5

[PATCH bpf-next 2/2] selftests/bpf: additional stack depth tests

2017-12-22 Thread Alexei Starovoitov

to test inner logic of stack depth tracking

Signed-off-by: Alexei Starovoitov 
---
 tools/testing/selftests/bpf/test_verifier.c | 50 +
 1 file changed, 50 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 71fb0be81b78..e0b21c95c3bc 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -8764,6 +8764,56 @@ static struct bpf_test tests[] = {
.result = REJECT,
},
{
+   "calls: stack depth check using three frames. test1",
+   .insns = {
+   /* main */
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 4), /* call A */
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 5), /* call B */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -32, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   /* A */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -256, 0),
+   BPF_EXIT_INSN(),
+   /* B */
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, -3), /* call A 
*/
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -64, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_XDP,
+   /* though stack_main=32, stack_A=256, stack_B=64
+* and max(main+A, main+A+B) < 512 the program is rejected
+* since stack tracking is conservative
+* frame[0]=32, frame[1]=256, frame[2]=256
+*/
+   .errstr = "combined stack size",
+   .result = REJECT,
+   },
+   {
+   "calls: stack depth check using three frames. test2",
+   .insns = {
+   /* main */
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 4), /* call A */
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 5), /* call B */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -32, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   /* A */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -64, 0),
+   BPF_EXIT_INSN(),
+   /* B */
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, -3), /* call A 
*/
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -256, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_XDP,
+   /* though stack_main=32, stack_A=64, stack_B=256
+* and max(main+A, main+A+B) < 512 the program is accepted
+* since frame[0]=32, frame[1]=256, frame[2]=64
+*/
+   .result = ACCEPT,
+   },
+   {
"calls: spill into caller stack frame",
.insns = {
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
-- 
2.9.5

[PATCH bpf-next 0/2] bpf: stack depth tracking fix

2017-12-22 Thread Alexei Starovoitov

Jann reported an issue with stack depth tracking.
Fix it and add tests.

Alexei Starovoitov (2):
  bpf: fix maximum stack depth tracking logic
  selftests/bpf: additional stack depth tests

 include/linux/bpf_verifier.h| 17 ++
 kernel/bpf/verifier.c   | 15 +
 tools/testing/selftests/bpf/test_verifier.c | 50 +
 3 files changed, 76 insertions(+), 6 deletions(-)

-- 
2.9.5

Re: [PATCH v2 bpf-next 1/2] tools/bpftool: use version from the kernel source tree

2017-12-22 Thread Jakub Kicinski

On Fri, 22 Dec 2017 16:11:51 +, Roman Gushchin wrote:
> Bpftool determines it's own version based on the kernel
> version, which is picked from the linux/version.h header.
> 
> It's strange to use the version of the installed kernel
> headers, and makes much more sense to use the version
> of the actual source tree, where bpftool sources are.
> 
> Fix this by building kernelversion target and use
> the resulting string as bpftool version.
> 
> Example:
> before:
> 
> $ bpftool version
> bpftool v4.14.6
> 
> after:
> $ bpftool version
> bpftool v4.15.0-rc3
> 
> $bpftool version --json
> {"version":"4.15.0-rc3"}
> 
> Signed-off-by: Roman Gushchin 
> Cc: Jakub Kicinski 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 

Reviewed-by: Jakub Kicinski 

Thanks Roman!

[PATCH net v2] netns, rtnetlink: fix struct net reference leak

2017-12-22 Thread Craig Gallek

From: Craig Gallek 

netns ids were added in commit 0c7aecd4bde4 and defined as signed
integers in both the kernel datastructures and the netlink interface.
However, the semantics of the implementation assume that the ids
are always greater than or equal to zero, except for an internal
sentinal value NETNSA_NSID_NOT_ASSIGNED.

Several subsequent patches carried this pattern forward.  This patch
updates all of the netlink input paths of this value to enforce the
'greater than or equal to zero' constraint.

This issue was discovered by syskaller.  It would set a negative
value for a netns id and then repeatedly call the RTM_GETLINK.
The put_net call in that path would not trigger for negative netns ids,
caused a reference count leak, and eventually overflowed.  There are
probably additional error paths that do not handle this situation
correctly, but this was the only one I was able to trigger a real
issue through.

Fixes: 0c7aecd4bde4 ("netns: add rtnl cmd to add and get peer netns ids")
Fixes: 317f4810e45e ("rtnl: allow to create device with IFLA_LINK_NETNSID set")
Fixes: 79e1ad148c84 ("rtnetlink: use netnsid to query interface")
CC: Jiri Benc 
CC: Nicolas Dichtel 
CC: Jason A. Donenfeld 
Signed-off-by: Craig Gallek 
---
 net/core/net_namespace.c |  2 ++
 net/core/rtnetlink.c | 17 +
 2 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 60a71be75aea..4b7ea33f5705 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -627,6 +627,8 @@ static int rtnl_net_newid(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return -EINVAL;
}
nsid = nla_get_s32(tb[NETNSA_NSID]);
+   if (nsid < 0)
+   return -EINVAL;
 
if (tb[NETNSA_PID]) {
peer = get_net_ns_by_pid(nla_get_u32(tb[NETNSA_PID]));
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index dabba2a91fc8..a928b8f081b8 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1733,10 +1733,12 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct 
netlink_callback *cb)
ifla_policy, NULL) >= 0) {
if (tb[IFLA_IF_NETNSID]) {
netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
-   tgt_net = get_target_net(skb, netnsid);
-   if (IS_ERR(tgt_net)) {
-   tgt_net = net;
-   netnsid = -1;
+   if (netnsid >= 0) {
+   tgt_net = get_target_net(skb, netnsid);
+   if (IS_ERR(tgt_net)) {
+   tgt_net = net;
+   netnsid = -1;
+   }
}
}
 
@@ -2792,6 +2794,11 @@ static int rtnl_newlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
if (tb[IFLA_LINK_NETNSID]) {
int id = nla_get_s32(tb[IFLA_LINK_NETNSID]);
 
+   if (id < 0) {
+   err =  -EINVAL;
+   goto out;
+   }
+
link_net = get_net_ns_by_id(dest_net, id);
if (!link_net) {
err =  -EINVAL;
@@ -2883,6 +2890,8 @@ static int rtnl_getlink(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
if (tb[IFLA_IF_NETNSID]) {
netnsid = nla_get_s32(tb[IFLA_IF_NETNSID]);
+   if (netnsid < 0)
+   return -EINVAL;
tgt_net = get_target_net(skb, netnsid);
if (IS_ERR(tgt_net))
return PTR_ERR(tgt_net);
-- 
2.15.1.620.gb9897f4670-goog

Re: [Patch net-next] net_sched: remove the unsafe __skb_array_empty()

2017-12-22 Thread Cong Wang

On Thu, Dec 21, 2017 at 7:06 PM, John Fastabend
 wrote:
> On 12/21/2017 04:03 PM, Cong Wang wrote:
>> __skb_array_empty() is only safe if array is never resized.
>> pfifo_fast_dequeue() is called in TX BH context and without
>> qdisc lock, so even after we disable BH on ->reset() path
>> we can still race with other CPU's.
>>
>> Fixes: c5ad119fb6c0 ("net: sched: pfifo_fast use skb_array")
>> Reported-by: Jakub Kicinski 
>> Cc: John Fastabend 
>> Signed-off-by: Cong Wang 
>> ---
>>  net/sched/sch_generic.c | 3 ---
>>  1 file changed, 3 deletions(-)
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 00ddb5f8f430..9279258ce060 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -622,9 +622,6 @@ static struct sk_buff *pfifo_fast_dequeue(struct Qdisc 
>> *qdisc)
>>   for (band = 0; band < PFIFO_FAST_BANDS && !skb; band++) {
>>   struct skb_array *q = band2list(priv, band);
>>
>> - if (__skb_array_empty(q))
>> - continue;
>> -
>>   skb = skb_array_consume_bh(q);
>>   }
>>   if (likely(skb)) {
>>
>
>
> So this is a performance thing we don't want to grab the consumer lock on
> empty bands. Which can be fairly common depending on traffic patterns.


I understand why you had it, but it is just not safe. You don't want
to achieve performance gain by crashing system, right?

>
> Although its not logical IMO to have both reset and dequeue running at
> the same time. Some skbs would get through others would get sent, sort
> of a mess. I don't see how it can be an issue. The never resized bit
> in the documentation is referring to resizing the ring size _not_ popping
> off elements of the ring. array_empty just reads the consumer head.
> The only ring resizing in pfifo fast should be at init and destroy where
> enqueue/dequeue should be disconnected by then. Although based on the
> trace I missed a case.


Both pfifo_fast_reset() and pfifo_fast_dequeue() call
skb_array_consume_bh(), so there is no difference w.r.t. resizing.

And ->reset() is called in qdisc_graft() too. Let's say we have htb+pfifo_fast,
htb_graft() calls qdisc_replace() which calls qdisc_reset() on pfifo_fast,
so clearly pfifo_fast_reset() can run with pfifo_fast_dequeue()
concurrently.


>
> I think the right fix is to only call reset/destroy patterns after
> waiting a grace period and for all tx_action calls in-flight to
> complete. This is also better going forward for more complex qdiscs.

But we don't even have rcu read lock in TX BH, do we?

Also, people certainly don't like yet another synchronize_net()...

[PATCH net-next 0/4] kcm: Fix two locking issues

2017-12-22 Thread Tom Herbert

One issue is lockdep warnings when sock_owned_by_user returns true
in strparser. Fix is to add and call sock_owned_by_user_nocheck since
the check for owned by user is not an error condition in this case.

The other issue is a potential deadlock between TX and RX paths

KCM socket lock and the psock socket lock are acquired in both
the RX and TX path, however they take the locks in opposite order
which can lead to deadlock. The fix is to add try_sock_lock to see
if psock socket lock can get acquired in the TX path with KCM lock
held. If not, then KCM socket is released and the psock socket lock
and KCM socket lock are acquired in the same order as the RX path.

Tested:

Ran KCM traffic without incident.

Tom Herbert (4):
  sock: Add sock_owned_by_user_nocheck
  strparser: Call sock_owned_by_user_nocheck
  sock_lock: Add try_sock_lock
  kcm: Address deadlock between TX and RX paths

 include/net/kcm.h |  1 +
 include/net/sock.h| 12 +
 net/core/sock.c   | 20 +++
 net/kcm/kcmsock.c | 64 ++-
 net/strparser/strparser.c |  2 +-
 5 files changed, 81 insertions(+), 18 deletions(-)

-- 
2.11.0

[PATCH net-next 4/4] kcm: Address deadlock between TX and RX paths

2017-12-22 Thread Tom Herbert

Both the transmit and receive paths of KCM need to take both the
KCM socket lock and the psock socket lock, however they take the
locks in opposite order which can lead to deadlock.

This patch changes the transmit path (kcm_write_msgs to be specific)
so the locks are taken in the proper order. try_sock_lock is first used
to get the lower socket lock, if that is successful then sending data
can proceed with dropping KCM lock. If try_sock_lock fails then the KCM
lock is released and lock_sock is done on the lower socket followed by
the lock_sock on the KCM sock.

A doing_write_msgs flag has been added to kcm structure to prevent
multiple threads doing write_msgs when the KCM lock is dropped.
kernel_sendpage_locked is now called to do the send data with lock
already held.

Signed-off-by: Tom Herbert 
---
 include/net/kcm.h |  1 +
 net/kcm/kcmsock.c | 64 ---
 2 files changed, 48 insertions(+), 17 deletions(-)

diff --git a/include/net/kcm.h b/include/net/kcm.h
index 2a8965819db0..22bd7dd3eedb 100644
--- a/include/net/kcm.h
+++ b/include/net/kcm.h
@@ -78,6 +78,7 @@ struct kcm_sock {
/* Don't use bit fields here, these are set under different locks */
bool tx_wait;
bool tx_wait_more;
+   bool doing_write_msgs;
 
/* Receive */
struct kcm_psock *rx_psock;
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index d4e98f20fc2a..3eb3179b96b3 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -574,13 +574,19 @@ static void kcm_report_tx_retry(struct kcm_sock *kcm)
 static int kcm_write_msgs(struct kcm_sock *kcm)
 {
struct sock *sk = >sk;
-   struct kcm_psock *psock;
-   struct sk_buff *skb, *head;
-   struct kcm_tx_msg *txm;
+   struct sk_buff *head = skb_peek(>sk_write_queue);
unsigned short fragidx, frag_offset;
unsigned int sent, total_sent = 0;
+   struct kcm_psock *psock;
+   struct kcm_tx_msg *txm;
+   struct sk_buff *skb;
int ret = 0;
 
+   if (unlikely(kcm->doing_write_msgs))
+   return 0;
+
+   kcm->doing_write_msgs = true;
+
kcm->tx_wait_more = false;
psock = kcm->tx_psock;
if (unlikely(psock && psock->tx_stopped)) {
@@ -598,15 +604,36 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
return 0;
}
 
+try_again:
+   psock = reserve_psock(kcm);
+   if (!psock)
+   goto out_no_release_psock;
+
+   /* Get lock for lower sock */
+   if (!try_lock_sock(psock->sk)) {
+   /* Someone  else is holding the lower sock lock. We need to
+* release the KCM lock and get the psock lock first. This is
+* needed since the receive path obtains the locks in reverse
+* order and we want to avoid deadlock. Note that
+* write_msgs can't be reentered when we drop the KCM lock
+* since doing_write_msgs is set.
+*/
+   release_sock(>sk);
+
+   /* Take locks in order that receive path does */
+   lock_sock(psock->sk);
+   lock_sock(>sk);
+   }
+
+   /* At this point we have a reserved psock and its lower socket is
+* locked.
+*/
+
head = skb_peek(>sk_write_queue);
txm = kcm_tx_msg(head);
 
if (txm->sent) {
/* Send of first skbuff in queue already in progress */
-   if (WARN_ON(!psock)) {
-   ret = -EINVAL;
-   goto out;
-   }
sent = txm->sent;
frag_offset = txm->frag_offset;
fragidx = txm->fragidx;
@@ -615,11 +642,6 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
goto do_frag;
}
 
-try_again:
-   psock = reserve_psock(kcm);
-   if (!psock)
-   goto out;
-
do {
skb = head;
txm = kcm_tx_msg(head);
@@ -643,11 +665,12 @@ static int kcm_write_msgs(struct kcm_sock *kcm)
goto out;
}
 
-   ret = kernel_sendpage(psock->sk->sk_socket,
- frag->page.p,
- frag->page_offset + frag_offset,
- frag->size - frag_offset,
- MSG_DONTWAIT);
+   ret = kernel_sendpage_locked(psock->sk, frag->page.p,
+frag->page_offset +
+   frag_offset,
+frag->size -
+   frag_offset,
+MSG_DONTWAIT);
if (ret <= 0) {

[PATCH net-next 3/4] sock_lock: Add try_sock_lock

2017-12-22 Thread Tom Herbert

try_sock lock is an opportunistic attempt to acquire a socket lock
without blocking or sleeping. If the socket lock is acquired then
true is returned, else false is returned.

Signed-off-by: Tom Herbert 
---
 include/net/sock.h |  7 +++
 net/core/sock.c| 20 
 2 files changed, 27 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3b4ca2046f8c..69fdd1a89591 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1462,6 +1462,13 @@ static inline void lock_sock(struct sock *sk)
lock_sock_nested(sk, 0);
 }
 
+bool try_lock_sock_nested(struct sock *sk, int subclass);
+
+static inline bool try_lock_sock(struct sock *sk)
+{
+   return try_lock_sock_nested(sk, 0);
+}
+
 void release_sock(struct sock *sk);
 
 /* BH context may only use the following locking interface. */
diff --git a/net/core/sock.c b/net/core/sock.c
index 72d14b221784..40fb772e2d52 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2782,6 +2782,26 @@ void lock_sock_nested(struct sock *sk, int subclass)
 }
 EXPORT_SYMBOL(lock_sock_nested);
 
+bool try_lock_sock_nested(struct sock *sk, int subclass)
+{
+   spin_lock_bh(>sk_lock.slock);
+   if (sk->sk_lock.owned) {
+   spin_unlock_bh(>sk_lock.slock);
+   return false;
+   }
+
+   sk->sk_lock.owned = 1;
+   spin_unlock(>sk_lock.slock);
+
+   /* The sk_lock has mutex_lock() semantics here: */
+
+   mutex_acquire(>sk_lock.dep_map, subclass, 0, _RET_IP_);
+   local_bh_enable();
+
+   return true;
+}
+EXPORT_SYMBOL(try_lock_sock_nested);
+
 void release_sock(struct sock *sk)
 {
spin_lock_bh(>sk_lock.slock);
-- 
2.11.0

[PATCH net-next 2/4] strparser: Call sock_owned_by_user_nocheck

2017-12-22 Thread Tom Herbert

strparser wants to check socket ownership without producing any
warnings. As indicated by the comment in the code, it is permissible
for owned_by_user to return true.

Fixes: 43a0c6751a322847 ("strparser: Stream parser for messages")
Reported-by: syzbot 
Signed-off-by: Tom Herbert 
---
 net/strparser/strparser.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/strparser/strparser.c b/net/strparser/strparser.c
index c5fda15ba319..1fdab5c4eda8 100644
--- a/net/strparser/strparser.c
+++ b/net/strparser/strparser.c
@@ -401,7 +401,7 @@ void strp_data_ready(struct strparser *strp)
 * allows a thread in BH context to safely check if the process
 * lock is held. In this case, if the lock is held, queue work.
 */
-   if (sock_owned_by_user(strp->sk)) {
+   if (sock_owned_by_user_nocheck(strp->sk)) {
queue_work(strp_wq, >work);
return;
}
-- 
2.11.0

[PATCH net-next 1/4] sock: Add sock_owned_by_user_nocheck

2017-12-22 Thread Tom Herbert

This allows checking socket lock ownership with producing lockdep
warnings.

Signed-off-by: Tom Herbert 
---
 include/net/sock.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 0a32f3ce381c..3b4ca2046f8c 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1515,6 +1515,11 @@ static inline bool sock_owned_by_user(const struct sock 
*sk)
return sk->sk_lock.owned;
 }
 
+static inline bool sock_owned_by_user_nocheck(const struct sock *sk)
+{
+   return sk->sk_lock.owned;
+}
+
 /* no reclassification while locks are held */
 static inline bool sock_allow_reclassification(const struct sock *csk)
 {
-- 
2.11.0

[PATCH bpf 2/2] tools: bpftool: protect against races with disappearing objects

2017-12-22 Thread Jakub Kicinski

On program/map show we may get an ID of an object from GETNEXT,
but the object may disappear before we call GET_FD_BY_ID.  If
that happens, ignore the object and continue.

Fixes: 71bb428fe2c1 ("tools: bpf: add bpftool")
Signed-off-by: Jakub Kicinski 
Acked-by: Quentin Monnet 
---
 tools/bpf/bpftool/map.c  | 2 ++
 tools/bpf/bpftool/prog.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index 8368b7ea31b5..a8c3a33dd185 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -528,6 +528,8 @@ static int do_show(int argc, char **argv)
 
fd = bpf_map_get_fd_by_id(id);
if (fd < 0) {
+   if (errno == ENOENT)
+   continue;
p_err("can't get map by id (%u): %s",
  id, strerror(errno));
break;
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index ad619b96c276..dded77345bfb 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -382,6 +382,8 @@ static int do_show(int argc, char **argv)
 
fd = bpf_prog_get_fd_by_id(id);
if (fd < 0) {
+   if (errno == ENOENT)
+   continue;
p_err("can't get prog by id (%u): %s",
  id, strerror(errno));
err = -1;
-- 
2.15.1

[PATCH bpf 1/2] tools: bpftool: maps: close json array on error paths of show

2017-12-22 Thread Jakub Kicinski

We can't return from the middle of do_show(), because
json_array will not be closed.  Break out of the loop.
Note that the error handling after the loop depends on
errno, so no need to set err.

Fixes: 831a0aafe5c3 ("tools: bpftool: add JSON output for `bpftool map *` 
commands")
Signed-off-by: Jakub Kicinski 
Acked-by: Quentin Monnet 
---
 tools/bpf/bpftool/map.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/map.c b/tools/bpf/bpftool/map.c
index e2450c8e88e6..8368b7ea31b5 100644
--- a/tools/bpf/bpftool/map.c
+++ b/tools/bpf/bpftool/map.c
@@ -523,21 +523,21 @@ static int do_show(int argc, char **argv)
break;
p_err("can't get next map: %s%s", strerror(errno),
  errno == EINVAL ? " -- kernel too old?" : "");
-   return -1;
+   break;
}
 
fd = bpf_map_get_fd_by_id(id);
if (fd < 0) {
p_err("can't get map by id (%u): %s",
  id, strerror(errno));
-   return -1;
+   break;
}
 
err = bpf_obj_get_info_by_fd(fd, , );
if (err) {
p_err("can't get map info: %s", strerror(errno));
close(fd);
-   return -1;
+   break;
}
 
if (json_output)
-- 
2.15.1

[PATCH bpf 0/2] tools: bpftool: fix unlikely race and JSON output on error path

2017-12-22 Thread Jakub Kicinski

Hi!

Two small fixes here to listing maps and programs.  The loop for showing
maps is written slightly differently to programs which was missed in JSON
output support, and output would be broken if any of the system calls
failed.  Second fix is in very unlikely case that program or map disappears
after we get its ID we should just skip over that object instead of failing.

Jakub Kicinski (2):
  tools: bpftool: maps: close json array on error paths of show
  tools: bpftool: protect against races with disappearing objects

 tools/bpf/bpftool/map.c  | 8 +---
 tools/bpf/bpftool/prog.c | 2 ++
 2 files changed, 7 insertions(+), 3 deletions(-)

-- 
2.15.1

[PATCH nf-next,v3 6/7] netfilter: nf_tables: flow offload expression

2017-12-22 Thread Pablo Neira Ayuso

Add new instruction for the nf_tables VM that allows us to specify what
flows are offloaded into a given flow table via name. This new
instruction creates the flow entry and adds it to the flow table.

Only established flows, ie. we have seen traffic in both directions, are
added to the flow table. You can still decide to offload entries at a
later stage via packet counting or checking the ct status in case you
want to offload assured conntracks.

This new extension depends on the conntrack subsystem.

Signed-off-by: Pablo Neira Ayuso 
---
 include/uapi/linux/netfilter/nf_tables.h |  11 ++
 net/netfilter/Kconfig|   7 +
 net/netfilter/Makefile   |   1 +
 net/netfilter/nft_flow_offload.c | 268 +++
 4 files changed, 287 insertions(+)
 create mode 100644 net/netfilter/nft_flow_offload.c

diff --git a/include/uapi/linux/netfilter/nf_tables.h 
b/include/uapi/linux/netfilter/nf_tables.h
index 9ba0f4c13de6..528d832fefb4 100644
--- a/include/uapi/linux/netfilter/nf_tables.h
+++ b/include/uapi/linux/netfilter/nf_tables.h
@@ -954,6 +954,17 @@ enum nft_ct_attributes {
 };
 #define NFTA_CT_MAX(__NFTA_CT_MAX - 1)
 
+/**
+ * enum nft_flow_attributes - ct offload expression attributes
+ * @NFTA_FLOW_TABLE_NAME: flow table name (NLA_STRING)
+ */
+enum nft_offload_attributes {
+   NFTA_FLOW_UNSPEC,
+   NFTA_FLOW_TABLE_NAME,
+   __NFTA_FLOW_MAX,
+};
+#define NFTA_FLOW_MAX  (__NFTA_FLOW_MAX - 1)
+
 enum nft_limit_type {
NFT_LIMIT_PKTS,
NFT_LIMIT_PKT_BYTES
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 0c6256db5a6c..1ada46345f3c 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -497,6 +497,13 @@ config NFT_CT
  This option adds the "ct" expression that you can use to match
  connection tracking information such as the flow state.
 
+config NFT_FLOW_OFFLOAD
+   depends on NF_CONNTRACK
+   tristate "Netfilter nf_tables hardware flow offload module"
+   help
+ This option adds the "flow_offload" expression that you can use to
+ choose what flows are placed into the hardware.
+
 config NFT_SET_RBTREE
tristate "Netfilter nf_tables rbtree set module"
help
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 1f7d92bd571a..2c1b8de922f2 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -83,6 +83,7 @@ obj-$(CONFIG_NFT_META)+= nft_meta.o
 obj-$(CONFIG_NFT_RT)   += nft_rt.o
 obj-$(CONFIG_NFT_NUMGEN)   += nft_numgen.o
 obj-$(CONFIG_NFT_CT)   += nft_ct.o
+obj-$(CONFIG_NFT_FLOW_OFFLOAD) += nft_flow_offload.o
 obj-$(CONFIG_NFT_LIMIT)+= nft_limit.o
 obj-$(CONFIG_NFT_NAT)  += nft_nat.o
 obj-$(CONFIG_NFT_OBJREF)   += nft_objref.o
diff --git a/net/netfilter/nft_flow_offload.c b/net/netfilter/nft_flow_offload.c
new file mode 100644
index ..4f16c37acaa3
--- /dev/null
+++ b/net/netfilter/nft_flow_offload.c
@@ -0,0 +1,268 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include  /* for ipv4 options. */
+#include 
+#include 
+#include 
+#include 
+#include 
+
+struct nft_flow_offload {
+   struct nft_flowtable*flowtable;
+};
+
+static int nft_flow_route(const struct nft_pktinfo *pkt,
+ const struct nf_conn *ct,
+ struct nf_flow_route *route,
+ enum ip_conntrack_dir dir)
+{
+   struct dst_entry *this_dst = skb_dst(pkt->skb);
+   struct dst_entry *other_dst;
+   const struct nf_afinfo *ai;
+   struct flowi fl;
+
+   memset(, 0, sizeof(fl));
+   switch (nft_pf(pkt)) {
+   case NFPROTO_IPV4:
+   fl.u.ip4.daddr = ct->tuplehash[!dir].tuple.dst.u3.ip;
+   break;
+   case NFPROTO_IPV6:
+   fl.u.ip6.daddr = ct->tuplehash[!dir].tuple.dst.u3.in6;
+   break;
+   }
+
+   ai = nf_get_afinfo(nft_pf(pkt));
+   if (ai) {
+   ai->route(nft_net(pkt), _dst, , false);
+   if (!other_dst)
+   return -ENOENT;
+   }
+
+   route->tuple[dir].dst   = this_dst;
+   route->tuple[dir].ifindex   = nft_in(pkt)->ifindex;
+   route->tuple[!dir].dst  = other_dst;
+   route->tuple[!dir].ifindex  = nft_out(pkt)->ifindex;
+
+   return 0;
+}
+
+static bool nft_flow_offload_skip(struct sk_buff *skb)
+{
+   struct ip_options *opt  = &(IPCB(skb)->opt);
+
+   if (unlikely(opt->optlen))
+   return true;
+   if (skb_sec_path(skb))
+   return true;
+
+   return false;
+}
+
+static void nft_flow_offload_eval(const struct nft_expr *expr,
+ struct nft_regs *regs,
+ const struct nft_pktinfo *pkt)
+{
+   struct nft_flow_offload *priv = nft_expr_priv(expr);

[PATCH nf-next,v3 3/7] netfilter: flow table support for IPv4

2017-12-22 Thread Pablo Neira Ayuso

This patch adds the IPv4 flow table type, that implements the datapath
flow table to forward IPv4 traffic. Rationale is:

1) Look up for the packet in the flow table, from the ingress hook.
2) If there's a hit, decrement ttl and pass it on to the neighbour layer
   for transmission.
3) If there's a miss, packet is passed up to the classic forwarding
   path.

This patch also supports layer 3 source and destination NAT.

Signed-off-by: Pablo Neira Ayuso 
---
 net/ipv4/netfilter/Kconfig  |   8 +
 net/ipv4/netfilter/Makefile |   3 +
 net/ipv4/netfilter/nf_flow_table_ipv4.c | 283 
 3 files changed, 294 insertions(+)
 create mode 100644 net/ipv4/netfilter/nf_flow_table_ipv4.c

diff --git a/net/ipv4/netfilter/Kconfig b/net/ipv4/netfilter/Kconfig
index c11eb1744ab1..7270771f9565 100644
--- a/net/ipv4/netfilter/Kconfig
+++ b/net/ipv4/netfilter/Kconfig
@@ -77,6 +77,14 @@ config NF_TABLES_ARP
 
 endif # NF_TABLES
 
+config NF_FLOW_TABLE_IPV4
+   select NF_FLOW_TABLE
+   tristate "Netfilter flow table IPv4 module"
+   help
+ This option adds the flow table IPv4 support.
+
+ To compile it as a module, choose M here.
+
 config NF_DUP_IPV4
tristate "Netfilter IPv4 packet duplication to alternate destination"
depends on !NF_CONNTRACK || NF_CONNTRACK
diff --git a/net/ipv4/netfilter/Makefile b/net/ipv4/netfilter/Makefile
index f462fee66ac8..116745275dc0 100644
--- a/net/ipv4/netfilter/Makefile
+++ b/net/ipv4/netfilter/Makefile
@@ -42,6 +42,9 @@ obj-$(CONFIG_NFT_REDIR_IPV4) += nft_redir_ipv4.o
 obj-$(CONFIG_NFT_DUP_IPV4) += nft_dup_ipv4.o
 obj-$(CONFIG_NF_TABLES_ARP) += nf_tables_arp.o
 
+# flow table support
+obj-$(CONFIG_NF_FLOW_TABLE_IPV4) += nf_flow_table_ipv4.o
+
 # generic IP tables 
 obj-$(CONFIG_IP_NF_IPTABLES) += ip_tables.o
 
diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c 
b/net/ipv4/netfilter/nf_flow_table_ipv4.c
new file mode 100644
index ..ac56c0f0492a
--- /dev/null
+++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c
@@ -0,0 +1,283 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+/* For layer 4 checksum field offset. */
+#include 
+#include 
+
+static int nf_flow_nat_ip_tcp(struct sk_buff *skb, unsigned int thoff,
+ __be32 addr, __be32 new_addr)
+{
+   struct tcphdr *tcph;
+
+   if (!pskb_may_pull(skb, thoff + sizeof(*tcph)) ||
+   skb_try_make_writable(skb, thoff + sizeof(*tcph)))
+   return -1;
+
+   tcph = (void *)(skb_network_header(skb) + thoff);
+   inet_proto_csum_replace4(>check, skb, addr, new_addr, true);
+
+   return 0;
+}
+
+static int nf_flow_nat_ip_udp(struct sk_buff *skb, unsigned int thoff,
+ __be32 addr, __be32 new_addr)
+{
+   struct udphdr *udph;
+
+   if (!pskb_may_pull(skb, thoff + sizeof(*udph)) ||
+   skb_try_make_writable(skb, thoff + sizeof(*udph)))
+   return -1;
+
+   udph = (void *)(skb_network_header(skb) + thoff);
+   if (udph->check || skb->ip_summed == CHECKSUM_PARTIAL) {
+   inet_proto_csum_replace4(>check, skb, addr,
+new_addr, true);
+   if (!udph->check)
+   udph->check = CSUM_MANGLED_0;
+   }
+
+   return 0;
+}
+
+static int nf_flow_nat_ip_l4proto(struct sk_buff *skb, struct iphdr *iph,
+ unsigned int thoff, __be32 addr,
+ __be32 new_addr)
+{
+   switch (iph->protocol) {
+   case IPPROTO_TCP:
+   if (nf_flow_nat_ip_tcp(skb, thoff, addr, new_addr) < 0)
+   return NF_DROP;
+   break;
+   case IPPROTO_UDP:
+   if (nf_flow_nat_ip_udp(skb, thoff, addr, new_addr) < 0)
+   return NF_DROP;
+   break;
+   }
+
+   return 0;
+}
+
+static int nf_flow_snat_ip(const struct flow_offload *flow, struct sk_buff 
*skb,
+  struct iphdr *iph, unsigned int thoff,
+  enum flow_offload_tuple_dir dir)
+{
+   __be32 addr, new_addr;
+
+   switch (dir) {
+   case FLOW_OFFLOAD_DIR_ORIGINAL:
+   addr = iph->saddr;
+   new_addr = 
flow->tuplehash[FLOW_OFFLOAD_DIR_REPLY].tuple.dst_v4.s_addr;
+   iph->saddr = new_addr;
+   break;
+   case FLOW_OFFLOAD_DIR_REPLY:
+   addr = iph->daddr;
+   new_addr = 
flow->tuplehash[FLOW_OFFLOAD_DIR_ORIGINAL].tuple.src_v4.s_addr;
+   iph->daddr = new_addr;
+   break;
+   default:
+   return -1;
+   }
+   csum_replace4(>check, addr, new_addr);
+
+   return nf_flow_nat_ip_l4proto(skb, iph, thoff, addr, new_addr);
+}
+
+static int nf_flow_dnat_ip(const struct flow_offload *flow, struct sk_buff

[PATCH nf-next,v3 5/7] netfilter: flow table support for the mixed IPv4/IPv6 family

2017-12-22 Thread Pablo Neira Ayuso

This patch adds the IPv6 flow table type, that implements the datapath
flow table to forward IPv6 traffic.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_flow_table.h   |  5 
 net/ipv4/netfilter/nf_flow_table_ipv4.c |  3 ++-
 net/ipv6/netfilter/nf_flow_table_ipv6.c |  3 ++-
 net/netfilter/Kconfig   |  8 ++
 net/netfilter/nf_flow_table_inet.c  | 48 +
 5 files changed, 65 insertions(+), 2 deletions(-)
 create mode 100644 net/netfilter/nf_flow_table_inet.c

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index 161f71ca78a0..b22b22082733 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -111,6 +111,11 @@ struct flow_ports {
__be16 source, dest;
 };
 
+unsigned int nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
+const struct nf_hook_state *state);
+unsigned int nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
+  const struct nf_hook_state *state);
+
 #define MODULE_ALIAS_NF_FLOWTABLE(family)  \
MODULE_ALIAS("nf-flowtable-" __stringify(family))
 
diff --git a/net/ipv4/netfilter/nf_flow_table_ipv4.c 
b/net/ipv4/netfilter/nf_flow_table_ipv4.c
index ac56c0f0492a..b2d01eb25f2c 100644
--- a/net/ipv4/netfilter/nf_flow_table_ipv4.c
+++ b/net/ipv4/netfilter/nf_flow_table_ipv4.c
@@ -202,7 +202,7 @@ static bool nf_flow_exceeds_mtu(struct sk_buff *skb, const 
struct rtable *rt)
return false;
 }
 
-static unsigned int
+unsigned int
 nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
const struct nf_hook_state *state)
 {
@@ -254,6 +254,7 @@ nf_flow_offload_ip_hook(void *priv, struct sk_buff *skb,
 
return NF_STOLEN;
 }
+EXPORT_SYMBOL_GPL(nf_flow_offload_ip_hook);
 
 static struct nf_flowtable_type flowtable_ipv4 = {
.family = NFPROTO_IPV4,
diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c 
b/net/ipv6/netfilter/nf_flow_table_ipv6.c
index ab78703154d8..021209be0c3c 100644
--- a/net/ipv6/netfilter/nf_flow_table_ipv6.c
+++ b/net/ipv6/netfilter/nf_flow_table_ipv6.c
@@ -196,7 +196,7 @@ static bool nf_flow_exceeds_mtu(struct sk_buff *skb, const 
struct rt6_info *rt)
return false;
 }
 
-static unsigned int
+unsigned int
 nf_flow_ipv6_offload_hook(void *priv, struct sk_buff *skb,
 const struct nf_hook_state *state)
 {
@@ -248,6 +248,7 @@ nf_flow_ipv6_offload_hook(void *priv, struct sk_buff *skb,
 
return NF_STOLEN;
 }
+EXPORT_SYMBOL_GPL(nf_flow_ipv6_offload_hook);
 
 static struct nf_flowtable_type flowtable_ipv6 = {
.family = NFPROTO_IPV6,
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index af0f58322515..0c6256db5a6c 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -649,6 +649,14 @@ endif # NF_TABLES_NETDEV
 
 endif # NF_TABLES
 
+config NF_FLOW_TABLE_INET
+   select NF_FLOW_TABLE
+   tristate "Netfilter flow table mixed IPv4/IPv6 module"
+   help
+  This option adds the flow table mixed IPv4/IPv6 support.
+
+ To compile it as a module, choose M here.
+
 config NF_FLOW_TABLE
tristate "Netfilter flow table module"
help
diff --git a/net/netfilter/nf_flow_table_inet.c 
b/net/netfilter/nf_flow_table_inet.c
new file mode 100644
index ..281209aeba8f
--- /dev/null
+++ b/net/netfilter/nf_flow_table_inet.c
@@ -0,0 +1,48 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static unsigned int
+nf_flow_offload_inet_hook(void *priv, struct sk_buff *skb,
+ const struct nf_hook_state *state)
+{
+   switch (skb->protocol) {
+   case htons(ETH_P_IP):
+   return nf_flow_offload_ip_hook(priv, skb, state);
+   case htons(ETH_P_IPV6):
+   return nf_flow_offload_ipv6_hook(priv, skb, state);
+   }
+
+   return NF_ACCEPT;
+}
+
+static struct nf_flowtable_type flowtable_inet = {
+   .family = NFPROTO_INET,
+   .params = _flow_offload_rhash_params,
+   .gc = nf_flow_offload_work_gc,
+   .hook   = nf_flow_offload_inet_hook,
+   .owner  = THIS_MODULE,
+};
+
+static int __init nf_flow_inet_module_init(void)
+{
+   nft_register_flowtable_type(_inet);
+
+   return 0;
+}
+
+static void __exit nf_flow_inet_module_exit(void)
+{
+   nft_unregister_flowtable_type(_inet);
+}
+
+module_init(nf_flow_inet_module_init);
+module_exit(nf_flow_inet_module_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Pablo Neira Ayuso ");
+MODULE_ALIAS_NF_FLOWTABLE(1); /* NFPROTO_INET */
-- 
2.11.0

[PATCH RFC nf-next,v3 7/7] netfilter: nf_flow_table: add hardware offload support

2017-12-22 Thread Pablo Neira Ayuso

This patch adds the infrastructure to offload flows to hardware, in case
the nic/switch comes with built-in flow tables capabilities.

If the hardware comes with no hardware flow tables or they have
limitations in terms of features, this falls back to the software
generic flow table implementation.

The software flow table garbage collector skips entries that resides in
the hardware, so the hardware will be responsible for releasing this
flow table entry too via flow_offload_dead(). In the next garbage
collector run, this removes the entries both in the software and
hardware flow table from user context.

Signed-off-by: Pablo Neira Ayuso 
---
 include/linux/netdevice.h |   9 +++
 include/net/netfilter/nf_flow_table.h |   6 ++
 net/netfilter/Kconfig |   9 +++
 net/netfilter/Makefile|   1 +
 net/netfilter/nf_flow_table.c |  13 
 net/netfilter/nf_flow_table_hw.c  | 127 ++
 net/netfilter/nf_tables_api.c |   2 +
 net/netfilter/nft_flow_offload.c  |   4 ++
 8 files changed, 171 insertions(+)
 create mode 100644 net/netfilter/nf_flow_table_hw.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f535779d9dc1..5f2919775632 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -826,6 +826,13 @@ struct xfrmdev_ops {
 };
 #endif
 
+struct flow_offload;
+
+enum flow_offload_type {
+   FLOW_OFFLOAD_ADD= 0,
+   FLOW_OFFLOAD_DEL,
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1281,6 +1288,8 @@ struct net_device_ops {
int (*ndo_bridge_dellink)(struct net_device *dev,
  struct nlmsghdr *nlh,
  u16 flags);
+   int (*ndo_flow_offload)(enum flow_offload_type type,
+   struct flow_offload *flow);
int (*ndo_change_carrier)(struct net_device *dev,
  bool new_carrier);
int (*ndo_get_phys_port_id)(struct net_device *dev,
diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index b22b22082733..02ac8c7e4f7f 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -23,6 +23,7 @@ struct nf_flowtable {
struct rhashtable   rhashtable;
const struct nf_flowtable_type  *type;
struct delayed_work gc_work;
+   possible_net_t  ft_net;
 };
 
 enum flow_offload_tuple_dir {
@@ -65,6 +66,7 @@ struct flow_offload_tuple_rhash {
 #define FLOW_OFFLOAD_SNAT  0x1
 #define FLOW_OFFLOAD_DNAT  0x2
 #define FLOW_OFFLOAD_DYING 0x4
+#define FLOW_OFFLOAD_HW0x8
 
 struct flow_offload {
struct flow_offload_tuple_rhash tuplehash[FLOW_OFFLOAD_DIR_MAX];
@@ -116,6 +118,10 @@ unsigned int nf_flow_offload_ip_hook(void *priv, struct 
sk_buff *skb,
 unsigned int nf_flow_offload_ipv6_hook(void *priv, struct sk_buff *skb,
   const struct nf_hook_state *state);
 
+void flow_offload_hw_add(struct net *net, struct flow_offload *flow,
+struct nf_conn *ct);
+void flow_offload_hw_del(struct net *net, struct flow_offload *flow);
+
 #define MODULE_ALIAS_NF_FLOWTABLE(family)  \
MODULE_ALIAS("nf-flowtable-" __stringify(family))
 
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 1ada46345f3c..cc25876cf223 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -671,6 +671,15 @@ config NF_FLOW_TABLE
 
  To compile it as a module, choose M here.
 
+config NF_FLOW_TABLE_HW
+   tristate "Netfilter flow table hardware offload module"
+   depends on NF_FLOW_TABLE
+   help
+ This option adds hardware offload support for the flow table core
+ infrastructure.
+
+ To compile it as a module, choose M here.
+
 config NETFILTER_XTABLES
tristate "Netfilter Xtables support (required for ip_tables)"
default m if NETFILTER_ADVANCED=n
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 2c1b8de922f2..1a97a47ad4e8 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_NFT_FWD_NETDEV)+= nft_fwd_netdev.o
 
 # flow table infrastructure
 obj-$(CONFIG_NF_FLOW_TABLE)+= nf_flow_table.o
+obj-$(CONFIG_NF_FLOW_TABLE_HW) += nf_flow_table_hw.o
 
 # generic X tables 
 obj-$(CONFIG_NETFILTER_XTABLES) += x_tables.o xt_tcpudp.o
diff --git a/net/netfilter/nf_flow_table.c b/net/netfilter/nf_flow_table.c
index e1024b17b910..a505351980fd 100644
--- a/net/netfilter/nf_flow_table.c
+++ b/net/netfilter/nf_flow_table.c

[PATCH nf-next,v3 4/7] netfilter: flow table support for IPv6

2017-12-22 Thread Pablo Neira Ayuso

This patch adds the IPv6 flow table type, that implements the datapath
flow table to forward IPv6 traffic.

This patch exports ip6_dst_mtu_forward() that is required to check for
mtu to pass up packets that need PMTUD handling to the classic
forwarding path.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/ipv6.h  |   2 +
 net/ipv6/ip6_output.c   |   3 +-
 net/ipv6/netfilter/Kconfig  |   8 +
 net/ipv6/netfilter/Makefile |   3 +
 net/ipv6/netfilter/nf_flow_table_ipv6.c | 277 
 5 files changed, 292 insertions(+), 1 deletion(-)
 create mode 100644 net/ipv6/netfilter/nf_flow_table_ipv6.c

diff --git a/include/net/ipv6.h b/include/net/ipv6.h
index 6eac5cf8f1e6..ff069a8e0cde 100644
--- a/include/net/ipv6.h
+++ b/include/net/ipv6.h
@@ -912,6 +912,8 @@ static inline struct sk_buff *ip6_finish_skb(struct sock 
*sk)
  _sk(sk)->cork);
 }
 
+unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst);
+
 int ip6_dst_lookup(struct net *net, struct sock *sk, struct dst_entry **dst,
   struct flowi6 *fl6);
 struct dst_entry *ip6_dst_lookup_flow(const struct sock *sk, struct flowi6 
*fl6,
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 43ca864327c7..5ccd082ce182 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -362,7 +362,7 @@ static inline int ip6_forward_finish(struct net *net, 
struct sock *sk,
return dst_output(net, sk, skb);
 }
 
-static unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
+unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
 {
unsigned int mtu;
struct inet6_dev *idev;
@@ -382,6 +382,7 @@ static unsigned int ip6_dst_mtu_forward(const struct 
dst_entry *dst)
 
return mtu;
 }
+EXPORT_SYMBOL_GPL(ip6_dst_mtu_forward);
 
 static bool ip6_pkt_too_big(const struct sk_buff *skb, unsigned int mtu)
 {
diff --git a/net/ipv6/netfilter/Kconfig b/net/ipv6/netfilter/Kconfig
index 6acb2eecd986..806e95375ec8 100644
--- a/net/ipv6/netfilter/Kconfig
+++ b/net/ipv6/netfilter/Kconfig
@@ -71,6 +71,14 @@ config NFT_FIB_IPV6
 endif # NF_TABLES_IPV6
 endif # NF_TABLES
 
+config NF_FLOW_TABLE_IPV6
+   select NF_FLOW_TABLE
+   tristate "Netfilter flow table IPv6 module"
+   help
+ This option adds the flow table IPv6 support.
+
+ To compile it as a module, choose M here.
+
 config NF_DUP_IPV6
tristate "Netfilter IPv6 packet duplication to alternate destination"
depends on !NF_CONNTRACK || NF_CONNTRACK
diff --git a/net/ipv6/netfilter/Makefile b/net/ipv6/netfilter/Makefile
index fe180c96040e..7dceadbb9eea 100644
--- a/net/ipv6/netfilter/Makefile
+++ b/net/ipv6/netfilter/Makefile
@@ -44,6 +44,9 @@ obj-$(CONFIG_NFT_REDIR_IPV6) += nft_redir_ipv6.o
 obj-$(CONFIG_NFT_DUP_IPV6) += nft_dup_ipv6.o
 obj-$(CONFIG_NFT_FIB_IPV6) += nft_fib_ipv6.o
 
+# flow table support
+obj-$(CONFIG_NF_FLOW_TABLE_IPV6) += nf_flow_table_ipv6.o
+
 # matches
 obj-$(CONFIG_IP6_NF_MATCH_AH) += ip6t_ah.o
 obj-$(CONFIG_IP6_NF_MATCH_EUI64) += ip6t_eui64.o
diff --git a/net/ipv6/netfilter/nf_flow_table_ipv6.c 
b/net/ipv6/netfilter/nf_flow_table_ipv6.c
new file mode 100644
index ..ab78703154d8
--- /dev/null
+++ b/net/ipv6/netfilter/nf_flow_table_ipv6.c
@@ -0,0 +1,277 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+/* For layer 4 checksum field offset. */
+#include 
+#include 
+
+static int nf_flow_nat_ipv6_tcp(struct sk_buff *skb, unsigned int thoff,
+   struct in6_addr *addr,
+   struct in6_addr *new_addr)
+{
+   struct tcphdr *tcph;
+
+   if (!pskb_may_pull(skb, thoff + sizeof(*tcph)) ||
+   skb_try_make_writable(skb, thoff + sizeof(*tcph)))
+   return -1;
+
+   tcph = (void *)(skb_network_header(skb) + thoff);
+   inet_proto_csum_replace16(>check, skb, addr->s6_addr32,
+ new_addr->s6_addr32, true);
+
+   return 0;
+}
+
+static int nf_flow_nat_ipv6_udp(struct sk_buff *skb, unsigned int thoff,
+   struct in6_addr *addr,
+   struct in6_addr *new_addr)
+{
+   struct udphdr *udph;
+
+   if (!pskb_may_pull(skb, thoff + sizeof(*udph)) ||
+   skb_try_make_writable(skb, thoff + sizeof(*udph)))
+   return -1;
+
+   udph = (void *)(skb_network_header(skb) + thoff);
+   if (udph->check || skb->ip_summed == CHECKSUM_PARTIAL) {
+   inet_proto_csum_replace16(>check, skb, addr->s6_addr32,
+ new_addr->s6_addr32, true);
+   if (!udph->check)
+   udph->check = CSUM_MANGLED_0;
+   }
+
+   return 0;
+}
+
+static int nf_flow_nat_ipv6_l4proto(struct sk_buff *skb, struct ipv6hdr *ip6h,

[PATCH nf-next,v3 2/7] netfilter: add generic flow table infrastructure

2017-12-22 Thread Pablo Neira Ayuso

This patch defines the API to interact with flow tables, this allows to
add, delete and lookup for entries in the flow table. This also adds the
generic garbage code that removes entries that have expired, ie. no
traffic has been seen for a while.

Users of the flow table infrastructure can delete entries via
flow_offload_dead(), which sets the dying bit, this signals the garbage
collector to release an entry from user context.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_flow_table.h |  94 
 net/netfilter/Kconfig |   7 +
 net/netfilter/Makefile|   3 +
 net/netfilter/nf_flow_table.c | 434 ++
 4 files changed, 538 insertions(+)
 create mode 100644 net/netfilter/nf_flow_table.c

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
index 3a0779589281..161f71ca78a0 100644
--- a/include/net/netfilter/nf_flow_table.h
+++ b/include/net/netfilter/nf_flow_table.h
@@ -1,7 +1,12 @@
 #ifndef _NF_FLOW_TABLE_H
 #define _NF_FLOW_TABLE_H
 
+#include 
+#include 
+#include 
 #include 
+#include 
+#include 
 
 struct nf_flowtable;
 
@@ -20,4 +25,93 @@ struct nf_flowtable {
struct delayed_work gc_work;
 };
 
+enum flow_offload_tuple_dir {
+   FLOW_OFFLOAD_DIR_ORIGINAL,
+   FLOW_OFFLOAD_DIR_REPLY,
+   __FLOW_OFFLOAD_DIR_MAX  = FLOW_OFFLOAD_DIR_REPLY,
+};
+#define FLOW_OFFLOAD_DIR_MAX   (__FLOW_OFFLOAD_DIR_MAX + 1)
+
+struct flow_offload_tuple {
+   union {
+   struct in_addr  src_v4;
+   struct in6_addr src_v6;
+   };
+   union {
+   struct in_addr  dst_v4;
+   struct in6_addr dst_v6;
+   };
+   struct {
+   __be16  src_port;
+   __be16  dst_port;
+   };
+
+   int iifidx;
+
+   u8  l3proto;
+   u8  l4proto;
+   u8  dir;
+
+   int oifidx;
+
+   struct dst_entry*dst_cache;
+};
+
+struct flow_offload_tuple_rhash {
+   struct rhash_head   node;
+   struct flow_offload_tuple   tuple;
+};
+
+#define FLOW_OFFLOAD_SNAT  0x1
+#define FLOW_OFFLOAD_DNAT  0x2
+#define FLOW_OFFLOAD_DYING 0x4
+
+struct flow_offload {
+   struct flow_offload_tuple_rhash tuplehash[FLOW_OFFLOAD_DIR_MAX];
+   u32 flags;
+   union {
+   /* Your private driver data here. */
+   u32 timeout;
+   };
+};
+
+#define NF_FLOW_TIMEOUT (30 * HZ)
+
+struct nf_flow_route {
+   struct {
+   struct dst_entry*dst;
+   int ifindex;
+   } tuple[FLOW_OFFLOAD_DIR_MAX];
+};
+
+struct flow_offload *flow_offload_alloc(struct nf_conn *ct,
+   struct nf_flow_route *route);
+void flow_offload_free(struct flow_offload *flow);
+
+int flow_offload_add(struct nf_flowtable *flow_table, struct flow_offload 
*flow);
+void flow_offload_del(struct nf_flowtable *flow_table, struct flow_offload 
*flow);
+struct flow_offload_tuple_rhash *flow_offload_lookup(struct nf_flowtable 
*flow_table,
+struct flow_offload_tuple 
*tuple);
+int nf_flow_table_iterate(struct nf_flowtable *flow_table,
+ void (*iter)(struct flow_offload *flow, void *data),
+ void *data);
+void nf_flow_offload_work_gc(struct work_struct *work);
+extern const struct rhashtable_params nf_flow_offload_rhash_params;
+
+void flow_offload_dead(struct flow_offload *flow);
+
+int nf_flow_snat_port(const struct flow_offload *flow,
+ struct sk_buff *skb, unsigned int thoff,
+ u8 protocol, enum flow_offload_tuple_dir dir);
+int nf_flow_dnat_port(const struct flow_offload *flow,
+ struct sk_buff *skb, unsigned int thoff,
+ u8 protocol, enum flow_offload_tuple_dir dir);
+
+struct flow_ports {
+   __be16 source, dest;
+};
+
+#define MODULE_ALIAS_NF_FLOWTABLE(family)  \
+   MODULE_ALIAS("nf-flowtable-" __stringify(family))
+
 #endif /* _FLOW_OFFLOAD_H */
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index e4a13cc8a2e7..af0f58322515 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -649,6 +649,13 @@ endif # NF_TABLES_NETDEV
 
 endif # NF_TABLES
 
+config NF_FLOW_TABLE
+   tristate "Netfilter flow table module"
+   help
+ This option adds the flow table core infrastructure.
+
+ To compile it as a module, choose M here.
+
 config NETFILTER_XTABLES
tristate "Netfilter Xtables support (required for ip_tables)"
default m if

[PATCH nf-next,v3 1/7] netfilter: nf_tables: add flow table netlink frontend

2017-12-22 Thread Pablo Neira Ayuso

This patch introduces a netlink control plane to create, delete and dump
flow tables. Flow tables are identified by name, this name is used from
rules to refer to an specific flow table. Flow tables use the rhashtable
class and a generic garbage collector to remove expired entries.

This also adds the infrastructure to add different flow table types, so
we can add one for each layer 3 protocol family.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/netfilter/nf_flow_table.h|  23 +
 include/net/netfilter/nf_tables.h|  48 ++
 include/uapi/linux/netfilter/nf_tables.h |  53 +++
 net/netfilter/nf_tables_api.c| 747 ++-
 4 files changed, 870 insertions(+), 1 deletion(-)
 create mode 100644 include/net/netfilter/nf_flow_table.h

diff --git a/include/net/netfilter/nf_flow_table.h 
b/include/net/netfilter/nf_flow_table.h
new file mode 100644
index ..3a0779589281
--- /dev/null
+++ b/include/net/netfilter/nf_flow_table.h
@@ -0,0 +1,23 @@
+#ifndef _NF_FLOW_TABLE_H
+#define _NF_FLOW_TABLE_H
+
+#include 
+
+struct nf_flowtable;
+
+struct nf_flowtable_type {
+   struct list_headlist;
+   int family;
+   void(*gc)(struct work_struct *work);
+   const struct rhashtable_params  *params;
+   nf_hookfn   *hook;
+   struct module   *owner;
+};
+
+struct nf_flowtable {
+   struct rhashtable   rhashtable;
+   const struct nf_flowtable_type  *type;
+   struct delayed_work gc_work;
+};
+
+#endif /* _FLOW_OFFLOAD_H */
diff --git a/include/net/netfilter/nf_tables.h 
b/include/net/netfilter/nf_tables.h
index 0f5b12a4ad09..624928d22589 100644
--- a/include/net/netfilter/nf_tables.h
+++ b/include/net/netfilter/nf_tables.h
@@ -8,6 +8,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define NFT_JUMP_STACK_SIZE16
@@ -942,6 +943,7 @@ unsigned int nft_do_chain(struct nft_pktinfo *pkt, void 
*priv);
  * @chains: chains in the table
  * @sets: sets in the table
  * @objects: stateful objects in the table
+ * @flowtables: flow tables in the table
  * @hgenerator: handle generator state
  * @use: number of chain references to this table
  * @flags: table flag (see enum nft_table_flags)
@@ -953,6 +955,7 @@ struct nft_table {
struct list_headchains;
struct list_headsets;
struct list_headobjects;
+   struct list_headflowtables;
u64 hgenerator;
u32 use;
u16 flags:14,
@@ -1091,6 +1094,44 @@ int nft_register_obj(struct nft_object_type *obj_type);
 void nft_unregister_obj(struct nft_object_type *obj_type);
 
 /**
+ * struct nft_flowtable - nf_tables flow table
+ *
+ * @list: flow table list node in table list
+ * @table: the table the flow table is contained in
+ * @name: name of this flow table
+ * @hooknum: hook number
+ * @priority: hook priority
+ * @ops_len: number of hooks in array
+ * @genmask: generation mask
+ * @use: number of references to this flow table
+ * @data: rhashtable and garbage collector
+ * @ops: array of hooks
+ */
+struct nft_flowtable {
+   struct list_headlist;
+   struct nft_table*table;
+   char*name;
+   int hooknum;
+   int priority;
+   int ops_len;
+   u32 genmask:2,
+   use:30;
+   /* runtime data below here */
+   struct nf_hook_ops  *ops cacheline_aligned;
+   struct nf_flowtable data;
+};
+
+struct nft_flowtable *nf_tables_flowtable_lookup(const struct nft_table *table,
+const struct nlattr *nla,
+u8 genmask);
+void nft_flow_table_iterate(struct net *net,
+   void (*iter)(struct nf_flowtable *flowtable, void 
*data),
+   void *data);
+
+void nft_register_flowtable_type(struct nf_flowtable_type *type);
+void nft_unregister_flowtable_type(struct nf_flowtable_type *type);
+
+/**
  * struct nft_traceinfo - nft tracing information and state
  *
  * @pkt: pktinfo currently processed
@@ -1326,4 +1367,11 @@ struct nft_trans_obj {
 #define nft_trans_obj(trans)   \
(((struct nft_trans_obj *)trans->data)->obj)
 
+struct nft_trans_flowtable {
+   struct nft_flowtable*flowtable;
+};
+
+#define nft_trans_flowtable(trans) \
+   (((struct nft_trans_flowtable *)trans->data)->flowtable)
+
 #endif /* _NET_NF_TABLES_H */
diff --git

[PATCH nf-next,v3 0/7] Flow offload infrastructure

2017-12-22 Thread Pablo Neira Ayuso

Hi,

This is a new round of the patchset to add the flow offload
infrastructure [1][2].

This round comes with IPv6 and mixed IPv4/IPv6 support, hardware offload
support in a separated nf_flow_table_hw module, port translation, net
namespace support and several bugfixes.

Patch 7/7 has been tagged as RFC, I will keep this one back until
there's an initial driver that introduces flow table offload support,
likely in a branch in nf-next.git once this new infrastructure gets
merged upstream.

Comments welcome, thanks.

[1] https://lwn.net/Articles/738214/
[2] https://marc.info/?l=netfilter-devel=151266258119014=2

Pablo Neira Ayuso (7):
  netfilter: nf_tables: add flow table netlink frontend
  netfilter: add generic flow table infrastructure
  netfilter: flow table support for IPv4
  netfilter: flow table support for IPv6
  netfilter: flow table support for the mixed IPv4/IPv6 family
  netfilter: nf_tables: flow offload expression
  netfilter: nf_flow_table: add hardware offload support

 include/linux/netdevice.h|   9 +
 include/net/ipv6.h   |   2 +
 include/net/netfilter/nf_flow_table.h| 128 ++
 include/net/netfilter/nf_tables.h|  48 ++
 include/uapi/linux/netfilter/nf_tables.h |  64 +++
 net/ipv4/netfilter/Kconfig   |   8 +
 net/ipv4/netfilter/Makefile  |   3 +
 net/ipv4/netfilter/nf_flow_table_ipv4.c  | 284 
 net/ipv6/ip6_output.c|   3 +-
 net/ipv6/netfilter/Kconfig   |   8 +
 net/ipv6/netfilter/Makefile  |   3 +
 net/ipv6/netfilter/nf_flow_table_ipv6.c  | 278 
 net/netfilter/Kconfig|  31 ++
 net/netfilter/Makefile   |   5 +
 net/netfilter/nf_flow_table.c| 447 ++
 net/netfilter/nf_flow_table_hw.c | 127 ++
 net/netfilter/nf_flow_table_inet.c   |  48 ++
 net/netfilter/nf_tables_api.c| 749 ++-
 net/netfilter/nft_flow_offload.c | 272 +++
 19 files changed, 2515 insertions(+), 2 deletions(-)
 create mode 100644 include/net/netfilter/nf_flow_table.h
 create mode 100644 net/ipv4/netfilter/nf_flow_table_ipv4.c
 create mode 100644 net/ipv6/netfilter/nf_flow_table_ipv6.c
 create mode 100644 net/netfilter/nf_flow_table.c
 create mode 100644 net/netfilter/nf_flow_table_hw.c
 create mode 100644 net/netfilter/nf_flow_table_inet.c
 create mode 100644 net/netfilter/nft_flow_offload.c

-- 
2.11.0

Re: [PATCH net] rtnetlink: fix struct net reference leak

2017-12-22 Thread Craig Gallek

On Fri, Dec 22, 2017 at 8:59 AM, Craig Gallek  wrote:
> On Fri, Dec 22, 2017 at 3:11 AM, Nicolas Dichtel
>  wrote:
>> Le 21/12/2017 à 23:18, Craig Gallek a écrit :
>>> From: Craig Gallek 
>>>
>>> The below referenced commit extended the RTM_GETLINK interface to
>>> allow querying by netns id.  The netnsid property was previously
>>> defined as a signed integer, but this patch assumes that the user
>>> always passes a positive integer.  syzkaller discovered this problem
>>> by setting a negative netnsid and then calling the get-link path
>>> in a tight loop.  This surprisingly quickly overflows the reference
>>> count on the associated struct net, potentially destroying it.  When the
>>> default namespace is used, the machine crashes in strange and interesting
>>> ways.
>>>
>>> Unfortunately, this is not easy to reproduce with just the ip tool
>>> as it enforces unsigned integer parsing despite the interface interpeting
>>> the NETNSID attribute as signed.
>>>
>>> I'm not sure why this attribute is signed in the first place, but
>>> the first commit that introduced it (6621dd29eb9b) is in v4.15-rc4,
>>> so I assume it's too late to change.
>> A valid (assigned) nsid is always >= 0.
>>
>>>
>>> This patch removes the positive netns id assumption, but adds another
>>> assumption that the netns id 0 is always the 'self' identifying id (for
>>> which an additional struct net reference is not necessary).
>> We cannot make this assumption, this is wrong. nsids may be automatically
>> allocated by the kernel, and it starts by 0.
>> The current netns can be identify by NETNSA_NSID_NOT_ASSIGNED, ie -1.
> Thank you, I'll respin this with NETNSA_NSID_NOT_ASSIGNED as the sentinel 
> value.

Looking at the netns id code more closely, there are several places
that assume ids will never be zero (short of the sentinel).  I think
the only simple fix here is to update the netlink interfaces to not
accept negative values as input.  I'm going to send that patch
instead...

Re: [PATCH] bpf: selftest for late caller stack size increase

2017-12-22 Thread Alexei Starovoitov

On Fri, Dec 22, 2017 at 07:12:35PM +0100, Jann Horn wrote:
> This checks that it is not possible to bypass the total stack size check in
> update_stack_depth() by calling a function that uses a large amount of
> stack memory *before* using a large amount of stack memory in the caller.
> 
> Currently, the first added testcase causes a rejection as expected, but
> the second testcase is (AFAICS incorrectly) accepted:
> 
> [...]
> #483/p calls: stack overflow using two frames (post-call access) FAIL
> Unexpected success to load!
> 0: (85) call pc+2
> caller:
>  R10=fp0,call_-1
> callee:
>  frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0
> 3: (72) *(u8 *)(r10 -300) = 0
> 4: (b7) r0 = 0
> 5: (95) exit
> returning from callee:
>  frame1: R0_w=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0
> to caller at 1:
>  R0_w=inv0 R10=fp0,call_-1
> 
> from 5 to 1: R0=inv0 R10=fp0,call_-1
> 1: (72) *(u8 *)(r10 -300) = 0
> 2: (95) exit
> processed 6 insns, stack depth 300+300

got it. thanks for the test!
working on a fix.

Re: INFO: task hung in bpf_exit_net

2017-12-22 Thread Marcelo Ricardo Leitner

On Fri, Dec 22, 2017 at 04:28:07PM -0200, Marcelo Ricardo Leitner wrote:
> On Fri, Dec 22, 2017 at 11:58:08AM +0100, Dmitry Vyukov wrote:
> ...
> > > Same with this one, perhaps related to / fixed by:
> > > http://patchwork.ozlabs.org/patch/850957/
> > >
> > 
> > 
> > 
> > Looking at the log, this one seems to be an infinite loop in SCTP code
> > with console output in it. Kernel is busy printing gazilion of:
> > 
> > [  176.491099] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> > low, using default minimum of 512
> > ** 110 printk messages dropped **
> > [  176.503409] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> > low, using default minimum of 512
> > ** 103 printk messages dropped **
> > ...
> > [  246.742374] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> > low, using default minimum of 512
> > [  246.742484] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> > low, using default minimum of 512
> > [  246.742590] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> > low, using default minimum of 512
> > 
> > Looks like a different issue.
> > 
> 
> Oh. I guess this is caused by the interface having a MTU smaller than
> SCTP_DEFAULT_MINSEGMENT (512), as the icmp frag needed handler
> (sctp_icmp_frag_needed) will trigger an instant retransmission.
> But as the MTU is smaller, SCTP won't update it, but will issue the
> retransmission anyway.
> 
> I will test this soon. Should be fairly easy to trigger it.

Reproduced it.

netns A veth0(1500) - veth1(1500) B veth2(508) - veth3(508) C

When A sends a sctp packet bigger than 508, it triggers the issue as B
will reply a icmp frag needed with a size that sctp won't accept but
will retransmit anyway.

  Marcelo

Re: [PATCH v2 bpf-next 2/2] tools/bpftool: fix bpftool build with bintutils >= 2.8

2017-12-22 Thread Quentin Monnet

Hi Roman,

2017-12-22 16:11 UTC+ ~ Roman Gushchin 
> Bpftool build is broken with binutils version 2.28 and later.

Could you check the binutils version? I believe it changed in 2.29
instead of 2.28. Could you update your commit log and subject
accordingly, please?

> The cause is commit 003ca0fd2286 ("Refactor disassembler selection")
> in the binutils repo, which changed the disassembler() function
> signature.
> 
> Fix this by adding a new "feature" to the tools/build/features
> infrastructure and make it responsible for decision which
> disassembler() function signature to use.
> 
> Signed-off-by: Roman Gushchin 
> Cc: Jakub Kicinski 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> ---
>  tools/bpf/Makefile| 29 
> +++
>  tools/bpf/bpf_jit_disasm.c|  7 ++
>  tools/bpf/bpftool/Makefile| 24 +++
>  tools/bpf/bpftool/jit_disasm.c|  7 ++
>  tools/build/feature/Makefile  |  4 
>  tools/build/feature/test-disassembler-four-args.c | 15 
>  6 files changed, 86 insertions(+)
>  create mode 100644 tools/build/feature/test-disassembler-four-args.c
> 
> diff --git a/tools/bpf/Makefile b/tools/bpf/Makefile
> index 07a6697466ef..c8ec0ae16bf0 100644
> --- a/tools/bpf/Makefile
> +++ b/tools/bpf/Makefile
> @@ -9,6 +9,35 @@ MAKE = make
>  CFLAGS += -Wall -O2
>  CFLAGS += -D__EXPORTED_HEADERS__ -I../../include/uapi -I../../include
>  
> +ifeq ($(srctree),)
> +srctree := $(patsubst %/,%,$(dir $(CURDIR)))
> +srctree := $(patsubst %/,%,$(dir $(srctree)))
> +endif
> +
> +FEATURE_USER = .bpf
> +FEATURE_TESTS = libbfd disassembler-four-args
> +FEATURE_DISPLAY = libbfd disassembler-four-args

Thanks for adding libbfd as I requested. However, you do not use it in
the Makefile to prevent compilation if the feature is not detected (see
"bpfdep" or "elfdep" in tools/lib/bpf/Makefile. Sorry, I should have
pointed it in my previous review.

But actually, I have another issue related to the libbfd feature: since
commit 280e7c48c3b8 ("perf tools: fix BFD detection on opensuse") it
requires libiberty so that libbfd is correctly detected, but libiberty
is not needed on all distros (at least Ubuntu can have libbfd without
libiberty). Typically, detection fails on my setup, although I do have
libbfd installed. So forcing libbfd feature here may eventually force
users to install libraries they do not need to compile bpftool, which is
not what we want.

I do not have a clean work around to suggest. Maybe have one
"libbfd-something" feature that tries to compile without libiberty, then
another one that tries with it, and compile the tools if at least one of
them succeeds. But it's probably for another patch series. In the
meantime, would you please simply remove libbfd detection here and
accept my apologies for suggesting to add it in the previous review?

> +
> +check_feat := 1
> +NON_CHECK_FEAT_TARGETS := clean bpftool_clean
> +ifdef MAKECMDGOALS
> +ifeq ($(filter-out $(NON_CHECK_FEAT_TARGETS),$(MAKECMDGOALS)),)
> +  check_feat := 0
> +endif
> +endif
> +
> +ifeq ($(check_feat),1)
> +ifeq ($(FEATURES_DUMP),)
> +include $(srctree)/tools/build/Makefile.feature
> +else
> +include $(FEATURES_DUMP)
> +endif
> +endif
> +
> +ifeq ($(feature-disassembler-four-args), 1)
> +CFLAGS += -DDISASM_FOUR_ARGS_SIGNATURE
> +endif
> +
>  %.yacc.c: %.y
>   $(YACC) -o $@ -d $<
>  
> diff --git a/tools/bpf/bpf_jit_disasm.c b/tools/bpf/bpf_jit_disasm.c
> index 75bf526a0168..30044bc4f389 100644
> --- a/tools/bpf/bpf_jit_disasm.c
> +++ b/tools/bpf/bpf_jit_disasm.c
> @@ -72,7 +72,14 @@ static void get_asm_insns(uint8_t *image, size_t len, int 
> opcodes)
>  
>   disassemble_init_for_target();
>  
> +#ifdef DISASM_FOUR_ARGS_SIGNATURE
> + disassemble = disassembler(info.arch,
> +bfd_big_endian(bfdf),
> +info.mach,
> +bfdf);
> +#else
>   disassemble = disassembler(bfdf);
> +#endif
>   assert(disassemble);
>  
>   do {
> diff --git a/tools/bpf/bpftool/Makefile b/tools/bpf/bpftool/Makefile
> index f8f31a8d9269..2237bc43f71c 100644
> --- a/tools/bpf/bpftool/Makefile
> +++ b/tools/bpf/bpftool/Makefile
> @@ -46,6 +46,30 @@ LIBS = -lelf -lbfd -lopcodes $(LIBBPF)
>  INSTALL ?= install
>  RM ?= rm -f
>  
> +FEATURE_USER = .bpftool
> +FEATURE_TESTS = libbfd disassembler-four-args
> +FEATURE_DISPLAY = libbfd disassembler-four-args
> +
> +check_feat := 1
> +NON_CHECK_FEAT_TARGETS := clean uninstall doc doc-clean doc-install 
> doc-uninstall

Nit: exclude "install" as well? I know libbpf does not exclude it, but
if the user runs `make` then `make install` we do not need to check
again the features for binary installation.

> +ifdef MAKECMDGOALS
> +ifeq ($(filter-out

Re: INFO: task hung in bpf_exit_net

2017-12-22 Thread Marcelo Ricardo Leitner

On Fri, Dec 22, 2017 at 11:58:08AM +0100, Dmitry Vyukov wrote:
...
> > Same with this one, perhaps related to / fixed by:
> > http://patchwork.ozlabs.org/patch/850957/
> >
> 
> 
> 
> Looking at the log, this one seems to be an infinite loop in SCTP code
> with console output in it. Kernel is busy printing gazilion of:
> 
> [  176.491099] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> low, using default minimum of 512
> ** 110 printk messages dropped **
> [  176.503409] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> low, using default minimum of 512
> ** 103 printk messages dropped **
> ...
> [  246.742374] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> low, using default minimum of 512
> [  246.742484] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> low, using default minimum of 512
> [  246.742590] sctp: sctp_transport_update_pmtu: Reported pmtu 508 too
> low, using default minimum of 512
> 
> Looks like a different issue.
> 

Oh. I guess this is caused by the interface having a MTU smaller than
SCTP_DEFAULT_MINSEGMENT (512), as the icmp frag needed handler
(sctp_icmp_frag_needed) will trigger an instant retransmission.
But as the MTU is smaller, SCTP won't update it, but will issue the
retransmission anyway.

I will test this soon. Should be fairly easy to trigger it.

  Marcelo

Re: [PATCH V2 net-next 3/3] rds: tcp: cleanup if kmem_cache_alloc fails in rds_tcp_conn_alloc()

2017-12-22 Thread Santosh Shilimkar


On 12/22/2017 9:39 AM, Sowmini Varadhan wrote:

If kmem_cache_alloc() fails in the middle of the for() loop,
cleanup anything that might have been allocated so far.

Signed-off-by: Sowmini Varadhan 
---
v2: target net-next, not net


Acked-by: Santosh Shilimkar

Re: correctness of BPF stack size checking logic for multi-function programs?

2017-12-22 Thread Jann Horn

On Fri, Dec 22, 2017 at 4:37 AM, Alexei Starovoitov
 wrote:
> On Fri, Dec 22, 2017 at 02:14:45AM +0100, Jann Horn wrote:
>> Hi!
>>
>> I saw the recently-added support for multiple functions in a single
>> program in BPF. I've stumbled over something that looks like it might
>> be a bug; I haven't verified it yet, but I thought I should give you a
>> heads-up before this lands in a release in case I'm right. If I'm
>> wrong, it might be worth adding a comment to stacksafe() that explains
>> why.
[...]
> but I will rewrite a test case for it unless you beat me to it :)

I just sent a failing test case for the case I'm talking about, subject
"[PATCH] bpf: selftest for late caller stack size increase".

[PATCH v2] sctp: Replace use of sockets_allocated with specified macro.

2017-12-22 Thread Tonghao Zhang

The patch(180d8cd942ce) replaces all uses of struct sock fields'
memory_pressure, memory_allocated, sockets_allocated, and sysctl_mem
to accessor macros. But the sockets_allocated field of sctp sock is
not replaced at all. Then replace it now for unifying the code.

Fixes: 180d8cd942ce ("foundations of per-cgroup memory pressure controlling.")
Cc: Glauber Costa 
Signed-off-by: Tonghao Zhang 
---
fix typo.
---
 net/sctp/socket.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index aadcd4244d9b..a5e2150ab013 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4569,7 +4569,7 @@ static int sctp_init_sock(struct sock *sk)
SCTP_DBG_OBJCNT_INC(sock);
 
local_bh_disable();
-   percpu_counter_inc(_sockets_allocated);
+   sk_sockets_allocated_inc(sk);
sock_prot_inuse_add(net, sk->sk_prot, 1);
 
/* Nothing can fail after this block, otherwise
@@ -4613,7 +4613,7 @@ static void sctp_destroy_sock(struct sock *sk)
}
sctp_endpoint_free(sp->ep);
local_bh_disable();
-   percpu_counter_dec(_sockets_allocated);
+   sk_sockets_allocated_dec(sk);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
local_bh_enable();
 }
-- 
2.13.6

Re: [PATCH V2 net-next 2/3] rds: tcp: initialize t_tcp_detached to false

2017-12-22 Thread Santosh Shilimkar


On 12/22/2017 9:39 AM, Sowmini Varadhan wrote:

Commit f10b4cff98c6 ("rds: tcp: atomically purge entries from
rds_tcp_conn_list during netns delete") adds the field t_tcp_detached,
but this needs to be initialized explicitly to false.

Signed-off-by: Sowmini Varadhan 
---
v2: target net-next, not net


Acked-by: Santosh Shilimkar

Re: [PATCH V2 net-next 1/3] rds; Reset rs->rs_bound_addr in rds_add_bound() failure path

2017-12-22 Thread Santosh Shilimkar


On 12/22/2017 9:38 AM, Sowmini Varadhan wrote:

If the rds_sock is not added to the bind_hash_table, we must
reset rs_bound_addr so that rds_remove_bound will not trip on
this rds_sock.

rds_add_bound() does a rds_sock_put() in this failure path, so
failing to reset rs_bound_addr will result in a socket refcount
bug, and will trigger a WARN_ON with the stack shown below when
the application subsequently tries to close the PF_RDS socket.

  WARNING: CPU: 20 PID: 19499 at net/rds/af_rds.c:496 \
rds_sock_destruct+0x15/0x30 [rds]
:
  __sk_destruct+0x21/0x190
  rds_remove_bound.part.13+0xb6/0x140 [rds]
  rds_release+0x71/0x120 [rds]
  sock_release+0x1a/0x70
  sock_close+0xe/0x20
  __fput+0xd5/0x210
  task_work_run+0x82/0xa0
  do_exit+0x2ce/0xb30
  ? syscall_trace_enter+0x1cc/0x2b0
  do_group_exit+0x39/0xa0
  SyS_exit_group+0x10/0x10
  do_syscall_64+0x61/0x1a0

Signed-off-by: Sowmini Varadhan 
---
v2: target net-next, not net


Acked-by: Santosh Shilimkar

Re: [PATCH net-next v5 0/5] Introduce NETIF_F_GRO_HW

2017-12-22 Thread Alexander Duyck

On Fri, Dec 22, 2017 at 6:57 AM, Sabrina Dubroca  wrote:
> Hello,
>
> Sorry for commenting late.
>
> 2017-12-16, 03:09:39 -0500, Michael Chan wrote:
>> Introduce NETIF_F_GRO_HW feature flag and convert drivers that support
>> hardware GRO to use the new flag.
>>
>> v5:
>> - Documentation changes requested by Alexander Duyck.
>> - bnx2x changes requested by Manish Chopra to enable LRO by default, and
>> disable GRO_HW if disable_tpa module parameter is set.
>>
>> v4:
>> - more changes requested by Alexander Duyck:
>> - check GRO_HW/GRO dependency in drivers's ndo_fix_features().
>> - Reverse the order of RXCSUM and GRO_HW dependency check in
>> netdev_fix_features().
>> - No propagation in netdev_disable_gro_hw().
>
> IIUC, with the patches that were applied, each driver can define
> whether GRO_HW depends on GRO? From a user's perspective, this
> inconsistent behavior is going to be quite confusing.
>
> Worse than inconsistent behavior, it looks like a driver deciding that
> GRO_HW doesn't depend on GRO is going to introduce a change of
> behavior.  Previously, when GRO was disabled, there wouldn't be any
> packet over MTU handed to the network stack.  Now, even if GRO is
> disabled, GRO_HW might still be enabled, so we might get over-MTU
> packets because of hardware GRO.

This isn't actually true. LRO was still handling packets larger than
MTU over even when GRO was disabled.

> I don't think drivers should be allowed to say "GRO_HW doesn't depend
> on GRO".

Why not, it doesn't. In my mind GRO_HW is closer to LRO than it is to
GRO. The only ugly bit as I see it is that these devices were exposing
the feature via the GRO flag in the first place. So for the sake of
legacy they might want to carry around the dependency.

> I think it's reasonable to be able to disable software GRO even if
> hardware GRO is enabled. Thus, I would propose:
> - keep the current GRO flag
> - add a GRO_HW flag, depending on GRO, enforced by the core as in
>   earlier versions of these patches
> - add a GRO_SW flag, also depending on GRO

This seems like a bunch of extra overhead for not much gain. Do we
really need to fork GRO into 3 bits? I would argue that GRO_HW really
should have been branded something like FORWARDABLE_LRO, but nobody
wanted to touch the name LRO since it apparently has some negative
stigma to it. If we had used a name like that we probably wouldn't be
going through all these extra hoops. The only real reason why this is
even being associated with GRO in the first place is that is how this
feature was hidden by the drivers so they got around having to deal
with the LRO being disabled for routing/forwarding issue. Those are
the parts that want to keep it associated with GRO since that is how
they exposed it in their devices originally.

- Alex

[PATCH] bpf: selftest for late caller stack size increase

2017-12-22 Thread Jann Horn

This checks that it is not possible to bypass the total stack size check in
update_stack_depth() by calling a function that uses a large amount of
stack memory *before* using a large amount of stack memory in the caller.

Currently, the first added testcase causes a rejection as expected, but
the second testcase is (AFAICS incorrectly) accepted:

[...]
#483/p calls: stack overflow using two frames (post-call access) FAIL
Unexpected success to load!
0: (85) call pc+2
caller:
 R10=fp0,call_-1
callee:
 frame1: R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0
3: (72) *(u8 *)(r10 -300) = 0
4: (b7) r0 = 0
5: (95) exit
returning from callee:
 frame1: R0_w=inv0 R1=ctx(id=0,off=0,imm=0) R10=fp0,call_0
to caller at 1:
 R0_w=inv0 R10=fp0,call_-1

from 5 to 1: R0=inv0 R10=fp0,call_-1
1: (72) *(u8 *)(r10 -300) = 0
2: (95) exit
processed 6 insns, stack depth 300+300
[...]
Summary: 704 PASSED, 1 FAILED

AFAICS the JIT-generated code for the second testcase shows that this
really causes the stack pointer to be decremented by 300+300:

first function:
  55push rbp
0001  4889E5mov rbp,rsp
0004  4881EC5801sub rsp,0x158
000B  4883ED28  sub rbp,byte +0x28
[...]
0025  E89AB3AFE5call 0xe5afb3c4
002A  C685D4FE00mov byte [rbp-0x12c],0x0
[...]
0041  4883C528  add rbp,byte +0x28
0045  C9leave
0046  C3ret

second function:
  55push rbp
0001  4889E5mov rbp,rsp
0004  4881EC5801sub rsp,0x158
000B  4883ED28  sub rbp,byte +0x28
[...]
0025  C685D4FE00mov byte [rbp-0x12c],0x0
[...]
003E  4883C528  add rbp,byte +0x28
0042  C9leave
0043  C3ret

Signed-off-by: Jann Horn 
---
 tools/testing/selftests/bpf/test_verifier.c | 34 +
 1 file changed, 34 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 3bacff0d6f91..71fb0be81b78 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -8729,6 +8729,40 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_XDP,
.result = ACCEPT,
},
+   {
+   "calls: stack overflow using two frames (pre-call access)",
+   .insns = {
+   /* prog 1 */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -300, 0),
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 1),
+   BPF_EXIT_INSN(),
+
+   /* prog 2 */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -300, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_XDP,
+   .errstr = "combined stack size",
+   .result = REJECT,
+   },
+   {
+   "calls: stack overflow using two frames (post-call access)",
+   .insns = {
+   /* prog 1 */
+   BPF_RAW_INSN(BPF_JMP|BPF_CALL, 0, 1, 0, 2),
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -300, 0),
+   BPF_EXIT_INSN(),
+
+   /* prog 2 */
+   BPF_ST_MEM(BPF_B, BPF_REG_10, -300, 0),
+   BPF_MOV64_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .prog_type = BPF_PROG_TYPE_XDP,
+   .errstr = "combined stack size",
+   .result = REJECT,
+   },
{
"calls: spill into caller stack frame",
.insns = {
-- 
2.15.1.620.gb9897f4670-goog

[PATCH V2 net-next 3/3] rds: tcp: cleanup if kmem_cache_alloc fails in rds_tcp_conn_alloc()

2017-12-22 Thread Sowmini Varadhan

If kmem_cache_alloc() fails in the middle of the for() loop,
cleanup anything that might have been allocated so far.

Signed-off-by: Sowmini Varadhan 
---
v2: target net-next, not net

 net/rds/tcp.c |   46 ++
 1 files changed, 26 insertions(+), 20 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index a61a498..2e554ef 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -270,16 +270,33 @@ static int rds_tcp_laddr_check(struct net *net, __be32 
addr)
return -EADDRNOTAVAIL;
 }
 
+static void rds_tcp_conn_free(void *arg)
+{
+   struct rds_tcp_connection *tc = arg;
+   unsigned long flags;
+
+   rdsdebug("freeing tc %p\n", tc);
+
+   spin_lock_irqsave(_tcp_conn_lock, flags);
+   if (!tc->t_tcp_node_detached)
+   list_del(>t_tcp_node);
+   spin_unlock_irqrestore(_tcp_conn_lock, flags);
+
+   kmem_cache_free(rds_tcp_conn_slab, tc);
+}
+
 static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp)
 {
struct rds_tcp_connection *tc;
-   int i;
+   int i, j;
+   int ret = 0;
 
for (i = 0; i < RDS_MPATH_WORKERS; i++) {
tc = kmem_cache_alloc(rds_tcp_conn_slab, gfp);
-   if (!tc)
-   return -ENOMEM;
-
+   if (!tc) {
+   ret = -ENOMEM;
+   break;
+   }
mutex_init(>t_conn_path_lock);
tc->t_sock = NULL;
tc->t_tinc = NULL;
@@ -296,22 +313,11 @@ static int rds_tcp_conn_alloc(struct rds_connection 
*conn, gfp_t gfp)
rdsdebug("rds_conn_path [%d] tc %p\n", i,
 conn->c_path[i].cp_transport_data);
}
-
-   return 0;
-}
-
-static void rds_tcp_conn_free(void *arg)
-{
-   struct rds_tcp_connection *tc = arg;
-   unsigned long flags;
-   rdsdebug("freeing tc %p\n", tc);
-
-   spin_lock_irqsave(_tcp_conn_lock, flags);
-   if (!tc->t_tcp_node_detached)
-   list_del(>t_tcp_node);
-   spin_unlock_irqrestore(_tcp_conn_lock, flags);
-
-   kmem_cache_free(rds_tcp_conn_slab, tc);
+   if (ret) {
+   for (j = 0; j < i; j++)
+   rds_tcp_conn_free(conn->c_path[j].cp_transport_data);
+   }
+   return ret;
 }
 
 static bool list_has_conn(struct list_head *list, struct rds_connection *conn)
-- 
1.7.1

[PATCH V2 net-next 1/3] rds; Reset rs->rs_bound_addr in rds_add_bound() failure path

2017-12-22 Thread Sowmini Varadhan

If the rds_sock is not added to the bind_hash_table, we must
reset rs_bound_addr so that rds_remove_bound will not trip on
this rds_sock.

rds_add_bound() does a rds_sock_put() in this failure path, so
failing to reset rs_bound_addr will result in a socket refcount
bug, and will trigger a WARN_ON with the stack shown below when
the application subsequently tries to close the PF_RDS socket.

 WARNING: CPU: 20 PID: 19499 at net/rds/af_rds.c:496 \
rds_sock_destruct+0x15/0x30 [rds]
   :
 __sk_destruct+0x21/0x190
 rds_remove_bound.part.13+0xb6/0x140 [rds]
 rds_release+0x71/0x120 [rds]
 sock_release+0x1a/0x70
 sock_close+0xe/0x20
 __fput+0xd5/0x210
 task_work_run+0x82/0xa0
 do_exit+0x2ce/0xb30
 ? syscall_trace_enter+0x1cc/0x2b0
 do_group_exit+0x39/0xa0
 SyS_exit_group+0x10/0x10
 do_syscall_64+0x61/0x1a0

Signed-off-by: Sowmini Varadhan 
---
v2: target net-next, not net

 net/rds/bind.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/rds/bind.c b/net/rds/bind.c
index 75d43dc..5aa3a64 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -114,6 +114,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
  rs, , (int)ntohs(*port));
break;
} else {
+   rs->rs_bound_addr = 0;
rds_sock_put(rs);
ret = -ENOMEM;
break;
-- 
1.7.1

[PATCH V2 net-next 0/3] rds bug fixes

2017-12-22 Thread Sowmini Varadhan

Ran into pre-existing bugs when working on the fix for
   https://www.spinics.net/lists/netdev/msg472849.html

The bugs fixed in this patchset are unrelated to the syzbot 
failure (which I'm still testing and trying to reproduce) but 
meanwhile, let's get these fixes out of the way.

V2: target net-next (rds:tcp patches have a dependancy on 
changes that are in net-next, but not yet in net)

Sowmini Varadhan (3):
  rds; Reset rs->rs_bound_addr in rds_add_bound() failure path
  rds: tcp: initialized t_tcp_detached to false
  rds: tcp: cleanup if kmem_cache_alloc fails in rds_tcp_conn_alloc()

 net/rds/bind.c |1 +
 net/rds/tcp.c  |   47 +++
 2 files changed, 28 insertions(+), 20 deletions(-)

[PATCH V2 net-next 2/3] rds: tcp: initialize t_tcp_detached to false

2017-12-22 Thread Sowmini Varadhan

Commit f10b4cff98c6 ("rds: tcp: atomically purge entries from
rds_tcp_conn_list during netns delete") adds the field t_tcp_detached,
but this needs to be initialized explicitly to false.

Signed-off-by: Sowmini Varadhan 
---
v2: target net-next, not net

 net/rds/tcp.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 39f502d..a61a498 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -290,6 +290,7 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, 
gfp_t gfp)
tc->t_cpath = >c_path[i];
 
spin_lock_irq(_tcp_conn_lock);
+   tc->t_tcp_node_detached = false;
list_add_tail(>t_tcp_node, _tcp_conn_list);
spin_unlock_irq(_tcp_conn_lock);
rdsdebug("rds_conn_path [%d] tc %p\n", i,
-- 
1.7.1

Re: [PATCH v3 1/4] security: Add support for SCTP security hooks

2017-12-22 Thread Marcelo Ricardo Leitner

On Fri, Dec 22, 2017 at 09:20:45AM -0800, Casey Schaufler wrote:
> On 12/22/2017 5:05 AM, Marcelo Ricardo Leitner wrote:
> > From: Richard Haines 
> >
> > The SCTP security hooks are explained in:
> > Documentation/security/LSM-sctp.rst

Thanks Casey for your comments. However, I'm not that acquainted with
these area of codes and I cannot work on them. I'll just wait for
Richard then.

  Marcelo

Re: [PATCH net 2/3] rds: tcp: initialize t_tcp_detached to false

2017-12-22 Thread Sowmini Varadhan

On (12/22/17 08:14), Sowmini Varadhan wrote:
> Commit f10b4cff98c6 ("rds: tcp: atomically purge entries from
> rds_tcp_conn_list during netns delete") adds the field t_tcp_detached,
> but this needs to be initialized explicitly to false.

I just realized that t_tcp_detached (and the above commit) has not 
made its way to net yet, so patch 2 should apply to net-next (not net).

Please ignore this patch series,  I'll submit a V2.
Apologies for the confusion.

--Sowmini

Re: [PATCH v3 1/4] security: Add support for SCTP security hooks

2017-12-22 Thread Casey Schaufler

On 12/22/2017 5:05 AM, Marcelo Ricardo Leitner wrote:
> From: Richard Haines 
>
> The SCTP security hooks are explained in:
> Documentation/security/LSM-sctp.rst
>
> Signed-off-by: Richard Haines 
> Acked-by: Marcelo Ricardo Leitner 
> ---
>  Documentation/security/LSM-sctp.rst | 194 
> 
>  include/linux/lsm_hooks.h   |  35 +++
>  include/linux/security.h|  25 +
>  security/security.c |  22 
>  4 files changed, 276 insertions(+)
>  create mode 100644 Documentation/security/LSM-sctp.rst
>
> diff --git a/Documentation/security/LSM-sctp.rst 
> b/Documentation/security/LSM-sctp.rst
> new file mode 100644
> index 
> ..61373672ce9f63bbd52d953500f44cdf3427c3f0
> --- /dev/null
> +++ b/Documentation/security/LSM-sctp.rst
> @@ -0,0 +1,194 @@
> +SCTP LSM Support
> +
> +
> +For security module support, three sctp specific hooks have been 
> implemented::
> +
> +security_sctp_assoc_request()
> +security_sctp_bind_connect()
> +security_sctp_sk_clone()
> +
> +Also the following security hook has been utilised::
> +
> +security_inet_conn_established()
> +
> +The usage of these hooks are described below with the SELinux implementation
> +described in ``Documentation/security/SELinux-sctp.rst``
> +
> +
> +security_sctp_assoc_request()
> +-
> +This new hook passes the ``@ep`` and ``@chunk->skb`` (the association INIT
> +packet) to the security module. Returns 0 on success, error on failure.
> +::
> +
> +@ep - pointer to sctp endpoint structure.
> +@skb - pointer to skbuff of association packet.
> +
> +The security module performs the following operations:
> + IF this is the first association on ``@ep->base.sk``, then set the peer
> + sid to that in ``@skb``. This will ensure there is only one peer sid
> + assigned to ``@ep->base.sk`` that may support multiple associations.
> +
> + ELSE validate the ``@ep->base.sk peer_sid`` against the ``@skb peer 
> sid``
> + to determine whether the association should be allowed or denied.
> +
> + Set the sctp ``@ep sid`` to socket's sid (from ``ep->base.sk``) with
> + MLS portion taken from ``@skb peer sid``. This will be used by SCTP
> + TCP style sockets and peeled off connections as they cause a new socket
> + to be generated.
> +
> + If IP security options are configured (CIPSO/CALIPSO), then the ip
> + options are set on the socket.

Please! Basing the documentation for the infrastructure behavior
on a specific security module implementation makes it *really rough*
to adopt it to a different module. It makes it doubly difficult to
define how it will work with multiple modules. Take the SELinux specifics
out of the documentation for the hooks. Describe the general intention,
not how SELinux uses it.

> +
> +
> +security_sctp_bind_connect()
> +-
> +This new hook passes one or more ipv4/ipv6 addresses to the security module
> +for validation based on the ``@optname`` that will result in either a bind or
> +connect service as shown in the permission check tables below.
> +Returns 0 on success, error on failure.
> +::
> +
> +@sk  - Pointer to sock structure.
> +@optname - Name of the option to validate.
> +@address - One or more ipv4 / ipv6 addresses.
> +@addrlen - The total length of address(s). This is calculated on each
> +   ipv4 or ipv6 address using sizeof(struct sockaddr_in) or
> +   sizeof(struct sockaddr_in6).
> +
> +  --
> +  | BIND Type Checks   |
> +  |   @optname | @address contains |
> +  ||---|
> +  | SCTP_SOCKOPT_BINDX_ADD | One or more ipv4 / ipv6 addresses |
> +  | SCTP_PRIMARY_ADDR  | Single ipv4 or ipv6 address   |
> +  | SCTP_SET_PEER_PRIMARY_ADDR | Single ipv4 or ipv6 address   |
> +  --
> +
> +  --
> +  |   CONNECT Type Checks  |
> +  |   @optname | @address contains |
> +  ||---|
> +  | SCTP_SOCKOPT_CONNECTX  | One or more ipv4 / ipv6 addresses |
> +  | SCTP_PARAM_ADD_IP  | One or more ipv4 / ipv6 addresses |
> +  | SCTP_SENDMSG_CONNECT   | Single ipv4 or ipv6 address   |
> +  | SCTP_PARAM_SET_PRIMARY | Single ipv4 or ipv6 address   |
> +  --
> +
> +A summary of the ``@optname`` entries is as follows::
> +
> +

[bpf-next V2 PATCH 14/14] samples/bpf: program demonstrating access to xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

This sample program can be used for monitoring and reporting how many
packets per sec (pps) are received per NIC RX queue index and which
CPU processed the packet. In itself it is a useful tool for quickly
identifying RSS imbalance issues, see below.

The default XDP action is XDP_PASS in-order to provide a monitor
mode. For benchmarking purposes it is possible to specify other XDP
actions on the cmdline --action.

Output below shows an imbalance RSS case where most RXQ's deliver to
CPU-0 while CPU-2 only get packets from a single RXQ.  Looking at
things from a CPU level the two CPUs are processing approx the same
amount, BUT looking at the rx_queue_index levels it is clear that
RXQ-2 receive much better service, than other RXQs which all share CPU-0.

Running XDP on dev:i40e1 (ifindex:3) action:XDP_PASS
XDP stats   CPU pps issue-pps
XDP-RX CPU  0   900,473 0
XDP-RX CPU  2   906,921 0
XDP-RX CPU  total   1,807,395

RXQ stats   RXQ:CPU pps issue-pps
rx_queue_index0:0   180,098 0
rx_queue_index0:sum 180,098
rx_queue_index1:0   180,098 0
rx_queue_index1:sum 180,098
rx_queue_index2:2   906,921 0
rx_queue_index2:sum 906,921
rx_queue_index3:0   180,098 0
rx_queue_index3:sum 180,098
rx_queue_index4:0   180,082 0
rx_queue_index4:sum 180,082
rx_queue_index5:0   180,093 0
rx_queue_index5:sum 180,093

Signed-off-by: Jesper Dangaard Brouer 
---
 samples/bpf/Makefile|4 
 samples/bpf/xdp_rxq_info_kern.c |   96 +++
 samples/bpf/xdp_rxq_info_user.c |  532 +++
 3 files changed, 632 insertions(+)
 create mode 100644 samples/bpf/xdp_rxq_info_kern.c
 create mode 100644 samples/bpf/xdp_rxq_info_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index 4fb944a7ecf8..3ff7a05bea9a 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -41,6 +41,7 @@ hostprogs-y += xdp_redirect
 hostprogs-y += xdp_redirect_map
 hostprogs-y += xdp_redirect_cpu
 hostprogs-y += xdp_monitor
+hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
 
 # Libbpf dependencies
@@ -90,6 +91,7 @@ xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
 xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
 xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
+xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
 
 # Tell kbuild to always build the programs
@@ -139,6 +141,7 @@ always += xdp_redirect_kern.o
 always += xdp_redirect_map_kern.o
 always += xdp_redirect_cpu_kern.o
 always += xdp_monitor_kern.o
+always += xdp_rxq_info_kern.o
 always += syscall_tp_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
@@ -182,6 +185,7 @@ HOSTLOADLIBES_xdp_redirect += -lelf
 HOSTLOADLIBES_xdp_redirect_map += -lelf
 HOSTLOADLIBES_xdp_redirect_cpu += -lelf
 HOSTLOADLIBES_xdp_monitor += -lelf
+HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
diff --git a/samples/bpf/xdp_rxq_info_kern.c b/samples/bpf/xdp_rxq_info_kern.c
new file mode 100644
index ..19cbd662202c
--- /dev/null
+++ b/samples/bpf/xdp_rxq_info_kern.c
@@ -0,0 +1,96 @@
+/* Example howto extract XDP RX-queue info
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+#include 
+#include "bpf_helpers.h"
+
+/* Config setup from with userspace
+ *
+ * User-side setup ifindex in config_map, to verify that
+ * ctx->ingress_ifindex is correct (against configured ifindex)
+ */
+struct config {
+   __u32 action;
+   int ifindex;
+};
+struct bpf_map_def SEC("maps") config_map = {
+   .type   = BPF_MAP_TYPE_ARRAY,
+   .key_size   = sizeof(int),
+   .value_size = sizeof(struct config),
+   .max_entries= 1,
+};
+
+/* Common stats data record (shared with userspace) */
+struct datarec {
+   __u64 processed;
+   __u64 issue;
+};
+
+struct bpf_map_def SEC("maps") stats_global_map = {
+   .type   = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size   = sizeof(u32),
+   .value_size = sizeof(struct datarec),
+   .max_entries= 1,
+};
+
+#define MAX_RXQs 64
+
+/* Stats per rx_queue_index (per CPU) */
+struct bpf_map_def SEC("maps") rx_queue_index_map = {
+   .type   = BPF_MAP_TYPE_PERCPU_ARRAY,
+   .key_size   = sizeof(u32),
+   .value_size = sizeof(struct datarec),
+   .max_entries= MAX_RXQs + 1,
+};
+
+SEC("xdp_prog0")
+int  xdp_prognum0(struct xdp_md *ctx)
+{
+   void *data_end = (void *)(long)ctx->data_end;
+   void *data = (void *)(long)ctx->data;
+   struct datarec *rec, *rxq_rec;
+   int ingress_ifindex;
+   struct

[bpf-next V2 PATCH 10/14] tun: setup xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

Driver hook points for xdp_rxq_info:
 * reg  : tun_attach
 * unreg: __tun_detach

I've done some manual testing of this tun driver, but I would
appriciate good review and someone else running their use-case tests,
as I'm not 100% sure I understand the tfile->detached semantics.

V2: Removed the skb_array_cleanup() call from V1 by request from Jason Wang.

Cc: Jason Wang 
Cc: "Michael S. Tsirkin" 
Cc: Willem de Bruijn 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/tun.c |   24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e367d6310353..e7c5f4b2a9a6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -180,6 +180,7 @@ struct tun_file {
struct list_head next;
struct tun_struct *detached;
struct skb_array tx_array;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 struct tun_flow_entry {
@@ -687,8 +688,10 @@ static void __tun_detach(struct tun_file *tfile, bool 
clean)
tun->dev->reg_state == NETREG_REGISTERED)
unregister_netdevice(tun->dev);
}
-   if (tun)
+   if (tun) {
skb_array_cleanup(>tx_array);
+   xdp_rxq_info_unreg(>xdp_rxq);
+   }
sock_put(>sk);
}
 }
@@ -728,11 +731,13 @@ static void tun_detach_all(struct net_device *dev)
tun_napi_del(tun, tfile);
/* Drop read queue */
tun_queue_purge(tfile);
+   xdp_rxq_info_unreg(>xdp_rxq);
sock_put(>sk);
}
list_for_each_entry_safe(tfile, tmp, >disabled, next) {
tun_enable_queue(tfile);
tun_queue_purge(tfile);
+   xdp_rxq_info_unreg(>xdp_rxq);
sock_put(>sk);
}
BUG_ON(tun->numdisabled != 0);
@@ -784,6 +789,22 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file,
 
tfile->queue_index = tun->numqueues;
tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
+
+   if (tfile->detached) {
+   /* Re-attach detached tfile, updating XDP queue_index */
+   WARN_ON(!xdp_rxq_info_is_reg(>xdp_rxq));
+
+   if (tfile->xdp_rxq.queue_index!= tfile->queue_index)
+   tfile->xdp_rxq.queue_index = tfile->queue_index;
+   } else {
+   /* Setup XDP RX-queue info, for new tfile getting attached */
+   err = xdp_rxq_info_reg(>xdp_rxq,
+  tun->dev, tfile->queue_index);
+   if (err < 0)
+   goto out;
+   err = 0;
+   }
+
rcu_assign_pointer(tfile->tun, tun);
rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
tun->numqueues++;
@@ -1508,6 +1529,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
xdp.data = buf + pad;
xdp_set_data_meta_invalid();
xdp.data_end = xdp.data + len;
+   xdp.rxq = >xdp_rxq;
orig_data = xdp.data;
act = bpf_prog_run_xdp(xdp_prog, );

[bpf-next V2 PATCH 12/14] xdp: generic XDP handling of xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

Hook points for xdp_rxq_info:
 * reg  : netif_alloc_rx_queues
 * unreg: netif_free_rx_queues

The net_device have some members (num_rx_queues + real_num_rx_queues)
and data-area (dev->_rx with struct netdev_rx_queue's) that were
primarily used for exporting information about RPS (CONFIG_RPS) queues
to sysfs (CONFIG_SYSFS).

For generic XDP extend struct netdev_rx_queue with the xdp_rxq_info,
and remove some of the CONFIG_SYSFS ifdefs.

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/netdevice.h |2 +
 net/core/dev.c|   69 ++---
 2 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cc4ce7456e38..43595b037872 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -44,6 +44,7 @@
 #include 
 #endif
 #include 
+#include 
 
 #include 
 #include 
@@ -686,6 +687,7 @@ struct netdev_rx_queue {
 #endif
struct kobject  kobj;
struct net_device   *dev;
+   struct xdp_rxq_info xdp_rxq;
 } cacheline_aligned_in_smp;
 
 /*
diff --git a/net/core/dev.c b/net/core/dev.c
index b0eee49a2489..24d7bbdf3758 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3889,9 +3889,33 @@ static int enqueue_to_backlog(struct sk_buff *skb, int 
cpu,
return NET_RX_DROP;
 }
 
+static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
+{
+   struct net_device *dev = skb->dev;
+   struct netdev_rx_queue *rxqueue;
+
+   rxqueue = dev->_rx;
+
+   if (skb_rx_queue_recorded(skb)) {
+   u16 index = skb_get_rx_queue(skb);
+
+   if (unlikely(index >= dev->real_num_rx_queues)) {
+   WARN_ONCE(dev->real_num_rx_queues > 1,
+ "%s received packet on queue %u, but number "
+ "of RX queues is %u\n",
+ dev->name, index, dev->real_num_rx_queues);
+
+   return rxqueue; /* Return first rxqueue */
+   }
+   rxqueue += index;
+   }
+   return rxqueue;
+}
+
 static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 struct bpf_prog *xdp_prog)
 {
+   struct netdev_rx_queue *rxqueue;
u32 metalen, act = XDP_DROP;
struct xdp_buff xdp;
void *orig_data;
@@ -3935,6 +3959,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
xdp.data_hard_start = skb->data - skb_headroom(skb);
orig_data = xdp.data;
 
+   rxqueue = netif_get_rxqueue(skb);
+   xdp.rxq = >xdp_rxq;
+
act = bpf_prog_run_xdp(xdp_prog, );
 
off = xdp.data - orig_data;
@@ -7557,12 +7584,12 @@ void netif_stacked_transfer_operstate(const struct 
net_device *rootdev,
 }
 EXPORT_SYMBOL(netif_stacked_transfer_operstate);
 
-#ifdef CONFIG_SYSFS
 static int netif_alloc_rx_queues(struct net_device *dev)
 {
unsigned int i, count = dev->num_rx_queues;
struct netdev_rx_queue *rx;
size_t sz = count * sizeof(*rx);
+   int err = 0;
 
BUG_ON(count < 1);
 
@@ -7572,11 +7599,39 @@ static int netif_alloc_rx_queues(struct net_device *dev)
 
dev->_rx = rx;
 
-   for (i = 0; i < count; i++)
+   for (i = 0; i < count; i++) {
rx[i].dev = dev;
+
+   /* XDP RX-queue setup */
+   err = xdp_rxq_info_reg([i].xdp_rxq, dev, i);
+   if (err < 0)
+   goto err_rxq_info;
+   }
return 0;
+
+err_rxq_info:
+   /* Rollback successful reg's and free other resources */
+   while (i--)
+   xdp_rxq_info_unreg([i].xdp_rxq);
+   kfree(dev->_rx);
+   dev->_rx = NULL;
+   return err;
+}
+
+static void netif_free_rx_queues(struct net_device *dev)
+{
+   unsigned int i, count = dev->num_rx_queues;
+   struct netdev_rx_queue *rx;
+
+   /* netif_alloc_rx_queues alloc failed, resources have been unreg'ed */
+   if (!dev->_rx)
+   return;
+
+   rx = dev->_rx;
+
+   for (i = 0; i < count; i++)
+   xdp_rxq_info_unreg([i].xdp_rxq);
 }
-#endif
 
 static void netdev_init_one_queue(struct net_device *dev,
  struct netdev_queue *queue, void *_unused)
@@ -8137,12 +8192,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
return NULL;
}
 
-#ifdef CONFIG_SYSFS
if (rxqs < 1) {
pr_err("alloc_netdev: Unable to allocate device with zero RX 
queues\n");
return NULL;
}
-#endif
 
alloc_size = sizeof(struct net_device);
if (sizeof_priv) {
@@ -8199,12 +8252,10 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, 
const char *name,
if (netif_alloc_netdev_queues(dev))
goto free_all;
 
-#ifdef CONFIG_SYSFS
dev->num_rx_queues =

[bpf-next V2 PATCH 04/14] ixgbe: setup xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

Driver hook points for xdp_rxq_info:
 * reg  : ixgbe_setup_rx_resources()
 * unreg: ixgbe_free_rx_resources()

Tested on actual hardware.

V2: Fix ixgbe_set_ringparam, clear xdp_rxq_info in temp_ring

Cc: intel-wired-...@lists.osuosl.org
Cc: Jeff Kirsher 
Cc: Alexander Duyck 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h |2 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |4 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c|   10 +-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 468c3555a629..8611763d6129 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -53,6 +53,7 @@
 #include 
 #endif
 
+#include 
 #include 
 
 /* common prefix used by pr_<> macros */
@@ -371,6 +372,7 @@ struct ixgbe_ring {
struct ixgbe_tx_queue_stats tx_stats;
struct ixgbe_rx_queue_stats rx_stats;
};
+   struct xdp_rxq_info xdp_rxq;
 } cacheline_internodealigned_in_smp;
 
 enum ixgbe_ring_f_enum {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 0aad1c2a3667..0aaf70b3cfcd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -1156,6 +1156,10 @@ static int ixgbe_set_ringparam(struct net_device *netdev,
memcpy(_ring[i], adapter->rx_ring[i],
   sizeof(struct ixgbe_ring));
 
+   /* Clear copied XDP RX-queue info */
+   memset(_ring[i].xdp_rxq, 0,
+  sizeof(temp_ring[i].xdp_rxq));
+
temp_ring[i].count = new_rx_count;
err = ixgbe_setup_rx_resources(adapter, _ring[i]);
if (err) {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 7737a05c717c..95aba975b391 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2318,12 +2318,14 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector 
*q_vector,
 #endif /* IXGBE_FCOE */
u16 cleaned_count = ixgbe_desc_unused(rx_ring);
bool xdp_xmit = false;
+   struct xdp_buff xdp;
+
+   xdp.rxq = _ring->xdp_rxq;
 
while (likely(total_rx_packets < budget)) {
union ixgbe_adv_rx_desc *rx_desc;
struct ixgbe_rx_buffer *rx_buffer;
struct sk_buff *skb;
-   struct xdp_buff xdp;
unsigned int size;
 
/* return some buffers to hardware, one at a time is too slow */
@@ -6444,6 +6446,11 @@ int ixgbe_setup_rx_resources(struct ixgbe_adapter 
*adapter,
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;
 
+   /* XDP RX-queue info */
+   if (xdp_rxq_info_reg(_ring->xdp_rxq, adapter->netdev,
+rx_ring->queue_index) < 0)
+   goto err;
+
rx_ring->xdp_prog = adapter->xdp_prog;
 
return 0;
@@ -6541,6 +6548,7 @@ void ixgbe_free_rx_resources(struct ixgbe_ring *rx_ring)
ixgbe_clean_rx_ring(rx_ring);
 
rx_ring->xdp_prog = NULL;
+   xdp_rxq_info_unreg(_ring->xdp_rxq);
vfree(rx_ring->rx_buffer_info);
rx_ring->rx_buffer_info = NULL;

[bpf-next V2 PATCH 09/14] thunderx: setup xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

This driver uses a bool scheme for "enable"/"disable" when setting up
different resources.  Thus, the hook points for xdp_rxq_info is done
in the same function call nicvf_rcv_queue_config().  This is activated
through enable/disable via nicvf_config_data_transfer(), which is tied
into nicvf_stop()/nicvf_open().

Extending driver packet handler call-path nicvf_rcv_pkt_handler() with
a pointer to the given struct rcv_queue, in-order to access the
xdp_rxq_info data area (in nicvf_xdp_rx()).

V2: Driver have no proper error path for failed XDP RX-queue info reg,
as nicvf_rcv_queue_config is a void function.

Cc: linux-arm-ker...@lists.infradead.org
Cc: Sunil Goutham 
Cc: Robert Richter 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   11 +++
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |4 
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |2 ++
 3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 52b3a6044f85..21618d0d694f 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -521,7 +521,7 @@ static void nicvf_unmap_page(struct nicvf *nic, struct page 
*page, u64 dma_addr)
 
 static inline bool nicvf_xdp_rx(struct nicvf *nic, struct bpf_prog *prog,
struct cqe_rx_t *cqe_rx, struct snd_queue *sq,
-   struct sk_buff **skb)
+   struct rcv_queue *rq, struct sk_buff **skb)
 {
struct xdp_buff xdp;
struct page *page;
@@ -545,6 +545,7 @@ static inline bool nicvf_xdp_rx(struct nicvf *nic, struct 
bpf_prog *prog,
xdp.data = (void *)cpu_addr;
xdp_set_data_meta_invalid();
xdp.data_end = xdp.data + len;
+   xdp.rxq = >xdp_rxq;
orig_data = xdp.data;
 
rcu_read_lock();
@@ -698,7 +699,8 @@ static inline void nicvf_set_rxhash(struct net_device 
*netdev,
 
 static void nicvf_rcv_pkt_handler(struct net_device *netdev,
  struct napi_struct *napi,
- struct cqe_rx_t *cqe_rx, struct snd_queue *sq)
+ struct cqe_rx_t *cqe_rx,
+ struct snd_queue *sq, struct rcv_queue *rq)
 {
struct sk_buff *skb = NULL;
struct nicvf *nic = netdev_priv(netdev);
@@ -724,7 +726,7 @@ static void nicvf_rcv_pkt_handler(struct net_device *netdev,
/* For XDP, ignore pkts spanning multiple pages */
if (nic->xdp_prog && (cqe_rx->rb_cnt == 1)) {
/* Packet consumed by XDP */
-   if (nicvf_xdp_rx(snic, nic->xdp_prog, cqe_rx, sq, ))
+   if (nicvf_xdp_rx(snic, nic->xdp_prog, cqe_rx, sq, rq, ))
return;
} else {
skb = nicvf_get_rcv_skb(snic, cqe_rx,
@@ -781,6 +783,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, 
u8 cq_idx,
struct cqe_rx_t *cq_desc;
struct netdev_queue *txq;
struct snd_queue *sq = >sq[cq_idx];
+   struct rcv_queue *rq = >rq[cq_idx];
unsigned int tx_pkts = 0, tx_bytes = 0, txq_idx;
 
spin_lock_bh(>lock);
@@ -811,7 +814,7 @@ static int nicvf_cq_intr_handler(struct net_device *netdev, 
u8 cq_idx,
 
switch (cq_desc->cqe_type) {
case CQE_TYPE_RX:
-   nicvf_rcv_pkt_handler(netdev, napi, cq_desc, sq);
+   nicvf_rcv_pkt_handler(netdev, napi, cq_desc, sq, rq);
work_done++;
break;
case CQE_TYPE_SEND:
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
index f38ea349aa00..14e62c6ac342 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_queues.c
@@ -760,6 +760,7 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, 
struct queue_set *qs,
 
if (!rq->enable) {
nicvf_reclaim_rcv_queue(nic, qs, qidx);
+   xdp_rxq_info_unreg(>xdp_rxq);
return;
}
 
@@ -772,6 +773,9 @@ static void nicvf_rcv_queue_config(struct nicvf *nic, 
struct queue_set *qs,
/* all writes of RBDR data to be loaded into L2 Cache as well*/
rq->caching = 1;
 
+   /* Driver have no proper error path for failed XDP RX-queue info reg */
+   WARN_ON(xdp_rxq_info_reg(>xdp_rxq, nic->netdev, qidx) < 0);
+
/* Send a mailbox msg to PF to config RQ */
mbx.rq.msg = NIC_MBOX_MSG_RQ_CFG;
mbx.rq.qs_num = qs->vnic_id;
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_queues.h 
b/drivers/net/ethernet/cavium/thunder/nicvf_queues.h
index 178ab6e8e3c5..7d1e4e2aaad0 100644
---

[bpf-next V2 PATCH 11/14] virtio_net: setup xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

The virtio_net driver doesn't dynamically change the RX-ring queue
layout and backing pages, but instead reject XDP setup if all the
conditions for XDP is not meet.  Thus, the xdp_rxq_info also remains
fairly static.  This allow us to simply add the reg/unreg to
net_device open/close functions.

Driver hook points for xdp_rxq_info:
 * reg  : virtnet_open
 * unreg: virtnet_close

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: virtualizat...@lists.linux-foundation.org
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/virtio_net.c |   12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 6fb7b658a6cc..6d45d687ec6a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -115,6 +116,8 @@ struct receive_queue {
 
/* Name of this receive queue: input.$index */
char name[40];
+
+   struct xdp_rxq_info xdp_rxq;
 };
 
 struct virtnet_info {
@@ -559,6 +562,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
xdp.data = xdp.data_hard_start + xdp_headroom;
xdp_set_data_meta_invalid();
xdp.data_end = xdp.data + len;
+   xdp.rxq = >xdp_rxq;
orig_data = xdp.data;
act = bpf_prog_run_xdp(xdp_prog, );
 
@@ -1225,13 +1229,18 @@ static int virtnet_poll(struct napi_struct *napi, int 
budget)
 static int virtnet_open(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
-   int i;
+   int i, err;
 
for (i = 0; i < vi->max_queue_pairs; i++) {
if (i < vi->curr_queue_pairs)
/* Make sure we have some buffers: if oom use wq. */
if (!try_fill_recv(vi, >rq[i], GFP_KERNEL))
schedule_delayed_work(>refill, 0);
+
+   err = xdp_rxq_info_reg(>rq[i].xdp_rxq, dev, i);
+   if (err < 0)
+   return err;
+
virtnet_napi_enable(vi->rq[i].vq, >rq[i].napi);
virtnet_napi_tx_enable(vi, vi->sq[i].vq, >sq[i].napi);
}
@@ -1560,6 +1569,7 @@ static int virtnet_close(struct net_device *dev)
cancel_delayed_work_sync(>refill);
 
for (i = 0; i < vi->max_queue_pairs; i++) {
+   xdp_rxq_info_unreg(>rq[i].xdp_rxq);
napi_disable(>rq[i].napi);
virtnet_napi_tx_disable(>sq[i].napi);
}

[bpf-next V2 PATCH 07/14] bnxt_en: setup xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

Driver hook points for xdp_rxq_info:
 * reg  : bnxt_alloc_rx_rings
 * unreg: bnxt_free_rx_rings

This driver should be updated to re-register when changing
allocation mode of RX rings.

Tested on actual hardware.

Cc: Andy Gospodarek 
Cc: Michael Chan 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c |   10 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt.h |2 ++
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c |1 +
 3 files changed, 13 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 1d865ae201db..e380f1a039ca 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2247,6 +2247,9 @@ static void bnxt_free_rx_rings(struct bnxt *bp)
if (rxr->xdp_prog)
bpf_prog_put(rxr->xdp_prog);
 
+   if (xdp_rxq_info_is_reg(>xdp_rxq))
+   xdp_rxq_info_unreg(>xdp_rxq);
+
kfree(rxr->rx_tpa);
rxr->rx_tpa = NULL;
 
@@ -2280,6 +2283,10 @@ static int bnxt_alloc_rx_rings(struct bnxt *bp)
 
ring = >rx_ring_struct;
 
+   rc = xdp_rxq_info_reg(>xdp_rxq, bp->dev, i);
+   if (rc < 0)
+   return rc;
+
rc = bnxt_alloc_ring(bp, ring);
if (rc)
return rc;
@@ -2834,6 +2841,9 @@ void bnxt_set_ring_params(struct bnxt *bp)
bp->cp_ring_mask = bp->cp_bit - 1;
 }
 
+/* Changing allocation mode of RX rings.
+ * TODO: Update when extending xdp_rxq_info to support allocation modes.
+ */
 int bnxt_set_rx_skb_mode(struct bnxt *bp, bool page_mode)
 {
if (page_mode) {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
index 5359a1f0045f..2d268fc26f5e 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct tx_bd {
__le32 tx_bd_len_flags_type;
@@ -664,6 +665,7 @@ struct bnxt_rx_ring_info {
 
struct bnxt_ring_struct rx_ring_struct;
struct bnxt_ring_struct rx_agg_ring_struct;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 struct bnxt_cp_ring_info {
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
index 261e5847557a..1389ab5e05df 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c
@@ -96,6 +96,7 @@ bool bnxt_rx_xdp(struct bnxt *bp, struct bnxt_rx_ring_info 
*rxr, u16 cons,
xdp.data = *data_ptr;
xdp_set_data_meta_invalid();
xdp.data_end = *data_ptr + *len;
+   xdp.rxq = >xdp_rxq;
orig_data = xdp.data;
mapping = rx_buf->mapping - bp->rx_dma_offset;

[bpf-next V2 PATCH 13/14] bpf: finally expose xdp_rxq_info to XDP bpf-programs

2017-12-22 Thread Jesper Dangaard Brouer

Now all XDP driver have been updated to setup xdp_rxq_info and assign
this to xdp_buff->rxq.  Thus, it is now safe to enable access to some
of the xdp_rxq_info struct members.

This patch extend xdp_md and expose UAPI to userspace for
ingress_ifindex and rx_queue_index.  Access happens via bpf
instruction rewrite, that load data directly from struct xdp_rxq_info.

* ingress_ifindex map to xdp_rxq_info->dev->ifindex
* rx_queue_index  map to xdp_rxq_info->queue_index

Signed-off-by: Jesper Dangaard Brouer 
---
 include/uapi/linux/bpf.h |3 +++
 net/core/filter.c|   19 +++
 2 files changed, 22 insertions(+)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 69eabfcb9bdb..a6000a95d40e 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -899,6 +899,9 @@ struct xdp_md {
__u32 data;
__u32 data_end;
__u32 data_meta;
+   /* Below access go though struct xdp_rxq_info */
+   __u32 ingress_ifindex; /* rxq->dev->ifindex */
+   __u32 rx_queue_index;  /* rxq->queue_index  */
 };
 
 enum sk_action {
diff --git a/net/core/filter.c b/net/core/filter.c
index 130b842c3a15..acdb94c0e97f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4304,6 +4304,25 @@ static u32 xdp_convert_ctx_access(enum bpf_access_type 
type,
  si->dst_reg, si->src_reg,
  offsetof(struct xdp_buff, data_end));
break;
+   case offsetof(struct xdp_md, ingress_ifindex):
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
+ si->dst_reg, si->src_reg,
+ offsetof(struct xdp_buff, rxq));
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_rxq_info, 
dev),
+ si->dst_reg, si->dst_reg,
+ offsetof(struct xdp_rxq_info, dev));
+   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ bpf_target_off(struct net_device,
+ifindex, 4, target_size));
+   break;
+   case offsetof(struct xdp_md, rx_queue_index):
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
+ si->dst_reg, si->src_reg,
+ offsetof(struct xdp_buff, rxq));
+   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ bpf_target_off(struct xdp_rxq_info,
+   queue_index, 4, target_size));
+   break;
}
 
return insn - insn_buf;

[bpf-next V2 PATCH 08/14] nfp: setup xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

Driver hook points for xdp_rxq_info:
 * reg  : nfp_net_rx_ring_alloc
 * unreg: nfp_net_rx_ring_free

In struct nfp_net_rx_ring moved member @size into a hole on 64-bit.
Thus, the size remaines the same after adding member @xdp_rxq.

Cc: oss-driv...@netronome.com
Cc: Jakub Kicinski 
Cc: Simon Horman 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |5 -
 .../net/ethernet/netronome/nfp/nfp_net_common.c|   10 --
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 3801c52098d5..0e564cfabe7e 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -47,6 +47,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "nfp_net_ctrl.h"
 
@@ -350,6 +351,7 @@ struct nfp_net_rx_buf {
  * @rxds:   Virtual address of FL/RX ring in host memory
  * @dma:DMA address of the FL/RX ring
  * @size:   Size, in bytes, of the FL/RX ring (needed to free)
+ * @xdp_rxq:RX-ring info avail for XDP
  */
 struct nfp_net_rx_ring {
struct nfp_net_r_vector *r_vec;
@@ -361,13 +363,14 @@ struct nfp_net_rx_ring {
u32 idx;
 
int fl_qcidx;
+   unsigned int size;
u8 __iomem *qcp_fl;
 
struct nfp_net_rx_buf *rxbufs;
struct nfp_net_rx_desc *rxds;
 
dma_addr_t dma;
-   unsigned int size;
+   struct xdp_rxq_info xdp_rxq;
 } cacheline_aligned;
 
 /**
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 0add4870ce2e..45b8cae937be 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1608,11 +1608,13 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
int budget)
unsigned int true_bufsz;
struct sk_buff *skb;
int pkts_polled = 0;
+   struct xdp_buff xdp;
int idx;
 
rcu_read_lock();
xdp_prog = READ_ONCE(dp->xdp_prog);
true_bufsz = xdp_prog ? PAGE_SIZE : dp->fl_bufsz;
+   xdp.rxq = _ring->xdp_rxq;
tx_ring = r_vec->xdp_ring;
 
while (pkts_polled < budget) {
@@ -1703,7 +1705,6 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, 
int budget)
  dp->bpf_offload_xdp) && !meta.portid) {
void *orig_data = rxbuf->frag + pkt_off;
unsigned int dma_off;
-   struct xdp_buff xdp;
int act;
 
xdp.data_hard_start = rxbuf->frag + 
NFP_NET_RX_BUF_HEADROOM;
@@ -2252,6 +2253,7 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring 
*rx_ring)
struct nfp_net_r_vector *r_vec = rx_ring->r_vec;
struct nfp_net_dp *dp = _vec->nfp_net->dp;
 
+   xdp_rxq_info_unreg(_ring->xdp_rxq);
kfree(rx_ring->rxbufs);
 
if (rx_ring->rxds)
@@ -2275,7 +2277,11 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring 
*rx_ring)
 static int
 nfp_net_rx_ring_alloc(struct nfp_net_dp *dp, struct nfp_net_rx_ring *rx_ring)
 {
-   int sz;
+   int sz, err;
+
+   err = xdp_rxq_info_reg(_ring->xdp_rxq, dp->netdev, rx_ring->idx);
+   if (err < 0)
+   return err;
 
rx_ring->cnt = dp->rxd_cnt;
rx_ring->size = sizeof(*rx_ring->rxds) * rx_ring->cnt;

[bpf-next V2 PATCH 03/14] i40e: setup xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

The i40e driver have a special "FDIR" RX-ring (I40E_VSI_FDIR) which is
a sideband channel for configuring/updating the flow director tables.
This (i40e_vsi_)type does not invoke XDP-ebpf code. Thus, mark this
type as a XDP RX-queue type RXQ_TYPE_SINK.

Driver hook points for xdp_rxq_info:
 * reg  : i40e_setup_rx_descriptors (via i40e_vsi_setup_rx_resources)
 * unreg: i40e_free_rx_resources(via i40e_vsi_free_rx_resources)

Tested on actual hardware with samples/bpf program.

V2: Fixed bug in i40e_set_ringparam

Cc: intel-wired-...@lists.osuosl.org
Cc: Björn Töpel 
Cc: Jeff Kirsher 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |2 ++
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|   18 --
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|3 +++
 3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 5f6cf7212d4f..cfd788b4fd7a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -1585,6 +1585,8 @@ static int i40e_set_ringparam(struct net_device *netdev,
 */
rx_rings[i].desc = NULL;
rx_rings[i].rx_bi = NULL;
+   /* Clear cloned XDP RX-queue info before setup call */
+   memset(_rings[i].xdp_rxq, 0, 
sizeof(rx_rings[i].xdp_rxq));
/* this is to allow wr32 to have something to write to
 * during early allocation of Rx buffers
 */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 4566d66ffc7c..2a8a85e3ae8f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
@@ -1236,6 +1237,8 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
 void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 {
i40e_clean_rx_ring(rx_ring);
+   if (rx_ring->vsi->type == I40E_VSI_MAIN)
+   xdp_rxq_info_unreg(_ring->xdp_rxq);
rx_ring->xdp_prog = NULL;
kfree(rx_ring->rx_bi);
rx_ring->rx_bi = NULL;
@@ -1256,6 +1259,7 @@ void i40e_free_rx_resources(struct i40e_ring *rx_ring)
 int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
 {
struct device *dev = rx_ring->dev;
+   int err = -ENOMEM;
int bi_size;
 
/* warn if we are about to overwrite the pointer */
@@ -1283,13 +1287,21 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
rx_ring->next_to_clean = 0;
rx_ring->next_to_use = 0;
 
+   /* XDP RX-queue info only needed for RX rings exposed to XDP */
+   if (rx_ring->vsi->type == I40E_VSI_MAIN) {
+   err = xdp_rxq_info_reg(_ring->xdp_rxq, rx_ring->netdev,
+  rx_ring->queue_index);
+   if (err < 0)
+   goto err;
+   }
+
rx_ring->xdp_prog = rx_ring->vsi->xdp_prog;
 
return 0;
 err:
kfree(rx_ring->rx_bi);
rx_ring->rx_bi = NULL;
-   return -ENOMEM;
+   return err;
 }
 
 /**
@@ -2068,11 +2080,13 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
int budget)
struct sk_buff *skb = rx_ring->skb;
u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
bool failure = false, xdp_xmit = false;
+   struct xdp_buff xdp;
+
+   xdp.rxq = _ring->xdp_rxq;
 
while (likely(total_rx_packets < (unsigned int)budget)) {
struct i40e_rx_buffer *rx_buffer;
union i40e_rx_desc *rx_desc;
-   struct xdp_buff xdp;
unsigned int size;
u16 vlan_tag;
u8 rx_ptype;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index fbae1182e2ea..2d08760fc4ce 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -27,6 +27,8 @@
 #ifndef _I40E_TXRX_H_
 #define _I40E_TXRX_H_
 
+#include 
+
 /* Interrupt Throttling and Rate Limiting Goodies */
 
 #define I40E_MAX_ITR   0x0FF0  /* reg uses 2 usec resolution */
@@ -428,6 +430,7 @@ struct i40e_ring {
 */
 
struct i40e_channel *ch;
+   struct xdp_rxq_info xdp_rxq;
 } cacheline_internodealigned_in_smp;
 
 static inline bool ring_uses_build_skb(struct i40e_ring *ring)

[bpf-next V2 PATCH 05/14] xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg

2017-12-22 Thread Jesper Dangaard Brouer

The driver code qede_free_fp_array() depend on kfree() can be called
with a NULL pointer. This stems from the qede_alloc_fp_array()
function which either (kz)alloc memory for fp->txq or fp->rxq.
This also simplifies error handling code in case of memory allocation
failures, but xdp_rxq_info_unreg need to know the difference.

Introduce xdp_rxq_info_is_reg() to handle if a memory allocation fails
and detect this is the failure path by seeing that xdp_rxq_info was
not registred yet, which first happens after successful alloaction in
qede_init_fp().

Driver hook points for xdp_rxq_info:
 * reg  : qede_init_fp
 * unreg: qede_free_fp_array

Tested on actual hardware with samples/bpf program.

V2: Driver have no proper error path for failed XDP RX-queue info reg, as
qede_init_fp() is a void function.

Cc: everest-linux...@cavium.com
Cc: Ariel Elior 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/qlogic/qede/qede.h  |2 ++
 drivers/net/ethernet/qlogic/qede/qede_fp.c   |1 +
 drivers/net/ethernet/qlogic/qede/qede_main.c |   10 ++
 include/net/xdp.h|1 +
 net/core/xdp.c   |6 ++
 5 files changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qede/qede.h 
b/drivers/net/ethernet/qlogic/qede/qede.h
index a3a70ade411f..62f47736511b 100644
--- a/drivers/net/ethernet/qlogic/qede/qede.h
+++ b/drivers/net/ethernet/qlogic/qede/qede.h
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #ifdef CONFIG_RFS_ACCEL
@@ -345,6 +346,7 @@ struct qede_rx_queue {
u64 xdp_no_pass;
 
void *handle;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 union db_prod {
diff --git a/drivers/net/ethernet/qlogic/qede/qede_fp.c 
b/drivers/net/ethernet/qlogic/qede/qede_fp.c
index 48ec4c56cddf..dafc079ab6b9 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_fp.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_fp.c
@@ -1006,6 +1006,7 @@ static bool qede_rx_xdp(struct qede_dev *edev,
xdp.data = xdp.data_hard_start + *data_offset;
xdp_set_data_meta_invalid();
xdp.data_end = xdp.data + *len;
+   xdp.rxq = >xdp_rxq;
 
/* Queues always have a full reset currently, so for the time
 * being until there's atomic program replace just mark read
diff --git a/drivers/net/ethernet/qlogic/qede/qede_main.c 
b/drivers/net/ethernet/qlogic/qede/qede_main.c
index 57332b3e5e64..24160f1fd0e5 100644
--- a/drivers/net/ethernet/qlogic/qede/qede_main.c
+++ b/drivers/net/ethernet/qlogic/qede/qede_main.c
@@ -762,6 +762,12 @@ static void qede_free_fp_array(struct qede_dev *edev)
fp = >fp_array[i];
 
kfree(fp->sb_info);
+   /* Handle mem alloc failure case where qede_init_fp
+* didn't register xdp_rxq_info yet.
+* Implicit only (fp->type & QEDE_FASTPATH_RX)
+*/
+   if (fp->rxq && xdp_rxq_info_is_reg(>rxq->xdp_rxq))
+   xdp_rxq_info_unreg(>rxq->xdp_rxq);
kfree(fp->rxq);
kfree(fp->xdp_tx);
kfree(fp->txq);
@@ -1498,6 +1504,10 @@ static void qede_init_fp(struct qede_dev *edev)
else
fp->rxq->data_direction = DMA_FROM_DEVICE;
fp->rxq->dev = >pdev->dev;
+
+   /* Driver have no error path from here */
+   WARN_ON(xdp_rxq_info_reg(>rxq->xdp_rxq, edev->ndev,
+fp->rxq->rxq_id) < 0);
}
 
if (fp->type & QEDE_FASTPATH_TX) {
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 8c6e83293bba..9b29e06e4687 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -43,5 +43,6 @@ int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 struct net_device *dev, u32 queue_index);
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
 void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
+bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/net/core/xdp.c b/net/core/xdp.c
index dde541b71aaa..5af17c44b92d 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -66,3 +66,9 @@ void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq)
xdp_rxq->reg_state = REG_STATE_UNUSED;
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_unused);
+
+bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq)
+{
+   return (xdp_rxq->reg_state == REG_STATE_REGISTERED);
+}
+EXPORT_SYMBOL_GPL(xdp_rxq_info_is_reg);

[bpf-next V2 PATCH 06/14] mlx4: setup xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

Driver hook points for xdp_rxq_info:
 * reg  : mlx4_en_create_rx_ring
 * unreg: mlx4_en_destroy_rx_ring

Tested on actual hardware.

Cc: Tariq Toukan 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |3 ++-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   13 ++---
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |4 +++-
 3 files changed, 15 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 99051a294fa6..0cfcf3089ae4 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2172,8 +2172,9 @@ static int mlx4_en_alloc_resources(struct mlx4_en_priv 
*priv)
 
if (mlx4_en_create_rx_ring(priv, >rx_ring[i],
   prof->rx_ring_size, priv->stride,
-  node))
+  node, i))
goto err;
+
}
 
 #ifdef CONFIG_RFS_ACCEL
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 85e28efcda33..48bbfc1ad391 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -262,7 +262,7 @@ void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev)
 
 int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
   struct mlx4_en_rx_ring **pring,
-  u32 size, u16 stride, int node)
+  u32 size, u16 stride, int node, int queue_index)
 {
struct mlx4_en_dev *mdev = priv->mdev;
struct mlx4_en_rx_ring *ring;
@@ -286,6 +286,9 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
ring->log_stride = ffs(ring->stride) - 1;
ring->buf_size = ring->size * ring->stride + TXBB_SIZE;
 
+   if (xdp_rxq_info_reg(>xdp_rxq, priv->dev, queue_index) < 0)
+   goto err_ring;
+
tmp = size * roundup_pow_of_two(MLX4_EN_MAX_RX_FRAGS *
sizeof(struct mlx4_en_rx_alloc));
ring->rx_info = vzalloc_node(tmp, node);
@@ -293,7 +296,7 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
ring->rx_info = vzalloc(tmp);
if (!ring->rx_info) {
err = -ENOMEM;
-   goto err_ring;
+   goto err_xdp_info;
}
}
 
@@ -317,6 +320,8 @@ int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,
 err_info:
vfree(ring->rx_info);
ring->rx_info = NULL;
+err_xdp_info:
+   xdp_rxq_info_unreg(>xdp_rxq);
 err_ring:
kfree(ring);
*pring = NULL;
@@ -440,6 +445,7 @@ void mlx4_en_destroy_rx_ring(struct mlx4_en_priv *priv,
lockdep_is_held(>state_lock));
if (old_prog)
bpf_prog_put(old_prog);
+   xdp_rxq_info_unreg(>xdp_rxq);
mlx4_free_hwq_res(mdev->dev, >wqres, size * stride + TXBB_SIZE);
vfree(ring->rx_info);
ring->rx_info = NULL;
@@ -650,6 +656,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
int cq_ring = cq->ring;
bool doorbell_pending;
struct mlx4_cqe *cqe;
+   struct xdp_buff xdp;
int polled = 0;
int index;
 
@@ -664,6 +671,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
/* Protect accesses to: ring->xdp_prog, priv->mac_hash list */
rcu_read_lock();
xdp_prog = rcu_dereference(ring->xdp_prog);
+   xdp.rxq = >xdp_rxq;
doorbell_pending = 0;
 
/* We assume a 1:1 mapping between CQEs and Rx descriptors, so Rx
@@ -748,7 +756,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct 
mlx4_en_cq *cq, int bud
 * read bytes but not past the end of the frag.
 */
if (xdp_prog) {
-   struct xdp_buff xdp;
dma_addr_t dma;
void *orig_data;
u32 act;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h 
b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 2b72677eccd4..39f3f1b1ab0a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -46,6 +46,7 @@
 #endif
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -356,6 +357,7 @@ struct mlx4_en_rx_ring {
unsigned long dropped;
int hwtstamp_rx_filter;
cpumask_var_t affinity_mask;
+   struct xdp_rxq_info xdp_rxq;
 };
 
 struct mlx4_en_cq {
@@ -719,7 +721,7 @@ void mlx4_en_set_num_rx_rings(struct mlx4_en_dev *mdev);
 void mlx4_en_recover_from_oom(struct mlx4_en_priv *priv);
 int mlx4_en_create_rx_ring(struct mlx4_en_priv *priv,

[bpf-next V2 PATCH 02/14] xdp/mlx5: setup xdp_rxq_info

2017-12-22 Thread Jesper Dangaard Brouer

The mlx5 driver have a special drop-RQ queue (one per interface) that
simply drops all incoming traffic. It helps driver keep other HW
objects (flow steering) alive upon down/up operations.  It is
temporarily pointed by flow steering objects during the interface
setup, and when interface is down. It lacks many fields that are set
in a regular RQ (for example its state is never switched to
MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explanation).

The XDP RX-queue info for this drop-RQ marked as unused, which
allow us to use the same takedown/free code path as other RX-queues.

Driver hook points for xdp_rxq_info:
 * reg   : mlx5e_alloc_rq()
 * unused: mlx5e_alloc_drop_rq()
 * unreg : mlx5e_free_rq()

Tested on actual hardware with samples/bpf program

Cc: Saeed Mahameed 
Cc: Matan Barak 
Cc: Tariq Toukan 
Signed-off-by: Jesper Dangaard Brouer 
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |4 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |9 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |1 +
 3 files changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index c0872b3284cb..405b09db2a84 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -46,6 +46,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "wq.h"
 #include "mlx5_core.h"
 #include "en_stats.h"
@@ -568,6 +569,9 @@ struct mlx5e_rq {
u32rqn;
struct mlx5_core_dev  *mdev;
struct mlx5_core_mkey  umr_mkey;
+
+   /* XDP read-mostly */
+   struct xdp_rxq_infoxdp_rxq;
 } cacheline_aligned_in_smp;
 
 struct mlx5e_channel {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0f5c012de52e..128cb9a32549 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -589,6 +589,9 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
goto err_rq_wq_destroy;
}
 
+   if (xdp_rxq_info_reg(>xdp_rxq, rq->netdev, rq->ix) < 0)
+   goto err_rq_wq_destroy;
+
rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
rq->buff.headroom = params->rq_headroom;
 
@@ -695,6 +698,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 err_rq_wq_destroy:
if (rq->xdp_prog)
bpf_prog_put(rq->xdp_prog);
+   xdp_rxq_info_unreg(>xdp_rxq);
mlx5_wq_destroy(>wq_ctrl);
 
return err;
@@ -707,6 +711,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
if (rq->xdp_prog)
bpf_prog_put(rq->xdp_prog);
 
+   xdp_rxq_info_unreg(>xdp_rxq);
+
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
mlx5e_rq_free_mpwqe_info(rq);
@@ -2768,6 +2774,9 @@ static int mlx5e_alloc_drop_rq(struct mlx5_core_dev *mdev,
if (err)
return err;
 
+   /* Mark as unused given "Drop-RQ" packets never reach XDP */
+   xdp_rxq_info_unused(>xdp_rxq);
+
rq->mdev = mdev;
 
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 5b499c7a698f..7b38480811d4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -812,6 +812,7 @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
xdp_set_data_meta_invalid();
xdp.data_end = xdp.data + *len;
xdp.data_hard_start = va;
+   xdp.rxq = >xdp_rxq;
 
act = bpf_prog_run_xdp(prog, );
switch (act) {

[bpf-next V2 PATCH 01/14] xdp: base API for new XDP rx-queue info concept

2017-12-22 Thread Jesper Dangaard Brouer

This patch only introduce the core data structures and API functions.
All XDP enabled drivers must use the API before this info can used.

There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on.  For both the XDP bpf-prog and kernel side.

Instead of extending xdp_buff each time new info is needed, the patch
creates a separate read-mostly struct xdp_rxq_info, that contains this
info.  We stress this data/cache-line is for read-only info.  This is
NOT for dynamic per packet info, use the data_meta for such use-cases.

The performance advantage is this info can be setup at RX-ring init
time, instead of updating N-members in xdp_buff.  A possible (driver
level) micro optimization is that xdp_buff->rxq assignment could be
done once per XDP/NAPI loop.  The extra pointer deref only happens for
program needing access to this info (thus, no slowdown to existing
use-cases).

Signed-off-by: Jesper Dangaard Brouer 
---
 include/linux/filter.h |2 +
 include/net/xdp.h  |   47 +
 net/core/Makefile  |2 +
 net/core/xdp.c |   68 
 4 files changed, 118 insertions(+), 1 deletion(-)
 create mode 100644 include/net/xdp.h
 create mode 100644 net/core/xdp.c

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 2b0df2703671..425056c7f96c 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -20,6 +20,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 #include 
@@ -503,6 +504,7 @@ struct xdp_buff {
void *data_end;
void *data_meta;
void *data_hard_start;
+   struct xdp_rxq_info *rxq;
 };
 
 /* Compute the linear packet data range [data, data_end) which
diff --git a/include/net/xdp.h b/include/net/xdp.h
new file mode 100644
index ..8c6e83293bba
--- /dev/null
+++ b/include/net/xdp.h
@@ -0,0 +1,47 @@
+/* include/net/xdp.h
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+#ifndef __LINUX_NET_XDP_H__
+#define __LINUX_NET_XDP_H__
+
+/**
+ * DOC: XDP RX-queue information
+ *
+ * The XDP RX-queue info (xdp_rxq_info) is associated with the driver
+ * level RX-ring queues.  It is information that is specific to how
+ * the driver have configured a given RX-ring queue.
+ *
+ * Each xdp_buff frame received in the driver carry a (pointer)
+ * reference to this xdp_rxq_info structure.  This provides the XDP
+ * data-path read-access to RX-info for both kernel and bpf-side
+ * (limited subset).
+ *
+ * For now, direct access is only safe while running in NAPI/softirq
+ * context.
+ *
+ * The driver usage API is an init, register and unregister API.
+ *
+ * The struct is not directly tied to the XDP prog.  A new XDP prog
+ * can be attached as long as it doesn't change the underlying
+ * RX-ring.  If the RX-ring does change significantly, the NIC driver
+ * naturally need to stop the RX-ring before purging and reallocating
+ * memory.  In that process the driver MUST call unregistor (which
+ * also apply for driver shutdown and unload).  The init and register
+ * API is also mandatory during RX-ring setup.
+ */
+
+struct xdp_rxq_info {
+   struct net_device *dev;
+   u32 queue_index;
+   u32 reg_state;
+} cacheline_aligned; /* perf critical, avoid false-sharing */
+
+void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq);
+int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
+struct net_device *dev, u32 queue_index);
+void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
+void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
+
+#endif /* __LINUX_NET_XDP_H__ */
diff --git a/net/core/Makefile b/net/core/Makefile
index 1fd0a9c88b1b..6dbbba8c57ae 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
 obj-y   += dev.o ethtool.o dev_addr_lists.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
-   fib_notifier.o
+   fib_notifier.o xdp.o
 
 obj-y += net-sysfs.o
 obj-$(CONFIG_PROC_FS) += net-procfs.o
diff --git a/net/core/xdp.c b/net/core/xdp.c
new file mode 100644
index ..dde541b71aaa
--- /dev/null
+++ b/net/core/xdp.c
@@ -0,0 +1,68 @@
+/* net/core/xdp.c
+ *
+ * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
+ * Released under terms in GPL version 2.  See COPYING.
+ */
+#include 
+#include 
+
+#include 
+
+#define REG_STATE_NEW  0x0
+#define REG_STATE_REGISTERED   0x1
+#define REG_STATE_UNREGISTERED 0x2
+#define REG_STATE_UNUSED   0x3
+
+void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
+{
+   /* Simplify driver cleanup code paths, allow unreg "unused" */
+   if (xdp_rxq->reg_state == REG_STATE_UNUSED)
+   return;
+
+

[bpf-next V2 PATCH 00/14] xdp: new XDP rx-queue info concept

2017-12-22 Thread Jesper Dangaard Brouer

V2:
* Changed API exposed to drivers
  - Removed invocation of "init" in drivers, and only call "reg"
(Suggested by Saeed)
  - Allow "reg" to fail and handle this in drivers
(Suggested by David Ahern)
* Removed the SINKQ qtype, instead allow to register as "unused"
* Also fixed some drivers during testing on actual HW (noted in patches)

There is a need for XDP to know more about the RX-queue a given XDP
frames have arrived on.  For both the XDP bpf-prog and kernel side.

Instead of extending struct xdp_buff each time new info is needed,
this patchset takes a different approach.  Struct xdp_buff is only
extended with a pointer to a struct xdp_rxq_info (allowing for easier
extending this later).  This xdp_rxq_info contains information related
to how the driver have setup the individual RX-queue's.  This is
read-mostly information, and all xdp_buff frames (in drivers
napi_poll) point to the same xdp_rxq_info (per RX-queue).

We stress this data/cache-line is for read-mostly info.  This is NOT
for dynamic per packet info, use the data_meta for such use-cases.

This patchset start out small, and only expose ingress_ifindex and the
RX-queue index to the XDP/BPF program. Access to tangible info like
the ingress ifindex and RX queue index, is fairly easy to comprehent.
The other future use-cases could allow XDP frames to be recycled back
to the originating device driver, by providing info on RX device and
queue number.

As XDP doesn't have driver feature flags, and eBPF code due to
bpf-tail-calls cannot determine that XDP driver invoke it, this
patchset have to update every driver that support XDP.

For driver developers (review individual driver patches!):

The xdp_rxq_info is tied to the drivers RX-ring(s). Whenever a RX-ring
modification require (temporary) stopping RX frames, then the
xdp_rxq_info should (likely) also be unregistred and re-registered,
especially if reallocating the pages in the ring. Make sure ethtool
set_channels does the right thing. When replacing XDP prog, if and
only if RX-ring need to be changed, then also re-register the
xdp_rxq_info.

I'm Cc'ing the individual driver patches to the registered maintainers.

Testing:

I've only tested the NIC drivers I have hardware for.  The general
test procedure is to (DUT = Device Under Test):
 (1) run pktgen script pktgen_sample04_many_flows.sh   (against DUT)
 (2) run samples/bpf program xdp_rxq_info --dev $DEV   (on DUT)
 (3) runtime modify number of NIC queues via ethtool -L(on DUT)
 (4) runtime modify number of NIC ring-size via ethtool -G (on DUT)

Patch based on git tree bpf-next (at commit c475ffad58a8a2):
 https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/

---

Jesper Dangaard Brouer (14):
  xdp: base API for new XDP rx-queue info concept
  xdp/mlx5: setup xdp_rxq_info
  i40e: setup xdp_rxq_info
  ixgbe: setup xdp_rxq_info
  xdp/qede: setup xdp_rxq_info and intro xdp_rxq_info_is_reg
  mlx4: setup xdp_rxq_info
  bnxt_en: setup xdp_rxq_info
  nfp: setup xdp_rxq_info
  thunderx: setup xdp_rxq_info
  tun: setup xdp_rxq_info
  virtio_net: setup xdp_rxq_info
  xdp: generic XDP handling of xdp_rxq_info
  bpf: finally expose xdp_rxq_info to XDP bpf-programs
  samples/bpf: program demonstrating access to xdp_rxq_info


 drivers/net/ethernet/broadcom/bnxt/bnxt.c  |   10 
 drivers/net/ethernet/broadcom/bnxt/bnxt.h  |2 
 drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c  |1 
 drivers/net/ethernet/cavium/thunder/nicvf_main.c   |   11 
 drivers/net/ethernet/cavium/thunder/nicvf_queues.c |4 
 drivers/net/ethernet/cavium/thunder/nicvf_queues.h |2 
 drivers/net/ethernet/intel/i40e/i40e_ethtool.c |2 
 drivers/net/ethernet/intel/i40e/i40e_txrx.c|   18 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.h|3 
 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |2 
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c   |4 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  |   10 
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |3 
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |   13 
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h   |4 
 drivers/net/ethernet/mellanox/mlx5/core/en.h   |4 
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  |9 
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|1 
 drivers/net/ethernet/netronome/nfp/nfp_net.h   |5 
 .../net/ethernet/netronome/nfp/nfp_net_common.c|   10 
 drivers/net/ethernet/qlogic/qede/qede.h|2 
 drivers/net/ethernet/qlogic/qede/qede_fp.c |1 
 drivers/net/ethernet/qlogic/qede/qede_main.c   |   10 
 drivers/net/tun.c  |   24 +
 drivers/net/virtio_net.c   |   12 
 include/linux/filter.h |2 
 include/linux/netdevice.h  |2 
 include/net/xdp.h

Re: [PATCH net-next] net/trace: fix printk format in inet_sock_set_state

2017-12-22 Thread Sergei Shtylyov


Hello!

On 12/22/2017 06:37 PM, Yafang Shao wrote:


There's a space character missed in the printk messages.
This error should be prevented with checkscript.pl, but it couldn't caught

 ^ be?


by running with "checkscript.pl -f .patch", that's what I had run
before.
What a carelessness.


   You generally don't need to break up the messages violating 80-column 
limit, and checkpatch.pl should be aware of this...



Fixes: 563e0bb0dc74("net: tracepoint: replace tcp_set_state tracepoint with
inet_sock_set_state tracepoint")
Signed-off-by: Yafang Shao 

[...]

MBR, Sergei

[PATCH net 0/3] rds bug fixes

2017-12-22 Thread Sowmini Varadhan

Ran into pre-existing bugs when working on the fix for
   https://www.spinics.net/lists/netdev/msg472849.html

The bugs fixed in this patchset are unrelated to the syzbot 
failure (which I'm still testing and trying to reproduce) but 
meanwhile, let's get these fixes out of the way.

Sowmini Varadhan (3):
  rds; Reset rs->rs_bound_addr in rds_add_bound() failure path
  rds: tcp: initialized t_tcp_detached to false
  rds: tcp: cleanup if kmem_cache_alloc fails in rds_tcp_conn_alloc()

 net/rds/bind.c |1 +
 net/rds/tcp.c  |   47 +++
 2 files changed, 28 insertions(+), 20 deletions(-)

Re: [QUESTION] Doubt about NAPI_GRO_CB(skb)->is_atomic in tcpv4 gro process

2017-12-22 Thread Alexander Duyck

On Fri, Dec 22, 2017 at 12:49 AM, Yunsheng Lin  wrote:
> Hi, Alexander
>
> On 2017/12/22 0:29, Alexander Duyck wrote:
>> On Thu, Dec 21, 2017 at 1:16 AM, Yunsheng Lin  wrote:
>>> Hi, Alexander
>>>
>>> On 2017/12/21 0:24, Alexander Duyck wrote:
 On Wed, Dec 20, 2017 at 1:09 AM, Yunsheng Lin  
 wrote:
> Hi, all
> I have some doubt about NAPI_GRO_CB(skb)->is_atomic when
> analyzing the tcpv4 gro process:
>
> Firstly we set NAPI_GRO_CB(skb)->is_atomic to 1 in dev_gro_receive:
> https://elixir.free-electrons.com/linux/v4.15-rc4/source/net/core/dev.c#L4838
>
> And then in inet_gro_receive, we check the NAPI_GRO_CB(skb)->is_atomic
> before setting NAPI_GRO_CB(skb)->is_atomic according to IP_DF bit in the 
> ip header:
> https://elixir.free-electrons.com/linux/v4.15-rc4/source/net/ipv4/af_inet.c#L1319
>
> struct sk_buff **inet_gro_receive(struct sk_buff **head, struct sk_buff 
> *skb)
> {
> .
> for (p = *head; p; p = p->next) {
> 
>
> /* If the previous IP ID value was based on an atomic
>  * datagram we can overwrite the value and ignore it.
>  */
> if (NAPI_GRO_CB(skb)->is_atomic)  
> //we check it here
> NAPI_GRO_CB(p)->flush_id = flush_id;
> else
> NAPI_GRO_CB(p)->flush_id |= flush_id;
> }
>
> NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));  
> //we set it here
> NAPI_GRO_CB(skb)->flush |= flush;
> skb_set_network_header(skb, off);
> 
> }
>
> My question is whether we should check the NAPI_GRO_CB(skb)->is_atomic or 
> NAPI_GRO_CB(p)->is_atomic?
> If we should check NAPI_GRO_CB(skb)->is_atomic, then maybe it is 
> unnecessary because it is alway true.
> If we should check NAPI_GRO_CB(p)->is_atomic, maybe there is a bug here.
>
> So what is the logic here? I am just start analyzing the gro, maybe I 
> miss something obvious here.

 The logic there is to address the multiple IP header case where there
 are 2 or more IP headers due to things like VXLAN or GRE tunnels. So
 what will happen is that an outer IP header will end up being sent
 with DF not set and will clear the is_atomic value then we want to OR
 in the next header that is applied. It defaults to assignment on
 is_atomic because the first IP header will encounter flush_id with no
 previous configuration occupying it.
>>>
>>> I see your point now.
>>>
>>> But for the same flow of tunnels packet, the outer and inner ip header must
>>> have the same fixed id or increment id?
>>>
>>> For example, if we have a flow of tunnels packet which has fixed id in outer
>>> header and increment id in inner header(the inner header does have DF flag 
>>> set):
>
> Sorry, a typo error here. I meant the inner header does *not* have DF flag 
> set here.
>
>>>
>>> 1. For the first packet, NAPI_GRO_CB(skb)->is_atomic will be set to zero 
>>> when
>>> inet_gro_receive is processing the inner ip header.
>>>
>>> 2. For the second packet, when inet_gro_receive is processing the outer ip 
>>> header
>>> which has a fixed id, NAPI_GRO_CB(p)->is_atomic is zero according to [1], so
>>> NAPI_GRO_CB(p)->flush_id will be set to 0x, then the second packet will 
>>> not
>>> be merged to first packet in tcp_gro_receive.
>>
>> I'm not sure how valid your case here is. The is_atomic is only really
>> meant to apply to the inner-most header.
>
> For the new skb, NAPI_GRO_CB(skb)->is_atomic is indeed applied to the
> inner-most header.
>
> What about the NAPI_GRO_CB(p)->is_atomic, p is the same skb flow already
> merged by gro.
>
> Let me try if I understand it correctly:
> when there is only one skb merged in p, then NAPI_GRO_CB(p)->is_atomic
> is set according to the first skb' inner-most ip DF flags.
>
> When the second skb comes, and inet_gro_receive is processing the
> outer-most ip, for the below code, NAPI_GRO_CB(p)->is_atomic
> is for first skb's inner-most ip DF flags, and "iph->frag_off & htons(IP_DF)"
> is for second skb' outer-most ip DF flags.
> Why don't we use the first skb's outer-most ip DF flags here? I think it is
> more logical to use first skb's outer-most ip DF flags here. But the first
> skb's outer-most ip DF flags is lost when we get here, right?
>
> if (!NAPI_GRO_CB(p)->is_atomic ||
> !(iph->frag_off & htons(IP_DF))) {
> flush_id ^= NAPI_GRO_CB(p)->count;
> flush_id = flush_id ? 0x : 0;
> }

We already did in a way. If you look earlier up we will set flush if
the DF bit of this packet and the

1 2 >

1 - 100 of 192 matches

Mail list logo