Re: lib: Introduce priority array area manager

2017-02-22 Thread Geert Uytterhoeven
Hi Jiri,

On Wed, Feb 22, 2017 at 8:02 PM, Linux Kernel Mailing List
 wrote:
> Web:
> https://git.kernel.org/torvalds/c/44091d29f2075972aede47ef17e1e70db3d51190
> Commit: 44091d29f2075972aede47ef17e1e70db3d51190
> Parent: b862815c3ee7b49ec20a9ab25da55a5f0bcbb95e
> Refname:refs/heads/master
> Author: Jiri Pirko 
> AuthorDate: Fri Feb 3 10:29:06 2017 +0100
> Committer:  David S. Miller 
> CommitDate: Fri Feb 3 16:35:42 2017 -0500
>
> lib: Introduce priority array area manager
>
> This introduces a infrastructure for management of linear priority
> areas. Priority order in an array matters, however order of items inside
> a priority group does not matter.
>
> As an initial implementation, L-sort algorithm is used. It is quite
> trivial. More advanced algorithm called P-sort will be introduced as a
> follow-up. The infrastructure is prepared for other algos.
>
> Alongside this, a testing module is introduced as well.
>
> Signed-off-by: Jiri Pirko 
> Signed-off-by: David S. Miller 

> --- a/lib/Kconfig
> +++ b/lib/Kconfig
> @@ -550,4 +550,7 @@ config STACKDEPOT
>  config SBITMAP
> bool
>
> +config PARMAN
> +   tristate "parman"

| parman (PARMAN) [N/m/y] (NEW) ?
|
| There is no help available for this option.

Can you please add a description for this option?
Or drop the "parman" string if this is always selected by its kernel users, and
never intended to be enabled by the end user.

Thanks!

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds


Re: [RFC v3 01/11] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) documentation

2017-02-22 Thread Vishwanathapura, Niranjana

On Wed, Feb 08, 2017 at 05:00:45PM +, Bart Van Assche wrote:

On Tue, 2017-02-07 at 12:23 -0800, Vishwanathapura, Niranjana wrote:

Please elaborate this section. What is a virtual Ethernet switch? Is it a
software entity or something that is implemented in hardware? Also, how are
these independent Ethernet networks identified on the wire? The Linux kernel
already supports IB partitions and Ethernet VLANs. How do these independent
Ethernet networks compare to IB partitions and Ethernet VLANs? Which wire-
level header contains the identity of these Ethernet networks? Is it
possible to query from user space which Ethernet network a VNIC belongs to?
If so, with which API and which tools?



I have added the VNIC packet format and some related information to the 
documentation in the PATCH series I just sent out.



Thanks,



[PATCH 00/11] Omni-Path Virtual Network Interface Controller (VNIC)

2017-02-22 Thread Vishwanathapura, Niranjana
Intel Omni-Path (OPA) Virtual Network Interface Controller (VNIC) feature
supports Ethernet functionality over Omni-Path fabric by encapsulating
the Ethernet packets between HFI nodes.

Architecture
=
The patterns of exchanges of Omni-Path encapsulated Ethernet packets
involves one or more virtual Ethernet switches overlaid on the Omni-Path
fabric topology. A subset of HFI nodes on the Omni-Path fabric are
permitted to exchange encapsulated Ethernet packets across a particular
virtual Ethernet switch. The virtual Ethernet switches are logical
abstractions achieved by configuring the HFI nodes on the fabric for
header generation and processing. In the simplest configuration all HFI
nodes across the fabric exchange encapsulated Ethernet packets over a
single virtual Ethernet switch. A virtual Ethernet switch, is effectively
an independent Ethernet network. The configuration is performed by an
Ethernet Manager (EM) which is part of the trusted Fabric Manager (FM)
application. HFI nodes can have multiple VNICs each connected to a
different virtual Ethernet switch. The below diagram presents a case
of two virtual Ethernet switches with two HFI nodes.

 +---+
 |  Subnet/  |
 | Ethernet  |
 |  Manager  |
 +---+
/  /
  /   /
//
  / /
+-+  +--+
|  Virtual Ethernet Switch|  |  Virtual Ethernet Switch |
|  +-++-+ |  | +-++-+   |
|  | VPORT   ||  VPORT  | |  | |  VPORT  ||  VPORT  |   |
+--+-++-+-+  +-+-++-+---+
 | \/ |
 |   \/   |
 | \/ |
 |/  \|
 |  /  \  |
 +---++  +---++
 |   VNIC|VNIC|  |VNIC   |VNIC|
 +---++  +---++
 |  HFI   |  |  HFI   |
 ++  ++


The Omni-Path encapsulated Ethernet packet format is as described below.

Bits  Field

Quad Word 0:
0-19  SLID (lower 20 bits)
20-30 Length (in Quad Words)
31BECN bit
32-51 DLID (lower 20 bits)
52-56 SC (Service Class)
57-59 RC (Routing Control)
60FECN bit
61-62 L2 (=10, 16B format)
63LT (=1, Link Transfer Head Flit)

Quad Word 1:
0-7   L4 type (=0x78 ETHERNET)
8-11  SLID[23:20]
12-15 DLID[23:20]
16-31 PKEY
32-47 Entropy
48-63 Reserved

Quad Word 2:
0-15  Reserved
16-31 L4 header
32-63 Ethernet Packet

Quad Words 3 to N-1:
0-63  Ethernet packet (pad extended)

Quad Word N (last):
0-23  Ethernet packet (pad extended)
24-55 ICRC
56-61 Tail
62-63 LT (=01, Link Transfer Tail Flit)

Ethernet packet is padded on the transmit side to ensure that the VNIC OPA
packet is quad word aligned. The 'Tail' field contains the number of bytes
padded. On the receive side the 'Tail' field is read and the padding is
removed (along with ICRC, Tail and OPA header) before passing packet up
the network stack.

The L4 header field contains the virtual Ethernet switch id the VNIC port
belongs to. On the receive side, this field is used to de-multiplex the
received VNIC packets to different VNIC ports.

Driver Design
==
Intel OPA VNIC software design is presented in the below diagram.
OPA VNIC functionality has a HW dependent component and a HW
independent component.

The support has been added for IB device to allocate and free the RDMA
netdev devices. The RDMA netdev supports interfacing with the network
stack thus creating standard network interfaces. OPA_VNIC is an RDMA
netdev device type.

The HW dependent VNIC functionality is part of the HFI1 driver. It
implements the verbs to allocate and free the OPA_VNIC RDMA netdev.
It involves HW resource allocation/management for VNIC functionality.
It interfaces with the network stack and implements the required
net_device_ops functions. It expects Omni-Path encapsulated Ethernet
packets in the transmit path and provides HW access to them. It strips
the Omni-Path header from the received packets before passing them up
the network stack. It also implements the RDMA netdev control operations.

The OPA VNIC module implements the HW independent VNIC functionality.
It consists of two parts. The VNIC Ethernet Management Agent (VEMA)
registers itself with IB core as 

[PATCH 05/11] IB/opa-vnic: VNIC statistics support

2017-02-22 Thread Vishwanathapura, Niranjana
OPA VNIC driver statistics support maintains various counters including
standard netdev counters and the Ethernet manager defined counters.
Add the Ethtool hook to read the counters.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
---
 drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c | 110 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|   4 +
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  |  20 
 3 files changed, 134 insertions(+)

diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
index b74f6ad..a98948c 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
@@ -53,9 +53,119 @@
 
 #include "opa_vnic_internal.h"
 
+enum {NETDEV_STATS, VNIC_STATS};
+
+struct vnic_stats {
+   char stat_string[ETH_GSTRING_LEN];
+   struct {
+   int sizeof_stat;
+   int stat_offset;
+   };
+};
+
+#define VNIC_STAT(m){ FIELD_SIZEOF(struct opa_vnic_stats, m),   \
+ offsetof(struct opa_vnic_stats, m) }
+
+static struct vnic_stats vnic_gstrings_stats[] = {
+   /* NETDEV stats */
+   {"rx_packets", VNIC_STAT(netstats.rx_packets)},
+   {"tx_packets", VNIC_STAT(netstats.tx_packets)},
+   {"rx_bytes", VNIC_STAT(netstats.rx_bytes)},
+   {"tx_bytes", VNIC_STAT(netstats.tx_bytes)},
+   {"rx_errors", VNIC_STAT(netstats.rx_errors)},
+   {"tx_errors", VNIC_STAT(netstats.tx_errors)},
+   {"rx_dropped", VNIC_STAT(netstats.rx_dropped)},
+   {"tx_dropped", VNIC_STAT(netstats.tx_dropped)},
+
+   /* SUMMARY counters */
+   {"tx_unicast", VNIC_STAT(tx_grp.unicast)},
+   {"tx_mcastbcast", VNIC_STAT(tx_grp.mcastbcast)},
+   {"tx_untagged", VNIC_STAT(tx_grp.untagged)},
+   {"tx_vlan", VNIC_STAT(tx_grp.vlan)},
+
+   {"tx_64_size", VNIC_STAT(tx_grp.s_64)},
+   {"tx_65_127", VNIC_STAT(tx_grp.s_65_127)},
+   {"tx_128_255", VNIC_STAT(tx_grp.s_128_255)},
+   {"tx_256_511", VNIC_STAT(tx_grp.s_256_511)},
+   {"tx_512_1023", VNIC_STAT(tx_grp.s_512_1023)},
+   {"tx_1024_1518", VNIC_STAT(tx_grp.s_1024_1518)},
+   {"tx_1519_max", VNIC_STAT(tx_grp.s_1519_max)},
+
+   {"rx_unicast", VNIC_STAT(rx_grp.unicast)},
+   {"rx_mcastbcast", VNIC_STAT(rx_grp.mcastbcast)},
+   {"rx_untagged", VNIC_STAT(rx_grp.untagged)},
+   {"rx_vlan", VNIC_STAT(rx_grp.vlan)},
+
+   {"rx_64_size", VNIC_STAT(rx_grp.s_64)},
+   {"rx_65_127", VNIC_STAT(rx_grp.s_65_127)},
+   {"rx_128_255", VNIC_STAT(rx_grp.s_128_255)},
+   {"rx_256_511", VNIC_STAT(rx_grp.s_256_511)},
+   {"rx_512_1023", VNIC_STAT(rx_grp.s_512_1023)},
+   {"rx_1024_1518", VNIC_STAT(rx_grp.s_1024_1518)},
+   {"rx_1519_max", VNIC_STAT(rx_grp.s_1519_max)},
+
+   /* ERROR counters */
+   {"rx_fifo_errors", VNIC_STAT(netstats.rx_fifo_errors)},
+   {"rx_length_errors", VNIC_STAT(netstats.rx_length_errors)},
+
+   {"tx_fifo_errors", VNIC_STAT(netstats.tx_fifo_errors)},
+   {"tx_carrier_errors", VNIC_STAT(netstats.tx_carrier_errors)},
+
+   {"tx_dlid_zero", VNIC_STAT(tx_dlid_zero)},
+   {"tx_drop_state", VNIC_STAT(tx_drop_state)},
+   {"rx_drop_state", VNIC_STAT(rx_drop_state)},
+   {"rx_oversize", VNIC_STAT(rx_oversize)},
+   {"rx_runt", VNIC_STAT(rx_runt)},
+};
+
+#define VNIC_STATS_LEN  ARRAY_SIZE(vnic_gstrings_stats)
+
+/* vnic_get_sset_count - get string set count */
+static int vnic_get_sset_count(struct net_device *netdev, int sset)
+{
+   return (sset == ETH_SS_STATS) ? VNIC_STATS_LEN : -EOPNOTSUPP;
+}
+
+/* vnic_get_ethtool_stats - get statistics */
+static void vnic_get_ethtool_stats(struct net_device *netdev,
+  struct ethtool_stats *stats, u64 *data)
+{
+   struct opa_vnic_adapter *adapter = opa_vnic_priv(netdev);
+   struct opa_vnic_stats vstats;
+   int i;
+
+   memset(, 0, sizeof(vstats));
+   mutex_lock(>stats_lock);
+   adapter->rn_ops->ndo_get_stats64(netdev, );
+   for (i = 0; i < VNIC_STATS_LEN; i++) {
+   char *p = (char *) + vnic_gstrings_stats[i].stat_offset;
+
+   data[i] = (vnic_gstrings_stats[i].sizeof_stat ==
+  sizeof(u64)) ? *(u64 *)p : *(u32 *)p;
+   }
+   mutex_unlock(>stats_lock);
+}
+
+/* vnic_get_strings - get strings */
+static void vnic_get_strings(struct net_device *netdev, u32 stringset, u8 
*data)
+{
+   int i;
+
+   if (stringset != ETH_SS_STATS)
+   return;
+
+   for (i = 0; i < VNIC_STATS_LEN; i++)
+   memcpy(data + i * ETH_GSTRING_LEN,
+  vnic_gstrings_stats[i].stat_string,
+  ETH_GSTRING_LEN);
+}
+
 /* ethtool ops */
 static const struct ethtool_ops 

[PATCH 04/11] IB/opa-vnic: VNIC Ethernet Management (EM) structure definitions

2017-02-22 Thread Vishwanathapura, Niranjana
Define VNIC EM MAD structures and the associated macros. These structures
are used for information exchange between VNIC EM agent (EMA) on the host
and the Ethernet manager. These include the virtual ethernet switch (vesw)
port information, vesw port mac table, summay and error counters,
vesw port interface mac lists and the EMA trap.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Tanya K Jajodia 
---
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h   | 423 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|  33 ++
 2 files changed, 456 insertions(+)

diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
index 176fca9..c025cde 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
@@ -52,6 +52,28 @@
  * and decapsulation of Ethernet packets
  */
 
+#include 
+#include 
+
+/* EMA class version */
+#define OPA_EMA_CLASS_VERSION   0x80
+
+/*
+ * Define the Intel vendor management class for OPA
+ * ETHERNET MANAGEMENT
+ */
+#define OPA_MGMT_CLASS_INTEL_EMA0x34
+
+/* EM attribute IDs */
+#define OPA_EM_ATTR_CLASS_PORT_INFO 0x0001
+#define OPA_EM_ATTR_VESWPORT_INFO   0x0011
+#define OPA_EM_ATTR_VESWPORT_MAC_ENTRIES0x0012
+#define OPA_EM_ATTR_IFACE_UCAST_MACS0x0013
+#define OPA_EM_ATTR_IFACE_MCAST_MACS0x0014
+#define OPA_EM_ATTR_DELETE_VESW 0x0015
+#define OPA_EM_ATTR_VESWPORT_SUMMARY_COUNTERS   0x0020
+#define OPA_EM_ATTR_VESWPORT_ERROR_COUNTERS 0x0022
+
 /* VNIC configured and operational state values */
 #define OPA_VNIC_STATE_DROP_ALL0x1
 #define OPA_VNIC_STATE_FORWARDING  0x3
@@ -59,4 +81,405 @@
 #define OPA_VESW_MAX_NUM_DEF_PORT   16
 #define OPA_VNIC_MAX_NUM_PCP8
 
+#define OPA_VNIC_EMA_DATA(OPA_MGMT_MAD_SIZE - IB_MGMT_VENDOR_HDR)
+
+/* Defines for vendor specific notice(trap) attributes */
+#define OPA_INTEL_EMA_NOTICE_TYPE_INFO 0x04
+
+/* INTEL OUI */
+#define INTEL_OUI_1 0x00
+#define INTEL_OUI_2 0x06
+#define INTEL_OUI_3 0x6a
+
+/* Trap opcodes sent from VNIC */
+#define OPA_VESWPORT_TRAP_IFACE_UCAST_MAC_CHANGE 0x1
+#define OPA_VESWPORT_TRAP_IFACE_MCAST_MAC_CHANGE 0x2
+#define OPA_VESWPORT_TRAP_ETH_LINK_STATUS_CHANGE 0x3
+
+#define OPA_VNIC_DLID_SD_IS_SRC_MAC(dlid_sd)  (!!((dlid_sd) & 0x20))
+#define OPA_VNIC_DLID_SD_GET_DLID(dlid_sd)((dlid_sd) >> 8)
+
+/**
+ * struct opa_vesw_info - OPA vnic switch information
+ * @fabric_id: 10-bit fabric id
+ * @vesw_id: 12-bit virtual ethernet switch id
+ * @def_port_mask: bitmask of default ports
+ * @pkey: partition key
+ * @u_mcast_dlid: unknown multicast dlid
+ * @u_ucast_dlid: array of unknown unicast dlids
+ * @eth_mtu: MTUs for each vlan PCP
+ * @eth_mtu_non_vlan: MTU for non vlan packets
+ */
+struct opa_vesw_info {
+   __be16  fabric_id;
+   __be16  vesw_id;
+
+   u8  rsvd0[6];
+   __be16  def_port_mask;
+
+   u8  rsvd1[2];
+   __be16  pkey;
+
+   u8  rsvd2[4];
+   __be32  u_mcast_dlid;
+   __be32  u_ucast_dlid[OPA_VESW_MAX_NUM_DEF_PORT];
+
+   u8  rsvd3[44];
+   __be16  eth_mtu[OPA_VNIC_MAX_NUM_PCP];
+   __be16  eth_mtu_non_vlan;
+   u8  rsvd4[2];
+} __packed;
+
+/**
+ * struct opa_per_veswport_info - OPA vnic per port information
+ * @port_num: port number
+ * @eth_link_status: current ethernet link state
+ * @base_mac_addr: base mac address
+ * @config_state: configured port state
+ * @oper_state: operational port state
+ * @max_mac_tbl_ent: max number of mac table entries
+ * @max_smac_ent: max smac entries in mac table
+ * @mac_tbl_digest: mac table digest
+ * @encap_slid: base slid for the port
+ * @pcp_to_sc_uc: sc by pcp index for unicast ethernet packets
+ * @pcp_to_vl_uc: vl by pcp index for unicast ethernet packets
+ * @pcp_to_sc_mc: sc by pcp index for multicast ethernet packets
+ * @pcp_to_vl_mc: vl by pcp index for multicast ethernet packets
+ * @non_vlan_sc_uc: sc for non-vlan unicast ethernet packets
+ * @non_vlan_vl_uc: vl for non-vlan unicast ethernet packets
+ * @non_vlan_sc_mc: sc for non-vlan multicast ethernet packets
+ * @non_vlan_vl_mc: vl for non-vlan multicast ethernet packets
+ * @uc_macs_gen_count: generation count for unicast macs list
+ * @mc_macs_gen_count: generation count for multicast macs list
+ */
+struct opa_per_veswport_info {
+   __be32  port_num;
+
+   u8  eth_link_status;
+   u8  rsvd0[3];
+
+   u8  base_mac_addr[ETH_ALEN];
+   u8  config_state;
+   u8  oper_state;
+
+   __be16  max_mac_tbl_ent;
+   __be16  max_smac_ent;
+   __be32  mac_tbl_digest;

[PATCH 03/11] IB/opa-vnic: Virtual Network Interface Controller (VNIC) netdev

2017-02-22 Thread Vishwanathapura, Niranjana
OPA VNIC netdev function supports Ethernet functionality over Omni-Path
fabric by encapsulating Ethernet packets inside Omni-Path packet header.
It allocates a rdma netdev device and interfaces with the network stack to
provide standard Ethernet network interfaces. It overrides HFI1 device's
netdev operations where it is required.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Sudeep Dutt 
Signed-off-by: Tanya K Jajodia 
Signed-off-by: Andrzej Kacprowski 
---
 MAINTAINERS|   7 +
 drivers/infiniband/Kconfig |   1 +
 drivers/infiniband/ulp/Makefile|   1 +
 drivers/infiniband/ulp/opa_vnic/Kconfig|   8 +
 drivers/infiniband/ulp/opa_vnic/Makefile   |   6 +
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c   | 239 +
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h   |  62 ++
 drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c |  65 ++
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h| 186 
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  | 225 +++
 10 files changed, 800 insertions(+)
 create mode 100644 drivers/infiniband/ulp/opa_vnic/Kconfig
 create mode 100644 drivers/infiniband/ulp/opa_vnic/Makefile
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 468d2e8..7f0a07d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5775,6 +5775,13 @@ F:   drivers/block/cciss*
 F: include/linux/cciss_ioctl.h
 F: include/uapi/linux/cciss_ioctl.h
 
+OPA-VNIC DRIVER
+M: Dennis Dalessandro 
+M: Niranjana Vishwanathapura 
+L: linux-r...@vger.kernel.org
+S: Supported
+F: drivers/infiniband/ulp/opa_vnic
+
 HFI1 DRIVER
 M: Mike Marciniszyn 
 M: Dennis Dalessandro 
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index 66f8602..234fe01 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -85,6 +85,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig"
 source "drivers/infiniband/ulp/iser/Kconfig"
 source "drivers/infiniband/ulp/isert/Kconfig"
 
+source "drivers/infiniband/ulp/opa_vnic/Kconfig"
 source "drivers/infiniband/sw/rdmavt/Kconfig"
 source "drivers/infiniband/sw/rxe/Kconfig"
 
diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile
index f3c7dcf..c28af18 100644
--- a/drivers/infiniband/ulp/Makefile
+++ b/drivers/infiniband/ulp/Makefile
@@ -3,3 +3,4 @@ obj-$(CONFIG_INFINIBAND_SRP)+= srp/
 obj-$(CONFIG_INFINIBAND_SRPT)  += srpt/
 obj-$(CONFIG_INFINIBAND_ISER)  += iser/
 obj-$(CONFIG_INFINIBAND_ISERT) += isert/
+obj-$(CONFIG_INFINIBAND_OPA_VNIC)  += opa_vnic/
diff --git a/drivers/infiniband/ulp/opa_vnic/Kconfig 
b/drivers/infiniband/ulp/opa_vnic/Kconfig
new file mode 100644
index 000..48132ab
--- /dev/null
+++ b/drivers/infiniband/ulp/opa_vnic/Kconfig
@@ -0,0 +1,8 @@
+config INFINIBAND_OPA_VNIC
+   tristate "Intel OPA VNIC support"
+   depends on X86_64 && INFINIBAND
+   ---help---
+   This is Omni-Path (OPA) Virtual Network Interface Controller (VNIC)
+   driver for Ethernet over Omni-Path feature. It implements the HW
+   independent VNIC functionality. It interfaces with Linux stack for
+   data path and IB MAD for the control path.
diff --git a/drivers/infiniband/ulp/opa_vnic/Makefile 
b/drivers/infiniband/ulp/opa_vnic/Makefile
new file mode 100644
index 000..975c313
--- /dev/null
+++ b/drivers/infiniband/ulp/opa_vnic/Makefile
@@ -0,0 +1,6 @@
+# Makefile - Intel Omni-Path Virtual Network Controller driver
+# Copyright(c) 2017, Intel Corporation.
+#
+obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic.o
+
+opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
new file mode 100644
index 000..c74d02a
--- /dev/null
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
@@ -0,0 +1,239 @@
+/*
+ * Copyright(c) 2017 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This 

[PATCH 01/11] IB/opa-vnic: Virtual Network Interface Controller (VNIC) documentation

2017-02-22 Thread Vishwanathapura, Niranjana
Add OPA VNIC design document explaining the VNIC architecture and the
driver design.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
---
 Documentation/infiniband/opa_vnic.txt | 153 ++
 1 file changed, 153 insertions(+)
 create mode 100644 Documentation/infiniband/opa_vnic.txt

diff --git a/Documentation/infiniband/opa_vnic.txt 
b/Documentation/infiniband/opa_vnic.txt
new file mode 100644
index 000..282e17b
--- /dev/null
+++ b/Documentation/infiniband/opa_vnic.txt
@@ -0,0 +1,153 @@
+Intel Omni-Path (OPA) Virtual Network Interface Controller (VNIC) feature
+supports Ethernet functionality over Omni-Path fabric by encapsulating
+the Ethernet packets between HFI nodes.
+
+Architecture
+=
+The patterns of exchanges of Omni-Path encapsulated Ethernet packets
+involves one or more virtual Ethernet switches overlaid on the Omni-Path
+fabric topology. A subset of HFI nodes on the Omni-Path fabric are
+permitted to exchange encapsulated Ethernet packets across a particular
+virtual Ethernet switch. The virtual Ethernet switches are logical
+abstractions achieved by configuring the HFI nodes on the fabric for
+header generation and processing. In the simplest configuration all HFI
+nodes across the fabric exchange encapsulated Ethernet packets over a
+single virtual Ethernet switch. A virtual Ethernet switch, is effectively
+an independent Ethernet network. The configuration is performed by an
+Ethernet Manager (EM) which is part of the trusted Fabric Manager (FM)
+application. HFI nodes can have multiple VNICs each connected to a
+different virtual Ethernet switch. The below diagram presents a case
+of two virtual Ethernet switches with two HFI nodes.
+
+ +---+
+ |  Subnet/  |
+ | Ethernet  |
+ |  Manager  |
+ +---+
+/  /
+  /   /
+//
+  / /
++-+  +--+
+|  Virtual Ethernet Switch|  |  Virtual Ethernet Switch |
+|  +-++-+ |  | +-++-+   |
+|  | VPORT   ||  VPORT  | |  | |  VPORT  ||  VPORT  |   |
++--+-++-+-+  +-+-++-+---+
+ | \/ |
+ |   \/   |
+ | \/ |
+ |/  \|
+ |  /  \  |
+ +---++  +---++
+ |   VNIC|VNIC|  |VNIC   |VNIC|
+ +---++  +---++
+ |  HFI   |  |  HFI   |
+ ++  ++
+
+
+The Omni-Path encapsulated Ethernet packet format is as described below.
+
+Bits  Field
+
+Quad Word 0:
+0-19  SLID (lower 20 bits)
+20-30 Length (in Quad Words)
+31BECN bit
+32-51 DLID (lower 20 bits)
+52-56 SC (Service Class)
+57-59 RC (Routing Control)
+60FECN bit
+61-62 L2 (=10, 16B format)
+63LT (=1, Link Transfer Head Flit)
+
+Quad Word 1:
+0-7   L4 type (=0x78 ETHERNET)
+8-11  SLID[23:20]
+12-15 DLID[23:20]
+16-31 PKEY
+32-47 Entropy
+48-63 Reserved
+
+Quad Word 2:
+0-15  Reserved
+16-31 L4 header
+32-63 Ethernet Packet
+
+Quad Words 3 to N-1:
+0-63  Ethernet packet (pad extended)
+
+Quad Word N (last):
+0-23  Ethernet packet (pad extended)
+24-55 ICRC
+56-61 Tail
+62-63 LT (=01, Link Transfer Tail Flit)
+
+Ethernet packet is padded on the transmit side to ensure that the VNIC OPA
+packet is quad word aligned. The 'Tail' field contains the number of bytes
+padded. On the receive side the 'Tail' field is read and the padding is
+removed (along with ICRC, Tail and OPA header) before passing packet up
+the network stack.
+
+The L4 header field contains the virtual Ethernet switch id the VNIC port
+belongs to. On the receive side, this field is used to de-multiplex the
+received VNIC packets to different VNIC ports.
+
+Driver Design
+==
+Intel OPA VNIC software design is presented in the below diagram.
+OPA VNIC functionality has a HW dependent component and a HW
+independent component.
+
+The support has been added for IB device to allocate and free the RDMA
+netdev devices. The RDMA netdev supports interfacing with the network
+stack thus creating standard network interfaces. 

[PATCH 10/11] IB/hfi1: Virtual Network Interface Controller (VNIC) HW support

2017-02-22 Thread Vishwanathapura, Niranjana
HFI1 HW specific support for VNIC functionality.
Dynamically allocate a set of contexts for VNIC when the first vnic
port is instantiated. Allocate VNIC contexts from user contexts pool
and return them back to the same pool while freeing up. Set aside
enough MSI-X interrupts for VNIC contexts and assign them when the
contexts are allocated. On the receive side, use an RSM rule to
spread TCP/UDP streams among VNIC contexts.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Andrzej Kacprowski 
---
 drivers/infiniband/hw/hfi1/aspm.h |  15 +-
 drivers/infiniband/hw/hfi1/chip.c | 293 +-
 drivers/infiniband/hw/hfi1/chip.h |   4 +-
 drivers/infiniband/hw/hfi1/debugfs.c  |   8 +-
 drivers/infiniband/hw/hfi1/driver.c   |  52 --
 drivers/infiniband/hw/hfi1/file_ops.c |  27 ++-
 drivers/infiniband/hw/hfi1/hfi.h  |  29 ++-
 drivers/infiniband/hw/hfi1/init.c |  29 +--
 drivers/infiniband/hw/hfi1/mad.c  |  10 +-
 drivers/infiniband/hw/hfi1/pio.c  |  19 +-
 drivers/infiniband/hw/hfi1/pio.h  |   8 +-
 drivers/infiniband/hw/hfi1/sysfs.c|   4 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c |   8 +-
 drivers/infiniband/hw/hfi1/user_pages.c   |   5 +-
 drivers/infiniband/hw/hfi1/verbs.c|   8 +-
 drivers/infiniband/hw/hfi1/vnic.h |   3 +
 drivers/infiniband/hw/hfi1/vnic_main.c| 245 -
 include/rdma/opa_port_info.h  |   4 +-
 18 files changed, 663 insertions(+), 108 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/aspm.h 
b/drivers/infiniband/hw/hfi1/aspm.h
index 0d58fe3..794e681 100644
--- a/drivers/infiniband/hw/hfi1/aspm.h
+++ b/drivers/infiniband/hw/hfi1/aspm.h
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015, 2016 Intel Corporation.
+ * Copyright(c) 2015-2017 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -229,14 +229,17 @@ static inline void aspm_ctx_timer_function(unsigned long 
data)
spin_unlock_irqrestore(>aspm_lock, flags);
 }
 
-/* Disable interrupt processing for verbs contexts when PSM contexts are open 
*/
+/*
+ * Disable interrupt processing for verbs contexts when PSM or VNIC contexts
+ * are open.
+ */
 static inline void aspm_disable_all(struct hfi1_devdata *dd)
 {
struct hfi1_ctxtdata *rcd;
unsigned long flags;
unsigned i;
 
-   for (i = 0; i < dd->first_user_ctxt; i++) {
+   for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) {
rcd = dd->rcd[i];
del_timer_sync(>aspm_timer);
spin_lock_irqsave(>aspm_lock, flags);
@@ -260,7 +263,7 @@ static inline void aspm_enable_all(struct hfi1_devdata *dd)
if (aspm_mode != ASPM_MODE_DYNAMIC)
return;
 
-   for (i = 0; i < dd->first_user_ctxt; i++) {
+   for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) {
rcd = dd->rcd[i];
spin_lock_irqsave(>aspm_lock, flags);
rcd->aspm_intr_enable = true;
@@ -276,7 +279,7 @@ static inline void aspm_ctx_init(struct hfi1_ctxtdata *rcd)
(unsigned long)rcd);
rcd->aspm_intr_supported = rcd->dd->aspm_supported &&
aspm_mode == ASPM_MODE_DYNAMIC &&
-   rcd->ctxt < rcd->dd->first_user_ctxt;
+   rcd->ctxt < rcd->dd->first_dyn_alloc_ctxt;
 }
 
 static inline void aspm_init(struct hfi1_devdata *dd)
@@ -286,7 +289,7 @@ static inline void aspm_init(struct hfi1_devdata *dd)
spin_lock_init(>aspm_lock);
dd->aspm_supported = aspm_hw_l1_supported(dd);
 
-   for (i = 0; i < dd->first_user_ctxt; i++)
+   for (i = 0; i < dd->first_dyn_alloc_ctxt; i++)
aspm_ctx_init(dd->rcd[i]);
 
/* Start with ASPM disabled */
diff --git a/drivers/infiniband/hw/hfi1/chip.c 
b/drivers/infiniband/hw/hfi1/chip.c
index 121a4c9..f97fccb 100644
--- a/drivers/infiniband/hw/hfi1/chip.c
+++ b/drivers/infiniband/hw/hfi1/chip.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015, 2016 Intel Corporation.
+ * Copyright(c) 2015-2017 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -125,9 +125,16 @@ struct flag_table {
 #define DEFAULT_KRCVQS   2
 #define MIN_KERNEL_KCTXTS 2
 #define FIRST_KERNEL_KCTXT1
-/* sizes for both the QP and RSM map tables */
-#define NUM_MAP_ENTRIES256
-#define NUM_MAP_REGS 32
+
+/*
+ * RSM instance allocation
+ *   0 - Verbs
+ *   1 - User Fecn Handling
+ *   2 - Vnic
+ */
+#define RSM_INS_VERBS 0
+#define RSM_INS_FECN  1
+#define RSM_INS_VNIC  2
 

[PATCH 08/11] IB/opa-vnic: VNIC Ethernet Management Agent (VEMA) function

2017-02-22 Thread Vishwanathapura, Niranjana
OPA VEMA function interfaces with the Infiniband MAD stack to exchange the
management information packets with the Ethernet Manager (EM).
It interfaces with the OPA VNIC netdev function to SET/GET the management
information. The information exchanged with the EM includes class port
details, encapsulation configuration, various counters, unicast and
multicast MAC list and the MAC table. It also supports sending traps
to the EM.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Tanya K Jajodia 
Signed-off-by: Sudeep Dutt 
---
 drivers/infiniband/ulp/opa_vnic/Makefile   |2 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c |   12 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|   17 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c| 1071 
 .../infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c  |2 +-
 5 files changed, 1099 insertions(+), 5 deletions(-)
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c

diff --git a/drivers/infiniband/ulp/opa_vnic/Makefile 
b/drivers/infiniband/ulp/opa_vnic/Makefile
index e8d1ea1..8061b28 100644
--- a/drivers/infiniband/ulp/opa_vnic/Makefile
+++ b/drivers/infiniband/ulp/opa_vnic/Makefile
@@ -4,4 +4,4 @@
 obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic.o
 
 opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o \
-  opa_vnic_vema_iface.o
+  opa_vnic_vema.o opa_vnic_vema_iface.o
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
index a98948c..d66540e 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c
@@ -120,6 +120,17 @@ struct vnic_stats {
 
 #define VNIC_STATS_LEN  ARRAY_SIZE(vnic_gstrings_stats)
 
+/* vnic_get_drvinfo - get driver info */
+static void vnic_get_drvinfo(struct net_device *netdev,
+struct ethtool_drvinfo *drvinfo)
+{
+   strlcpy(drvinfo->driver, opa_vnic_driver_name, sizeof(drvinfo->driver));
+   strlcpy(drvinfo->version, opa_vnic_driver_version,
+   sizeof(drvinfo->version));
+   strlcpy(drvinfo->bus_info, dev_name(netdev->dev.parent),
+   sizeof(drvinfo->bus_info));
+}
+
 /* vnic_get_sset_count - get string set count */
 static int vnic_get_sset_count(struct net_device *netdev, int sset)
 {
@@ -162,6 +173,7 @@ static void vnic_get_strings(struct net_device *netdev, u32 
stringset, u8 *data)
 
 /* ethtool ops */
 static const struct ethtool_ops opa_vnic_ethtool_ops = {
+   .get_drvinfo = vnic_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_strings = vnic_get_strings,
.get_sset_count = vnic_get_sset_count,
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
index b49f5d7..6bba886 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
@@ -164,10 +164,12 @@ struct __opa_veswport_trap {
  * struct opa_vnic_ctrl_port - OPA virtual NIC control port
  * @ibdev: pointer to ib device
  * @ops: opa vnic control operations
+ * @num_ports: number of opa ports
  */
 struct opa_vnic_ctrl_port {
struct ib_device   *ibdev;
struct opa_vnic_ctrl_ops   *ops;
+   u8  num_ports;
 };
 
 /**
@@ -187,6 +189,8 @@ struct opa_vnic_ctrl_port {
  * @mactbl_lock: mac table lock
  * @stats_lock: statistics lock
  * @flow_tbl: flow to default port redirection table
+ * @trap_timeout: trap timeout
+ * @trap_count: no. of traps allowed within timeout period
  */
 struct opa_vnic_adapter {
struct net_device *netdev;
@@ -213,6 +217,9 @@ struct opa_vnic_adapter {
struct mutex stats_lock;
 
u8 flow_tbl[OPA_VNIC_FLOW_TBL_SIZE];
+
+   unsigned long trap_timeout;
+   u8trap_count;
 };
 
 /* Same as opa_veswport_mactable_entry, but without bitwise attribute */
@@ -247,6 +254,8 @@ struct opa_vnic_mac_tbl_node {
dev_err(>ibdev->dev, format, ## arg)
 #define c_info(format, arg...) \
dev_info(>ibdev->dev, format, ## arg)
+#define c_dbg(format, arg...) \
+   dev_dbg(>ibdev->dev, format, ## arg)
 
 /* The maximum allowed entries in the mac table */
 #define OPA_VNIC_MAC_TBL_MAX_ENTRIES  2048
@@ -281,6 +290,9 @@ struct opa_vnic_mac_tbl_node {
!obj && (bkt) < OPA_VNIC_MAC_TBL_SIZE; (bkt)++)   \
hlist_for_each_entry(obj, [bkt], member)
 
+extern char opa_vnic_driver_name[];
+extern const char opa_vnic_driver_version[];
+
 struct opa_vnic_adapter *opa_vnic_add_netdev(struct ib_device *ibdev,

[PATCH 02/11] IB/opa-vnic: Virtual Network Interface Controller (VNIC) interface

2017-02-22 Thread Vishwanathapura, Niranjana
Add rdma netdev interface to ib device structure allowing rdma netdev
devices to be allocated by ib clients.
Define OPA VNIC interface between hardware independent VNIC
functionality and the hardware dependent VNIC functionality.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
---
 include/rdma/ib_verbs.h |  27 +
 include/rdma/opa_vnic.h | 143 
 2 files changed, 170 insertions(+)
 create mode 100644 include/rdma/opa_vnic.h

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 8c61532..16ad142 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -55,6 +55,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -221,6 +222,7 @@ enum ib_device_cap_flags {
IB_DEVICE_SG_GAPS_REG   = (1ULL << 32),
IB_DEVICE_VIRTUAL_FUNCTION  = (1ULL << 33),
IB_DEVICE_RAW_SCATTER_FCS   = (1ULL << 34),
+   IB_DEVICE_RDMA_NETDEV_OPA_VNIC  = (1ULL << 35),
 };
 
 enum ib_signature_prot_cap {
@@ -1858,6 +1860,22 @@ struct ib_port_immutable {
u32   max_mad_size;
 };
 
+/* rdma netdev type - specifies protocol type */
+enum rdma_netdev_t {
+   RDMA_NETDEV_OPA_VNIC
+};
+
+/**
+ * struct rdma_netdev - rdma netdev
+ * For cases where netstack interfacing is required.
+ */
+struct rdma_netdev {
+   void *clnt_priv;
+
+   /* control functions */
+   void (*set_id)(struct net_device *netdev, int id);
+};
+
 struct ib_device {
struct device*dma_device;
 
@@ -2110,6 +2128,15 @@ struct ib_device {
   struct 
ib_rwq_ind_table_init_attr *init_attr,
   struct ib_udata 
*udata);
int(*destroy_rwq_ind_table)(struct 
ib_rwq_ind_table *wq_ind_table);
+   /* rdma netdev operations */
+   struct net_device *(*alloc_rdma_netdev)(
+   struct ib_device *device,
+   u8 port_num,
+   enum rdma_netdev_t type,
+   const char *name,
+   unsigned char name_assign_type,
+   void (*setup)(struct net_device *));
+   void (*free_rdma_netdev)(struct net_device *netdev);
struct ib_dma_mapping_ops   *dma_ops;
 
struct module   *owner;
diff --git a/include/rdma/opa_vnic.h b/include/rdma/opa_vnic.h
new file mode 100644
index 000..68315cc
--- /dev/null
+++ b/include/rdma/opa_vnic.h
@@ -0,0 +1,143 @@
+#ifndef _OPA_VNIC_H
+#define _OPA_VNIC_H
+/*
+ * Copyright(c) 2017 Intel Corporation.
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * 

[PATCH 07/11] IB/opa-vnic: VNIC Ethernet Management Agent (VEMA) interface

2017-02-22 Thread Vishwanathapura, Niranjana
OPA VNIC EMA interface functions are the management interfaces to the OPA
VNIC netdev. Add support to add and remove VNIC ports. Implement the
required GET/SET management interface functions and processing of new
management information. Add support to send trap notifications upon various
events like interface status change, unicast/multicast mac list update and
mac address change.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
Signed-off-by: Tanya K Jajodia 
---
 drivers/infiniband/ulp/opa_vnic/Makefile   |   3 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h   |   4 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|  44 +++
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  | 142 +++-
 .../infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c  | 390 +
 5 files changed, 581 insertions(+), 2 deletions(-)
 create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c

diff --git a/drivers/infiniband/ulp/opa_vnic/Makefile 
b/drivers/infiniband/ulp/opa_vnic/Makefile
index 975c313..e8d1ea1 100644
--- a/drivers/infiniband/ulp/opa_vnic/Makefile
+++ b/drivers/infiniband/ulp/opa_vnic/Makefile
@@ -3,4 +3,5 @@
 #
 obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic.o
 
-opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o
+opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o \
+  opa_vnic_vema_iface.o
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
index c025cde..4c434b9 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h
@@ -99,6 +99,10 @@
 #define OPA_VNIC_DLID_SD_IS_SRC_MAC(dlid_sd)  (!!((dlid_sd) & 0x20))
 #define OPA_VNIC_DLID_SD_GET_DLID(dlid_sd)((dlid_sd) >> 8)
 
+/* VNIC Ethernet link status */
+#define OPA_VNIC_ETH_LINK_UP 1
+#define OPA_VNIC_ETH_LINK_DOWN   2
+
 /**
  * struct opa_vesw_info - OPA vnic switch information
  * @fabric_id: 10-bit fabric id
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
index bec4866..b49f5d7 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h
@@ -161,14 +161,28 @@ struct __opa_veswport_trap {
 } __packed;
 
 /**
+ * struct opa_vnic_ctrl_port - OPA virtual NIC control port
+ * @ibdev: pointer to ib device
+ * @ops: opa vnic control operations
+ */
+struct opa_vnic_ctrl_port {
+   struct ib_device   *ibdev;
+   struct opa_vnic_ctrl_ops   *ops;
+};
+
+/**
  * struct opa_vnic_adapter - OPA VNIC netdev private data structure
  * @netdev: pointer to associated netdev
  * @ibdev: ib device
+ * @cport: pointer to opa vnic control port
  * @rn_ops: rdma netdev's net_device_ops
  * @port_num: OPA port number
  * @vport_num: vesw port number
  * @lock: adapter lock
  * @info: virtual ethernet switch port information
+ * @vema_mac_addr: mac address configured by vema
+ * @umac_hash: unicast maclist hash
+ * @mmac_hash: multicast maclist hash
  * @mactbl: hash table of MAC entries
  * @mactbl_lock: mac table lock
  * @stats_lock: statistics lock
@@ -177,6 +191,7 @@ struct __opa_veswport_trap {
 struct opa_vnic_adapter {
struct net_device *netdev;
struct ib_device  *ibdev;
+   struct opa_vnic_ctrl_port *cport;
const struct net_device_ops   *rn_ops;
 
u8 port_num;
@@ -186,6 +201,9 @@ struct opa_vnic_adapter {
struct mutex lock;
 
struct __opa_veswport_info  info;
+   u8  vema_mac_addr[ETH_ALEN];
+   u32 umac_hash;
+   u32 mmac_hash;
struct hlist_head  __rcu   *mactbl;
 
/* Lock used to protect updates to mac table */
@@ -225,6 +243,11 @@ struct opa_vnic_mac_tbl_node {
 #define v_warn(format, arg...) \
netdev_warn(adapter->netdev, format, ## arg)
 
+#define c_err(format, arg...) \
+   dev_err(>ibdev->dev, format, ## arg)
+#define c_info(format, arg...) \
+   dev_info(>ibdev->dev, format, ## arg)
+
 /* The maximum allowed entries in the mac table */
 #define OPA_VNIC_MAC_TBL_MAX_ENTRIES  2048
 /* Limit of smac entries in mac table */
@@ -264,11 +287,32 @@ struct opa_vnic_adapter *opa_vnic_add_netdev(struct 
ib_device *ibdev,
 void opa_vnic_encap_skb(struct opa_vnic_adapter *adapter, struct sk_buff *skb);
 u8 opa_vnic_get_vl(struct opa_vnic_adapter *adapter, struct sk_buff *skb);
 u8 opa_vnic_calc_entropy(struct opa_vnic_adapter *adapter, struct sk_buff 
*skb);
+void opa_vnic_process_vema_config(struct opa_vnic_adapter *adapter);
 void opa_vnic_release_mac_tbl(struct opa_vnic_adapter *adapter);
 void 

[PATCH 09/11] IB/hfi1: OPA_VNIC RDMA netdev support

2017-02-22 Thread Vishwanathapura, Niranjana
Add support to create and free OPA_VNIC rdma netdev devices.
Implement netstack interface functionality including xmit_skb,
receive side NAPI etc. Also implement rdma netdev control functions.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Andrzej Kacprowski 
---
 drivers/infiniband/hw/hfi1/Makefile|   2 +-
 drivers/infiniband/hw/hfi1/driver.c|  25 +-
 drivers/infiniband/hw/hfi1/hfi.h   |  27 +-
 drivers/infiniband/hw/hfi1/init.c  |   9 +-
 drivers/infiniband/hw/hfi1/vnic.h  | 153 
 drivers/infiniband/hw/hfi1/vnic_main.c | 646 +
 6 files changed, 855 insertions(+), 7 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/vnic.h
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_main.c

diff --git a/drivers/infiniband/hw/hfi1/Makefile 
b/drivers/infiniband/hw/hfi1/Makefile
index 0cf97a0..2280538 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -12,7 +12,7 @@ hfi1-y := affinity.o chip.o device.o driver.o efivar.o \
init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \
qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o \
uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \
-   verbs_txreq.o
+   verbs_txreq.o vnic_main.o
 hfi1-$(CONFIG_DEBUG_FS) += debugfs.o
 
 CFLAGS_trace.o = -I$(src)
diff --git a/drivers/infiniband/hw/hfi1/driver.c 
b/drivers/infiniband/hw/hfi1/driver.c
index 3881c95..4969b88 100644
--- a/drivers/infiniband/hw/hfi1/driver.c
+++ b/drivers/infiniband/hw/hfi1/driver.c
@@ -1,5 +1,5 @@
 /*
- * Copyright(c) 2015, 2016 Intel Corporation.
+ * Copyright(c) 2015-2017 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -59,6 +59,7 @@
 #include "trace.h"
 #include "qp.h"
 #include "sdma.h"
+#include "vnic.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) DRIVER_NAME ": " fmt
@@ -1372,15 +1373,31 @@ int process_receive_ib(struct hfi1_packet *packet)
return RHF_RCV_CONTINUE;
 }
 
+static inline bool hfi1_is_vnic_packet(struct hfi1_packet *packet)
+{
+   /* Packet received in VNIC context via RSM */
+   if (packet->rcd->is_vnic)
+   return true;
+
+   if ((HFI1_GET_L2_TYPE(packet->ebuf) == OPA_VNIC_L2_TYPE) &&
+   (HFI1_GET_L4_TYPE(packet->ebuf) == OPA_VNIC_L4_ETHR))
+   return true;
+
+   return false;
+}
+
 int process_receive_bypass(struct hfi1_packet *packet)
 {
struct hfi1_devdata *dd = packet->rcd->dd;
 
-   if (unlikely(rhf_err_flags(packet->rhf)))
+   if (unlikely(rhf_err_flags(packet->rhf))) {
handle_eflags(packet);
+   } else if (hfi1_is_vnic_packet(packet)) {
+   hfi1_vnic_bypass_rcv(packet);
+   return RHF_RCV_CONTINUE;
+   }
 
-   dd_dev_err(dd,
-  "Bypass packets are not supported in normal operation. 
Dropping\n");
+   dd_dev_err(dd, "Unsupported bypass packet. Dropping\n");
incr_cntr64(>sw_rcv_bypass_packet_errors);
if (!(dd->err_info_rcvport.status_and_code & OPA_EI_STATUS_SMASK)) {
u64 *flits = packet->ebuf;
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index 0808e3c3..66fb9e4 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1,7 +1,7 @@
 #ifndef _HFI1_KERNEL_H
 #define _HFI1_KERNEL_H
 /*
- * Copyright(c) 2015, 2016 Intel Corporation.
+ * Copyright(c) 2015-2017 Intel Corporation.
  *
  * This file is provided under a dual BSD/GPLv2 license.  When using or
  * redistributing this file, you may do so under either license.
@@ -337,6 +337,12 @@ struct hfi1_ctxtdata {
 * packets with the wrong interrupt handler.
 */
int (*do_interrupt)(struct hfi1_ctxtdata *rcd, int threaded);
+
+   /* Indicates that this is vnic context */
+   bool is_vnic;
+
+   /* vnic queue index this context is mapped to */
+   u8 vnic_q_idx;
 };
 
 /*
@@ -808,6 +814,19 @@ struct hfi1_asic_data {
struct hfi1_i2c_bus *i2c_bus1;
 };
 
+/*
+ * Number of VNIC contexts used. Ensure it is less than or equal to
+ * max queues supported by VNIC (HFI1_VNIC_MAX_QUEUE).
+ */
+#define HFI1_NUM_VNIC_CTXT   8
+
+/* Virtual NIC information */
+struct hfi1_vnic_data {
+   struct idr vesw_idr;
+};
+
+struct hfi1_vnic_vport_info;
+
 /* device data struct now contains only "general per-device" info.
  * fields related to a physical IB port are in a hfi1_pportdata struct.
  */
@@ -1115,6 +1134,9 @@ struct hfi1_devdata {
send_routine process_dma_send;
void (*pio_inline_send)(struct hfi1_devdata *dd, struct pio_buf *pbuf,
u64 pbc, const void *from, size_t 

[PATCH 11/11] IB/hfi1: VNIC SDMA support

2017-02-22 Thread Vishwanathapura, Niranjana
HFI1 VNIC SDMA support enables transmission of VNIC packets over SDMA.
Map VNIC queues to SDMA engines and support halting and wakeup of the
VNIC queues.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
---
 drivers/infiniband/hw/hfi1/Makefile|   2 +-
 drivers/infiniband/hw/hfi1/hfi.h   |   1 +
 drivers/infiniband/hw/hfi1/init.c  |   1 +
 drivers/infiniband/hw/hfi1/vnic.h  |  28 +++
 drivers/infiniband/hw/hfi1/vnic_main.c |  24 ++-
 drivers/infiniband/hw/hfi1/vnic_sdma.c | 323 +
 6 files changed, 376 insertions(+), 3 deletions(-)
 create mode 100644 drivers/infiniband/hw/hfi1/vnic_sdma.c

diff --git a/drivers/infiniband/hw/hfi1/Makefile 
b/drivers/infiniband/hw/hfi1/Makefile
index 2280538..88085f6 100644
--- a/drivers/infiniband/hw/hfi1/Makefile
+++ b/drivers/infiniband/hw/hfi1/Makefile
@@ -12,7 +12,7 @@ hfi1-y := affinity.o chip.o device.o driver.o efivar.o \
init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \
qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o \
uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \
-   verbs_txreq.o vnic_main.o
+   verbs_txreq.o vnic_main.o vnic_sdma.o
 hfi1-$(CONFIG_DEBUG_FS) += debugfs.o
 
 CFLAGS_trace.o = -I$(src)
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index ac31b23..b57b88a 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -834,6 +834,7 @@ struct hfi1_asic_data {
 /* Virtual NIC information */
 struct hfi1_vnic_data {
struct hfi1_ctxtdata *ctxt[HFI1_NUM_VNIC_CTXT];
+   struct kmem_cache *txreq_cache;
u8 num_vports;
struct idr vesw_idr;
u8 rmt_start;
diff --git a/drivers/infiniband/hw/hfi1/init.c 
b/drivers/infiniband/hw/hfi1/init.c
index 1ecccaa..3fc7984 100644
--- a/drivers/infiniband/hw/hfi1/init.c
+++ b/drivers/infiniband/hw/hfi1/init.c
@@ -681,6 +681,7 @@ int hfi1_init(struct hfi1_devdata *dd, int reinit)
dd->process_pio_send = hfi1_verbs_send_pio;
dd->process_dma_send = hfi1_verbs_send_dma;
dd->pio_inline_send = pio_copy;
+   dd->process_vnic_dma_send = hfi1_vnic_send_dma;
 
if (is_ax(dd)) {
atomic_set(>drop_packet, DROP_PACKET_ON);
diff --git a/drivers/infiniband/hw/hfi1/vnic.h 
b/drivers/infiniband/hw/hfi1/vnic.h
index d620aec..36996f0 100644
--- a/drivers/infiniband/hw/hfi1/vnic.h
+++ b/drivers/infiniband/hw/hfi1/vnic.h
@@ -49,6 +49,7 @@
 
 #include 
 #include "hfi.h"
+#include "sdma.h"
 
 #define HFI1_VNIC_MAX_TXQ 16
 #define HFI1_VNIC_MAX_PAD 12
@@ -85,6 +86,26 @@
 #define HFI1_VNIC_MAX_QUEUE 16
 
 /**
+ * struct hfi1_vnic_sdma - VNIC per Tx ring SDMA information
+ * @dd - device data pointer
+ * @sde - sdma engine
+ * @vinfo - vnic info pointer
+ * @wait - iowait structure
+ * @stx - sdma tx request
+ * @state - vnic Tx ring SDMA state
+ * @q_idx - vnic Tx queue index
+ */
+struct hfi1_vnic_sdma {
+   struct hfi1_devdata *dd;
+   struct sdma_engine  *sde;
+   struct hfi1_vnic_vport_info *vinfo;
+   struct iowait wait;
+   struct sdma_txreq stx;
+   unsigned int state;
+   u8 q_idx;
+};
+
+/**
  * struct hfi1_vnic_rx_queue - HFI1 VNIC receive queue
  * @idx: queue index
  * @vinfo: pointer to vport information
@@ -111,6 +132,7 @@ struct hfi1_vnic_rx_queue {
  * @vesw_id: virtual switch id
  * @rxq: Array of receive queues
  * @stats: per queue stats
+ * @sdma: VNIC SDMA structure per TXQ
  */
 struct hfi1_vnic_vport_info {
struct hfi1_devdata *dd;
@@ -126,6 +148,7 @@ struct hfi1_vnic_vport_info {
struct hfi1_vnic_rx_queue rxq[HFI1_NUM_VNIC_CTXT];
 
struct opa_vnic_stats  stats[HFI1_VNIC_MAX_QUEUE];
+   struct hfi1_vnic_sdma  sdma[HFI1_VNIC_MAX_TXQ];
 };
 
 #define v_dbg(format, arg...) \
@@ -138,8 +161,13 @@ struct hfi1_vnic_vport_info {
 /* vnic hfi1 internal functions */
 void hfi1_vnic_setup(struct hfi1_devdata *dd);
 void hfi1_vnic_cleanup(struct hfi1_devdata *dd);
+int hfi1_vnic_txreq_init(struct hfi1_devdata *dd);
+void hfi1_vnic_txreq_deinit(struct hfi1_devdata *dd);
 
 void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet);
+void hfi1_vnic_sdma_init(struct hfi1_vnic_vport_info *vinfo);
+bool hfi1_vnic_sdma_write_avail(struct hfi1_vnic_vport_info *vinfo,
+   u8 q_idx);
 
 /* vnic rdma netdev operations */
 struct net_device *hfi1_vnic_alloc_rn(struct ib_device *device,
diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c 
b/drivers/infiniband/hw/hfi1/vnic_main.c
index 4a9bb8c..8f354e7 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -408,6 +408,10 @@ static void hfi1_vnic_maybe_stop_tx(struct 
hfi1_vnic_vport_info *vinfo,
u8 q_idx)
 {
netif_stop_subqueue(vinfo->netdev, 

[PATCH 06/11] IB/opa-vnic: VNIC MAC table support

2017-02-22 Thread Vishwanathapura, Niranjana
OPA VNIC MAC table contains the MAC address to DLID mappings provided by
the Ethernet manager. During transmission, the MAC table provides the MAC
address to DLID translation. Implement MAC table using simple hash list.
Also provide support to update/query the MAC table by Ethernet manager.

Reviewed-by: Dennis Dalessandro 
Reviewed-by: Ira Weiny 
Signed-off-by: Niranjana Vishwanathapura 
Signed-off-by: Sadanand Warrier 
---
 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c   | 236 +
 .../infiniband/ulp/opa_vnic/opa_vnic_internal.h|  51 +
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c  |   4 +
 3 files changed, 291 insertions(+)

diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
index c74d02a..2e8fee9 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c
@@ -96,6 +96,238 @@ static inline void opa_vnic_make_header(u8 *hdr, u32 slid, 
u32 dlid, u16 len,
memcpy(hdr, h, OPA_VNIC_HDR_LEN);
 }
 
+/*
+ * Using a simple hash table for mac table implementation with the last octet
+ * of mac address as a key.
+ */
+static void opa_vnic_free_mac_tbl(struct hlist_head *mactbl)
+{
+   struct opa_vnic_mac_tbl_node *node;
+   struct hlist_node *tmp;
+   int bkt;
+
+   if (!mactbl)
+   return;
+
+   vnic_hash_for_each_safe(mactbl, bkt, tmp, node, hlist) {
+   hash_del(>hlist);
+   kfree(node);
+   }
+   kfree(mactbl);
+}
+
+static struct hlist_head *opa_vnic_alloc_mac_tbl(void)
+{
+   u32 size = sizeof(struct hlist_head) * OPA_VNIC_MAC_TBL_SIZE;
+   struct hlist_head *mactbl;
+
+   mactbl = kzalloc(size, GFP_KERNEL);
+   if (!mactbl)
+   return ERR_PTR(-ENOMEM);
+
+   vnic_hash_init(mactbl);
+   return mactbl;
+}
+
+/* opa_vnic_release_mac_tbl - empty and free the mac table */
+void opa_vnic_release_mac_tbl(struct opa_vnic_adapter *adapter)
+{
+   struct hlist_head *mactbl;
+
+   mutex_lock(>mactbl_lock);
+   mactbl = rcu_access_pointer(adapter->mactbl);
+   rcu_assign_pointer(adapter->mactbl, NULL);
+   synchronize_rcu();
+   opa_vnic_free_mac_tbl(mactbl);
+   mutex_unlock(>mactbl_lock);
+}
+
+/*
+ * opa_vnic_query_mac_tbl - query the mac table for a section
+ *
+ * This function implements query of specific function of the mac table.
+ * The function also expects the requested range to be valid.
+ */
+void opa_vnic_query_mac_tbl(struct opa_vnic_adapter *adapter,
+   struct opa_veswport_mactable *tbl)
+{
+   struct opa_vnic_mac_tbl_node *node;
+   struct hlist_head *mactbl;
+   int bkt;
+   u16 loffset, lnum_entries;
+
+   rcu_read_lock();
+   mactbl = rcu_dereference(adapter->mactbl);
+   if (!mactbl)
+   goto get_mac_done;
+
+   loffset = be16_to_cpu(tbl->offset);
+   lnum_entries = be16_to_cpu(tbl->num_entries);
+
+   vnic_hash_for_each(mactbl, bkt, node, hlist) {
+   struct __opa_vnic_mactable_entry *nentry = >entry;
+   struct opa_veswport_mactable_entry *entry;
+
+   if ((node->index < loffset) ||
+   (node->index >= (loffset + lnum_entries)))
+   continue;
+
+   /* populate entry in the tbl corresponding to the index */
+   entry = >tbl_entries[node->index - loffset];
+   memcpy(entry->mac_addr, nentry->mac_addr,
+  ARRAY_SIZE(entry->mac_addr));
+   memcpy(entry->mac_addr_mask, nentry->mac_addr_mask,
+  ARRAY_SIZE(entry->mac_addr_mask));
+   entry->dlid_sd = cpu_to_be32(nentry->dlid_sd);
+   }
+   tbl->mac_tbl_digest = cpu_to_be32(adapter->info.vport.mac_tbl_digest);
+get_mac_done:
+   rcu_read_unlock();
+}
+
+/*
+ * opa_vnic_update_mac_tbl - update mac table section
+ *
+ * This function updates the specified section of the mac table.
+ * The procedure includes following steps.
+ *  - Allocate a new mac (hash) table.
+ *  - Add the specified entries to the new table.
+ *(except the ones that are requested to be deleted).
+ *  - Add all the other entries from the old mac table.
+ *  - If there is a failure, free the new table and return.
+ *  - Switch to the new table.
+ *  - Free the old table and return.
+ *
+ * The function also expects the requested range to be valid.
+ */
+int opa_vnic_update_mac_tbl(struct opa_vnic_adapter *adapter,
+   struct opa_veswport_mactable *tbl)
+{
+   struct opa_vnic_mac_tbl_node *node, *new_node;
+   struct hlist_head *new_mactbl, *old_mactbl;
+   int i, bkt, rc = 0;
+   u8 key;
+   u16 loffset, lnum_entries;
+
+   mutex_lock(>mactbl_lock);
+   /* allocate new mac 

[PATCH net-next v3 3/3] A Sample of using socket cookie and uid for traffic monitoring

2017-02-22 Thread Chenbo Feng
From: Chenbo Feng 

Add a sample program to demostrate the possible usage of
get_socket_cookie and get_socket_uid helper function. The program will
store bytes and packets counting of in/out traffic monitored by iptable
and store the stats in a bpf map in per socket base. The owner uid of
the socket will be stored as part of the data entry. A shell script for
running the program is also included.

Change since V2:
Add the example code and the shell script to run the program.

Signed-off-by: Chenbo Feng 
---
 samples/bpf/cookie_uid_helper_example.c  | 225 +++
 samples/bpf/run_cookie_uid_helper_example.sh |  14 ++
 2 files changed, 239 insertions(+)
 create mode 100644 samples/bpf/cookie_uid_helper_example.c
 create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh

diff --git a/samples/bpf/cookie_uid_helper_example.c 
b/samples/bpf/cookie_uid_helper_example.c
new file mode 100644
index 000..ffa4740
--- /dev/null
+++ b/samples/bpf/cookie_uid_helper_example.c
@@ -0,0 +1,225 @@
+/* This test is a demo of using get_socket_uid and get_socket_cookie
+ * helper function to do per socket based network traffic monitoring.
+ * It requires iptable version higher then 1.6.1. to load pined eBPF
+ * program into the xt_bpf match.
+ *
+ * Compile:
+ * gcc -I ../../usr/include -I ../../tools/lib -I ../../tools/include \
+ * -I ./ -Wall cookie_uid_helper_example.c ../../tools/lib/bpf/bpf.c -o \
+ * perSocketStats_example
+ *
+ * TEST:
+ * ./run_cookie_uid_helper_example.sh
+ * Then generate some traffic in variate ways. ping 0 -c 10 would work
+ * but the cookie and uid in this case could both be 0. A sample output
+ * with some traffic generated by web browser is shown below:
+ *
+ * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058
+ *
+ * cookie: 132, uid: 0x0, Pakcet Count: 2, Bytes Count: 286
+ * cookie: 812, uid: 0x3e8, Pakcet Count: 3, Bytes Count: 1726
+ * cookie: 802, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104
+ * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058
+ * cookie: 831, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104
+ * cookie: 0, uid: 0x0, Pakcet Count: 6, Bytes Count: 712
+ * cookie: 880, uid: 0xfffe, Pakcet Count: 1, Bytes Count: 70
+ *
+ * Clean up: if using shell script, the script file will delete the iptables
+ * rule and unmount the bpf program when exit. Else the iptables rule need
+ * to be deleted using:
+ *   iptables -D INPUT -m bpf --object-pinned ${mnt_dir}/bpf_prog -j ACCEPT
+ */
+
+#define _GNU_SOURCE
+
+#define offsetof(type, member) __builtin_offsetof(type, member)
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+
+struct stats {
+   uint32_t uid;
+   uint64_t packets;
+   uint64_t bytes;
+};
+
+static int map_fd, prog_fd;
+
+static void maps_create(void)
+{
+   map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(uint32_t),
+   sizeof(struct stats), 100, 0);
+   if (map_fd < 0)
+   error(1, errno, "map create failed!\n");
+}
+
+static void prog_load(void)
+{
+   static char log_buf[1 << 16];
+
+   struct bpf_insn prog[] = {
+   /*
+* it for future usage. value stored in R6 to R10 will not be
+* reset after a bpf helper function call.
+*/
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   /*
+* pc1: BPF_FUNC_get_socket_cookie takes one parameter,
+* R1: sk_buff
+*/
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+   BPF_FUNC_get_socket_cookie),
+   /* pc2-4: save  to r7 for future usage*/
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
+   /*
+* pc5-8: set up the registers for BPF_FUNC_map_lookup_elem,
+* it takes two parameters (R1: map_fd,  R2: _cookie)
+*/
+   BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+   BPF_FUNC_map_lookup_elem),
+   /*
+* pc9. if r0 != 0x0, go to pc+14, since we have the cookie
+* stored already
+* Otherwise do pc10-22 to setup a new data entry.
+*/
+   BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 14),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+   BPF_FUNC_get_socket_uid),
+   /*
+* Place a struct 

[PATCH net-next v3 0/3] net: core: Two Helper function about socket information

2017-02-22 Thread Chenbo Feng
From: Chenbo Feng 

Introduce two eBpf helper function to get the socket cookie and
socket uid for each packet. The helper function is useful when
the *sk field inside sk_buff is not empty. These helper functions
can be used on socket and uid based traffic monitoring programs.

Change since V2:
* Add a sample program to demostrate the usage of the helper function.
* Moved the helper function proto invoking place.
* Add function header into tools/include
* Apply sk_to_full_sk() before getting uid.

Change since V1:
* Removed the unnecessary declarations and export command
* resolved conflict with master branch. 
* Examine if the socket is a full socket before getting the uid.

Chenbo Feng (3):
  Add a helper function to get socket cookie in eBPF
  Add a eBPF helper function to retrieve socket uid
  A Sample of using socket cookie and uid for traffic monitoring

 include/linux/sock_diag.h|   1 +
 include/uapi/linux/bpf.h |  16 +-
 net/core/filter.c|  36 +
 net/core/sock_diag.c |   2 +-
 samples/bpf/cookie_uid_helper_example.c  | 225 +++
 samples/bpf/run_cookie_uid_helper_example.sh |  14 ++
 tools/include/uapi/linux/bpf.h   |   4 +-
 7 files changed, 295 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/cookie_uid_helper_example.c
 create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh

-- 
2.7.4



[PATCH net-next v3 1/3] Add a helper function to get socket cookie in eBPF

2017-02-22 Thread Chenbo Feng
From: Chenbo Feng 

Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.

Change since V2:
Moved the helper function from bpf_base_func_proto() to both
sk_filter_func_proto() and tc_cls_act_func_proto(). Add function name
to uapi header file under tools/include.

Change since V1:
Removed the unnecessary declarations and export command, resolved
conflict with master branch.

Signed-off-by: Chenbo Feng 
---
 include/linux/sock_diag.h  |  1 +
 include/uapi/linux/bpf.h   |  9 -
 net/core/filter.c  | 17 +
 net/core/sock_diag.c   |  2 +-
 tools/include/uapi/linux/bpf.h |  3 ++-
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index a0596ca0..a2f8109 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -24,6 +24,7 @@ void sock_diag_unregister(const struct sock_diag_handler *h);
 void sock_diag_register_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 void sock_diag_unregister_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 
+u64 sock_gen_cookie(struct sock *sk);
 int sock_diag_check_cookie(struct sock *sk, const __u32 *cookie);
 void sock_diag_save_cookie(struct sock *sk, __u32 *cookie);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0539a0c..dc81a9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -456,6 +456,12 @@ union bpf_attr {
  * Return:
  *   > 0 length of the string including the trailing NUL on success
  *   < 0 error
+ *
+ * u64 bpf_bpf_get_socket_cookie(skb)
+ * Get the cookie for the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: 8 Bytes non-decreasing number on success or 0 if the socket
+ * field is missing inside sk_buff
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -503,7 +509,8 @@ union bpf_attr {
FN(get_numa_node_id),   \
FN(skb_change_head),\
FN(xdp_adjust_head),\
-   FN(probe_read_str),
+   FN(probe_read_str), \
+   FN(get_socket_cookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index e466e004..06263c0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2599,6 +2600,18 @@ static const struct bpf_func_proto 
bpf_xdp_event_output_proto = {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb)
+{
+   return skb->sk ? sock_gen_cookie(skb->sk) : 0;
+}
+
+static const struct bpf_func_proto bpf_get_socket_cookie_proto = {
+   .func   = bpf_get_socket_cookie,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2633,6 +2646,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
switch (func_id) {
case BPF_FUNC_skb_load_bytes:
return _skb_load_bytes_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -2692,6 +2707,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return _get_smp_processor_id_proto;
case BPF_FUNC_skb_under_cgroup:
return _skb_under_cgroup_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 6b10573..acd2a6c 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -19,7 +19,7 @@ static int (*inet_rcv_compat)(struct sk_buff *skb, struct 
nlmsghdr *nlh);
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
 
-static u64 sock_gen_cookie(struct sock *sk)
+u64 sock_gen_cookie(struct sock *sk)
 {
while (1) {
u64 res = atomic64_read(>sk_cookie);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0539a0c..a94bdd3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -503,7 +503,8 @@ union bpf_attr {
FN(get_numa_node_id),   \
FN(skb_change_head),\

[PATCH net-next v3 3/3] A Sample of using socket cookie and uid for traffic monitoring

2017-02-22 Thread Chenbo Feng
From: Chenbo Feng 

Add a sample program to demostrate the possible usage of
get_socket_cookie and get_socket_uid helper function. The program will
store bytes and packets counting of in/out traffic monitored by iptable
and store the stats in a bpf map in per socket base. The owner uid of
the socket will be stored as part of the data entry. A shell script for
running the program is also included.

Change since V2:
Add the example code and the shell script to run the program.

Signed-off-by: Chenbo Feng 
---
 samples/bpf/cookie_uid_helper_example.c  | 225 +++
 samples/bpf/run_cookie_uid_helper_example.sh |  14 ++
 2 files changed, 239 insertions(+)
 create mode 100644 samples/bpf/cookie_uid_helper_example.c
 create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh

diff --git a/samples/bpf/cookie_uid_helper_example.c 
b/samples/bpf/cookie_uid_helper_example.c
new file mode 100644
index 000..ffa4740
--- /dev/null
+++ b/samples/bpf/cookie_uid_helper_example.c
@@ -0,0 +1,225 @@
+/* This test is a demo of using get_socket_uid and get_socket_cookie
+ * helper function to do per socket based network traffic monitoring.
+ * It requires iptable version higher then 1.6.1. to load pined eBPF
+ * program into the xt_bpf match.
+ *
+ * Compile:
+ * gcc -I ../../usr/include -I ../../tools/lib -I ../../tools/include \
+ * -I ./ -Wall cookie_uid_helper_example.c ../../tools/lib/bpf/bpf.c -o \
+ * perSocketStats_example
+ *
+ * TEST:
+ * ./run_cookie_uid_helper_example.sh
+ * Then generate some traffic in variate ways. ping 0 -c 10 would work
+ * but the cookie and uid in this case could both be 0. A sample output
+ * with some traffic generated by web browser is shown below:
+ *
+ * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058
+ *
+ * cookie: 132, uid: 0x0, Pakcet Count: 2, Bytes Count: 286
+ * cookie: 812, uid: 0x3e8, Pakcet Count: 3, Bytes Count: 1726
+ * cookie: 802, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104
+ * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058
+ * cookie: 831, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104
+ * cookie: 0, uid: 0x0, Pakcet Count: 6, Bytes Count: 712
+ * cookie: 880, uid: 0xfffe, Pakcet Count: 1, Bytes Count: 70
+ *
+ * Clean up: if using shell script, the script file will delete the iptables
+ * rule and unmount the bpf program when exit. Else the iptables rule need
+ * to be deleted using:
+ *   iptables -D INPUT -m bpf --object-pinned ${mnt_dir}/bpf_prog -j ACCEPT
+ */
+
+#define _GNU_SOURCE
+
+#define offsetof(type, member) __builtin_offsetof(type, member)
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x)))
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "libbpf.h"
+
+struct stats {
+   uint32_t uid;
+   uint64_t packets;
+   uint64_t bytes;
+};
+
+static int map_fd, prog_fd;
+
+static void maps_create(void)
+{
+   map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(uint32_t),
+   sizeof(struct stats), 100, 0);
+   if (map_fd < 0)
+   error(1, errno, "map create failed!\n");
+}
+
+static void prog_load(void)
+{
+   static char log_buf[1 << 16];
+
+   struct bpf_insn prog[] = {
+   /*
+* it for future usage. value stored in R6 to R10 will not be
+* reset after a bpf helper function call.
+*/
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+   /*
+* pc1: BPF_FUNC_get_socket_cookie takes one parameter,
+* R1: sk_buff
+*/
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+   BPF_FUNC_get_socket_cookie),
+   /* pc2-4: save  to r7 for future usage*/
+   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
+   BPF_MOV64_REG(BPF_REG_7, BPF_REG_10),
+   BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8),
+   /*
+* pc5-8: set up the registers for BPF_FUNC_map_lookup_elem,
+* it takes two parameters (R1: map_fd,  R2: _cookie)
+*/
+   BPF_LD_MAP_FD(BPF_REG_1, map_fd),
+   BPF_MOV64_REG(BPF_REG_2, BPF_REG_7),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+   BPF_FUNC_map_lookup_elem),
+   /*
+* pc9. if r0 != 0x0, go to pc+14, since we have the cookie
+* stored already
+* Otherwise do pc10-22 to setup a new data entry.
+*/
+   BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 14),
+   BPF_MOV64_REG(BPF_REG_1, BPF_REG_6),
+   BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
+   BPF_FUNC_get_socket_uid),
+   /*
+* Place a struct 

[PATCH net-next v3 2/3] Add a eBPF helper function to retrieve socket uid

2017-02-22 Thread Chenbo Feng
From: Chenbo Feng 

Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise 0 is returned.

Change since V2:
Add a sk_to_full_sk() check before retrieving the uid. Moved the helper
function from bpf_base_func_proto() to both sk_filter_func_proto() and
tc_cls_act_func_proto(). Add function name to uapi header file under
tools/include

Change since V1:
Removed the unnecessary declarations and export command, resolved
conflict with master branch. Examine if the socket is a full socket
before getting the uid.

Signed-off-by: Chenbo Feng 
---
 include/uapi/linux/bpf.h   |  9 -
 net/core/filter.c  | 19 +++
 tools/include/uapi/linux/bpf.h |  3 ++-
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dc81a9f..ff42111 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -462,6 +462,12 @@ union bpf_attr {
  * @skb: pointer to skb
  * Return: 8 Bytes non-decreasing number on success or 0 if the socket
  * field is missing inside sk_buff
+ *
+ * u32 bpf_get_socket_uid(skb)
+ * Get the owner uid of the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: uid of the socket owner on success or 0 if the socket pointer
+ * inside sk_buff is NULL
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -510,7 +516,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 06263c0..53c4afc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2612,6 +2612,21 @@ static const struct bpf_func_proto 
bpf_get_socket_cookie_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb)
+{
+   struct sock *sk = sk_to_full_sk(skb->sk);
+   kuid_t kuid = sock_net_uid(dev_net(skb->dev), sk);
+
+   return (u32)kuid.val;
+}
+
+static const struct bpf_func_proto bpf_get_socket_uid_proto = {
+   .func   = bpf_get_socket_uid,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2648,6 +2663,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
return _skb_load_bytes_proto;
case BPF_FUNC_get_socket_cookie:
return _get_socket_cookie_proto;
+   case BPF_FUNC_get_socket_uid:
+   return _get_socket_uid_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -2709,6 +2726,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return _skb_under_cgroup_proto;
case BPF_FUNC_get_socket_cookie:
return _get_socket_cookie_proto;
+   case BPF_FUNC_get_socket_uid:
+   return _get_socket_uid_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a94bdd3..4a2d56d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -504,7 +504,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.7.4



[PATCH net-next v3 2/3] Add a eBPF helper function to retrieve socket uid

2017-02-22 Thread Chenbo Feng
From: Chenbo Feng 

Returns the owner uid of the socket inside a sk_buff. This is useful to
perform per-UID accounting of network traffic or per-UID packet
filtering. The socket need to be a fullsock otherwise 0 is returned.

Change since V2:
Add a sk_to_full_sk() check before retrieving the uid. Moved the helper
function from bpf_base_func_proto() to both sk_filter_func_proto() and
tc_cls_act_func_proto(). Add function name to uapi header file under
tools/include

Change since V1:
Removed the unnecessary declarations and export command, resolved
conflict with master branch. Examine if the socket is a full socket
before getting the uid.

Signed-off-by: Chenbo Feng 
---
 include/uapi/linux/bpf.h   |  9 -
 net/core/filter.c  | 19 +++
 tools/include/uapi/linux/bpf.h |  3 ++-
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index dc81a9f..ff42111 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -462,6 +462,12 @@ union bpf_attr {
  * @skb: pointer to skb
  * Return: 8 Bytes non-decreasing number on success or 0 if the socket
  * field is missing inside sk_buff
+ *
+ * u32 bpf_get_socket_uid(skb)
+ * Get the owner uid of the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: uid of the socket owner on success or 0 if the socket pointer
+ * inside sk_buff is NULL
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -510,7 +516,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index 06263c0..53c4afc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2612,6 +2612,21 @@ static const struct bpf_func_proto 
bpf_get_socket_cookie_proto = {
.arg1_type  = ARG_PTR_TO_CTX,
 };
 
+BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb)
+{
+   struct sock *sk = sk_to_full_sk(skb->sk);
+   kuid_t kuid = sock_net_uid(dev_net(skb->dev), sk);
+
+   return (u32)kuid.val;
+}
+
+static const struct bpf_func_proto bpf_get_socket_uid_proto = {
+   .func   = bpf_get_socket_uid,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2648,6 +2663,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
return _skb_load_bytes_proto;
case BPF_FUNC_get_socket_cookie:
return _get_socket_cookie_proto;
+   case BPF_FUNC_get_socket_uid:
+   return _get_socket_uid_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -2709,6 +2726,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return _skb_under_cgroup_proto;
case BPF_FUNC_get_socket_cookie:
return _get_socket_cookie_proto;
+   case BPF_FUNC_get_socket_uid:
+   return _get_socket_uid_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index a94bdd3..4a2d56d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -504,7 +504,8 @@ union bpf_attr {
FN(skb_change_head),\
FN(xdp_adjust_head),\
FN(probe_read_str), \
-   FN(get_socket_cookie),
+   FN(get_socket_cookie),  \
+   FN(get_socket_uid),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
-- 
2.7.4



[PATCH net-next v3 1/3] Add a helper function to get socket cookie in eBPF

2017-02-22 Thread Chenbo Feng
From: Chenbo Feng 

Retrieve the socket cookie generated by sock_gen_cookie() from a sk_buff
with a known socket. Generates a new cookie if one was not yet set.If
the socket pointer inside sk_buff is NULL, 0 is returned. The helper
function coud be useful in monitoring per socket networking traffic
statistics and provide a unique socket identifier per namespace.

Change since V2:
Moved the helper function from bpf_base_func_proto() to both
sk_filter_func_proto() and tc_cls_act_func_proto(). Add function name
to uapi header file under tools/include.

Change since V1:
Removed the unnecessary declarations and export command, resolved
conflict with master branch.

Signed-off-by: Chenbo Feng 
---
 include/linux/sock_diag.h  |  1 +
 include/uapi/linux/bpf.h   |  9 -
 net/core/filter.c  | 17 +
 net/core/sock_diag.c   |  2 +-
 tools/include/uapi/linux/bpf.h |  3 ++-
 5 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h
index a0596ca0..a2f8109 100644
--- a/include/linux/sock_diag.h
+++ b/include/linux/sock_diag.h
@@ -24,6 +24,7 @@ void sock_diag_unregister(const struct sock_diag_handler *h);
 void sock_diag_register_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 void sock_diag_unregister_inet_compat(int (*fn)(struct sk_buff *skb, struct 
nlmsghdr *nlh));
 
+u64 sock_gen_cookie(struct sock *sk);
 int sock_diag_check_cookie(struct sock *sk, const __u32 *cookie);
 void sock_diag_save_cookie(struct sock *sk, __u32 *cookie);
 
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 0539a0c..dc81a9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -456,6 +456,12 @@ union bpf_attr {
  * Return:
  *   > 0 length of the string including the trailing NUL on success
  *   < 0 error
+ *
+ * u64 bpf_bpf_get_socket_cookie(skb)
+ * Get the cookie for the socket stored inside sk_buff.
+ * @skb: pointer to skb
+ * Return: 8 Bytes non-decreasing number on success or 0 if the socket
+ * field is missing inside sk_buff
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -503,7 +509,8 @@ union bpf_attr {
FN(get_numa_node_id),   \
FN(skb_change_head),\
FN(xdp_adjust_head),\
-   FN(probe_read_str),
+   FN(probe_read_str), \
+   FN(get_socket_cookie),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/net/core/filter.c b/net/core/filter.c
index e466e004..06263c0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -2599,6 +2600,18 @@ static const struct bpf_func_proto 
bpf_xdp_event_output_proto = {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb)
+{
+   return skb->sk ? sock_gen_cookie(skb->sk) : 0;
+}
+
+static const struct bpf_func_proto bpf_get_socket_cookie_proto = {
+   .func   = bpf_get_socket_cookie,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@ -2633,6 +2646,8 @@ sk_filter_func_proto(enum bpf_func_id func_id)
switch (func_id) {
case BPF_FUNC_skb_load_bytes:
return _skb_load_bytes_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
@@ -2692,6 +2707,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id)
return _get_smp_processor_id_proto;
case BPF_FUNC_skb_under_cgroup:
return _skb_under_cgroup_proto;
+   case BPF_FUNC_get_socket_cookie:
+   return _get_socket_cookie_proto;
default:
return bpf_base_func_proto(func_id);
}
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 6b10573..acd2a6c 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -19,7 +19,7 @@ static int (*inet_rcv_compat)(struct sk_buff *skb, struct 
nlmsghdr *nlh);
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
 
-static u64 sock_gen_cookie(struct sock *sk)
+u64 sock_gen_cookie(struct sock *sk)
 {
while (1) {
u64 res = atomic64_read(>sk_cookie);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 0539a0c..a94bdd3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -503,7 +503,8 @@ union bpf_attr {
FN(get_numa_node_id),   \
FN(skb_change_head),\

[PATCH net-next v3 0/3] net: core: Two Helper function about socket information

2017-02-22 Thread Chenbo Feng
From: Chenbo Feng 

Introduce two eBpf helper function to get the socket cookie and
socket uid for each packet. The helper function is useful when
the *sk field inside sk_buff is not empty. These helper functions
can be used on socket and uid based traffic monitoring programs.

Change since V2:
* Add a sample program to demostrate the usage of the helper function.
* Moved the helper function proto invoking place.
* Add function header into tools/include
* Apply sk_to_full_sk() before getting uid.

Change since V1:
* Removed the unnecessary declarations and export command
* resolved conflict with master branch. 
* Examine if the socket is a full socket before getting the uid.

Chenbo Feng (3):
  Add a helper function to get socket cookie in eBPF
  Add a eBPF helper function to retrieve socket uid
  A Sample of using socket cookie and uid for traffic monitoring

 include/linux/sock_diag.h|   1 +
 include/uapi/linux/bpf.h |  16 +-
 net/core/filter.c|  36 +
 net/core/sock_diag.c |   2 +-
 samples/bpf/cookie_uid_helper_example.c  | 225 +++
 samples/bpf/run_cookie_uid_helper_example.sh |  14 ++
 tools/include/uapi/linux/bpf.h   |   4 +-
 7 files changed, 295 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/cookie_uid_helper_example.c
 create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh

-- 
2.7.4



Re: VXLAN RCU error

2017-02-22 Thread Jakub Kicinski
On Wed, 22 Feb 2017 14:27:45 -0800, Jakub Kicinski wrote:
> Hi Roopa!

Ah, sorry, it seems like this splat may be coming all the way from
c6fcc4fc5f8b ("vxlan: avoid using stale vxlan socket.").

> I get this RCU error on net 12d656af4e3d2781b9b9f52538593e1717e7c979:
> 
> [ 1571.067134] ===
> [ 1571.071842] [ ERR: suspicious RCU usage.  ]
> [ 1571.076546] 4.10.0-debug-03232-g12d656af4e3d #1 Tainted: GW  O   
> [ 1571.084166] ---
> [ 1571.088867] ../drivers/net/vxlan.c:2111 suspicious rcu_dereference_check() 
> usage!
> [ 1571.097286] 
> [ 1571.097286] other info that might help us debug this:
> [ 1571.097286] 
> [ 1571.106305] 
> [ 1571.106305] rcu_scheduler_active = 2, debug_locks = 1
> [ 1571.113654] 3 locks held by ping/13826:
> [ 1571.117968]  #0:  (sk_lock-AF_INET){+.+.+.}, at: [] 
> raw_sendmsg+0x14e2/0x2e40
> [ 1571.127758]  #1:  (rcu_read_lock_bh){..}, at: [] 
> ip_finish_output2+0x274/0x1390
> [ 1571.138135]  #2:  (rcu_read_lock_bh){..}, at: [] 
> __dev_queue_xmit+0x1ec/0x2750
> [ 1571.148408] 
> [ 1571.148408] stack backtrace:
> [ 1571.153326] CPU: 10 PID: 13826 Comm: ping Tainted: GW  O
> 4.10.0-debug-03232-g12d656af4e3d #1
> [ 1571.163877] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
> 11/08/2016
> [ 1571.172290] Call Trace:
> [ 1571.175053]  dump_stack+0xcd/0x134
> [ 1571.178881]  ? _atomic_dec_and_lock+0xcc/0xcc
> [ 1571.183782]  ? print_lock+0xb2/0xb5
> [ 1571.187711]  lockdep_rcu_suspicious+0x123/0x170
> [ 1571.192807]  vxlan_xmit_one+0x1931/0x4270 [vxlan]
> [ 1571.198126]  ? encap_bypass_if_local+0x380/0x380 [vxlan]
> [ 1571.204109]  ? sched_clock+0x9/0x10
> [ 1571.208034]  ? sched_clock_cpu+0x20/0x2c0
> [ 1571.212541]  ? unwind_get_return_address+0x1b8/0x2b0
> [ 1571.218132]  ? __lock_acquire+0x6d6/0x3160
> [ 1571.222740]  vxlan_xmit+0x756/0x4f90 [vxlan]
> [ 1571.227541]  ? vxlan_xmit_one+0x4270/0x4270 [vxlan]
> [ 1571.233014]  ? netif_skb_features+0x2be/0xba0
> [ 1571.237919]  dev_hard_start_xmit+0x1ab/0xa70
> [ 1571.242724]  __dev_queue_xmit+0x137b/0x2750
> [ 1571.247425]  ? __dev_queue_xmit+0x1ec/0x2750
> [ 1571.252228]  ? netdev_pick_tx+0x330/0x330
> [ 1571.256735]  ? debug_smp_processor_id+0x17/0x20
> [ 1571.261826]  ? get_lock_stats+0x1d/0x160
> [ 1571.266241]  ? mark_held_locks+0x105/0x280
> [ 1571.270850]  ? memcpy+0x45/0x50
> [ 1571.274391]  dev_queue_xmit+0x10/0x20
> [ 1571.278511]  neigh_resolve_output+0x43e/0x7f0
> [ 1571.283405]  ? ip_finish_output2+0x69d/0x1390
> [ 1571.288308]  ip_finish_output2+0x69d/0x1390
> [ 1571.293008]  ? ip_finish_output2+0x274/0x1390
> [ 1571.297909]  ? ip_copy_metadata+0x7e0/0x7e0
> [ 1571.302610]  ? get_lock_stats+0x1d/0x160
> [ 1571.307027]  ip_finish_output+0x598/0xc50
> [ 1571.311537]  ip_output+0x371/0x630
> [ 1571.315362]  ? ip_output+0x1dc/0x630
> [ 1571.319383]  ? ip_mc_output+0xe70/0xe70
> [ 1571.323694]  ? kfree+0x372/0x5a0
> [ 1571.327325]  ? mark_held_locks+0x105/0x280
> [ 1571.331933]  ? __ip_make_skb+0xdd1/0x2200
> [ 1571.336457]  ip_local_out+0x8f/0x180
> [ 1571.340480]  ip_send_skb+0x44/0xf0
> [ 1571.344306]  ip_push_pending_frames+0x5a/0x80
> [ 1571.349203]  raw_sendmsg+0x164d/0x2e40
> [ 1571.353422]  ? debug_check_no_locks_freed+0x350/0x350
> [ 1571.359099]  ? dst_output+0x1b0/0x1b0
> [ 1571.363217]  ? get_lock_stats+0x1d/0x160
> [ 1571.367640]  ? __might_fault+0x199/0x230
> [ 1571.372052]  ? kasan_check_write+0x14/0x20
> [ 1571.382002]  ? _copy_from_user+0xb9/0x130
> [ 1571.386513]  ? rw_copy_check_uvector+0x8d/0x490
> [ 1571.391609]  ? import_iovec+0xae/0x5d0
> [ 1571.395826]  ? push_pipe+0xd00/0xd00
> [ 1571.399847]  ? kasan_check_write+0x14/0x20
> [ 1571.404450]  ? _copy_from_user+0xb9/0x130
> [ 1571.408960]  inet_sendmsg+0x19f/0x5f0
> [ 1571.413071]  ? inet_recvmsg+0x980/0x980
> [ 1571.417386]  sock_sendmsg+0xe2/0x170
> [ 1571.421408]  ___sys_sendmsg+0x66e/0x960
> [ 1571.425726]  ? mem_cgroup_commit_charge+0x144/0x2720
> [ 1571.431303]  ? copy_msghdr_from_user+0x610/0x610
> [ 1571.436495]  ? debug_smp_processor_id+0x17/0x20
> [ 1571.441584]  ? get_lock_stats+0x1d/0x160
> [ 1571.445995]  ? mem_cgroup_uncharge_swap+0x250/0x250
> [ 1571.451474]  ? page_add_new_anon_rmap+0x173/0x3a0
> [ 1571.456762]  ? handle_mm_fault+0x1589/0x3820
> [ 1571.461566]  ? handle_mm_fault+0x1589/0x3820
> [ 1571.466362]  ? handle_mm_fault+0x191/0x3820
> [ 1571.471070]  ? __fdget+0x13/0x20
> [ 1571.474702]  ? get_lock_stats+0x1d/0x160
> [ 1571.479116]  __sys_sendmsg+0xc6/0x150
> [ 1571.483234]  ? SyS_shutdown+0x1b0/0x1b0
> [ 1571.487551]  ? __do_page_fault+0x556/0xe50
> [ 1571.492158]  ? trace_hardirqs_on_thunk+0x1a/0x1c
> [ 1571.497340]  SyS_sendmsg+0x12/0x20
> [ 1571.501166]  entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1571.506354] RIP: 0033:0x7fca2d0384a0
> [ 1571.510374] RSP: 002b:7ffd18d7fe88 EFLAGS: 0246 ORIG_RAX: 
> 002e
> [ 1571.518886] RAX: ffda RBX: 0040 RCX: 
> 7fca2d0384a0
> [ 1571.526889] RDX: 

Re: [PATCH net-next 2/2] sctp: add support for MSG_MORE

2017-02-22 Thread Xin Long
On Tue, Feb 21, 2017 at 10:27 PM, David Laight  wrote:
> From: Xin Long
>> Sent: 18 February 2017 17:53
>> This patch is to add support for MSG_MORE on sctp.
>>
>> It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after
>> creating datamsg according to the send flag. sctp_packet_can_append_data
>> then uses it to decide if the chunks of this msg will be sent at once or
>> delay it.
>>
>> Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of
>> in assoc. As sctp enqueues the chunks first, then dequeue them one by
>> one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may
>> affect other chunks' bundling.
>
> I thought about that and decided that the MSG_MORE flag on the last data
> chunk was the only one that mattered.
> Indeed looking at any others is broken.
>
> Consider what happens if you have two small chunks queued, the first
> with MSG_MORE set, the second with it clear.
>
> I think that sctp_outq_flush() will look at the first chunk and decide it
> doesn't need to do anything because sctp_packet_transmit_chunk()
> returns SCTP_XMIT_DELAY.
> The data chunk with MSG_MORE clear won't even be looked at.
> So the data will never be sent.
It's not that bad as you thought, in sctp_packet_can_append_data():
when inflight == 0 || sctp_sk(asoc->base.sk)->nodelay, the chunks
would be still sent out.

What MSG_MORE flag actually does is ignore inflight == 0 and
sctp_sk(asoc->base.sk)->nodelay to delay the chunks, but still
it has to respect the original logic (like !chunk->msg->can_delay
|| !sctp_packet_empty(packet) || ...)

To delay the chunks with MSG_MORE set even when inflight is 0
it especially important here for users.

>
> I wouldn't worry about having messages queued that have MSG_MORE clean
> when the final message has it set.
Yeah, It's an old optimization for bundling. MSG_MORE should NOT
break that.

> While it might be 'nice' to send the data (would have to be tx credit)
> waiting for the next data chunk shouldn't be a problem.
sorry, you mean it shouldn't send the data if it's waiting for the
next data whenever ?

>
> I'm not sure I even want to test the current patch!
>
> David
>


Re: [PATCH net v5] bpf: add helper to compare network namespaces

2017-02-22 Thread David Ahern
On 2/19/17 9:17 PM, Eric W. Biederman wrote:
>>> @@ -2597,6 +2598,39 @@ static const struct bpf_func_proto 
>>> bpf_xdp_event_output_proto = {
>>> .arg5_type  = ARG_CONST_STACK_SIZE,
>>>   };
>>>
>>> +BPF_CALL_3(bpf_sk_netns_cmp, struct sock *, sk,  u64, ns_dev, u64, ns_ino)
>>> +{
>>> +   return netns_cmp(sock_net(sk), ns_dev, ns_ino);
>>> +}
>>
>> Is there anything that speaks against doing the comparison itself
>> outside of the helper? Meaning, the helper would get a buffer
>> passed from stack f.e. struct foo { u64 ns_dev; u64 ns_ino; }
>> and fills both out with the netns info belonging to the sk/skb.
> 
> Yes.  The dev/ino pair is not necessarily unique so it is not at all
> clear that the returned value would be what the program is expecting.

How does the comparison inside a helper change the fact that a dev and
inode number are compared? ie., inside or outside of a helper, the end
result is that a bpf program has a dev/inode pair that is compared to
that of a socket or skb.

Ideally, it would be nice to have a bpf equivalent to net_eq(), but it
is not possible from a practical perspective to have bpf programs load a
namespace reference (address really) from a given pid or fd.


Re: [PATCH net-next] virtio-net: switch to use build_skb() for small buffer

2017-02-22 Thread Jason Wang



On 2017年02月23日 01:17, John Fastabend wrote:

On 17-02-21 12:46 AM, Jason Wang wrote:

This patch switch to use build_skb() for small buffer which can have
better performance for both TCP and XDP (since we can work at page
before skb creation). It also remove lots of XDP codes since both
mergeable and small buffer use page frag during refill now.

Before   | After
XDP_DROP(xdp1) 64B  :  11.1Mpps | 14.4Mpps

Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf.

When you do the xdp tests are you generating packets with pktgen on the
corresponding tap devices?


Yes, pktgen on the tap directly.



Also another thought, have you looked at using some of the buffer recycling
techniques used in the hardware drivers such as ixgbe and with Eric's latest
patches mlx? I have seen significant performance increases for some
workloads doing this. I wanted to try something like this out on virtio
but haven't had time yet.


Yes, this is in TODO list. Will pick some time to do this.

Thanks




Signed-off-by: Jason Wang 
---

[...]


  static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue 
*rq,
 gfp_t gfp)
  {
-   int headroom = GOOD_PACKET_LEN + virtnet_get_headroom(vi);
+   struct page_frag *alloc_frag = >alloc_frag;
+   char *buf;
unsigned int xdp_headroom = virtnet_get_headroom(vi);
-   struct sk_buff *skb;
-   struct virtio_net_hdr_mrg_rxbuf *hdr;
+   int len = vi->hdr_len + VIRTNET_RX_PAD + GOOD_PACKET_LEN + xdp_headroom;
int err;
  
-	skb = __netdev_alloc_skb_ip_align(vi->dev, headroom, gfp);

-   if (unlikely(!skb))
+   len = SKB_DATA_ALIGN(len) +
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
return -ENOMEM;
  
-	skb_put(skb, headroom);

-
-   hdr = skb_vnet_hdr(skb);
-   sg_init_table(rq->sg, 2);
-   sg_set_buf(rq->sg, hdr, vi->hdr_len);
-   skb_to_sgvec(skb, rq->sg + 1, xdp_headroom, skb->len - xdp_headroom);
-
-   err = virtqueue_add_inbuf(rq->vq, rq->sg, 2, skb, gfp);
+   buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
+   get_page(alloc_frag->page);
+   alloc_frag->offset += len;
+   sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom,
+   vi->hdr_len + GOOD_PACKET_LEN);
+   err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);

Nice this cleans up a lot of the branching code. Thanks.

Acked-by: John Fastabend 




Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Alexander Duyck
On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazet  wrote:
> On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:
>
>>
>> Right but you were talking about using both halves one after the
>> other.  If that occurs you have nothing left that you can reuse.  That
>> was what I was getting at.  If you use up both halves you end up
>> having to unmap the page.
>>
>
> You must have misunderstood me.
>
> Once we use both halves of a page, we _keep_ the page, we do not unmap
> it.
>
> We save the page pointer in a ring buffer of pages.
> Call it the 'quarantine'
>
> When we _need_ to replenish the RX desc, we take a look at the oldest
> entry in the quarantine ring.
>
> If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
> this saved page.
>
> If not, _then_ we unmap and release the page.

Okay, that was what I was referring to when I mentioned a "hybrid
between the mlx5 and the Intel approach".  Makes sense.

> Note that we would have received 4096 frames before looking at the page
> count, so there is high chance both halves were consumed.
>
> To recap on x86 :
>
> 2048 active pages would be visible by the device, because 4096 RX desc
> would contain dma addresses pointing to the 4096 halves.
>
> And 2048 pages would be in the reserve.

The buffer info layout for something like that would probably be
pretty interesting.  Basically you would be doubling up the ring so
that you handle 2 Rx descriptors per a single buffer info since you
would automatically know that it would be an even/odd setup in terms
of the buffer offsets.

If you get a chance to do something like that I would love to know the
result.  Otherwise if I get a chance I can try messing with i40e or
ixgbe some time and see what kind of impact it has.

>> The whole idea behind using only half the page per descriptor is to
>> allow us to loop through the ring before we end up reusing it again.
>> That buys us enough time that usually the stack has consumed the frame
>> before we need it again.
>
>
> The same will happen really.
>
> Best maybe is for me to send the patch ;)

I think I have the idea now.  However patches are always welcome..  :-)


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Eric Dumazet
On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote:

> 
> Right but you were talking about using both halves one after the
> other.  If that occurs you have nothing left that you can reuse.  That
> was what I was getting at.  If you use up both halves you end up
> having to unmap the page.
> 

You must have misunderstood me.

Once we use both halves of a page, we _keep_ the page, we do not unmap
it.

We save the page pointer in a ring buffer of pages.
Call it the 'quarantine'

When we _need_ to replenish the RX desc, we take a look at the oldest
entry in the quarantine ring.

If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse
this saved page.

If not, _then_ we unmap and release the page.

Note that we would have received 4096 frames before looking at the page
count, so there is high chance both halves were consumed.

To recap on x86 :

2048 active pages would be visible by the device, because 4096 RX desc
would contain dma addresses pointing to the 4096 halves.

And 2048 pages would be in the reserve.


> The whole idea behind using only half the page per descriptor is to
> allow us to loop through the ring before we end up reusing it again.
> That buys us enough time that usually the stack has consumed the frame
> before we need it again.


The same will happen really.

Best maybe is for me to send the patch ;)




Re: [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages

2017-02-22 Thread Willem de Bruijn
>>
>> - page = alloc_page(gfp_mask);
>> + page = skb_frag_page(f);
>> + if (page_count(page) == 1) {
>> + skb_frag_ref(skb, i);
>
> This could be : get_page(page);

Ah, indeed. Thanks.

>
>> + goto copy_done;
>> + }
>> +
>> + if (f->size > PAGE_SIZE) {
>> + order = get_order(f->size);
>> + mask |= __GFP_COMP;
>
> Note that this would probably fail under memory pressure.
>
> We could instead try to explode the few segments into order-0 only
> pages.

Good point. I'll revise to use only order-0 here.


Re: [PATCH] uapi: fix linux/rds.h userspace compilation errors

2017-02-22 Thread Santosh Shilimkar

On 2/22/2017 5:13 PM, Dmitry V. Levin wrote:

Consistently use types from linux/types.h to fix the following
linux/rds.h userspace compilation errors:

/usr/include/linux/rds.h:198:2: error: unknown type name 'u8'
  u8 rx_traces;
/usr/include/linux/rds.h:199:2: error: unknown type name 'u8'
  u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
/usr/include/linux/rds.h:203:2: error: unknown type name 'u8'
  u8 rx_traces;
/usr/include/linux/rds.h:204:2: error: unknown type name 'u8'
  u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
/usr/include/linux/rds.h:205:2: error: unknown type name 'u64'
  u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];

Fixes: 3289025a("RDS: add receive message trace used by application")
Signed-off-by: Dmitry V. Levin 
---

This was part of the patch I submitted other-day with
rest of the clean-up. Thanks Dmitry.

Acked-by: Santosh Shilimkar 



[PATCH] uapi: fix linux/seg6.h and linux/seg6_iptunnel.h userspace compilation errors

2017-02-22 Thread Dmitry V. Levin
Include  in uapi/linux/seg6.h to fix the following
linux/seg6.h userspace compilation error:

/usr/include/linux/seg6.h:31:18: error: array type has incomplete element type 
'struct in6_addr'
  struct in6_addr segments[0];

Include  in uapi/linux/seg6_iptunnel.h to fix
the following linux/seg6_iptunnel.h userspace compilation error:

/usr/include/linux/seg6_iptunnel.h:26:21: error: array type has incomplete 
element type 'struct ipv6_sr_hdr'
  struct ipv6_sr_hdr srh[0];

Fixes: a50a05f4("ipv6: sr: add missing Kbuild export for header files")
Signed-off-by: Dmitry V. Levin 
---
 include/uapi/linux/seg6.h  | 1 +
 include/uapi/linux/seg6_iptunnel.h | 2 ++
 2 files changed, 3 insertions(+)

diff --git a/include/uapi/linux/seg6.h b/include/uapi/linux/seg6.h
index 61df8d3..7278511 100644
--- a/include/uapi/linux/seg6.h
+++ b/include/uapi/linux/seg6.h
@@ -15,6 +15,7 @@
 #define _UAPI_LINUX_SEG6_H
 
 #include 
+#include  /* For struct in6_addr. */
 
 /*
  * SRH
diff --git a/include/uapi/linux/seg6_iptunnel.h 
b/include/uapi/linux/seg6_iptunnel.h
index 7a7183d..b6e5a0a 100644
--- a/include/uapi/linux/seg6_iptunnel.h
+++ b/include/uapi/linux/seg6_iptunnel.h
@@ -14,6 +14,8 @@
 #ifndef _UAPI_LINUX_SEG6_IPTUNNEL_H
 #define _UAPI_LINUX_SEG6_IPTUNNEL_H
 
+#include /* For struct ipv6_sr_hdr. */
+
 enum {
SEG6_IPTUNNEL_UNSPEC,
SEG6_IPTUNNEL_SRH,
-- 
ldv


[PATCH] uapi: fix linux/rds.h userspace compilation errors

2017-02-22 Thread Dmitry V. Levin
Consistently use types from linux/types.h to fix the following
linux/rds.h userspace compilation errors:

/usr/include/linux/rds.h:198:2: error: unknown type name 'u8'
  u8 rx_traces;
/usr/include/linux/rds.h:199:2: error: unknown type name 'u8'
  u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
/usr/include/linux/rds.h:203:2: error: unknown type name 'u8'
  u8 rx_traces;
/usr/include/linux/rds.h:204:2: error: unknown type name 'u8'
  u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
/usr/include/linux/rds.h:205:2: error: unknown type name 'u64'
  u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];

Fixes: 3289025a("RDS: add receive message trace used by application")
Signed-off-by: Dmitry V. Levin 
---
 include/uapi/linux/rds.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 47c03ca..198892b 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -195,14 +195,14 @@ enum rds_message_rxpath_latency {
 };
 
 struct rds_rx_trace_so {
-   u8 rx_traces;
-   u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
+   __u8 rx_traces;
+   __u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
 };
 
 struct rds_cmsg_rx_trace {
-   u8 rx_traces;
-   u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
-   u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
+   __u8 rx_traces;
+   __u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX];
+   __u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX];
 };
 
 /*
-- 
ldv


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Alexander Duyck
On Wed, Feb 22, 2017 at 10:21 AM, Eric Dumazet  wrote:
> On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote:
>> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet  wrote:
>> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
>> >> Use of order-3 pages is problematic in some cases.
>> >>
>> >> This patch might add three kinds of regression :
>> >>
>> >> 1) a CPU performance regression, but we will add later page
>> >> recycling and performance should be back.
>> >>
>> >> 2) TCP receiver could grow its receive window slightly slower,
>> >>because skb->len/skb->truesize ratio will decrease.
>> >>This is mostly ok, we prefer being conservative to not risk OOM,
>> >>and eventually tune TCP better in the future.
>> >>This is consistent with other drivers using 2048 per ethernet frame.
>> >>
>> >> 3) Because we allocate one page per RX slot, we consume more
>> >>memory for the ring buffers. XDP already had this constraint anyway.
>> >>
>> >> Signed-off-by: Eric Dumazet 
>> >> ---
>> >
>> > Note that we also could use a different strategy.
>> >
>> > Assume RX rings of 4096 entries/slots.
>> >
>> > With this patch, mlx4 gets the strategy used by Alexander in Intel
>> > drivers :
>> >
>> > Each RX slot has an allocated page, and uses half of it, flipping to the
>> > other half every time the slot is used.
>> >
>> > So a ring buffer of 4096 slots allocates 4096 pages.
>> >
>> > When we receive a packet train for the same flow, GRO builds an skb with
>> > ~45 page frags, all from different pages.
>> >
>> > The put_page() done from skb_release_data() touches ~45 different struct
>> > page cache lines, and show a high cost. (compared to the order-3 used
>> > today by mlx4, this adds extra cache line misses and stalls for the
>> > consumer)
>> >
>> > If we instead try to use the two halves of one page on consecutive RX
>> > slots, we might instead cook skb with the same number of MSS (45), but
>> > half the number of cache lines for put_page(), so we should speed up the
>> > consumer.
>>
>> So there is a problem that is being overlooked here.  That is the cost
>> of the DMA map/unmap calls.  The problem is many PowerPC systems have
>> an IOMMU that you have to work around, and that IOMMU comes at a heavy
>> cost for every map/unmap call.  So unless you are saying you wan to
>> setup a hybrid between the mlx5 and this approach where we have a page
>> cache that these all fall back into you will take a heavy cost for
>> having to map and unmap pages.
>>
>> The whole reason why I implemented the Intel page reuse approach the
>> way I did is to try and mitigate the IOMMU issue, it wasn't so much to
>> resolve allocator/freeing expense.  Basically the allocator scales,
>> the IOMMU does not.  So any solution would require making certain that
>> we can leave the pages pinned in the DMA to avoid having to take the
>> global locks involved in accessing the IOMMU.
>
>
> I do not see any difference for the fact that we keep pages mapped the
> same way.
>
> mlx4_en_complete_rx_desc() will still use the :
>
> dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
>   frag_size, priv->dma_dir);
>
> for every single MSS we receive.
>
> This wont change.

Right but you were talking about using both halves one after the
other.  If that occurs you have nothing left that you can reuse.  That
was what I was getting at.  If you use up both halves you end up
having to unmap the page.

The whole idea behind using only half the page per descriptor is to
allow us to loop through the ring before we end up reusing it again.
That buys us enough time that usually the stack has consumed the frame
before we need it again.

- Alex


[PATCH] bpf: fix spelling mistake: "proccessed" -> "processed"

2017-02-22 Thread Colin King
From: Colin Ian King 

trivial fix to spelling mistake in verbose log message

Signed-off-by: Colin Ian King 
---
 kernel/bpf/verifier.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index d2bded2..3fc6e39 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2776,7 +2776,7 @@ static int do_check(struct bpf_verifier_env *env)
class = BPF_CLASS(insn->code);
 
if (++insn_processed > BPF_COMPLEXITY_LIMIT_INSNS) {
-   verbose("BPF program is too large. Proccessed %d 
insn\n",
+   verbose("BPF program is too large. Processed %d insn\n",
insn_processed);
return -E2BIG;
}
-- 
2.10.2



Re: [PATCH V5 2/2] qedf: Add QLogic FastLinQ offload FCoE driver framework.

2017-02-22 Thread Martin K. Petersen
> "Chad" == Dupuis, Chad  writes:

Chad> The QLogic FastLinQ Driver for FCoE (qedf) is the FCoE specific
Chad> module for 41000 Series Converged Network Adapters by QLogic. This
Chad> patch consists of following changes:

Now that Linus pulled Dave's tree I have gone ahead and merged this into
4.11/scsi-fixes.

-- 
Martin K. Petersen  Oracle Linux Engineering


RE: create drivers/net/mdio and move mdio drivers into it

2017-02-22 Thread YUAN Linyu


> -Original Message-
> From: Andrew Lunn [mailto:and...@lunn.ch]
> Sent: Wednesday, February 22, 2017 6:21 PM
> To: YUAN Linyu
> Cc: Florian Fainelli; David S . Miller; netdev@vger.kernel.org; cug...@163.com
> Subject: Re: create drivers/net/mdio and move mdio drivers into it
> 
> On Wed, Feb 22, 2017 at 05:38:49AM +, YUAN Linyu wrote:
> > Hi Florian,
> >
> > 1.
> > Let's go back to original topic,
> > Can we move all mdio dirvers into drivers/net/mdio ?
> 
> Hi Yuan
> 
> Please could you explain what benefit this brings. Please also list
> all the downsides for such a move. As Florian said, we need to ensure
> such a move adds more value than it removes.
At beginning I think mdio and phy are two different things, mdio should have 
it's home.

> 
> > Per may understanding,
> > I don't know why create a struct mii_bus instance to represent a mdio device
> in current mdio driver.
> > Why not create a struct mdio_device instance, it's easy to understand.
> > (We can move part of member of mii_bus to mdio_device).
> 
> Please take a step back. What are you trying to achieve. What is the
> big picture. What cannot you do with the current design?
Big picture is we can remove struct mii_bus, and use struct mdio_device/driver 
for mdio controller.

> 
> Andrew


[PATCH] rtlwifi: fix spelling mistake: "conuntry" -> "country"

2017-02-22 Thread Colin King
From: Colin Ian King 

trivial fix to spelling mistake in RT_TRACE message

Signed-off-by: Colin Ian King 
---
 drivers/net/wireless/realtek/rtlwifi/regd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/wireless/realtek/rtlwifi/regd.c 
b/drivers/net/wireless/realtek/rtlwifi/regd.c
index 558c31b..1bf3eb2 100644
--- a/drivers/net/wireless/realtek/rtlwifi/regd.c
+++ b/drivers/net/wireless/realtek/rtlwifi/regd.c
@@ -435,7 +435,7 @@ int rtl_regd_init(struct ieee80211_hw *hw,
channel_plan_to_country_code(rtlpriv->efuse.channel_plan);
 
RT_TRACE(rtlpriv, COMP_REGD, DBG_DMESG,
-"rtl: EEPROM regdomain: 0x%0x conuntry code: %d\n",
+"rtl: EEPROM regdomain: 0x%0x country code: %d\n",
 rtlpriv->efuse.channel_plan, rtlpriv->regd.country_code);
 
if (rtlpriv->regd.country_code >= COUNTRY_CODE_MAX) {
-- 
2.10.2



Re: linux-next: build failure after merge of the net-next tree

2017-02-22 Thread Stephen Rothwell
Hi all,

On Tue, 10 Jan 2017 10:59:27 +1100 Stephen Rothwell  
wrote:
>
> After merging the net-next tree, today's linux-next build (x86_64
> allmodconfig) failed like this:
> 
> net/smc/af_smc.c: In function 'smc_splice_read':
> net/smc/af_smc.c:1258:39: error: passing argument 1 of 
> 'smc->clcsock->ops->splice_read' from incompatible pointer type 
> [-Werror=incompatible-pointer-types]
>rc = smc->clcsock->ops->splice_read(smc->clcsock, ppos,
>^
> net/smc/af_smc.c:1258:39: note: expected 'struct file *' but argument is of 
> type 'struct socket *'
> net/smc/af_smc.c: At top level:
> net/smc/af_smc.c:1288:17: error: initialization from incompatible pointer 
> type [-Werror=incompatible-pointer-types]
>   .splice_read = smc_splice_read,   
>  ^
> net/smc/af_smc.c:1288:17: note: (near initialization for 
> 'smc_sock_ops.splice_read')
> 
> Caused by commit
> 
>   ac7138746e14 ("smc: establish new socket family")
> 
> interacting with commit
> 
>   15a8f657c71d ("switch socket ->splice_read() to struct file *")
> 
> from the vfs tree.
> 
> I applied the following merge fix patch which could well be incorrect ...
> 
> From: Stephen Rothwell 
> Date: Tue, 10 Jan 2017 10:52:38 +1100
> Subject: [PATCH] smc: merge fix for "switch socket ->splice_read() to struct 
> file *"
> 
> Signed-off-by: Stephen Rothwell 
> ---
>  net/smc/af_smc.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
> index 5d4208ad029e..4875e65f0c4a 100644
> --- a/net/smc/af_smc.c
> +++ b/net/smc/af_smc.c
> @@ -1242,10 +1242,11 @@ static ssize_t smc_sendpage(struct socket *sock, 
> struct page *page,
>   return rc;
>  }
>  
> -static ssize_t smc_splice_read(struct socket *sock, loff_t *ppos,
> +static ssize_t smc_splice_read(struct file *file, loff_t *ppos,
>  struct pipe_inode_info *pipe, size_t len,
>   unsigned int flags)
>  {
> + struct socket *sock = file->private_data;
>   struct sock *sk = sock->sk;
>   struct smc_sock *smc;
>   int rc = -ENOTCONN;
> @@ -1255,7 +1256,7 @@ static ssize_t smc_splice_read(struct socket *sock, 
> loff_t *ppos,
>   if ((sk->sk_state != SMC_ACTIVE) && (sk->sk_state != SMC_CLOSED))
>   goto out;
>   if (smc->use_fallback) {
> - rc = smc->clcsock->ops->splice_read(smc->clcsock, ppos,
> + rc = smc->clcsock->ops->splice_read(file, ppos,
>   pipe, len, flags);
>   } else {
>   rc = -EOPNOTSUPP;
> -- 
> 2.10.2

This fix up is now needed when the vfs tree is merged with Linus' tree.

-- 
Cheers,
Stephen Rothwell


[PATCH] net: realtek: 8139too: use new api ethtool_{get|set}_link_ksettings

2017-02-22 Thread Philippe Reynes
The ethtool api {get|set}_settings is deprecated.
We move this driver to new api {get|set}_link_ksettings.

As I don't have the hardware, I'd be very pleased if
someone may test this patch.

Signed-off-by: Philippe Reynes 
---
 drivers/net/ethernet/realtek/8139too.c |   14 --
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/realtek/8139too.c 
b/drivers/net/ethernet/realtek/8139too.c
index 8963175..ca22f28 100644
--- a/drivers/net/ethernet/realtek/8139too.c
+++ b/drivers/net/ethernet/realtek/8139too.c
@@ -2384,21 +2384,23 @@ static void rtl8139_get_drvinfo(struct net_device *dev, 
struct ethtool_drvinfo *
strlcpy(info->bus_info, pci_name(tp->pci_dev), sizeof(info->bus_info));
 }
 
-static int rtl8139_get_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
+static int rtl8139_get_link_ksettings(struct net_device *dev,
+ struct ethtool_link_ksettings *cmd)
 {
struct rtl8139_private *tp = netdev_priv(dev);
spin_lock_irq(>lock);
-   mii_ethtool_gset(>mii, cmd);
+   mii_ethtool_get_link_ksettings(>mii, cmd);
spin_unlock_irq(>lock);
return 0;
 }
 
-static int rtl8139_set_settings(struct net_device *dev, struct ethtool_cmd 
*cmd)
+static int rtl8139_set_link_ksettings(struct net_device *dev,
+ const struct ethtool_link_ksettings *cmd)
 {
struct rtl8139_private *tp = netdev_priv(dev);
int rc;
spin_lock_irq(>lock);
-   rc = mii_ethtool_sset(>mii, cmd);
+   rc = mii_ethtool_set_link_ksettings(>mii, cmd);
spin_unlock_irq(>lock);
return rc;
 }
@@ -2480,8 +2482,6 @@ static void rtl8139_get_strings(struct net_device *dev, 
u32 stringset, u8 *data)
 
 static const struct ethtool_ops rtl8139_ethtool_ops = {
.get_drvinfo= rtl8139_get_drvinfo,
-   .get_settings   = rtl8139_get_settings,
-   .set_settings   = rtl8139_set_settings,
.get_regs_len   = rtl8139_get_regs_len,
.get_regs   = rtl8139_get_regs,
.nway_reset = rtl8139_nway_reset,
@@ -2493,6 +2493,8 @@ static void rtl8139_get_strings(struct net_device *dev, 
u32 stringset, u8 *data)
.get_strings= rtl8139_get_strings,
.get_sset_count = rtl8139_get_sset_count,
.get_ethtool_stats  = rtl8139_get_ethtool_stats,
+   .get_link_ksettings = rtl8139_get_link_ksettings,
+   .set_link_ksettings = rtl8139_set_link_ksettings,
 };
 
 static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd)
-- 
1.7.4.4



[PATCH] uapi: fix linux/llc.h userspace compilation error

2017-02-22 Thread Dmitry V. Levin
Include  to fix the following linux/llc.h userspace
compilation error:

/usr/include/linux/llc.h:26:27: error: 'IFHWADDRLEN' undeclared here (not in a 
function)
  unsigned char   sllc_mac[IFHWADDRLEN];

Signed-off-by: Dmitry V. Levin 
---
 include/uapi/linux/llc.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/uapi/linux/llc.h b/include/uapi/linux/llc.h
index 9c987a4..a6c17f6 100644
--- a/include/uapi/linux/llc.h
+++ b/include/uapi/linux/llc.h
@@ -14,6 +14,7 @@
 #define _UAPI__LINUX_LLC_H
 
 #include 
+#include   /* For IFHWADDRLEN. */
 
 #define __LLC_SOCK_SIZE__ 16   /* sizeof(sockaddr_llc), word align. */
 struct sockaddr_llc {
-- 
ldv


[PATCH] uapi: fix linux/ip6_tunnel.h userspace compilation errors

2017-02-22 Thread Dmitry V. Levin
Include  and  to fix the following
linux/ip6_tunnel.h userspace compilation errors:

/usr/include/linux/ip6_tunnel.h:23:12: error: 'IFNAMSIZ' undeclared here (not 
in a function)
  char name[IFNAMSIZ]; /* name of tunnel device */
/usr/include/linux/ip6_tunnel.h:30:18: error: field 'laddr' has incomplete type
  struct in6_addr laddr; /* local tunnel end-point address */

Signed-off-by: Dmitry V. Levin 
---
 include/uapi/linux/ip6_tunnel.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/ip6_tunnel.h b/include/uapi/linux/ip6_tunnel.h
index 48af63c..425926c 100644
--- a/include/uapi/linux/ip6_tunnel.h
+++ b/include/uapi/linux/ip6_tunnel.h
@@ -2,6 +2,8 @@
 #define _IP6_TUNNEL_H
 
 #include 
+#include   /* For IFNAMSIZ. */
+#include  /* For struct in6_addr. */
 
 #define IPV6_TLV_TNL_ENCAP_LIMIT 4
 #define IPV6_DEFAULT_TNL_ENCAP_LIMIT 4
-- 
ldv


VXLAN RCU error

2017-02-22 Thread Jakub Kicinski
Hi Roopa!

I get this RCU error on net 12d656af4e3d2781b9b9f52538593e1717e7c979:

[ 1571.067134] ===
[ 1571.071842] [ ERR: suspicious RCU usage.  ]
[ 1571.076546] 4.10.0-debug-03232-g12d656af4e3d #1 Tainted: GW  O   
[ 1571.084166] ---
[ 1571.088867] ../drivers/net/vxlan.c:2111 suspicious rcu_dereference_check() 
usage!
[ 1571.097286] 
[ 1571.097286] other info that might help us debug this:
[ 1571.097286] 
[ 1571.106305] 
[ 1571.106305] rcu_scheduler_active = 2, debug_locks = 1
[ 1571.113654] 3 locks held by ping/13826:
[ 1571.117968]  #0:  (sk_lock-AF_INET){+.+.+.}, at: [] 
raw_sendmsg+0x14e2/0x2e40
[ 1571.127758]  #1:  (rcu_read_lock_bh){..}, at: [] 
ip_finish_output2+0x274/0x1390
[ 1571.138135]  #2:  (rcu_read_lock_bh){..}, at: [] 
__dev_queue_xmit+0x1ec/0x2750
[ 1571.148408] 
[ 1571.148408] stack backtrace:
[ 1571.153326] CPU: 10 PID: 13826 Comm: ping Tainted: GW  O
4.10.0-debug-03232-g12d656af4e3d #1
[ 1571.163877] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 
11/08/2016
[ 1571.172290] Call Trace:
[ 1571.175053]  dump_stack+0xcd/0x134
[ 1571.178881]  ? _atomic_dec_and_lock+0xcc/0xcc
[ 1571.183782]  ? print_lock+0xb2/0xb5
[ 1571.187711]  lockdep_rcu_suspicious+0x123/0x170
[ 1571.192807]  vxlan_xmit_one+0x1931/0x4270 [vxlan]
[ 1571.198126]  ? encap_bypass_if_local+0x380/0x380 [vxlan]
[ 1571.204109]  ? sched_clock+0x9/0x10
[ 1571.208034]  ? sched_clock_cpu+0x20/0x2c0
[ 1571.212541]  ? unwind_get_return_address+0x1b8/0x2b0
[ 1571.218132]  ? __lock_acquire+0x6d6/0x3160
[ 1571.222740]  vxlan_xmit+0x756/0x4f90 [vxlan]
[ 1571.227541]  ? vxlan_xmit_one+0x4270/0x4270 [vxlan]
[ 1571.233014]  ? netif_skb_features+0x2be/0xba0
[ 1571.237919]  dev_hard_start_xmit+0x1ab/0xa70
[ 1571.242724]  __dev_queue_xmit+0x137b/0x2750
[ 1571.247425]  ? __dev_queue_xmit+0x1ec/0x2750
[ 1571.252228]  ? netdev_pick_tx+0x330/0x330
[ 1571.256735]  ? debug_smp_processor_id+0x17/0x20
[ 1571.261826]  ? get_lock_stats+0x1d/0x160
[ 1571.266241]  ? mark_held_locks+0x105/0x280
[ 1571.270850]  ? memcpy+0x45/0x50
[ 1571.274391]  dev_queue_xmit+0x10/0x20
[ 1571.278511]  neigh_resolve_output+0x43e/0x7f0
[ 1571.283405]  ? ip_finish_output2+0x69d/0x1390
[ 1571.288308]  ip_finish_output2+0x69d/0x1390
[ 1571.293008]  ? ip_finish_output2+0x274/0x1390
[ 1571.297909]  ? ip_copy_metadata+0x7e0/0x7e0
[ 1571.302610]  ? get_lock_stats+0x1d/0x160
[ 1571.307027]  ip_finish_output+0x598/0xc50
[ 1571.311537]  ip_output+0x371/0x630
[ 1571.315362]  ? ip_output+0x1dc/0x630
[ 1571.319383]  ? ip_mc_output+0xe70/0xe70
[ 1571.323694]  ? kfree+0x372/0x5a0
[ 1571.327325]  ? mark_held_locks+0x105/0x280
[ 1571.331933]  ? __ip_make_skb+0xdd1/0x2200
[ 1571.336457]  ip_local_out+0x8f/0x180
[ 1571.340480]  ip_send_skb+0x44/0xf0
[ 1571.344306]  ip_push_pending_frames+0x5a/0x80
[ 1571.349203]  raw_sendmsg+0x164d/0x2e40
[ 1571.353422]  ? debug_check_no_locks_freed+0x350/0x350
[ 1571.359099]  ? dst_output+0x1b0/0x1b0
[ 1571.363217]  ? get_lock_stats+0x1d/0x160
[ 1571.367640]  ? __might_fault+0x199/0x230
[ 1571.372052]  ? kasan_check_write+0x14/0x20
[ 1571.382002]  ? _copy_from_user+0xb9/0x130
[ 1571.386513]  ? rw_copy_check_uvector+0x8d/0x490
[ 1571.391609]  ? import_iovec+0xae/0x5d0
[ 1571.395826]  ? push_pipe+0xd00/0xd00
[ 1571.399847]  ? kasan_check_write+0x14/0x20
[ 1571.404450]  ? _copy_from_user+0xb9/0x130
[ 1571.408960]  inet_sendmsg+0x19f/0x5f0
[ 1571.413071]  ? inet_recvmsg+0x980/0x980
[ 1571.417386]  sock_sendmsg+0xe2/0x170
[ 1571.421408]  ___sys_sendmsg+0x66e/0x960
[ 1571.425726]  ? mem_cgroup_commit_charge+0x144/0x2720
[ 1571.431303]  ? copy_msghdr_from_user+0x610/0x610
[ 1571.436495]  ? debug_smp_processor_id+0x17/0x20
[ 1571.441584]  ? get_lock_stats+0x1d/0x160
[ 1571.445995]  ? mem_cgroup_uncharge_swap+0x250/0x250
[ 1571.451474]  ? page_add_new_anon_rmap+0x173/0x3a0
[ 1571.456762]  ? handle_mm_fault+0x1589/0x3820
[ 1571.461566]  ? handle_mm_fault+0x1589/0x3820
[ 1571.466362]  ? handle_mm_fault+0x191/0x3820
[ 1571.471070]  ? __fdget+0x13/0x20
[ 1571.474702]  ? get_lock_stats+0x1d/0x160
[ 1571.479116]  __sys_sendmsg+0xc6/0x150
[ 1571.483234]  ? SyS_shutdown+0x1b0/0x1b0
[ 1571.487551]  ? __do_page_fault+0x556/0xe50
[ 1571.492158]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 1571.497340]  SyS_sendmsg+0x12/0x20
[ 1571.501166]  entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1571.506354] RIP: 0033:0x7fca2d0384a0
[ 1571.510374] RSP: 002b:7ffd18d7fe88 EFLAGS: 0246 ORIG_RAX: 
002e
[ 1571.518886] RAX: ffda RBX: 0040 RCX: 7fca2d0384a0
[ 1571.526889] RDX:  RSI: 0060a300 RDI: 0003
[ 1571.534892] RBP: 0046 R08: 0020 R09: 003e
[ 1571.542897] R10: 7ffd18d7fc50 R11: 0246 R12: 00c0
[ 1571.550900] R13: 0004 R14: 7ffd18d81608 R15: 7ffd18d810b0

Some of Netronome's VXLAN tests are also failing but I need to dig a
bit to see what's wrong 

Re: Focusing the XDP project

2017-02-22 Thread Tom Herbert
On Wed, Feb 22, 2017 at 1:43 PM, Jesper Dangaard Brouer
 wrote:
> On Wed, 22 Feb 2017 09:22:53 -0800
> Tom Herbert  wrote:
>
>> On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouer
>>  wrote:
>> >
>> > On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert  
>> > wrote:
>> >> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed 
>> >>  wrote:
>> > [...]
>> >> > The only complexity XDP is adding to the drivers is the constrains on
>> >> > RX memory management and memory model, calling the XDP program itself
>> >> > and handling the  action is really a simple thing once you have the
>> >> > correct memory model.
>> >
>> > Exactly, that is why I've been looking at introducing a generic
>> > facility for a memory model for drivers.  This should help simply
>> > drivers.  Due to performance needs this need to be a very thin API layer
>> > on top of the page allocator. (That's why I'm working with Mel Gorman
>> > to get more close integration with the page allocator e.g. a bulking
>> > facility).
>> >
>> >> > Who knows! maybe someday XDP will define one unified RX API for all
>> >> > drivers and it even will handle normal stack delivery it self :).
>> >> >
>> >> That's exactly the point and what we need for TXDP. I'm missing why
>> >> doing this is such rocket science other than the fact that all these
>> >> drivers are vastly different and changing the existing API is
>> >> unpleasant. The only functional complexity I see in creating a generic
>> >> batching interface is handling return codes asynchronously. This is
>> >> entirely feasible though...
>> >
>> > I'll be happy as long as we get a batching interface, then we can
>> > incrementally do the optimizations later.
>> >
>> > In the future, I do hope (like Saeed) this RX API will evolve into
>> > delivering (a bulk of) raw-packet-pages into the netstack, this should
>> > simplify drivers, and we can keep the complexity and SKB allocations
>> > out of the drivers.
>> > To start with, we can play with doing this delivering (a bulk of)
>> > raw-packet-pages into Tom's TXDP engine/system?
>> >
>> Hi Jesper,
>>
>> Maybe we can to start to narrow in on what a batching API might look like.
>>
>> Looking at mlx5 (as a model of how XDP is implemented) the main RX
>> loop in ml5e_poll_rx_cq calls the backend handler in one indirect
>> function call. The XDP path goes through mlx5e_handle_rx_cqe,
>> skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with
>> building the skbuf. As a prerequisite to RX batching it would be
>> helpful if this could be flatten so that most of the logic is obvious
>> in the main RX loop.
>
> I fully agree here, it would be helpful to flatten out.  The mlx5
> driver is a bit hard to follow in that respect.  Saeed have already
> send me some offlist patches, where some of this code gets
> restructured. In one of the patches the RX-stages does get flatten out
> some more.  We are currently benchmarking this patchset, and depending
> on CPU it is either a small win or a small (7ns) regressing (on the newest
> CPUs).
>
Cool!

>
>> The model of RX batching seems straightforward enough-- pull packets
>> from the ring, save xdp_data information in a vector, periodically
>> call into the stack to handle a batch where argument is the vector of
>> packets and another argument is an output vector that gives return
>> codes (XDP actions), process the each return code for each packet in
>> the driver accordingly.
>
> Yes, exactly.  I did imagine that (maybe), the input vector of packets
> could have a room for the return codes (XDP actions) next to the packet
> pointer?
>
Which ever way is more efficient I suppose. The important point is
that the return code should be only the only thing returned to the
driver.

>
>> Presumably, there is a maximum allowed batch
>> that may or may not be the same as the NAPI budget so the so the
>> batching call needs to be done when the limit is reach and also before
>> exiting NAPI.
>
> In my PoC code that Saeed is working on, we have a smaller batch
> size(10), and prefetch to L2 cache (like DPDK does), based on the
> theory that we don't want to stress the L2 cache usage, and that these
> CPUs usually have a Line Feed Buffer (LFB) that is limited to 10
> outstanding cache-lines.
>
> I don't know if this artifically smaller batch size is the right thing,
> as DPDK always prefetch to L2 cache all 32 packets on RX.  And snabb
> uses batches of 100 packets per "breath".
>
Maybe make it configurable :-)

>
>> For each packet the stack can return an XDP code,
>> XDP_PASS in this case could be interpreted as being consumed by the
>> stack; this would be used in the case the stack creates an skbuff for
>> the packet. The stack on it's part can process the batch how it sees
>> fit, it can process each packet individual in the canonical model, or
>> we can continue processing a batch in a VPP-like fashion.
>
> Agree.
>

Re: Questions on XDP

2017-02-22 Thread Jesper Dangaard Brouer
On Wed, 22 Feb 2017 09:08:53 -0800
John Fastabend  wrote:

> > GSO/TSO is getting into advanced stuff I would rather not have to get
> > into right now.  I figure we need to take this portion one step at a
> > time.  To support GSO we need more information like the mss.
> >   
> 
> Agreed lets get the driver support for basic things first. But this
> is on my list. I'm just repeating myself but VM to VM performance uses
> TSO/LRO heavily.

Sorry, but I get annoyed every time I hear we need to support
TSO/LRO/GRO for performance reasons.  If you take one step back, you
are actually saying we need bulking for better performance.  And the
bulking you are proposing is a TCP protocol specific bulking mechanism.

I'm saying is let's make bulking protocol agnostic, by doing it at the
packet level.  And once the bulk enters the VM, by-all-means it should
construct a GRO packet it can send into it's own network stack.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH net V2 0/5] mlx4 misc fixes

2017-02-22 Thread David Miller
From: Tariq Toukan 
Date: Wed, 22 Feb 2017 18:25:24 +0200

> This patchset contains misc bug fixes from Eric Dumazet and our team
> to the mlx4 Core and Eth drivers.
> 
> Series generated against net commit:
> 00ea1ceebe0d ipv6: release dst on error in ip6_dst_lookup_tail
> 
> Thanks,
> Tariq.
> 
> v2:
> * Added Eric's fix (patch 5/5).

This doesn't apply cleanly to the net tree, please respin.

Thanks.


Re: [PATCH net-next] net/gtp: Add udp source port generation according to flow hash

2017-02-22 Thread Tom Herbert
On Wed, Feb 22, 2017 at 1:29 PM, Or Gerlitz  wrote:
> On Thu, Feb 16, 2017 at 11:58 PM, Andreas Schultz  wrote:
>> Hi Or,
>> - On Feb 16, 2017, at 3:59 PM, Or Gerlitz ogerl...@mellanox.com wrote:
>>
>>> Generate the source udp header according to the flow represented by
>>> the packet we are encapsulating, as done for other udp tunnels. This
>>> helps on the receiver side to apply RSS spreading.
>>
>> This might work for GTPv0-U, However, for GTPv1-U this could interfere
>> with error handling in the user space control process when the UDP port
>> extension  header is used in error indications.
>
>
> in the document you posted there's this quote "The source IP and port
> have no meaning and can change at any time" -- I assume it refers to
> v0? can we identify in the kernel code that we're on v0 and have the
> patch come into play?
>
>> 3GPP TS 29.281 Rel 13, section 5.2.2.1 defines the UDP port extension and
>> section 7.3.1 says that the UDP source port extension can be used to
>> mitigate DOS attacks. This would IMHO imply that the user space control
>> process needs to know the TEID to UDP source port mapping.
>
>> The other question is, on what is this actually hashing. When I understand
>> the code correctly, this will hash on the source/destination of the orignal
>> flow. I would expect that a SGSN/SGW/eNodeB would like the keep flow
>> processing on a per TEID base, so the port hashing should be base on the 
>> TEID.
>
> is it possible for packets belonging to the same TCP session or UDP
> "pseudo session" (given pair of src/dst ip/port) to be encapsulated
> using different TEID?
>
> hashing on the TEID imposes a harder requirement on the NIC HW vs.
> just UDP based RSS.

This shouldn't be taken as a HW requirement and it's unlikely we'd add
explicit GTP support in flow_dissector. If we can't get entropy in the
UDP source port then IPv6 flow label is a potential alternative (so
that should be supported in NICs for RSS).

I'll also reiterate my previous point about the need for GTP testing--
in order for us to be able to evaluate the GTP datapath for things
like performance or how they withstand against DDOS we really need an
easy way to isolate the datapath.

Tom


Re: [PATCH] fjes: Move fjes driver info message into fjes_acpi_add()

2017-02-22 Thread Yasuaki Ishimatsu

Thank you for quick response.
I'll think of other solution.

Thanks,
Yasuaki Ishimatsu

On 02/22/2017 03:45 PM, David Miller wrote:

From: Yasuaki Ishimatsu 
Date: Wed, 22 Feb 2017 15:40:49 -0500


To avoid the confusion, the patch moves the message into
fjes_acpi_add() so that it is shows only when fjes_acpi_add()
succeeded.


This change means it'll never be printed for platform driver matches,
which is even worse than what we have now.



Re: Focusing the XDP project

2017-02-22 Thread Jesper Dangaard Brouer
On Wed, 22 Feb 2017 09:22:53 -0800
Tom Herbert  wrote:

> On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouer
>  wrote:
> >
> > On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert  
> > wrote:  
> >> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed 
> >>  wrote:  
> > [...]  
> >> > The only complexity XDP is adding to the drivers is the constrains on
> >> > RX memory management and memory model, calling the XDP program itself
> >> > and handling the  action is really a simple thing once you have the
> >> > correct memory model.  
> >
> > Exactly, that is why I've been looking at introducing a generic
> > facility for a memory model for drivers.  This should help simply
> > drivers.  Due to performance needs this need to be a very thin API layer
> > on top of the page allocator. (That's why I'm working with Mel Gorman
> > to get more close integration with the page allocator e.g. a bulking
> > facility).
> >  
> >> > Who knows! maybe someday XDP will define one unified RX API for all
> >> > drivers and it even will handle normal stack delivery it self :).
> >> >  
> >> That's exactly the point and what we need for TXDP. I'm missing why
> >> doing this is such rocket science other than the fact that all these
> >> drivers are vastly different and changing the existing API is
> >> unpleasant. The only functional complexity I see in creating a generic
> >> batching interface is handling return codes asynchronously. This is
> >> entirely feasible though...  
> >
> > I'll be happy as long as we get a batching interface, then we can
> > incrementally do the optimizations later.
> >
> > In the future, I do hope (like Saeed) this RX API will evolve into
> > delivering (a bulk of) raw-packet-pages into the netstack, this should
> > simplify drivers, and we can keep the complexity and SKB allocations
> > out of the drivers.
> > To start with, we can play with doing this delivering (a bulk of)
> > raw-packet-pages into Tom's TXDP engine/system?
> >  
> Hi Jesper,
> 
> Maybe we can to start to narrow in on what a batching API might look like.
> 
> Looking at mlx5 (as a model of how XDP is implemented) the main RX
> loop in ml5e_poll_rx_cq calls the backend handler in one indirect
> function call. The XDP path goes through mlx5e_handle_rx_cqe,
> skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with
> building the skbuf. As a prerequisite to RX batching it would be
> helpful if this could be flatten so that most of the logic is obvious
> in the main RX loop.

I fully agree here, it would be helpful to flatten out.  The mlx5
driver is a bit hard to follow in that respect.  Saeed have already
send me some offlist patches, where some of this code gets
restructured. In one of the patches the RX-stages does get flatten out
some more.  We are currently benchmarking this patchset, and depending
on CPU it is either a small win or a small (7ns) regressing (on the newest
CPUs).


> The model of RX batching seems straightforward enough-- pull packets
> from the ring, save xdp_data information in a vector, periodically
> call into the stack to handle a batch where argument is the vector of
> packets and another argument is an output vector that gives return
> codes (XDP actions), process the each return code for each packet in
> the driver accordingly.

Yes, exactly.  I did imagine that (maybe), the input vector of packets
could have a room for the return codes (XDP actions) next to the packet
pointer?


> Presumably, there is a maximum allowed batch
> that may or may not be the same as the NAPI budget so the so the
> batching call needs to be done when the limit is reach and also before
> exiting NAPI. 

In my PoC code that Saeed is working on, we have a smaller batch
size(10), and prefetch to L2 cache (like DPDK does), based on the
theory that we don't want to stress the L2 cache usage, and that these
CPUs usually have a Line Feed Buffer (LFB) that is limited to 10
outstanding cache-lines.

I don't know if this artifically smaller batch size is the right thing,
as DPDK always prefetch to L2 cache all 32 packets on RX.  And snabb
uses batches of 100 packets per "breath".


> For each packet the stack can return an XDP code,
> XDP_PASS in this case could be interpreted as being consumed by the
> stack; this would be used in the case the stack creates an skbuff for
> the packet. The stack on it's part can process the batch how it sees
> fit, it can process each packet individual in the canonical model, or
> we can continue processing a batch in a VPP-like fashion.

Agree.

> The batching API could be transparent to the stack or not. In the
> transparent case, the driver calls what looks like a receive function
> but the stack may defer processing for batching. A callback function
> (that can be inlined) is used to process return codes as I mentioned
> previously. In the non-transparent model, the driver knowingly creates
> the packet 

Re: [PATCH v2 2/2] tcp: account for ts offset only if tsecr not zero

2017-02-22 Thread David Miller
From: Alexey Kodanev 
Date: Wed, 22 Feb 2017 13:23:56 +0300

> We can get SYN with zero tsecr, don't apply offset in this case.
> 
> Fixes: ee684b6f2830 ("tcp: send packets with a socket timestamp")
> Signed-off-by: Alexey Kodanev 

Applied.


Re: [PATCH v2 1/2] tcp: setup timestamp offset when write_seq already set

2017-02-22 Thread David Miller
From: Alexey Kodanev 
Date: Wed, 22 Feb 2017 13:23:55 +0300

> Found that when randomized tcp offsets are enabled (by default)
> TCP client can still start new connections without them. Later,
> if server does active close and re-uses sockets in TIME-WAIT
> state, new SYN from client can be rejected on PAWS check inside
> tcp_timewait_state_process(), because either tw_ts_recent or
> rcv_tsval doesn't really have an offset set.
> 
> Here is how to reproduce it with LTP netstress tool:
> netstress -R 1 &
> netstress -H 127.0.0.1 -lr 100 -a1
> 
> [...]
> < S  seq 1956977072 win 43690 TS val 295618 ecr 459956970
> > .  ack 1956911535 win 342 TS val 459967184 ecr 1547117608
> < R  seq 1956911535 win 0 length 0
> +1. < S  seq 1956977072 win 43690 TS val 296640 ecr 459956970
> > S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640
> 
> Fixes: 95a22caee396 ("tcp: randomize tcp timestamp offsets for each 
> connection")
> Signed-off-by: Alexey Kodanev 

Applied.


Re: [PATCH net-next] net/gtp: Add udp source port generation according to flow hash

2017-02-22 Thread Or Gerlitz
On Thu, Feb 16, 2017 at 11:58 PM, Andreas Schultz  wrote:
> Hi Or,
> - On Feb 16, 2017, at 3:59 PM, Or Gerlitz ogerl...@mellanox.com wrote:
>
>> Generate the source udp header according to the flow represented by
>> the packet we are encapsulating, as done for other udp tunnels. This
>> helps on the receiver side to apply RSS spreading.
>
> This might work for GTPv0-U, However, for GTPv1-U this could interfere
> with error handling in the user space control process when the UDP port
> extension  header is used in error indications.


in the document you posted there's this quote "The source IP and port
have no meaning and can change at any time" -- I assume it refers to
v0? can we identify in the kernel code that we're on v0 and have the
patch come into play?

> 3GPP TS 29.281 Rel 13, section 5.2.2.1 defines the UDP port extension and
> section 7.3.1 says that the UDP source port extension can be used to
> mitigate DOS attacks. This would IMHO imply that the user space control
> process needs to know the TEID to UDP source port mapping.

> The other question is, on what is this actually hashing. When I understand
> the code correctly, this will hash on the source/destination of the orignal
> flow. I would expect that a SGSN/SGW/eNodeB would like the keep flow
> processing on a per TEID base, so the port hashing should be base on the TEID.

is it possible for packets belonging to the same TCP session or UDP
"pseudo session" (given pair of src/dst ip/port) to be encapsulated
using different TEID?

hashing on the TEID imposes a harder requirement on the NIC HW vs.
just UDP based RSS.


Re: [PATCH v2] net/dccp: fix use after free in tw_timer_handler()

2017-02-22 Thread David Miller
From: Andrey Ryabinin 
Date: Wed, 22 Feb 2017 12:35:27 +0300

> DCCP doesn't purge timewait sockets on network namespace shutdown.
> So, after net namespace destroyed we could still have an active timer
> which will trigger use after free in tw_timer_handler():
 ...
> Add .exit_batch hook to dccp_v4_ops()/dccp_v6_ops() which will purge
> timewait sockets on net namespace destruction and prevent above issue.
> 
> Fixes: f2bf415cfed7 ("mib: add net to NET_ADD_STATS_BH")
> Reported-by: Dmitry Vyukov 
> Signed-off-by: Andrey Ryabinin 
> Acked-by: Arnaldo Carvalho de Melo 

Applied and queued up for -sable, thanks.


Re: [PATCH] uapi: fix linux/if.h userspace compilation errors

2017-02-22 Thread David Miller
From: "Dmitry V. Levin" 
Date: Tue, 21 Feb 2017 23:19:14 +0300

> On Tue, Feb 21, 2017 at 12:10:22PM -0500, David Miller wrote:
>> From: "Dmitry V. Levin" 
>> Date: Mon, 20 Feb 2017 14:58:41 +0300
>> 
>> > Include  (guarded by ifndef __KERNEL__) to fix
>> > the following linux/if.h userspace compilation errors:
>> 
>> Wouldn't it be so much better to do this in include/uapi/linux/socket.h?
> 
> Yes, it would be nicer if we could afford it.  However, changing
> uapi/linux/socket.h to include  is less conservative than
> changing every uapi header that fails to compile because of its use
> of struct sockaddr.  It's risky because  pulls in other
> types that might conflict with definitions provided by uapi headers.

Ok, I'll apply this for now.


Re: [PATCH net-next] l2tp: Avoid schedule while atomic in exit_net

2017-02-22 Thread David Miller
From: Ridge Kennedy 
Date: Wed, 22 Feb 2017 14:59:49 +1300

> While destroying a network namespace that contains a L2TP tunnel a
> "BUG: scheduling while atomic" can be observed.
> 
> Enabling lockdep shows that this is happening because l2tp_exit_net()
> is calling l2tp_tunnel_closeall() (via l2tp_tunnel_delete()) from
> within an RCU critical section.
 ...
> This bug can easily be reproduced with a few steps:
> 
>  $ sudo unshare -n bash  # Create a shell in a new namespace
>  # ip link set lo up
>  # ip addr add 127.0.0.1 dev lo
>  # ip l2tp add tunnel remote 127.0.0.1 local 127.0.0.1 tunnel_id 1 \
> peer_tunnel_id 1 udp_sport 5 udp_dport 5
>  # ip l2tp add session name foo tunnel_id 1 session_id 1 \
> peer_session_id 1
>  # ip link set foo up
>  # exit  # Exit the shell, in turn exiting the namespace
>  $ dmesg
>  ...
>  [942121.089216] BUG: scheduling while atomic: kworker/u16:3/13872/0x0200
>  ...
> 
> To fix this, move the call to l2tp_tunnel_closeall() out of the RCU
> critical section, and instead call it from l2tp_tunnel_del_work(), which
> is running from the l2tp_wq workqueue.
> 
> Fixes: 2b551c6e7d5b ("l2tp: close sessions before initiating tunnel delete")
> Signed-off-by: Ridge Kennedy 

Applied and queued up for -stable, thanks.


Re: [PATCH] fjes: Move fjes driver info message into fjes_acpi_add()

2017-02-22 Thread David Miller
From: Yasuaki Ishimatsu 
Date: Wed, 22 Feb 2017 15:40:49 -0500

> To avoid the confusion, the patch moves the message into
> fjes_acpi_add() so that it is shows only when fjes_acpi_add()
> succeeded.

This change means it'll never be printed for platform driver matches,
which is even worse than what we have now.


Re: [PATCH next 0/4] bonding: winter cleanup

2017-02-22 Thread Jiri Pirko
Wed, Feb 22, 2017 at 08:23:13PM CET, mahe...@google.com wrote:
>On Tue, Feb 21, 2017 at 11:58 PM, Jiri Pirko  wrote:
>> Wed, Feb 22, 2017 at 02:08:16AM CET, mah...@bandewar.net wrote:
>>>From: Mahesh Bandewar 
>>>
>>>Few cleanup patches that I have accumulated over some time now.
>>>
>>>(a) First two patches are basically to move the work-queue initialization
>>>from every ndo_open / bond_open operation to once at the beginning while
>>>port creation. Work-queue initialization is an unnecessary operation
>>>for every 'ifup' operation. However we have some mode-specific 
>>> work-queues
>>>and mode can change anytime after port creation. So the second patch is
>>>to ensure the correct work-handler is called based on the mode.
>>>
>>>(b) Third patch is simple and straightforward that removes hard-coded value
>>>that was added into the initial commit and replaces it with the default
>>>value configured.
>>>
>>>(c) The final patch in the series removes the unimplemented "port-moved" 
>>>state
>>>from the LACP state machine. This state is defined but never set so
>>>removing from the state machine logic makes code little cleaner.
>>>
>>>Note: None of these patches are making any functional changes.
>>>
>>>Mahesh Bandewar (4):
>>
>> Mahesh. I understand that you are still using bonding. What's stopping
>> you from using team instead?
>>
>Let me just say this, if it was trivial enough, we'd have done with it
>by now. :)

What exactly is the blocker? Can I help?


Re: [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages

2017-02-22 Thread Eric Dumazet
On Wed, 2017-02-22 at 11:38 -0500, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> Refine skb_copy_ubufs to support compount pages. With upcoming TCP
> and UDP zerocopy sendmsg, such fragments may appear.
> 
> These skbuffs can have both kernel and zerocopy fragments, e.g., when
> corking. Avoid unnecessary copying of fragments that have no userspace
> reference.
> 
> It is not safe to modify skb frags when the skbuff is shared. This
> should not happen. Fail loudly if we find an unexpected edge case.
> 
> Signed-off-by: Willem de Bruijn 
> ---
>  net/core/skbuff.c | 24 +++-
>  1 file changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f3557958e9bf..67e4216fca01 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -944,6 +944,9 @@ EXPORT_SYMBOL_GPL(skb_morph);
>   *   If this function is called from an interrupt gfp_mask() must be
>   *   %GFP_ATOMIC.
>   *
> + *   skb_shinfo(skb) can only be safely modified when not accessed
> + *   concurrently. Fail if the skb is shared or cloned.
> + *
>   *   Returns 0 on success or a negative error code on failure
>   *   to allocate kernel memory to copy to.
>   */
> @@ -954,11 +957,29 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
>   struct page *page, *head = NULL;
>   struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
>  
> + if (skb_shared(skb) || skb_cloned(skb)) {
> + WARN_ON_ONCE(1);
> + return -EINVAL;
> + }
> +
>   for (i = 0; i < num_frags; i++) {
>   u8 *vaddr;
> + unsigned int order = 0;
> + gfp_t mask = gfp_mask;
>   skb_frag_t *f = _shinfo(skb)->frags[i];
>  
> - page = alloc_page(gfp_mask);
> + page = skb_frag_page(f);
> + if (page_count(page) == 1) {
> + skb_frag_ref(skb, i);

This could be : get_page(page);

> + goto copy_done;
> + }
> +
> + if (f->size > PAGE_SIZE) {
> + order = get_order(f->size);
> + mask |= __GFP_COMP;

Note that this would probably fail under memory pressure.

We could instead try to explode the few segments into order-0 only
pages.

Hopefully this case should not be frequent.




[PATCH] fjes: Move fjes driver info message into fjes_acpi_add()

2017-02-22 Thread Yasuaki Ishimatsu

The fjes driver is used only by FUJITSU servers and almost of all
servers in the world never use it. But currently if ACPI PNP0C02
is defined in the ACPI table, the following message is always shown:

 "FUJITSU Extended Socket Network Device Driver - version 1.2
  - Copyright (c) 2015 FUJITSU LIMITED"

The message makes users confused because there is no reason that
the message is shown in other vendor servers.

To avoid the confusion, the patch moves the message into
fjes_acpi_add() so that it is shows only when fjes_acpi_add()
succeeded.

Signed-off-by: Yasuaki Ishimatsu 
CC: Taku Izumi 
---
 drivers/net/fjes/fjes_main.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c
index b77e4ecf..8e1329c 100644
--- a/drivers/net/fjes/fjes_main.c
+++ b/drivers/net/fjes/fjes_main.c
@@ -151,6 +151,9 @@ static int fjes_acpi_add(struct acpi_device *device)
   ARRAY_SIZE(fjes_resource));
device->driver_data = plat_dev;

+   pr_info("%s - version %s - %s\n",
+   fjes_driver_string, fjes_driver_version, fjes_copyright);
+
return 0;
 }

@@ -1481,9 +1484,6 @@ static int __init fjes_init_module(void)
 {
int result;

-   pr_info("%s - version %s - %s\n",
-   fjes_driver_string, fjes_driver_version, fjes_copyright);
-
fjes_dbg_init();

result = platform_driver_register(_driver);
--
1.8.3.1



Re: [PATCH] qlogic: netxen: constify bin_attribute structures

2017-02-22 Thread David Miller
From: Bhumika Goyal 
Date: Wed, 22 Feb 2017 00:17:48 +0530

> Declare bin_attribute structures as const as they are only passed as an
> arguments to the functions device_remove_bin_file and
> device_create_bin_file. These function arguments are of type const, so
> bin_attribute structures having this property can be made const too.
> Done using Coccinelle:
 ...
> Signed-off-by: Bhumika Goyal 

Also applied, thanks.


Re: [PATCH] qlogic: qlcnic_sysfs: constify bin_attribute structures

2017-02-22 Thread David Miller
From: Bhumika Goyal 
Date: Wed, 22 Feb 2017 00:11:17 +0530

> Declare bin_attribute structures as const as they are only passed as an
> arguments to the functions device_remove_bin_file and
> device_create_bin_file. These function arguments are of type const, so
> bin_attribute structures having this property can be made const too.
> Done using Coccinelle:
 ...
> Signed-off-by: Bhumika Goyal 

Applied.


Re: [PATCH v1.1] net: emac: add support for device-tree based PHY discovery and setup

2017-02-22 Thread David Miller
From: Christian Lamparter 
Date: Mon, 20 Feb 2017 20:10:58 +0100

> This patch adds glue-code that allows the EMAC driver to interface
> with the existing dt-supported PHYs in drivers/net/phy.
> 
> Because currently, the emac driver maintains a small library of
> supported phys for in a private phy.c file located in the drivers
> directory.
> 
> The support is limited to mostly single ethernet transceiver like the:
> CIS8201, BCM5248, ET1011C, Marvell 88E and 88E1112, AR8035.
> 
> However, routers like the Netgear WNDR4700 and Cisco Meraki MX60(W)
> have a 5-port switch (AR8327N) attached to the EMAC. The switch chip
> is supported by the qca8k mdio driver, which uses the generic phy
> library. Another reason is that PHYLIB also supports the BCM54610,
> which was used for the Western Digital My Book Live.
> 
> This will now also make EMAC select PHYLIB.
> 
> Signed-off-by: Christian Lamparter 

Applied, thanks.


Re: [PATCH net 3/6] net/mlx5e: Do not reduce LRO WQE size when not using build_skb

2017-02-22 Thread Alexei Starovoitov
On Wed, Feb 22, 2017 at 7:20 AM, Saeed Mahameed  wrote:
> From: Tariq Toukan 
>
> When rq_type is Striding RQ, no room of SKB_RESERVE is needed
> as SKB allocation is not done via build_skb.
>
> Fixes: e4b85508072b ("net/mlx5e: Slightly reduce hardware LRO size")
> Signed-off-by: Tariq Toukan 
> Signed-off-by: Saeed Mahameed 

why this one is a bug fix?
Sound like an optimization from commit log.


Re: [PATCH next 0/4] bonding: winter cleanup

2017-02-22 Thread Andy Gospodarek
On Wed, Feb 22, 2017 at 2:17 PM, Mahesh Bandewar (महेश बंडेवार)
 wrote:
> On Tue, Feb 21, 2017 at 8:36 PM, Or Gerlitz  wrote:
>>
>> On Wed, Feb 22, 2017 at 5:29 AM, David Miller  wrote:
>> > From: Mahesh Bandewar 
>> > Date: Tue, 21 Feb 2017 17:08:16 -0800
>> >
>> >> Few cleanup patches that I have accumulated over some time now.
>> >
>> > The net-next tree is closed, therefore it is not appropriate to
>> > submit cleanups at this time.
>> >
> Oops, My bad! Well, this will give an opportunity for people to have
> more time with the patch(s) / clean-up-code :p
>
>> > Please wait until after the merge window and the net-next tree
>> > opens back up.
>> >
> Will do so. Thank you.
>>
>> Maybe we should start educating ppl on this by mandating them to come
>> and bring home made cakes to netdev each time they ignore that?
>>
> That's a risky proposal Or! You are assuming that someone who can
> write code can bake "good" cake too :)

Just bring the recipe, so if it happens not to be tasty we can tell
you why.  :-D

>> in our
>> school this is the model for stopping kids and teachers phones to ring
>> during class time. Jamal - there will be more attendees this way :)
>>
>
>> Or.


Re: [PATCH next 0/4] bonding: winter cleanup

2017-02-22 Thread महेश बंडेवार
On Tue, Feb 21, 2017 at 11:58 PM, Jiri Pirko  wrote:
> Wed, Feb 22, 2017 at 02:08:16AM CET, mah...@bandewar.net wrote:
>>From: Mahesh Bandewar 
>>
>>Few cleanup patches that I have accumulated over some time now.
>>
>>(a) First two patches are basically to move the work-queue initialization
>>from every ndo_open / bond_open operation to once at the beginning while
>>port creation. Work-queue initialization is an unnecessary operation
>>for every 'ifup' operation. However we have some mode-specific work-queues
>>and mode can change anytime after port creation. So the second patch is
>>to ensure the correct work-handler is called based on the mode.
>>
>>(b) Third patch is simple and straightforward that removes hard-coded value
>>that was added into the initial commit and replaces it with the default
>>value configured.
>>
>>(c) The final patch in the series removes the unimplemented "port-moved" state
>>from the LACP state machine. This state is defined but never set so
>>removing from the state machine logic makes code little cleaner.
>>
>>Note: None of these patches are making any functional changes.
>>
>>Mahesh Bandewar (4):
>
> Mahesh. I understand that you are still using bonding. What's stopping
> you from using team instead?
>
Let me just say this, if it was trivial enough, we'd have done with it
by now. :)

> Isn't about the time to start deprecate process of bonding? :O
>
>
>>  bonding: restructure arp-monitor
>>  bonding: initialize work-queues during creation of bond
>>  bonding: remove hardcoded value
>>  bonding: remove "port-moved" state that was never implemented
>>
>> drivers/net/bonding/bond_3ad.c  | 11 +++
>> drivers/net/bonding/bond_main.c | 42 
>> -
>> 2 files changed, 32 insertions(+), 21 deletions(-)
>>
>>--
>>2.11.0.483.g087da7b7c-goog
>>


Re: [PATCH next 0/4] bonding: winter cleanup

2017-02-22 Thread महेश बंडेवार
On Tue, Feb 21, 2017 at 8:36 PM, Or Gerlitz  wrote:
>
> On Wed, Feb 22, 2017 at 5:29 AM, David Miller  wrote:
> > From: Mahesh Bandewar 
> > Date: Tue, 21 Feb 2017 17:08:16 -0800
> >
> >> Few cleanup patches that I have accumulated over some time now.
> >
> > The net-next tree is closed, therefore it is not appropriate to
> > submit cleanups at this time.
> >
Oops, My bad! Well, this will give an opportunity for people to have
more time with the patch(s) / clean-up-code :p

> > Please wait until after the merge window and the net-next tree
> > opens back up.
> >
Will do so. Thank you.
>
> Maybe we should start educating ppl on this by mandating them to come
> and bring home made cakes to netdev each time they ignore that?
>
That's a risky proposal Or! You are assuming that someone who can
write code can bake "good" cake too :)

> in our
> school this is the model for stopping kids and teachers phones to ring
> during class time. Jamal - there will be more attendees this way :)
>

> Or.


ATENCIÓN;

2017-02-22 Thread dministrador
ATENCIÓN;

Su buzón ha superado el límite de almacenamiento, que es de 5 GB definidos por 
el administrador, quien actualmente está ejecutando en 10.9GB, no puede ser 
capaz de enviar o recibir correo nuevo hasta que vuelva a validar su buzón de 
correo electrónico. Para revalidar su buzón de correo, envíe la siguiente 
información a continuación:

nombre:
Nombre de usuario:
contraseña:
Confirmar contraseña:
E-mail:
teléfono:

Si usted no puede revalidar su buzón, el buzón se deshabilitará!

Disculpa las molestias.
Código de verificación: es: 006524
Correo Soporte Técnico © 2017

¡gracias
Sistemas administrador 


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Eric Dumazet
On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote:
> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet  wrote:
> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
> >> Use of order-3 pages is problematic in some cases.
> >>
> >> This patch might add three kinds of regression :
> >>
> >> 1) a CPU performance regression, but we will add later page
> >> recycling and performance should be back.
> >>
> >> 2) TCP receiver could grow its receive window slightly slower,
> >>because skb->len/skb->truesize ratio will decrease.
> >>This is mostly ok, we prefer being conservative to not risk OOM,
> >>and eventually tune TCP better in the future.
> >>This is consistent with other drivers using 2048 per ethernet frame.
> >>
> >> 3) Because we allocate one page per RX slot, we consume more
> >>memory for the ring buffers. XDP already had this constraint anyway.
> >>
> >> Signed-off-by: Eric Dumazet 
> >> ---
> >
> > Note that we also could use a different strategy.
> >
> > Assume RX rings of 4096 entries/slots.
> >
> > With this patch, mlx4 gets the strategy used by Alexander in Intel
> > drivers :
> >
> > Each RX slot has an allocated page, and uses half of it, flipping to the
> > other half every time the slot is used.
> >
> > So a ring buffer of 4096 slots allocates 4096 pages.
> >
> > When we receive a packet train for the same flow, GRO builds an skb with
> > ~45 page frags, all from different pages.
> >
> > The put_page() done from skb_release_data() touches ~45 different struct
> > page cache lines, and show a high cost. (compared to the order-3 used
> > today by mlx4, this adds extra cache line misses and stalls for the
> > consumer)
> >
> > If we instead try to use the two halves of one page on consecutive RX
> > slots, we might instead cook skb with the same number of MSS (45), but
> > half the number of cache lines for put_page(), so we should speed up the
> > consumer.
> 
> So there is a problem that is being overlooked here.  That is the cost
> of the DMA map/unmap calls.  The problem is many PowerPC systems have
> an IOMMU that you have to work around, and that IOMMU comes at a heavy
> cost for every map/unmap call.  So unless you are saying you wan to
> setup a hybrid between the mlx5 and this approach where we have a page
> cache that these all fall back into you will take a heavy cost for
> having to map and unmap pages.
> 
> The whole reason why I implemented the Intel page reuse approach the
> way I did is to try and mitigate the IOMMU issue, it wasn't so much to
> resolve allocator/freeing expense.  Basically the allocator scales,
> the IOMMU does not.  So any solution would require making certain that
> we can leave the pages pinned in the DMA to avoid having to take the
> global locks involved in accessing the IOMMU.


I do not see any difference for the fact that we keep pages mapped the
same way.

mlx4_en_complete_rx_desc() will still use the :

dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset,
  frag_size, priv->dma_dir);

for every single MSS we receive.

This wont change.




RE: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread David Laight
From: Alexander Duyck
> Sent: 22 February 2017 17:24
...
> So there is a problem that is being overlooked here.  That is the cost
> of the DMA map/unmap calls.  The problem is many PowerPC systems have
> an IOMMU that you have to work around, and that IOMMU comes at a heavy
> cost for every map/unmap call.  So unless you are saying you wan to
> setup a hybrid between the mlx5 and this approach where we have a page
> cache that these all fall back into you will take a heavy cost for
> having to map and unmap pages.
..

I can't help feeling that you need to look at how to get the iommu
code to reuse pages, rather the ethernet driver.
Maybe something like:

1) The driver requests a mapped receive buffer from the iommu.
This might give it memory that is already mapped but not in use.

2) When the receive completes the driver tells the iommu the mapping
is no longer needed. The iommu is not (yet) changed.

3) When the skb is freed the iommu is told that the buffer can be freed.

4) At (1), if the driver is using too much iommu resource then the mapping
for completed receives can be removed to free up iommu space.

Probably not as simple as it looks :-)

David



Re: [PATCH net-next v4 4/7] gtp: consolidate gtp socket rx path

2017-02-22 Thread Tom Herbert
On Tue, Feb 21, 2017 at 2:18 AM, Andreas Schultz  wrote:
> Add network device to gtp context in preparation for splitting
> the TEID from the network device.
>
> Use this to rework the socker rx path. Move the common RX part
> of v0 and v1 into a helper. Also move the final rx part into
> that helper as well.
>
Andeas,

How are these GTP kernel patches being tested? Is it possible to
create some sort of GTP network device that separates out just the
datapath for development in the same way that VXLAN did this?

Tom

> Signed-off-by: Andreas Schultz 
> ---
>  drivers/net/gtp.c | 80 
> ++-
>  1 file changed, 44 insertions(+), 36 deletions(-)
>
> diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c
> index 961fb3c..fc0fff5 100644
> --- a/drivers/net/gtp.c
> +++ b/drivers/net/gtp.c
> @@ -58,6 +58,8 @@ struct pdp_ctx {
> struct in_addr  ms_addr_ip4;
> struct in_addr  sgsn_addr_ip4;
>
> +   struct net_device   *dev;
> +
> atomic_ttx_seq;
> struct rcu_head rcu_head;
>  };
> @@ -175,6 +177,40 @@ static bool gtp_check_src_ms(struct sk_buff *skb, struct 
> pdp_ctx *pctx,
> return false;
>  }
>
> +static int gtp_rx(struct pdp_ctx *pctx, struct sk_buff *skb, unsigned int 
> hdrlen,
> + bool xnet)
> +{
> +   struct pcpu_sw_netstats *stats;
> +
> +   if (!gtp_check_src_ms(skb, pctx, hdrlen)) {
> +   netdev_dbg(pctx->dev, "No PDP ctx for this MS\n");
> +   return 1;
> +   }
> +
> +   /* Get rid of the GTP + UDP headers. */
> +   if (iptunnel_pull_header(skb, hdrlen, skb->protocol, xnet))
> +   return -1;
> +
> +   netdev_dbg(pctx->dev, "forwarding packet from GGSN to uplink\n");
> +
> +   /* Now that the UDP and the GTP header have been removed, set up the
> +* new network header. This is required by the upper layer to
> +* calculate the transport header.
> +*/
> +   skb_reset_network_header(skb);
> +
> +   skb->dev = pctx->dev;
> +
> +   stats = this_cpu_ptr(pctx->dev->tstats);
> +   u64_stats_update_begin(>syncp);
> +   stats->rx_packets++;
> +   stats->rx_bytes += skb->len;
> +   u64_stats_update_end(>syncp);
> +
> +   netif_rx(skb);
> +   return 0;
> +}
> +
>  /* 1 means pass up to the stack, -1 means drop and 0 means decapsulated. */
>  static int gtp0_udp_encap_recv(struct gtp_dev *gtp, struct sk_buff *skb,
>bool xnet)
> @@ -201,13 +237,7 @@ static int gtp0_udp_encap_recv(struct gtp_dev *gtp, 
> struct sk_buff *skb,
> return 1;
> }
>
> -   if (!gtp_check_src_ms(skb, pctx, hdrlen)) {
> -   netdev_dbg(gtp->dev, "No PDP ctx for this MS\n");
> -   return 1;
> -   }
> -
> -   /* Get rid of the GTP + UDP headers. */
> -   return iptunnel_pull_header(skb, hdrlen, skb->protocol, xnet);
> +   return gtp_rx(pctx, skb, hdrlen, xnet);
>  }
>
>  static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, struct sk_buff *skb,
> @@ -250,13 +280,7 @@ static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, 
> struct sk_buff *skb,
> return 1;
> }
>
> -   if (!gtp_check_src_ms(skb, pctx, hdrlen)) {
> -   netdev_dbg(gtp->dev, "No PDP ctx for this MS\n");
> -   return 1;
> -   }
> -
> -   /* Get rid of the GTP + UDP headers. */
> -   return iptunnel_pull_header(skb, hdrlen, skb->protocol, xnet);
> +   return gtp_rx(pctx, skb, hdrlen, xnet);
>  }
>
>  static void gtp_encap_destroy(struct sock *sk)
> @@ -290,10 +314,9 @@ static void gtp_encap_disable(struct gtp_dev *gtp)
>   */
>  static int gtp_encap_recv(struct sock *sk, struct sk_buff *skb)
>  {
> -   struct pcpu_sw_netstats *stats;
> struct gtp_dev *gtp;
> +   int ret = 0;
> bool xnet;
> -   int ret;
>
> gtp = rcu_dereference_sk_user_data(sk);
> if (!gtp)
> @@ -319,33 +342,17 @@ static int gtp_encap_recv(struct sock *sk, struct 
> sk_buff *skb)
> switch (ret) {
> case 1:
> netdev_dbg(gtp->dev, "pass up to the process\n");
> -   return 1;
> +   break;
> case 0:
> -   netdev_dbg(gtp->dev, "forwarding packet from GGSN to 
> uplink\n");
> break;
> case -1:
> netdev_dbg(gtp->dev, "GTP packet has been dropped\n");
> kfree_skb(skb);
> -   return 0;
> +   ret = 0;
> +   break;
> }
>
> -   /* Now that the UDP and the GTP header have been removed, set up the
> -* new network header. This is required by the upper layer to
> -* calculate the transport header.
> -*/
> -   skb_reset_network_header(skb);
> -
> -   skb->dev = gtp->dev;
> -
> -   stats = 

[BUG] vmxnet3: random freeze regression

2017-02-22 Thread Stephen Hemminger
I get the bugzilla reports for networking, and I see several reports
of vmxnet3 hanging with 4.8 and later kernels.

Is this a know issue?

https://bugzilla.kernel.org/show_bug.cgi?id=191201


Please I want you to patiently read this offer.?

2017-02-22 Thread Mr.Hassan Habib
Hello.

I know this means of communication may not be morally right to you as a person 
but I also have had a great thought about it and I have come to this conclusion 
which I am about to share with you.

INTRODUCTION:I am the Credit Manager U. B. A Bank of Burkina Faso Ouagadougou
and in one way or the other was hoping you will cooperate with me as a partner 
in a project of transferring an abandoned fund of a late customer of the bank 
worth of $18,000,000 (Eighteen Million Dollars US).

This will be disbursed or shared between the both of us in these percentages, 
55% to me and 35% to you while 10% will be for expenses both parties might have 
incurred during the process of transferring.
I
await for your response so that we can commence on this project as soon as 
possible.

Reply to this Email:mr_habib2...@yahoo.com

Regards,
Mr.Hassan Habib.

Credit Manager U.B.A Bank of
Burkina Faso Ouagadougou


netvsc NAPI

2017-02-22 Thread Stephen Hemminger
NAPI for netvsc is ready but the merge coordination is a nuisance.

Since netvsc NAPI support requires other changes that are proceeding through
GregKH's char-misc tree. I would like to send the two patches after current 
net-next
and char-misc-next are merged into Linus's tree.

Need these (at a minimum) these changes

 6e47dd3e2938 ("vmbus: expose hv_begin/end_read")
 5529eaf6e79a ("vmbus: remove conditional locking of vmbus_write")
 b71e328297a3 ("vmbus: add direct isr callback mode")
 631e63a9f346 ("vmbus: change to per channel tasklet")
 37cdd991fac8 ("vmbus: put related per-cpu variable together")

Please let me know when linux-net is up to date with these.


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Alexander Duyck
On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet  wrote:
> On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
>> Use of order-3 pages is problematic in some cases.
>>
>> This patch might add three kinds of regression :
>>
>> 1) a CPU performance regression, but we will add later page
>> recycling and performance should be back.
>>
>> 2) TCP receiver could grow its receive window slightly slower,
>>because skb->len/skb->truesize ratio will decrease.
>>This is mostly ok, we prefer being conservative to not risk OOM,
>>and eventually tune TCP better in the future.
>>This is consistent with other drivers using 2048 per ethernet frame.
>>
>> 3) Because we allocate one page per RX slot, we consume more
>>memory for the ring buffers. XDP already had this constraint anyway.
>>
>> Signed-off-by: Eric Dumazet 
>> ---
>
> Note that we also could use a different strategy.
>
> Assume RX rings of 4096 entries/slots.
>
> With this patch, mlx4 gets the strategy used by Alexander in Intel
> drivers :
>
> Each RX slot has an allocated page, and uses half of it, flipping to the
> other half every time the slot is used.
>
> So a ring buffer of 4096 slots allocates 4096 pages.
>
> When we receive a packet train for the same flow, GRO builds an skb with
> ~45 page frags, all from different pages.
>
> The put_page() done from skb_release_data() touches ~45 different struct
> page cache lines, and show a high cost. (compared to the order-3 used
> today by mlx4, this adds extra cache line misses and stalls for the
> consumer)
>
> If we instead try to use the two halves of one page on consecutive RX
> slots, we might instead cook skb with the same number of MSS (45), but
> half the number of cache lines for put_page(), so we should speed up the
> consumer.

So there is a problem that is being overlooked here.  That is the cost
of the DMA map/unmap calls.  The problem is many PowerPC systems have
an IOMMU that you have to work around, and that IOMMU comes at a heavy
cost for every map/unmap call.  So unless you are saying you wan to
setup a hybrid between the mlx5 and this approach where we have a page
cache that these all fall back into you will take a heavy cost for
having to map and unmap pages.

The whole reason why I implemented the Intel page reuse approach the
way I did is to try and mitigate the IOMMU issue, it wasn't so much to
resolve allocator/freeing expense.  Basically the allocator scales,
the IOMMU does not.  So any solution would require making certain that
we can leave the pages pinned in the DMA to avoid having to take the
global locks involved in accessing the IOMMU.

> This means the number of active pages would be minimal, especially on
> PowerPC. Pages that have been used by X=2 received frags would be put in
> a quarantine (size to be determined).
> On PowerPC, X would be PAGE_SIZE/frag_size
>
>
> This strategy would consume less memory on PowerPC :
> 65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of
> 4096.
>
> The quarantine would be sized to increase chances of reusing an old
> page, without consuming too much memory.
>
> Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size))
>
> x86 would still use 4096 pages, but PowerPC would use 98+128 pages
> instead of 4096) (14 MBytes instead of 256 MBytes)

So any solution will need to work with an IOMMU enabled on the
platform.  I assume you have some x86 test systems you could run with
an IOMMU enabled.  My advice would be to try running in that
environment and see where the overhead lies.

- Alex


Re: Focusing the XDP project

2017-02-22 Thread Tom Herbert
On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouer
 wrote:
>
> On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert  wrote:
>> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed  
>> wrote:
> [...]
>> > The only complexity XDP is adding to the drivers is the constrains on
>> > RX memory management and memory model, calling the XDP program itself
>> > and handling the  action is really a simple thing once you have the
>> > correct memory model.
>
> Exactly, that is why I've been looking at introducing a generic
> facility for a memory model for drivers.  This should help simply
> drivers.  Due to performance needs this need to be a very thin API layer
> on top of the page allocator. (That's why I'm working with Mel Gorman
> to get more close integration with the page allocator e.g. a bulking
> facility).
>
>> > Who knows! maybe someday XDP will define one unified RX API for all
>> > drivers and it even will handle normal stack delivery it self :).
>> >
>> That's exactly the point and what we need for TXDP. I'm missing why
>> doing this is such rocket science other than the fact that all these
>> drivers are vastly different and changing the existing API is
>> unpleasant. The only functional complexity I see in creating a generic
>> batching interface is handling return codes asynchronously. This is
>> entirely feasible though...
>
> I'll be happy as long as we get a batching interface, then we can
> incrementally do the optimizations later.
>
> In the future, I do hope (like Saeed) this RX API will evolve into
> delivering (a bulk of) raw-packet-pages into the netstack, this should
> simplify drivers, and we can keep the complexity and SKB allocations
> out of the drivers.
> To start with, we can play with doing this delivering (a bulk of)
> raw-packet-pages into Tom's TXDP engine/system?
>
Hi Jesper,

Maybe we can to start to narrow in on what a batching API might look like.

Looking at mlx5 (as a model of how XDP is implemented) the main RX
loop in ml5e_poll_rx_cq calls the backend handler in one indirect
function call. The XDP path goes through mlx5e_handle_rx_cqe,
skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with
building the skbuf. As a prerequisite to RX batching it would be
helpful if this could be flatten so that most of the logic is obvious
in the main RX loop.

The model of RX batching seems straightforward enough-- pull packets
from the ring, save xdp_data information in a vector, periodically
call into the stack to handle a batch where argument is the vector of
packets and another argument is an output vector that gives return
codes (XDP actions), process the each return code for each packet in
the driver accordingly. Presumably, there is a maximum allowed batch
that may or may not be the same as the NAPI budget so the so the
batching call needs to be done when the limit is reach and also before
exiting NAPI. For each packet the stack can return an XDP code,
XDP_PASS in this case could be interpreted as being consumed by the
stack; this would be used in the case the stack creates an skbuff for
the packet. The stack on it's part can process the batch how it sees
fit, it can process each packet individual in the canonical model, or
we can continue processing a batch in a VPP-like fashion.

The batching API could be transparent to the stack or not. In the
transparent case, the driver calls what looks like a receive function
but the stack may defer processing for batching. A callback function
(that can be inlined) is used to process return codes as I mentioned
previously. In the non-transparent model, the driver knowingly creates
the packet vector and then explicitly calls another function to
process the vector. Personally, I lean towards the transparent API,
this may be less complexity in drivers and gives the stack more
control over the parameters of batching (for instance it may choose
some batch size to optimize its processing instead of driver guessing
the best size).

Btw the logic for RX batching is very similar to how we batch packets
for RPS (I think you already mention an skb-less RPS and that should
hopefully be something would falls out from this design).

Tom


Re: [PATCH net-next] virtio-net: switch to use build_skb() for small buffer

2017-02-22 Thread John Fastabend
On 17-02-21 12:46 AM, Jason Wang wrote:
> This patch switch to use build_skb() for small buffer which can have
> better performance for both TCP and XDP (since we can work at page
> before skb creation). It also remove lots of XDP codes since both
> mergeable and small buffer use page frag during refill now.
> 
>Before   | After
> XDP_DROP(xdp1) 64B  :  11.1Mpps | 14.4Mpps
> 
> Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf.

When you do the xdp tests are you generating packets with pktgen on the
corresponding tap devices?

Also another thought, have you looked at using some of the buffer recycling
techniques used in the hardware drivers such as ixgbe and with Eric's latest
patches mlx? I have seen significant performance increases for some
workloads doing this. I wanted to try something like this out on virtio
but haven't had time yet.

> 
> Signed-off-by: Jason Wang 
> ---

[...]

>  static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue 
> *rq,
>gfp_t gfp)
>  {
> - int headroom = GOOD_PACKET_LEN + virtnet_get_headroom(vi);
> + struct page_frag *alloc_frag = >alloc_frag;
> + char *buf;
>   unsigned int xdp_headroom = virtnet_get_headroom(vi);
> - struct sk_buff *skb;
> - struct virtio_net_hdr_mrg_rxbuf *hdr;
> + int len = vi->hdr_len + VIRTNET_RX_PAD + GOOD_PACKET_LEN + xdp_headroom;
>   int err;
>  
> - skb = __netdev_alloc_skb_ip_align(vi->dev, headroom, gfp);
> - if (unlikely(!skb))
> + len = SKB_DATA_ALIGN(len) +
> +   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> + if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
>   return -ENOMEM;
>  
> - skb_put(skb, headroom);
> -
> - hdr = skb_vnet_hdr(skb);
> - sg_init_table(rq->sg, 2);
> - sg_set_buf(rq->sg, hdr, vi->hdr_len);
> - skb_to_sgvec(skb, rq->sg + 1, xdp_headroom, skb->len - xdp_headroom);
> -
> - err = virtqueue_add_inbuf(rq->vq, rq->sg, 2, skb, gfp);
> + buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> + get_page(alloc_frag->page);
> + alloc_frag->offset += len;
> + sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom,
> + vi->hdr_len + GOOD_PACKET_LEN);
> + err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp);

Nice this cleans up a lot of the branching code. Thanks.

Acked-by: John Fastabend 


Re: Questions on XDP

2017-02-22 Thread John Fastabend
On 17-02-21 09:44 AM, Alexander Duyck wrote:
> On Mon, Feb 20, 2017 at 11:55 PM, Alexei Starovoitov
>  wrote:
>> On Mon, Feb 20, 2017 at 08:00:57PM -0800, Alexander Duyck wrote:
>>>
>>> I assumed "toy Tx" since I wasn't aware that they were actually
>>> allowing writing to the page.  I think that might work for the XDP_TX
>>> case,
>>
>> Take a look at samples/bpf/xdp_tx_iptunnel_kern.c
>> It's close enough approximation of load balancer.
>> The packet header is rewritten by the bpf program.
>> That's where dma_bidirectional requirement came from.
> 
> Thanks.  I will take a look at it.
> 
>>> but the case where encap/decap is done and then passed up to the
>>> stack runs the risk of causing data corruption on some architectures
>>> if they unmap the page before the stack is done with the skb.  I
>>> already pointed out the issue to the Mellanox guys and that will
>>> hopefully be addressed shortly.
>>
>> sure. the path were xdp program does decap and passes to the stack
>> is not finished. To make it work properly we need to expose
>> csum complete field to the program at least.
> 
> I would think the checksum is something that could be validated after
> the frame has been modified.  In the case of encapsulating or
> decapsulating a TCP frame you could probably assume the inner TCP
> checksum is valid and then you only have to deal with the checksum if
> it is present in the outer tunnel header.  Basically deal with it like
> we do the local checksum offload, only you would have to compute the
> pseudo header checksum for the inner and outer headers since you can't
> use the partial checksum of the inner header.
> 
>>> As far as the Tx I need to work with John since his current solution
>>> doesn't have any batching support that I saw and that is a major
>>> requirement if we want to get above 7 Mpps for a single core.
>>
>> I think we need to focus on both Mpps and 'perf report' together.
> 
> Agreed, I usually look over both as one tells you how fast you are
> going and the other tells you where the bottlenecks are.
> 
>> Single core doing 7Mpps and scaling linearly to 40Gbps line rate
>> is much better than single core doing 20Mpps and not scaling at all.
>> There could be sw inefficiencies and hw limits, hence 'perf report'
>> is must have when discussing numbers.
> 
> Agreed.
> 
>> I think long term we will be able to agree on a set of real life
>> use cases and corresponding set of 'blessed' bpf programs and
>> create a table of nic, driver, use case 1, 2, 3, single core, multi.
>> Making level playing field for all nic vendors is one of the goals.
>>
>> Right now we have xdp1, xdp2 and xdp_tx_iptunnel benchmarks.
>> They are approximations of ddos, router, load balancer
>> use cases. They obviously need work to get to 'blessed' shape,
>> but imo quite good to do vendor vs vendor comparison for
>> the use cases that we care about.
>> Eventually nic->vm and vm->vm use cases via xdp_redirect should
>> be added to such set of 'blessed' benchmarks too.
>> I think so far we avoided falling into trap of microbenchmarking wars.
> 
> I'll keep this in mind for upcoming patches.
> 

Yep, agreed although having some larger examples in the wild even if
not in the kernel source would be great. I think we will see these
soon.

>> 3.  Should we support scatter-gather to support 9K jumbo frames
>> instead of allocating order 2 pages?
>
> we can, if main use case of mtu < 4k doesn't suffer.

 Agreed I don't think it should degrade <4k performance. That said
 for VM traffic this is absolutely needed. Without TSO enabled VM
 traffic is 50% slower on my tests :/.

 With tap/vhost support for XDP this becomes necessary. vhost/tap
 support for XDP is on my list directly behind ixgbe and redirect
 support.
>>>
>>> I'm thinking we just need to turn XDP into something like a
>>> scatterlist for such cases.  It wouldn't take much to just convert the
>>> single xdp_buf into an array of xdp_buf.
>>
>> datapath has to be fast. If xdp program needs to look at all
>> bytes of the packet the performance is gone. Therefore I don't see
>> a need to expose an array of xdp_buffs to the program.
> 
> The program itself may not care, but if we are going to deal with
> things like Tx and Drop we need to make sure we drop all the parts of
> the frame.  An alternate idea I have been playing around with is just
> having the driver repeat the last action until it hits the end of a
> frame.  So XDP would analyze the first 1.5K or 3K of the frame, and
> then tell us to either drop it, pass it, or xmit it.  After that we
> would just repeat that action until we hit the end of the frame.  The
> only limitation is that it means XDP is limited to only accessing the
> first 1514 bytes.
> 
>> The alternative would be to add a hidden field to xdp_buff that keeps
>> SG in some form and data_end will point to the end of linear chunk.
>> But you cannot put only headers into 

[PATCH RFC v2 12/12] test: add sendmsg zerocopy tests

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Introduce the tests uses to verify MSG_ZEROCOPY behavior:

snd_zerocopy:
  send zerocopy fragments out over the default route.

snd_zerocopy_lo:
  send data between a pair of local sockets and report throughput.

These tests are not suitable for inclusion in /tools/testing/selftest
as is, as they do not return a pass/fail verdict. Including them in
this RFC for demonstration, only.

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/.gitignore|   2 +
 tools/testing/selftests/net/Makefile  |   1 +
 tools/testing/selftests/net/snd_zerocopy.c| 354 +++
 tools/testing/selftests/net/snd_zerocopy_lo.c | 596 ++
 4 files changed, 953 insertions(+)
 create mode 100644 tools/testing/selftests/net/snd_zerocopy.c
 create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c

diff --git a/tools/testing/selftests/net/.gitignore 
b/tools/testing/selftests/net/.gitignore
index afe109e5508a..7dfb030f0c9b 100644
--- a/tools/testing/selftests/net/.gitignore
+++ b/tools/testing/selftests/net/.gitignore
@@ -5,3 +5,5 @@ reuseport_bpf
 reuseport_bpf_cpu
 reuseport_bpf_numa
 reuseport_dualstack
+snd_zerocopy
+snd_zerocopy_lo
diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index e24e4c82542e..aa663c791f7a 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -7,6 +7,7 @@ NET_PROGS =  socket
 NET_PROGS += psock_fanout psock_tpacket
 NET_PROGS += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa
 NET_PROGS += reuseport_dualstack
+NET_PROGS += snd_zerocopy snd_zerocopy_lo
 
 all: $(NET_PROGS)
 reuseport_bpf_numa: LDFLAGS += -lnuma
diff --git a/tools/testing/selftests/net/snd_zerocopy.c 
b/tools/testing/selftests/net/snd_zerocopy.c
new file mode 100644
index ..052d0d14e62d
--- /dev/null
+++ b/tools/testing/selftests/net/snd_zerocopy.c
@@ -0,0 +1,354 @@
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define MSG_ZEROCOPY   0x400
+
+#define SK_FUDGE_FACTOR2   /* allow for overhead in SNDBUF 
*/
+#define BUFLEN (400 * 1000)/* max length of send call */
+#define DEST_PORT  9000
+
+uint32_t sent = UINT32_MAX, acked = UINT32_MAX;
+
+int cfg_batch_notify = 10;
+int cfg_num_runs = 16;
+size_t cfg_socksize = 1 << 20;
+int cfg_stress_sec;
+int cfg_verbose;
+bool cfg_zerocopy;
+
+static unsigned long gettime_now_ms(void)
+{
+   struct timeval tv;
+
+   gettimeofday(, NULL);
+   return (tv.tv_sec * 1000) + (tv.tv_usec / 1000);
+}
+
+static void do_set_socksize(int fd)
+{
+   if (setsockopt(fd, SOL_SOCKET, SO_SNDBUFFORCE,
+  _socksize, sizeof(cfg_socksize)))
+   error(1, 0, "setsockopt sndbufforce");
+
+   if (setsockopt(fd, SOL_SOCKET, SO_RCVBUFFORCE,
+  _socksize, sizeof(cfg_socksize)))
+   error(1, 0, "setsockopt sndbufforce");
+}
+
+static bool do_read_notification(int fd)
+{
+   struct sock_extended_err *serr;
+   struct cmsghdr *cm;
+   struct msghdr msg = {};
+   char control[100];
+   int64_t hi, lo;
+   int ret;
+
+   msg.msg_control = control;
+   msg.msg_controllen = sizeof(control);
+
+   ret = recvmsg(fd, , MSG_DONTWAIT | MSG_ERRQUEUE);
+   if (ret == -1 && errno == EAGAIN)
+   return false;
+   if (ret == -1)
+   error(1, errno, "recvmsg notification");
+   if (msg.msg_flags & MSG_CTRUNC)
+   error(1, errno, "recvmsg notification: truncated");
+
+   cm = CMSG_FIRSTHDR();
+   if (!cm || cm->cmsg_level != SOL_IP ||
+   (cm->cmsg_type != IP_RECVERR && cm->cmsg_type != IPV6_RECVERR))
+   error(1, 0, "cmsg: wrong type");
+
+   serr = (void *) CMSG_DATA(cm);
+   if (serr->ee_errno != 0 || serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+   error(1, 0, "serr: wrong type");
+
+   hi = serr->ee_data;
+   lo = serr->ee_info;
+   if (lo != (uint32_t) (acked + 1))
+   error(1, 0, "notify: %lu..%lu, expected %u\n",
+ lo, hi, acked + 1);
+   acked = hi;
+
+   if (cfg_verbose)
+   fprintf(stderr, "completed: %lu..%lu\n", lo, hi);
+
+   return true;
+}
+
+static void do_poll(int fd, int events, int timeout)
+{
+   struct pollfd pfd;
+   int ret;
+
+   pfd.fd = fd;
+   pfd.events = events;
+   pfd.revents = 0;
+
+   ret = poll(, 1, timeout);
+   if (ret == -1)
+   error(1, errno, "poll");
+   if (ret != 1)
+   error(1, 0, "poll timeout. events=0x%x acked=%u sent=%u",
+ pfd.events, acked, sent);
+
+   if (cfg_verbose >= 2)
+   fprintf(stderr, "poll ok. events=0x%x 

[PATCH RFC v2 09/12] udp: enable sendmsg zerocopy

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Add MSG_ZEROCOPY support to inet/dgram. This includes udplite.

Tested:
  loopback test snd_zerocopy_lo -u -z produces

  without zerocopy (-u):
rx=173940 (10854 MB) tx=173940 txc=0
rx=367026 (22904 MB) tx=367026 txc=0
rx=564078 (35201 MB) tx=564078 txc=0
rx=756588 (47214 MB) tx=756588 txc=0

  with zerocopy (-u -z):
rx=377994 (23588 MB) tx=377994 txc=377980
rx=792654 (49465 MB) tx=792654 txc=792632
rx=1209582 (75483 MB) tx=1209582 txc=1209552
rx=1628376 (101618 MB) tx=1628376 txc=1628338

  loopback test currently fails with corking, due to
  CHECKSUM_PARTIAL being disabled with UDP_CORK after commit
  d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets")

  I will suggest to allow it on NETIF_F_LOOPBACK.

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h |  5 +
 net/ipv4/ip_output.c   | 34 +-
 2 files changed, 34 insertions(+), 5 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6ad1724ceb60..9e7386f3f7a8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -424,6 +424,11 @@ struct ubuf_info {
 
 #define skb_uarg(SKB)  ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
+#define sock_can_zerocopy(sk, rt, csummode) \
+   ((rt->dst.dev->features & NETIF_F_SG) && \
+((sk->sk_type == SOCK_RAW) || \
+ (sk->sk_type == SOCK_DGRAM && csummode & CHECKSUM_UNNECESSARY)))
+
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
 struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
struct ubuf_info *uarg);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 737ce826d7ec..9e0110d8a429 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -919,7 +919,7 @@ static int __ip_append_data(struct sock *sk,
 {
struct inet_sock *inet = inet_sk(sk);
struct sk_buff *skb;
-
+   struct ubuf_info *uarg = NULL;
struct ip_options *opt = cork->opt;
int hh_len;
int exthdrlen;
@@ -963,9 +963,16 @@ static int __ip_append_data(struct sock *sk,
!exthdrlen)
csummode = CHECKSUM_PARTIAL;
 
+   if (flags & MSG_ZEROCOPY && length &&
+   sock_can_zerocopy(sk, rt, skb ? skb->ip_summed : csummode)) {
+   uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+   if (!uarg)
+   return -ENOBUFS;
+   }
+
cork->length += length;
if length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) &&
-   (sk->sk_protocol == IPPROTO_UDP) &&
+   (sk->sk_protocol == IPPROTO_UDP) && !uarg &&
(rt->dst.dev->features & NETIF_F_UFO) && !rt->dst.header_len &&
(sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) {
err = ip_ufo_append_data(sk, queue, getfrag, from, length,
@@ -1017,6 +1024,8 @@ static int __ip_append_data(struct sock *sk,
if ((flags & MSG_MORE) &&
!(rt->dst.dev->features_F_SG))
alloclen = mtu;
+   else if (uarg)
+   alloclen = min_t(int, fraglen, MAX_HEADER);
else
alloclen = fraglen;
 
@@ -1059,11 +1068,12 @@ static int __ip_append_data(struct sock *sk,
cork->tx_flags = 0;
skb_shinfo(skb)->tskey = tskey;
tskey = 0;
+   skb_zcopy_set(skb, uarg);
 
/*
 *  Find where to start putting bytes.
 */
-   data = skb_put(skb, fraglen + exthdrlen);
+   data = skb_put(skb, alloclen);
skb_set_network_header(skb, exthdrlen);
skb->transport_header = (skb->network_header +
 fragheaderlen);
@@ -1079,7 +1089,9 @@ static int __ip_append_data(struct sock *sk,
pskb_trim_unique(skb_prev, maxfraglen);
}
 
-   copy = datalen - transhdrlen - fraggap;
+   copy = min(datalen,
+  alloclen - exthdrlen - fragheaderlen);
+   copy -= transhdrlen - fraggap;
if (copy > 0 && getfrag(from, data + transhdrlen, 
offset, copy, fraggap, skb) < 0) {
err = -EFAULT;
kfree_skb(skb);
@@ -1087,7 +1099,7 @@ static int __ip_append_data(struct sock *sk,
}
 
offset += copy;
-   length -= datalen - fraggap;
+   length -= copy + transhdrlen;
transhdrlen = 

[PATCH RFC v2 11/12] packet: enable sendmsg zerocopy

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Support MSG_ZEROCOPY on PF_PACKET transmission.

Tested:
  pf_packet loopback test snd_zerocopy_lo -p -z produces:

  without zerocopy (-p):
rx=0 (0 MB) tx=221696 txc=0
rx=0 (0 MB) tx=443880 txc=0
rx=0 (0 MB) tx=661056 txc=0
rx=0 (0 MB) tx=877152 txc=0

  with zerocopy (-p -z):
rx=0 (0 MB) tx=528548 txc=528544
rx=0 (0 MB) tx=1052364 txc=1052360
rx=0 (0 MB) tx=1571956 txc=1571952
rx=0 (0 MB) tx=2094144 txc=2094140

  Packets do not arrive at the Rx socket due to a martian test:

IPv4: martian destination 127.0.0.1 from 127.0.0.1, dev lo

  I'll need to revise snd_zerocopy_lo to bypass that.

Signed-off-by: Willem de Bruijn 
---
 net/packet/af_packet.c | 52 --
 1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 2bd0d1949312..af9ecc1edf72 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2754,28 +2754,55 @@ static int tpacket_snd(struct packet_sock *po, struct 
msghdr *msg)
 
 static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad,
size_t reserve, size_t len,
-   size_t linear, int noblock,
+   size_t linear, int flags,
int *err)
 {
struct sk_buff *skb;
+   size_t data_len;
 
-   /* Under a page?  Don't bother with paged skb. */
-   if (prepad + len < PAGE_SIZE || !linear)
-   linear = len;
+   if (flags & MSG_ZEROCOPY) {
+   /* Minimize linear, but respect header lower bound */
+   linear = reserve + min(len, max_t(size_t, linear, MAX_HEADER));
+   data_len = 0;
+   } else {
+   /* Under a page? Don't bother with paged skb. */
+   if (prepad + len < PAGE_SIZE || !linear)
+   linear = len;
+   data_len = len - linear;
+   }
 
-   skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock,
-  err, 0);
+   skb = sock_alloc_send_pskb(sk, prepad + linear, data_len,
+  flags & MSG_DONTWAIT, err, 0);
if (!skb)
return NULL;
 
skb_reserve(skb, reserve);
skb_put(skb, linear);
-   skb->data_len = len - linear;
-   skb->len += len - linear;
+   skb->data_len = data_len;
+   skb->len += data_len;
 
return skb;
 }
 
+static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb,
+struct msghdr *msg,
+int offset, size_t size)
+{
+   int ret;
+
+   /* if SOCK_DGRAM, head room was alloc'ed and holds ll-headers */
+   __skb_pull(skb, offset);
+   ret = zerocopy_sg_from_iter(skb, >msg_iter);
+   __skb_push(skb, offset);
+   if (unlikely(ret))
+   return ret == -EMSGSIZE ? ret : -EIO;
+
+   if (!skb_zerocopy_alloc(skb, size))
+   return -ENOMEM;
+
+   return 0;
+}
+
 static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 {
struct sock *sk = sock->sk;
@@ -2853,7 +2880,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
linear = __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len);
linear = max(linear, min_t(int, len, dev->hard_header_len));
skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, linear,
-  msg->msg_flags & MSG_DONTWAIT, );
+  msg->msg_flags, );
if (skb == NULL)
goto out_unlock;
 
@@ -2867,7 +2894,11 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
}
 
/* Returns -EFAULT on error */
-   err = skb_copy_datagram_from_iter(skb, offset, >msg_iter, len);
+   if (msg->msg_flags & MSG_ZEROCOPY)
+   err = packet_zerocopy_sg_from_iovec(skb, msg, offset, len);
+   else
+   err = skb_copy_datagram_from_iter(skb, offset, >msg_iter,
+ len);
if (err)
goto out_free;
 
@@ -2913,6 +2944,7 @@ static int packet_snd(struct socket *sock, struct msghdr 
*msg, size_t len)
return len;
 
 out_free:
+   skb_zcopy_abort(skb);
kfree_skb(skb);
 out_unlock:
if (dev)
-- 
2.11.0.483.g087da7b7c-goog



[PATCH RFC v2 07/12] sock: sendmsg zerocopy limit bytes per notification

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Zerocopy can coalesce notifications of up to 65535 send calls.
Excessive coalescing increases notification latency and process
working set size.

Experiments showed trains of 75 syscalls holding around 8 MB of data
per notification. On servers with many slower clients, this causes
many GB of user data waiting for acknowledgment and many seconds of
latency between send and notification reception.

Introduce a notification byte limit.

Implementation notes:
- Due to space constraints in struct ubuf_info, the internal
  calculation is approximate, in Kilobytes and capped to 64MB.

- The field is accessed only on initial allocation of ubuf_info, when
  the struct is private, or under the tcp lock.

- When breaking a chain, we create a new notification structure uarg.
  A chain can be broken in the middle of a large sendmsg. Each skbuff
  can only point to a single uarg, so skb_zerocopy_add_frags_iter will
  fail after breaking a chain. The (next) TCP patch is changed in v2
  to detect failure (EEXIST) and jump to new_segment to create a new
  skbuff that can point to the new uarg. As a result, packetization of
  the bytestream may differ from a send without zerocopy.

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c  | 11 ++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a38308b10d76..6ad1724ceb60 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -411,6 +411,7 @@ struct ubuf_info {
struct {
u32 id;
u16 len;
+   u16 kbytelen;
};
};
atomic_t refcnt;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b86e196d6dec..6a07a20a91ed 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -974,6 +974,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, 
size_t size)
uarg->callback = sock_zerocopy_callback;
uarg->id = ((u32)atomic_inc_return(>sk_zckey)) - 1;
uarg->len = 1;
+   uarg->kbytelen = min_t(size_t, DIV_ROUND_UP(size, 1024u), USHRT_MAX);
atomic_set(>refcnt, 0);
sock_hold(sk);
 
@@ -990,6 +991,8 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, 
size_t size,
struct ubuf_info *uarg)
 {
if (uarg) {
+   const size_t limit_kb = 512;/* consider a sysctl */
+   size_t kbytelen;
u32 next;
 
/* realloc only when socket is locked (TCP, UDP cork),
@@ -997,8 +1000,13 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, 
size_t size,
 */
BUG_ON(!sock_owned_by_user(sk));
 
+   kbytelen = uarg->kbytelen + DIV_ROUND_UP(size, 1024u);
+   if (unlikely(kbytelen > limit_kb))
+   goto new_alloc;
+   uarg->kbytelen = kbytelen;
+
if (unlikely(uarg->len == USHRT_MAX - 1))
-   return NULL;
+   goto new_alloc;
 
next = (u32)atomic_read(>sk_zckey);
if ((u32)(uarg->id + uarg->len) == next) {
@@ -1010,6 +1018,7 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, 
size_t size,
}
}
 
+new_alloc:
return sock_zerocopy_alloc(sk, size);
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_realloc);
-- 
2.11.0.483.g087da7b7c-goog



[PATCH RFC v2 10/12] raw: enable sendmsg zerocopy with IP_HDRINCL

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Tested:
  raw loopback test snd_zerocopy_lo -r -z produces:

  without zerocopy (-r):
rx=97632 (6092 MB) tx=97632 txc=0
rx=208194 (12992 MB) tx=208194 txc=0
rx=318714 (19889 MB) tx=318714 txc=0
rx=429126 (26779 MB) tx=429126 txc=0

  with zerocopy (-r -z):
rx=326160 (20353 MB) tx=326160 txc=326144
rx=689244 (43012 MB) tx=689244 txc=689220
rx=1049352 (65484 MB) tx=1049352 txc=1049320
rx=1408782 (87914 MB) tx=1408782 txc=1408744

  raw hdrincl loopback test snd_zerocopy_lo -R -z produces:

  without zerocopy (-R):
rx=167328 (10442 MB) tx=167328 txc=0
rx=354942 (22150 MB) tx=354942 txc=0
rx=542400 (33848 MB) tx=542400 txc=0
rx=716442 (44709 MB) tx=716442 txc=0

  with zerocopy (-R -z):
rx=340116 (21224 MB) tx=340116 txc=340102
rx=712746 (44478 MB) tx=712746 txc=712726
rx=1083732 (67629 MB) tx=1083732 txc=1083704
rx=1457856 (90976 MB) tx=1457856 txc=1457820

Signed-off-by: Willem de Bruijn 
---
 net/ipv4/raw.c | 27 +++
 1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 8119e1f66e03..d21279b2f69e 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -351,7 +351,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
unsigned int iphlen;
int err;
struct rtable *rt = *rtp;
-   int hlen, tlen;
+   int hlen, tlen, linear;
 
if (length > rt->dst.dev->mtu) {
ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport,
@@ -363,8 +363,14 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 
hlen = LL_RESERVED_SPACE(rt->dst.dev);
tlen = rt->dst.dev->needed_tailroom;
+   linear = length;
+
+   if (flags & MSG_ZEROCOPY && length &&
+   sock_can_zerocopy(sk, rt, CHECKSUM_UNNECESSARY))
+   linear = min_t(int, length, MAX_HEADER);
+
skb = sock_alloc_send_skb(sk,
- length + hlen + tlen + 15,
+ linear + hlen + tlen + 15,
  flags & MSG_DONTWAIT, );
if (!skb)
goto error;
@@ -377,7 +383,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 
skb_reset_network_header(skb);
iph = ip_hdr(skb);
-   skb_put(skb, length);
+   skb_put(skb, linear);
 
skb->ip_summed = CHECKSUM_NONE;
 
@@ -388,7 +394,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 
skb->transport_header = skb->network_header;
err = -EFAULT;
-   if (memcpy_from_msg(iph, msg, length))
+   if (memcpy_from_msg(iph, msg, linear))
goto error_free;
 
iphlen = iph->ihl * 4;
@@ -404,6 +410,17 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
if (iphlen > length)
goto error_free;
 
+   if (length != linear) {
+   size_t datalen = length - linear;
+
+   if (!skb_zerocopy_alloc(skb, datalen))
+   goto error_zcopy;
+   err = skb_zerocopy_add_frags_iter(sk, skb, >msg_iter,
+ datalen, skb_uarg(skb));
+   if (err != datalen)
+   goto error_zcopy;
+   }
+
if (iphlen >= sizeof(*iph)) {
if (!iph->saddr)
iph->saddr = fl4->saddr;
@@ -430,6 +447,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 
*fl4,
 out:
return 0;
 
+error_zcopy:
+   sock_zerocopy_put_abort(skb_zcopy(skb));
 error_free:
kfree_skb(skb);
 error:
-- 
2.11.0.483.g087da7b7c-goog



[PATCH RFC v2 05/12] sock: sendmsg zerocopy notification coalescing

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

In the simple case, each sendmsg() call generates data and eventually
a zerocopy ready notification N, where N indicates the Nth successful
invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket.

TCP and corked sockets can cause sendmsg() calls to append to a single
sk_buff and ubuf_info. Modify the notification path to return an
inclusive range of notifications [N..N+m].

Add skb_zerocopy_realloc() to reuse ubuf_info across sendmsg() calls
and modify the notification path to return a range.

For the case of reliable ordered transmission (TCP), only the upper
value of the range to be read, as the lower value is guaranteed to
be 1 above the last read notification.

Additionally, coalesce notifications in this common case: if an
skb_uarg [1, 1] is queued while [0, 0] is already on the queue,
just modify the head of the queue to read [0, 1].

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h | 21 +++-
 net/core/skbuff.c  | 92 +++---
 2 files changed, 107 insertions(+), 6 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c7b42272b409..eedac9fd3f0f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -406,13 +406,21 @@ enum {
 struct ubuf_info {
void (*callback)(struct ubuf_info *, bool zerocopy_success);
void *ctx;
-   unsigned long desc;
+   union {
+   unsigned long desc;
+   struct {
+   u32 id;
+   u16 len;
+   };
+   };
atomic_t refcnt;
 };
 
 #define skb_uarg(SKB)  ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
 
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+   struct ubuf_info *uarg);
 
 static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 {
@@ -420,6 +428,7 @@ static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 }
 
 void sock_zerocopy_put(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
@@ -1276,6 +1285,16 @@ static inline void skb_zcopy_clear(struct sk_buff *skb)
}
 }
 
+static inline void skb_zcopy_abort(struct sk_buff *skb)
+{
+   struct ubuf_info *uarg = skb_zcopy(skb);
+
+   if (uarg) {
+   sock_zerocopy_put_abort(uarg);
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+   }
+}
+
 /**
  * skb_queue_empty - check if a queue is empty
  * @list: queue head
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index fcbdc91b2d24..7a1d6e7703a6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -928,7 +928,8 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, 
size_t size)
uarg = (void *)skb->cb;
 
uarg->callback = sock_zerocopy_callback;
-   uarg->desc = atomic_inc_return(>sk_zckey) - 1;
+   uarg->id = ((u32)atomic_inc_return(>sk_zckey)) - 1;
+   uarg->len = 1;
atomic_set(>refcnt, 0);
sock_hold(sk);
 
@@ -941,24 +942,94 @@ static inline struct sk_buff *skb_from_uarg(struct 
ubuf_info *uarg)
return container_of((void *)uarg, struct sk_buff, cb);
 }
 
+struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size,
+   struct ubuf_info *uarg)
+{
+   if (uarg) {
+   u32 next;
+
+   /* realloc only when socket is locked (TCP, UDP cork),
+* so uarg->len and sk_zckey access is serialized
+*/
+   BUG_ON(!sock_owned_by_user(sk));
+
+   if (unlikely(uarg->len == USHRT_MAX - 1))
+   return NULL;
+
+   next = (u32)atomic_read(>sk_zckey);
+   if ((u32)(uarg->id + uarg->len) == next) {
+   uarg->len++;
+   atomic_set(>sk_zckey, ++next);
+   return uarg;
+   }
+   }
+
+   return sock_zerocopy_alloc(sk, size);
+}
+EXPORT_SYMBOL_GPL(sock_zerocopy_realloc);
+
+static bool skb_zerocopy_notify_extend(struct sk_buff *skb, u32 lo, u16 len)
+{
+   struct sock_exterr_skb *serr = SKB_EXT_ERR(skb);
+   s64 sum_len;
+   u32 old_lo, old_hi;
+
+   old_lo = serr->ee.ee_info;
+   old_hi = serr->ee.ee_data;
+   sum_len = old_hi - old_lo + 1 + len;
+   if (old_hi < old_lo)
+   sum_len += (1ULL << 32);
+
+   if (sum_len >= (1ULL << 32))
+   return false;
+
+   if (lo != old_hi + 1)
+   return false;
+
+   serr->ee.ee_data += len;
+   return true;
+}
+
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success)
 {
struct sock_exterr_skb *serr;
-   struct sk_buff *skb = skb_from_uarg(uarg);
+   struct sk_buff *head, 

[PATCH RFC v2 06/12] sock: sendmsg zerocopy ulimit

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Bound the number of pages that a user may pin.

Follow the lead of perf tools to maintain a per-user bound on memory
locked pages commit 789f90fcf6b0 ("perf_counter: per user mlock gift")

Signed-off-by: Willem de Bruijn 
---
 include/linux/sched.h  |  2 +-
 include/linux/skbuff.h |  5 +
 net/core/skbuff.c  | 48 
 3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad3ec9ec61f7..943714f8e91a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -905,7 +905,7 @@ struct user_struct {
struct hlist_node uidhash_node;
kuid_t uid;
 
-#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL)
+#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || 
defined(CONFIG_NET)
atomic_long_t locked_vm;
 #endif
 };
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index eedac9fd3f0f..a38308b10d76 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -414,6 +414,11 @@ struct ubuf_info {
};
};
atomic_t refcnt;
+
+   struct mmpin {
+   struct user_struct *user;
+   int num_pg;
+   } mmp;
 };
 
 #define skb_uarg(SKB)  ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7a1d6e7703a6..b86e196d6dec 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -914,6 +914,44 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct 
sk_buff *src)
 }
 EXPORT_SYMBOL_GPL(skb_morph);
 
+static int mm_account_pinned_pages(struct mmpin *mmp, size_t size)
+{
+   unsigned long max_pg, num_pg, new_pg, old_pg;
+   struct user_struct *user;
+
+   if (capable(CAP_IPC_LOCK) || !size)
+   return 0;
+
+   num_pg = (size >> PAGE_SHIFT) + 2;  /* worst case */
+   max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+   user = mmp->user ? : current_user();
+
+   do {
+   old_pg = atomic_long_read(>locked_vm);
+   new_pg = old_pg + num_pg;
+   if (new_pg > max_pg)
+   return -ENOMEM;
+   } while (atomic_long_cmpxchg(>locked_vm, old_pg, new_pg) !=
+old_pg);
+
+   if (!mmp->user) {
+   mmp->user = get_uid(user);
+   mmp->num_pg = num_pg;
+   } else {
+   mmp->num_pg += num_pg;
+   }
+
+   return 0;
+}
+
+static void mm_unaccount_pinned_pages(struct mmpin *mmp)
+{
+   if (mmp->user) {
+   atomic_long_sub(mmp->num_pg, >user->locked_vm);
+   free_uid(mmp->user);
+   }
+}
+
 /* must only be called from process context */
 struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size)
 {
@@ -926,6 +964,12 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, 
size_t size)
 
BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb));
uarg = (void *)skb->cb;
+   uarg->mmp.user = NULL;
+
+   if (mm_account_pinned_pages(>mmp, size)) {
+   kfree_skb(skb);
+   return NULL;
+   }
 
uarg->callback = sock_zerocopy_callback;
uarg->id = ((u32)atomic_inc_return(>sk_zckey)) - 1;
@@ -958,6 +1002,8 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, 
size_t size,
 
next = (u32)atomic_read(>sk_zckey);
if ((u32)(uarg->id + uarg->len) == next) {
+   if (mm_account_pinned_pages(>mmp, size))
+   return NULL;
uarg->len++;
atomic_set(>sk_zckey, ++next);
return uarg;
@@ -1037,6 +1083,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback);
 void sock_zerocopy_put(struct ubuf_info *uarg)
 {
if (uarg && atomic_dec_and_test(>refcnt)) {
+   mm_unaccount_pinned_pages(>mmp);
+
if (uarg->callback)
uarg->callback(uarg, true);
else
-- 
2.11.0.483.g087da7b7c-goog



[PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

RFCv2:

I have received a few requests for status and rebased code of this
feature. We have been running this code internally, discovering and
fixing various bugs. With net-next closed, now seems like a good time
to share an updated patchset with fixes. The rebase from RFCv1/v4.2
was mostly straightforward: mainly iov_iter changes. Full changelog:

  RFC -> RFCv2:
- review comment: do not loop skb with zerocopy frags onto rx:
  add skb_orphan_frags_rx to orphan even refcounted frags
  call this in __netif_receive_skb_core, deliver_skb and tun:
  the same as 1080e512d44d ("net: orphan frags on receive")
- fix: hold an explicit sk reference on each notification skb.
  previously relied on the reference (or wmem) held by the
  data skb that would trigger notification, but this breaks
  on skb_orphan.
- fix: when aborting a send, do not inc the zerocopy counter
  this caused gaps in the notification chain
- fix: in packet with SOCK_DGRAM, pull ll headers before calling
  zerocopy_sg_from_iter
- fix: if sock_zerocopy_realloc does not allow coalescing,
  do not fail, just allocate a new ubuf
- fix: in tcp, check return value of second allocation attempt
- chg: allocate notification skbs from optmem
  to avoid affecting tcp write queue accounting (TSQ)
- chg: limit #locked pages (ulimit) per user instead of per process
- chg: grow notification ids from 16 to 32 bit
  - pass range [lo, hi] through 32 bit fields ee_info and ee_data
- chg: rebased to davem-net-next on top of v4.10-rc7
- add: limit notification coalescing
  sharing ubufs limits overhead, but delays notification until
  the last packet is released, possibly unbounded. Add a cap. 
- tests: add snd_zerocopy_lo pf_packet test
- tests: two bugfixes (add do_flush_tcp, ++sent not only in debug)

The change to allocate notification skbuffs from optmem requires
ensuring that net.core.optmem is at least a few 100KB. To
experiment, run

  sysctl -w net.core.optmem_max=1048576

The snd_zerocopy_lo benchmarks reported in the individual patches were
rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were
replaced with skb_orphan_frags to allow looping to local sockets. The
netperf results below are also rerun with v2.

In application load, copy avoidance shows a roughly 5% systemwide
reduction in cycles when streaming large flows and a 4-8% reduction in
wall clock time on early tensorflow test workloads.


Overview (from original RFC):

Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY.
Implement the feature for TCP, UDP, RAW and packet sockets. This is
a generalization of a previous packet socket RFC patch

  http://patchwork.ozlabs.org/patch/413184/

On a send call with MSG_ZEROCOPY, the kernel pins the user pages and
creates skbuff fragments directly from these pages. On tx completion,
it notifies the socket owner that it is safe to modify memory by
queuing a completion notification onto the socket error queue.

The kernel already implements such copy avoidance with vmsplice plus
splice and with ubuf_info for tun and virtio. Extend the second
with features required by TCP and others: reference counting to
support cloning (retransmit queue) and shared fragments (GSO) and
notification coalescing to handle corking.

Notifications are queued onto the socket error queue as a range
range [N, N+m], where N is a per-socket counter incremented on each
successful zerocopy send call.

* Performance

The below table shows cycles reported by perf for a netperf process
sending a single 10 Gbps TCP_STREAM. The first three columns show
Mcycles spent in the netperf process context. The second three columns
show time spent systemwide (-a -C A,B) on the two cpus that run the
process and interrupt handler. Reported is the median of at least 3
runs. std is a standard netperf, zc uses zerocopy and % is the ratio.
Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs
are disabled and the kernel is booted with idle=halt.

NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size

perf stat -e cycles $NETPERF
perf stat -C 2,3 -a -e cycles $NETPERF

--process cycles--  cpu cycles
   std  zc   %  std zc   %
4K  27,609  11,217  41  49,217  39,175  79
16K 21,370   3,823  18  43,540  29,213  67
64K 20,557   2,312  11  42,189  26,910  64
256K21,110   2,134  10  43,006  27,104  63
1M  20,987   1,610   8  42,759  25,931  61

Perf record indicates the main source of these differences. Process
cycles only at 1M writes (perf record; perf report -n):

std:
Samples: 42K of event 'cycles', Event count (approx.): 21258597313  
 
 79.41% 33884  netperf  [kernel.kallsyms]  [k] copy_user_generic_string
  3.27%  1396  

[PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Refine skb_copy_ubufs to support compount pages. With upcoming TCP
and UDP zerocopy sendmsg, such fragments may appear.

These skbuffs can have both kernel and zerocopy fragments, e.g., when
corking. Avoid unnecessary copying of fragments that have no userspace
reference.

It is not safe to modify skb frags when the skbuff is shared. This
should not happen. Fail loudly if we find an unexpected edge case.

Signed-off-by: Willem de Bruijn 
---
 net/core/skbuff.c | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f3557958e9bf..67e4216fca01 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -944,6 +944,9 @@ EXPORT_SYMBOL_GPL(skb_morph);
  * If this function is called from an interrupt gfp_mask() must be
  * %GFP_ATOMIC.
  *
+ * skb_shinfo(skb) can only be safely modified when not accessed
+ * concurrently. Fail if the skb is shared or cloned.
+ *
  * Returns 0 on success or a negative error code on failure
  * to allocate kernel memory to copy to.
  */
@@ -954,11 +957,29 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
struct page *page, *head = NULL;
struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg;
 
+   if (skb_shared(skb) || skb_cloned(skb)) {
+   WARN_ON_ONCE(1);
+   return -EINVAL;
+   }
+
for (i = 0; i < num_frags; i++) {
u8 *vaddr;
+   unsigned int order = 0;
+   gfp_t mask = gfp_mask;
skb_frag_t *f = _shinfo(skb)->frags[i];
 
-   page = alloc_page(gfp_mask);
+   page = skb_frag_page(f);
+   if (page_count(page) == 1) {
+   skb_frag_ref(skb, i);
+   goto copy_done;
+   }
+
+   if (f->size > PAGE_SIZE) {
+   order = get_order(f->size);
+   mask |= __GFP_COMP;
+   }
+
+   page = alloc_pages(mask, order);
if (!page) {
while (head) {
struct page *next = (struct page 
*)page_private(head);
@@ -971,6 +992,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask)
memcpy(page_address(page),
   vaddr + f->page_offset, skb_frag_size(f));
kunmap_atomic(vaddr);
+copy_done:
set_page_private(page, (unsigned long)head);
head = page;
}
-- 
2.11.0.483.g087da7b7c-goog



[PATCH RFC v2 04/12] sock: enable sendmsg zerocopy

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
skb_zerocopy_clone() wherever needed due to skb split, merge, resize
or clone.

Split skb_orphan_frags into two variants. The split, merge, .. paths
support reference counted zerocopy buffers, so do not do a deep copy.
Add skb_orphan_frags_rx for paths that may loop packets to receive
sockets. That is not allowed, as it may cause unbounded latency.
Deep copy all zerocopy copy buffers, ref-counted or not, in this path.

The exact locations to modify were chosen by exhaustively searching
through all code that might modify skb_frag references and/or the
the SKBTX_DEV_ZEROCOPY tx_flags bit.

The changes err on the safe side, in two ways.

(1) legacy ubuf_info paths virtio and tap are not modified. They keep
a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
still call skb_copy_ubufs and thus copy frags in this case.

(2) not all copies deep in the stack are addressed yet. skb_shift,
skb_split and skb_try_coalesce can be refined to avoid copying.
These are not in the hot path and this patch is hairy enough as
is, so that is left for future refinement.

Signed-off-by: Willem de Bruijn 
---
 drivers/net/tun.c  |  2 +-
 drivers/vhost/net.c|  1 +
 include/linux/skbuff.h | 16 ++--
 net/core/dev.c |  4 ++--
 net/core/skbuff.c  | 52 +-
 5 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 30863e378925..b80c7fdcb05b 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -880,7 +880,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
sk_filter(tfile->socket.sk, skb))
goto drop;
 
-   if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+   if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
goto drop;
 
skb_tx_timestamp(skb);
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2fe35354f20e..f7ff72ed892f 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -454,6 +454,7 @@ static void handle_tx(struct vhost_net *net)
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
+   atomic_set(>refcnt, 1);
msg.msg_control = ubuf;
msg.msg_controllen = sizeof(ubuf);
ubufs = nvq->ubufs;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c99538b258c9..c7b42272b409 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2448,7 +2448,7 @@ static inline void skb_orphan(struct sk_buff *skb)
 }
 
 /**
- * skb_orphan_frags - orphan the frags contained in a buffer
+ * skb_orphan_frags - make a local copy of non-refcounted user frags
  * @skb: buffer to orphan frags from
  * @gfp_mask: allocation mask for replacement pages
  *
@@ -2458,7 +2458,17 @@ static inline void skb_orphan(struct sk_buff *skb)
  */
 static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
 {
-   if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)))
+   if (likely(!skb_zcopy(skb)))
+   return 0;
+   if (skb_uarg(skb)->callback == sock_zerocopy_callback)
+   return 0;
+   return skb_copy_ubufs(skb, gfp_mask);
+}
+
+/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
+static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
+{
+   if (likely(!skb_zcopy(skb)))
return 0;
return skb_copy_ubufs(skb, gfp_mask);
 }
@@ -2890,6 +2900,8 @@ static inline int skb_add_data(struct sk_buff *skb,
 static inline bool skb_can_coalesce(struct sk_buff *skb, int i,
const struct page *page, int off)
 {
+   if (skb_zcopy(skb))
+   return false;
if (i) {
const struct skb_frag_struct *frag = _shinfo(skb)->frags[i 
- 1];
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 304f2deae5f9..7879225818da 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1801,7 +1801,7 @@ static inline int deliver_skb(struct sk_buff *skb,
  struct packet_type *pt_prev,
  struct net_device *orig_dev)
 {
-   if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+   if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
return -ENOMEM;
atomic_inc(>users);
return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
@@ -4173,7 +4173,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, 
bool pfmemalloc)
}
 
if (pt_prev) {
-   if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+   if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
   

[PATCH RFC v2 01/12] sock: allocate skbs from optmem

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Add sock_omalloc and sock_ofree to be able to allocate control skbs,
for instance for looping errors onto sk_error_queue.

The transmit budget (sk_wmem_alloc) is involved in transmit skb
shaping, most notably in TCP Small Queues. Using this budget for
control packets would impact transmission.

Signed-off-by: Willem de Bruijn 
---
 include/net/sock.h |  2 ++
 net/core/sock.c| 27 +++
 2 files changed, 29 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 9ccefa5c5487..c1a8b2cbc75e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1531,6 +1531,8 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned 
long size, int force,
 gfp_t priority);
 void __sock_wfree(struct sk_buff *skb);
 void sock_wfree(struct sk_buff *skb);
+struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
+gfp_t priority);
 void skb_orphan_partial(struct sk_buff *skb);
 void sock_rfree(struct sk_buff *skb);
 void sock_efree(struct sk_buff *skb);
diff --git a/net/core/sock.c b/net/core/sock.c
index e7d74940e863..57a7da46ac52 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1772,6 +1772,33 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned 
long size, int force,
 }
 EXPORT_SYMBOL(sock_wmalloc);
 
+static void sock_ofree(struct sk_buff *skb)
+{
+   struct sock *sk = skb->sk;
+
+   atomic_sub(skb->truesize, >sk_omem_alloc);
+}
+
+struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
+gfp_t priority)
+{
+   struct sk_buff *skb;
+
+   /* small safe race: SKB_TRUESIZE may differ from final skb->truesize */
+   if (atomic_read(>sk_omem_alloc) + SKB_TRUESIZE(size) >
+   sysctl_optmem_max)
+   return NULL;
+
+   skb = alloc_skb(size, priority);
+   if (!skb)
+   return NULL;
+
+   atomic_add(skb->truesize, >sk_omem_alloc);
+   skb->sk = sk;
+   skb->destructor = sock_ofree;
+   return skb;
+}
+
 /*
  * Allocate a memory block from the socket's option memory buffer.
  */
-- 
2.11.0.483.g087da7b7c-goog



[PATCH RFC v2 08/12] tcp: enable sendmsg zerocopy

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

Enable support for MSG_ZEROCOPY to the TCP stack. Data that is
sent to a remote host will be zerocopy. TSO and GSO are supported.

Tested:
  A 10x TCP_STREAM between two hosts showed a reduction in netserver
  process cycles by up to 70%, depending on packet size. Systemwide,
  savings are of course much less pronounced, at up to 20% best case.

  loopback test snd_zerocopy_lo -t -z produced:

  without zerocopy (-t):
rx=102852 (6418 MB) tx=102852 txc=0
rx=213216 (13305 MB) tx=213216 txc=0
rx=325266 (20298 MB) tx=325266 txc=0
rx=437082 (27275 MB) tx=437082 txc=0

  with zerocopy (-t -z):
rx=238446 (14880 MB) tx=238446 txc=238434
rx=500076 (31207 MB) tx=500076 txc=500060
rx=763728 (47660 MB) tx=763728 txc=763706
rx=1028184 (64163 MB) tx=1028184 txc=1028156

  This test opens a pair of local sockets, one one calls sendmsg with
  64KB and optionally MSG_ZEROCOPY and on the other reads the initial
  bytes. The receiver truncates, so this is strictly an upper bound on
  what is achievable. It is more representative of sending data out of
  a physical NIC (when payload is not touched, either).

Signed-off-by: Willem de Bruijn 
---
 net/ipv4/tcp.c | 37 ++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index da385ae997a3..4884f4ff14d2 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1051,13 +1051,17 @@ static int linear_payload_sz(bool first_skb)
return 0;
 }
 
-static int select_size(const struct sock *sk, bool sg, bool first_skb)
+static int select_size(const struct sock *sk, bool sg, bool first_skb,
+  bool zerocopy)
 {
const struct tcp_sock *tp = tcp_sk(sk);
int tmp = tp->mss_cache;
 
if (sg) {
if (sk_can_gso(sk)) {
+   if (zerocopy)
+   return 0;
+
tmp = linear_payload_sz(first_skb);
} else {
int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER);
@@ -1121,6 +1125,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
struct tcp_sock *tp = tcp_sk(sk);
struct sk_buff *skb;
struct sockcm_cookie sockc;
+   struct ubuf_info *uarg = NULL;
int flags, err, copied = 0;
int mss_now = 0, size_goal, copied_syn = 0;
bool process_backlog = false;
@@ -1190,6 +1195,21 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
 
sg = !!(sk->sk_route_caps & NETIF_F_SG);
 
+   if (sg && (flags & MSG_ZEROCOPY) && size && !uarg) {
+   skb = tcp_send_head(sk) ? tcp_write_queue_tail(sk) : NULL;
+   uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
+   if (!uarg) {
+   if ((err = sk_stream_wait_memory(sk, )) != 0)
+   goto out_err;
+   uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb));
+   if (!uarg) {
+   err = -ENOBUFS;
+   goto out_err;
+   }
+   }
+   sock_zerocopy_get(uarg);
+   }
+
while (msg_data_left(msg)) {
int copy = 0;
int max = size_goal;
@@ -1217,7 +1237,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
}
first_skb = skb_queue_empty(>sk_write_queue);
skb = sk_stream_alloc_skb(sk,
- select_size(sk, sg, 
first_skb),
+ select_size(sk, sg, 
first_skb, uarg),
  sk->sk_allocation,
  first_skb);
if (!skb)
@@ -1253,7 +1273,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
err = skb_add_data_nocache(sk, skb, >msg_iter, 
copy);
if (err)
goto do_fault;
-   } else {
+   } else if (!uarg) {
bool merge = true;
int i = skb_shinfo(skb)->nr_frags;
struct page_frag *pfrag = sk_page_frag(sk);
@@ -1291,6 +1311,15 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, 
size_t size)
page_ref_inc(pfrag->page);
}
pfrag->offset += copy;
+   } else {
+   err = skb_zerocopy_add_frags_iter(sk, skb,
+ >msg_iter,
+ copy, uarg);
+   if (err == -EMSGSIZE || err == -EEXIST)
+

[PATCH RFC v2 03/12] sock: add generic socket zerocopy

2017-02-22 Thread Willem de Bruijn
From: Willem de Bruijn 

The kernel supports zerocopy sendmsg in virtio and tap. Expand the
infrastructure to support other socket types. Introduce a completion
notification channel over the socket error queue. Notifications are
returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid
blocking the send/recv path on receiving notifications.

Add reference counting, to support the skb split, merge, resize and
clone operations possible with SOCK_STREAM and other socket types.

The patch does not yet modify any datapaths.

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h|  46 
 include/linux/socket.h|   1 +
 include/net/sock.h|   2 +
 include/uapi/linux/errqueue.h |   1 +
 net/core/datagram.c   |  35 
 net/core/skbuff.c | 120 ++
 net/core/sock.c   |   2 +
 7 files changed, 196 insertions(+), 11 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 69ccd2636911..c99538b258c9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -390,6 +390,7 @@ enum {
SKBTX_SCHED_TSTAMP = 1 << 6,
 };
 
+#define SKBTX_ZEROCOPY_FRAG(SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG)
 #define SKBTX_ANY_SW_TSTAMP(SKBTX_SW_TSTAMP| \
 SKBTX_SCHED_TSTAMP)
 #define SKBTX_ANY_TSTAMP   (SKBTX_HW_TSTAMP | SKBTX_ANY_SW_TSTAMP)
@@ -406,8 +407,27 @@ struct ubuf_info {
void (*callback)(struct ubuf_info *, bool zerocopy_success);
void *ctx;
unsigned long desc;
+   atomic_t refcnt;
 };
 
+#define skb_uarg(SKB)  ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg))
+
+struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size);
+
+static inline void sock_zerocopy_get(struct ubuf_info *uarg)
+{
+   atomic_inc(>refcnt);
+}
+
+void sock_zerocopy_put(struct ubuf_info *uarg);
+
+void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
+
+bool skb_zerocopy_alloc(struct sk_buff *skb, size_t size);
+int skb_zerocopy_add_frags_iter(struct sock *sk, struct sk_buff *skb,
+   struct iov_iter *iter, int len,
+   struct ubuf_info *uarg);
+
 /* This data is invariant across clones and lives at
  * the end of the header data, ie. at skb->end.
  */
@@ -1230,6 +1250,32 @@ static inline struct skb_shared_hwtstamps 
*skb_hwtstamps(struct sk_buff *skb)
return _shinfo(skb)->hwtstamps;
 }
 
+static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb)
+{
+   bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY;
+
+   return is_zcopy ? skb_uarg(skb) : NULL;
+}
+
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+{
+   if (uarg) {
+   sock_zerocopy_get(uarg);
+   skb_shinfo(skb)->destructor_arg = uarg;
+   skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
+   }
+}
+
+static inline void skb_zcopy_clear(struct sk_buff *skb)
+{
+   struct ubuf_info *uarg = skb_zcopy(skb);
+
+   if (uarg) {
+   sock_zerocopy_put(uarg);
+   skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY;
+   }
+}
+
 /**
  * skb_queue_empty - check if a queue is empty
  * @list: queue head
diff --git a/include/linux/socket.h b/include/linux/socket.h
index 082027457825..c2d6ec354bee 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -287,6 +287,7 @@ struct ucred {
 #define MSG_BATCH  0x4 /* sendmmsg(): more messages coming */
 #define MSG_EOF MSG_FIN
 
+#define MSG_ZEROCOPY   0x400   /* Use user data in kernel path */
 #define MSG_FASTOPEN   0x2000  /* Send data in TCP SYN */
 #define MSG_CMSG_CLOEXEC 0x4000/* Set close_on_exec for file
   descriptor received through
diff --git a/include/net/sock.h b/include/net/sock.h
index c1a8b2cbc75e..74ad7d7c5eed 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -288,6 +288,7 @@ struct sock_common {
   *@sk_stamp: time stamp of last packet received
   *@sk_tsflags: SO_TIMESTAMPING socket options
   *@sk_tskey: counter to disambiguate concurrent tstamp requests
+  *@sk_zckey: counter to order MSG_ZEROCOPY notifications
   *@sk_socket: Identd and reporting IO signals
   *@sk_user_data: RPC layer private data
   *@sk_frag: cached page frag
@@ -455,6 +456,7 @@ struct sock {
u16 sk_tsflags;
u8  sk_shutdown;
u32 sk_tskey;
+   atomic_tsk_zckey;
struct socket   *sk_socket;
void*sk_user_data;
 #ifdef CONFIG_SECURITY
diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h
index 07bdce1f444a..0f15a77c9e39 100644
--- a/include/uapi/linux/errqueue.h
+++ 

[PATCH net V2 5/5] net/mlx4_en: Use __skb_fill_page_desc()

2017-02-22 Thread Tariq Toukan
From: Eric Dumazet 

Or we might miss the fact that a page was allocated from memory reserves.

Fixes: dceeab0e5258 ("mlx4: support __GFP_MEMALLOC for rx")
Signed-off-by: Eric Dumazet 
Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index cc003fdf0ed9..eca31f443909 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -603,10 +603,10 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv 
*priv,
dma_sync_single_for_cpu(priv->ddev, dma, frag_info->frag_size,
DMA_FROM_DEVICE);
 
-   /* Save page reference in skb */
-   __skb_frag_set_page(_frags_rx[nr], frags[nr].page);
-   skb_frag_size_set(_frags_rx[nr], frag_info->frag_size);
-   skb_frags_rx[nr].page_offset = frags[nr].page_offset;
+   __skb_fill_page_desc(skb, nr, frags[nr].page,
+frags[nr].page_offset,
+frag_info->frag_size);
+
skb->truesize += frag_info->frag_stride;
frags[nr].page = NULL;
}
-- 
1.8.3.1



[PATCH net V2 1/5] net/mlx4: Change ENOTSUPP to EOPNOTSUPP

2017-02-22 Thread Tariq Toukan
From: Or Gerlitz 

As ENOTSUPP is specific to NFS, change the return error value to
EOPNOTSUPP in various places in the mlx4 driver.

Signed-off-by: Or Gerlitz 
Suggested-by: Yotam Gigi 
Reviewed-by: Matan Barak 
Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c| 2 +-
 drivers/net/ethernet/mellanox/mlx4/fw.c   | 2 +-
 drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +-
 drivers/net/ethernet/mellanox/mlx4/main.c | 6 +++---
 drivers/net/ethernet/mellanox/mlx4/mr.c   | 2 +-
 drivers/net/ethernet/mellanox/mlx4/qp.c   | 2 +-
 drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 2 +-
 7 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c 
b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
index b04760a5034b..1dae8e40fb25 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c
@@ -319,7 +319,7 @@ static int mlx4_en_ets_validate(struct mlx4_en_priv *priv, 
struct ieee_ets *ets)
default:
en_err(priv, "TC[%d]: Not supported TSA: %d\n",
i, ets->tc_tsa[i]);
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
}
}
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c 
b/drivers/net/ethernet/mellanox/mlx4/fw.c
index 84bab9f0732e..34a0c24e6844 100644
--- a/drivers/net/ethernet/mellanox/mlx4/fw.c
+++ b/drivers/net/ethernet/mellanox/mlx4/fw.c
@@ -2436,7 +2436,7 @@ int mlx4_config_dev_retrieval(struct mlx4_dev *dev,
 #define CONFIG_DEV_RX_CSUM_MODE_PORT2_BIT_OFFSET   4
 
if (!(dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_CONFIG_DEV))
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 
err = mlx4_CONFIG_DEV_get(dev, _dev);
if (err)
diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c 
b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 8258d08acd8c..e00f627331cb 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -136,7 +136,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable)
LIST_HEAD(bond_list);
 
if (!(dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_PORT_REMAP))
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 
ret = mlx4_disable_rx_port_check(dev, enable);
if (ret) {
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index bffa6f345f2f..55e4be51ee5a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -1447,7 +1447,7 @@ int mlx4_port_map_set(struct mlx4_dev *dev, struct 
mlx4_port_map *v2p)
int err;
 
if (!(dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_PORT_REMAP))
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 
mutex_lock(>bond_mutex);
 
@@ -1884,7 +1884,7 @@ int mlx4_get_internal_clock_params(struct mlx4_dev *dev,
struct mlx4_priv *priv = mlx4_priv(dev);
 
if (mlx4_is_slave(dev))
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 
if (!params)
return -EINVAL;
@@ -2384,7 +2384,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev)
 
/* Query CONFIG_DEV parameters */
err = mlx4_config_dev_retrieval(dev, );
-   if (err && err != -ENOTSUPP) {
+   if (err && err != -EOPNOTSUPP) {
mlx4_err(dev, "Failed to query CONFIG_DEV parameters\n");
} else if (!err) {
dev->caps.rx_checksum_flags_port[1] = 
params.rx_csum_flags_port_1;
diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c 
b/drivers/net/ethernet/mellanox/mlx4/mr.c
index 395b5463cfd9..db65f72879e9 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx4/mr.c
@@ -823,7 +823,7 @@ int mlx4_mw_alloc(struct mlx4_dev *dev, u32 pd, enum 
mlx4_mw_type type,
 !(dev->caps.flags & MLX4_DEV_CAP_FLAG_MEM_WINDOW)) ||
 (type == MLX4_MW_TYPE_2 &&
 !(dev->caps.bmme_flags & MLX4_BMME_FLAG_TYPE_2_WIN)))
-   return -ENOTSUPP;
+   return -EOPNOTSUPP;
 
index = mlx4_mpt_reserve(dev);
if (index == -1)
diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c 
b/drivers/net/ethernet/mellanox/mlx4/qp.c
index d1cd9c32a9ae..2d6abd4662b1 100644
--- a/drivers/net/ethernet/mellanox/mlx4/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx4/qp.c
@@ -447,7 +447,7 @@ int mlx4_update_qp(struct mlx4_dev *dev, u32 qpn,
  & MLX4_DEV_CAP_FLAG2_UPDATE_QP_SRC_CHECK_LB)) {
mlx4_warn(dev,
  "Trying to set src check LB, but it isn't 
supported\n");
-   err = -ENOTSUPP;
+ 

[PATCH net V2 4/5] net/mlx4_core: Use cq quota in SRIOV when creating completion EQs

2017-02-22 Thread Tariq Toukan
From: Jack Morgenstein 

When creating EQs to handle CQ completion events for the PF
or for VFs, we create enough EQE entries to handle completions
for the max number of CQs that can use that EQ.

When SRIOV is activated, the max number of CQs a VF (or the PF) can
obtain is its CQ quota (determined by the Hypervisor resource tracker).
Therefore, when creating an EQ, the number of EQE entries that the VF
should request for that EQ is the CQ quota value (and not the total
number of CQs available in the FW).

Under SRIOV, the PF, also must use its CQ quota, because
the resource tracker also controls how many CQs the PF can obtain.

Using the FW total CQs instead of the CQ quota when creating EQs resulted
wasting MTT entries, due to allocating more EQEs than were needed.

Fixes: 5a0d0a6161ae ("mlx4: Structures and init/teardown for VF resource 
quotas")
Signed-off-by: Jack Morgenstein 
Reported-by: Dexuan Cui 
Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/eq.c   | 5 ++---
 drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++-
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c 
b/drivers/net/ethernet/mellanox/mlx4/eq.c
index 0509996957d9..232f46db0dce 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -1256,9 +1256,8 @@ int mlx4_init_eq_table(struct mlx4_dev *dev)
mlx4_warn(dev, "Failed adding irq 
rmap\n");
}
 #endif
-   err = mlx4_create_eq(dev, dev->caps.num_cqs -
- dev->caps.reserved_cqs +
- MLX4_NUM_SPARE_EQE,
+   err = mlx4_create_eq(dev, dev->quotas.cq +
+MLX4_NUM_SPARE_EQE,
 (dev->flags & MLX4_FLAG_MSI_X) ?
 i + 1 - !!(i > MLX4_EQ_ASYNC) : 0,
 eq);
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 7a030d10ff3e..094cfd8a1a18 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -3501,6 +3501,8 @@ static int mlx4_load_one(struct pci_dev *pdev, int 
pci_dev_data,
goto err_disable_msix;
}
 
+   mlx4_init_quotas(dev);
+
err = mlx4_setup_hca(dev);
if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X) &&
!mlx4_is_mfunc(dev)) {
@@ -3513,7 +3515,6 @@ static int mlx4_load_one(struct pci_dev *pdev, int 
pci_dev_data,
if (err)
goto err_steer;
 
-   mlx4_init_quotas(dev);
/* When PF resources are ready arm its comm channel to enable
 * getting commands
 */
-- 
1.8.3.1



[PATCH net V2 2/5] net/mlx4: Spoofcheck and zero MAC can't coexist

2017-02-22 Thread Tariq Toukan
From: Eugenia Emantayev 

Spoofcheck can't be enabled if VF MAC is zero.
Vice versa, can't zero MAC if spoofcheck is on.

Fixes: 8f7ba3ca12f6 ('net/mlx4: Add set VF mac address support')
Signed-off-by: Eugenia Emantayev 
Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c   | 22 --
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  6 +-
 include/linux/mlx4/cmd.h   |  2 +-
 include/linux/mlx4/driver.h| 10 ++
 4 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index a49072b4fa52..e8c105164931 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -2955,7 +2956,7 @@ static bool mlx4_valid_vf_state_change(struct mlx4_dev 
*dev, int port,
return false;
 }
 
-int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac)
+int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u8 *mac)
 {
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_vport_state *s_info;
@@ -2964,13 +2965,22 @@ int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int 
vf, u64 mac)
if (!mlx4_is_master(dev))
return -EPROTONOSUPPORT;
 
+   if (is_multicast_ether_addr(mac))
+   return -EINVAL;
+
slave = mlx4_get_slave_indx(dev, vf);
if (slave < 0)
return -EINVAL;
 
port = mlx4_slaves_closest_port(dev, slave, port);
s_info = >mfunc.master.vf_admin[slave].vport[port];
-   s_info->mac = mac;
+
+   if (s_info->spoofchk && is_zero_ether_addr(mac)) {
+   mlx4_info(dev, "MAC invalidation is not allowed when spoofchk 
is on\n");
+   return -EPERM;
+   }
+
+   s_info->mac = mlx4_mac_to_u64(mac);
mlx4_info(dev, "default mac on vf %d port %d to %llX will take effect 
only after vf restart\n",
  vf, port, s_info->mac);
return 0;
@@ -3143,6 +3153,7 @@ int mlx4_set_vf_spoofchk(struct mlx4_dev *dev, int port, 
int vf, bool setting)
struct mlx4_priv *priv = mlx4_priv(dev);
struct mlx4_vport_state *s_info;
int slave;
+   u8 mac[ETH_ALEN];
 
if ((!mlx4_is_master(dev)) ||
!(dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_FSM))
@@ -3154,6 +3165,13 @@ int mlx4_set_vf_spoofchk(struct mlx4_dev *dev, int port, 
int vf, bool setting)
 
port = mlx4_slaves_closest_port(dev, slave, port);
s_info = >mfunc.master.vf_admin[slave].vport[port];
+
+   mlx4_u64_to_mac(mac, s_info->mac);
+   if (setting && !is_valid_ether_addr(mac)) {
+   mlx4_info(dev, "Illegal MAC with spoofchk\n");
+   return -EPERM;
+   }
+
s_info->spoofchk = setting;
 
return 0;
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c 
b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 3b4961a8e8e4..9a86dd397315 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2475,12 +2475,8 @@ static int mlx4_en_set_vf_mac(struct net_device *dev, 
int queue, u8 *mac)
 {
struct mlx4_en_priv *en_priv = netdev_priv(dev);
struct mlx4_en_dev *mdev = en_priv->mdev;
-   u64 mac_u64 = mlx4_mac_to_u64(mac);
 
-   if (is_multicast_ether_addr(mac))
-   return -EINVAL;
-
-   return mlx4_set_vf_mac(mdev->dev, en_priv->port, queue, mac_u64);
+   return mlx4_set_vf_mac(mdev->dev, en_priv->port, queue, mac);
 }
 
 static int mlx4_en_set_vf_vlan(struct net_device *dev, int vf, u16 vlan, u8 
qos,
diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h
index 1f3568694a57..7b74afcbbab2 100644
--- a/include/linux/mlx4/cmd.h
+++ b/include/linux/mlx4/cmd.h
@@ -308,7 +308,7 @@ int mlx4_get_counter_stats(struct mlx4_dev *dev, int 
counter_index,
 int mlx4_get_vf_stats(struct mlx4_dev *dev, int port, int vf_idx,
  struct ifla_vf_stats *vf_stats);
 u32 mlx4_comm_get_version(void);
-int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac);
+int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u8 *mac);
 int mlx4_set_vf_vlan(struct mlx4_dev *dev, int port, int vf, u16 vlan,
 u8 qos, __be16 proto);
 int mlx4_set_vf_rate(struct mlx4_dev *dev, int port, int vf, int min_tx_rate,
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index bd0e7075ea6d..e965e5090d96 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -104,4 +104,14 @@ static inline u64 mlx4_mac_to_u64(u8 *addr)
return mac;
 }
 
+static inline void mlx4_u64_to_mac(u8 *addr, u64 mac)
+{
+   int i;
+
+   for (i = ETH_ALEN; i > 0; i--) {
+   addr[i - 1] = mac && 

[PATCH net V2 0/5] mlx4 misc fixes

2017-02-22 Thread Tariq Toukan
Hi Dave,

This patchset contains misc bug fixes from Eric Dumazet and our team
to the mlx4 Core and Eth drivers.

Series generated against net commit:
00ea1ceebe0d ipv6: release dst on error in ip6_dst_lookup_tail

Thanks,
Tariq.

v2:
* Added Eric's fix (patch 5/5).

Eric Dumazet (1):
  net/mlx4_en: Use __skb_fill_page_desc()

Eugenia Emantayev (1):
  net/mlx4: Spoofcheck and zero MAC can't coexist

Jack Morgenstein (1):
  net/mlx4_core: Use cq quota in SRIOV when creating completion EQs

Majd Dibbiny (1):
  net/mlx4_core: Fix VF overwrite of module param which disables DMFS on
new probed PFs

Or Gerlitz (1):
  net/mlx4: Change ENOTSUPP to EOPNOTSUPP

 drivers/net/ethernet/mellanox/mlx4/cmd.c   | 22 --
 drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c |  2 +-
 drivers/net/ethernet/mellanox/mlx4/en_netdev.c |  6 +-
 drivers/net/ethernet/mellanox/mlx4/en_rx.c |  8 
 drivers/net/ethernet/mellanox/mlx4/eq.c|  5 ++---
 drivers/net/ethernet/mellanox/mlx4/fw.c|  2 +-
 drivers/net/ethernet/mellanox/mlx4/intf.c  |  2 +-
 drivers/net/ethernet/mellanox/mlx4/main.c  | 11 +--
 drivers/net/ethernet/mellanox/mlx4/mr.c|  2 +-
 drivers/net/ethernet/mellanox/mlx4/qp.c|  2 +-
 .../net/ethernet/mellanox/mlx4/resource_tracker.c  |  2 +-
 include/linux/mlx4/cmd.h   |  2 +-
 include/linux/mlx4/driver.h| 10 ++
 13 files changed, 49 insertions(+), 27 deletions(-)

-- 
1.8.3.1



[PATCH net V2 3/5] net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs

2017-02-22 Thread Tariq Toukan
From: Majd Dibbiny 

In the VF driver, module parameter mlx4_log_num_mgm_entry_size was
mistakenly overwritten -- and in a manner which overrode the
device-managed flow steering option encoded in the parameter.

log_num_mgm_entry_size is a global module parameter which
affects all ConnectX-3 PFs installed on that host.
If a VF changes log_num_mgm_entry_size, this will affect all PFs
which are probed subsequent to the change (by disabling DMFS for
those PFs).

Fixes: 3c439b5586e9 ("mlx4_core: Allow choosing flow steering mode")
Signed-off-by: Majd Dibbiny 
Reviewed-by: Jack Morgenstein 
Signed-off-by: Tariq Toukan 
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c 
b/drivers/net/ethernet/mellanox/mlx4/main.c
index 55e4be51ee5a..7a030d10ff3e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -841,8 +841,6 @@ static int mlx4_slave_cap(struct mlx4_dev *dev)
return -ENOSYS;
}
 
-   mlx4_log_num_mgm_entry_size = hca_param.log_mc_entry_sz;
-
dev->caps.hca_core_clock = hca_param.hca_core_clock;
 
memset(_cap, 0, sizeof(dev_cap));
-- 
1.8.3.1



Re: [PATCH net 0/4] mlx4 misc fixes

2017-02-22 Thread Tariq Toukan



On 22/02/2017 2:33 PM, Tariq Toukan wrote:

Hi Dave,

This patchset contains misc bug fixes from the team
to the mlx4 Core and Eth drivers.

Series generated against net commit:
00ea1ceebe0d ipv6: release dst on error in ip6_dst_lookup_tail

Thanks,
Tariq.



Please ignore this one.
I am submitting V2 with an additional patch.

Thanks,
Tariq


Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

2017-02-22 Thread Eric Dumazet
On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote:
> Use of order-3 pages is problematic in some cases.
> 
> This patch might add three kinds of regression :
> 
> 1) a CPU performance regression, but we will add later page
> recycling and performance should be back.
> 
> 2) TCP receiver could grow its receive window slightly slower,
>because skb->len/skb->truesize ratio will decrease.
>This is mostly ok, we prefer being conservative to not risk OOM,
>and eventually tune TCP better in the future.
>This is consistent with other drivers using 2048 per ethernet frame.
> 
> 3) Because we allocate one page per RX slot, we consume more
>memory for the ring buffers. XDP already had this constraint anyway.
> 
> Signed-off-by: Eric Dumazet 
> ---

Note that we also could use a different strategy.

Assume RX rings of 4096 entries/slots.

With this patch, mlx4 gets the strategy used by Alexander in Intel
drivers : 

Each RX slot has an allocated page, and uses half of it, flipping to the
other half every time the slot is used.

So a ring buffer of 4096 slots allocates 4096 pages.

When we receive a packet train for the same flow, GRO builds an skb with
~45 page frags, all from different pages.

The put_page() done from skb_release_data() touches ~45 different struct
page cache lines, and show a high cost. (compared to the order-3 used
today by mlx4, this adds extra cache line misses and stalls for the
consumer)

If we instead try to use the two halves of one page on consecutive RX
slots, we might instead cook skb with the same number of MSS (45), but
half the number of cache lines for put_page(), so we should speed up the
consumer.

This means the number of active pages would be minimal, especially on
PowerPC. Pages that have been used by X=2 received frags would be put in
a quarantine (size to be determined).
On PowerPC, X would be PAGE_SIZE/frag_size


This strategy would consume less memory on PowerPC : 
65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of
4096.

The quarantine would be sized to increase chances of reusing an old
page, without consuming too much memory.

Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size))

x86 would still use 4096 pages, but PowerPC would use 98+128 pages
instead of 4096) (14 MBytes instead of 256 MBytes)





[PATCH net 2/6] net/mlx5e: Register/unregister vport representors on interface attach/detach

2017-02-22 Thread Saeed Mahameed
Currently vport representors are added only on driver load and removed on
driver unload.  Apparently we forgot to handle them when we added the
seamless reset flow feature.  This caused to leave the representors
netdevs alive and active with open HW resources on pci shutdown and on
error reset flows.

To overcome this we move their handling to interface attach/detach, so
they would be cleaned up on shutdown and recreated on reset flows.

Fixes: 26e59d8077a3 ("net/mlx5e: Implement mlx5e interface attach/detach 
callbacks")
Signed-off-by: Saeed Mahameed 
Reviewed-by: Hadar Hen Zion 
Reviewed-by: Roi Dayan 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 3cce6281e075..c24366868b39 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3970,6 +3970,19 @@ static void mlx5e_register_vport_rep(struct 
mlx5_core_dev *mdev)
}
 }
 
+static void mlx5e_unregister_vport_rep(struct mlx5_core_dev *mdev)
+{
+   struct mlx5_eswitch *esw = mdev->priv.eswitch;
+   int total_vfs = MLX5_TOTAL_VPORTS(mdev);
+   int vport;
+
+   if (!MLX5_CAP_GEN(mdev, vport_group_manager))
+   return;
+
+   for (vport = 1; vport < total_vfs; vport++)
+   mlx5_eswitch_unregister_vport_rep(esw, vport);
+}
+
 void mlx5e_detach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev)
 {
struct mlx5e_priv *priv = netdev_priv(netdev);
@@ -4016,6 +4029,7 @@ static int mlx5e_attach(struct mlx5_core_dev *mdev, void 
*vpriv)
return err;
}
 
+   mlx5e_register_vport_rep(mdev);
return 0;
 }
 
@@ -4027,6 +4041,7 @@ static void mlx5e_detach(struct mlx5_core_dev *mdev, void 
*vpriv)
if (!netif_device_present(netdev))
return;
 
+   mlx5e_unregister_vport_rep(mdev);
mlx5e_detach_netdev(mdev, netdev);
mlx5e_destroy_mdev_resources(mdev);
 }
@@ -4045,8 +4060,6 @@ static void *mlx5e_add(struct mlx5_core_dev *mdev)
if (err)
return NULL;
 
-   mlx5e_register_vport_rep(mdev);
-
if (MLX5_CAP_GEN(mdev, vport_group_manager))
ppriv = >offloads.vport_reps[0];
 
@@ -4098,13 +4111,7 @@ void mlx5e_destroy_netdev(struct mlx5_core_dev *mdev, 
struct mlx5e_priv *priv)
 
 static void mlx5e_remove(struct mlx5_core_dev *mdev, void *vpriv)
 {
-   struct mlx5_eswitch *esw = mdev->priv.eswitch;
-   int total_vfs = MLX5_TOTAL_VPORTS(mdev);
struct mlx5e_priv *priv = vpriv;
-   int vport;
-
-   for (vport = 1; vport < total_vfs; vport++)
-   mlx5_eswitch_unregister_vport_rep(esw, vport);
 
unregister_netdev(priv->netdev);
mlx5e_detach(mdev, vpriv);
-- 
2.11.0



[PATCH net 1/6] net/mlx5e: s390 system compilation fix

2017-02-22 Thread Saeed Mahameed
From: Mohamad Haj Yahia 

Add necessary headers include for s390 arch compilation.

Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files")
Fixes: d605d6686dc7 ("net/mlx5e: Add support for ethtool self..")
Signed-off-by: Mohamad Haj Yahia 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b039b87742a6..9fad22768aab 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -30,6 +30,7 @@
  * SOFTWARE.
  */
 
+#include 
 #include 
 #include 
 #include 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
index 65442c36a6e1..31e3cb7ee5fe 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c
@@ -30,6 +30,7 @@
  * SOFTWARE.
  */
 
+#include 
 #include 
 #include 
 #include 
-- 
2.11.0



[PATCH net 5/6] net/mlx5e: Update MPWQE stride size when modifying CQE compress state

2017-02-22 Thread Saeed Mahameed
When the admin enables/disables cqe compression, updating
mpwqe stride size is required:
CQE compress ON  ==> stride size = 256B
CQE compress OFF ==> stride size = 64B

This is already done on driver load via mlx5e_set_rq_type_params, all we
need is just to call it on arbitrary admin changes of cqe compression
state via priv flags or when changing timestamping state
(as it is mutually exclusive with cqe compression).

This bug introduces no functional damage, it only makes cqe compression
occur less often, since in ConnectX4-LX CQE compression is performed
only on packets smaller than stride size.

Tested:
 ethtool --set-priv-flags ethxx rx_cqe_compress on
 pktgen with  64 < pkt size < 256 and netperf TCP_STREAM (IPv4/IPv6)
 verify `ethtool -S ethxx | grep compress` are advancing more often
 (rapidly)

Fixes: 7219ab34f184 ("net/mlx5e: CQE compression")
Signed-off-by: Saeed Mahameed 
Reviewed-by: Tariq Toukan 
Cc: kernel-t...@fb.com
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 1 +
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c| 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c  | 1 +
 4 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 95ca03c0d9f5..f6a6ded204f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -816,6 +816,7 @@ int mlx5e_get_max_linkspeed(struct mlx5_core_dev *mdev, u32 
*speed);
 
 void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params,
 u8 cq_period_mode);
+void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type);
 
 static inline void mlx5e_tx_notify_hw(struct mlx5e_sq *sq,
  struct mlx5_wqe_ctrl_seg *ctrl, int bf_sz)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
index cc80522b5854..a004a5a1a4c2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
@@ -1487,6 +1487,7 @@ static int set_pflag_rx_cqe_compress(struct net_device 
*netdev,
 
mlx5e_modify_rx_cqe_compression_locked(priv, enable);
priv->params.rx_cqe_compress_def = enable;
+   mlx5e_set_rq_type_params(priv, priv->params.rq_wq_type);
 
return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index dc621bc4e173..8ef64c4db2c2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -79,7 +79,7 @@ static bool mlx5e_check_fragmented_striding_rq_cap(struct 
mlx5_core_dev *mdev)
MLX5_CAP_ETH(mdev, reg_umr_sq);
 }
 
-static void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type)
+void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type)
 {
priv->params.rq_wq_type = rq_type;
priv->params.lro_wqe_sz = MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 9fad22768aab..d5ce20db3f0b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -172,6 +172,7 @@ void mlx5e_modify_rx_cqe_compression_locked(struct 
mlx5e_priv *priv, bool val)
mlx5e_close_locked(priv->netdev);
 
MLX5E_SET_PFLAG(priv, MLX5E_PFLAG_RX_CQE_COMPRESS, val);
+   mlx5e_set_rq_type_params(priv, priv->params.rq_wq_type);
 
if (was_opened)
mlx5e_open_locked(priv->netdev);
-- 
2.11.0



  1   2   >