Re: lib: Introduce priority array area manager
Hi Jiri, On Wed, Feb 22, 2017 at 8:02 PM, Linux Kernel Mailing Listwrote: > Web: > https://git.kernel.org/torvalds/c/44091d29f2075972aede47ef17e1e70db3d51190 > Commit: 44091d29f2075972aede47ef17e1e70db3d51190 > Parent: b862815c3ee7b49ec20a9ab25da55a5f0bcbb95e > Refname:refs/heads/master > Author: Jiri Pirko > AuthorDate: Fri Feb 3 10:29:06 2017 +0100 > Committer: David S. Miller > CommitDate: Fri Feb 3 16:35:42 2017 -0500 > > lib: Introduce priority array area manager > > This introduces a infrastructure for management of linear priority > areas. Priority order in an array matters, however order of items inside > a priority group does not matter. > > As an initial implementation, L-sort algorithm is used. It is quite > trivial. More advanced algorithm called P-sort will be introduced as a > follow-up. The infrastructure is prepared for other algos. > > Alongside this, a testing module is introduced as well. > > Signed-off-by: Jiri Pirko > Signed-off-by: David S. Miller > --- a/lib/Kconfig > +++ b/lib/Kconfig > @@ -550,4 +550,7 @@ config STACKDEPOT > config SBITMAP > bool > > +config PARMAN > + tristate "parman" | parman (PARMAN) [N/m/y] (NEW) ? | | There is no help available for this option. Can you please add a description for this option? Or drop the "parman" string if this is always selected by its kernel users, and never intended to be enabled by the end user. Thanks! Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: [RFC v3 01/11] IB/hfi-vnic: Virtual Network Interface Controller (VNIC) documentation
On Wed, Feb 08, 2017 at 05:00:45PM +, Bart Van Assche wrote: On Tue, 2017-02-07 at 12:23 -0800, Vishwanathapura, Niranjana wrote: Please elaborate this section. What is a virtual Ethernet switch? Is it a software entity or something that is implemented in hardware? Also, how are these independent Ethernet networks identified on the wire? The Linux kernel already supports IB partitions and Ethernet VLANs. How do these independent Ethernet networks compare to IB partitions and Ethernet VLANs? Which wire- level header contains the identity of these Ethernet networks? Is it possible to query from user space which Ethernet network a VNIC belongs to? If so, with which API and which tools? I have added the VNIC packet format and some related information to the documentation in the PATCH series I just sent out. Thanks,
[PATCH 00/11] Omni-Path Virtual Network Interface Controller (VNIC)
Intel Omni-Path (OPA) Virtual Network Interface Controller (VNIC) feature supports Ethernet functionality over Omni-Path fabric by encapsulating the Ethernet packets between HFI nodes. Architecture = The patterns of exchanges of Omni-Path encapsulated Ethernet packets involves one or more virtual Ethernet switches overlaid on the Omni-Path fabric topology. A subset of HFI nodes on the Omni-Path fabric are permitted to exchange encapsulated Ethernet packets across a particular virtual Ethernet switch. The virtual Ethernet switches are logical abstractions achieved by configuring the HFI nodes on the fabric for header generation and processing. In the simplest configuration all HFI nodes across the fabric exchange encapsulated Ethernet packets over a single virtual Ethernet switch. A virtual Ethernet switch, is effectively an independent Ethernet network. The configuration is performed by an Ethernet Manager (EM) which is part of the trusted Fabric Manager (FM) application. HFI nodes can have multiple VNICs each connected to a different virtual Ethernet switch. The below diagram presents a case of two virtual Ethernet switches with two HFI nodes. +---+ | Subnet/ | | Ethernet | | Manager | +---+ / / / / // / / +-+ +--+ | Virtual Ethernet Switch| | Virtual Ethernet Switch | | +-++-+ | | +-++-+ | | | VPORT || VPORT | | | | VPORT || VPORT | | +--+-++-+-+ +-+-++-+---+ | \/ | | \/ | | \/ | |/ \| | / \ | +---++ +---++ | VNIC|VNIC| |VNIC |VNIC| +---++ +---++ | HFI | | HFI | ++ ++ The Omni-Path encapsulated Ethernet packet format is as described below. Bits Field Quad Word 0: 0-19 SLID (lower 20 bits) 20-30 Length (in Quad Words) 31BECN bit 32-51 DLID (lower 20 bits) 52-56 SC (Service Class) 57-59 RC (Routing Control) 60FECN bit 61-62 L2 (=10, 16B format) 63LT (=1, Link Transfer Head Flit) Quad Word 1: 0-7 L4 type (=0x78 ETHERNET) 8-11 SLID[23:20] 12-15 DLID[23:20] 16-31 PKEY 32-47 Entropy 48-63 Reserved Quad Word 2: 0-15 Reserved 16-31 L4 header 32-63 Ethernet Packet Quad Words 3 to N-1: 0-63 Ethernet packet (pad extended) Quad Word N (last): 0-23 Ethernet packet (pad extended) 24-55 ICRC 56-61 Tail 62-63 LT (=01, Link Transfer Tail Flit) Ethernet packet is padded on the transmit side to ensure that the VNIC OPA packet is quad word aligned. The 'Tail' field contains the number of bytes padded. On the receive side the 'Tail' field is read and the padding is removed (along with ICRC, Tail and OPA header) before passing packet up the network stack. The L4 header field contains the virtual Ethernet switch id the VNIC port belongs to. On the receive side, this field is used to de-multiplex the received VNIC packets to different VNIC ports. Driver Design == Intel OPA VNIC software design is presented in the below diagram. OPA VNIC functionality has a HW dependent component and a HW independent component. The support has been added for IB device to allocate and free the RDMA netdev devices. The RDMA netdev supports interfacing with the network stack thus creating standard network interfaces. OPA_VNIC is an RDMA netdev device type. The HW dependent VNIC functionality is part of the HFI1 driver. It implements the verbs to allocate and free the OPA_VNIC RDMA netdev. It involves HW resource allocation/management for VNIC functionality. It interfaces with the network stack and implements the required net_device_ops functions. It expects Omni-Path encapsulated Ethernet packets in the transmit path and provides HW access to them. It strips the Omni-Path header from the received packets before passing them up the network stack. It also implements the RDMA netdev control operations. The OPA VNIC module implements the HW independent VNIC functionality. It consists of two parts. The VNIC Ethernet Management Agent (VEMA) registers itself with IB core as
[PATCH 05/11] IB/opa-vnic: VNIC statistics support
OPA VNIC driver statistics support maintains various counters including standard netdev counters and the Ethernet manager defined counters. Add the Ethtool hook to read the counters. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura --- drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c | 110 + .../infiniband/ulp/opa_vnic/opa_vnic_internal.h| 4 + drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c | 20 3 files changed, 134 insertions(+) diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c index b74f6ad..a98948c 100644 --- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c +++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c @@ -53,9 +53,119 @@ #include "opa_vnic_internal.h" +enum {NETDEV_STATS, VNIC_STATS}; + +struct vnic_stats { + char stat_string[ETH_GSTRING_LEN]; + struct { + int sizeof_stat; + int stat_offset; + }; +}; + +#define VNIC_STAT(m){ FIELD_SIZEOF(struct opa_vnic_stats, m), \ + offsetof(struct opa_vnic_stats, m) } + +static struct vnic_stats vnic_gstrings_stats[] = { + /* NETDEV stats */ + {"rx_packets", VNIC_STAT(netstats.rx_packets)}, + {"tx_packets", VNIC_STAT(netstats.tx_packets)}, + {"rx_bytes", VNIC_STAT(netstats.rx_bytes)}, + {"tx_bytes", VNIC_STAT(netstats.tx_bytes)}, + {"rx_errors", VNIC_STAT(netstats.rx_errors)}, + {"tx_errors", VNIC_STAT(netstats.tx_errors)}, + {"rx_dropped", VNIC_STAT(netstats.rx_dropped)}, + {"tx_dropped", VNIC_STAT(netstats.tx_dropped)}, + + /* SUMMARY counters */ + {"tx_unicast", VNIC_STAT(tx_grp.unicast)}, + {"tx_mcastbcast", VNIC_STAT(tx_grp.mcastbcast)}, + {"tx_untagged", VNIC_STAT(tx_grp.untagged)}, + {"tx_vlan", VNIC_STAT(tx_grp.vlan)}, + + {"tx_64_size", VNIC_STAT(tx_grp.s_64)}, + {"tx_65_127", VNIC_STAT(tx_grp.s_65_127)}, + {"tx_128_255", VNIC_STAT(tx_grp.s_128_255)}, + {"tx_256_511", VNIC_STAT(tx_grp.s_256_511)}, + {"tx_512_1023", VNIC_STAT(tx_grp.s_512_1023)}, + {"tx_1024_1518", VNIC_STAT(tx_grp.s_1024_1518)}, + {"tx_1519_max", VNIC_STAT(tx_grp.s_1519_max)}, + + {"rx_unicast", VNIC_STAT(rx_grp.unicast)}, + {"rx_mcastbcast", VNIC_STAT(rx_grp.mcastbcast)}, + {"rx_untagged", VNIC_STAT(rx_grp.untagged)}, + {"rx_vlan", VNIC_STAT(rx_grp.vlan)}, + + {"rx_64_size", VNIC_STAT(rx_grp.s_64)}, + {"rx_65_127", VNIC_STAT(rx_grp.s_65_127)}, + {"rx_128_255", VNIC_STAT(rx_grp.s_128_255)}, + {"rx_256_511", VNIC_STAT(rx_grp.s_256_511)}, + {"rx_512_1023", VNIC_STAT(rx_grp.s_512_1023)}, + {"rx_1024_1518", VNIC_STAT(rx_grp.s_1024_1518)}, + {"rx_1519_max", VNIC_STAT(rx_grp.s_1519_max)}, + + /* ERROR counters */ + {"rx_fifo_errors", VNIC_STAT(netstats.rx_fifo_errors)}, + {"rx_length_errors", VNIC_STAT(netstats.rx_length_errors)}, + + {"tx_fifo_errors", VNIC_STAT(netstats.tx_fifo_errors)}, + {"tx_carrier_errors", VNIC_STAT(netstats.tx_carrier_errors)}, + + {"tx_dlid_zero", VNIC_STAT(tx_dlid_zero)}, + {"tx_drop_state", VNIC_STAT(tx_drop_state)}, + {"rx_drop_state", VNIC_STAT(rx_drop_state)}, + {"rx_oversize", VNIC_STAT(rx_oversize)}, + {"rx_runt", VNIC_STAT(rx_runt)}, +}; + +#define VNIC_STATS_LEN ARRAY_SIZE(vnic_gstrings_stats) + +/* vnic_get_sset_count - get string set count */ +static int vnic_get_sset_count(struct net_device *netdev, int sset) +{ + return (sset == ETH_SS_STATS) ? VNIC_STATS_LEN : -EOPNOTSUPP; +} + +/* vnic_get_ethtool_stats - get statistics */ +static void vnic_get_ethtool_stats(struct net_device *netdev, + struct ethtool_stats *stats, u64 *data) +{ + struct opa_vnic_adapter *adapter = opa_vnic_priv(netdev); + struct opa_vnic_stats vstats; + int i; + + memset(, 0, sizeof(vstats)); + mutex_lock(>stats_lock); + adapter->rn_ops->ndo_get_stats64(netdev, ); + for (i = 0; i < VNIC_STATS_LEN; i++) { + char *p = (char *) + vnic_gstrings_stats[i].stat_offset; + + data[i] = (vnic_gstrings_stats[i].sizeof_stat == + sizeof(u64)) ? *(u64 *)p : *(u32 *)p; + } + mutex_unlock(>stats_lock); +} + +/* vnic_get_strings - get strings */ +static void vnic_get_strings(struct net_device *netdev, u32 stringset, u8 *data) +{ + int i; + + if (stringset != ETH_SS_STATS) + return; + + for (i = 0; i < VNIC_STATS_LEN; i++) + memcpy(data + i * ETH_GSTRING_LEN, + vnic_gstrings_stats[i].stat_string, + ETH_GSTRING_LEN); +} + /* ethtool ops */ static const struct ethtool_ops
[PATCH 04/11] IB/opa-vnic: VNIC Ethernet Management (EM) structure definitions
Define VNIC EM MAD structures and the associated macros. These structures are used for information exchange between VNIC EM agent (EMA) on the host and the Ethernet manager. These include the virtual ethernet switch (vesw) port information, vesw port mac table, summay and error counters, vesw port interface mac lists and the EMA trap. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura Signed-off-by: Sadanand Warrier Signed-off-by: Tanya K Jajodia --- drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h | 423 + .../infiniband/ulp/opa_vnic/opa_vnic_internal.h| 33 ++ 2 files changed, 456 insertions(+) diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h index 176fca9..c025cde 100644 --- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h +++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h @@ -52,6 +52,28 @@ * and decapsulation of Ethernet packets */ +#include +#include + +/* EMA class version */ +#define OPA_EMA_CLASS_VERSION 0x80 + +/* + * Define the Intel vendor management class for OPA + * ETHERNET MANAGEMENT + */ +#define OPA_MGMT_CLASS_INTEL_EMA0x34 + +/* EM attribute IDs */ +#define OPA_EM_ATTR_CLASS_PORT_INFO 0x0001 +#define OPA_EM_ATTR_VESWPORT_INFO 0x0011 +#define OPA_EM_ATTR_VESWPORT_MAC_ENTRIES0x0012 +#define OPA_EM_ATTR_IFACE_UCAST_MACS0x0013 +#define OPA_EM_ATTR_IFACE_MCAST_MACS0x0014 +#define OPA_EM_ATTR_DELETE_VESW 0x0015 +#define OPA_EM_ATTR_VESWPORT_SUMMARY_COUNTERS 0x0020 +#define OPA_EM_ATTR_VESWPORT_ERROR_COUNTERS 0x0022 + /* VNIC configured and operational state values */ #define OPA_VNIC_STATE_DROP_ALL0x1 #define OPA_VNIC_STATE_FORWARDING 0x3 @@ -59,4 +81,405 @@ #define OPA_VESW_MAX_NUM_DEF_PORT 16 #define OPA_VNIC_MAX_NUM_PCP8 +#define OPA_VNIC_EMA_DATA(OPA_MGMT_MAD_SIZE - IB_MGMT_VENDOR_HDR) + +/* Defines for vendor specific notice(trap) attributes */ +#define OPA_INTEL_EMA_NOTICE_TYPE_INFO 0x04 + +/* INTEL OUI */ +#define INTEL_OUI_1 0x00 +#define INTEL_OUI_2 0x06 +#define INTEL_OUI_3 0x6a + +/* Trap opcodes sent from VNIC */ +#define OPA_VESWPORT_TRAP_IFACE_UCAST_MAC_CHANGE 0x1 +#define OPA_VESWPORT_TRAP_IFACE_MCAST_MAC_CHANGE 0x2 +#define OPA_VESWPORT_TRAP_ETH_LINK_STATUS_CHANGE 0x3 + +#define OPA_VNIC_DLID_SD_IS_SRC_MAC(dlid_sd) (!!((dlid_sd) & 0x20)) +#define OPA_VNIC_DLID_SD_GET_DLID(dlid_sd)((dlid_sd) >> 8) + +/** + * struct opa_vesw_info - OPA vnic switch information + * @fabric_id: 10-bit fabric id + * @vesw_id: 12-bit virtual ethernet switch id + * @def_port_mask: bitmask of default ports + * @pkey: partition key + * @u_mcast_dlid: unknown multicast dlid + * @u_ucast_dlid: array of unknown unicast dlids + * @eth_mtu: MTUs for each vlan PCP + * @eth_mtu_non_vlan: MTU for non vlan packets + */ +struct opa_vesw_info { + __be16 fabric_id; + __be16 vesw_id; + + u8 rsvd0[6]; + __be16 def_port_mask; + + u8 rsvd1[2]; + __be16 pkey; + + u8 rsvd2[4]; + __be32 u_mcast_dlid; + __be32 u_ucast_dlid[OPA_VESW_MAX_NUM_DEF_PORT]; + + u8 rsvd3[44]; + __be16 eth_mtu[OPA_VNIC_MAX_NUM_PCP]; + __be16 eth_mtu_non_vlan; + u8 rsvd4[2]; +} __packed; + +/** + * struct opa_per_veswport_info - OPA vnic per port information + * @port_num: port number + * @eth_link_status: current ethernet link state + * @base_mac_addr: base mac address + * @config_state: configured port state + * @oper_state: operational port state + * @max_mac_tbl_ent: max number of mac table entries + * @max_smac_ent: max smac entries in mac table + * @mac_tbl_digest: mac table digest + * @encap_slid: base slid for the port + * @pcp_to_sc_uc: sc by pcp index for unicast ethernet packets + * @pcp_to_vl_uc: vl by pcp index for unicast ethernet packets + * @pcp_to_sc_mc: sc by pcp index for multicast ethernet packets + * @pcp_to_vl_mc: vl by pcp index for multicast ethernet packets + * @non_vlan_sc_uc: sc for non-vlan unicast ethernet packets + * @non_vlan_vl_uc: vl for non-vlan unicast ethernet packets + * @non_vlan_sc_mc: sc for non-vlan multicast ethernet packets + * @non_vlan_vl_mc: vl for non-vlan multicast ethernet packets + * @uc_macs_gen_count: generation count for unicast macs list + * @mc_macs_gen_count: generation count for multicast macs list + */ +struct opa_per_veswport_info { + __be32 port_num; + + u8 eth_link_status; + u8 rsvd0[3]; + + u8 base_mac_addr[ETH_ALEN]; + u8 config_state; + u8 oper_state; + + __be16 max_mac_tbl_ent; + __be16 max_smac_ent; + __be32 mac_tbl_digest;
[PATCH 03/11] IB/opa-vnic: Virtual Network Interface Controller (VNIC) netdev
OPA VNIC netdev function supports Ethernet functionality over Omni-Path fabric by encapsulating Ethernet packets inside Omni-Path packet header. It allocates a rdma netdev device and interfaces with the network stack to provide standard Ethernet network interfaces. It overrides HFI1 device's netdev operations where it is required. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura Signed-off-by: Sadanand Warrier Signed-off-by: Sudeep Dutt Signed-off-by: Tanya K Jajodia Signed-off-by: Andrzej Kacprowski --- MAINTAINERS| 7 + drivers/infiniband/Kconfig | 1 + drivers/infiniband/ulp/Makefile| 1 + drivers/infiniband/ulp/opa_vnic/Kconfig| 8 + drivers/infiniband/ulp/opa_vnic/Makefile | 6 + drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c | 239 + drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h | 62 ++ drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c | 65 ++ .../infiniband/ulp/opa_vnic/opa_vnic_internal.h| 186 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c | 225 +++ 10 files changed, 800 insertions(+) create mode 100644 drivers/infiniband/ulp/opa_vnic/Kconfig create mode 100644 drivers/infiniband/ulp/opa_vnic/Makefile create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c diff --git a/MAINTAINERS b/MAINTAINERS index 468d2e8..7f0a07d 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5775,6 +5775,13 @@ F: drivers/block/cciss* F: include/linux/cciss_ioctl.h F: include/uapi/linux/cciss_ioctl.h +OPA-VNIC DRIVER +M: Dennis Dalessandro +M: Niranjana Vishwanathapura +L: linux-r...@vger.kernel.org +S: Supported +F: drivers/infiniband/ulp/opa_vnic + HFI1 DRIVER M: Mike Marciniszyn M: Dennis Dalessandro diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index 66f8602..234fe01 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -85,6 +85,7 @@ source "drivers/infiniband/ulp/srpt/Kconfig" source "drivers/infiniband/ulp/iser/Kconfig" source "drivers/infiniband/ulp/isert/Kconfig" +source "drivers/infiniband/ulp/opa_vnic/Kconfig" source "drivers/infiniband/sw/rdmavt/Kconfig" source "drivers/infiniband/sw/rxe/Kconfig" diff --git a/drivers/infiniband/ulp/Makefile b/drivers/infiniband/ulp/Makefile index f3c7dcf..c28af18 100644 --- a/drivers/infiniband/ulp/Makefile +++ b/drivers/infiniband/ulp/Makefile @@ -3,3 +3,4 @@ obj-$(CONFIG_INFINIBAND_SRP)+= srp/ obj-$(CONFIG_INFINIBAND_SRPT) += srpt/ obj-$(CONFIG_INFINIBAND_ISER) += iser/ obj-$(CONFIG_INFINIBAND_ISERT) += isert/ +obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic/ diff --git a/drivers/infiniband/ulp/opa_vnic/Kconfig b/drivers/infiniband/ulp/opa_vnic/Kconfig new file mode 100644 index 000..48132ab --- /dev/null +++ b/drivers/infiniband/ulp/opa_vnic/Kconfig @@ -0,0 +1,8 @@ +config INFINIBAND_OPA_VNIC + tristate "Intel OPA VNIC support" + depends on X86_64 && INFINIBAND + ---help--- + This is Omni-Path (OPA) Virtual Network Interface Controller (VNIC) + driver for Ethernet over Omni-Path feature. It implements the HW + independent VNIC functionality. It interfaces with Linux stack for + data path and IB MAD for the control path. diff --git a/drivers/infiniband/ulp/opa_vnic/Makefile b/drivers/infiniband/ulp/opa_vnic/Makefile new file mode 100644 index 000..975c313 --- /dev/null +++ b/drivers/infiniband/ulp/opa_vnic/Makefile @@ -0,0 +1,6 @@ +# Makefile - Intel Omni-Path Virtual Network Controller driver +# Copyright(c) 2017, Intel Corporation. +# +obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic.o + +opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c new file mode 100644 index 000..c74d02a --- /dev/null +++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c @@ -0,0 +1,239 @@ +/* + * Copyright(c) 2017 Intel Corporation. + * + * This file is provided under a dual BSD/GPLv2 license. When using or + * redistributing this file, you may do so under either license. + * + * GPL LICENSE SUMMARY + * + * This
[PATCH 01/11] IB/opa-vnic: Virtual Network Interface Controller (VNIC) documentation
Add OPA VNIC design document explaining the VNIC architecture and the driver design. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura --- Documentation/infiniband/opa_vnic.txt | 153 ++ 1 file changed, 153 insertions(+) create mode 100644 Documentation/infiniband/opa_vnic.txt diff --git a/Documentation/infiniband/opa_vnic.txt b/Documentation/infiniband/opa_vnic.txt new file mode 100644 index 000..282e17b --- /dev/null +++ b/Documentation/infiniband/opa_vnic.txt @@ -0,0 +1,153 @@ +Intel Omni-Path (OPA) Virtual Network Interface Controller (VNIC) feature +supports Ethernet functionality over Omni-Path fabric by encapsulating +the Ethernet packets between HFI nodes. + +Architecture += +The patterns of exchanges of Omni-Path encapsulated Ethernet packets +involves one or more virtual Ethernet switches overlaid on the Omni-Path +fabric topology. A subset of HFI nodes on the Omni-Path fabric are +permitted to exchange encapsulated Ethernet packets across a particular +virtual Ethernet switch. The virtual Ethernet switches are logical +abstractions achieved by configuring the HFI nodes on the fabric for +header generation and processing. In the simplest configuration all HFI +nodes across the fabric exchange encapsulated Ethernet packets over a +single virtual Ethernet switch. A virtual Ethernet switch, is effectively +an independent Ethernet network. The configuration is performed by an +Ethernet Manager (EM) which is part of the trusted Fabric Manager (FM) +application. HFI nodes can have multiple VNICs each connected to a +different virtual Ethernet switch. The below diagram presents a case +of two virtual Ethernet switches with two HFI nodes. + + +---+ + | Subnet/ | + | Ethernet | + | Manager | + +---+ +/ / + / / +// + / / ++-+ +--+ +| Virtual Ethernet Switch| | Virtual Ethernet Switch | +| +-++-+ | | +-++-+ | +| | VPORT || VPORT | | | | VPORT || VPORT | | ++--+-++-+-+ +-+-++-+---+ + | \/ | + | \/ | + | \/ | + |/ \| + | / \ | + +---++ +---++ + | VNIC|VNIC| |VNIC |VNIC| + +---++ +---++ + | HFI | | HFI | + ++ ++ + + +The Omni-Path encapsulated Ethernet packet format is as described below. + +Bits Field + +Quad Word 0: +0-19 SLID (lower 20 bits) +20-30 Length (in Quad Words) +31BECN bit +32-51 DLID (lower 20 bits) +52-56 SC (Service Class) +57-59 RC (Routing Control) +60FECN bit +61-62 L2 (=10, 16B format) +63LT (=1, Link Transfer Head Flit) + +Quad Word 1: +0-7 L4 type (=0x78 ETHERNET) +8-11 SLID[23:20] +12-15 DLID[23:20] +16-31 PKEY +32-47 Entropy +48-63 Reserved + +Quad Word 2: +0-15 Reserved +16-31 L4 header +32-63 Ethernet Packet + +Quad Words 3 to N-1: +0-63 Ethernet packet (pad extended) + +Quad Word N (last): +0-23 Ethernet packet (pad extended) +24-55 ICRC +56-61 Tail +62-63 LT (=01, Link Transfer Tail Flit) + +Ethernet packet is padded on the transmit side to ensure that the VNIC OPA +packet is quad word aligned. The 'Tail' field contains the number of bytes +padded. On the receive side the 'Tail' field is read and the padding is +removed (along with ICRC, Tail and OPA header) before passing packet up +the network stack. + +The L4 header field contains the virtual Ethernet switch id the VNIC port +belongs to. On the receive side, this field is used to de-multiplex the +received VNIC packets to different VNIC ports. + +Driver Design +== +Intel OPA VNIC software design is presented in the below diagram. +OPA VNIC functionality has a HW dependent component and a HW +independent component. + +The support has been added for IB device to allocate and free the RDMA +netdev devices. The RDMA netdev supports interfacing with the network +stack thus creating standard network interfaces.
[PATCH 10/11] IB/hfi1: Virtual Network Interface Controller (VNIC) HW support
HFI1 HW specific support for VNIC functionality. Dynamically allocate a set of contexts for VNIC when the first vnic port is instantiated. Allocate VNIC contexts from user contexts pool and return them back to the same pool while freeing up. Set aside enough MSI-X interrupts for VNIC contexts and assign them when the contexts are allocated. On the receive side, use an RSM rule to spread TCP/UDP streams among VNIC contexts. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura Signed-off-by: Andrzej Kacprowski --- drivers/infiniband/hw/hfi1/aspm.h | 15 +- drivers/infiniband/hw/hfi1/chip.c | 293 +- drivers/infiniband/hw/hfi1/chip.h | 4 +- drivers/infiniband/hw/hfi1/debugfs.c | 8 +- drivers/infiniband/hw/hfi1/driver.c | 52 -- drivers/infiniband/hw/hfi1/file_ops.c | 27 ++- drivers/infiniband/hw/hfi1/hfi.h | 29 ++- drivers/infiniband/hw/hfi1/init.c | 29 +-- drivers/infiniband/hw/hfi1/mad.c | 10 +- drivers/infiniband/hw/hfi1/pio.c | 19 +- drivers/infiniband/hw/hfi1/pio.h | 8 +- drivers/infiniband/hw/hfi1/sysfs.c| 4 +- drivers/infiniband/hw/hfi1/user_exp_rcv.c | 8 +- drivers/infiniband/hw/hfi1/user_pages.c | 5 +- drivers/infiniband/hw/hfi1/verbs.c| 8 +- drivers/infiniband/hw/hfi1/vnic.h | 3 + drivers/infiniband/hw/hfi1/vnic_main.c| 245 - include/rdma/opa_port_info.h | 4 +- 18 files changed, 663 insertions(+), 108 deletions(-) diff --git a/drivers/infiniband/hw/hfi1/aspm.h b/drivers/infiniband/hw/hfi1/aspm.h index 0d58fe3..794e681 100644 --- a/drivers/infiniband/hw/hfi1/aspm.h +++ b/drivers/infiniband/hw/hfi1/aspm.h @@ -1,5 +1,5 @@ /* - * Copyright(c) 2015, 2016 Intel Corporation. + * Copyright(c) 2015-2017 Intel Corporation. * * This file is provided under a dual BSD/GPLv2 license. When using or * redistributing this file, you may do so under either license. @@ -229,14 +229,17 @@ static inline void aspm_ctx_timer_function(unsigned long data) spin_unlock_irqrestore(>aspm_lock, flags); } -/* Disable interrupt processing for verbs contexts when PSM contexts are open */ +/* + * Disable interrupt processing for verbs contexts when PSM or VNIC contexts + * are open. + */ static inline void aspm_disable_all(struct hfi1_devdata *dd) { struct hfi1_ctxtdata *rcd; unsigned long flags; unsigned i; - for (i = 0; i < dd->first_user_ctxt; i++) { + for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) { rcd = dd->rcd[i]; del_timer_sync(>aspm_timer); spin_lock_irqsave(>aspm_lock, flags); @@ -260,7 +263,7 @@ static inline void aspm_enable_all(struct hfi1_devdata *dd) if (aspm_mode != ASPM_MODE_DYNAMIC) return; - for (i = 0; i < dd->first_user_ctxt; i++) { + for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) { rcd = dd->rcd[i]; spin_lock_irqsave(>aspm_lock, flags); rcd->aspm_intr_enable = true; @@ -276,7 +279,7 @@ static inline void aspm_ctx_init(struct hfi1_ctxtdata *rcd) (unsigned long)rcd); rcd->aspm_intr_supported = rcd->dd->aspm_supported && aspm_mode == ASPM_MODE_DYNAMIC && - rcd->ctxt < rcd->dd->first_user_ctxt; + rcd->ctxt < rcd->dd->first_dyn_alloc_ctxt; } static inline void aspm_init(struct hfi1_devdata *dd) @@ -286,7 +289,7 @@ static inline void aspm_init(struct hfi1_devdata *dd) spin_lock_init(>aspm_lock); dd->aspm_supported = aspm_hw_l1_supported(dd); - for (i = 0; i < dd->first_user_ctxt; i++) + for (i = 0; i < dd->first_dyn_alloc_ctxt; i++) aspm_ctx_init(dd->rcd[i]); /* Start with ASPM disabled */ diff --git a/drivers/infiniband/hw/hfi1/chip.c b/drivers/infiniband/hw/hfi1/chip.c index 121a4c9..f97fccb 100644 --- a/drivers/infiniband/hw/hfi1/chip.c +++ b/drivers/infiniband/hw/hfi1/chip.c @@ -1,5 +1,5 @@ /* - * Copyright(c) 2015, 2016 Intel Corporation. + * Copyright(c) 2015-2017 Intel Corporation. * * This file is provided under a dual BSD/GPLv2 license. When using or * redistributing this file, you may do so under either license. @@ -125,9 +125,16 @@ struct flag_table { #define DEFAULT_KRCVQS 2 #define MIN_KERNEL_KCTXTS 2 #define FIRST_KERNEL_KCTXT1 -/* sizes for both the QP and RSM map tables */ -#define NUM_MAP_ENTRIES256 -#define NUM_MAP_REGS 32 + +/* + * RSM instance allocation + * 0 - Verbs + * 1 - User Fecn Handling + * 2 - Vnic + */ +#define RSM_INS_VERBS 0 +#define RSM_INS_FECN 1 +#define RSM_INS_VNIC 2
[PATCH 08/11] IB/opa-vnic: VNIC Ethernet Management Agent (VEMA) function
OPA VEMA function interfaces with the Infiniband MAD stack to exchange the management information packets with the Ethernet Manager (EM). It interfaces with the OPA VNIC netdev function to SET/GET the management information. The information exchanged with the EM includes class port details, encapsulation configuration, various counters, unicast and multicast MAC list and the MAC table. It also supports sending traps to the EM. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Sadanand Warrier Signed-off-by: Niranjana Vishwanathapura Signed-off-by: Tanya K Jajodia Signed-off-by: Sudeep Dutt --- drivers/infiniband/ulp/opa_vnic/Makefile |2 +- drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c | 12 + .../infiniband/ulp/opa_vnic/opa_vnic_internal.h| 17 +- drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c| 1071 .../infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c |2 +- 5 files changed, 1099 insertions(+), 5 deletions(-) create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_vema.c diff --git a/drivers/infiniband/ulp/opa_vnic/Makefile b/drivers/infiniband/ulp/opa_vnic/Makefile index e8d1ea1..8061b28 100644 --- a/drivers/infiniband/ulp/opa_vnic/Makefile +++ b/drivers/infiniband/ulp/opa_vnic/Makefile @@ -4,4 +4,4 @@ obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic.o opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o \ - opa_vnic_vema_iface.o + opa_vnic_vema.o opa_vnic_vema_iface.o diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c index a98948c..d66540e 100644 --- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c +++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_ethtool.c @@ -120,6 +120,17 @@ struct vnic_stats { #define VNIC_STATS_LEN ARRAY_SIZE(vnic_gstrings_stats) +/* vnic_get_drvinfo - get driver info */ +static void vnic_get_drvinfo(struct net_device *netdev, +struct ethtool_drvinfo *drvinfo) +{ + strlcpy(drvinfo->driver, opa_vnic_driver_name, sizeof(drvinfo->driver)); + strlcpy(drvinfo->version, opa_vnic_driver_version, + sizeof(drvinfo->version)); + strlcpy(drvinfo->bus_info, dev_name(netdev->dev.parent), + sizeof(drvinfo->bus_info)); +} + /* vnic_get_sset_count - get string set count */ static int vnic_get_sset_count(struct net_device *netdev, int sset) { @@ -162,6 +173,7 @@ static void vnic_get_strings(struct net_device *netdev, u32 stringset, u8 *data) /* ethtool ops */ static const struct ethtool_ops opa_vnic_ethtool_ops = { + .get_drvinfo = vnic_get_drvinfo, .get_link = ethtool_op_get_link, .get_strings = vnic_get_strings, .get_sset_count = vnic_get_sset_count, diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h index b49f5d7..6bba886 100644 --- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h +++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h @@ -164,10 +164,12 @@ struct __opa_veswport_trap { * struct opa_vnic_ctrl_port - OPA virtual NIC control port * @ibdev: pointer to ib device * @ops: opa vnic control operations + * @num_ports: number of opa ports */ struct opa_vnic_ctrl_port { struct ib_device *ibdev; struct opa_vnic_ctrl_ops *ops; + u8 num_ports; }; /** @@ -187,6 +189,8 @@ struct opa_vnic_ctrl_port { * @mactbl_lock: mac table lock * @stats_lock: statistics lock * @flow_tbl: flow to default port redirection table + * @trap_timeout: trap timeout + * @trap_count: no. of traps allowed within timeout period */ struct opa_vnic_adapter { struct net_device *netdev; @@ -213,6 +217,9 @@ struct opa_vnic_adapter { struct mutex stats_lock; u8 flow_tbl[OPA_VNIC_FLOW_TBL_SIZE]; + + unsigned long trap_timeout; + u8trap_count; }; /* Same as opa_veswport_mactable_entry, but without bitwise attribute */ @@ -247,6 +254,8 @@ struct opa_vnic_mac_tbl_node { dev_err(>ibdev->dev, format, ## arg) #define c_info(format, arg...) \ dev_info(>ibdev->dev, format, ## arg) +#define c_dbg(format, arg...) \ + dev_dbg(>ibdev->dev, format, ## arg) /* The maximum allowed entries in the mac table */ #define OPA_VNIC_MAC_TBL_MAX_ENTRIES 2048 @@ -281,6 +290,9 @@ struct opa_vnic_mac_tbl_node { !obj && (bkt) < OPA_VNIC_MAC_TBL_SIZE; (bkt)++) \ hlist_for_each_entry(obj, [bkt], member) +extern char opa_vnic_driver_name[]; +extern const char opa_vnic_driver_version[]; + struct opa_vnic_adapter *opa_vnic_add_netdev(struct ib_device *ibdev,
[PATCH 02/11] IB/opa-vnic: Virtual Network Interface Controller (VNIC) interface
Add rdma netdev interface to ib device structure allowing rdma netdev devices to be allocated by ib clients. Define OPA VNIC interface between hardware independent VNIC functionality and the hardware dependent VNIC functionality. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura --- include/rdma/ib_verbs.h | 27 + include/rdma/opa_vnic.h | 143 2 files changed, 170 insertions(+) create mode 100644 include/rdma/opa_vnic.h diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 8c61532..16ad142 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -55,6 +55,7 @@ #include #include #include +#include #include #include @@ -221,6 +222,7 @@ enum ib_device_cap_flags { IB_DEVICE_SG_GAPS_REG = (1ULL << 32), IB_DEVICE_VIRTUAL_FUNCTION = (1ULL << 33), IB_DEVICE_RAW_SCATTER_FCS = (1ULL << 34), + IB_DEVICE_RDMA_NETDEV_OPA_VNIC = (1ULL << 35), }; enum ib_signature_prot_cap { @@ -1858,6 +1860,22 @@ struct ib_port_immutable { u32 max_mad_size; }; +/* rdma netdev type - specifies protocol type */ +enum rdma_netdev_t { + RDMA_NETDEV_OPA_VNIC +}; + +/** + * struct rdma_netdev - rdma netdev + * For cases where netstack interfacing is required. + */ +struct rdma_netdev { + void *clnt_priv; + + /* control functions */ + void (*set_id)(struct net_device *netdev, int id); +}; + struct ib_device { struct device*dma_device; @@ -2110,6 +2128,15 @@ struct ib_device { struct ib_rwq_ind_table_init_attr *init_attr, struct ib_udata *udata); int(*destroy_rwq_ind_table)(struct ib_rwq_ind_table *wq_ind_table); + /* rdma netdev operations */ + struct net_device *(*alloc_rdma_netdev)( + struct ib_device *device, + u8 port_num, + enum rdma_netdev_t type, + const char *name, + unsigned char name_assign_type, + void (*setup)(struct net_device *)); + void (*free_rdma_netdev)(struct net_device *netdev); struct ib_dma_mapping_ops *dma_ops; struct module *owner; diff --git a/include/rdma/opa_vnic.h b/include/rdma/opa_vnic.h new file mode 100644 index 000..68315cc --- /dev/null +++ b/include/rdma/opa_vnic.h @@ -0,0 +1,143 @@ +#ifndef _OPA_VNIC_H +#define _OPA_VNIC_H +/* + * Copyright(c) 2017 Intel Corporation. + * + * This file is provided under a dual BSD/GPLv2 license. When using or + * redistributing this file, you may do so under either license. + * + * GPL LICENSE SUMMARY + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * BSD LICENSE + * + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions + * are met: + * + * - Redistributions of source code must retain the above copyright + *notice, this list of conditions and the following disclaimer. + * - Redistributions in binary form must reproduce the above copyright + *notice, this list of conditions and the following disclaimer in + *the documentation and/or other materials provided with the + *distribution. + * - Neither the name of Intel Corporation nor the names of its + *contributors may be used to endorse or promote products derived + *from this software without specific prior written permission. + * + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + *
[PATCH 07/11] IB/opa-vnic: VNIC Ethernet Management Agent (VEMA) interface
OPA VNIC EMA interface functions are the management interfaces to the OPA VNIC netdev. Add support to add and remove VNIC ports. Implement the required GET/SET management interface functions and processing of new management information. Add support to send trap notifications upon various events like interface status change, unicast/multicast mac list update and mac address change. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura Signed-off-by: Sadanand Warrier Signed-off-by: Tanya K Jajodia --- drivers/infiniband/ulp/opa_vnic/Makefile | 3 +- drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h | 4 + .../infiniband/ulp/opa_vnic/opa_vnic_internal.h| 44 +++ drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c | 142 +++- .../infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c | 390 + 5 files changed, 581 insertions(+), 2 deletions(-) create mode 100644 drivers/infiniband/ulp/opa_vnic/opa_vnic_vema_iface.c diff --git a/drivers/infiniband/ulp/opa_vnic/Makefile b/drivers/infiniband/ulp/opa_vnic/Makefile index 975c313..e8d1ea1 100644 --- a/drivers/infiniband/ulp/opa_vnic/Makefile +++ b/drivers/infiniband/ulp/opa_vnic/Makefile @@ -3,4 +3,5 @@ # obj-$(CONFIG_INFINIBAND_OPA_VNIC) += opa_vnic.o -opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o +opa_vnic-y := opa_vnic_netdev.o opa_vnic_encap.o opa_vnic_ethtool.o \ + opa_vnic_vema_iface.o diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h index c025cde..4c434b9 100644 --- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h +++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.h @@ -99,6 +99,10 @@ #define OPA_VNIC_DLID_SD_IS_SRC_MAC(dlid_sd) (!!((dlid_sd) & 0x20)) #define OPA_VNIC_DLID_SD_GET_DLID(dlid_sd)((dlid_sd) >> 8) +/* VNIC Ethernet link status */ +#define OPA_VNIC_ETH_LINK_UP 1 +#define OPA_VNIC_ETH_LINK_DOWN 2 + /** * struct opa_vesw_info - OPA vnic switch information * @fabric_id: 10-bit fabric id diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h index bec4866..b49f5d7 100644 --- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h +++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_internal.h @@ -161,14 +161,28 @@ struct __opa_veswport_trap { } __packed; /** + * struct opa_vnic_ctrl_port - OPA virtual NIC control port + * @ibdev: pointer to ib device + * @ops: opa vnic control operations + */ +struct opa_vnic_ctrl_port { + struct ib_device *ibdev; + struct opa_vnic_ctrl_ops *ops; +}; + +/** * struct opa_vnic_adapter - OPA VNIC netdev private data structure * @netdev: pointer to associated netdev * @ibdev: ib device + * @cport: pointer to opa vnic control port * @rn_ops: rdma netdev's net_device_ops * @port_num: OPA port number * @vport_num: vesw port number * @lock: adapter lock * @info: virtual ethernet switch port information + * @vema_mac_addr: mac address configured by vema + * @umac_hash: unicast maclist hash + * @mmac_hash: multicast maclist hash * @mactbl: hash table of MAC entries * @mactbl_lock: mac table lock * @stats_lock: statistics lock @@ -177,6 +191,7 @@ struct __opa_veswport_trap { struct opa_vnic_adapter { struct net_device *netdev; struct ib_device *ibdev; + struct opa_vnic_ctrl_port *cport; const struct net_device_ops *rn_ops; u8 port_num; @@ -186,6 +201,9 @@ struct opa_vnic_adapter { struct mutex lock; struct __opa_veswport_info info; + u8 vema_mac_addr[ETH_ALEN]; + u32 umac_hash; + u32 mmac_hash; struct hlist_head __rcu *mactbl; /* Lock used to protect updates to mac table */ @@ -225,6 +243,11 @@ struct opa_vnic_mac_tbl_node { #define v_warn(format, arg...) \ netdev_warn(adapter->netdev, format, ## arg) +#define c_err(format, arg...) \ + dev_err(>ibdev->dev, format, ## arg) +#define c_info(format, arg...) \ + dev_info(>ibdev->dev, format, ## arg) + /* The maximum allowed entries in the mac table */ #define OPA_VNIC_MAC_TBL_MAX_ENTRIES 2048 /* Limit of smac entries in mac table */ @@ -264,11 +287,32 @@ struct opa_vnic_adapter *opa_vnic_add_netdev(struct ib_device *ibdev, void opa_vnic_encap_skb(struct opa_vnic_adapter *adapter, struct sk_buff *skb); u8 opa_vnic_get_vl(struct opa_vnic_adapter *adapter, struct sk_buff *skb); u8 opa_vnic_calc_entropy(struct opa_vnic_adapter *adapter, struct sk_buff *skb); +void opa_vnic_process_vema_config(struct opa_vnic_adapter *adapter); void opa_vnic_release_mac_tbl(struct opa_vnic_adapter *adapter); void
[PATCH 09/11] IB/hfi1: OPA_VNIC RDMA netdev support
Add support to create and free OPA_VNIC rdma netdev devices. Implement netstack interface functionality including xmit_skb, receive side NAPI etc. Also implement rdma netdev control functions. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura Signed-off-by: Andrzej Kacprowski --- drivers/infiniband/hw/hfi1/Makefile| 2 +- drivers/infiniband/hw/hfi1/driver.c| 25 +- drivers/infiniband/hw/hfi1/hfi.h | 27 +- drivers/infiniband/hw/hfi1/init.c | 9 +- drivers/infiniband/hw/hfi1/vnic.h | 153 drivers/infiniband/hw/hfi1/vnic_main.c | 646 + 6 files changed, 855 insertions(+), 7 deletions(-) create mode 100644 drivers/infiniband/hw/hfi1/vnic.h create mode 100644 drivers/infiniband/hw/hfi1/vnic_main.c diff --git a/drivers/infiniband/hw/hfi1/Makefile b/drivers/infiniband/hw/hfi1/Makefile index 0cf97a0..2280538 100644 --- a/drivers/infiniband/hw/hfi1/Makefile +++ b/drivers/infiniband/hw/hfi1/Makefile @@ -12,7 +12,7 @@ hfi1-y := affinity.o chip.o device.o driver.o efivar.o \ init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \ qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o \ uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \ - verbs_txreq.o + verbs_txreq.o vnic_main.o hfi1-$(CONFIG_DEBUG_FS) += debugfs.o CFLAGS_trace.o = -I$(src) diff --git a/drivers/infiniband/hw/hfi1/driver.c b/drivers/infiniband/hw/hfi1/driver.c index 3881c95..4969b88 100644 --- a/drivers/infiniband/hw/hfi1/driver.c +++ b/drivers/infiniband/hw/hfi1/driver.c @@ -1,5 +1,5 @@ /* - * Copyright(c) 2015, 2016 Intel Corporation. + * Copyright(c) 2015-2017 Intel Corporation. * * This file is provided under a dual BSD/GPLv2 license. When using or * redistributing this file, you may do so under either license. @@ -59,6 +59,7 @@ #include "trace.h" #include "qp.h" #include "sdma.h" +#include "vnic.h" #undef pr_fmt #define pr_fmt(fmt) DRIVER_NAME ": " fmt @@ -1372,15 +1373,31 @@ int process_receive_ib(struct hfi1_packet *packet) return RHF_RCV_CONTINUE; } +static inline bool hfi1_is_vnic_packet(struct hfi1_packet *packet) +{ + /* Packet received in VNIC context via RSM */ + if (packet->rcd->is_vnic) + return true; + + if ((HFI1_GET_L2_TYPE(packet->ebuf) == OPA_VNIC_L2_TYPE) && + (HFI1_GET_L4_TYPE(packet->ebuf) == OPA_VNIC_L4_ETHR)) + return true; + + return false; +} + int process_receive_bypass(struct hfi1_packet *packet) { struct hfi1_devdata *dd = packet->rcd->dd; - if (unlikely(rhf_err_flags(packet->rhf))) + if (unlikely(rhf_err_flags(packet->rhf))) { handle_eflags(packet); + } else if (hfi1_is_vnic_packet(packet)) { + hfi1_vnic_bypass_rcv(packet); + return RHF_RCV_CONTINUE; + } - dd_dev_err(dd, - "Bypass packets are not supported in normal operation. Dropping\n"); + dd_dev_err(dd, "Unsupported bypass packet. Dropping\n"); incr_cntr64(>sw_rcv_bypass_packet_errors); if (!(dd->err_info_rcvport.status_and_code & OPA_EI_STATUS_SMASK)) { u64 *flits = packet->ebuf; diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h index 0808e3c3..66fb9e4 100644 --- a/drivers/infiniband/hw/hfi1/hfi.h +++ b/drivers/infiniband/hw/hfi1/hfi.h @@ -1,7 +1,7 @@ #ifndef _HFI1_KERNEL_H #define _HFI1_KERNEL_H /* - * Copyright(c) 2015, 2016 Intel Corporation. + * Copyright(c) 2015-2017 Intel Corporation. * * This file is provided under a dual BSD/GPLv2 license. When using or * redistributing this file, you may do so under either license. @@ -337,6 +337,12 @@ struct hfi1_ctxtdata { * packets with the wrong interrupt handler. */ int (*do_interrupt)(struct hfi1_ctxtdata *rcd, int threaded); + + /* Indicates that this is vnic context */ + bool is_vnic; + + /* vnic queue index this context is mapped to */ + u8 vnic_q_idx; }; /* @@ -808,6 +814,19 @@ struct hfi1_asic_data { struct hfi1_i2c_bus *i2c_bus1; }; +/* + * Number of VNIC contexts used. Ensure it is less than or equal to + * max queues supported by VNIC (HFI1_VNIC_MAX_QUEUE). + */ +#define HFI1_NUM_VNIC_CTXT 8 + +/* Virtual NIC information */ +struct hfi1_vnic_data { + struct idr vesw_idr; +}; + +struct hfi1_vnic_vport_info; + /* device data struct now contains only "general per-device" info. * fields related to a physical IB port are in a hfi1_pportdata struct. */ @@ -1115,6 +1134,9 @@ struct hfi1_devdata { send_routine process_dma_send; void (*pio_inline_send)(struct hfi1_devdata *dd, struct pio_buf *pbuf, u64 pbc, const void *from, size_t
[PATCH 11/11] IB/hfi1: VNIC SDMA support
HFI1 VNIC SDMA support enables transmission of VNIC packets over SDMA. Map VNIC queues to SDMA engines and support halting and wakeup of the VNIC queues. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura --- drivers/infiniband/hw/hfi1/Makefile| 2 +- drivers/infiniband/hw/hfi1/hfi.h | 1 + drivers/infiniband/hw/hfi1/init.c | 1 + drivers/infiniband/hw/hfi1/vnic.h | 28 +++ drivers/infiniband/hw/hfi1/vnic_main.c | 24 ++- drivers/infiniband/hw/hfi1/vnic_sdma.c | 323 + 6 files changed, 376 insertions(+), 3 deletions(-) create mode 100644 drivers/infiniband/hw/hfi1/vnic_sdma.c diff --git a/drivers/infiniband/hw/hfi1/Makefile b/drivers/infiniband/hw/hfi1/Makefile index 2280538..88085f6 100644 --- a/drivers/infiniband/hw/hfi1/Makefile +++ b/drivers/infiniband/hw/hfi1/Makefile @@ -12,7 +12,7 @@ hfi1-y := affinity.o chip.o device.o driver.o efivar.o \ init.o intr.o mad.o mmu_rb.o pcie.o pio.o pio_copy.o platform.o \ qp.o qsfp.o rc.o ruc.o sdma.o sysfs.o trace.o \ uc.o ud.o user_exp_rcv.o user_pages.o user_sdma.o verbs.o \ - verbs_txreq.o vnic_main.o + verbs_txreq.o vnic_main.o vnic_sdma.o hfi1-$(CONFIG_DEBUG_FS) += debugfs.o CFLAGS_trace.o = -I$(src) diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h index ac31b23..b57b88a 100644 --- a/drivers/infiniband/hw/hfi1/hfi.h +++ b/drivers/infiniband/hw/hfi1/hfi.h @@ -834,6 +834,7 @@ struct hfi1_asic_data { /* Virtual NIC information */ struct hfi1_vnic_data { struct hfi1_ctxtdata *ctxt[HFI1_NUM_VNIC_CTXT]; + struct kmem_cache *txreq_cache; u8 num_vports; struct idr vesw_idr; u8 rmt_start; diff --git a/drivers/infiniband/hw/hfi1/init.c b/drivers/infiniband/hw/hfi1/init.c index 1ecccaa..3fc7984 100644 --- a/drivers/infiniband/hw/hfi1/init.c +++ b/drivers/infiniband/hw/hfi1/init.c @@ -681,6 +681,7 @@ int hfi1_init(struct hfi1_devdata *dd, int reinit) dd->process_pio_send = hfi1_verbs_send_pio; dd->process_dma_send = hfi1_verbs_send_dma; dd->pio_inline_send = pio_copy; + dd->process_vnic_dma_send = hfi1_vnic_send_dma; if (is_ax(dd)) { atomic_set(>drop_packet, DROP_PACKET_ON); diff --git a/drivers/infiniband/hw/hfi1/vnic.h b/drivers/infiniband/hw/hfi1/vnic.h index d620aec..36996f0 100644 --- a/drivers/infiniband/hw/hfi1/vnic.h +++ b/drivers/infiniband/hw/hfi1/vnic.h @@ -49,6 +49,7 @@ #include #include "hfi.h" +#include "sdma.h" #define HFI1_VNIC_MAX_TXQ 16 #define HFI1_VNIC_MAX_PAD 12 @@ -85,6 +86,26 @@ #define HFI1_VNIC_MAX_QUEUE 16 /** + * struct hfi1_vnic_sdma - VNIC per Tx ring SDMA information + * @dd - device data pointer + * @sde - sdma engine + * @vinfo - vnic info pointer + * @wait - iowait structure + * @stx - sdma tx request + * @state - vnic Tx ring SDMA state + * @q_idx - vnic Tx queue index + */ +struct hfi1_vnic_sdma { + struct hfi1_devdata *dd; + struct sdma_engine *sde; + struct hfi1_vnic_vport_info *vinfo; + struct iowait wait; + struct sdma_txreq stx; + unsigned int state; + u8 q_idx; +}; + +/** * struct hfi1_vnic_rx_queue - HFI1 VNIC receive queue * @idx: queue index * @vinfo: pointer to vport information @@ -111,6 +132,7 @@ struct hfi1_vnic_rx_queue { * @vesw_id: virtual switch id * @rxq: Array of receive queues * @stats: per queue stats + * @sdma: VNIC SDMA structure per TXQ */ struct hfi1_vnic_vport_info { struct hfi1_devdata *dd; @@ -126,6 +148,7 @@ struct hfi1_vnic_vport_info { struct hfi1_vnic_rx_queue rxq[HFI1_NUM_VNIC_CTXT]; struct opa_vnic_stats stats[HFI1_VNIC_MAX_QUEUE]; + struct hfi1_vnic_sdma sdma[HFI1_VNIC_MAX_TXQ]; }; #define v_dbg(format, arg...) \ @@ -138,8 +161,13 @@ struct hfi1_vnic_vport_info { /* vnic hfi1 internal functions */ void hfi1_vnic_setup(struct hfi1_devdata *dd); void hfi1_vnic_cleanup(struct hfi1_devdata *dd); +int hfi1_vnic_txreq_init(struct hfi1_devdata *dd); +void hfi1_vnic_txreq_deinit(struct hfi1_devdata *dd); void hfi1_vnic_bypass_rcv(struct hfi1_packet *packet); +void hfi1_vnic_sdma_init(struct hfi1_vnic_vport_info *vinfo); +bool hfi1_vnic_sdma_write_avail(struct hfi1_vnic_vport_info *vinfo, + u8 q_idx); /* vnic rdma netdev operations */ struct net_device *hfi1_vnic_alloc_rn(struct ib_device *device, diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c b/drivers/infiniband/hw/hfi1/vnic_main.c index 4a9bb8c..8f354e7 100644 --- a/drivers/infiniband/hw/hfi1/vnic_main.c +++ b/drivers/infiniband/hw/hfi1/vnic_main.c @@ -408,6 +408,10 @@ static void hfi1_vnic_maybe_stop_tx(struct hfi1_vnic_vport_info *vinfo, u8 q_idx) { netif_stop_subqueue(vinfo->netdev,
[PATCH 06/11] IB/opa-vnic: VNIC MAC table support
OPA VNIC MAC table contains the MAC address to DLID mappings provided by the Ethernet manager. During transmission, the MAC table provides the MAC address to DLID translation. Implement MAC table using simple hash list. Also provide support to update/query the MAC table by Ethernet manager. Reviewed-by: Dennis DalessandroReviewed-by: Ira Weiny Signed-off-by: Niranjana Vishwanathapura Signed-off-by: Sadanand Warrier --- drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c | 236 + .../infiniband/ulp/opa_vnic/opa_vnic_internal.h| 51 + drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c | 4 + 3 files changed, 291 insertions(+) diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c index c74d02a..2e8fee9 100644 --- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c +++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_encap.c @@ -96,6 +96,238 @@ static inline void opa_vnic_make_header(u8 *hdr, u32 slid, u32 dlid, u16 len, memcpy(hdr, h, OPA_VNIC_HDR_LEN); } +/* + * Using a simple hash table for mac table implementation with the last octet + * of mac address as a key. + */ +static void opa_vnic_free_mac_tbl(struct hlist_head *mactbl) +{ + struct opa_vnic_mac_tbl_node *node; + struct hlist_node *tmp; + int bkt; + + if (!mactbl) + return; + + vnic_hash_for_each_safe(mactbl, bkt, tmp, node, hlist) { + hash_del(>hlist); + kfree(node); + } + kfree(mactbl); +} + +static struct hlist_head *opa_vnic_alloc_mac_tbl(void) +{ + u32 size = sizeof(struct hlist_head) * OPA_VNIC_MAC_TBL_SIZE; + struct hlist_head *mactbl; + + mactbl = kzalloc(size, GFP_KERNEL); + if (!mactbl) + return ERR_PTR(-ENOMEM); + + vnic_hash_init(mactbl); + return mactbl; +} + +/* opa_vnic_release_mac_tbl - empty and free the mac table */ +void opa_vnic_release_mac_tbl(struct opa_vnic_adapter *adapter) +{ + struct hlist_head *mactbl; + + mutex_lock(>mactbl_lock); + mactbl = rcu_access_pointer(adapter->mactbl); + rcu_assign_pointer(adapter->mactbl, NULL); + synchronize_rcu(); + opa_vnic_free_mac_tbl(mactbl); + mutex_unlock(>mactbl_lock); +} + +/* + * opa_vnic_query_mac_tbl - query the mac table for a section + * + * This function implements query of specific function of the mac table. + * The function also expects the requested range to be valid. + */ +void opa_vnic_query_mac_tbl(struct opa_vnic_adapter *adapter, + struct opa_veswport_mactable *tbl) +{ + struct opa_vnic_mac_tbl_node *node; + struct hlist_head *mactbl; + int bkt; + u16 loffset, lnum_entries; + + rcu_read_lock(); + mactbl = rcu_dereference(adapter->mactbl); + if (!mactbl) + goto get_mac_done; + + loffset = be16_to_cpu(tbl->offset); + lnum_entries = be16_to_cpu(tbl->num_entries); + + vnic_hash_for_each(mactbl, bkt, node, hlist) { + struct __opa_vnic_mactable_entry *nentry = >entry; + struct opa_veswport_mactable_entry *entry; + + if ((node->index < loffset) || + (node->index >= (loffset + lnum_entries))) + continue; + + /* populate entry in the tbl corresponding to the index */ + entry = >tbl_entries[node->index - loffset]; + memcpy(entry->mac_addr, nentry->mac_addr, + ARRAY_SIZE(entry->mac_addr)); + memcpy(entry->mac_addr_mask, nentry->mac_addr_mask, + ARRAY_SIZE(entry->mac_addr_mask)); + entry->dlid_sd = cpu_to_be32(nentry->dlid_sd); + } + tbl->mac_tbl_digest = cpu_to_be32(adapter->info.vport.mac_tbl_digest); +get_mac_done: + rcu_read_unlock(); +} + +/* + * opa_vnic_update_mac_tbl - update mac table section + * + * This function updates the specified section of the mac table. + * The procedure includes following steps. + * - Allocate a new mac (hash) table. + * - Add the specified entries to the new table. + *(except the ones that are requested to be deleted). + * - Add all the other entries from the old mac table. + * - If there is a failure, free the new table and return. + * - Switch to the new table. + * - Free the old table and return. + * + * The function also expects the requested range to be valid. + */ +int opa_vnic_update_mac_tbl(struct opa_vnic_adapter *adapter, + struct opa_veswport_mactable *tbl) +{ + struct opa_vnic_mac_tbl_node *node, *new_node; + struct hlist_head *new_mactbl, *old_mactbl; + int i, bkt, rc = 0; + u8 key; + u16 loffset, lnum_entries; + + mutex_lock(>mactbl_lock); + /* allocate new mac
[PATCH net-next v3 3/3] A Sample of using socket cookie and uid for traffic monitoring
From: Chenbo FengAdd a sample program to demostrate the possible usage of get_socket_cookie and get_socket_uid helper function. The program will store bytes and packets counting of in/out traffic monitored by iptable and store the stats in a bpf map in per socket base. The owner uid of the socket will be stored as part of the data entry. A shell script for running the program is also included. Change since V2: Add the example code and the shell script to run the program. Signed-off-by: Chenbo Feng --- samples/bpf/cookie_uid_helper_example.c | 225 +++ samples/bpf/run_cookie_uid_helper_example.sh | 14 ++ 2 files changed, 239 insertions(+) create mode 100644 samples/bpf/cookie_uid_helper_example.c create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh diff --git a/samples/bpf/cookie_uid_helper_example.c b/samples/bpf/cookie_uid_helper_example.c new file mode 100644 index 000..ffa4740 --- /dev/null +++ b/samples/bpf/cookie_uid_helper_example.c @@ -0,0 +1,225 @@ +/* This test is a demo of using get_socket_uid and get_socket_cookie + * helper function to do per socket based network traffic monitoring. + * It requires iptable version higher then 1.6.1. to load pined eBPF + * program into the xt_bpf match. + * + * Compile: + * gcc -I ../../usr/include -I ../../tools/lib -I ../../tools/include \ + * -I ./ -Wall cookie_uid_helper_example.c ../../tools/lib/bpf/bpf.c -o \ + * perSocketStats_example + * + * TEST: + * ./run_cookie_uid_helper_example.sh + * Then generate some traffic in variate ways. ping 0 -c 10 would work + * but the cookie and uid in this case could both be 0. A sample output + * with some traffic generated by web browser is shown below: + * + * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058 + * + * cookie: 132, uid: 0x0, Pakcet Count: 2, Bytes Count: 286 + * cookie: 812, uid: 0x3e8, Pakcet Count: 3, Bytes Count: 1726 + * cookie: 802, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104 + * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058 + * cookie: 831, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104 + * cookie: 0, uid: 0x0, Pakcet Count: 6, Bytes Count: 712 + * cookie: 880, uid: 0xfffe, Pakcet Count: 1, Bytes Count: 70 + * + * Clean up: if using shell script, the script file will delete the iptables + * rule and unmount the bpf program when exit. Else the iptables rule need + * to be deleted using: + * iptables -D INPUT -m bpf --object-pinned ${mnt_dir}/bpf_prog -j ACCEPT + */ + +#define _GNU_SOURCE + +#define offsetof(type, member) __builtin_offsetof(type, member) +#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x))) + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "libbpf.h" + +struct stats { + uint32_t uid; + uint64_t packets; + uint64_t bytes; +}; + +static int map_fd, prog_fd; + +static void maps_create(void) +{ + map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(uint32_t), + sizeof(struct stats), 100, 0); + if (map_fd < 0) + error(1, errno, "map create failed!\n"); +} + +static void prog_load(void) +{ + static char log_buf[1 << 16]; + + struct bpf_insn prog[] = { + /* +* it for future usage. value stored in R6 to R10 will not be +* reset after a bpf helper function call. +*/ + BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), + /* +* pc1: BPF_FUNC_get_socket_cookie takes one parameter, +* R1: sk_buff +*/ + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, + BPF_FUNC_get_socket_cookie), + /* pc2-4: save to r7 for future usage*/ + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8), + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + /* +* pc5-8: set up the registers for BPF_FUNC_map_lookup_elem, +* it takes two parameters (R1: map_fd, R2: _cookie) +*/ + BPF_LD_MAP_FD(BPF_REG_1, map_fd), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, + BPF_FUNC_map_lookup_elem), + /* +* pc9. if r0 != 0x0, go to pc+14, since we have the cookie +* stored already +* Otherwise do pc10-22 to setup a new data entry. +*/ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 14), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_6), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, + BPF_FUNC_get_socket_uid), + /* +* Place a struct
[PATCH net-next v3 0/3] net: core: Two Helper function about socket information
From: Chenbo FengIntroduce two eBpf helper function to get the socket cookie and socket uid for each packet. The helper function is useful when the *sk field inside sk_buff is not empty. These helper functions can be used on socket and uid based traffic monitoring programs. Change since V2: * Add a sample program to demostrate the usage of the helper function. * Moved the helper function proto invoking place. * Add function header into tools/include * Apply sk_to_full_sk() before getting uid. Change since V1: * Removed the unnecessary declarations and export command * resolved conflict with master branch. * Examine if the socket is a full socket before getting the uid. Chenbo Feng (3): Add a helper function to get socket cookie in eBPF Add a eBPF helper function to retrieve socket uid A Sample of using socket cookie and uid for traffic monitoring include/linux/sock_diag.h| 1 + include/uapi/linux/bpf.h | 16 +- net/core/filter.c| 36 + net/core/sock_diag.c | 2 +- samples/bpf/cookie_uid_helper_example.c | 225 +++ samples/bpf/run_cookie_uid_helper_example.sh | 14 ++ tools/include/uapi/linux/bpf.h | 4 +- 7 files changed, 295 insertions(+), 3 deletions(-) create mode 100644 samples/bpf/cookie_uid_helper_example.c create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh -- 2.7.4
[PATCH net-next v3 1/3] Add a helper function to get socket cookie in eBPF
From: Chenbo FengRetrieve the socket cookie generated by sock_gen_cookie() from a sk_buff with a known socket. Generates a new cookie if one was not yet set.If the socket pointer inside sk_buff is NULL, 0 is returned. The helper function coud be useful in monitoring per socket networking traffic statistics and provide a unique socket identifier per namespace. Change since V2: Moved the helper function from bpf_base_func_proto() to both sk_filter_func_proto() and tc_cls_act_func_proto(). Add function name to uapi header file under tools/include. Change since V1: Removed the unnecessary declarations and export command, resolved conflict with master branch. Signed-off-by: Chenbo Feng --- include/linux/sock_diag.h | 1 + include/uapi/linux/bpf.h | 9 - net/core/filter.c | 17 + net/core/sock_diag.c | 2 +- tools/include/uapi/linux/bpf.h | 3 ++- 5 files changed, 29 insertions(+), 3 deletions(-) diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h index a0596ca0..a2f8109 100644 --- a/include/linux/sock_diag.h +++ b/include/linux/sock_diag.h @@ -24,6 +24,7 @@ void sock_diag_unregister(const struct sock_diag_handler *h); void sock_diag_register_inet_compat(int (*fn)(struct sk_buff *skb, struct nlmsghdr *nlh)); void sock_diag_unregister_inet_compat(int (*fn)(struct sk_buff *skb, struct nlmsghdr *nlh)); +u64 sock_gen_cookie(struct sock *sk); int sock_diag_check_cookie(struct sock *sk, const __u32 *cookie); void sock_diag_save_cookie(struct sock *sk, __u32 *cookie); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0539a0c..dc81a9f 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -456,6 +456,12 @@ union bpf_attr { * Return: * > 0 length of the string including the trailing NUL on success * < 0 error + * + * u64 bpf_bpf_get_socket_cookie(skb) + * Get the cookie for the socket stored inside sk_buff. + * @skb: pointer to skb + * Return: 8 Bytes non-decreasing number on success or 0 if the socket + * field is missing inside sk_buff */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -503,7 +509,8 @@ union bpf_attr { FN(get_numa_node_id), \ FN(skb_change_head),\ FN(xdp_adjust_head),\ - FN(probe_read_str), + FN(probe_read_str), \ + FN(get_socket_cookie), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/core/filter.c b/net/core/filter.c index e466e004..06263c0 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -2599,6 +2600,18 @@ static const struct bpf_func_proto bpf_xdp_event_output_proto = { .arg5_type = ARG_CONST_SIZE, }; +BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb) +{ + return skb->sk ? sock_gen_cookie(skb->sk) : 0; +} + +static const struct bpf_func_proto bpf_get_socket_cookie_proto = { + .func = bpf_get_socket_cookie, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + static const struct bpf_func_proto * bpf_base_func_proto(enum bpf_func_id func_id) { @@ -2633,6 +2646,8 @@ sk_filter_func_proto(enum bpf_func_id func_id) switch (func_id) { case BPF_FUNC_skb_load_bytes: return _skb_load_bytes_proto; + case BPF_FUNC_get_socket_cookie: + return _get_socket_cookie_proto; default: return bpf_base_func_proto(func_id); } @@ -2692,6 +2707,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) return _get_smp_processor_id_proto; case BPF_FUNC_skb_under_cgroup: return _skb_under_cgroup_proto; + case BPF_FUNC_get_socket_cookie: + return _get_socket_cookie_proto; default: return bpf_base_func_proto(func_id); } diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c index 6b10573..acd2a6c 100644 --- a/net/core/sock_diag.c +++ b/net/core/sock_diag.c @@ -19,7 +19,7 @@ static int (*inet_rcv_compat)(struct sk_buff *skb, struct nlmsghdr *nlh); static DEFINE_MUTEX(sock_diag_table_mutex); static struct workqueue_struct *broadcast_wq; -static u64 sock_gen_cookie(struct sock *sk) +u64 sock_gen_cookie(struct sock *sk) { while (1) { u64 res = atomic64_read(>sk_cookie); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 0539a0c..a94bdd3 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -503,7 +503,8 @@ union bpf_attr { FN(get_numa_node_id), \ FN(skb_change_head),\
[PATCH net-next v3 3/3] A Sample of using socket cookie and uid for traffic monitoring
From: Chenbo FengAdd a sample program to demostrate the possible usage of get_socket_cookie and get_socket_uid helper function. The program will store bytes and packets counting of in/out traffic monitored by iptable and store the stats in a bpf map in per socket base. The owner uid of the socket will be stored as part of the data entry. A shell script for running the program is also included. Change since V2: Add the example code and the shell script to run the program. Signed-off-by: Chenbo Feng --- samples/bpf/cookie_uid_helper_example.c | 225 +++ samples/bpf/run_cookie_uid_helper_example.sh | 14 ++ 2 files changed, 239 insertions(+) create mode 100644 samples/bpf/cookie_uid_helper_example.c create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh diff --git a/samples/bpf/cookie_uid_helper_example.c b/samples/bpf/cookie_uid_helper_example.c new file mode 100644 index 000..ffa4740 --- /dev/null +++ b/samples/bpf/cookie_uid_helper_example.c @@ -0,0 +1,225 @@ +/* This test is a demo of using get_socket_uid and get_socket_cookie + * helper function to do per socket based network traffic monitoring. + * It requires iptable version higher then 1.6.1. to load pined eBPF + * program into the xt_bpf match. + * + * Compile: + * gcc -I ../../usr/include -I ../../tools/lib -I ../../tools/include \ + * -I ./ -Wall cookie_uid_helper_example.c ../../tools/lib/bpf/bpf.c -o \ + * perSocketStats_example + * + * TEST: + * ./run_cookie_uid_helper_example.sh + * Then generate some traffic in variate ways. ping 0 -c 10 would work + * but the cookie and uid in this case could both be 0. A sample output + * with some traffic generated by web browser is shown below: + * + * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058 + * + * cookie: 132, uid: 0x0, Pakcet Count: 2, Bytes Count: 286 + * cookie: 812, uid: 0x3e8, Pakcet Count: 3, Bytes Count: 1726 + * cookie: 802, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104 + * cookie: 877, uid: 0x3e8, Pakcet Count: 20, Bytes Count: 11058 + * cookie: 831, uid: 0x3e8, Pakcet Count: 2, Bytes Count: 104 + * cookie: 0, uid: 0x0, Pakcet Count: 6, Bytes Count: 712 + * cookie: 880, uid: 0xfffe, Pakcet Count: 1, Bytes Count: 70 + * + * Clean up: if using shell script, the script file will delete the iptables + * rule and unmount the bpf program when exit. Else the iptables rule need + * to be deleted using: + * iptables -D INPUT -m bpf --object-pinned ${mnt_dir}/bpf_prog -j ACCEPT + */ + +#define _GNU_SOURCE + +#define offsetof(type, member) __builtin_offsetof(type, member) +#define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x))) + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "libbpf.h" + +struct stats { + uint32_t uid; + uint64_t packets; + uint64_t bytes; +}; + +static int map_fd, prog_fd; + +static void maps_create(void) +{ + map_fd = bpf_create_map(BPF_MAP_TYPE_HASH, sizeof(uint32_t), + sizeof(struct stats), 100, 0); + if (map_fd < 0) + error(1, errno, "map create failed!\n"); +} + +static void prog_load(void) +{ + static char log_buf[1 << 16]; + + struct bpf_insn prog[] = { + /* +* it for future usage. value stored in R6 to R10 will not be +* reset after a bpf helper function call. +*/ + BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), + /* +* pc1: BPF_FUNC_get_socket_cookie takes one parameter, +* R1: sk_buff +*/ + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, + BPF_FUNC_get_socket_cookie), + /* pc2-4: save to r7 for future usage*/ + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8), + BPF_MOV64_REG(BPF_REG_7, BPF_REG_10), + BPF_ALU64_IMM(BPF_ADD, BPF_REG_7, -8), + /* +* pc5-8: set up the registers for BPF_FUNC_map_lookup_elem, +* it takes two parameters (R1: map_fd, R2: _cookie) +*/ + BPF_LD_MAP_FD(BPF_REG_1, map_fd), + BPF_MOV64_REG(BPF_REG_2, BPF_REG_7), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, + BPF_FUNC_map_lookup_elem), + /* +* pc9. if r0 != 0x0, go to pc+14, since we have the cookie +* stored already +* Otherwise do pc10-22 to setup a new data entry. +*/ + BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 14), + BPF_MOV64_REG(BPF_REG_1, BPF_REG_6), + BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, + BPF_FUNC_get_socket_uid), + /* +* Place a struct
[PATCH net-next v3 2/3] Add a eBPF helper function to retrieve socket uid
From: Chenbo FengReturns the owner uid of the socket inside a sk_buff. This is useful to perform per-UID accounting of network traffic or per-UID packet filtering. The socket need to be a fullsock otherwise 0 is returned. Change since V2: Add a sk_to_full_sk() check before retrieving the uid. Moved the helper function from bpf_base_func_proto() to both sk_filter_func_proto() and tc_cls_act_func_proto(). Add function name to uapi header file under tools/include Change since V1: Removed the unnecessary declarations and export command, resolved conflict with master branch. Examine if the socket is a full socket before getting the uid. Signed-off-by: Chenbo Feng --- include/uapi/linux/bpf.h | 9 - net/core/filter.c | 19 +++ tools/include/uapi/linux/bpf.h | 3 ++- 3 files changed, 29 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index dc81a9f..ff42111 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -462,6 +462,12 @@ union bpf_attr { * @skb: pointer to skb * Return: 8 Bytes non-decreasing number on success or 0 if the socket * field is missing inside sk_buff + * + * u32 bpf_get_socket_uid(skb) + * Get the owner uid of the socket stored inside sk_buff. + * @skb: pointer to skb + * Return: uid of the socket owner on success or 0 if the socket pointer + * inside sk_buff is NULL */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -510,7 +516,8 @@ union bpf_attr { FN(skb_change_head),\ FN(xdp_adjust_head),\ FN(probe_read_str), \ - FN(get_socket_cookie), + FN(get_socket_cookie), \ + FN(get_socket_uid), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/core/filter.c b/net/core/filter.c index 06263c0..53c4afc 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2612,6 +2612,21 @@ static const struct bpf_func_proto bpf_get_socket_cookie_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb) +{ + struct sock *sk = sk_to_full_sk(skb->sk); + kuid_t kuid = sock_net_uid(dev_net(skb->dev), sk); + + return (u32)kuid.val; +} + +static const struct bpf_func_proto bpf_get_socket_uid_proto = { + .func = bpf_get_socket_uid, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + static const struct bpf_func_proto * bpf_base_func_proto(enum bpf_func_id func_id) { @@ -2648,6 +2663,8 @@ sk_filter_func_proto(enum bpf_func_id func_id) return _skb_load_bytes_proto; case BPF_FUNC_get_socket_cookie: return _get_socket_cookie_proto; + case BPF_FUNC_get_socket_uid: + return _get_socket_uid_proto; default: return bpf_base_func_proto(func_id); } @@ -2709,6 +2726,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) return _skb_under_cgroup_proto; case BPF_FUNC_get_socket_cookie: return _get_socket_cookie_proto; + case BPF_FUNC_get_socket_uid: + return _get_socket_uid_proto; default: return bpf_base_func_proto(func_id); } diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index a94bdd3..4a2d56d 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -504,7 +504,8 @@ union bpf_attr { FN(skb_change_head),\ FN(xdp_adjust_head),\ FN(probe_read_str), \ - FN(get_socket_cookie), + FN(get_socket_cookie), \ + FN(get_socket_uid), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call -- 2.7.4
[PATCH net-next v3 2/3] Add a eBPF helper function to retrieve socket uid
From: Chenbo FengReturns the owner uid of the socket inside a sk_buff. This is useful to perform per-UID accounting of network traffic or per-UID packet filtering. The socket need to be a fullsock otherwise 0 is returned. Change since V2: Add a sk_to_full_sk() check before retrieving the uid. Moved the helper function from bpf_base_func_proto() to both sk_filter_func_proto() and tc_cls_act_func_proto(). Add function name to uapi header file under tools/include Change since V1: Removed the unnecessary declarations and export command, resolved conflict with master branch. Examine if the socket is a full socket before getting the uid. Signed-off-by: Chenbo Feng --- include/uapi/linux/bpf.h | 9 - net/core/filter.c | 19 +++ tools/include/uapi/linux/bpf.h | 3 ++- 3 files changed, 29 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index dc81a9f..ff42111 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -462,6 +462,12 @@ union bpf_attr { * @skb: pointer to skb * Return: 8 Bytes non-decreasing number on success or 0 if the socket * field is missing inside sk_buff + * + * u32 bpf_get_socket_uid(skb) + * Get the owner uid of the socket stored inside sk_buff. + * @skb: pointer to skb + * Return: uid of the socket owner on success or 0 if the socket pointer + * inside sk_buff is NULL */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -510,7 +516,8 @@ union bpf_attr { FN(skb_change_head),\ FN(xdp_adjust_head),\ FN(probe_read_str), \ - FN(get_socket_cookie), + FN(get_socket_cookie), \ + FN(get_socket_uid), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/core/filter.c b/net/core/filter.c index 06263c0..53c4afc 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2612,6 +2612,21 @@ static const struct bpf_func_proto bpf_get_socket_cookie_proto = { .arg1_type = ARG_PTR_TO_CTX, }; +BPF_CALL_1(bpf_get_socket_uid, struct sk_buff *, skb) +{ + struct sock *sk = sk_to_full_sk(skb->sk); + kuid_t kuid = sock_net_uid(dev_net(skb->dev), sk); + + return (u32)kuid.val; +} + +static const struct bpf_func_proto bpf_get_socket_uid_proto = { + .func = bpf_get_socket_uid, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + static const struct bpf_func_proto * bpf_base_func_proto(enum bpf_func_id func_id) { @@ -2648,6 +2663,8 @@ sk_filter_func_proto(enum bpf_func_id func_id) return _skb_load_bytes_proto; case BPF_FUNC_get_socket_cookie: return _get_socket_cookie_proto; + case BPF_FUNC_get_socket_uid: + return _get_socket_uid_proto; default: return bpf_base_func_proto(func_id); } @@ -2709,6 +2726,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) return _skb_under_cgroup_proto; case BPF_FUNC_get_socket_cookie: return _get_socket_cookie_proto; + case BPF_FUNC_get_socket_uid: + return _get_socket_uid_proto; default: return bpf_base_func_proto(func_id); } diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index a94bdd3..4a2d56d 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -504,7 +504,8 @@ union bpf_attr { FN(skb_change_head),\ FN(xdp_adjust_head),\ FN(probe_read_str), \ - FN(get_socket_cookie), + FN(get_socket_cookie), \ + FN(get_socket_uid), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call -- 2.7.4
[PATCH net-next v3 1/3] Add a helper function to get socket cookie in eBPF
From: Chenbo FengRetrieve the socket cookie generated by sock_gen_cookie() from a sk_buff with a known socket. Generates a new cookie if one was not yet set.If the socket pointer inside sk_buff is NULL, 0 is returned. The helper function coud be useful in monitoring per socket networking traffic statistics and provide a unique socket identifier per namespace. Change since V2: Moved the helper function from bpf_base_func_proto() to both sk_filter_func_proto() and tc_cls_act_func_proto(). Add function name to uapi header file under tools/include. Change since V1: Removed the unnecessary declarations and export command, resolved conflict with master branch. Signed-off-by: Chenbo Feng --- include/linux/sock_diag.h | 1 + include/uapi/linux/bpf.h | 9 - net/core/filter.c | 17 + net/core/sock_diag.c | 2 +- tools/include/uapi/linux/bpf.h | 3 ++- 5 files changed, 29 insertions(+), 3 deletions(-) diff --git a/include/linux/sock_diag.h b/include/linux/sock_diag.h index a0596ca0..a2f8109 100644 --- a/include/linux/sock_diag.h +++ b/include/linux/sock_diag.h @@ -24,6 +24,7 @@ void sock_diag_unregister(const struct sock_diag_handler *h); void sock_diag_register_inet_compat(int (*fn)(struct sk_buff *skb, struct nlmsghdr *nlh)); void sock_diag_unregister_inet_compat(int (*fn)(struct sk_buff *skb, struct nlmsghdr *nlh)); +u64 sock_gen_cookie(struct sock *sk); int sock_diag_check_cookie(struct sock *sk, const __u32 *cookie); void sock_diag_save_cookie(struct sock *sk, __u32 *cookie); diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 0539a0c..dc81a9f 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -456,6 +456,12 @@ union bpf_attr { * Return: * > 0 length of the string including the trailing NUL on success * < 0 error + * + * u64 bpf_bpf_get_socket_cookie(skb) + * Get the cookie for the socket stored inside sk_buff. + * @skb: pointer to skb + * Return: 8 Bytes non-decreasing number on success or 0 if the socket + * field is missing inside sk_buff */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -503,7 +509,8 @@ union bpf_attr { FN(get_numa_node_id), \ FN(skb_change_head),\ FN(xdp_adjust_head),\ - FN(probe_read_str), + FN(probe_read_str), \ + FN(get_socket_cookie), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends to call diff --git a/net/core/filter.c b/net/core/filter.c index e466e004..06263c0 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -2599,6 +2600,18 @@ static const struct bpf_func_proto bpf_xdp_event_output_proto = { .arg5_type = ARG_CONST_SIZE, }; +BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb) +{ + return skb->sk ? sock_gen_cookie(skb->sk) : 0; +} + +static const struct bpf_func_proto bpf_get_socket_cookie_proto = { + .func = bpf_get_socket_cookie, + .gpl_only = false, + .ret_type = RET_INTEGER, + .arg1_type = ARG_PTR_TO_CTX, +}; + static const struct bpf_func_proto * bpf_base_func_proto(enum bpf_func_id func_id) { @@ -2633,6 +2646,8 @@ sk_filter_func_proto(enum bpf_func_id func_id) switch (func_id) { case BPF_FUNC_skb_load_bytes: return _skb_load_bytes_proto; + case BPF_FUNC_get_socket_cookie: + return _get_socket_cookie_proto; default: return bpf_base_func_proto(func_id); } @@ -2692,6 +2707,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id) return _get_smp_processor_id_proto; case BPF_FUNC_skb_under_cgroup: return _skb_under_cgroup_proto; + case BPF_FUNC_get_socket_cookie: + return _get_socket_cookie_proto; default: return bpf_base_func_proto(func_id); } diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c index 6b10573..acd2a6c 100644 --- a/net/core/sock_diag.c +++ b/net/core/sock_diag.c @@ -19,7 +19,7 @@ static int (*inet_rcv_compat)(struct sk_buff *skb, struct nlmsghdr *nlh); static DEFINE_MUTEX(sock_diag_table_mutex); static struct workqueue_struct *broadcast_wq; -static u64 sock_gen_cookie(struct sock *sk) +u64 sock_gen_cookie(struct sock *sk) { while (1) { u64 res = atomic64_read(>sk_cookie); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 0539a0c..a94bdd3 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -503,7 +503,8 @@ union bpf_attr { FN(get_numa_node_id), \ FN(skb_change_head),\
[PATCH net-next v3 0/3] net: core: Two Helper function about socket information
From: Chenbo FengIntroduce two eBpf helper function to get the socket cookie and socket uid for each packet. The helper function is useful when the *sk field inside sk_buff is not empty. These helper functions can be used on socket and uid based traffic monitoring programs. Change since V2: * Add a sample program to demostrate the usage of the helper function. * Moved the helper function proto invoking place. * Add function header into tools/include * Apply sk_to_full_sk() before getting uid. Change since V1: * Removed the unnecessary declarations and export command * resolved conflict with master branch. * Examine if the socket is a full socket before getting the uid. Chenbo Feng (3): Add a helper function to get socket cookie in eBPF Add a eBPF helper function to retrieve socket uid A Sample of using socket cookie and uid for traffic monitoring include/linux/sock_diag.h| 1 + include/uapi/linux/bpf.h | 16 +- net/core/filter.c| 36 + net/core/sock_diag.c | 2 +- samples/bpf/cookie_uid_helper_example.c | 225 +++ samples/bpf/run_cookie_uid_helper_example.sh | 14 ++ tools/include/uapi/linux/bpf.h | 4 +- 7 files changed, 295 insertions(+), 3 deletions(-) create mode 100644 samples/bpf/cookie_uid_helper_example.c create mode 100755 samples/bpf/run_cookie_uid_helper_example.sh -- 2.7.4
Re: VXLAN RCU error
On Wed, 22 Feb 2017 14:27:45 -0800, Jakub Kicinski wrote: > Hi Roopa! Ah, sorry, it seems like this splat may be coming all the way from c6fcc4fc5f8b ("vxlan: avoid using stale vxlan socket."). > I get this RCU error on net 12d656af4e3d2781b9b9f52538593e1717e7c979: > > [ 1571.067134] === > [ 1571.071842] [ ERR: suspicious RCU usage. ] > [ 1571.076546] 4.10.0-debug-03232-g12d656af4e3d #1 Tainted: GW O > [ 1571.084166] --- > [ 1571.088867] ../drivers/net/vxlan.c:2111 suspicious rcu_dereference_check() > usage! > [ 1571.097286] > [ 1571.097286] other info that might help us debug this: > [ 1571.097286] > [ 1571.106305] > [ 1571.106305] rcu_scheduler_active = 2, debug_locks = 1 > [ 1571.113654] 3 locks held by ping/13826: > [ 1571.117968] #0: (sk_lock-AF_INET){+.+.+.}, at: [] > raw_sendmsg+0x14e2/0x2e40 > [ 1571.127758] #1: (rcu_read_lock_bh){..}, at: [] > ip_finish_output2+0x274/0x1390 > [ 1571.138135] #2: (rcu_read_lock_bh){..}, at: [] > __dev_queue_xmit+0x1ec/0x2750 > [ 1571.148408] > [ 1571.148408] stack backtrace: > [ 1571.153326] CPU: 10 PID: 13826 Comm: ping Tainted: GW O > 4.10.0-debug-03232-g12d656af4e3d #1 > [ 1571.163877] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 > 11/08/2016 > [ 1571.172290] Call Trace: > [ 1571.175053] dump_stack+0xcd/0x134 > [ 1571.178881] ? _atomic_dec_and_lock+0xcc/0xcc > [ 1571.183782] ? print_lock+0xb2/0xb5 > [ 1571.187711] lockdep_rcu_suspicious+0x123/0x170 > [ 1571.192807] vxlan_xmit_one+0x1931/0x4270 [vxlan] > [ 1571.198126] ? encap_bypass_if_local+0x380/0x380 [vxlan] > [ 1571.204109] ? sched_clock+0x9/0x10 > [ 1571.208034] ? sched_clock_cpu+0x20/0x2c0 > [ 1571.212541] ? unwind_get_return_address+0x1b8/0x2b0 > [ 1571.218132] ? __lock_acquire+0x6d6/0x3160 > [ 1571.222740] vxlan_xmit+0x756/0x4f90 [vxlan] > [ 1571.227541] ? vxlan_xmit_one+0x4270/0x4270 [vxlan] > [ 1571.233014] ? netif_skb_features+0x2be/0xba0 > [ 1571.237919] dev_hard_start_xmit+0x1ab/0xa70 > [ 1571.242724] __dev_queue_xmit+0x137b/0x2750 > [ 1571.247425] ? __dev_queue_xmit+0x1ec/0x2750 > [ 1571.252228] ? netdev_pick_tx+0x330/0x330 > [ 1571.256735] ? debug_smp_processor_id+0x17/0x20 > [ 1571.261826] ? get_lock_stats+0x1d/0x160 > [ 1571.266241] ? mark_held_locks+0x105/0x280 > [ 1571.270850] ? memcpy+0x45/0x50 > [ 1571.274391] dev_queue_xmit+0x10/0x20 > [ 1571.278511] neigh_resolve_output+0x43e/0x7f0 > [ 1571.283405] ? ip_finish_output2+0x69d/0x1390 > [ 1571.288308] ip_finish_output2+0x69d/0x1390 > [ 1571.293008] ? ip_finish_output2+0x274/0x1390 > [ 1571.297909] ? ip_copy_metadata+0x7e0/0x7e0 > [ 1571.302610] ? get_lock_stats+0x1d/0x160 > [ 1571.307027] ip_finish_output+0x598/0xc50 > [ 1571.311537] ip_output+0x371/0x630 > [ 1571.315362] ? ip_output+0x1dc/0x630 > [ 1571.319383] ? ip_mc_output+0xe70/0xe70 > [ 1571.323694] ? kfree+0x372/0x5a0 > [ 1571.327325] ? mark_held_locks+0x105/0x280 > [ 1571.331933] ? __ip_make_skb+0xdd1/0x2200 > [ 1571.336457] ip_local_out+0x8f/0x180 > [ 1571.340480] ip_send_skb+0x44/0xf0 > [ 1571.344306] ip_push_pending_frames+0x5a/0x80 > [ 1571.349203] raw_sendmsg+0x164d/0x2e40 > [ 1571.353422] ? debug_check_no_locks_freed+0x350/0x350 > [ 1571.359099] ? dst_output+0x1b0/0x1b0 > [ 1571.363217] ? get_lock_stats+0x1d/0x160 > [ 1571.367640] ? __might_fault+0x199/0x230 > [ 1571.372052] ? kasan_check_write+0x14/0x20 > [ 1571.382002] ? _copy_from_user+0xb9/0x130 > [ 1571.386513] ? rw_copy_check_uvector+0x8d/0x490 > [ 1571.391609] ? import_iovec+0xae/0x5d0 > [ 1571.395826] ? push_pipe+0xd00/0xd00 > [ 1571.399847] ? kasan_check_write+0x14/0x20 > [ 1571.404450] ? _copy_from_user+0xb9/0x130 > [ 1571.408960] inet_sendmsg+0x19f/0x5f0 > [ 1571.413071] ? inet_recvmsg+0x980/0x980 > [ 1571.417386] sock_sendmsg+0xe2/0x170 > [ 1571.421408] ___sys_sendmsg+0x66e/0x960 > [ 1571.425726] ? mem_cgroup_commit_charge+0x144/0x2720 > [ 1571.431303] ? copy_msghdr_from_user+0x610/0x610 > [ 1571.436495] ? debug_smp_processor_id+0x17/0x20 > [ 1571.441584] ? get_lock_stats+0x1d/0x160 > [ 1571.445995] ? mem_cgroup_uncharge_swap+0x250/0x250 > [ 1571.451474] ? page_add_new_anon_rmap+0x173/0x3a0 > [ 1571.456762] ? handle_mm_fault+0x1589/0x3820 > [ 1571.461566] ? handle_mm_fault+0x1589/0x3820 > [ 1571.466362] ? handle_mm_fault+0x191/0x3820 > [ 1571.471070] ? __fdget+0x13/0x20 > [ 1571.474702] ? get_lock_stats+0x1d/0x160 > [ 1571.479116] __sys_sendmsg+0xc6/0x150 > [ 1571.483234] ? SyS_shutdown+0x1b0/0x1b0 > [ 1571.487551] ? __do_page_fault+0x556/0xe50 > [ 1571.492158] ? trace_hardirqs_on_thunk+0x1a/0x1c > [ 1571.497340] SyS_sendmsg+0x12/0x20 > [ 1571.501166] entry_SYSCALL_64_fastpath+0x23/0xc6 > [ 1571.506354] RIP: 0033:0x7fca2d0384a0 > [ 1571.510374] RSP: 002b:7ffd18d7fe88 EFLAGS: 0246 ORIG_RAX: > 002e > [ 1571.518886] RAX: ffda RBX: 0040 RCX: > 7fca2d0384a0 > [ 1571.526889] RDX:
Re: [PATCH net-next 2/2] sctp: add support for MSG_MORE
On Tue, Feb 21, 2017 at 10:27 PM, David Laightwrote: > From: Xin Long >> Sent: 18 February 2017 17:53 >> This patch is to add support for MSG_MORE on sctp. >> >> It adds force_delay in sctp_datamsg to save MSG_MORE, and sets it after >> creating datamsg according to the send flag. sctp_packet_can_append_data >> then uses it to decide if the chunks of this msg will be sent at once or >> delay it. >> >> Note that unlike [1], this patch saves MSG_MORE in datamsg, instead of >> in assoc. As sctp enqueues the chunks first, then dequeue them one by >> one. If it's saved in assoc,the current msg's send flag (MSG_MORE) may >> affect other chunks' bundling. > > I thought about that and decided that the MSG_MORE flag on the last data > chunk was the only one that mattered. > Indeed looking at any others is broken. > > Consider what happens if you have two small chunks queued, the first > with MSG_MORE set, the second with it clear. > > I think that sctp_outq_flush() will look at the first chunk and decide it > doesn't need to do anything because sctp_packet_transmit_chunk() > returns SCTP_XMIT_DELAY. > The data chunk with MSG_MORE clear won't even be looked at. > So the data will never be sent. It's not that bad as you thought, in sctp_packet_can_append_data(): when inflight == 0 || sctp_sk(asoc->base.sk)->nodelay, the chunks would be still sent out. What MSG_MORE flag actually does is ignore inflight == 0 and sctp_sk(asoc->base.sk)->nodelay to delay the chunks, but still it has to respect the original logic (like !chunk->msg->can_delay || !sctp_packet_empty(packet) || ...) To delay the chunks with MSG_MORE set even when inflight is 0 it especially important here for users. > > I wouldn't worry about having messages queued that have MSG_MORE clean > when the final message has it set. Yeah, It's an old optimization for bundling. MSG_MORE should NOT break that. > While it might be 'nice' to send the data (would have to be tx credit) > waiting for the next data chunk shouldn't be a problem. sorry, you mean it shouldn't send the data if it's waiting for the next data whenever ? > > I'm not sure I even want to test the current patch! > > David >
Re: [PATCH net v5] bpf: add helper to compare network namespaces
On 2/19/17 9:17 PM, Eric W. Biederman wrote: >>> @@ -2597,6 +2598,39 @@ static const struct bpf_func_proto >>> bpf_xdp_event_output_proto = { >>> .arg5_type = ARG_CONST_STACK_SIZE, >>> }; >>> >>> +BPF_CALL_3(bpf_sk_netns_cmp, struct sock *, sk, u64, ns_dev, u64, ns_ino) >>> +{ >>> + return netns_cmp(sock_net(sk), ns_dev, ns_ino); >>> +} >> >> Is there anything that speaks against doing the comparison itself >> outside of the helper? Meaning, the helper would get a buffer >> passed from stack f.e. struct foo { u64 ns_dev; u64 ns_ino; } >> and fills both out with the netns info belonging to the sk/skb. > > Yes. The dev/ino pair is not necessarily unique so it is not at all > clear that the returned value would be what the program is expecting. How does the comparison inside a helper change the fact that a dev and inode number are compared? ie., inside or outside of a helper, the end result is that a bpf program has a dev/inode pair that is compared to that of a socket or skb. Ideally, it would be nice to have a bpf equivalent to net_eq(), but it is not possible from a practical perspective to have bpf programs load a namespace reference (address really) from a given pid or fd.
Re: [PATCH net-next] virtio-net: switch to use build_skb() for small buffer
On 2017年02月23日 01:17, John Fastabend wrote: On 17-02-21 12:46 AM, Jason Wang wrote: This patch switch to use build_skb() for small buffer which can have better performance for both TCP and XDP (since we can work at page before skb creation). It also remove lots of XDP codes since both mergeable and small buffer use page frag during refill now. Before | After XDP_DROP(xdp1) 64B : 11.1Mpps | 14.4Mpps Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf. When you do the xdp tests are you generating packets with pktgen on the corresponding tap devices? Yes, pktgen on the tap directly. Also another thought, have you looked at using some of the buffer recycling techniques used in the hardware drivers such as ixgbe and with Eric's latest patches mlx? I have seen significant performance increases for some workloads doing this. I wanted to try something like this out on virtio but haven't had time yet. Yes, this is in TODO list. Will pick some time to do this. Thanks Signed-off-by: Jason Wang--- [...] static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue *rq, gfp_t gfp) { - int headroom = GOOD_PACKET_LEN + virtnet_get_headroom(vi); + struct page_frag *alloc_frag = >alloc_frag; + char *buf; unsigned int xdp_headroom = virtnet_get_headroom(vi); - struct sk_buff *skb; - struct virtio_net_hdr_mrg_rxbuf *hdr; + int len = vi->hdr_len + VIRTNET_RX_PAD + GOOD_PACKET_LEN + xdp_headroom; int err; - skb = __netdev_alloc_skb_ip_align(vi->dev, headroom, gfp); - if (unlikely(!skb)) + len = SKB_DATA_ALIGN(len) + + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); + if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp))) return -ENOMEM; - skb_put(skb, headroom); - - hdr = skb_vnet_hdr(skb); - sg_init_table(rq->sg, 2); - sg_set_buf(rq->sg, hdr, vi->hdr_len); - skb_to_sgvec(skb, rq->sg + 1, xdp_headroom, skb->len - xdp_headroom); - - err = virtqueue_add_inbuf(rq->vq, rq->sg, 2, skb, gfp); + buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; + get_page(alloc_frag->page); + alloc_frag->offset += len; + sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom, + vi->hdr_len + GOOD_PACKET_LEN); + err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp); Nice this cleans up a lot of the branching code. Thanks. Acked-by: John Fastabend
Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
On Wed, Feb 22, 2017 at 6:06 PM, Eric Dumazetwrote: > On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote: > >> >> Right but you were talking about using both halves one after the >> other. If that occurs you have nothing left that you can reuse. That >> was what I was getting at. If you use up both halves you end up >> having to unmap the page. >> > > You must have misunderstood me. > > Once we use both halves of a page, we _keep_ the page, we do not unmap > it. > > We save the page pointer in a ring buffer of pages. > Call it the 'quarantine' > > When we _need_ to replenish the RX desc, we take a look at the oldest > entry in the quarantine ring. > > If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse > this saved page. > > If not, _then_ we unmap and release the page. Okay, that was what I was referring to when I mentioned a "hybrid between the mlx5 and the Intel approach". Makes sense. > Note that we would have received 4096 frames before looking at the page > count, so there is high chance both halves were consumed. > > To recap on x86 : > > 2048 active pages would be visible by the device, because 4096 RX desc > would contain dma addresses pointing to the 4096 halves. > > And 2048 pages would be in the reserve. The buffer info layout for something like that would probably be pretty interesting. Basically you would be doubling up the ring so that you handle 2 Rx descriptors per a single buffer info since you would automatically know that it would be an even/odd setup in terms of the buffer offsets. If you get a chance to do something like that I would love to know the result. Otherwise if I get a chance I can try messing with i40e or ixgbe some time and see what kind of impact it has. >> The whole idea behind using only half the page per descriptor is to >> allow us to loop through the ring before we end up reusing it again. >> That buys us enough time that usually the stack has consumed the frame >> before we need it again. > > > The same will happen really. > > Best maybe is for me to send the patch ;) I think I have the idea now. However patches are always welcome.. :-)
Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
On Wed, 2017-02-22 at 17:08 -0800, Alexander Duyck wrote: > > Right but you were talking about using both halves one after the > other. If that occurs you have nothing left that you can reuse. That > was what I was getting at. If you use up both halves you end up > having to unmap the page. > You must have misunderstood me. Once we use both halves of a page, we _keep_ the page, we do not unmap it. We save the page pointer in a ring buffer of pages. Call it the 'quarantine' When we _need_ to replenish the RX desc, we take a look at the oldest entry in the quarantine ring. If page count is 1 (or pagecnt_bias if needed) -> we immediately reuse this saved page. If not, _then_ we unmap and release the page. Note that we would have received 4096 frames before looking at the page count, so there is high chance both halves were consumed. To recap on x86 : 2048 active pages would be visible by the device, because 4096 RX desc would contain dma addresses pointing to the 4096 halves. And 2048 pages would be in the reserve. > The whole idea behind using only half the page per descriptor is to > allow us to loop through the ring before we end up reusing it again. > That buys us enough time that usually the stack has consumed the frame > before we need it again. The same will happen really. Best maybe is for me to send the patch ;)
Re: [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages
>> >> - page = alloc_page(gfp_mask); >> + page = skb_frag_page(f); >> + if (page_count(page) == 1) { >> + skb_frag_ref(skb, i); > > This could be : get_page(page); Ah, indeed. Thanks. > >> + goto copy_done; >> + } >> + >> + if (f->size > PAGE_SIZE) { >> + order = get_order(f->size); >> + mask |= __GFP_COMP; > > Note that this would probably fail under memory pressure. > > We could instead try to explode the few segments into order-0 only > pages. Good point. I'll revise to use only order-0 here.
Re: [PATCH] uapi: fix linux/rds.h userspace compilation errors
On 2/22/2017 5:13 PM, Dmitry V. Levin wrote: Consistently use types from linux/types.h to fix the following linux/rds.h userspace compilation errors: /usr/include/linux/rds.h:198:2: error: unknown type name 'u8' u8 rx_traces; /usr/include/linux/rds.h:199:2: error: unknown type name 'u8' u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX]; /usr/include/linux/rds.h:203:2: error: unknown type name 'u8' u8 rx_traces; /usr/include/linux/rds.h:204:2: error: unknown type name 'u8' u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX]; /usr/include/linux/rds.h:205:2: error: unknown type name 'u64' u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX]; Fixes: 3289025a("RDS: add receive message trace used by application") Signed-off-by: Dmitry V. Levin--- This was part of the patch I submitted other-day with rest of the clean-up. Thanks Dmitry. Acked-by: Santosh Shilimkar
[PATCH] uapi: fix linux/seg6.h and linux/seg6_iptunnel.h userspace compilation errors
Include in uapi/linux/seg6.h to fix the following linux/seg6.h userspace compilation error: /usr/include/linux/seg6.h:31:18: error: array type has incomplete element type 'struct in6_addr' struct in6_addr segments[0]; Include in uapi/linux/seg6_iptunnel.h to fix the following linux/seg6_iptunnel.h userspace compilation error: /usr/include/linux/seg6_iptunnel.h:26:21: error: array type has incomplete element type 'struct ipv6_sr_hdr' struct ipv6_sr_hdr srh[0]; Fixes: a50a05f4("ipv6: sr: add missing Kbuild export for header files") Signed-off-by: Dmitry V. Levin--- include/uapi/linux/seg6.h | 1 + include/uapi/linux/seg6_iptunnel.h | 2 ++ 2 files changed, 3 insertions(+) diff --git a/include/uapi/linux/seg6.h b/include/uapi/linux/seg6.h index 61df8d3..7278511 100644 --- a/include/uapi/linux/seg6.h +++ b/include/uapi/linux/seg6.h @@ -15,6 +15,7 @@ #define _UAPI_LINUX_SEG6_H #include +#include /* For struct in6_addr. */ /* * SRH diff --git a/include/uapi/linux/seg6_iptunnel.h b/include/uapi/linux/seg6_iptunnel.h index 7a7183d..b6e5a0a 100644 --- a/include/uapi/linux/seg6_iptunnel.h +++ b/include/uapi/linux/seg6_iptunnel.h @@ -14,6 +14,8 @@ #ifndef _UAPI_LINUX_SEG6_IPTUNNEL_H #define _UAPI_LINUX_SEG6_IPTUNNEL_H +#include /* For struct ipv6_sr_hdr. */ + enum { SEG6_IPTUNNEL_UNSPEC, SEG6_IPTUNNEL_SRH, -- ldv
[PATCH] uapi: fix linux/rds.h userspace compilation errors
Consistently use types from linux/types.h to fix the following linux/rds.h userspace compilation errors: /usr/include/linux/rds.h:198:2: error: unknown type name 'u8' u8 rx_traces; /usr/include/linux/rds.h:199:2: error: unknown type name 'u8' u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX]; /usr/include/linux/rds.h:203:2: error: unknown type name 'u8' u8 rx_traces; /usr/include/linux/rds.h:204:2: error: unknown type name 'u8' u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX]; /usr/include/linux/rds.h:205:2: error: unknown type name 'u64' u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX]; Fixes: 3289025a("RDS: add receive message trace used by application") Signed-off-by: Dmitry V. Levin--- include/uapi/linux/rds.h | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h index 47c03ca..198892b 100644 --- a/include/uapi/linux/rds.h +++ b/include/uapi/linux/rds.h @@ -195,14 +195,14 @@ enum rds_message_rxpath_latency { }; struct rds_rx_trace_so { - u8 rx_traces; - u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX]; + __u8 rx_traces; + __u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX]; }; struct rds_cmsg_rx_trace { - u8 rx_traces; - u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX]; - u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX]; + __u8 rx_traces; + __u8 rx_trace_pos[RDS_MSG_RX_DGRAM_TRACE_MAX]; + __u64 rx_trace[RDS_MSG_RX_DGRAM_TRACE_MAX]; }; /* -- ldv
Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
On Wed, Feb 22, 2017 at 10:21 AM, Eric Dumazetwrote: > On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote: >> On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazet wrote: >> > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote: >> >> Use of order-3 pages is problematic in some cases. >> >> >> >> This patch might add three kinds of regression : >> >> >> >> 1) a CPU performance regression, but we will add later page >> >> recycling and performance should be back. >> >> >> >> 2) TCP receiver could grow its receive window slightly slower, >> >>because skb->len/skb->truesize ratio will decrease. >> >>This is mostly ok, we prefer being conservative to not risk OOM, >> >>and eventually tune TCP better in the future. >> >>This is consistent with other drivers using 2048 per ethernet frame. >> >> >> >> 3) Because we allocate one page per RX slot, we consume more >> >>memory for the ring buffers. XDP already had this constraint anyway. >> >> >> >> Signed-off-by: Eric Dumazet >> >> --- >> > >> > Note that we also could use a different strategy. >> > >> > Assume RX rings of 4096 entries/slots. >> > >> > With this patch, mlx4 gets the strategy used by Alexander in Intel >> > drivers : >> > >> > Each RX slot has an allocated page, and uses half of it, flipping to the >> > other half every time the slot is used. >> > >> > So a ring buffer of 4096 slots allocates 4096 pages. >> > >> > When we receive a packet train for the same flow, GRO builds an skb with >> > ~45 page frags, all from different pages. >> > >> > The put_page() done from skb_release_data() touches ~45 different struct >> > page cache lines, and show a high cost. (compared to the order-3 used >> > today by mlx4, this adds extra cache line misses and stalls for the >> > consumer) >> > >> > If we instead try to use the two halves of one page on consecutive RX >> > slots, we might instead cook skb with the same number of MSS (45), but >> > half the number of cache lines for put_page(), so we should speed up the >> > consumer. >> >> So there is a problem that is being overlooked here. That is the cost >> of the DMA map/unmap calls. The problem is many PowerPC systems have >> an IOMMU that you have to work around, and that IOMMU comes at a heavy >> cost for every map/unmap call. So unless you are saying you wan to >> setup a hybrid between the mlx5 and this approach where we have a page >> cache that these all fall back into you will take a heavy cost for >> having to map and unmap pages. >> >> The whole reason why I implemented the Intel page reuse approach the >> way I did is to try and mitigate the IOMMU issue, it wasn't so much to >> resolve allocator/freeing expense. Basically the allocator scales, >> the IOMMU does not. So any solution would require making certain that >> we can leave the pages pinned in the DMA to avoid having to take the >> global locks involved in accessing the IOMMU. > > > I do not see any difference for the fact that we keep pages mapped the > same way. > > mlx4_en_complete_rx_desc() will still use the : > > dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset, > frag_size, priv->dma_dir); > > for every single MSS we receive. > > This wont change. Right but you were talking about using both halves one after the other. If that occurs you have nothing left that you can reuse. That was what I was getting at. If you use up both halves you end up having to unmap the page. The whole idea behind using only half the page per descriptor is to allow us to loop through the ring before we end up reusing it again. That buys us enough time that usually the stack has consumed the frame before we need it again. - Alex
[PATCH] bpf: fix spelling mistake: "proccessed" -> "processed"
From: Colin Ian Kingtrivial fix to spelling mistake in verbose log message Signed-off-by: Colin Ian King --- kernel/bpf/verifier.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index d2bded2..3fc6e39 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -2776,7 +2776,7 @@ static int do_check(struct bpf_verifier_env *env) class = BPF_CLASS(insn->code); if (++insn_processed > BPF_COMPLEXITY_LIMIT_INSNS) { - verbose("BPF program is too large. Proccessed %d insn\n", + verbose("BPF program is too large. Processed %d insn\n", insn_processed); return -E2BIG; } -- 2.10.2
Re: [PATCH V5 2/2] qedf: Add QLogic FastLinQ offload FCoE driver framework.
> "Chad" == Dupuis, Chadwrites: Chad> The QLogic FastLinQ Driver for FCoE (qedf) is the FCoE specific Chad> module for 41000 Series Converged Network Adapters by QLogic. This Chad> patch consists of following changes: Now that Linus pulled Dave's tree I have gone ahead and merged this into 4.11/scsi-fixes. -- Martin K. Petersen Oracle Linux Engineering
RE: create drivers/net/mdio and move mdio drivers into it
> -Original Message- > From: Andrew Lunn [mailto:and...@lunn.ch] > Sent: Wednesday, February 22, 2017 6:21 PM > To: YUAN Linyu > Cc: Florian Fainelli; David S . Miller; netdev@vger.kernel.org; cug...@163.com > Subject: Re: create drivers/net/mdio and move mdio drivers into it > > On Wed, Feb 22, 2017 at 05:38:49AM +, YUAN Linyu wrote: > > Hi Florian, > > > > 1. > > Let's go back to original topic, > > Can we move all mdio dirvers into drivers/net/mdio ? > > Hi Yuan > > Please could you explain what benefit this brings. Please also list > all the downsides for such a move. As Florian said, we need to ensure > such a move adds more value than it removes. At beginning I think mdio and phy are two different things, mdio should have it's home. > > > Per may understanding, > > I don't know why create a struct mii_bus instance to represent a mdio device > in current mdio driver. > > Why not create a struct mdio_device instance, it's easy to understand. > > (We can move part of member of mii_bus to mdio_device). > > Please take a step back. What are you trying to achieve. What is the > big picture. What cannot you do with the current design? Big picture is we can remove struct mii_bus, and use struct mdio_device/driver for mdio controller. > > Andrew
[PATCH] rtlwifi: fix spelling mistake: "conuntry" -> "country"
From: Colin Ian Kingtrivial fix to spelling mistake in RT_TRACE message Signed-off-by: Colin Ian King --- drivers/net/wireless/realtek/rtlwifi/regd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/wireless/realtek/rtlwifi/regd.c b/drivers/net/wireless/realtek/rtlwifi/regd.c index 558c31b..1bf3eb2 100644 --- a/drivers/net/wireless/realtek/rtlwifi/regd.c +++ b/drivers/net/wireless/realtek/rtlwifi/regd.c @@ -435,7 +435,7 @@ int rtl_regd_init(struct ieee80211_hw *hw, channel_plan_to_country_code(rtlpriv->efuse.channel_plan); RT_TRACE(rtlpriv, COMP_REGD, DBG_DMESG, -"rtl: EEPROM regdomain: 0x%0x conuntry code: %d\n", +"rtl: EEPROM regdomain: 0x%0x country code: %d\n", rtlpriv->efuse.channel_plan, rtlpriv->regd.country_code); if (rtlpriv->regd.country_code >= COUNTRY_CODE_MAX) { -- 2.10.2
Re: linux-next: build failure after merge of the net-next tree
Hi all, On Tue, 10 Jan 2017 10:59:27 +1100 Stephen Rothwellwrote: > > After merging the net-next tree, today's linux-next build (x86_64 > allmodconfig) failed like this: > > net/smc/af_smc.c: In function 'smc_splice_read': > net/smc/af_smc.c:1258:39: error: passing argument 1 of > 'smc->clcsock->ops->splice_read' from incompatible pointer type > [-Werror=incompatible-pointer-types] >rc = smc->clcsock->ops->splice_read(smc->clcsock, ppos, >^ > net/smc/af_smc.c:1258:39: note: expected 'struct file *' but argument is of > type 'struct socket *' > net/smc/af_smc.c: At top level: > net/smc/af_smc.c:1288:17: error: initialization from incompatible pointer > type [-Werror=incompatible-pointer-types] > .splice_read = smc_splice_read, > ^ > net/smc/af_smc.c:1288:17: note: (near initialization for > 'smc_sock_ops.splice_read') > > Caused by commit > > ac7138746e14 ("smc: establish new socket family") > > interacting with commit > > 15a8f657c71d ("switch socket ->splice_read() to struct file *") > > from the vfs tree. > > I applied the following merge fix patch which could well be incorrect ... > > From: Stephen Rothwell > Date: Tue, 10 Jan 2017 10:52:38 +1100 > Subject: [PATCH] smc: merge fix for "switch socket ->splice_read() to struct > file *" > > Signed-off-by: Stephen Rothwell > --- > net/smc/af_smc.c | 5 +++-- > 1 file changed, 3 insertions(+), 2 deletions(-) > > diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c > index 5d4208ad029e..4875e65f0c4a 100644 > --- a/net/smc/af_smc.c > +++ b/net/smc/af_smc.c > @@ -1242,10 +1242,11 @@ static ssize_t smc_sendpage(struct socket *sock, > struct page *page, > return rc; > } > > -static ssize_t smc_splice_read(struct socket *sock, loff_t *ppos, > +static ssize_t smc_splice_read(struct file *file, loff_t *ppos, > struct pipe_inode_info *pipe, size_t len, > unsigned int flags) > { > + struct socket *sock = file->private_data; > struct sock *sk = sock->sk; > struct smc_sock *smc; > int rc = -ENOTCONN; > @@ -1255,7 +1256,7 @@ static ssize_t smc_splice_read(struct socket *sock, > loff_t *ppos, > if ((sk->sk_state != SMC_ACTIVE) && (sk->sk_state != SMC_CLOSED)) > goto out; > if (smc->use_fallback) { > - rc = smc->clcsock->ops->splice_read(smc->clcsock, ppos, > + rc = smc->clcsock->ops->splice_read(file, ppos, > pipe, len, flags); > } else { > rc = -EOPNOTSUPP; > -- > 2.10.2 This fix up is now needed when the vfs tree is merged with Linus' tree. -- Cheers, Stephen Rothwell
[PATCH] net: realtek: 8139too: use new api ethtool_{get|set}_link_ksettings
The ethtool api {get|set}_settings is deprecated. We move this driver to new api {get|set}_link_ksettings. As I don't have the hardware, I'd be very pleased if someone may test this patch. Signed-off-by: Philippe Reynes--- drivers/net/ethernet/realtek/8139too.c | 14 -- 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/realtek/8139too.c b/drivers/net/ethernet/realtek/8139too.c index 8963175..ca22f28 100644 --- a/drivers/net/ethernet/realtek/8139too.c +++ b/drivers/net/ethernet/realtek/8139too.c @@ -2384,21 +2384,23 @@ static void rtl8139_get_drvinfo(struct net_device *dev, struct ethtool_drvinfo * strlcpy(info->bus_info, pci_name(tp->pci_dev), sizeof(info->bus_info)); } -static int rtl8139_get_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int rtl8139_get_link_ksettings(struct net_device *dev, + struct ethtool_link_ksettings *cmd) { struct rtl8139_private *tp = netdev_priv(dev); spin_lock_irq(>lock); - mii_ethtool_gset(>mii, cmd); + mii_ethtool_get_link_ksettings(>mii, cmd); spin_unlock_irq(>lock); return 0; } -static int rtl8139_set_settings(struct net_device *dev, struct ethtool_cmd *cmd) +static int rtl8139_set_link_ksettings(struct net_device *dev, + const struct ethtool_link_ksettings *cmd) { struct rtl8139_private *tp = netdev_priv(dev); int rc; spin_lock_irq(>lock); - rc = mii_ethtool_sset(>mii, cmd); + rc = mii_ethtool_set_link_ksettings(>mii, cmd); spin_unlock_irq(>lock); return rc; } @@ -2480,8 +2482,6 @@ static void rtl8139_get_strings(struct net_device *dev, u32 stringset, u8 *data) static const struct ethtool_ops rtl8139_ethtool_ops = { .get_drvinfo= rtl8139_get_drvinfo, - .get_settings = rtl8139_get_settings, - .set_settings = rtl8139_set_settings, .get_regs_len = rtl8139_get_regs_len, .get_regs = rtl8139_get_regs, .nway_reset = rtl8139_nway_reset, @@ -2493,6 +2493,8 @@ static void rtl8139_get_strings(struct net_device *dev, u32 stringset, u8 *data) .get_strings= rtl8139_get_strings, .get_sset_count = rtl8139_get_sset_count, .get_ethtool_stats = rtl8139_get_ethtool_stats, + .get_link_ksettings = rtl8139_get_link_ksettings, + .set_link_ksettings = rtl8139_set_link_ksettings, }; static int netdev_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) -- 1.7.4.4
[PATCH] uapi: fix linux/llc.h userspace compilation error
Include to fix the following linux/llc.h userspace compilation error: /usr/include/linux/llc.h:26:27: error: 'IFHWADDRLEN' undeclared here (not in a function) unsigned char sllc_mac[IFHWADDRLEN]; Signed-off-by: Dmitry V. Levin--- include/uapi/linux/llc.h | 1 + 1 file changed, 1 insertion(+) diff --git a/include/uapi/linux/llc.h b/include/uapi/linux/llc.h index 9c987a4..a6c17f6 100644 --- a/include/uapi/linux/llc.h +++ b/include/uapi/linux/llc.h @@ -14,6 +14,7 @@ #define _UAPI__LINUX_LLC_H #include +#include /* For IFHWADDRLEN. */ #define __LLC_SOCK_SIZE__ 16 /* sizeof(sockaddr_llc), word align. */ struct sockaddr_llc { -- ldv
[PATCH] uapi: fix linux/ip6_tunnel.h userspace compilation errors
Include and to fix the following linux/ip6_tunnel.h userspace compilation errors: /usr/include/linux/ip6_tunnel.h:23:12: error: 'IFNAMSIZ' undeclared here (not in a function) char name[IFNAMSIZ]; /* name of tunnel device */ /usr/include/linux/ip6_tunnel.h:30:18: error: field 'laddr' has incomplete type struct in6_addr laddr; /* local tunnel end-point address */ Signed-off-by: Dmitry V. Levin--- include/uapi/linux/ip6_tunnel.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/ip6_tunnel.h b/include/uapi/linux/ip6_tunnel.h index 48af63c..425926c 100644 --- a/include/uapi/linux/ip6_tunnel.h +++ b/include/uapi/linux/ip6_tunnel.h @@ -2,6 +2,8 @@ #define _IP6_TUNNEL_H #include +#include /* For IFNAMSIZ. */ +#include /* For struct in6_addr. */ #define IPV6_TLV_TNL_ENCAP_LIMIT 4 #define IPV6_DEFAULT_TNL_ENCAP_LIMIT 4 -- ldv
VXLAN RCU error
Hi Roopa! I get this RCU error on net 12d656af4e3d2781b9b9f52538593e1717e7c979: [ 1571.067134] === [ 1571.071842] [ ERR: suspicious RCU usage. ] [ 1571.076546] 4.10.0-debug-03232-g12d656af4e3d #1 Tainted: GW O [ 1571.084166] --- [ 1571.088867] ../drivers/net/vxlan.c:2111 suspicious rcu_dereference_check() usage! [ 1571.097286] [ 1571.097286] other info that might help us debug this: [ 1571.097286] [ 1571.106305] [ 1571.106305] rcu_scheduler_active = 2, debug_locks = 1 [ 1571.113654] 3 locks held by ping/13826: [ 1571.117968] #0: (sk_lock-AF_INET){+.+.+.}, at: [] raw_sendmsg+0x14e2/0x2e40 [ 1571.127758] #1: (rcu_read_lock_bh){..}, at: [] ip_finish_output2+0x274/0x1390 [ 1571.138135] #2: (rcu_read_lock_bh){..}, at: [] __dev_queue_xmit+0x1ec/0x2750 [ 1571.148408] [ 1571.148408] stack backtrace: [ 1571.153326] CPU: 10 PID: 13826 Comm: ping Tainted: GW O 4.10.0-debug-03232-g12d656af4e3d #1 [ 1571.163877] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.3.4 11/08/2016 [ 1571.172290] Call Trace: [ 1571.175053] dump_stack+0xcd/0x134 [ 1571.178881] ? _atomic_dec_and_lock+0xcc/0xcc [ 1571.183782] ? print_lock+0xb2/0xb5 [ 1571.187711] lockdep_rcu_suspicious+0x123/0x170 [ 1571.192807] vxlan_xmit_one+0x1931/0x4270 [vxlan] [ 1571.198126] ? encap_bypass_if_local+0x380/0x380 [vxlan] [ 1571.204109] ? sched_clock+0x9/0x10 [ 1571.208034] ? sched_clock_cpu+0x20/0x2c0 [ 1571.212541] ? unwind_get_return_address+0x1b8/0x2b0 [ 1571.218132] ? __lock_acquire+0x6d6/0x3160 [ 1571.222740] vxlan_xmit+0x756/0x4f90 [vxlan] [ 1571.227541] ? vxlan_xmit_one+0x4270/0x4270 [vxlan] [ 1571.233014] ? netif_skb_features+0x2be/0xba0 [ 1571.237919] dev_hard_start_xmit+0x1ab/0xa70 [ 1571.242724] __dev_queue_xmit+0x137b/0x2750 [ 1571.247425] ? __dev_queue_xmit+0x1ec/0x2750 [ 1571.252228] ? netdev_pick_tx+0x330/0x330 [ 1571.256735] ? debug_smp_processor_id+0x17/0x20 [ 1571.261826] ? get_lock_stats+0x1d/0x160 [ 1571.266241] ? mark_held_locks+0x105/0x280 [ 1571.270850] ? memcpy+0x45/0x50 [ 1571.274391] dev_queue_xmit+0x10/0x20 [ 1571.278511] neigh_resolve_output+0x43e/0x7f0 [ 1571.283405] ? ip_finish_output2+0x69d/0x1390 [ 1571.288308] ip_finish_output2+0x69d/0x1390 [ 1571.293008] ? ip_finish_output2+0x274/0x1390 [ 1571.297909] ? ip_copy_metadata+0x7e0/0x7e0 [ 1571.302610] ? get_lock_stats+0x1d/0x160 [ 1571.307027] ip_finish_output+0x598/0xc50 [ 1571.311537] ip_output+0x371/0x630 [ 1571.315362] ? ip_output+0x1dc/0x630 [ 1571.319383] ? ip_mc_output+0xe70/0xe70 [ 1571.323694] ? kfree+0x372/0x5a0 [ 1571.327325] ? mark_held_locks+0x105/0x280 [ 1571.331933] ? __ip_make_skb+0xdd1/0x2200 [ 1571.336457] ip_local_out+0x8f/0x180 [ 1571.340480] ip_send_skb+0x44/0xf0 [ 1571.344306] ip_push_pending_frames+0x5a/0x80 [ 1571.349203] raw_sendmsg+0x164d/0x2e40 [ 1571.353422] ? debug_check_no_locks_freed+0x350/0x350 [ 1571.359099] ? dst_output+0x1b0/0x1b0 [ 1571.363217] ? get_lock_stats+0x1d/0x160 [ 1571.367640] ? __might_fault+0x199/0x230 [ 1571.372052] ? kasan_check_write+0x14/0x20 [ 1571.382002] ? _copy_from_user+0xb9/0x130 [ 1571.386513] ? rw_copy_check_uvector+0x8d/0x490 [ 1571.391609] ? import_iovec+0xae/0x5d0 [ 1571.395826] ? push_pipe+0xd00/0xd00 [ 1571.399847] ? kasan_check_write+0x14/0x20 [ 1571.404450] ? _copy_from_user+0xb9/0x130 [ 1571.408960] inet_sendmsg+0x19f/0x5f0 [ 1571.413071] ? inet_recvmsg+0x980/0x980 [ 1571.417386] sock_sendmsg+0xe2/0x170 [ 1571.421408] ___sys_sendmsg+0x66e/0x960 [ 1571.425726] ? mem_cgroup_commit_charge+0x144/0x2720 [ 1571.431303] ? copy_msghdr_from_user+0x610/0x610 [ 1571.436495] ? debug_smp_processor_id+0x17/0x20 [ 1571.441584] ? get_lock_stats+0x1d/0x160 [ 1571.445995] ? mem_cgroup_uncharge_swap+0x250/0x250 [ 1571.451474] ? page_add_new_anon_rmap+0x173/0x3a0 [ 1571.456762] ? handle_mm_fault+0x1589/0x3820 [ 1571.461566] ? handle_mm_fault+0x1589/0x3820 [ 1571.466362] ? handle_mm_fault+0x191/0x3820 [ 1571.471070] ? __fdget+0x13/0x20 [ 1571.474702] ? get_lock_stats+0x1d/0x160 [ 1571.479116] __sys_sendmsg+0xc6/0x150 [ 1571.483234] ? SyS_shutdown+0x1b0/0x1b0 [ 1571.487551] ? __do_page_fault+0x556/0xe50 [ 1571.492158] ? trace_hardirqs_on_thunk+0x1a/0x1c [ 1571.497340] SyS_sendmsg+0x12/0x20 [ 1571.501166] entry_SYSCALL_64_fastpath+0x23/0xc6 [ 1571.506354] RIP: 0033:0x7fca2d0384a0 [ 1571.510374] RSP: 002b:7ffd18d7fe88 EFLAGS: 0246 ORIG_RAX: 002e [ 1571.518886] RAX: ffda RBX: 0040 RCX: 7fca2d0384a0 [ 1571.526889] RDX: RSI: 0060a300 RDI: 0003 [ 1571.534892] RBP: 0046 R08: 0020 R09: 003e [ 1571.542897] R10: 7ffd18d7fc50 R11: 0246 R12: 00c0 [ 1571.550900] R13: 0004 R14: 7ffd18d81608 R15: 7ffd18d810b0 Some of Netronome's VXLAN tests are also failing but I need to dig a bit to see what's wrong
Re: Focusing the XDP project
On Wed, Feb 22, 2017 at 1:43 PM, Jesper Dangaard Brouerwrote: > On Wed, 22 Feb 2017 09:22:53 -0800 > Tom Herbert wrote: > >> On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouer >> wrote: >> > >> > On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert >> > wrote: >> >> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed >> >> wrote: >> > [...] >> >> > The only complexity XDP is adding to the drivers is the constrains on >> >> > RX memory management and memory model, calling the XDP program itself >> >> > and handling the action is really a simple thing once you have the >> >> > correct memory model. >> > >> > Exactly, that is why I've been looking at introducing a generic >> > facility for a memory model for drivers. This should help simply >> > drivers. Due to performance needs this need to be a very thin API layer >> > on top of the page allocator. (That's why I'm working with Mel Gorman >> > to get more close integration with the page allocator e.g. a bulking >> > facility). >> > >> >> > Who knows! maybe someday XDP will define one unified RX API for all >> >> > drivers and it even will handle normal stack delivery it self :). >> >> > >> >> That's exactly the point and what we need for TXDP. I'm missing why >> >> doing this is such rocket science other than the fact that all these >> >> drivers are vastly different and changing the existing API is >> >> unpleasant. The only functional complexity I see in creating a generic >> >> batching interface is handling return codes asynchronously. This is >> >> entirely feasible though... >> > >> > I'll be happy as long as we get a batching interface, then we can >> > incrementally do the optimizations later. >> > >> > In the future, I do hope (like Saeed) this RX API will evolve into >> > delivering (a bulk of) raw-packet-pages into the netstack, this should >> > simplify drivers, and we can keep the complexity and SKB allocations >> > out of the drivers. >> > To start with, we can play with doing this delivering (a bulk of) >> > raw-packet-pages into Tom's TXDP engine/system? >> > >> Hi Jesper, >> >> Maybe we can to start to narrow in on what a batching API might look like. >> >> Looking at mlx5 (as a model of how XDP is implemented) the main RX >> loop in ml5e_poll_rx_cq calls the backend handler in one indirect >> function call. The XDP path goes through mlx5e_handle_rx_cqe, >> skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with >> building the skbuf. As a prerequisite to RX batching it would be >> helpful if this could be flatten so that most of the logic is obvious >> in the main RX loop. > > I fully agree here, it would be helpful to flatten out. The mlx5 > driver is a bit hard to follow in that respect. Saeed have already > send me some offlist patches, where some of this code gets > restructured. In one of the patches the RX-stages does get flatten out > some more. We are currently benchmarking this patchset, and depending > on CPU it is either a small win or a small (7ns) regressing (on the newest > CPUs). > Cool! > >> The model of RX batching seems straightforward enough-- pull packets >> from the ring, save xdp_data information in a vector, periodically >> call into the stack to handle a batch where argument is the vector of >> packets and another argument is an output vector that gives return >> codes (XDP actions), process the each return code for each packet in >> the driver accordingly. > > Yes, exactly. I did imagine that (maybe), the input vector of packets > could have a room for the return codes (XDP actions) next to the packet > pointer? > Which ever way is more efficient I suppose. The important point is that the return code should be only the only thing returned to the driver. > >> Presumably, there is a maximum allowed batch >> that may or may not be the same as the NAPI budget so the so the >> batching call needs to be done when the limit is reach and also before >> exiting NAPI. > > In my PoC code that Saeed is working on, we have a smaller batch > size(10), and prefetch to L2 cache (like DPDK does), based on the > theory that we don't want to stress the L2 cache usage, and that these > CPUs usually have a Line Feed Buffer (LFB) that is limited to 10 > outstanding cache-lines. > > I don't know if this artifically smaller batch size is the right thing, > as DPDK always prefetch to L2 cache all 32 packets on RX. And snabb > uses batches of 100 packets per "breath". > Maybe make it configurable :-) > >> For each packet the stack can return an XDP code, >> XDP_PASS in this case could be interpreted as being consumed by the >> stack; this would be used in the case the stack creates an skbuff for >> the packet. The stack on it's part can process the batch how it sees >> fit, it can process each packet individual in the canonical model, or >> we can continue processing a batch in a VPP-like fashion. > > Agree. >
Re: Questions on XDP
On Wed, 22 Feb 2017 09:08:53 -0800 John Fastabendwrote: > > GSO/TSO is getting into advanced stuff I would rather not have to get > > into right now. I figure we need to take this portion one step at a > > time. To support GSO we need more information like the mss. > > > > Agreed lets get the driver support for basic things first. But this > is on my list. I'm just repeating myself but VM to VM performance uses > TSO/LRO heavily. Sorry, but I get annoyed every time I hear we need to support TSO/LRO/GRO for performance reasons. If you take one step back, you are actually saying we need bulking for better performance. And the bulking you are proposing is a TCP protocol specific bulking mechanism. I'm saying is let's make bulking protocol agnostic, by doing it at the packet level. And once the bulk enters the VM, by-all-means it should construct a GRO packet it can send into it's own network stack. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net V2 0/5] mlx4 misc fixes
From: Tariq ToukanDate: Wed, 22 Feb 2017 18:25:24 +0200 > This patchset contains misc bug fixes from Eric Dumazet and our team > to the mlx4 Core and Eth drivers. > > Series generated against net commit: > 00ea1ceebe0d ipv6: release dst on error in ip6_dst_lookup_tail > > Thanks, > Tariq. > > v2: > * Added Eric's fix (patch 5/5). This doesn't apply cleanly to the net tree, please respin. Thanks.
Re: [PATCH net-next] net/gtp: Add udp source port generation according to flow hash
On Wed, Feb 22, 2017 at 1:29 PM, Or Gerlitzwrote: > On Thu, Feb 16, 2017 at 11:58 PM, Andreas Schultz wrote: >> Hi Or, >> - On Feb 16, 2017, at 3:59 PM, Or Gerlitz ogerl...@mellanox.com wrote: >> >>> Generate the source udp header according to the flow represented by >>> the packet we are encapsulating, as done for other udp tunnels. This >>> helps on the receiver side to apply RSS spreading. >> >> This might work for GTPv0-U, However, for GTPv1-U this could interfere >> with error handling in the user space control process when the UDP port >> extension header is used in error indications. > > > in the document you posted there's this quote "The source IP and port > have no meaning and can change at any time" -- I assume it refers to > v0? can we identify in the kernel code that we're on v0 and have the > patch come into play? > >> 3GPP TS 29.281 Rel 13, section 5.2.2.1 defines the UDP port extension and >> section 7.3.1 says that the UDP source port extension can be used to >> mitigate DOS attacks. This would IMHO imply that the user space control >> process needs to know the TEID to UDP source port mapping. > >> The other question is, on what is this actually hashing. When I understand >> the code correctly, this will hash on the source/destination of the orignal >> flow. I would expect that a SGSN/SGW/eNodeB would like the keep flow >> processing on a per TEID base, so the port hashing should be base on the >> TEID. > > is it possible for packets belonging to the same TCP session or UDP > "pseudo session" (given pair of src/dst ip/port) to be encapsulated > using different TEID? > > hashing on the TEID imposes a harder requirement on the NIC HW vs. > just UDP based RSS. This shouldn't be taken as a HW requirement and it's unlikely we'd add explicit GTP support in flow_dissector. If we can't get entropy in the UDP source port then IPv6 flow label is a potential alternative (so that should be supported in NICs for RSS). I'll also reiterate my previous point about the need for GTP testing-- in order for us to be able to evaluate the GTP datapath for things like performance or how they withstand against DDOS we really need an easy way to isolate the datapath. Tom
Re: [PATCH] fjes: Move fjes driver info message into fjes_acpi_add()
Thank you for quick response. I'll think of other solution. Thanks, Yasuaki Ishimatsu On 02/22/2017 03:45 PM, David Miller wrote: From: Yasuaki IshimatsuDate: Wed, 22 Feb 2017 15:40:49 -0500 To avoid the confusion, the patch moves the message into fjes_acpi_add() so that it is shows only when fjes_acpi_add() succeeded. This change means it'll never be printed for platform driver matches, which is even worse than what we have now.
Re: Focusing the XDP project
On Wed, 22 Feb 2017 09:22:53 -0800 Tom Herbertwrote: > On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouer > wrote: > > > > On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert > > wrote: > >> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed > >> wrote: > > [...] > >> > The only complexity XDP is adding to the drivers is the constrains on > >> > RX memory management and memory model, calling the XDP program itself > >> > and handling the action is really a simple thing once you have the > >> > correct memory model. > > > > Exactly, that is why I've been looking at introducing a generic > > facility for a memory model for drivers. This should help simply > > drivers. Due to performance needs this need to be a very thin API layer > > on top of the page allocator. (That's why I'm working with Mel Gorman > > to get more close integration with the page allocator e.g. a bulking > > facility). > > > >> > Who knows! maybe someday XDP will define one unified RX API for all > >> > drivers and it even will handle normal stack delivery it self :). > >> > > >> That's exactly the point and what we need for TXDP. I'm missing why > >> doing this is such rocket science other than the fact that all these > >> drivers are vastly different and changing the existing API is > >> unpleasant. The only functional complexity I see in creating a generic > >> batching interface is handling return codes asynchronously. This is > >> entirely feasible though... > > > > I'll be happy as long as we get a batching interface, then we can > > incrementally do the optimizations later. > > > > In the future, I do hope (like Saeed) this RX API will evolve into > > delivering (a bulk of) raw-packet-pages into the netstack, this should > > simplify drivers, and we can keep the complexity and SKB allocations > > out of the drivers. > > To start with, we can play with doing this delivering (a bulk of) > > raw-packet-pages into Tom's TXDP engine/system? > > > Hi Jesper, > > Maybe we can to start to narrow in on what a batching API might look like. > > Looking at mlx5 (as a model of how XDP is implemented) the main RX > loop in ml5e_poll_rx_cq calls the backend handler in one indirect > function call. The XDP path goes through mlx5e_handle_rx_cqe, > skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with > building the skbuf. As a prerequisite to RX batching it would be > helpful if this could be flatten so that most of the logic is obvious > in the main RX loop. I fully agree here, it would be helpful to flatten out. The mlx5 driver is a bit hard to follow in that respect. Saeed have already send me some offlist patches, where some of this code gets restructured. In one of the patches the RX-stages does get flatten out some more. We are currently benchmarking this patchset, and depending on CPU it is either a small win or a small (7ns) regressing (on the newest CPUs). > The model of RX batching seems straightforward enough-- pull packets > from the ring, save xdp_data information in a vector, periodically > call into the stack to handle a batch where argument is the vector of > packets and another argument is an output vector that gives return > codes (XDP actions), process the each return code for each packet in > the driver accordingly. Yes, exactly. I did imagine that (maybe), the input vector of packets could have a room for the return codes (XDP actions) next to the packet pointer? > Presumably, there is a maximum allowed batch > that may or may not be the same as the NAPI budget so the so the > batching call needs to be done when the limit is reach and also before > exiting NAPI. In my PoC code that Saeed is working on, we have a smaller batch size(10), and prefetch to L2 cache (like DPDK does), based on the theory that we don't want to stress the L2 cache usage, and that these CPUs usually have a Line Feed Buffer (LFB) that is limited to 10 outstanding cache-lines. I don't know if this artifically smaller batch size is the right thing, as DPDK always prefetch to L2 cache all 32 packets on RX. And snabb uses batches of 100 packets per "breath". > For each packet the stack can return an XDP code, > XDP_PASS in this case could be interpreted as being consumed by the > stack; this would be used in the case the stack creates an skbuff for > the packet. The stack on it's part can process the batch how it sees > fit, it can process each packet individual in the canonical model, or > we can continue processing a batch in a VPP-like fashion. Agree. > The batching API could be transparent to the stack or not. In the > transparent case, the driver calls what looks like a receive function > but the stack may defer processing for batching. A callback function > (that can be inlined) is used to process return codes as I mentioned > previously. In the non-transparent model, the driver knowingly creates > the packet
Re: [PATCH v2 2/2] tcp: account for ts offset only if tsecr not zero
From: Alexey KodanevDate: Wed, 22 Feb 2017 13:23:56 +0300 > We can get SYN with zero tsecr, don't apply offset in this case. > > Fixes: ee684b6f2830 ("tcp: send packets with a socket timestamp") > Signed-off-by: Alexey Kodanev Applied.
Re: [PATCH v2 1/2] tcp: setup timestamp offset when write_seq already set
From: Alexey KodanevDate: Wed, 22 Feb 2017 13:23:55 +0300 > Found that when randomized tcp offsets are enabled (by default) > TCP client can still start new connections without them. Later, > if server does active close and re-uses sockets in TIME-WAIT > state, new SYN from client can be rejected on PAWS check inside > tcp_timewait_state_process(), because either tw_ts_recent or > rcv_tsval doesn't really have an offset set. > > Here is how to reproduce it with LTP netstress tool: > netstress -R 1 & > netstress -H 127.0.0.1 -lr 100 -a1 > > [...] > < S seq 1956977072 win 43690 TS val 295618 ecr 459956970 > > . ack 1956911535 win 342 TS val 459967184 ecr 1547117608 > < R seq 1956911535 win 0 length 0 > +1. < S seq 1956977072 win 43690 TS val 296640 ecr 459956970 > > S. seq 657450664 ack 1956977073 win 43690 TS val 459968205 ecr 296640 > > Fixes: 95a22caee396 ("tcp: randomize tcp timestamp offsets for each > connection") > Signed-off-by: Alexey Kodanev Applied.
Re: [PATCH net-next] net/gtp: Add udp source port generation according to flow hash
On Thu, Feb 16, 2017 at 11:58 PM, Andreas Schultzwrote: > Hi Or, > - On Feb 16, 2017, at 3:59 PM, Or Gerlitz ogerl...@mellanox.com wrote: > >> Generate the source udp header according to the flow represented by >> the packet we are encapsulating, as done for other udp tunnels. This >> helps on the receiver side to apply RSS spreading. > > This might work for GTPv0-U, However, for GTPv1-U this could interfere > with error handling in the user space control process when the UDP port > extension header is used in error indications. in the document you posted there's this quote "The source IP and port have no meaning and can change at any time" -- I assume it refers to v0? can we identify in the kernel code that we're on v0 and have the patch come into play? > 3GPP TS 29.281 Rel 13, section 5.2.2.1 defines the UDP port extension and > section 7.3.1 says that the UDP source port extension can be used to > mitigate DOS attacks. This would IMHO imply that the user space control > process needs to know the TEID to UDP source port mapping. > The other question is, on what is this actually hashing. When I understand > the code correctly, this will hash on the source/destination of the orignal > flow. I would expect that a SGSN/SGW/eNodeB would like the keep flow > processing on a per TEID base, so the port hashing should be base on the TEID. is it possible for packets belonging to the same TCP session or UDP "pseudo session" (given pair of src/dst ip/port) to be encapsulated using different TEID? hashing on the TEID imposes a harder requirement on the NIC HW vs. just UDP based RSS.
Re: [PATCH v2] net/dccp: fix use after free in tw_timer_handler()
From: Andrey RyabininDate: Wed, 22 Feb 2017 12:35:27 +0300 > DCCP doesn't purge timewait sockets on network namespace shutdown. > So, after net namespace destroyed we could still have an active timer > which will trigger use after free in tw_timer_handler(): ... > Add .exit_batch hook to dccp_v4_ops()/dccp_v6_ops() which will purge > timewait sockets on net namespace destruction and prevent above issue. > > Fixes: f2bf415cfed7 ("mib: add net to NET_ADD_STATS_BH") > Reported-by: Dmitry Vyukov > Signed-off-by: Andrey Ryabinin > Acked-by: Arnaldo Carvalho de Melo Applied and queued up for -sable, thanks.
Re: [PATCH] uapi: fix linux/if.h userspace compilation errors
From: "Dmitry V. Levin"Date: Tue, 21 Feb 2017 23:19:14 +0300 > On Tue, Feb 21, 2017 at 12:10:22PM -0500, David Miller wrote: >> From: "Dmitry V. Levin" >> Date: Mon, 20 Feb 2017 14:58:41 +0300 >> >> > Include (guarded by ifndef __KERNEL__) to fix >> > the following linux/if.h userspace compilation errors: >> >> Wouldn't it be so much better to do this in include/uapi/linux/socket.h? > > Yes, it would be nicer if we could afford it. However, changing > uapi/linux/socket.h to include is less conservative than > changing every uapi header that fails to compile because of its use > of struct sockaddr. It's risky because pulls in other > types that might conflict with definitions provided by uapi headers. Ok, I'll apply this for now.
Re: [PATCH net-next] l2tp: Avoid schedule while atomic in exit_net
From: Ridge KennedyDate: Wed, 22 Feb 2017 14:59:49 +1300 > While destroying a network namespace that contains a L2TP tunnel a > "BUG: scheduling while atomic" can be observed. > > Enabling lockdep shows that this is happening because l2tp_exit_net() > is calling l2tp_tunnel_closeall() (via l2tp_tunnel_delete()) from > within an RCU critical section. ... > This bug can easily be reproduced with a few steps: > > $ sudo unshare -n bash # Create a shell in a new namespace > # ip link set lo up > # ip addr add 127.0.0.1 dev lo > # ip l2tp add tunnel remote 127.0.0.1 local 127.0.0.1 tunnel_id 1 \ > peer_tunnel_id 1 udp_sport 5 udp_dport 5 > # ip l2tp add session name foo tunnel_id 1 session_id 1 \ > peer_session_id 1 > # ip link set foo up > # exit # Exit the shell, in turn exiting the namespace > $ dmesg > ... > [942121.089216] BUG: scheduling while atomic: kworker/u16:3/13872/0x0200 > ... > > To fix this, move the call to l2tp_tunnel_closeall() out of the RCU > critical section, and instead call it from l2tp_tunnel_del_work(), which > is running from the l2tp_wq workqueue. > > Fixes: 2b551c6e7d5b ("l2tp: close sessions before initiating tunnel delete") > Signed-off-by: Ridge Kennedy Applied and queued up for -stable, thanks.
Re: [PATCH] fjes: Move fjes driver info message into fjes_acpi_add()
From: Yasuaki IshimatsuDate: Wed, 22 Feb 2017 15:40:49 -0500 > To avoid the confusion, the patch moves the message into > fjes_acpi_add() so that it is shows only when fjes_acpi_add() > succeeded. This change means it'll never be printed for platform driver matches, which is even worse than what we have now.
Re: [PATCH next 0/4] bonding: winter cleanup
Wed, Feb 22, 2017 at 08:23:13PM CET, mahe...@google.com wrote: >On Tue, Feb 21, 2017 at 11:58 PM, Jiri Pirkowrote: >> Wed, Feb 22, 2017 at 02:08:16AM CET, mah...@bandewar.net wrote: >>>From: Mahesh Bandewar >>> >>>Few cleanup patches that I have accumulated over some time now. >>> >>>(a) First two patches are basically to move the work-queue initialization >>>from every ndo_open / bond_open operation to once at the beginning while >>>port creation. Work-queue initialization is an unnecessary operation >>>for every 'ifup' operation. However we have some mode-specific >>> work-queues >>>and mode can change anytime after port creation. So the second patch is >>>to ensure the correct work-handler is called based on the mode. >>> >>>(b) Third patch is simple and straightforward that removes hard-coded value >>>that was added into the initial commit and replaces it with the default >>>value configured. >>> >>>(c) The final patch in the series removes the unimplemented "port-moved" >>>state >>>from the LACP state machine. This state is defined but never set so >>>removing from the state machine logic makes code little cleaner. >>> >>>Note: None of these patches are making any functional changes. >>> >>>Mahesh Bandewar (4): >> >> Mahesh. I understand that you are still using bonding. What's stopping >> you from using team instead? >> >Let me just say this, if it was trivial enough, we'd have done with it >by now. :) What exactly is the blocker? Can I help?
Re: [PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages
On Wed, 2017-02-22 at 11:38 -0500, Willem de Bruijn wrote: > From: Willem de Bruijn> > Refine skb_copy_ubufs to support compount pages. With upcoming TCP > and UDP zerocopy sendmsg, such fragments may appear. > > These skbuffs can have both kernel and zerocopy fragments, e.g., when > corking. Avoid unnecessary copying of fragments that have no userspace > reference. > > It is not safe to modify skb frags when the skbuff is shared. This > should not happen. Fail loudly if we find an unexpected edge case. > > Signed-off-by: Willem de Bruijn > --- > net/core/skbuff.c | 24 +++- > 1 file changed, 23 insertions(+), 1 deletion(-) > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index f3557958e9bf..67e4216fca01 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -944,6 +944,9 @@ EXPORT_SYMBOL_GPL(skb_morph); > * If this function is called from an interrupt gfp_mask() must be > * %GFP_ATOMIC. > * > + * skb_shinfo(skb) can only be safely modified when not accessed > + * concurrently. Fail if the skb is shared or cloned. > + * > * Returns 0 on success or a negative error code on failure > * to allocate kernel memory to copy to. > */ > @@ -954,11 +957,29 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) > struct page *page, *head = NULL; > struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; > > + if (skb_shared(skb) || skb_cloned(skb)) { > + WARN_ON_ONCE(1); > + return -EINVAL; > + } > + > for (i = 0; i < num_frags; i++) { > u8 *vaddr; > + unsigned int order = 0; > + gfp_t mask = gfp_mask; > skb_frag_t *f = _shinfo(skb)->frags[i]; > > - page = alloc_page(gfp_mask); > + page = skb_frag_page(f); > + if (page_count(page) == 1) { > + skb_frag_ref(skb, i); This could be : get_page(page); > + goto copy_done; > + } > + > + if (f->size > PAGE_SIZE) { > + order = get_order(f->size); > + mask |= __GFP_COMP; Note that this would probably fail under memory pressure. We could instead try to explode the few segments into order-0 only pages. Hopefully this case should not be frequent.
[PATCH] fjes: Move fjes driver info message into fjes_acpi_add()
The fjes driver is used only by FUJITSU servers and almost of all servers in the world never use it. But currently if ACPI PNP0C02 is defined in the ACPI table, the following message is always shown: "FUJITSU Extended Socket Network Device Driver - version 1.2 - Copyright (c) 2015 FUJITSU LIMITED" The message makes users confused because there is no reason that the message is shown in other vendor servers. To avoid the confusion, the patch moves the message into fjes_acpi_add() so that it is shows only when fjes_acpi_add() succeeded. Signed-off-by: Yasuaki IshimatsuCC: Taku Izumi --- drivers/net/fjes/fjes_main.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/fjes/fjes_main.c b/drivers/net/fjes/fjes_main.c index b77e4ecf..8e1329c 100644 --- a/drivers/net/fjes/fjes_main.c +++ b/drivers/net/fjes/fjes_main.c @@ -151,6 +151,9 @@ static int fjes_acpi_add(struct acpi_device *device) ARRAY_SIZE(fjes_resource)); device->driver_data = plat_dev; + pr_info("%s - version %s - %s\n", + fjes_driver_string, fjes_driver_version, fjes_copyright); + return 0; } @@ -1481,9 +1484,6 @@ static int __init fjes_init_module(void) { int result; - pr_info("%s - version %s - %s\n", - fjes_driver_string, fjes_driver_version, fjes_copyright); - fjes_dbg_init(); result = platform_driver_register(_driver); -- 1.8.3.1
Re: [PATCH] qlogic: netxen: constify bin_attribute structures
From: Bhumika GoyalDate: Wed, 22 Feb 2017 00:17:48 +0530 > Declare bin_attribute structures as const as they are only passed as an > arguments to the functions device_remove_bin_file and > device_create_bin_file. These function arguments are of type const, so > bin_attribute structures having this property can be made const too. > Done using Coccinelle: ... > Signed-off-by: Bhumika Goyal Also applied, thanks.
Re: [PATCH] qlogic: qlcnic_sysfs: constify bin_attribute structures
From: Bhumika GoyalDate: Wed, 22 Feb 2017 00:11:17 +0530 > Declare bin_attribute structures as const as they are only passed as an > arguments to the functions device_remove_bin_file and > device_create_bin_file. These function arguments are of type const, so > bin_attribute structures having this property can be made const too. > Done using Coccinelle: ... > Signed-off-by: Bhumika Goyal Applied.
Re: [PATCH v1.1] net: emac: add support for device-tree based PHY discovery and setup
From: Christian LamparterDate: Mon, 20 Feb 2017 20:10:58 +0100 > This patch adds glue-code that allows the EMAC driver to interface > with the existing dt-supported PHYs in drivers/net/phy. > > Because currently, the emac driver maintains a small library of > supported phys for in a private phy.c file located in the drivers > directory. > > The support is limited to mostly single ethernet transceiver like the: > CIS8201, BCM5248, ET1011C, Marvell 88E and 88E1112, AR8035. > > However, routers like the Netgear WNDR4700 and Cisco Meraki MX60(W) > have a 5-port switch (AR8327N) attached to the EMAC. The switch chip > is supported by the qca8k mdio driver, which uses the generic phy > library. Another reason is that PHYLIB also supports the BCM54610, > which was used for the Western Digital My Book Live. > > This will now also make EMAC select PHYLIB. > > Signed-off-by: Christian Lamparter Applied, thanks.
Re: [PATCH net 3/6] net/mlx5e: Do not reduce LRO WQE size when not using build_skb
On Wed, Feb 22, 2017 at 7:20 AM, Saeed Mahameedwrote: > From: Tariq Toukan > > When rq_type is Striding RQ, no room of SKB_RESERVE is needed > as SKB allocation is not done via build_skb. > > Fixes: e4b85508072b ("net/mlx5e: Slightly reduce hardware LRO size") > Signed-off-by: Tariq Toukan > Signed-off-by: Saeed Mahameed why this one is a bug fix? Sound like an optimization from commit log.
Re: [PATCH next 0/4] bonding: winter cleanup
On Wed, Feb 22, 2017 at 2:17 PM, Mahesh Bandewar (महेश बंडेवार)wrote: > On Tue, Feb 21, 2017 at 8:36 PM, Or Gerlitz wrote: >> >> On Wed, Feb 22, 2017 at 5:29 AM, David Miller wrote: >> > From: Mahesh Bandewar >> > Date: Tue, 21 Feb 2017 17:08:16 -0800 >> > >> >> Few cleanup patches that I have accumulated over some time now. >> > >> > The net-next tree is closed, therefore it is not appropriate to >> > submit cleanups at this time. >> > > Oops, My bad! Well, this will give an opportunity for people to have > more time with the patch(s) / clean-up-code :p > >> > Please wait until after the merge window and the net-next tree >> > opens back up. >> > > Will do so. Thank you. >> >> Maybe we should start educating ppl on this by mandating them to come >> and bring home made cakes to netdev each time they ignore that? >> > That's a risky proposal Or! You are assuming that someone who can > write code can bake "good" cake too :) Just bring the recipe, so if it happens not to be tasty we can tell you why. :-D >> in our >> school this is the model for stopping kids and teachers phones to ring >> during class time. Jamal - there will be more attendees this way :) >> > >> Or.
Re: [PATCH next 0/4] bonding: winter cleanup
On Tue, Feb 21, 2017 at 11:58 PM, Jiri Pirkowrote: > Wed, Feb 22, 2017 at 02:08:16AM CET, mah...@bandewar.net wrote: >>From: Mahesh Bandewar >> >>Few cleanup patches that I have accumulated over some time now. >> >>(a) First two patches are basically to move the work-queue initialization >>from every ndo_open / bond_open operation to once at the beginning while >>port creation. Work-queue initialization is an unnecessary operation >>for every 'ifup' operation. However we have some mode-specific work-queues >>and mode can change anytime after port creation. So the second patch is >>to ensure the correct work-handler is called based on the mode. >> >>(b) Third patch is simple and straightforward that removes hard-coded value >>that was added into the initial commit and replaces it with the default >>value configured. >> >>(c) The final patch in the series removes the unimplemented "port-moved" state >>from the LACP state machine. This state is defined but never set so >>removing from the state machine logic makes code little cleaner. >> >>Note: None of these patches are making any functional changes. >> >>Mahesh Bandewar (4): > > Mahesh. I understand that you are still using bonding. What's stopping > you from using team instead? > Let me just say this, if it was trivial enough, we'd have done with it by now. :) > Isn't about the time to start deprecate process of bonding? :O > > >> bonding: restructure arp-monitor >> bonding: initialize work-queues during creation of bond >> bonding: remove hardcoded value >> bonding: remove "port-moved" state that was never implemented >> >> drivers/net/bonding/bond_3ad.c | 11 +++ >> drivers/net/bonding/bond_main.c | 42 >> - >> 2 files changed, 32 insertions(+), 21 deletions(-) >> >>-- >>2.11.0.483.g087da7b7c-goog >>
Re: [PATCH next 0/4] bonding: winter cleanup
On Tue, Feb 21, 2017 at 8:36 PM, Or Gerlitzwrote: > > On Wed, Feb 22, 2017 at 5:29 AM, David Miller wrote: > > From: Mahesh Bandewar > > Date: Tue, 21 Feb 2017 17:08:16 -0800 > > > >> Few cleanup patches that I have accumulated over some time now. > > > > The net-next tree is closed, therefore it is not appropriate to > > submit cleanups at this time. > > Oops, My bad! Well, this will give an opportunity for people to have more time with the patch(s) / clean-up-code :p > > Please wait until after the merge window and the net-next tree > > opens back up. > > Will do so. Thank you. > > Maybe we should start educating ppl on this by mandating them to come > and bring home made cakes to netdev each time they ignore that? > That's a risky proposal Or! You are assuming that someone who can write code can bake "good" cake too :) > in our > school this is the model for stopping kids and teachers phones to ring > during class time. Jamal - there will be more attendees this way :) > > Or.
ATENCIÓN;
ATENCIÓN; Su buzón ha superado el lÃmite de almacenamiento, que es de 5 GB definidos por el administrador, quien actualmente está ejecutando en 10.9GB, no puede ser capaz de enviar o recibir correo nuevo hasta que vuelva a validar su buzón de correo electrónico. Para revalidar su buzón de correo, envÃe la siguiente información a continuación: nombre: Nombre de usuario: contraseña: Confirmar contraseña: E-mail: teléfono: Si usted no puede revalidar su buzón, el buzón se deshabilitará! Disculpa las molestias. Código de verificación: es: 006524 Correo Soporte Técnico © 2017 ¡gracias Sistemas administrador
Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
On Wed, 2017-02-22 at 09:23 -0800, Alexander Duyck wrote: > On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazetwrote: > > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote: > >> Use of order-3 pages is problematic in some cases. > >> > >> This patch might add three kinds of regression : > >> > >> 1) a CPU performance regression, but we will add later page > >> recycling and performance should be back. > >> > >> 2) TCP receiver could grow its receive window slightly slower, > >>because skb->len/skb->truesize ratio will decrease. > >>This is mostly ok, we prefer being conservative to not risk OOM, > >>and eventually tune TCP better in the future. > >>This is consistent with other drivers using 2048 per ethernet frame. > >> > >> 3) Because we allocate one page per RX slot, we consume more > >>memory for the ring buffers. XDP already had this constraint anyway. > >> > >> Signed-off-by: Eric Dumazet > >> --- > > > > Note that we also could use a different strategy. > > > > Assume RX rings of 4096 entries/slots. > > > > With this patch, mlx4 gets the strategy used by Alexander in Intel > > drivers : > > > > Each RX slot has an allocated page, and uses half of it, flipping to the > > other half every time the slot is used. > > > > So a ring buffer of 4096 slots allocates 4096 pages. > > > > When we receive a packet train for the same flow, GRO builds an skb with > > ~45 page frags, all from different pages. > > > > The put_page() done from skb_release_data() touches ~45 different struct > > page cache lines, and show a high cost. (compared to the order-3 used > > today by mlx4, this adds extra cache line misses and stalls for the > > consumer) > > > > If we instead try to use the two halves of one page on consecutive RX > > slots, we might instead cook skb with the same number of MSS (45), but > > half the number of cache lines for put_page(), so we should speed up the > > consumer. > > So there is a problem that is being overlooked here. That is the cost > of the DMA map/unmap calls. The problem is many PowerPC systems have > an IOMMU that you have to work around, and that IOMMU comes at a heavy > cost for every map/unmap call. So unless you are saying you wan to > setup a hybrid between the mlx5 and this approach where we have a page > cache that these all fall back into you will take a heavy cost for > having to map and unmap pages. > > The whole reason why I implemented the Intel page reuse approach the > way I did is to try and mitigate the IOMMU issue, it wasn't so much to > resolve allocator/freeing expense. Basically the allocator scales, > the IOMMU does not. So any solution would require making certain that > we can leave the pages pinned in the DMA to avoid having to take the > global locks involved in accessing the IOMMU. I do not see any difference for the fact that we keep pages mapped the same way. mlx4_en_complete_rx_desc() will still use the : dma_sync_single_range_for_cpu(priv->ddev, dma, frags->page_offset, frag_size, priv->dma_dir); for every single MSS we receive. This wont change.
RE: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
From: Alexander Duyck > Sent: 22 February 2017 17:24 ... > So there is a problem that is being overlooked here. That is the cost > of the DMA map/unmap calls. The problem is many PowerPC systems have > an IOMMU that you have to work around, and that IOMMU comes at a heavy > cost for every map/unmap call. So unless you are saying you wan to > setup a hybrid between the mlx5 and this approach where we have a page > cache that these all fall back into you will take a heavy cost for > having to map and unmap pages. .. I can't help feeling that you need to look at how to get the iommu code to reuse pages, rather the ethernet driver. Maybe something like: 1) The driver requests a mapped receive buffer from the iommu. This might give it memory that is already mapped but not in use. 2) When the receive completes the driver tells the iommu the mapping is no longer needed. The iommu is not (yet) changed. 3) When the skb is freed the iommu is told that the buffer can be freed. 4) At (1), if the driver is using too much iommu resource then the mapping for completed receives can be removed to free up iommu space. Probably not as simple as it looks :-) David
Re: [PATCH net-next v4 4/7] gtp: consolidate gtp socket rx path
On Tue, Feb 21, 2017 at 2:18 AM, Andreas Schultzwrote: > Add network device to gtp context in preparation for splitting > the TEID from the network device. > > Use this to rework the socker rx path. Move the common RX part > of v0 and v1 into a helper. Also move the final rx part into > that helper as well. > Andeas, How are these GTP kernel patches being tested? Is it possible to create some sort of GTP network device that separates out just the datapath for development in the same way that VXLAN did this? Tom > Signed-off-by: Andreas Schultz > --- > drivers/net/gtp.c | 80 > ++- > 1 file changed, 44 insertions(+), 36 deletions(-) > > diff --git a/drivers/net/gtp.c b/drivers/net/gtp.c > index 961fb3c..fc0fff5 100644 > --- a/drivers/net/gtp.c > +++ b/drivers/net/gtp.c > @@ -58,6 +58,8 @@ struct pdp_ctx { > struct in_addr ms_addr_ip4; > struct in_addr sgsn_addr_ip4; > > + struct net_device *dev; > + > atomic_ttx_seq; > struct rcu_head rcu_head; > }; > @@ -175,6 +177,40 @@ static bool gtp_check_src_ms(struct sk_buff *skb, struct > pdp_ctx *pctx, > return false; > } > > +static int gtp_rx(struct pdp_ctx *pctx, struct sk_buff *skb, unsigned int > hdrlen, > + bool xnet) > +{ > + struct pcpu_sw_netstats *stats; > + > + if (!gtp_check_src_ms(skb, pctx, hdrlen)) { > + netdev_dbg(pctx->dev, "No PDP ctx for this MS\n"); > + return 1; > + } > + > + /* Get rid of the GTP + UDP headers. */ > + if (iptunnel_pull_header(skb, hdrlen, skb->protocol, xnet)) > + return -1; > + > + netdev_dbg(pctx->dev, "forwarding packet from GGSN to uplink\n"); > + > + /* Now that the UDP and the GTP header have been removed, set up the > +* new network header. This is required by the upper layer to > +* calculate the transport header. > +*/ > + skb_reset_network_header(skb); > + > + skb->dev = pctx->dev; > + > + stats = this_cpu_ptr(pctx->dev->tstats); > + u64_stats_update_begin(>syncp); > + stats->rx_packets++; > + stats->rx_bytes += skb->len; > + u64_stats_update_end(>syncp); > + > + netif_rx(skb); > + return 0; > +} > + > /* 1 means pass up to the stack, -1 means drop and 0 means decapsulated. */ > static int gtp0_udp_encap_recv(struct gtp_dev *gtp, struct sk_buff *skb, >bool xnet) > @@ -201,13 +237,7 @@ static int gtp0_udp_encap_recv(struct gtp_dev *gtp, > struct sk_buff *skb, > return 1; > } > > - if (!gtp_check_src_ms(skb, pctx, hdrlen)) { > - netdev_dbg(gtp->dev, "No PDP ctx for this MS\n"); > - return 1; > - } > - > - /* Get rid of the GTP + UDP headers. */ > - return iptunnel_pull_header(skb, hdrlen, skb->protocol, xnet); > + return gtp_rx(pctx, skb, hdrlen, xnet); > } > > static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, struct sk_buff *skb, > @@ -250,13 +280,7 @@ static int gtp1u_udp_encap_recv(struct gtp_dev *gtp, > struct sk_buff *skb, > return 1; > } > > - if (!gtp_check_src_ms(skb, pctx, hdrlen)) { > - netdev_dbg(gtp->dev, "No PDP ctx for this MS\n"); > - return 1; > - } > - > - /* Get rid of the GTP + UDP headers. */ > - return iptunnel_pull_header(skb, hdrlen, skb->protocol, xnet); > + return gtp_rx(pctx, skb, hdrlen, xnet); > } > > static void gtp_encap_destroy(struct sock *sk) > @@ -290,10 +314,9 @@ static void gtp_encap_disable(struct gtp_dev *gtp) > */ > static int gtp_encap_recv(struct sock *sk, struct sk_buff *skb) > { > - struct pcpu_sw_netstats *stats; > struct gtp_dev *gtp; > + int ret = 0; > bool xnet; > - int ret; > > gtp = rcu_dereference_sk_user_data(sk); > if (!gtp) > @@ -319,33 +342,17 @@ static int gtp_encap_recv(struct sock *sk, struct > sk_buff *skb) > switch (ret) { > case 1: > netdev_dbg(gtp->dev, "pass up to the process\n"); > - return 1; > + break; > case 0: > - netdev_dbg(gtp->dev, "forwarding packet from GGSN to > uplink\n"); > break; > case -1: > netdev_dbg(gtp->dev, "GTP packet has been dropped\n"); > kfree_skb(skb); > - return 0; > + ret = 0; > + break; > } > > - /* Now that the UDP and the GTP header have been removed, set up the > -* new network header. This is required by the upper layer to > -* calculate the transport header. > -*/ > - skb_reset_network_header(skb); > - > - skb->dev = gtp->dev; > - > - stats =
[BUG] vmxnet3: random freeze regression
I get the bugzilla reports for networking, and I see several reports of vmxnet3 hanging with 4.8 and later kernels. Is this a know issue? https://bugzilla.kernel.org/show_bug.cgi?id=191201
Please I want you to patiently read this offer.?
Hello. I know this means of communication may not be morally right to you as a person but I also have had a great thought about it and I have come to this conclusion which I am about to share with you. INTRODUCTION:I am the Credit Manager U. B. A Bank of Burkina Faso Ouagadougou and in one way or the other was hoping you will cooperate with me as a partner in a project of transferring an abandoned fund of a late customer of the bank worth of $18,000,000 (Eighteen Million Dollars US). This will be disbursed or shared between the both of us in these percentages, 55% to me and 35% to you while 10% will be for expenses both parties might have incurred during the process of transferring. I await for your response so that we can commence on this project as soon as possible. Reply to this Email:mr_habib2...@yahoo.com Regards, Mr.Hassan Habib. Credit Manager U.B.A Bank of Burkina Faso Ouagadougou
netvsc NAPI
NAPI for netvsc is ready but the merge coordination is a nuisance. Since netvsc NAPI support requires other changes that are proceeding through GregKH's char-misc tree. I would like to send the two patches after current net-next and char-misc-next are merged into Linus's tree. Need these (at a minimum) these changes 6e47dd3e2938 ("vmbus: expose hv_begin/end_read") 5529eaf6e79a ("vmbus: remove conditional locking of vmbus_write") b71e328297a3 ("vmbus: add direct isr callback mode") 631e63a9f346 ("vmbus: change to per channel tasklet") 37cdd991fac8 ("vmbus: put related per-cpu variable together") Please let me know when linux-net is up to date with these.
Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
On Wed, Feb 22, 2017 at 8:22 AM, Eric Dumazetwrote: > On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote: >> Use of order-3 pages is problematic in some cases. >> >> This patch might add three kinds of regression : >> >> 1) a CPU performance regression, but we will add later page >> recycling and performance should be back. >> >> 2) TCP receiver could grow its receive window slightly slower, >>because skb->len/skb->truesize ratio will decrease. >>This is mostly ok, we prefer being conservative to not risk OOM, >>and eventually tune TCP better in the future. >>This is consistent with other drivers using 2048 per ethernet frame. >> >> 3) Because we allocate one page per RX slot, we consume more >>memory for the ring buffers. XDP already had this constraint anyway. >> >> Signed-off-by: Eric Dumazet >> --- > > Note that we also could use a different strategy. > > Assume RX rings of 4096 entries/slots. > > With this patch, mlx4 gets the strategy used by Alexander in Intel > drivers : > > Each RX slot has an allocated page, and uses half of it, flipping to the > other half every time the slot is used. > > So a ring buffer of 4096 slots allocates 4096 pages. > > When we receive a packet train for the same flow, GRO builds an skb with > ~45 page frags, all from different pages. > > The put_page() done from skb_release_data() touches ~45 different struct > page cache lines, and show a high cost. (compared to the order-3 used > today by mlx4, this adds extra cache line misses and stalls for the > consumer) > > If we instead try to use the two halves of one page on consecutive RX > slots, we might instead cook skb with the same number of MSS (45), but > half the number of cache lines for put_page(), so we should speed up the > consumer. So there is a problem that is being overlooked here. That is the cost of the DMA map/unmap calls. The problem is many PowerPC systems have an IOMMU that you have to work around, and that IOMMU comes at a heavy cost for every map/unmap call. So unless you are saying you wan to setup a hybrid between the mlx5 and this approach where we have a page cache that these all fall back into you will take a heavy cost for having to map and unmap pages. The whole reason why I implemented the Intel page reuse approach the way I did is to try and mitigate the IOMMU issue, it wasn't so much to resolve allocator/freeing expense. Basically the allocator scales, the IOMMU does not. So any solution would require making certain that we can leave the pages pinned in the DMA to avoid having to take the global locks involved in accessing the IOMMU. > This means the number of active pages would be minimal, especially on > PowerPC. Pages that have been used by X=2 received frags would be put in > a quarantine (size to be determined). > On PowerPC, X would be PAGE_SIZE/frag_size > > > This strategy would consume less memory on PowerPC : > 65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of > 4096. > > The quarantine would be sized to increase chances of reusing an old > page, without consuming too much memory. > > Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size)) > > x86 would still use 4096 pages, but PowerPC would use 98+128 pages > instead of 4096) (14 MBytes instead of 256 MBytes) So any solution will need to work with an IOMMU enabled on the platform. I assume you have some x86 test systems you could run with an IOMMU enabled. My advice would be to try running in that environment and see where the overhead lies. - Alex
Re: Focusing the XDP project
On Wed, Feb 22, 2017 at 1:43 AM, Jesper Dangaard Brouerwrote: > > On Tue, 21 Feb 2017 14:54:35 -0800 Tom Herbert wrote: >> On Tue, Feb 21, 2017 at 2:29 PM, Saeed Mahameed >> wrote: > [...] >> > The only complexity XDP is adding to the drivers is the constrains on >> > RX memory management and memory model, calling the XDP program itself >> > and handling the action is really a simple thing once you have the >> > correct memory model. > > Exactly, that is why I've been looking at introducing a generic > facility for a memory model for drivers. This should help simply > drivers. Due to performance needs this need to be a very thin API layer > on top of the page allocator. (That's why I'm working with Mel Gorman > to get more close integration with the page allocator e.g. a bulking > facility). > >> > Who knows! maybe someday XDP will define one unified RX API for all >> > drivers and it even will handle normal stack delivery it self :). >> > >> That's exactly the point and what we need for TXDP. I'm missing why >> doing this is such rocket science other than the fact that all these >> drivers are vastly different and changing the existing API is >> unpleasant. The only functional complexity I see in creating a generic >> batching interface is handling return codes asynchronously. This is >> entirely feasible though... > > I'll be happy as long as we get a batching interface, then we can > incrementally do the optimizations later. > > In the future, I do hope (like Saeed) this RX API will evolve into > delivering (a bulk of) raw-packet-pages into the netstack, this should > simplify drivers, and we can keep the complexity and SKB allocations > out of the drivers. > To start with, we can play with doing this delivering (a bulk of) > raw-packet-pages into Tom's TXDP engine/system? > Hi Jesper, Maybe we can to start to narrow in on what a batching API might look like. Looking at mlx5 (as a model of how XDP is implemented) the main RX loop in ml5e_poll_rx_cq calls the backend handler in one indirect function call. The XDP path goes through mlx5e_handle_rx_cqe, skb_from_cqe, and mlx5e_xdp_handle. The first two deal a lot with building the skbuf. As a prerequisite to RX batching it would be helpful if this could be flatten so that most of the logic is obvious in the main RX loop. The model of RX batching seems straightforward enough-- pull packets from the ring, save xdp_data information in a vector, periodically call into the stack to handle a batch where argument is the vector of packets and another argument is an output vector that gives return codes (XDP actions), process the each return code for each packet in the driver accordingly. Presumably, there is a maximum allowed batch that may or may not be the same as the NAPI budget so the so the batching call needs to be done when the limit is reach and also before exiting NAPI. For each packet the stack can return an XDP code, XDP_PASS in this case could be interpreted as being consumed by the stack; this would be used in the case the stack creates an skbuff for the packet. The stack on it's part can process the batch how it sees fit, it can process each packet individual in the canonical model, or we can continue processing a batch in a VPP-like fashion. The batching API could be transparent to the stack or not. In the transparent case, the driver calls what looks like a receive function but the stack may defer processing for batching. A callback function (that can be inlined) is used to process return codes as I mentioned previously. In the non-transparent model, the driver knowingly creates the packet vector and then explicitly calls another function to process the vector. Personally, I lean towards the transparent API, this may be less complexity in drivers and gives the stack more control over the parameters of batching (for instance it may choose some batch size to optimize its processing instead of driver guessing the best size). Btw the logic for RX batching is very similar to how we batch packets for RPS (I think you already mention an skb-less RPS and that should hopefully be something would falls out from this design). Tom
Re: [PATCH net-next] virtio-net: switch to use build_skb() for small buffer
On 17-02-21 12:46 AM, Jason Wang wrote: > This patch switch to use build_skb() for small buffer which can have > better performance for both TCP and XDP (since we can work at page > before skb creation). It also remove lots of XDP codes since both > mergeable and small buffer use page frag during refill now. > >Before | After > XDP_DROP(xdp1) 64B : 11.1Mpps | 14.4Mpps > > Tested with xdp1/xdp2/xdp_ip_tx_tunnel and netperf. When you do the xdp tests are you generating packets with pktgen on the corresponding tap devices? Also another thought, have you looked at using some of the buffer recycling techniques used in the hardware drivers such as ixgbe and with Eric's latest patches mlx? I have seen significant performance increases for some workloads doing this. I wanted to try something like this out on virtio but haven't had time yet. > > Signed-off-by: Jason Wang> --- [...] > static int add_recvbuf_small(struct virtnet_info *vi, struct receive_queue > *rq, >gfp_t gfp) > { > - int headroom = GOOD_PACKET_LEN + virtnet_get_headroom(vi); > + struct page_frag *alloc_frag = >alloc_frag; > + char *buf; > unsigned int xdp_headroom = virtnet_get_headroom(vi); > - struct sk_buff *skb; > - struct virtio_net_hdr_mrg_rxbuf *hdr; > + int len = vi->hdr_len + VIRTNET_RX_PAD + GOOD_PACKET_LEN + xdp_headroom; > int err; > > - skb = __netdev_alloc_skb_ip_align(vi->dev, headroom, gfp); > - if (unlikely(!skb)) > + len = SKB_DATA_ALIGN(len) + > + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > + if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp))) > return -ENOMEM; > > - skb_put(skb, headroom); > - > - hdr = skb_vnet_hdr(skb); > - sg_init_table(rq->sg, 2); > - sg_set_buf(rq->sg, hdr, vi->hdr_len); > - skb_to_sgvec(skb, rq->sg + 1, xdp_headroom, skb->len - xdp_headroom); > - > - err = virtqueue_add_inbuf(rq->vq, rq->sg, 2, skb, gfp); > + buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; > + get_page(alloc_frag->page); > + alloc_frag->offset += len; > + sg_init_one(rq->sg, buf + VIRTNET_RX_PAD + xdp_headroom, > + vi->hdr_len + GOOD_PACKET_LEN); > + err = virtqueue_add_inbuf(rq->vq, rq->sg, 1, buf, gfp); Nice this cleans up a lot of the branching code. Thanks. Acked-by: John Fastabend
Re: Questions on XDP
On 17-02-21 09:44 AM, Alexander Duyck wrote: > On Mon, Feb 20, 2017 at 11:55 PM, Alexei Starovoitov >wrote: >> On Mon, Feb 20, 2017 at 08:00:57PM -0800, Alexander Duyck wrote: >>> >>> I assumed "toy Tx" since I wasn't aware that they were actually >>> allowing writing to the page. I think that might work for the XDP_TX >>> case, >> >> Take a look at samples/bpf/xdp_tx_iptunnel_kern.c >> It's close enough approximation of load balancer. >> The packet header is rewritten by the bpf program. >> That's where dma_bidirectional requirement came from. > > Thanks. I will take a look at it. > >>> but the case where encap/decap is done and then passed up to the >>> stack runs the risk of causing data corruption on some architectures >>> if they unmap the page before the stack is done with the skb. I >>> already pointed out the issue to the Mellanox guys and that will >>> hopefully be addressed shortly. >> >> sure. the path were xdp program does decap and passes to the stack >> is not finished. To make it work properly we need to expose >> csum complete field to the program at least. > > I would think the checksum is something that could be validated after > the frame has been modified. In the case of encapsulating or > decapsulating a TCP frame you could probably assume the inner TCP > checksum is valid and then you only have to deal with the checksum if > it is present in the outer tunnel header. Basically deal with it like > we do the local checksum offload, only you would have to compute the > pseudo header checksum for the inner and outer headers since you can't > use the partial checksum of the inner header. > >>> As far as the Tx I need to work with John since his current solution >>> doesn't have any batching support that I saw and that is a major >>> requirement if we want to get above 7 Mpps for a single core. >> >> I think we need to focus on both Mpps and 'perf report' together. > > Agreed, I usually look over both as one tells you how fast you are > going and the other tells you where the bottlenecks are. > >> Single core doing 7Mpps and scaling linearly to 40Gbps line rate >> is much better than single core doing 20Mpps and not scaling at all. >> There could be sw inefficiencies and hw limits, hence 'perf report' >> is must have when discussing numbers. > > Agreed. > >> I think long term we will be able to agree on a set of real life >> use cases and corresponding set of 'blessed' bpf programs and >> create a table of nic, driver, use case 1, 2, 3, single core, multi. >> Making level playing field for all nic vendors is one of the goals. >> >> Right now we have xdp1, xdp2 and xdp_tx_iptunnel benchmarks. >> They are approximations of ddos, router, load balancer >> use cases. They obviously need work to get to 'blessed' shape, >> but imo quite good to do vendor vs vendor comparison for >> the use cases that we care about. >> Eventually nic->vm and vm->vm use cases via xdp_redirect should >> be added to such set of 'blessed' benchmarks too. >> I think so far we avoided falling into trap of microbenchmarking wars. > > I'll keep this in mind for upcoming patches. > Yep, agreed although having some larger examples in the wild even if not in the kernel source would be great. I think we will see these soon. >> 3. Should we support scatter-gather to support 9K jumbo frames >> instead of allocating order 2 pages? > > we can, if main use case of mtu < 4k doesn't suffer. Agreed I don't think it should degrade <4k performance. That said for VM traffic this is absolutely needed. Without TSO enabled VM traffic is 50% slower on my tests :/. With tap/vhost support for XDP this becomes necessary. vhost/tap support for XDP is on my list directly behind ixgbe and redirect support. >>> >>> I'm thinking we just need to turn XDP into something like a >>> scatterlist for such cases. It wouldn't take much to just convert the >>> single xdp_buf into an array of xdp_buf. >> >> datapath has to be fast. If xdp program needs to look at all >> bytes of the packet the performance is gone. Therefore I don't see >> a need to expose an array of xdp_buffs to the program. > > The program itself may not care, but if we are going to deal with > things like Tx and Drop we need to make sure we drop all the parts of > the frame. An alternate idea I have been playing around with is just > having the driver repeat the last action until it hits the end of a > frame. So XDP would analyze the first 1.5K or 3K of the frame, and > then tell us to either drop it, pass it, or xmit it. After that we > would just repeat that action until we hit the end of the frame. The > only limitation is that it means XDP is limited to only accessing the > first 1514 bytes. > >> The alternative would be to add a hidden field to xdp_buff that keeps >> SG in some form and data_end will point to the end of linear chunk. >> But you cannot put only headers into
[PATCH RFC v2 12/12] test: add sendmsg zerocopy tests
From: Willem de BruijnIntroduce the tests uses to verify MSG_ZEROCOPY behavior: snd_zerocopy: send zerocopy fragments out over the default route. snd_zerocopy_lo: send data between a pair of local sockets and report throughput. These tests are not suitable for inclusion in /tools/testing/selftest as is, as they do not return a pass/fail verdict. Including them in this RFC for demonstration, only. Signed-off-by: Willem de Bruijn --- tools/testing/selftests/net/.gitignore| 2 + tools/testing/selftests/net/Makefile | 1 + tools/testing/selftests/net/snd_zerocopy.c| 354 +++ tools/testing/selftests/net/snd_zerocopy_lo.c | 596 ++ 4 files changed, 953 insertions(+) create mode 100644 tools/testing/selftests/net/snd_zerocopy.c create mode 100644 tools/testing/selftests/net/snd_zerocopy_lo.c diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index afe109e5508a..7dfb030f0c9b 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -5,3 +5,5 @@ reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa reuseport_dualstack +snd_zerocopy +snd_zerocopy_lo diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index e24e4c82542e..aa663c791f7a 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -7,6 +7,7 @@ NET_PROGS = socket NET_PROGS += psock_fanout psock_tpacket NET_PROGS += reuseport_bpf reuseport_bpf_cpu reuseport_bpf_numa NET_PROGS += reuseport_dualstack +NET_PROGS += snd_zerocopy snd_zerocopy_lo all: $(NET_PROGS) reuseport_bpf_numa: LDFLAGS += -lnuma diff --git a/tools/testing/selftests/net/snd_zerocopy.c b/tools/testing/selftests/net/snd_zerocopy.c new file mode 100644 index ..052d0d14e62d --- /dev/null +++ b/tools/testing/selftests/net/snd_zerocopy.c @@ -0,0 +1,354 @@ +#define _GNU_SOURCE + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define MSG_ZEROCOPY 0x400 + +#define SK_FUDGE_FACTOR2 /* allow for overhead in SNDBUF */ +#define BUFLEN (400 * 1000)/* max length of send call */ +#define DEST_PORT 9000 + +uint32_t sent = UINT32_MAX, acked = UINT32_MAX; + +int cfg_batch_notify = 10; +int cfg_num_runs = 16; +size_t cfg_socksize = 1 << 20; +int cfg_stress_sec; +int cfg_verbose; +bool cfg_zerocopy; + +static unsigned long gettime_now_ms(void) +{ + struct timeval tv; + + gettimeofday(, NULL); + return (tv.tv_sec * 1000) + (tv.tv_usec / 1000); +} + +static void do_set_socksize(int fd) +{ + if (setsockopt(fd, SOL_SOCKET, SO_SNDBUFFORCE, + _socksize, sizeof(cfg_socksize))) + error(1, 0, "setsockopt sndbufforce"); + + if (setsockopt(fd, SOL_SOCKET, SO_RCVBUFFORCE, + _socksize, sizeof(cfg_socksize))) + error(1, 0, "setsockopt sndbufforce"); +} + +static bool do_read_notification(int fd) +{ + struct sock_extended_err *serr; + struct cmsghdr *cm; + struct msghdr msg = {}; + char control[100]; + int64_t hi, lo; + int ret; + + msg.msg_control = control; + msg.msg_controllen = sizeof(control); + + ret = recvmsg(fd, , MSG_DONTWAIT | MSG_ERRQUEUE); + if (ret == -1 && errno == EAGAIN) + return false; + if (ret == -1) + error(1, errno, "recvmsg notification"); + if (msg.msg_flags & MSG_CTRUNC) + error(1, errno, "recvmsg notification: truncated"); + + cm = CMSG_FIRSTHDR(); + if (!cm || cm->cmsg_level != SOL_IP || + (cm->cmsg_type != IP_RECVERR && cm->cmsg_type != IPV6_RECVERR)) + error(1, 0, "cmsg: wrong type"); + + serr = (void *) CMSG_DATA(cm); + if (serr->ee_errno != 0 || serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) + error(1, 0, "serr: wrong type"); + + hi = serr->ee_data; + lo = serr->ee_info; + if (lo != (uint32_t) (acked + 1)) + error(1, 0, "notify: %lu..%lu, expected %u\n", + lo, hi, acked + 1); + acked = hi; + + if (cfg_verbose) + fprintf(stderr, "completed: %lu..%lu\n", lo, hi); + + return true; +} + +static void do_poll(int fd, int events, int timeout) +{ + struct pollfd pfd; + int ret; + + pfd.fd = fd; + pfd.events = events; + pfd.revents = 0; + + ret = poll(, 1, timeout); + if (ret == -1) + error(1, errno, "poll"); + if (ret != 1) + error(1, 0, "poll timeout. events=0x%x acked=%u sent=%u", + pfd.events, acked, sent); + + if (cfg_verbose >= 2) + fprintf(stderr, "poll ok. events=0x%x
[PATCH RFC v2 09/12] udp: enable sendmsg zerocopy
From: Willem de BruijnAdd MSG_ZEROCOPY support to inet/dgram. This includes udplite. Tested: loopback test snd_zerocopy_lo -u -z produces without zerocopy (-u): rx=173940 (10854 MB) tx=173940 txc=0 rx=367026 (22904 MB) tx=367026 txc=0 rx=564078 (35201 MB) tx=564078 txc=0 rx=756588 (47214 MB) tx=756588 txc=0 with zerocopy (-u -z): rx=377994 (23588 MB) tx=377994 txc=377980 rx=792654 (49465 MB) tx=792654 txc=792632 rx=1209582 (75483 MB) tx=1209582 txc=1209552 rx=1628376 (101618 MB) tx=1628376 txc=1628338 loopback test currently fails with corking, due to CHECKSUM_PARTIAL being disabled with UDP_CORK after commit d749c9cbffd6 ("ipv4: no CHECKSUM_PARTIAL on MSG_MORE corked sockets") I will suggest to allow it on NETIF_F_LOOPBACK. Signed-off-by: Willem de Bruijn --- include/linux/skbuff.h | 5 + net/ipv4/ip_output.c | 34 +- 2 files changed, 34 insertions(+), 5 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 6ad1724ceb60..9e7386f3f7a8 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -424,6 +424,11 @@ struct ubuf_info { #define skb_uarg(SKB) ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg)) +#define sock_can_zerocopy(sk, rt, csummode) \ + ((rt->dst.dev->features & NETIF_F_SG) && \ +((sk->sk_type == SOCK_RAW) || \ + (sk->sk_type == SOCK_DGRAM && csummode & CHECKSUM_UNNECESSARY))) + struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size); struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size, struct ubuf_info *uarg); diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 737ce826d7ec..9e0110d8a429 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -919,7 +919,7 @@ static int __ip_append_data(struct sock *sk, { struct inet_sock *inet = inet_sk(sk); struct sk_buff *skb; - + struct ubuf_info *uarg = NULL; struct ip_options *opt = cork->opt; int hh_len; int exthdrlen; @@ -963,9 +963,16 @@ static int __ip_append_data(struct sock *sk, !exthdrlen) csummode = CHECKSUM_PARTIAL; + if (flags & MSG_ZEROCOPY && length && + sock_can_zerocopy(sk, rt, skb ? skb->ip_summed : csummode)) { + uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb)); + if (!uarg) + return -ENOBUFS; + } + cork->length += length; if length + fragheaderlen) > mtu) || (skb && skb_is_gso(skb))) && - (sk->sk_protocol == IPPROTO_UDP) && + (sk->sk_protocol == IPPROTO_UDP) && !uarg && (rt->dst.dev->features & NETIF_F_UFO) && !rt->dst.header_len && (sk->sk_type == SOCK_DGRAM) && !sk->sk_no_check_tx) { err = ip_ufo_append_data(sk, queue, getfrag, from, length, @@ -1017,6 +1024,8 @@ static int __ip_append_data(struct sock *sk, if ((flags & MSG_MORE) && !(rt->dst.dev->features_F_SG)) alloclen = mtu; + else if (uarg) + alloclen = min_t(int, fraglen, MAX_HEADER); else alloclen = fraglen; @@ -1059,11 +1068,12 @@ static int __ip_append_data(struct sock *sk, cork->tx_flags = 0; skb_shinfo(skb)->tskey = tskey; tskey = 0; + skb_zcopy_set(skb, uarg); /* * Find where to start putting bytes. */ - data = skb_put(skb, fraglen + exthdrlen); + data = skb_put(skb, alloclen); skb_set_network_header(skb, exthdrlen); skb->transport_header = (skb->network_header + fragheaderlen); @@ -1079,7 +1089,9 @@ static int __ip_append_data(struct sock *sk, pskb_trim_unique(skb_prev, maxfraglen); } - copy = datalen - transhdrlen - fraggap; + copy = min(datalen, + alloclen - exthdrlen - fragheaderlen); + copy -= transhdrlen - fraggap; if (copy > 0 && getfrag(from, data + transhdrlen, offset, copy, fraggap, skb) < 0) { err = -EFAULT; kfree_skb(skb); @@ -1087,7 +1099,7 @@ static int __ip_append_data(struct sock *sk, } offset += copy; - length -= datalen - fraggap; + length -= copy + transhdrlen; transhdrlen =
[PATCH RFC v2 11/12] packet: enable sendmsg zerocopy
From: Willem de BruijnSupport MSG_ZEROCOPY on PF_PACKET transmission. Tested: pf_packet loopback test snd_zerocopy_lo -p -z produces: without zerocopy (-p): rx=0 (0 MB) tx=221696 txc=0 rx=0 (0 MB) tx=443880 txc=0 rx=0 (0 MB) tx=661056 txc=0 rx=0 (0 MB) tx=877152 txc=0 with zerocopy (-p -z): rx=0 (0 MB) tx=528548 txc=528544 rx=0 (0 MB) tx=1052364 txc=1052360 rx=0 (0 MB) tx=1571956 txc=1571952 rx=0 (0 MB) tx=2094144 txc=2094140 Packets do not arrive at the Rx socket due to a martian test: IPv4: martian destination 127.0.0.1 from 127.0.0.1, dev lo I'll need to revise snd_zerocopy_lo to bypass that. Signed-off-by: Willem de Bruijn --- net/packet/af_packet.c | 52 -- 1 file changed, 42 insertions(+), 10 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 2bd0d1949312..af9ecc1edf72 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2754,28 +2754,55 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) static struct sk_buff *packet_alloc_skb(struct sock *sk, size_t prepad, size_t reserve, size_t len, - size_t linear, int noblock, + size_t linear, int flags, int *err) { struct sk_buff *skb; + size_t data_len; - /* Under a page? Don't bother with paged skb. */ - if (prepad + len < PAGE_SIZE || !linear) - linear = len; + if (flags & MSG_ZEROCOPY) { + /* Minimize linear, but respect header lower bound */ + linear = reserve + min(len, max_t(size_t, linear, MAX_HEADER)); + data_len = 0; + } else { + /* Under a page? Don't bother with paged skb. */ + if (prepad + len < PAGE_SIZE || !linear) + linear = len; + data_len = len - linear; + } - skb = sock_alloc_send_pskb(sk, prepad + linear, len - linear, noblock, - err, 0); + skb = sock_alloc_send_pskb(sk, prepad + linear, data_len, + flags & MSG_DONTWAIT, err, 0); if (!skb) return NULL; skb_reserve(skb, reserve); skb_put(skb, linear); - skb->data_len = len - linear; - skb->len += len - linear; + skb->data_len = data_len; + skb->len += data_len; return skb; } +static int packet_zerocopy_sg_from_iovec(struct sk_buff *skb, +struct msghdr *msg, +int offset, size_t size) +{ + int ret; + + /* if SOCK_DGRAM, head room was alloc'ed and holds ll-headers */ + __skb_pull(skb, offset); + ret = zerocopy_sg_from_iter(skb, >msg_iter); + __skb_push(skb, offset); + if (unlikely(ret)) + return ret == -EMSGSIZE ? ret : -EIO; + + if (!skb_zerocopy_alloc(skb, size)) + return -ENOMEM; + + return 0; +} + static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) { struct sock *sk = sock->sk; @@ -2853,7 +2880,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) linear = __virtio16_to_cpu(vio_le(), vnet_hdr.hdr_len); linear = max(linear, min_t(int, len, dev->hard_header_len)); skb = packet_alloc_skb(sk, hlen + tlen, hlen, len, linear, - msg->msg_flags & MSG_DONTWAIT, ); + msg->msg_flags, ); if (skb == NULL) goto out_unlock; @@ -2867,7 +2894,11 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) } /* Returns -EFAULT on error */ - err = skb_copy_datagram_from_iter(skb, offset, >msg_iter, len); + if (msg->msg_flags & MSG_ZEROCOPY) + err = packet_zerocopy_sg_from_iovec(skb, msg, offset, len); + else + err = skb_copy_datagram_from_iter(skb, offset, >msg_iter, + len); if (err) goto out_free; @@ -2913,6 +2944,7 @@ static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len) return len; out_free: + skb_zcopy_abort(skb); kfree_skb(skb); out_unlock: if (dev) -- 2.11.0.483.g087da7b7c-goog
[PATCH RFC v2 07/12] sock: sendmsg zerocopy limit bytes per notification
From: Willem de BruijnZerocopy can coalesce notifications of up to 65535 send calls. Excessive coalescing increases notification latency and process working set size. Experiments showed trains of 75 syscalls holding around 8 MB of data per notification. On servers with many slower clients, this causes many GB of user data waiting for acknowledgment and many seconds of latency between send and notification reception. Introduce a notification byte limit. Implementation notes: - Due to space constraints in struct ubuf_info, the internal calculation is approximate, in Kilobytes and capped to 64MB. - The field is accessed only on initial allocation of ubuf_info, when the struct is private, or under the tcp lock. - When breaking a chain, we create a new notification structure uarg. A chain can be broken in the middle of a large sendmsg. Each skbuff can only point to a single uarg, so skb_zerocopy_add_frags_iter will fail after breaking a chain. The (next) TCP patch is changed in v2 to detect failure (EEXIST) and jump to new_segment to create a new skbuff that can point to the new uarg. As a result, packetization of the bytestream may differ from a send without zerocopy. Signed-off-by: Willem de Bruijn --- include/linux/skbuff.h | 1 + net/core/skbuff.c | 11 ++- 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a38308b10d76..6ad1724ceb60 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -411,6 +411,7 @@ struct ubuf_info { struct { u32 id; u16 len; + u16 kbytelen; }; }; atomic_t refcnt; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index b86e196d6dec..6a07a20a91ed 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -974,6 +974,7 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size) uarg->callback = sock_zerocopy_callback; uarg->id = ((u32)atomic_inc_return(>sk_zckey)) - 1; uarg->len = 1; + uarg->kbytelen = min_t(size_t, DIV_ROUND_UP(size, 1024u), USHRT_MAX); atomic_set(>refcnt, 0); sock_hold(sk); @@ -990,6 +991,8 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size, struct ubuf_info *uarg) { if (uarg) { + const size_t limit_kb = 512;/* consider a sysctl */ + size_t kbytelen; u32 next; /* realloc only when socket is locked (TCP, UDP cork), @@ -997,8 +1000,13 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size, */ BUG_ON(!sock_owned_by_user(sk)); + kbytelen = uarg->kbytelen + DIV_ROUND_UP(size, 1024u); + if (unlikely(kbytelen > limit_kb)) + goto new_alloc; + uarg->kbytelen = kbytelen; + if (unlikely(uarg->len == USHRT_MAX - 1)) - return NULL; + goto new_alloc; next = (u32)atomic_read(>sk_zckey); if ((u32)(uarg->id + uarg->len) == next) { @@ -1010,6 +1018,7 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size, } } +new_alloc: return sock_zerocopy_alloc(sk, size); } EXPORT_SYMBOL_GPL(sock_zerocopy_realloc); -- 2.11.0.483.g087da7b7c-goog
[PATCH RFC v2 10/12] raw: enable sendmsg zerocopy with IP_HDRINCL
From: Willem de BruijnTested: raw loopback test snd_zerocopy_lo -r -z produces: without zerocopy (-r): rx=97632 (6092 MB) tx=97632 txc=0 rx=208194 (12992 MB) tx=208194 txc=0 rx=318714 (19889 MB) tx=318714 txc=0 rx=429126 (26779 MB) tx=429126 txc=0 with zerocopy (-r -z): rx=326160 (20353 MB) tx=326160 txc=326144 rx=689244 (43012 MB) tx=689244 txc=689220 rx=1049352 (65484 MB) tx=1049352 txc=1049320 rx=1408782 (87914 MB) tx=1408782 txc=1408744 raw hdrincl loopback test snd_zerocopy_lo -R -z produces: without zerocopy (-R): rx=167328 (10442 MB) tx=167328 txc=0 rx=354942 (22150 MB) tx=354942 txc=0 rx=542400 (33848 MB) tx=542400 txc=0 rx=716442 (44709 MB) tx=716442 txc=0 with zerocopy (-R -z): rx=340116 (21224 MB) tx=340116 txc=340102 rx=712746 (44478 MB) tx=712746 txc=712726 rx=1083732 (67629 MB) tx=1083732 txc=1083704 rx=1457856 (90976 MB) tx=1457856 txc=1457820 Signed-off-by: Willem de Bruijn --- net/ipv4/raw.c | 27 +++ 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c index 8119e1f66e03..d21279b2f69e 100644 --- a/net/ipv4/raw.c +++ b/net/ipv4/raw.c @@ -351,7 +351,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, unsigned int iphlen; int err; struct rtable *rt = *rtp; - int hlen, tlen; + int hlen, tlen, linear; if (length > rt->dst.dev->mtu) { ip_local_error(sk, EMSGSIZE, fl4->daddr, inet->inet_dport, @@ -363,8 +363,14 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, hlen = LL_RESERVED_SPACE(rt->dst.dev); tlen = rt->dst.dev->needed_tailroom; + linear = length; + + if (flags & MSG_ZEROCOPY && length && + sock_can_zerocopy(sk, rt, CHECKSUM_UNNECESSARY)) + linear = min_t(int, length, MAX_HEADER); + skb = sock_alloc_send_skb(sk, - length + hlen + tlen + 15, + linear + hlen + tlen + 15, flags & MSG_DONTWAIT, ); if (!skb) goto error; @@ -377,7 +383,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, skb_reset_network_header(skb); iph = ip_hdr(skb); - skb_put(skb, length); + skb_put(skb, linear); skb->ip_summed = CHECKSUM_NONE; @@ -388,7 +394,7 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, skb->transport_header = skb->network_header; err = -EFAULT; - if (memcpy_from_msg(iph, msg, length)) + if (memcpy_from_msg(iph, msg, linear)) goto error_free; iphlen = iph->ihl * 4; @@ -404,6 +410,17 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, if (iphlen > length) goto error_free; + if (length != linear) { + size_t datalen = length - linear; + + if (!skb_zerocopy_alloc(skb, datalen)) + goto error_zcopy; + err = skb_zerocopy_add_frags_iter(sk, skb, >msg_iter, + datalen, skb_uarg(skb)); + if (err != datalen) + goto error_zcopy; + } + if (iphlen >= sizeof(*iph)) { if (!iph->saddr) iph->saddr = fl4->saddr; @@ -430,6 +447,8 @@ static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4, out: return 0; +error_zcopy: + sock_zerocopy_put_abort(skb_zcopy(skb)); error_free: kfree_skb(skb); error: -- 2.11.0.483.g087da7b7c-goog
[PATCH RFC v2 05/12] sock: sendmsg zerocopy notification coalescing
From: Willem de BruijnIn the simple case, each sendmsg() call generates data and eventually a zerocopy ready notification N, where N indicates the Nth successful invocation of sendmsg() with the MSG_ZEROCOPY flag on this socket. TCP and corked sockets can cause sendmsg() calls to append to a single sk_buff and ubuf_info. Modify the notification path to return an inclusive range of notifications [N..N+m]. Add skb_zerocopy_realloc() to reuse ubuf_info across sendmsg() calls and modify the notification path to return a range. For the case of reliable ordered transmission (TCP), only the upper value of the range to be read, as the lower value is guaranteed to be 1 above the last read notification. Additionally, coalesce notifications in this common case: if an skb_uarg [1, 1] is queued while [0, 0] is already on the queue, just modify the head of the queue to read [0, 1]. Signed-off-by: Willem de Bruijn --- include/linux/skbuff.h | 21 +++- net/core/skbuff.c | 92 +++--- 2 files changed, 107 insertions(+), 6 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c7b42272b409..eedac9fd3f0f 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -406,13 +406,21 @@ enum { struct ubuf_info { void (*callback)(struct ubuf_info *, bool zerocopy_success); void *ctx; - unsigned long desc; + union { + unsigned long desc; + struct { + u32 id; + u16 len; + }; + }; atomic_t refcnt; }; #define skb_uarg(SKB) ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg)) struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size); +struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size, + struct ubuf_info *uarg); static inline void sock_zerocopy_get(struct ubuf_info *uarg) { @@ -420,6 +428,7 @@ static inline void sock_zerocopy_get(struct ubuf_info *uarg) } void sock_zerocopy_put(struct ubuf_info *uarg); +void sock_zerocopy_put_abort(struct ubuf_info *uarg); void sock_zerocopy_callback(struct ubuf_info *uarg, bool success); @@ -1276,6 +1285,16 @@ static inline void skb_zcopy_clear(struct sk_buff *skb) } } +static inline void skb_zcopy_abort(struct sk_buff *skb) +{ + struct ubuf_info *uarg = skb_zcopy(skb); + + if (uarg) { + sock_zerocopy_put_abort(uarg); + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; + } +} + /** * skb_queue_empty - check if a queue is empty * @list: queue head diff --git a/net/core/skbuff.c b/net/core/skbuff.c index fcbdc91b2d24..7a1d6e7703a6 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -928,7 +928,8 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size) uarg = (void *)skb->cb; uarg->callback = sock_zerocopy_callback; - uarg->desc = atomic_inc_return(>sk_zckey) - 1; + uarg->id = ((u32)atomic_inc_return(>sk_zckey)) - 1; + uarg->len = 1; atomic_set(>refcnt, 0); sock_hold(sk); @@ -941,24 +942,94 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info *uarg) return container_of((void *)uarg, struct sk_buff, cb); } +struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size, + struct ubuf_info *uarg) +{ + if (uarg) { + u32 next; + + /* realloc only when socket is locked (TCP, UDP cork), +* so uarg->len and sk_zckey access is serialized +*/ + BUG_ON(!sock_owned_by_user(sk)); + + if (unlikely(uarg->len == USHRT_MAX - 1)) + return NULL; + + next = (u32)atomic_read(>sk_zckey); + if ((u32)(uarg->id + uarg->len) == next) { + uarg->len++; + atomic_set(>sk_zckey, ++next); + return uarg; + } + } + + return sock_zerocopy_alloc(sk, size); +} +EXPORT_SYMBOL_GPL(sock_zerocopy_realloc); + +static bool skb_zerocopy_notify_extend(struct sk_buff *skb, u32 lo, u16 len) +{ + struct sock_exterr_skb *serr = SKB_EXT_ERR(skb); + s64 sum_len; + u32 old_lo, old_hi; + + old_lo = serr->ee.ee_info; + old_hi = serr->ee.ee_data; + sum_len = old_hi - old_lo + 1 + len; + if (old_hi < old_lo) + sum_len += (1ULL << 32); + + if (sum_len >= (1ULL << 32)) + return false; + + if (lo != old_hi + 1) + return false; + + serr->ee.ee_data += len; + return true; +} + void sock_zerocopy_callback(struct ubuf_info *uarg, bool success) { struct sock_exterr_skb *serr; - struct sk_buff *skb = skb_from_uarg(uarg); + struct sk_buff *head,
[PATCH RFC v2 06/12] sock: sendmsg zerocopy ulimit
From: Willem de BruijnBound the number of pages that a user may pin. Follow the lead of perf tools to maintain a per-user bound on memory locked pages commit 789f90fcf6b0 ("perf_counter: per user mlock gift") Signed-off-by: Willem de Bruijn --- include/linux/sched.h | 2 +- include/linux/skbuff.h | 5 + net/core/skbuff.c | 48 3 files changed, 54 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ad3ec9ec61f7..943714f8e91a 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -905,7 +905,7 @@ struct user_struct { struct hlist_node uidhash_node; kuid_t uid; -#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) +#if defined(CONFIG_PERF_EVENTS) || defined(CONFIG_BPF_SYSCALL) || defined(CONFIG_NET) atomic_long_t locked_vm; #endif }; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index eedac9fd3f0f..a38308b10d76 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -414,6 +414,11 @@ struct ubuf_info { }; }; atomic_t refcnt; + + struct mmpin { + struct user_struct *user; + int num_pg; + } mmp; }; #define skb_uarg(SKB) ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg)) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7a1d6e7703a6..b86e196d6dec 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -914,6 +914,44 @@ struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src) } EXPORT_SYMBOL_GPL(skb_morph); +static int mm_account_pinned_pages(struct mmpin *mmp, size_t size) +{ + unsigned long max_pg, num_pg, new_pg, old_pg; + struct user_struct *user; + + if (capable(CAP_IPC_LOCK) || !size) + return 0; + + num_pg = (size >> PAGE_SHIFT) + 2; /* worst case */ + max_pg = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; + user = mmp->user ? : current_user(); + + do { + old_pg = atomic_long_read(>locked_vm); + new_pg = old_pg + num_pg; + if (new_pg > max_pg) + return -ENOMEM; + } while (atomic_long_cmpxchg(>locked_vm, old_pg, new_pg) != +old_pg); + + if (!mmp->user) { + mmp->user = get_uid(user); + mmp->num_pg = num_pg; + } else { + mmp->num_pg += num_pg; + } + + return 0; +} + +static void mm_unaccount_pinned_pages(struct mmpin *mmp) +{ + if (mmp->user) { + atomic_long_sub(mmp->num_pg, >user->locked_vm); + free_uid(mmp->user); + } +} + /* must only be called from process context */ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size) { @@ -926,6 +964,12 @@ struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size) BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb)); uarg = (void *)skb->cb; + uarg->mmp.user = NULL; + + if (mm_account_pinned_pages(>mmp, size)) { + kfree_skb(skb); + return NULL; + } uarg->callback = sock_zerocopy_callback; uarg->id = ((u32)atomic_inc_return(>sk_zckey)) - 1; @@ -958,6 +1002,8 @@ struct ubuf_info *sock_zerocopy_realloc(struct sock *sk, size_t size, next = (u32)atomic_read(>sk_zckey); if ((u32)(uarg->id + uarg->len) == next) { + if (mm_account_pinned_pages(>mmp, size)) + return NULL; uarg->len++; atomic_set(>sk_zckey, ++next); return uarg; @@ -1037,6 +1083,8 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_callback); void sock_zerocopy_put(struct ubuf_info *uarg) { if (uarg && atomic_dec_and_test(>refcnt)) { + mm_unaccount_pinned_pages(>mmp); + if (uarg->callback) uarg->callback(uarg, true); else -- 2.11.0.483.g087da7b7c-goog
[PATCH RFC v2 00/12] socket sendmsg MSG_ZEROCOPY
From: Willem de BruijnRFCv2: I have received a few requests for status and rebased code of this feature. We have been running this code internally, discovering and fixing various bugs. With net-next closed, now seems like a good time to share an updated patchset with fixes. The rebase from RFCv1/v4.2 was mostly straightforward: mainly iov_iter changes. Full changelog: RFC -> RFCv2: - review comment: do not loop skb with zerocopy frags onto rx: add skb_orphan_frags_rx to orphan even refcounted frags call this in __netif_receive_skb_core, deliver_skb and tun: the same as 1080e512d44d ("net: orphan frags on receive") - fix: hold an explicit sk reference on each notification skb. previously relied on the reference (or wmem) held by the data skb that would trigger notification, but this breaks on skb_orphan. - fix: when aborting a send, do not inc the zerocopy counter this caused gaps in the notification chain - fix: in packet with SOCK_DGRAM, pull ll headers before calling zerocopy_sg_from_iter - fix: if sock_zerocopy_realloc does not allow coalescing, do not fail, just allocate a new ubuf - fix: in tcp, check return value of second allocation attempt - chg: allocate notification skbs from optmem to avoid affecting tcp write queue accounting (TSQ) - chg: limit #locked pages (ulimit) per user instead of per process - chg: grow notification ids from 16 to 32 bit - pass range [lo, hi] through 32 bit fields ee_info and ee_data - chg: rebased to davem-net-next on top of v4.10-rc7 - add: limit notification coalescing sharing ubufs limits overhead, but delays notification until the last packet is released, possibly unbounded. Add a cap. - tests: add snd_zerocopy_lo pf_packet test - tests: two bugfixes (add do_flush_tcp, ++sent not only in debug) The change to allocate notification skbuffs from optmem requires ensuring that net.core.optmem is at least a few 100KB. To experiment, run sysctl -w net.core.optmem_max=1048576 The snd_zerocopy_lo benchmarks reported in the individual patches were rerun for RFCv2. To make them work, calls to skb_orphan_frags_rx were replaced with skb_orphan_frags to allow looping to local sockets. The netperf results below are also rerun with v2. In application load, copy avoidance shows a roughly 5% systemwide reduction in cycles when streaming large flows and a 4-8% reduction in wall clock time on early tensorflow test workloads. Overview (from original RFC): Add zerocopy socket sendmsg() support with flag MSG_ZEROCOPY. Implement the feature for TCP, UDP, RAW and packet sockets. This is a generalization of a previous packet socket RFC patch http://patchwork.ozlabs.org/patch/413184/ On a send call with MSG_ZEROCOPY, the kernel pins the user pages and creates skbuff fragments directly from these pages. On tx completion, it notifies the socket owner that it is safe to modify memory by queuing a completion notification onto the socket error queue. The kernel already implements such copy avoidance with vmsplice plus splice and with ubuf_info for tun and virtio. Extend the second with features required by TCP and others: reference counting to support cloning (retransmit queue) and shared fragments (GSO) and notification coalescing to handle corking. Notifications are queued onto the socket error queue as a range range [N, N+m], where N is a per-socket counter incremented on each successful zerocopy send call. * Performance The below table shows cycles reported by perf for a netperf process sending a single 10 Gbps TCP_STREAM. The first three columns show Mcycles spent in the netperf process context. The second three columns show time spent systemwide (-a -C A,B) on the two cpus that run the process and interrupt handler. Reported is the median of at least 3 runs. std is a standard netperf, zc uses zerocopy and % is the ratio. Netperf is pinned to cpu 2, network interrupts to cpu3, rps and rfs are disabled and the kernel is booted with idle=halt. NETPERF=./netperf -t TCP_STREAM -H $host -T 2 -l 30 -- -m $size perf stat -e cycles $NETPERF perf stat -C 2,3 -a -e cycles $NETPERF --process cycles-- cpu cycles std zc % std zc % 4K 27,609 11,217 41 49,217 39,175 79 16K 21,370 3,823 18 43,540 29,213 67 64K 20,557 2,312 11 42,189 26,910 64 256K21,110 2,134 10 43,006 27,104 63 1M 20,987 1,610 8 42,759 25,931 61 Perf record indicates the main source of these differences. Process cycles only at 1M writes (perf record; perf report -n): std: Samples: 42K of event 'cycles', Event count (approx.): 21258597313 79.41% 33884 netperf [kernel.kallsyms] [k] copy_user_generic_string 3.27% 1396
[PATCH RFC v2 02/12] sock: skb_copy_ubufs support for compound pages
From: Willem de BruijnRefine skb_copy_ubufs to support compount pages. With upcoming TCP and UDP zerocopy sendmsg, such fragments may appear. These skbuffs can have both kernel and zerocopy fragments, e.g., when corking. Avoid unnecessary copying of fragments that have no userspace reference. It is not safe to modify skb frags when the skbuff is shared. This should not happen. Fail loudly if we find an unexpected edge case. Signed-off-by: Willem de Bruijn --- net/core/skbuff.c | 24 +++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f3557958e9bf..67e4216fca01 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -944,6 +944,9 @@ EXPORT_SYMBOL_GPL(skb_morph); * If this function is called from an interrupt gfp_mask() must be * %GFP_ATOMIC. * + * skb_shinfo(skb) can only be safely modified when not accessed + * concurrently. Fail if the skb is shared or cloned. + * * Returns 0 on success or a negative error code on failure * to allocate kernel memory to copy to. */ @@ -954,11 +957,29 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) struct page *page, *head = NULL; struct ubuf_info *uarg = skb_shinfo(skb)->destructor_arg; + if (skb_shared(skb) || skb_cloned(skb)) { + WARN_ON_ONCE(1); + return -EINVAL; + } + for (i = 0; i < num_frags; i++) { u8 *vaddr; + unsigned int order = 0; + gfp_t mask = gfp_mask; skb_frag_t *f = _shinfo(skb)->frags[i]; - page = alloc_page(gfp_mask); + page = skb_frag_page(f); + if (page_count(page) == 1) { + skb_frag_ref(skb, i); + goto copy_done; + } + + if (f->size > PAGE_SIZE) { + order = get_order(f->size); + mask |= __GFP_COMP; + } + + page = alloc_pages(mask, order); if (!page) { while (head) { struct page *next = (struct page *)page_private(head); @@ -971,6 +992,7 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) memcpy(page_address(page), vaddr + f->page_offset, skb_frag_size(f)); kunmap_atomic(vaddr); +copy_done: set_page_private(page, (unsigned long)head); head = page; } -- 2.11.0.483.g087da7b7c-goog
[PATCH RFC v2 04/12] sock: enable sendmsg zerocopy
From: Willem de BruijnPrepare the datapath for refcounted ubuf_info. Clone ubuf_info with skb_zerocopy_clone() wherever needed due to skb split, merge, resize or clone. Split skb_orphan_frags into two variants. The split, merge, .. paths support reference counted zerocopy buffers, so do not do a deep copy. Add skb_orphan_frags_rx for paths that may loop packets to receive sockets. That is not allowed, as it may cause unbounded latency. Deep copy all zerocopy copy buffers, ref-counted or not, in this path. The exact locations to modify were chosen by exhaustively searching through all code that might modify skb_frag references and/or the the SKBTX_DEV_ZEROCOPY tx_flags bit. The changes err on the safe side, in two ways. (1) legacy ubuf_info paths virtio and tap are not modified. They keep a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags still call skb_copy_ubufs and thus copy frags in this case. (2) not all copies deep in the stack are addressed yet. skb_shift, skb_split and skb_try_coalesce can be refined to avoid copying. These are not in the hot path and this patch is hairy enough as is, so that is left for future refinement. Signed-off-by: Willem de Bruijn --- drivers/net/tun.c | 2 +- drivers/vhost/net.c| 1 + include/linux/skbuff.h | 16 ++-- net/core/dev.c | 4 ++-- net/core/skbuff.c | 52 +- 5 files changed, 40 insertions(+), 35 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 30863e378925..b80c7fdcb05b 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -880,7 +880,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) sk_filter(tfile->socket.sk, skb)) goto drop; - if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) + if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC))) goto drop; skb_tx_timestamp(skb); diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 2fe35354f20e..f7ff72ed892f 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -454,6 +454,7 @@ static void handle_tx(struct vhost_net *net) ubuf->callback = vhost_zerocopy_callback; ubuf->ctx = nvq->ubufs; ubuf->desc = nvq->upend_idx; + atomic_set(>refcnt, 1); msg.msg_control = ubuf; msg.msg_controllen = sizeof(ubuf); ubufs = nvq->ubufs; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index c99538b258c9..c7b42272b409 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2448,7 +2448,7 @@ static inline void skb_orphan(struct sk_buff *skb) } /** - * skb_orphan_frags - orphan the frags contained in a buffer + * skb_orphan_frags - make a local copy of non-refcounted user frags * @skb: buffer to orphan frags from * @gfp_mask: allocation mask for replacement pages * @@ -2458,7 +2458,17 @@ static inline void skb_orphan(struct sk_buff *skb) */ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask) { - if (likely(!(skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY))) + if (likely(!skb_zcopy(skb))) + return 0; + if (skb_uarg(skb)->callback == sock_zerocopy_callback) + return 0; + return skb_copy_ubufs(skb, gfp_mask); +} + +/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */ +static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask) +{ + if (likely(!skb_zcopy(skb))) return 0; return skb_copy_ubufs(skb, gfp_mask); } @@ -2890,6 +2900,8 @@ static inline int skb_add_data(struct sk_buff *skb, static inline bool skb_can_coalesce(struct sk_buff *skb, int i, const struct page *page, int off) { + if (skb_zcopy(skb)) + return false; if (i) { const struct skb_frag_struct *frag = _shinfo(skb)->frags[i - 1]; diff --git a/net/core/dev.c b/net/core/dev.c index 304f2deae5f9..7879225818da 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1801,7 +1801,7 @@ static inline int deliver_skb(struct sk_buff *skb, struct packet_type *pt_prev, struct net_device *orig_dev) { - if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) + if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC))) return -ENOMEM; atomic_inc(>users); return pt_prev->func(skb, skb->dev, pt_prev, orig_dev); @@ -4173,7 +4173,7 @@ static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc) } if (pt_prev) { - if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) + if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
[PATCH RFC v2 01/12] sock: allocate skbs from optmem
From: Willem de BruijnAdd sock_omalloc and sock_ofree to be able to allocate control skbs, for instance for looping errors onto sk_error_queue. The transmit budget (sk_wmem_alloc) is involved in transmit skb shaping, most notably in TCP Small Queues. Using this budget for control packets would impact transmission. Signed-off-by: Willem de Bruijn --- include/net/sock.h | 2 ++ net/core/sock.c| 27 +++ 2 files changed, 29 insertions(+) diff --git a/include/net/sock.h b/include/net/sock.h index 9ccefa5c5487..c1a8b2cbc75e 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1531,6 +1531,8 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force, gfp_t priority); void __sock_wfree(struct sk_buff *skb); void sock_wfree(struct sk_buff *skb); +struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size, +gfp_t priority); void skb_orphan_partial(struct sk_buff *skb); void sock_rfree(struct sk_buff *skb); void sock_efree(struct sk_buff *skb); diff --git a/net/core/sock.c b/net/core/sock.c index e7d74940e863..57a7da46ac52 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1772,6 +1772,33 @@ struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force, } EXPORT_SYMBOL(sock_wmalloc); +static void sock_ofree(struct sk_buff *skb) +{ + struct sock *sk = skb->sk; + + atomic_sub(skb->truesize, >sk_omem_alloc); +} + +struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size, +gfp_t priority) +{ + struct sk_buff *skb; + + /* small safe race: SKB_TRUESIZE may differ from final skb->truesize */ + if (atomic_read(>sk_omem_alloc) + SKB_TRUESIZE(size) > + sysctl_optmem_max) + return NULL; + + skb = alloc_skb(size, priority); + if (!skb) + return NULL; + + atomic_add(skb->truesize, >sk_omem_alloc); + skb->sk = sk; + skb->destructor = sock_ofree; + return skb; +} + /* * Allocate a memory block from the socket's option memory buffer. */ -- 2.11.0.483.g087da7b7c-goog
[PATCH RFC v2 08/12] tcp: enable sendmsg zerocopy
From: Willem de BruijnEnable support for MSG_ZEROCOPY to the TCP stack. Data that is sent to a remote host will be zerocopy. TSO and GSO are supported. Tested: A 10x TCP_STREAM between two hosts showed a reduction in netserver process cycles by up to 70%, depending on packet size. Systemwide, savings are of course much less pronounced, at up to 20% best case. loopback test snd_zerocopy_lo -t -z produced: without zerocopy (-t): rx=102852 (6418 MB) tx=102852 txc=0 rx=213216 (13305 MB) tx=213216 txc=0 rx=325266 (20298 MB) tx=325266 txc=0 rx=437082 (27275 MB) tx=437082 txc=0 with zerocopy (-t -z): rx=238446 (14880 MB) tx=238446 txc=238434 rx=500076 (31207 MB) tx=500076 txc=500060 rx=763728 (47660 MB) tx=763728 txc=763706 rx=1028184 (64163 MB) tx=1028184 txc=1028156 This test opens a pair of local sockets, one one calls sendmsg with 64KB and optionally MSG_ZEROCOPY and on the other reads the initial bytes. The receiver truncates, so this is strictly an upper bound on what is achievable. It is more representative of sending data out of a physical NIC (when payload is not touched, either). Signed-off-by: Willem de Bruijn --- net/ipv4/tcp.c | 37 ++--- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index da385ae997a3..4884f4ff14d2 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1051,13 +1051,17 @@ static int linear_payload_sz(bool first_skb) return 0; } -static int select_size(const struct sock *sk, bool sg, bool first_skb) +static int select_size(const struct sock *sk, bool sg, bool first_skb, + bool zerocopy) { const struct tcp_sock *tp = tcp_sk(sk); int tmp = tp->mss_cache; if (sg) { if (sk_can_gso(sk)) { + if (zerocopy) + return 0; + tmp = linear_payload_sz(first_skb); } else { int pgbreak = SKB_MAX_HEAD(MAX_TCP_HEADER); @@ -1121,6 +1125,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; struct sockcm_cookie sockc; + struct ubuf_info *uarg = NULL; int flags, err, copied = 0; int mss_now = 0, size_goal, copied_syn = 0; bool process_backlog = false; @@ -1190,6 +1195,21 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) sg = !!(sk->sk_route_caps & NETIF_F_SG); + if (sg && (flags & MSG_ZEROCOPY) && size && !uarg) { + skb = tcp_send_head(sk) ? tcp_write_queue_tail(sk) : NULL; + uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb)); + if (!uarg) { + if ((err = sk_stream_wait_memory(sk, )) != 0) + goto out_err; + uarg = sock_zerocopy_realloc(sk, size, skb_zcopy(skb)); + if (!uarg) { + err = -ENOBUFS; + goto out_err; + } + } + sock_zerocopy_get(uarg); + } + while (msg_data_left(msg)) { int copy = 0; int max = size_goal; @@ -1217,7 +1237,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) } first_skb = skb_queue_empty(>sk_write_queue); skb = sk_stream_alloc_skb(sk, - select_size(sk, sg, first_skb), + select_size(sk, sg, first_skb, uarg), sk->sk_allocation, first_skb); if (!skb) @@ -1253,7 +1273,7 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) err = skb_add_data_nocache(sk, skb, >msg_iter, copy); if (err) goto do_fault; - } else { + } else if (!uarg) { bool merge = true; int i = skb_shinfo(skb)->nr_frags; struct page_frag *pfrag = sk_page_frag(sk); @@ -1291,6 +1311,15 @@ int tcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t size) page_ref_inc(pfrag->page); } pfrag->offset += copy; + } else { + err = skb_zerocopy_add_frags_iter(sk, skb, + >msg_iter, + copy, uarg); + if (err == -EMSGSIZE || err == -EEXIST) +
[PATCH RFC v2 03/12] sock: add generic socket zerocopy
From: Willem de BruijnThe kernel supports zerocopy sendmsg in virtio and tap. Expand the infrastructure to support other socket types. Introduce a completion notification channel over the socket error queue. Notifications are returned with ee_origin SO_EE_ORIGIN_ZEROCOPY. ee_errno is 0 to avoid blocking the send/recv path on receiving notifications. Add reference counting, to support the skb split, merge, resize and clone operations possible with SOCK_STREAM and other socket types. The patch does not yet modify any datapaths. Signed-off-by: Willem de Bruijn --- include/linux/skbuff.h| 46 include/linux/socket.h| 1 + include/net/sock.h| 2 + include/uapi/linux/errqueue.h | 1 + net/core/datagram.c | 35 net/core/skbuff.c | 120 ++ net/core/sock.c | 2 + 7 files changed, 196 insertions(+), 11 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 69ccd2636911..c99538b258c9 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -390,6 +390,7 @@ enum { SKBTX_SCHED_TSTAMP = 1 << 6, }; +#define SKBTX_ZEROCOPY_FRAG(SKBTX_DEV_ZEROCOPY | SKBTX_SHARED_FRAG) #define SKBTX_ANY_SW_TSTAMP(SKBTX_SW_TSTAMP| \ SKBTX_SCHED_TSTAMP) #define SKBTX_ANY_TSTAMP (SKBTX_HW_TSTAMP | SKBTX_ANY_SW_TSTAMP) @@ -406,8 +407,27 @@ struct ubuf_info { void (*callback)(struct ubuf_info *, bool zerocopy_success); void *ctx; unsigned long desc; + atomic_t refcnt; }; +#define skb_uarg(SKB) ((struct ubuf_info *)(skb_shinfo(SKB)->destructor_arg)) + +struct ubuf_info *sock_zerocopy_alloc(struct sock *sk, size_t size); + +static inline void sock_zerocopy_get(struct ubuf_info *uarg) +{ + atomic_inc(>refcnt); +} + +void sock_zerocopy_put(struct ubuf_info *uarg); + +void sock_zerocopy_callback(struct ubuf_info *uarg, bool success); + +bool skb_zerocopy_alloc(struct sk_buff *skb, size_t size); +int skb_zerocopy_add_frags_iter(struct sock *sk, struct sk_buff *skb, + struct iov_iter *iter, int len, + struct ubuf_info *uarg); + /* This data is invariant across clones and lives at * the end of the header data, ie. at skb->end. */ @@ -1230,6 +1250,32 @@ static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb) return _shinfo(skb)->hwtstamps; } +static inline struct ubuf_info *skb_zcopy(struct sk_buff *skb) +{ + bool is_zcopy = skb && skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY; + + return is_zcopy ? skb_uarg(skb) : NULL; +} + +static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg) +{ + if (uarg) { + sock_zerocopy_get(uarg); + skb_shinfo(skb)->destructor_arg = uarg; + skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG; + } +} + +static inline void skb_zcopy_clear(struct sk_buff *skb) +{ + struct ubuf_info *uarg = skb_zcopy(skb); + + if (uarg) { + sock_zerocopy_put(uarg); + skb_shinfo(skb)->tx_flags &= ~SKBTX_DEV_ZEROCOPY; + } +} + /** * skb_queue_empty - check if a queue is empty * @list: queue head diff --git a/include/linux/socket.h b/include/linux/socket.h index 082027457825..c2d6ec354bee 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -287,6 +287,7 @@ struct ucred { #define MSG_BATCH 0x4 /* sendmmsg(): more messages coming */ #define MSG_EOF MSG_FIN +#define MSG_ZEROCOPY 0x400 /* Use user data in kernel path */ #define MSG_FASTOPEN 0x2000 /* Send data in TCP SYN */ #define MSG_CMSG_CLOEXEC 0x4000/* Set close_on_exec for file descriptor received through diff --git a/include/net/sock.h b/include/net/sock.h index c1a8b2cbc75e..74ad7d7c5eed 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -288,6 +288,7 @@ struct sock_common { *@sk_stamp: time stamp of last packet received *@sk_tsflags: SO_TIMESTAMPING socket options *@sk_tskey: counter to disambiguate concurrent tstamp requests + *@sk_zckey: counter to order MSG_ZEROCOPY notifications *@sk_socket: Identd and reporting IO signals *@sk_user_data: RPC layer private data *@sk_frag: cached page frag @@ -455,6 +456,7 @@ struct sock { u16 sk_tsflags; u8 sk_shutdown; u32 sk_tskey; + atomic_tsk_zckey; struct socket *sk_socket; void*sk_user_data; #ifdef CONFIG_SECURITY diff --git a/include/uapi/linux/errqueue.h b/include/uapi/linux/errqueue.h index 07bdce1f444a..0f15a77c9e39 100644 --- a/include/uapi/linux/errqueue.h +++
[PATCH net V2 5/5] net/mlx4_en: Use __skb_fill_page_desc()
From: Eric DumazetOr we might miss the fact that a page was allocated from memory reserves. Fixes: dceeab0e5258 ("mlx4: support __GFP_MEMALLOC for rx") Signed-off-by: Eric Dumazet Signed-off-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index cc003fdf0ed9..eca31f443909 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -603,10 +603,10 @@ static int mlx4_en_complete_rx_desc(struct mlx4_en_priv *priv, dma_sync_single_for_cpu(priv->ddev, dma, frag_info->frag_size, DMA_FROM_DEVICE); - /* Save page reference in skb */ - __skb_frag_set_page(_frags_rx[nr], frags[nr].page); - skb_frag_size_set(_frags_rx[nr], frag_info->frag_size); - skb_frags_rx[nr].page_offset = frags[nr].page_offset; + __skb_fill_page_desc(skb, nr, frags[nr].page, +frags[nr].page_offset, +frag_info->frag_size); + skb->truesize += frag_info->frag_stride; frags[nr].page = NULL; } -- 1.8.3.1
[PATCH net V2 1/5] net/mlx4: Change ENOTSUPP to EOPNOTSUPP
From: Or GerlitzAs ENOTSUPP is specific to NFS, change the return error value to EOPNOTSUPP in various places in the mlx4 driver. Signed-off-by: Or Gerlitz Suggested-by: Yotam Gigi Reviewed-by: Matan Barak Signed-off-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c| 2 +- drivers/net/ethernet/mellanox/mlx4/fw.c | 2 +- drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +- drivers/net/ethernet/mellanox/mlx4/main.c | 6 +++--- drivers/net/ethernet/mellanox/mlx4/mr.c | 2 +- drivers/net/ethernet/mellanox/mlx4/qp.c | 2 +- drivers/net/ethernet/mellanox/mlx4/resource_tracker.c | 2 +- 7 files changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c index b04760a5034b..1dae8e40fb25 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c @@ -319,7 +319,7 @@ static int mlx4_en_ets_validate(struct mlx4_en_priv *priv, struct ieee_ets *ets) default: en_err(priv, "TC[%d]: Not supported TSA: %d\n", i, ets->tc_tsa[i]); - return -ENOTSUPP; + return -EOPNOTSUPP; } } diff --git a/drivers/net/ethernet/mellanox/mlx4/fw.c b/drivers/net/ethernet/mellanox/mlx4/fw.c index 84bab9f0732e..34a0c24e6844 100644 --- a/drivers/net/ethernet/mellanox/mlx4/fw.c +++ b/drivers/net/ethernet/mellanox/mlx4/fw.c @@ -2436,7 +2436,7 @@ int mlx4_config_dev_retrieval(struct mlx4_dev *dev, #define CONFIG_DEV_RX_CSUM_MODE_PORT2_BIT_OFFSET 4 if (!(dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_CONFIG_DEV)) - return -ENOTSUPP; + return -EOPNOTSUPP; err = mlx4_CONFIG_DEV_get(dev, _dev); if (err) diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c b/drivers/net/ethernet/mellanox/mlx4/intf.c index 8258d08acd8c..e00f627331cb 100644 --- a/drivers/net/ethernet/mellanox/mlx4/intf.c +++ b/drivers/net/ethernet/mellanox/mlx4/intf.c @@ -136,7 +136,7 @@ int mlx4_do_bond(struct mlx4_dev *dev, bool enable) LIST_HEAD(bond_list); if (!(dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_PORT_REMAP)) - return -ENOTSUPP; + return -EOPNOTSUPP; ret = mlx4_disable_rx_port_check(dev, enable); if (ret) { diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index bffa6f345f2f..55e4be51ee5a 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -1447,7 +1447,7 @@ int mlx4_port_map_set(struct mlx4_dev *dev, struct mlx4_port_map *v2p) int err; if (!(dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_PORT_REMAP)) - return -ENOTSUPP; + return -EOPNOTSUPP; mutex_lock(>bond_mutex); @@ -1884,7 +1884,7 @@ int mlx4_get_internal_clock_params(struct mlx4_dev *dev, struct mlx4_priv *priv = mlx4_priv(dev); if (mlx4_is_slave(dev)) - return -ENOTSUPP; + return -EOPNOTSUPP; if (!params) return -EINVAL; @@ -2384,7 +2384,7 @@ static int mlx4_init_hca(struct mlx4_dev *dev) /* Query CONFIG_DEV parameters */ err = mlx4_config_dev_retrieval(dev, ); - if (err && err != -ENOTSUPP) { + if (err && err != -EOPNOTSUPP) { mlx4_err(dev, "Failed to query CONFIG_DEV parameters\n"); } else if (!err) { dev->caps.rx_checksum_flags_port[1] = params.rx_csum_flags_port_1; diff --git a/drivers/net/ethernet/mellanox/mlx4/mr.c b/drivers/net/ethernet/mellanox/mlx4/mr.c index 395b5463cfd9..db65f72879e9 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mr.c +++ b/drivers/net/ethernet/mellanox/mlx4/mr.c @@ -823,7 +823,7 @@ int mlx4_mw_alloc(struct mlx4_dev *dev, u32 pd, enum mlx4_mw_type type, !(dev->caps.flags & MLX4_DEV_CAP_FLAG_MEM_WINDOW)) || (type == MLX4_MW_TYPE_2 && !(dev->caps.bmme_flags & MLX4_BMME_FLAG_TYPE_2_WIN))) - return -ENOTSUPP; + return -EOPNOTSUPP; index = mlx4_mpt_reserve(dev); if (index == -1) diff --git a/drivers/net/ethernet/mellanox/mlx4/qp.c b/drivers/net/ethernet/mellanox/mlx4/qp.c index d1cd9c32a9ae..2d6abd4662b1 100644 --- a/drivers/net/ethernet/mellanox/mlx4/qp.c +++ b/drivers/net/ethernet/mellanox/mlx4/qp.c @@ -447,7 +447,7 @@ int mlx4_update_qp(struct mlx4_dev *dev, u32 qpn, & MLX4_DEV_CAP_FLAG2_UPDATE_QP_SRC_CHECK_LB)) { mlx4_warn(dev, "Trying to set src check LB, but it isn't supported\n"); - err = -ENOTSUPP; +
[PATCH net V2 4/5] net/mlx4_core: Use cq quota in SRIOV when creating completion EQs
From: Jack MorgensteinWhen creating EQs to handle CQ completion events for the PF or for VFs, we create enough EQE entries to handle completions for the max number of CQs that can use that EQ. When SRIOV is activated, the max number of CQs a VF (or the PF) can obtain is its CQ quota (determined by the Hypervisor resource tracker). Therefore, when creating an EQ, the number of EQE entries that the VF should request for that EQ is the CQ quota value (and not the total number of CQs available in the FW). Under SRIOV, the PF, also must use its CQ quota, because the resource tracker also controls how many CQs the PF can obtain. Using the FW total CQs instead of the CQ quota when creating EQs resulted wasting MTT entries, due to allocating more EQEs than were needed. Fixes: 5a0d0a6161ae ("mlx4: Structures and init/teardown for VF resource quotas") Signed-off-by: Jack Morgenstein Reported-by: Dexuan Cui Signed-off-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/eq.c | 5 ++--- drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c index 0509996957d9..232f46db0dce 100644 --- a/drivers/net/ethernet/mellanox/mlx4/eq.c +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c @@ -1256,9 +1256,8 @@ int mlx4_init_eq_table(struct mlx4_dev *dev) mlx4_warn(dev, "Failed adding irq rmap\n"); } #endif - err = mlx4_create_eq(dev, dev->caps.num_cqs - - dev->caps.reserved_cqs + - MLX4_NUM_SPARE_EQE, + err = mlx4_create_eq(dev, dev->quotas.cq + +MLX4_NUM_SPARE_EQE, (dev->flags & MLX4_FLAG_MSI_X) ? i + 1 - !!(i > MLX4_EQ_ASYNC) : 0, eq); diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 7a030d10ff3e..094cfd8a1a18 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -3501,6 +3501,8 @@ static int mlx4_load_one(struct pci_dev *pdev, int pci_dev_data, goto err_disable_msix; } + mlx4_init_quotas(dev); + err = mlx4_setup_hca(dev); if (err == -EBUSY && (dev->flags & MLX4_FLAG_MSI_X) && !mlx4_is_mfunc(dev)) { @@ -3513,7 +3515,6 @@ static int mlx4_load_one(struct pci_dev *pdev, int pci_dev_data, if (err) goto err_steer; - mlx4_init_quotas(dev); /* When PF resources are ready arm its comm channel to enable * getting commands */ -- 1.8.3.1
[PATCH net V2 2/5] net/mlx4: Spoofcheck and zero MAC can't coexist
From: Eugenia EmantayevSpoofcheck can't be enabled if VF MAC is zero. Vice versa, can't zero MAC if spoofcheck is on. Fixes: 8f7ba3ca12f6 ('net/mlx4: Add set VF mac address support') Signed-off-by: Eugenia Emantayev Signed-off-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/cmd.c | 22 -- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 6 +- include/linux/mlx4/cmd.h | 2 +- include/linux/mlx4/driver.h| 10 ++ 4 files changed, 32 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c index a49072b4fa52..e8c105164931 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c @@ -43,6 +43,7 @@ #include #include #include +#include #include @@ -2955,7 +2956,7 @@ static bool mlx4_valid_vf_state_change(struct mlx4_dev *dev, int port, return false; } -int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac) +int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u8 *mac) { struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_vport_state *s_info; @@ -2964,13 +2965,22 @@ int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac) if (!mlx4_is_master(dev)) return -EPROTONOSUPPORT; + if (is_multicast_ether_addr(mac)) + return -EINVAL; + slave = mlx4_get_slave_indx(dev, vf); if (slave < 0) return -EINVAL; port = mlx4_slaves_closest_port(dev, slave, port); s_info = >mfunc.master.vf_admin[slave].vport[port]; - s_info->mac = mac; + + if (s_info->spoofchk && is_zero_ether_addr(mac)) { + mlx4_info(dev, "MAC invalidation is not allowed when spoofchk is on\n"); + return -EPERM; + } + + s_info->mac = mlx4_mac_to_u64(mac); mlx4_info(dev, "default mac on vf %d port %d to %llX will take effect only after vf restart\n", vf, port, s_info->mac); return 0; @@ -3143,6 +3153,7 @@ int mlx4_set_vf_spoofchk(struct mlx4_dev *dev, int port, int vf, bool setting) struct mlx4_priv *priv = mlx4_priv(dev); struct mlx4_vport_state *s_info; int slave; + u8 mac[ETH_ALEN]; if ((!mlx4_is_master(dev)) || !(dev->caps.flags2 & MLX4_DEV_CAP_FLAG2_FSM)) @@ -3154,6 +3165,13 @@ int mlx4_set_vf_spoofchk(struct mlx4_dev *dev, int port, int vf, bool setting) port = mlx4_slaves_closest_port(dev, slave, port); s_info = >mfunc.master.vf_admin[slave].vport[port]; + + mlx4_u64_to_mac(mac, s_info->mac); + if (setting && !is_valid_ether_addr(mac)) { + mlx4_info(dev, "Illegal MAC with spoofchk\n"); + return -EPERM; + } + s_info->spoofchk = setting; return 0; diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c index 3b4961a8e8e4..9a86dd397315 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c @@ -2475,12 +2475,8 @@ static int mlx4_en_set_vf_mac(struct net_device *dev, int queue, u8 *mac) { struct mlx4_en_priv *en_priv = netdev_priv(dev); struct mlx4_en_dev *mdev = en_priv->mdev; - u64 mac_u64 = mlx4_mac_to_u64(mac); - if (is_multicast_ether_addr(mac)) - return -EINVAL; - - return mlx4_set_vf_mac(mdev->dev, en_priv->port, queue, mac_u64); + return mlx4_set_vf_mac(mdev->dev, en_priv->port, queue, mac); } static int mlx4_en_set_vf_vlan(struct net_device *dev, int vf, u16 vlan, u8 qos, diff --git a/include/linux/mlx4/cmd.h b/include/linux/mlx4/cmd.h index 1f3568694a57..7b74afcbbab2 100644 --- a/include/linux/mlx4/cmd.h +++ b/include/linux/mlx4/cmd.h @@ -308,7 +308,7 @@ int mlx4_get_counter_stats(struct mlx4_dev *dev, int counter_index, int mlx4_get_vf_stats(struct mlx4_dev *dev, int port, int vf_idx, struct ifla_vf_stats *vf_stats); u32 mlx4_comm_get_version(void); -int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u64 mac); +int mlx4_set_vf_mac(struct mlx4_dev *dev, int port, int vf, u8 *mac); int mlx4_set_vf_vlan(struct mlx4_dev *dev, int port, int vf, u16 vlan, u8 qos, __be16 proto); int mlx4_set_vf_rate(struct mlx4_dev *dev, int port, int vf, int min_tx_rate, diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h index bd0e7075ea6d..e965e5090d96 100644 --- a/include/linux/mlx4/driver.h +++ b/include/linux/mlx4/driver.h @@ -104,4 +104,14 @@ static inline u64 mlx4_mac_to_u64(u8 *addr) return mac; } +static inline void mlx4_u64_to_mac(u8 *addr, u64 mac) +{ + int i; + + for (i = ETH_ALEN; i > 0; i--) { + addr[i - 1] = mac &&
[PATCH net V2 0/5] mlx4 misc fixes
Hi Dave, This patchset contains misc bug fixes from Eric Dumazet and our team to the mlx4 Core and Eth drivers. Series generated against net commit: 00ea1ceebe0d ipv6: release dst on error in ip6_dst_lookup_tail Thanks, Tariq. v2: * Added Eric's fix (patch 5/5). Eric Dumazet (1): net/mlx4_en: Use __skb_fill_page_desc() Eugenia Emantayev (1): net/mlx4: Spoofcheck and zero MAC can't coexist Jack Morgenstein (1): net/mlx4_core: Use cq quota in SRIOV when creating completion EQs Majd Dibbiny (1): net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs Or Gerlitz (1): net/mlx4: Change ENOTSUPP to EOPNOTSUPP drivers/net/ethernet/mellanox/mlx4/cmd.c | 22 -- drivers/net/ethernet/mellanox/mlx4/en_dcb_nl.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_netdev.c | 6 +- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 8 drivers/net/ethernet/mellanox/mlx4/eq.c| 5 ++--- drivers/net/ethernet/mellanox/mlx4/fw.c| 2 +- drivers/net/ethernet/mellanox/mlx4/intf.c | 2 +- drivers/net/ethernet/mellanox/mlx4/main.c | 11 +-- drivers/net/ethernet/mellanox/mlx4/mr.c| 2 +- drivers/net/ethernet/mellanox/mlx4/qp.c| 2 +- .../net/ethernet/mellanox/mlx4/resource_tracker.c | 2 +- include/linux/mlx4/cmd.h | 2 +- include/linux/mlx4/driver.h| 10 ++ 13 files changed, 49 insertions(+), 27 deletions(-) -- 1.8.3.1
[PATCH net V2 3/5] net/mlx4_core: Fix VF overwrite of module param which disables DMFS on new probed PFs
From: Majd DibbinyIn the VF driver, module parameter mlx4_log_num_mgm_entry_size was mistakenly overwritten -- and in a manner which overrode the device-managed flow steering option encoded in the parameter. log_num_mgm_entry_size is a global module parameter which affects all ConnectX-3 PFs installed on that host. If a VF changes log_num_mgm_entry_size, this will affect all PFs which are probed subsequent to the change (by disabling DMFS for those PFs). Fixes: 3c439b5586e9 ("mlx4_core: Allow choosing flow steering mode") Signed-off-by: Majd Dibbiny Reviewed-by: Jack Morgenstein Signed-off-by: Tariq Toukan --- drivers/net/ethernet/mellanox/mlx4/main.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 55e4be51ee5a..7a030d10ff3e 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -841,8 +841,6 @@ static int mlx4_slave_cap(struct mlx4_dev *dev) return -ENOSYS; } - mlx4_log_num_mgm_entry_size = hca_param.log_mc_entry_sz; - dev->caps.hca_core_clock = hca_param.hca_core_clock; memset(_cap, 0, sizeof(dev_cap)); -- 1.8.3.1
Re: [PATCH net 0/4] mlx4 misc fixes
On 22/02/2017 2:33 PM, Tariq Toukan wrote: Hi Dave, This patchset contains misc bug fixes from the team to the mlx4 Core and Eth drivers. Series generated against net commit: 00ea1ceebe0d ipv6: release dst on error in ip6_dst_lookup_tail Thanks, Tariq. Please ignore this one. I am submitting V2 with an additional patch. Thanks, Tariq
Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX
On Mon, 2017-02-13 at 11:58 -0800, Eric Dumazet wrote: > Use of order-3 pages is problematic in some cases. > > This patch might add three kinds of regression : > > 1) a CPU performance regression, but we will add later page > recycling and performance should be back. > > 2) TCP receiver could grow its receive window slightly slower, >because skb->len/skb->truesize ratio will decrease. >This is mostly ok, we prefer being conservative to not risk OOM, >and eventually tune TCP better in the future. >This is consistent with other drivers using 2048 per ethernet frame. > > 3) Because we allocate one page per RX slot, we consume more >memory for the ring buffers. XDP already had this constraint anyway. > > Signed-off-by: Eric Dumazet> --- Note that we also could use a different strategy. Assume RX rings of 4096 entries/slots. With this patch, mlx4 gets the strategy used by Alexander in Intel drivers : Each RX slot has an allocated page, and uses half of it, flipping to the other half every time the slot is used. So a ring buffer of 4096 slots allocates 4096 pages. When we receive a packet train for the same flow, GRO builds an skb with ~45 page frags, all from different pages. The put_page() done from skb_release_data() touches ~45 different struct page cache lines, and show a high cost. (compared to the order-3 used today by mlx4, this adds extra cache line misses and stalls for the consumer) If we instead try to use the two halves of one page on consecutive RX slots, we might instead cook skb with the same number of MSS (45), but half the number of cache lines for put_page(), so we should speed up the consumer. This means the number of active pages would be minimal, especially on PowerPC. Pages that have been used by X=2 received frags would be put in a quarantine (size to be determined). On PowerPC, X would be PAGE_SIZE/frag_size This strategy would consume less memory on PowerPC : 65535/1536 = 42, so a 4096 RX ring would need 98 active pages instead of 4096. The quarantine would be sized to increase chances of reusing an old page, without consuming too much memory. Probably roundup_pow_of_two(rx_ring_size / (PAGE_SIZE/frag_size)) x86 would still use 4096 pages, but PowerPC would use 98+128 pages instead of 4096) (14 MBytes instead of 256 MBytes)
[PATCH net 2/6] net/mlx5e: Register/unregister vport representors on interface attach/detach
Currently vport representors are added only on driver load and removed on driver unload. Apparently we forgot to handle them when we added the seamless reset flow feature. This caused to leave the representors netdevs alive and active with open HW resources on pci shutdown and on error reset flows. To overcome this we move their handling to interface attach/detach, so they would be cleaned up on shutdown and recreated on reset flows. Fixes: 26e59d8077a3 ("net/mlx5e: Implement mlx5e interface attach/detach callbacks") Signed-off-by: Saeed MahameedReviewed-by: Hadar Hen Zion Reviewed-by: Roi Dayan --- drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index 3cce6281e075..c24366868b39 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -3970,6 +3970,19 @@ static void mlx5e_register_vport_rep(struct mlx5_core_dev *mdev) } } +static void mlx5e_unregister_vport_rep(struct mlx5_core_dev *mdev) +{ + struct mlx5_eswitch *esw = mdev->priv.eswitch; + int total_vfs = MLX5_TOTAL_VPORTS(mdev); + int vport; + + if (!MLX5_CAP_GEN(mdev, vport_group_manager)) + return; + + for (vport = 1; vport < total_vfs; vport++) + mlx5_eswitch_unregister_vport_rep(esw, vport); +} + void mlx5e_detach_netdev(struct mlx5_core_dev *mdev, struct net_device *netdev) { struct mlx5e_priv *priv = netdev_priv(netdev); @@ -4016,6 +4029,7 @@ static int mlx5e_attach(struct mlx5_core_dev *mdev, void *vpriv) return err; } + mlx5e_register_vport_rep(mdev); return 0; } @@ -4027,6 +4041,7 @@ static void mlx5e_detach(struct mlx5_core_dev *mdev, void *vpriv) if (!netif_device_present(netdev)) return; + mlx5e_unregister_vport_rep(mdev); mlx5e_detach_netdev(mdev, netdev); mlx5e_destroy_mdev_resources(mdev); } @@ -4045,8 +4060,6 @@ static void *mlx5e_add(struct mlx5_core_dev *mdev) if (err) return NULL; - mlx5e_register_vport_rep(mdev); - if (MLX5_CAP_GEN(mdev, vport_group_manager)) ppriv = >offloads.vport_reps[0]; @@ -4098,13 +4111,7 @@ void mlx5e_destroy_netdev(struct mlx5_core_dev *mdev, struct mlx5e_priv *priv) static void mlx5e_remove(struct mlx5_core_dev *mdev, void *vpriv) { - struct mlx5_eswitch *esw = mdev->priv.eswitch; - int total_vfs = MLX5_TOTAL_VPORTS(mdev); struct mlx5e_priv *priv = vpriv; - int vport; - - for (vport = 1; vport < total_vfs; vport++) - mlx5_eswitch_unregister_vport_rep(esw, vport); unregister_netdev(priv->netdev); mlx5e_detach(mdev, vpriv); -- 2.11.0
[PATCH net 1/6] net/mlx5e: s390 system compilation fix
From: Mohamad Haj YahiaAdd necessary headers include for s390 arch compilation. Fixes: e586b3b0baee ("net/mlx5: Ethernet Datapath files") Fixes: d605d6686dc7 ("net/mlx5e: Add support for ethtool self..") Signed-off-by: Mohamad Haj Yahia Signed-off-by: Saeed Mahameed --- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 1 + drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index b039b87742a6..9fad22768aab 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include #include diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c index 65442c36a6e1..31e3cb7ee5fe 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_selftest.c @@ -30,6 +30,7 @@ * SOFTWARE. */ +#include #include #include #include -- 2.11.0
[PATCH net 5/6] net/mlx5e: Update MPWQE stride size when modifying CQE compress state
When the admin enables/disables cqe compression, updating mpwqe stride size is required: CQE compress ON ==> stride size = 256B CQE compress OFF ==> stride size = 64B This is already done on driver load via mlx5e_set_rq_type_params, all we need is just to call it on arbitrary admin changes of cqe compression state via priv flags or when changing timestamping state (as it is mutually exclusive with cqe compression). This bug introduces no functional damage, it only makes cqe compression occur less often, since in ConnectX4-LX CQE compression is performed only on packets smaller than stride size. Tested: ethtool --set-priv-flags ethxx rx_cqe_compress on pktgen with 64 < pkt size < 256 and netperf TCP_STREAM (IPv4/IPv6) verify `ethtool -S ethxx | grep compress` are advancing more often (rapidly) Fixes: 7219ab34f184 ("net/mlx5e: CQE compression") Signed-off-by: Saeed MahameedReviewed-by: Tariq Toukan Cc: kernel-t...@fb.com --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 1 + drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c | 1 + drivers/net/ethernet/mellanox/mlx5/core/en_main.c| 2 +- drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 1 + 4 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index 95ca03c0d9f5..f6a6ded204f6 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -816,6 +816,7 @@ int mlx5e_get_max_linkspeed(struct mlx5_core_dev *mdev, u32 *speed); void mlx5e_set_rx_cq_mode_params(struct mlx5e_params *params, u8 cq_period_mode); +void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type); static inline void mlx5e_tx_notify_hw(struct mlx5e_sq *sq, struct mlx5_wqe_ctrl_seg *ctrl, int bf_sz) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c index cc80522b5854..a004a5a1a4c2 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c @@ -1487,6 +1487,7 @@ static int set_pflag_rx_cqe_compress(struct net_device *netdev, mlx5e_modify_rx_cqe_compression_locked(priv, enable); priv->params.rx_cqe_compress_def = enable; + mlx5e_set_rq_type_params(priv, priv->params.rq_wq_type); return 0; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c index dc621bc4e173..8ef64c4db2c2 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c @@ -79,7 +79,7 @@ static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev) MLX5_CAP_ETH(mdev, reg_umr_sq); } -static void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type) +void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type) { priv->params.rq_wq_type = rq_type; priv->params.lro_wqe_sz = MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 9fad22768aab..d5ce20db3f0b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -172,6 +172,7 @@ void mlx5e_modify_rx_cqe_compression_locked(struct mlx5e_priv *priv, bool val) mlx5e_close_locked(priv->netdev); MLX5E_SET_PFLAG(priv, MLX5E_PFLAG_RX_CQE_COMPRESS, val); + mlx5e_set_rq_type_params(priv, priv->params.rq_wq_type); if (was_opened) mlx5e_open_locked(priv->netdev); -- 2.11.0