Re: [PATCH net] ebpf: fix bpf_msg_pull_data
On 08/30/2018 02:21 AM, Tushar Dave wrote: > On 08/29/2018 05:07 PM, Tushar Dave wrote: >> While doing some preliminary testing it is found that bpf helper >> bpf_msg_pull_data does not calculate the data and data_end offset >> correctly. Fix it! >> >> Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data") >> Signed-off-by: Tushar Dave >> Acked-by: Sowmini Varadhan >> --- >> net/core/filter.c | 38 +- >> 1 file changed, 25 insertions(+), 13 deletions(-) >> >> diff --git a/net/core/filter.c b/net/core/filter.c >> index c25eb36..3eeb3d6 100644 >> --- a/net/core/filter.c >> +++ b/net/core/filter.c >> @@ -2285,7 +2285,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff >> *msg) >> BPF_CALL_4(bpf_msg_pull_data, >> struct sk_msg_buff *, msg, u32, start, u32, end, u64, flags) >> { >> - unsigned int len = 0, offset = 0, copy = 0; >> + unsigned int len = 0, offset = 0, copy = 0, off = 0; >> struct scatterlist *sg = msg->sg_data; >> int first_sg, last_sg, i, shift; >> unsigned char *p, *to, *from; >> @@ -2299,22 +2299,30 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff >> *msg) >> i = msg->sg_start; >> do { >> len = sg[i].length; >> - offset += len; >> if (start < offset + len) >> break; >> + offset += len; >> i++; >> if (i == MAX_SKB_FRAGS) >> i = 0; >> - } while (i != msg->sg_end); >> + } while (i <= msg->sg_end); I don't think this condition is correct, msg operates as a scatterlist ring, so sg_end may very well be < current i when there's a wrap-around in the traversal ... more below. >> + /* return error if start is out of range */ >> if (unlikely(start >= offset + len)) >> return -EINVAL; >> - if (!msg->sg_copy[i] && bytes <= len) >> - goto out; >> + /* return error if i is last entry in sglist and end is out of range */ >> + if (msg->sg_copy[i] && end > offset + len) >> + return -EINVAL; >> first_sg = i; >> + /* if i is not last entry in sg list and end (i.e start + bytes) is >> + * within this sg[i] then goto out and calculate data and data_end >> + */ >> + if (!msg->sg_copy[i] && end <= offset + len) >> + goto out; >> + >> /* At this point we need to linearize multiple scatterlist >> * elements or a single shared page. Either way we need to >> * copy into a linear buffer exclusively owned by BPF. Then >> @@ -2330,9 +2338,14 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff >> *msg) >> i++; >> if (i == MAX_SKB_FRAGS) >> i = 0; >> - if (bytes < copy) >> + if (end < copy) >> break; >> - } while (i != msg->sg_end); >> + } while (i <= msg->sg_end); >> + >> + /* return error if i is last entry in sglist and end is out of range */ >> + if (i > msg->sg_end && end > offset + copy) >> + return -EINVAL; >> + >> last_sg = i; >> if (unlikely(copy < end - start)) >> @@ -2342,23 +2355,22 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff >> *msg) >> if (unlikely(!page)) >> return -ENOMEM; >> p = page_address(page); >> - offset = 0; >> i = first_sg; >> do { >> from = sg_virt(&sg[i]); >> len = sg[i].length; >> - to = p + offset; >> + to = p + off; >> memcpy(to, from, len); >> - offset += len; >> + off += len; >> sg[i].length = 0; >> put_page(sg_page(&sg[i])); >> i++; >> if (i == MAX_SKB_FRAGS) >> i = 0; >> - } while (i != last_sg); >> + } while (i < last_sg); >> sg[first_sg].length = copy; >> sg_set_page(&sg[first_sg], page, copy, 0); >> @@ -2380,7 +2392,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff >> *msg) >> else >> move_from = i + shift; >> - if (move_from == msg->sg_end) >> + if (move_from > msg->sg_end) >> break; >> sg[i] = sg[move_from]; >> @@ -2396,7 +2408,7 @@ struct sock *do_msg_redirect_map(struct sk_msg_buff >> *msg) >> if (msg->sg_end < 0) >> msg->sg_end += MAX_SKB_FRAGS; >> out: >> - msg->data = sg_virt(&sg[i]) + start - offset; >> + msg->data = sg_virt(&sg[first_sg]) + start - offset; >> msg->data_end = msg->data + bytes; >> return 0; >> > > Please discard this patch. I just noticed that Daniel Borkmann sent some > similar fixes for bpf_msg_pull_data. Yeah I've been looking at these recently as well, please make sure you test with the below fixes included to see if there's anything left: https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=5b24109b0563d45094c470684c1f8cea1af269f8 https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/net/core/filter.c?id=0e06b227c
Re: [PATCH] ieee802154: mcr20a: read out of bounds in mcr20a_set_channel()
On Wed, Aug 29, 2018 at 07:50:51PM +0200, Xue Liu wrote: > Hello Dan, > > > On Wed, 29 Aug 2018 at 16:49, Dan Carpenter > wrote: > > > The "channel" variable can be any u8 value. We need to make sure we > > don't read outside of the PLL_INT[] or PLL_FRAC[] arrays. > > > I think the “channel” variable can not be any u8 value. This values is > already checked before set_channel function is called. > https://github.com/torvalds/linux/blob/master/net/ieee802154/nl802154.c#L978 Oh... Hm... I should have reviewed more carefully. This patch isn't correct. But there may still be a bug. What Smatch is worried about is the other call tree: ieee802154_start_req() calls (struct ieee802154_mlme_ops)->start_req -> mac802154_mlme_start_req() -> mac802154_dev_set_page_channel() -> drv_set_channel() calls local->ops->set_channel(&local->hw, page, channel); -> mcr20a_set_channel() So maybe we could move the check from nl802154_set_channel() to drv_set_channel() so channel is checked on both call trees. regards, dan carpenter
[PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames
From: Björn Töpel Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did not include XDP meta data in the data buffers copied out to the user application. In this commit, we check if meta data is available, and if so, it is prepended to the frame. Signed-off-by: Björn Töpel --- net/xdp/xsk.c | 37 - 1 file changed, 28 insertions(+), 9 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4e937cd7c17d..817e4cee1540 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr); static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) { - void *buffer; + void *to_buf, *from_buf; + u32 metalen; u64 addr; int err; if (!xskq_peek_addr(xs->umem->fq, &addr) || - len > xs->umem->chunk_size_nohr) { + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) { xs->rx_dropped++; return -ENOSPC; } addr += xs->umem->headroom; - buffer = xdp_umem_get_data(xs->umem, addr); - memcpy(buffer, xdp->data, len); + if (xdp_data_meta_unsupported(xdp)) { + from_buf = xdp->data; + metalen = 0; + } else { + from_buf = xdp->data_meta; + metalen = xdp->data - xdp->data_meta; + } + + to_buf = xdp_umem_get_data(xs->umem, addr); + memcpy(to_buf, from_buf, len + metalen); + addr += metalen; err = xskq_produce_batch_desc(xs->rx, addr, len); if (!err) { xskq_discard_addr(xs->umem->fq); @@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs) int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) { - u32 len = xdp->data_end - xdp->data; - void *buffer; + u32 metalen, len = xdp->data_end - xdp->data; + void *to_buf, *from_buf; u64 addr; int err; @@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) return -EINVAL; if (!xskq_peek_addr(xs->umem->fq, &addr) || - len > xs->umem->chunk_size_nohr) { + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) { xs->rx_dropped++; return -ENOSPC; } addr += xs->umem->headroom; - buffer = xdp_umem_get_data(xs->umem, addr); - memcpy(buffer, xdp->data, len); + if (xdp_data_meta_unsupported(xdp)) { + from_buf = xdp->data; + metalen = 0; + } else { + from_buf = xdp->data_meta; + metalen = xdp->data - xdp->data_meta; + } + + to_buf = xdp_umem_get_data(xs->umem, addr); + memcpy(to_buf, from_buf, len + metalen); + addr += metalen; err = xskq_produce_batch_desc(xs->rx, addr, len); if (!err) { xskq_discard_addr(xs->umem->fq); -- 2.17.1
Re: mlx5 driver loading failing on v4.19 / net-next / bpf-next
On 29/08/2018 6:05 PM, Jesper Dangaard Brouer wrote: Hi Saeed, I'm having issues loading mlx5 driver on v4.19 kernels (tested both net-next and bpf-next), while kernel v4.18 seems to work. It happens with a Mellanox ConnectX-5 NIC (and also a CX4-Lx but I removed that from the system now). Hi Jesper, Thanks for your report! We are working to analyze and debug the issue. One pain point is very long boot-time, caused by some timeout code in the driver. The kernel console log (dmesg) says: [5.763330] mlx5_core :03:00.0: firmware version: 16.22.1002 [5.769367] mlx5_core :03:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8 GT/s x16 link at :00:02.0 (capable of 252.048 Gb/s with 16 GT/s x16 link) (...) other drivers loading [ 66.816635] mlx5_core :03:00.0: wait_func:964:(pid 112): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource [ 66.828123] mlx5_core :03:00.0: enable hca failed [ 66.845516] mlx5_core :03:00.0: mlx5_load_one failed with error code -110 [ 66.852802] mlx5_core: probe of :03:00.0 failed with error -110 [ 66.859347] mlx5_core :03:00.1: firmware version: 16.22.1002 [ 66.865388] mlx5_core :03:00.1: 126.016 Gb/s available PCIe bandwidth, limited by 8 GT/s x16 link at :00:02.0 (capable of 252.048 Gb/s with 16 GT/s x16 link) [ 125.787395] XFS (sda3): Mounting V5 Filesystem [ 125.848509] XFS (sda3): Ending clean mount [ 127.984784] mlx5_core :03:00.1: wait_func:964:(pid 5): ENABLE_HCA(0x104) timeout. Will cause a leak of a command resource [ 127.996090] mlx5_core :03:00.1: enable hca failed [ 128.013819] mlx5_core :03:00.1: mlx5_load_one failed with error code -110 [ 128.021076] mlx5_core: probe of :03:00.1 failed with error -110 Do you have any idea what could be causing this? We'll update regarding any progress. Thanks!
Re: [PATCH 0/3] IB/ipoib: Use dev_port to disambiguate port numbers
On Thu, Aug 30, 2018 at 08:43:30AM +0300, Leon Romanovsky wrote: > On Wed, Aug 29, 2018 at 12:01:14AM +0300, Arseny Maslennikov wrote: > > Pre-3.15 userspace had trouble distinguishing different ports > > of a NIC on a single PCI bus/device/function. To solve this, > > a sysfs field `dev_port' was introduced quite a while ago > > (commit v3.14-rc3-739-g3f85944fe207), and some relevant device > > drivers were fixed to use it, but not in case of IPoIB. > > > > The convention for some reason never got documented in the kernel, but > > was immediately adopted by userspace (notably udev[1][2], biosdevname[3]) > > > > 3/3 documents the sysfs field — that's why I'm CC-ing netdev. > > > > This series was tested on and applies to 4.19-rc1. > > > > [1] > > https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html > > [2] > > https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html > > [3] > > https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38 > > > > Arseny Maslennikov (3): > > IB/ipoib: Use dev_port to expose network interface port numbers > > IB/ipoib: Stop using dev_id to expose port numbers > > I completely agree with previous Yuval's comment, it makes no sense to > start separate commits for every line. > > Please decide what is best and right behavior and do it, instead of > pushing it up to be the maintainer's problem. > No problem; will squash those two in v2, then. > Thanks > > > Documentation/ABI: document /sys/class/net/*/dev_port > > > > Documentation/ABI/testing/sysfs-class-net | 10 ++ > > drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +- > > 2 files changed, 11 insertions(+), 1 deletion(-) > > > > -- > > 2.18.0 > > signature.asc Description: PGP signature
Re: [PATCH 0/4] clk-pmc-atom + r8169: Add ether_clk handling to fix suspend issues
On Wed, Aug 29, 2018 at 07:06:09PM +0200, Hans de Goede wrote: > > What a nice stuff, thanks! > > > > Reviewed-by: Andy Shevchenko > > Thank you (and also thank you for the other reviews) > > > Btw, you probably better to refer to > > https://bugzilla.kernel.org/show_bug.cgi?id=196861 which is dedicated for > > S0ix > > issue. > > Ok, I've added a link to that. I've also added: > > Reported-by: Johannes Stezenbach > > To honor Johannes for figuring out that leaving the clocks enabled > was a problem in the first place. > > This will all be included in v2 of the series when I send it out. Forgot to mention, instead of Irina, which is not longer working for Intel for more than a year, put Pierre's address there. -- With Best Regards, Andy Shevchenko
Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e
Den ons 29 aug. 2018 kl 18:12 skrev Daniel Borkmann : > > On 08/28/2018 02:44 PM, Björn Töpel wrote: > > From: Björn Töpel > > > > This patch set introduces zero-copy AF_XDP support for Intel's i40e > > driver. In the first preparatory patch we also add support for > > XDP_REDIRECT for zero-copy allocated frames so that XDP programs can > > redirect them. This was a ToDo from the first AF_XDP zero-copy patch > > set from early June. Special thanks to Alex Duyck and Jesper Dangaard > > Brouer for reviewing earlier versions of this patch set. > > > > The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note > > that in the interest of time, to get an AF_XDP zero-copy implementation > > out there for people to try, some code paths have been copied from the > > XDP path to the zero-copy path. It is out goal to merge the two paths > > in later patch sets. > > > > In contrast to the implementation from beginning of June, this patch > > set does not require any extra HW queues for AF_XDP zero-copy > > TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and > > AF_XDP zero-copy TX. > > > > Jeff, given that most of changes are in i40e, it is up to you how you > > would like to route these patches. The set is tagged bpf-next, but > > if taking it via the Intel driver tree is easier, let us know. > > > > We have run some benchmarks on a dual socket system with two Broadwell > > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14 > > cores which gives a total of 28, but only two cores are used in these > > experiments. One for TR/RX and one for the user space application. The > > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is > > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total > > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The > > NIC is Intel I40E 40Gbit/s using the i40e driver. > > > > Below are the results in Mpps of the I40E NIC benchmark runs for 64 > > and 1500 byte packets, generated by a commercial packet generator HW > > outputing packets at full 40 Gbit/s line rate. The results are with > > retpoline and all other spectre and meltdown fixes, so these results > > are not comparable to the ones from the zero-copy patch set in June. > > > > AF_XDP performance 64 byte packets. > > Benchmark XDP_SKBXDP_DRVXDP_DRV with zerocopy > > rxdrop 2.68.2 15.0 > > txpush 2.2- 21.9 > > l2fwd1.72.3 11.3 > > > > AF_XDP performance 1500 byte packets: > > Benchmark XDP_SKB XDP_DRV XDP_DRV with zerocopy > > rxdrop 2.03.3 3.3 > > l2fwd1.31.7 3.1 > > > > XDP performance on our system as a base line: > > > > 64 byte packets: > > XDP stats CPU pps issue-pps > > XDP-RX CPU 16 18.4M 0 > > > > 1500 byte packets: > > XDP stats CPU pps issue-pps > > XDP-RX CPU 16 3.3M0 > > > > The structure of the patch set is as follows: > > > > Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames > > Patches 2-4: Preparatory patches to common xsk and net code > > Patches 5-7: Preparatory patches to i40e driver code for RX > > Patch 8: i40e zero-copy support for RX > > Patch 9: Preparatory patch to i40e driver code for TX > > Patch 10: i40e zero-copy support for TX > > Patch 11: Add flags to sample application to force zero-copy/copy mode > > > > We based this patch set on bpf-next commit 050cdc6c9501 ("Merge > > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net") > > > > > > Magnus & Björn > > > > Björn Töpel (8): > > xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY > > xdp: export xdp_rxq_info_unreg_mem_model > > xsk: expose xdp_umem_get_{data,dma} to drivers > > i40e: added queue pair disable/enable functions > > i40e: refactor Rx path for re-use > > i40e: move common Rx functions to i40e_txrx_common.h > > i40e: add AF_XDP zero-copy Rx support > > samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock > > > > Magnus Karlsson (3): > > net: add napi_if_scheduled_mark_missed > > i40e: move common Tx functions to i40e_txrx_common.h > > i40e: add AF_XDP zero-copy Tx support > > Thanks for working on this, LGTM! Are you also planning to get ixgbe > out after that? > Yes, the plan is to get ixgbe out as the next driver, but we'll focus on the existing i40e issues first. Thanks, Björn > For the series: > > Acked-by: Daniel Borkmann > > Thanks, > Daniel
[PATCH net-next] packet: add sockopt to ignore outgoing packets
Currently, the only way to ignore outgoing packets on a packet socket is via the BPF filter. With MSG_ZEROCOPY, packets that are looped into AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even if the filter run from packet_rcv() would reject them. So the presence of a packet socket on the interface takes away the benefits of MSG_ZEROCOPY, even if the packet socket is not interested in outgoing packets. (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily cloned, but the cost for that is much lower.) Add a socket option to allow AF_PACKET sockets to ignore outgoing packets to solve this. Note that the *BSDs already have something similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT. The first intended user is lldpd. Signed-off-by: Vincent Whitchurch --- include/linux/netdevice.h | 1 + include/uapi/linux/if_packet.h | 1 + net/core/dev.c | 3 +++ net/packet/af_packet.c | 15 +++ 4 files changed, 20 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ca5ab98053c8..8ef14d9edc58 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2317,6 +2317,7 @@ static inline struct sk_buff *call_gro_receive_sk(gro_receive_sk_t cb, struct packet_type { __be16 type; /* This is really htons(ether_type). */ + boolignore_outgoing; struct net_device *dev; /* NULL is wildcarded here */ int (*func) (struct sk_buff *, struct net_device *, diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h index 67b61d91d89b..467b654bd4c7 100644 --- a/include/uapi/linux/if_packet.h +++ b/include/uapi/linux/if_packet.h @@ -57,6 +57,7 @@ struct sockaddr_ll { #define PACKET_QDISC_BYPASS20 #define PACKET_ROLLOVER_STATS 21 #define PACKET_FANOUT_DATA 22 +#define PACKET_IGNORE_OUTGOING 23 #define PACKET_FANOUT_HASH 0 #define PACKET_FANOUT_LB 1 diff --git a/net/core/dev.c b/net/core/dev.c index 325fc5088370..0addb4f0abfe 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1947,6 +1947,9 @@ static inline bool skb_loop_sk(struct packet_type *ptype, struct sk_buff *skb) if (!ptype->af_packet_priv || !skb->sk) return false; + if (ptype->ignore_outgoing) + return true; + if (ptype->id_match) return ptype->id_match(ptype, skb->sk); else if ((struct sock *)ptype->af_packet_priv == skb->sk) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 5610061e7f2e..37bbafad724a 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -3805,6 +3805,18 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv return fanout_set_data(po, optval, optlen); } + case PACKET_IGNORE_OUTGOING: + { + int val; + + if (optlen != sizeof(val)) + return -EINVAL; + if (copy_from_user(&val, optval, sizeof(val))) + return -EFAULT; + + po->prot_hook.ignore_outgoing = !!val; + return 0; + } case PACKET_TX_HAS_OFF: { unsigned int val; @@ -3928,6 +3940,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname, ((u32)po->fanout->flags << 24)) : 0); break; + case PACKET_IGNORE_OUTGOING: + val = po->prot_hook.ignore_outgoing; + break; case PACKET_ROLLOVER_STATS: if (!po->rollover) return -EINVAL; -- 2.11.0
Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute
Hi Jiri, The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs file as DEVTYPE= information. To avoid any kind of race conditions between netlink messages and reading from sysfs, it is useful to add the same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK messages. For network managing daemons that have to classify ARPHRD_ETHER network devices into different types (like Wireless LAN, Bluetooth etc.), this avoids the extra round trip to sysfs and parsing of the uevent file. Signed-off-by: Marcel Holtmann --- include/uapi/linux/if_link.h | 2 ++ net/core/rtnetlink.c | 12 2 files changed, 14 insertions(+) diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h index 43391e2d1153..781294972bb4 100644 --- a/include/uapi/linux/if_link.h +++ b/include/uapi/linux/if_link.h @@ -166,6 +166,8 @@ enum { IFLA_NEW_IFINDEX, IFLA_MIN_MTU, IFLA_MAX_MTU, + IFLA_DEVTYPE, /* Name value from SET_NETDEV_DEVTYPE */ >>> >>> This is not something netdev-related. dev->dev.type is struct device_type. >>> This is a generic "device" thing. Incorrect to expose over >>> netdev-specific API. Please use "device" API for this. >> >> it is not just "device" related since this is a sub-classification of >> ARPHRD_ETHER type as a wrote above. Don't get hang up that this information >> is part of struct device. The dev->dev.type contains strings like "wlan", >> "bluetooth", "wimax", "gadget" etc. so that you can tell what kind of >> Ethernet card you have. > > I understand. But using dev->dev.type for that purpose seems like an > abuse. If this is subtype of netdev, it should not be stored in parent > device. I believe that you need to re-introduce this as a part of struct > net_device - preferably an enum - and that you can expose over rtnl (and > net-sysfs). it is not stored in the parent device. It is stored in the device of the netdev. As Stephen said, there is no device API. And why would I add the same info again and bloat netdev? drivers/net/bonding/bond_main.c:SET_NETDEV_DEVTYPE(bond_dev, &bond_type); drivers/net/geneve.c: SET_NETDEV_DEVTYPE(dev, &geneve_type); drivers/net/macsec.c: SET_NETDEV_DEVTYPE(dev, &macsec_type); drivers/net/ppp/ppp_generic.c: SET_NETDEV_DEVTYPE(dev, &ppp_type); drivers/net/usb/hso.c: SET_NETDEV_DEVTYPE(net, &hso_type); drivers/net/usb/usbnet.c: SET_NETDEV_DEVTYPE(net, &wlan_type); drivers/net/usb/usbnet.c: SET_NETDEV_DEVTYPE(net, &wwan_type); drivers/net/vxlan.c:SET_NETDEV_DEVTYPE(dev, &vxlan_type); drivers/net/wimax/i2400m/usb.c: SET_NETDEV_DEVTYPE(net_dev, &i2400mu_type); drivers/net/wireless/intersil/prism54/islpci_dev.c: SET_NETDEV_DEVTYPE(ndev, &wlan_type); drivers/staging/gdm724x/gdm_lte.c: SET_NETDEV_DEVTYPE(net, &wwan_type); drivers/usb/gadget/function/u_ether.c: SET_NETDEV_DEVTYPE(net, &gadget_type); drivers/usb/gadget/function/u_ether.c: SET_NETDEV_DEVTYPE(net, &gadget_type); net/8021q/vlan_dev.c: SET_NETDEV_DEVTYPE(dev, &vlan_type); net/bluetooth/6lowpan.c:SET_NETDEV_DEVTYPE(netdev, &bt_type); net/bluetooth/bnep/core.c: SET_NETDEV_DEVTYPE(dev, &bnep_type); net/bridge/br_device.c: SET_NETDEV_DEVTYPE(dev, &br_type); net/dsa/slave.c:SET_NETDEV_DEVTYPE(slave_dev, &dsa_type); net/hsr/hsr_device.c: SET_NETDEV_DEVTYPE(dev, &hsr_type); net/l2tp/l2tp_eth.c:SET_NETDEV_DEVTYPE(dev, &l2tpeth_type); net/wireless/core.c:SET_NETDEV_DEVTYPE(dev, &wiphy_type); So for all these devices we have to add duplicate information now? I actually don't like that idea. I can rename it to IFLA_FOO or any other proposal, but adding the same information twice is pointless. >> We can revive the patches for adding this information as IFLA_INFO_KIND, but >> last time they where reverted since NetworkManager did all the wrong >> decisions based on that. That part is fixed now and we can put that back and >> declare the sub-type classification of Ethernet device in two places. Or we >> just take the information that we added a long time ago for exactly this >> sub-classification and provide them to userspace via RTNL. > > IFLA_INFO_KIND serves for different purpose. It is for drivers that > register rtnl_link_ops. I don't think it would be wise to mix it with > this. We could register rtnl_link_ops for these Ethernet drivers, but then we are duplicating the information as well. Regards Marcel
Re: [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
Den ons 29 aug. 2018 kl 14:44 skrev Jesper Dangaard Brouer : > > On Tue, 28 Aug 2018 14:44:35 +0200 > Björn Töpel wrote: > > > From: Björn Töpel > > > > The -c/--copy -z/--zero-copy flags enforces either copy or zero-copy > > mode. > > Nice, thanks for adding this. It allows me to quickly test the > difference between normal-copy vs zero-copy modes. > (Kernel bpf-next without RETPOLINE). > > AF_XDP RX-drop: > Normal-copy mode: rx 13,070,318 pps - 76.5 ns > Zero-copy mode: rx 26,132,328 pps - 38.3 ns > > Compare to XDP_DROP: 34,251,464 pps - 29.2 ns >XDP_DROP + read : 30,756,664 pps - 32.5 ns > > The normal-copy mode is surprisingly fast (and it works for every > driver implemeting the regular XDP_REDIRECT action). It is still > faster to do in-kernel XDP_DROP than AF_XDP zero-copy mode dropping, > which was expected given frames travel to a remote CPU before returned > (don't think remote CPU reads payload?). The gap in nanosec is > actually quite small, thus I'm impressed by the SPSC-queue > implementation working across these CPUs. > > > AF_XDP layer2-fwd: > Normal-copy mode: rx 3,200,885 tx 3,200,892 > Zero-copy mode: rx 17,026,300 tx 17,026,269 > > Compare to XDP_TX: rx 14,529,079 tx 14,529,850 - 68.82 ns > XDP_REDIRECT: rx 13,235,785 tx 13,235,784 - 75.55 ns > > The copy-mode is slow because it allocates SKBs internally (I do > wonder if we could speed it up by using ndo_xdp_xmit + disable-BH). > More intersting is that the zero-copy is faster than XDP_TX and > XDP_REDIRECT. I think the speedup comes from avoiding some DMA mapping > calls with ZC. > > Side-note: XDP_TX vs. REDIRECT: 75.55 - 68.82 = 6.73 ns. The cost of > going through the xdp_do_redirect_map core is actually quite small :-) > (I have some micro optimizations that should help ~2ns). > > > AF_XDP TX-only: > Normal-copy mode: tx 2,853,461 pps > Zero-copy mode: tx 22,255,311 pps > > (There is not XDP mode that does TX to compare against) > Kudos for doing the in-depth benchmarking! Thanks! Björn > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute
Thu, Aug 30, 2018 at 12:13:40PM CEST, mar...@holtmann.org wrote: >Hi Jiri, > > The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs > file as DEVTYPE= information. To avoid any kind of race conditions > between netlink messages and reading from sysfs, it is useful to add the > same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK > messages. > > For network managing daemons that have to classify ARPHRD_ETHER network > devices into different types (like Wireless LAN, Bluetooth etc.), this > avoids the extra round trip to sysfs and parsing of the uevent file. > > Signed-off-by: Marcel Holtmann > --- > include/uapi/linux/if_link.h | 2 ++ > net/core/rtnetlink.c | 12 > 2 files changed, 14 insertions(+) > > diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h > index 43391e2d1153..781294972bb4 100644 > --- a/include/uapi/linux/if_link.h > +++ b/include/uapi/linux/if_link.h > @@ -166,6 +166,8 @@ enum { > IFLA_NEW_IFINDEX, > IFLA_MIN_MTU, > IFLA_MAX_MTU, > + IFLA_DEVTYPE, /* Name value from SET_NETDEV_DEVTYPE */ This is not something netdev-related. dev->dev.type is struct device_type. This is a generic "device" thing. Incorrect to expose over netdev-specific API. Please use "device" API for this. >>> >>> it is not just "device" related since this is a sub-classification of >>> ARPHRD_ETHER type as a wrote above. Don't get hang up that this information >>> is part of struct device. The dev->dev.type contains strings like "wlan", >>> "bluetooth", "wimax", "gadget" etc. so that you can tell what kind of >>> Ethernet card you have. >> >> I understand. But using dev->dev.type for that purpose seems like an >> abuse. If this is subtype of netdev, it should not be stored in parent >> device. I believe that you need to re-introduce this as a part of struct >> net_device - preferably an enum - and that you can expose over rtnl (and >> net-sysfs). > >it is not stored in the parent device. It is stored in the device of the >netdev. Yeah. That is what I ment. "Device struct" associated with the netdevice. > >As Stephen said, there is no device API. And why would I add the same info >again and bloat netdev? > >drivers/net/bonding/bond_main.c:SET_NETDEV_DEVTYPE(bond_dev, >&bond_type); >drivers/net/geneve.c: SET_NETDEV_DEVTYPE(dev, &geneve_type); >drivers/net/macsec.c: SET_NETDEV_DEVTYPE(dev, &macsec_type); >drivers/net/ppp/ppp_generic.c: SET_NETDEV_DEVTYPE(dev, &ppp_type); >drivers/net/usb/hso.c: SET_NETDEV_DEVTYPE(net, &hso_type); >drivers/net/usb/usbnet.c: SET_NETDEV_DEVTYPE(net, &wlan_type); >drivers/net/usb/usbnet.c: SET_NETDEV_DEVTYPE(net, &wwan_type); >drivers/net/vxlan.c:SET_NETDEV_DEVTYPE(dev, &vxlan_type); >drivers/net/wimax/i2400m/usb.c: SET_NETDEV_DEVTYPE(net_dev, &i2400mu_type); >drivers/net/wireless/intersil/prism54/islpci_dev.c: >SET_NETDEV_DEVTYPE(ndev, &wlan_type); >drivers/staging/gdm724x/gdm_lte.c: SET_NETDEV_DEVTYPE(net, >&wwan_type); >drivers/usb/gadget/function/u_ether.c: SET_NETDEV_DEVTYPE(net, &gadget_type); >drivers/usb/gadget/function/u_ether.c: SET_NETDEV_DEVTYPE(net, &gadget_type); >net/8021q/vlan_dev.c: SET_NETDEV_DEVTYPE(dev, &vlan_type); >net/bluetooth/6lowpan.c:SET_NETDEV_DEVTYPE(netdev, &bt_type); >net/bluetooth/bnep/core.c: SET_NETDEV_DEVTYPE(dev, &bnep_type); >net/bridge/br_device.c: SET_NETDEV_DEVTYPE(dev, &br_type); >net/dsa/slave.c:SET_NETDEV_DEVTYPE(slave_dev, &dsa_type); >net/hsr/hsr_device.c: SET_NETDEV_DEVTYPE(dev, &hsr_type); >net/l2tp/l2tp_eth.c:SET_NETDEV_DEVTYPE(dev, &l2tpeth_type); >net/wireless/core.c:SET_NETDEV_DEVTYPE(dev, &wiphy_type); > >So for all these devices we have to add duplicate information now? The fact the things are done now it a wrong way does not justify extending UAPI in the same wrong way. So yes, change them all if you need it. The "ethernet-subtype" should not an attribute of "struct device" but rather "struct net_device". > >I actually don't like that idea. I can rename it to IFLA_FOO or any other >proposal, but adding the same information twice is pointless. So don't expose the "struct device" attribute over rtnetlink. Use "struct device"-specific interface for that. > >>> We can revive the patches for adding this information as IFLA_INFO_KIND, >>> but last time they where reverted since NetworkManager did all the wrong >>> decisions based on that. That part is fixed now and we can put that back >>> and declare the sub-type classification of Ethernet device in two places. >>> Or we just take the information that we added a long time ago for exactly >>> this sub-classification and provide them to userspace via RTNL. >> >> IFLA_INFO_KIND serves for different purpose. It is for drivers that >> register rtnl_link_ops. I
[RFT net-next] net: stmmac: Rework coalesce timer and fix multi-queue races
[ As for now this is only for testing! ] This follows David Miller advice and tries to fix coalesce timer in multi-queue scenarios. We are now using per-queue coalesce values and per-queue TX timer. This assumes that tx_queues == rx_queues, which can not be necessarly true. Official patch will need to have this fixed. Coalesce timer default values was changed to 1ms and the coalesce frames to 25. Tested in B2B setup between XGMAC2 and GMAC5. Signed-off-by: Jose Abreu Cc: Jerome Brunet Cc: Martin Blumenstingl Cc: David S. Miller Cc: Joao Pinto Cc: Giuseppe Cavallaro Cc: Alexandre Torgue --- drivers/net/ethernet/stmicro/stmmac/common.h | 4 +- drivers/net/ethernet/stmicro/stmmac/stmmac.h | 6 +- drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 117 +- 3 files changed, 75 insertions(+), 52 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h index 1854f270ad66..b1b305f8f414 100644 --- a/drivers/net/ethernet/stmicro/stmmac/common.h +++ b/drivers/net/ethernet/stmicro/stmmac/common.h @@ -258,10 +258,10 @@ struct stmmac_safety_stats { #define MAX_DMA_RIWT 0xff #define MIN_DMA_RIWT 0x20 /* Tx coalesce parameters */ -#define STMMAC_COAL_TX_TIMER 4 +#define STMMAC_COAL_TX_TIMER 1000 #define STMMAC_MAX_COAL_TX_TICK10 #define STMMAC_TX_MAX_FRAMES 256 -#define STMMAC_TX_FRAMES 64 +#define STMMAC_TX_FRAMES 25 /* Packets types */ enum packets_types { diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h index 76649adf8fb0..db42a5f03a5f 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h @@ -48,6 +48,9 @@ struct stmmac_tx_info { /* Frequently used values are kept adjacent for cache effect */ struct stmmac_tx_queue { + u32 tx_count_frames; + int tx_timer_active; + struct timer_list txtimer; u32 queue_index; struct stmmac_priv *priv_data; struct dma_extended_desc *dma_etx cacheline_aligned_in_smp; @@ -109,15 +112,12 @@ struct stmmac_pps_cfg { struct stmmac_priv { /* Frequently used values are kept adjacent for cache effect */ - u32 tx_count_frames; u32 tx_coal_frames; u32 tx_coal_timer; - bool tx_timer_armed; int tx_coalesce; int hwts_tx_en; bool tx_path_in_lpi_mode; - struct timer_list txtimer; bool tso; unsigned int dma_buf_sz; diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c index ff1ffb46198a..ae26a6e8608e 100644 --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c @@ -2034,7 +2034,6 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv) u32 channels_to_check = tx_channel_count > rx_channel_count ? tx_channel_count : rx_channel_count; u32 chan; - bool poll_scheduled = false; int status[max_t(u32, MTL_MAX_TX_QUEUES, MTL_MAX_RX_QUEUES)]; /* Make sure we never check beyond our status buffer. */ @@ -2058,7 +2057,6 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv) if (likely(napi_schedule_prep(&rx_q->napi))) { stmmac_disable_dma_irq(priv, priv->ioaddr, chan); __napi_schedule(&rx_q->napi); - poll_scheduled = true; } } } @@ -2067,21 +2065,13 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv) * If we didn't schedule poll, see if any DMA channel (used by tx) has a * completed transmission, if so, call stmmac_poll (once). */ - if (!poll_scheduled) { - for (chan = 0; chan < tx_channel_count; chan++) { - if (status[chan] & handle_tx) { - /* It doesn't matter what rx queue we choose -* here. We use 0 since it always exists. -*/ - struct stmmac_rx_queue *rx_q = - &priv->rx_queue[0]; + for (chan = 0; chan < tx_channel_count; chan++) { + if (status[chan] & handle_tx) { + struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan]; - if (likely(napi_schedule_prep(&rx_q->napi))) { - stmmac_disable_dma_irq(priv, - priv->ioaddr, chan); - __napi_schedule(&rx_q->napi); - } - break; + if (likely(napi_schedule_prep(&rx_q->napi))) { +
[PATCH net-next 2/2] netlink: ipv6 MLD join notifications
Some userspace applications need to know about MLD joins from the kernel for 2 reasons: 1. To allow the programming of multicast MAC filters in hardware 2. To form a multicast FORUS list for non link-local multicast groups to be sent to the kernel and from there to the interested party. (1) can be fulfilled but simply sending the hardware multicast MAC address to be programmed but (2) requires the L3 address to be sent since this cannot be constructed from the MAC address whereas the reverse translation is a standard library function. This commit provides addition and deletion of multicast addresses using the RTM_NEWADDR and RTM_DELADDR messages. It also provides the RTM_GETADDR extension to allow multicast join state to be read from the kernel. Signed-off-by: Patrick Ruddy --- net/ipv6/addrconf.c | 44 +- net/ipv6/mcast.c| 66 + 2 files changed, 98 insertions(+), 12 deletions(-) diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c index d51a8c0b3372..e4e7362f9298 100644 --- a/net/ipv6/addrconf.c +++ b/net/ipv6/addrconf.c @@ -4855,11 +4855,13 @@ static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa, } static int inet6_fill_ifmcaddr(struct sk_buff *skb, struct ifmcaddr6 *ifmca, - u32 portid, u32 seq, int event, u16 flags) + u32 portid, u32 seq, int event, u16 flags) { struct nlmsghdr *nlh; u8 scope = RT_SCOPE_UNIVERSE; int ifindex = ifmca->idev->dev->ifindex; + int addr_type = (event == RTM_GETMULTICAST) ? IFA_MULTICAST : + IFA_ADDRESS; if (ipv6_addr_scope(&ifmca->mca_addr) & IFA_SITE) scope = RT_SCOPE_SITE; @@ -4869,7 +4871,7 @@ static int inet6_fill_ifmcaddr(struct sk_buff *skb, struct ifmcaddr6 *ifmca, return -EMSGSIZE; put_ifaddrmsg(nlh, 128, IFA_F_PERMANENT, scope, ifindex); - if (nla_put_in6_addr(skb, IFA_MULTICAST, &ifmca->mca_addr) < 0 || + if (nla_put_in6_addr(skb, addr_type, &ifmca->mca_addr) < 0 || put_cacheinfo(skb, ifmca->mca_cstamp, ifmca->mca_tstamp, INFINITY_LIFE_TIME, INFINITY_LIFE_TIME) < 0) { nlmsg_cancel(skb, nlh); @@ -4916,7 +4918,7 @@ enum addr_type_t { /* called with rcu_read_lock() */ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb, struct netlink_callback *cb, enum addr_type_t type, - int s_ip_idx, int *p_ip_idx) + int s_ip_idx, int *p_ip_idx, int msg_type) { struct ifmcaddr6 *ifmca; struct ifacaddr6 *ifaca; @@ -4935,7 +4937,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb, err = inet6_fill_ifaddr(skb, ifa, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, - RTM_NEWADDR, + msg_type, NLM_F_MULTI); if (err < 0) break; @@ -4952,7 +4954,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb, err = inet6_fill_ifmcaddr(skb, ifmca, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, - RTM_GETMULTICAST, + msg_type, NLM_F_MULTI); if (err < 0) break; @@ -4967,7 +4969,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb, err = inet6_fill_ifacaddr(skb, ifaca, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, - RTM_GETANYCAST, + msg_type, NLM_F_MULTI); if (err < 0) break; @@ -4982,7 +4984,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb, } static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb, - enum addr_type_t type) + enum addr_type_t type, int msg_type) { struct net *net = sock_net(skb->sk); int h, s_h; @@ -5012,7 +5014,7 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb, goto cont; if (in6_dump_addrs(idev, skb, cb, type, -
kernels >= v4.12 oops/crash with ipsec-traffic: partly bisected
Hello, kernels >= 4.12 do not work on one of our main routers. They crash as soon as ipsec-tunnels are configured and ipsec-traffic actually flows. Just configuring ipsec (that is starting strongswan) does not trigger the oops. I finally found time to bisect that. Though I have not completed that yet, I already narrowed it down to the following commits good: d24406c85d123df773bc4df88ad5da2233896919 udp: call dst_hold_safe() in udp_sk_rx_set_dst() bad: 5b7c9a8ff828287af5aebe93e707271bf1a82cc3 net: remove dst gc related code Commits in between are almost all changes to remove dst gc. Now we have other machines which run just fine with the very same kernels doing ipsec. They differ insofar as they have much less cores, do not use the ixgbe driver, do not have 10G and terminate only a few tunnels instead of hundreds. I already tested distribution kernels > 4.12 from debian, they also crash. Regards, -- Wolfgang Walter Studentenwerk München Anstalt des öffentlichen Rechts
Re: [Patch net-nnext] net_sched: add missing tcf_lock for act_connmark
On Wed 29 Aug 2018 at 17:15, Cong Wang wrote: > According to the new locking rule, we have to take tcf_lock > for both ->init() and ->dump(), as RTNL will be removed. > However, it is missing for act_connmark. Thank you for finding and fixing this! > > Cc: Vlad Buslov > Signed-off-by: Cong Wang > --- > net/sched/act_connmark.c | 11 --- > 1 file changed, 8 insertions(+), 3 deletions(-) > > diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c > index e869c0ee63c8..8475913f2070 100644 > --- a/net/sched/act_connmark.c > +++ b/net/sched/act_connmark.c > @@ -143,8 +143,10 @@ static int tcf_connmark_init(struct net *net, struct > nlattr *nla, > return -EEXIST; > } > /* replacing action and zone */ > + spin_lock_bh(&ci->tcf_lock); > ci->tcf_action = parm->action; > ci->zone = parm->zone; > + spin_unlock_bh(&ci->tcf_lock); > ret = 0; > } > > @@ -156,16 +158,16 @@ static inline int tcf_connmark_dump(struct sk_buff > *skb, struct tc_action *a, > { > unsigned char *b = skb_tail_pointer(skb); > struct tcf_connmark_info *ci = to_connmark(a); > - > struct tc_connmark opt = { > .index = ci->tcf_index, > .refcnt = refcount_read(&ci->tcf_refcnt) - ref, > .bindcnt = atomic_read(&ci->tcf_bindcnt) - bind, > - .action = ci->tcf_action, > - .zone = ci->zone, > }; > struct tcf_t t; > > + spin_lock_bh(&ci->tcf_lock); > + opt.action = ci->tcf_action; > + opt.zone = ci->zone; > if (nla_put(skb, TCA_CONNMARK_PARMS, sizeof(opt), &opt)) > goto nla_put_failure; > > @@ -173,9 +175,12 @@ static inline int tcf_connmark_dump(struct sk_buff *skb, > struct tc_action *a, > if (nla_put_64bit(skb, TCA_CONNMARK_TM, sizeof(t), &t, > TCA_CONNMARK_PAD)) > goto nla_put_failure; > + spin_unlock_bh(&ci->tcf_lock); > > return skb->len; > + > nla_put_failure: > + spin_unlock_bh(&ci->tcf_lock); > nlmsg_trim(skb, b); > return -1; > }
[PATCH net-next 1/2] netlink: ipv4 IGMP join notifications
Some userspace applications need to know about IGMP joins from the kernel for 2 reasons 1. To allow the programming of multicast MAC filters in hardware 2. To form a multicast FORUS list for non link-local multicast groups to be sent to the kernel and from there to the interested party. (1) can be fulfilled but simply sending the hardware multicast MAC address to be programmed but (2) requires the L3 address to be sent since this cannot be constructed from the MAC address whereas the reverse translation is a standard library function. This commit provides addition and deletion of multicast addresses using the RTM_NEWADDR and RTM_DELADDR messages. It also provides the RTM_GETADDR extension to allow multicast join state to be read from the kernel. Signed-off-by: Patrick Ruddy --- include/linux/igmp.h | 2 + net/ipv4/devinet.c | 39 +-- net/ipv4/igmp.c | 90 3 files changed, 120 insertions(+), 11 deletions(-) diff --git a/include/linux/igmp.h b/include/linux/igmp.h index 119f53941c12..1fb417865e7d 100644 --- a/include/linux/igmp.h +++ b/include/linux/igmp.h @@ -130,6 +130,8 @@ extern void ip_mc_unmap(struct in_device *); extern void ip_mc_remap(struct in_device *); extern void ip_mc_dec_group(struct in_device *in_dev, __be32 addr); extern void ip_mc_inc_group(struct in_device *in_dev, __be32 addr); +extern int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb, +struct net_device *dev); int ip_mc_check_igmp(struct sk_buff *skb, struct sk_buff **skb_trimmed); #endif diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index ea4bd8a52422..42f7dcc4fb5e 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -57,6 +57,7 @@ #endif #include #include +#include #include #include @@ -1651,6 +1652,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb) int h, s_h; int idx, s_idx; int ip_idx, s_ip_idx; + int multicast, mcast_idx; struct net_device *dev; struct in_device *in_dev; struct in_ifaddr *ifa; @@ -1659,6 +1661,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb) s_h = cb->args[0]; s_idx = idx = cb->args[1]; s_ip_idx = ip_idx = cb->args[2]; + multicast = cb->args[3]; + mcast_idx = cb->args[4]; for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) { idx = 0; @@ -1675,18 +1679,29 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb) if (!in_dev) goto cont; - for (ifa = in_dev->ifa_list, ip_idx = 0; ifa; -ifa = ifa->ifa_next, ip_idx++) { - if (ip_idx < s_ip_idx) - continue; - if (inet_fill_ifaddr(skb, ifa, -NETLINK_CB(cb->skb).portid, -cb->nlh->nlmsg_seq, -RTM_NEWADDR, NLM_F_MULTI) < 0) { - rcu_read_unlock(); - goto done; + if (!multicast) { + for (ifa = in_dev->ifa_list, ip_idx = 0; ifa; +ifa = ifa->ifa_next, ip_idx++) { + if (ip_idx < s_ip_idx) + continue; + if (inet_fill_ifaddr(skb, ifa, + NETLINK_CB(cb->skb).portid, +cb->nlh->nlmsg_seq, +RTM_NEWADDR, +NLM_F_MULTI) < 0) { + rcu_read_unlock(); + goto done; + } + nl_dump_check_consistent(cb, + nlmsg_hdr(skb)); } - nl_dump_check_consistent(cb, nlmsg_hdr(skb)); + /* set for multicast loop */ + multicast++; + } + /* loop over multicast addresses */ + if (ip_mc_dump_ifaddr(skb, cb, dev) < 0) { + rcu_read_unlock(); + goto done; } cont: idx++; @@ -1698,6 +1713,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb) cb->args[0]
Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
On 2018-08-29 21:14, Jakub Kicinski wrote: > On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote: >> From: Björn Töpel >> >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain >> queue. >> >> All AF_XDP specific functions are added to a new file, i40e_xsk.c. >> >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS >> will allocate a new buffer and copy the zero-copy frame prior passing >> it to the kernel stack. >> >> Signed-off-by: Björn Töpel > > Mm.. I'm surprised you don't run into buffer reuse issues that I had > when playing with AF_XDP. What happens in i40e if someone downs the > interface? Will UMEMs get destroyed? Will the RX buffers get freed? > The UMEM will linger in the driver until the sockets are dead. > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue, > FWIW. > Some background for folks that don't know the details: A zero-copy capable driver picks buffers off the fill ring and places them on the hardware Rx ring to be completed at a later point when DMA is complete. Analogous for the Tx side; The driver picks buffers off the Tx ring and places them on the Tx hardware ring. In the typical flow, the Rx buffer will be placed onto an Rx ring (completed to the user), and the Tx buffer will be placed on the completion ring to notify the user that the transfer is done. However, if the driver needs to tear down the hardware rings for some reason (interface goes down, reconfiguration and such), what should be done with the Rx and Tx buffers that has been given to the driver? So, to frame the problem: What should a driver do when this happens, so that buffers aren't leaked? Note that when the UMEM is going down, there's no need to complete anything, since the sockets are dying/dead already. This is, as you state, a missing piece in the implementation and needs to be fixed. Now on to possible solutions: 1. Complete the buffers back to the user. For Tx, this is probably the best way -- just place the buffers onto the completion ring. For Rx, we can give buffers back to user space by setting the length in the Rx descriptor to zero And putting them on the Rx ring. However, one complication here is that we do not have any back-pressure mechanism for the Rx side like we have on Tx. If the Rx ring(s) is (are) full the kernel will have to leak them or implement a retry mechanism (ugly and should be avoided). Another option to solve this without needing any retry or leaking for Rx is to implement the same back-pressure mechanism that we have on the Tx path in the Rx path. In the Tx path, the driver will only get a Tx packet to send if there is space for it in the completion ring. On Rx, this would be that the driver would only get a buffer from the fill ring if there is space for it to put it on the Rx ring. The drawback of this is that it would likely impact performance negatively since the Rx ring would have to be touch one more time (in the Tx path, it increased performance since it made it possible to implement the Tx path without any buffering), but it would guarantee that all buffers can always be returned to user space making solution this a viable option. 2. Store the buffers internally in the driver, and make sure that they are inserted into the "normal flow" again. For Rx that would be putting the buffers back into the allocation scheme that the driver is using. For Tx, placing the buffers back onto the Tx HW ring (plus all the logic for making sure that all corner cases work). 3. Mark the socket(s) as in error state, en require the user to redo the setup. This is bit harsh... For i40e I think #2 for Rx (buffers reside in kernel, return to allocator) and #1 for Tx (complete to userland). Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan of extending the umem with the "reuse queue". This decision is really up the driver. Some driver might prefer another scheme, or simply prefer storing the buffers in another manner. Looking forward, as both you and Jesper has alluded to, we need a proper allocator for zero-copy. Then it would be a matter of injecting the Rx buffers back to the allocator. I'll send out a patch, so that we don't leak the buffers, but I'd like to hear your thoughts on what the behavior should be. And what should the behavior be when the netdev is removed from the kernel? And thanks for looking into the code! Björn
[PATCH net] tcp: do not restart timewait timer on rst reception
RFC 1337 says: ''Ignore RST segments in TIME-WAIT state. If the 2 minute MSL is enforced, this fix avoids all three hazards.'' So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk expire rather than removing it instantly when a reset is received. However, Linux will also re-start the TIME-WAIT timer. This causes connect to fail when tying to re-use ports or very long delays (until syn retry interval exceeds MSL). packetdrill test case: // Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode. `sysctl net.ipv4.tcp_rfc1337=1` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 0.000 bind(3, ..., ...) = 0 0.000 listen(3, 1) = 0 0.100 < S 0:0(0) win 29200 0.100 > S. 0:0(0) ack 1 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 // Receive first segment 0.310 < P. 1:1001(1000) ack 1 win 46 // Send one ACK 0.310 > . 1:1(0) ack 1001 // read 1000 byte 0.310 read(4, ..., 1000) = 1000 // Application writes 100 bytes 0.350 write(4, ..., 100) = 100 0.350 > P. 1:101(100) ack 1001 // ACK 0.500 < . 1001:1001(0) ack 101 win 257 // close the connection 0.600 close(4) = 0 0.600 > F. 101:101(0) ack 1001 win 244 // Our side is in FIN_WAIT_1 & waits for ack to fin 0.7 < . 1001:1001(0) ack 102 win 244 // Our side is in FIN_WAIT_2 with no outstanding data. 0.8 < F. 1001:1001(0) ack 102 win 244 0.8 > . 102:102(0) ack 1002 win 244 // Our side is now in TIME_WAIT state, send ack for fin. 0.9 < F. 1002:1002(0) ack 102 win 244 0.9 > . 102:102(0) ack 1002 win 244 // Peer reopens with in-window SYN: 1.000 < S 1000:1000(0) win 9200 // Therefore, reply with ACK. 1.000 > . 102:102(0) ack 1002 win 244 // Peer sends RST for this ACK. Normally this RST results // in tw socket removal, but rfc1337=1 setting prevents this. 1.100 < R 1002:1002(0) win 244 // second syn. Due to rfc1337=1 expect another pure ACK. 31.0 < S 1000:1000(0) win 9200 31.0 > . 102:102(0) ack 1002 win 244 // .. and another RST from peer. 31.1 < R 1002:1002(0) win 244 31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT` // third syn after one minute. Time-Wait socket should have expired by now. 63.0 < S 1000:1000(0) win 9200 // so we expect a syn-ack & 3whs to proceed from here on. 63.0 > S. 0:0(0) ack 1 Without this patch, 'ss' shows restarts of tw timer and last packet is thus just another pure ack, more than one minute later. This restores the original code from commit 283fd6cf0be690a83 ("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git . For some reason the else branch was removed/lost in 1f28b683339f7 ("Merge in TCP/UDP optimizations and [..]") and timer restart became unconditional. Reported-by: Michal Tesar Signed-off-by: Florian Westphal --- net/ipv4/tcp_minisocks.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index 1dda1341a223..b690132f5da2 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -184,8 +184,9 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb, inet_twsk_deschedule_put(tw); return TCP_TW_SUCCESS; } + } else { + inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN); } - inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN); if (tmp_opt.saw_tstamp) { tcptw->tw_ts_recent = tmp_opt.rcv_tsval; -- 2.16.4
Re: [PATCH bpf-next 0/3] bpf: implement percpu map pretty print for bpffs and bpftool
On 08/29/2018 11:43 PM, Yonghong Song wrote: > Commit a26ca7c982cb ("bpf: btf: Add pretty print support to the > basic arraymap") and Commit 699c86d6ec21 ("bpf: btf: add pretty print > for hash/lru_hash maps") added bpffs pretty print for array, hash and > lru hash maps. The pretty print gives users a structurally formatted > dump for keys/values which much easy to understand than raw bytes. > > This patch set implemented bpffs pretty print support for > percpu arraymap, percpu hashmap and percpu lru hashmap. > For complex key/value types, the pretty print here is even more useful > due to > . large volumne of data making it even harder to correlate bytes > to a particular field in a particular cpu. > . kernel rounds the value size for each cpu to multiple of 8. > User has to be aware of this otherwise wrong value may be > derived from cpu 1/2/... > > For example, we may have a bpffs pretty print like below: >43602: { > cpu0: > {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} > cpu1: > {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} > cpu2: > {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} > cpu3: > {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO} >} > for a percpu map. > > This patch also added percpu formatted print on bpftool. For example, > bpftool may print like below: > { > "key": 0, > "values": [{ > "cpu": 0, > "value": { > "ui32": 0, > "ui16": 0, > } > },{ > "cpu": 1, > "value": { > "ui32": 1, > "ui16": 0, > } > },{ > "cpu": 2, > "value": { > "ui32": 2, > "ui16": 0, > } > },{ > "cpu": 3, > "value": { > "ui32": 3, > "ui16": 0, > } > } > ] > } > > Patch #1 implemented bpffs pretty print for percpu arraymap/hash/lru_hash > in kernel. Patch #2 added the test case in tools bpf selftest test_btf. > Patch #3 added percpu map btf based dump. > > Yonghong Song (3): > bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash > tools/bpf: add bpffs percpu map pretty print tests in test_btf > tools/bpf: bpftool: add btf percpu map formated dump > > kernel/bpf/arraymap.c | 24 + > kernel/bpf/hashtab.c | 31 ++ > tools/bpf/bpftool/map.c| 33 +- > tools/testing/selftests/bpf/test_btf.c | 179 > ++--- > 4 files changed, 230 insertions(+), 37 deletions(-) Applied to bpf-next, thanks Yonghong!
Re: [PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames
On 08/30/2018 10:09 AM, Björn Töpel wrote: > From: Björn Töpel > > Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did > not include XDP meta data in the data buffers copied out to the user > application. > > In this commit, we check if meta data is available, and if so, it is > prepended to the frame. > > Signed-off-by: Björn Töpel > --- > net/xdp/xsk.c | 37 - > 1 file changed, 28 insertions(+), 9 deletions(-) > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > index 4e937cd7c17d..817e4cee1540 100644 > --- a/net/xdp/xsk.c > +++ b/net/xdp/xsk.c > @@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr); > > static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) > { > - void *buffer; > + void *to_buf, *from_buf; > + u32 metalen; > u64 addr; > int err; > > if (!xskq_peek_addr(xs->umem->fq, &addr) || > - len > xs->umem->chunk_size_nohr) { > + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) { > xs->rx_dropped++; > return -ENOSPC; > } > > addr += xs->umem->headroom; > > - buffer = xdp_umem_get_data(xs->umem, addr); > - memcpy(buffer, xdp->data, len); > + if (xdp_data_meta_unsupported(xdp)) { Nit: Probably makes sense to wrap with unlikely() since we expect all drivers to implement this. > + from_buf = xdp->data; > + metalen = 0; > + } else { > + from_buf = xdp->data_meta; > + metalen = xdp->data - xdp->data_meta; > + } > + > + to_buf = xdp_umem_get_data(xs->umem, addr); > + memcpy(to_buf, from_buf, len + metalen); > + addr += metalen; > err = xskq_produce_batch_desc(xs->rx, addr, len); > if (!err) { > xskq_discard_addr(xs->umem->fq); > @@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs) > > int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) > { > - u32 len = xdp->data_end - xdp->data; > - void *buffer; > + u32 metalen, len = xdp->data_end - xdp->data; > + void *to_buf, *from_buf; > u64 addr; > int err; > > @@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct > xdp_buff *xdp) > return -EINVAL; > > if (!xskq_peek_addr(xs->umem->fq, &addr) || > - len > xs->umem->chunk_size_nohr) { > + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) { > xs->rx_dropped++; > return -ENOSPC; > } > > addr += xs->umem->headroom; > > - buffer = xdp_umem_get_data(xs->umem, addr); > - memcpy(buffer, xdp->data, len); > + if (xdp_data_meta_unsupported(xdp)) { > + from_buf = xdp->data; > + metalen = 0; Note that this condition should be dead code. netif_receive_generic_xdp() sets xdp->data_meta to xdp->data, so all good here and above is never hit. > + } else { > + from_buf = xdp->data_meta; > + metalen = xdp->data - xdp->data_meta; > + } > + > + to_buf = xdp_umem_get_data(xs->umem, addr); > + memcpy(to_buf, from_buf, len + metalen); > + addr += metalen; > err = xskq_produce_batch_desc(xs->rx, addr, len); > if (!err) { > xskq_discard_addr(xs->umem->fq); >
Re: [PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames
Den tors 30 aug. 2018 kl 14:37 skrev Daniel Borkmann : > > On 08/30/2018 10:09 AM, Björn Töpel wrote: > > From: Björn Töpel > > > > Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did > > not include XDP meta data in the data buffers copied out to the user > > application. > > > > In this commit, we check if meta data is available, and if so, it is > > prepended to the frame. > > > > Signed-off-by: Björn Töpel > > --- > > net/xdp/xsk.c | 37 - > > 1 file changed, 28 insertions(+), 9 deletions(-) > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c > > index 4e937cd7c17d..817e4cee1540 100644 > > --- a/net/xdp/xsk.c > > +++ b/net/xdp/xsk.c > > @@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr); > > > > static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) > > { > > - void *buffer; > > + void *to_buf, *from_buf; > > + u32 metalen; > > u64 addr; > > int err; > > > > if (!xskq_peek_addr(xs->umem->fq, &addr) || > > - len > xs->umem->chunk_size_nohr) { > > + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) { > > xs->rx_dropped++; > > return -ENOSPC; > > } > > > > addr += xs->umem->headroom; > > > > - buffer = xdp_umem_get_data(xs->umem, addr); > > - memcpy(buffer, xdp->data, len); > > + if (xdp_data_meta_unsupported(xdp)) { > > Nit: Probably makes sense to wrap with unlikely() since we expect all > drivers to implement this. > I'll send a v2 fixing this... > > + from_buf = xdp->data; > > + metalen = 0; > > + } else { > > + from_buf = xdp->data_meta; > > + metalen = xdp->data - xdp->data_meta; > > + } > > + > > + to_buf = xdp_umem_get_data(xs->umem, addr); > > + memcpy(to_buf, from_buf, len + metalen); > > + addr += metalen; > > err = xskq_produce_batch_desc(xs->rx, addr, len); > > if (!err) { > > xskq_discard_addr(xs->umem->fq); > > @@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs) > > > > int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) > > { > > - u32 len = xdp->data_end - xdp->data; > > - void *buffer; > > + u32 metalen, len = xdp->data_end - xdp->data; > > + void *to_buf, *from_buf; > > u64 addr; > > int err; > > > > @@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct > > xdp_buff *xdp) > > return -EINVAL; > > > > if (!xskq_peek_addr(xs->umem->fq, &addr) || > > - len > xs->umem->chunk_size_nohr) { > > + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) { > > xs->rx_dropped++; > > return -ENOSPC; > > } > > > > addr += xs->umem->headroom; > > > > - buffer = xdp_umem_get_data(xs->umem, addr); > > - memcpy(buffer, xdp->data, len); > > + if (xdp_data_meta_unsupported(xdp)) { > > + from_buf = xdp->data; > > + metalen = 0; > > Note that this condition should be dead code. netif_receive_generic_xdp() > sets xdp->data_meta to xdp->data, so all good here and above is never hit. > ...and this! Thanks for pointing this out! Björn > > + } else { > > + from_buf = xdp->data_meta; > > + metalen = xdp->data - xdp->data_meta; > > + } > > + > > + to_buf = xdp_umem_get_data(xs->umem, addr); > > + memcpy(to_buf, from_buf, len + metalen); > > + addr += metalen; > > err = xskq_produce_batch_desc(xs->rx, addr, len); > > if (!err) { > > xskq_discard_addr(xs->umem->fq); > > >
[PATCH 2/2] vti6: do not check for ignore_df in order to update pmtu
Before commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching and reporting on xmit"), skb was scrubbed before checking for ignore_df. The scrubbing meant ignore_df was false, making the check irrelevant. Now that the scrubbing happens after that, some packets might fail the checking and dst will not have its pmtu updated. Not only that, but too big skb will be potentially passed down to __xfrm6_output, causing it to fail to transmit but not free the skb, causing a leak of skb, and consequentially a leak of dst references. After running pmtu.sh, that shows as failure to unregister devices in a namespace: [ 311.397671] unregister_netdevice: waiting for veth_b to become free. Usage count = 1 Signed-off-by: Thadeu Lima de Souza Cascardo --- net/ipv6/ip6_vti.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c index c72ae3a4fe09..fbd3752ea587 100644 --- a/net/ipv6/ip6_vti.c +++ b/net/ipv6/ip6_vti.c @@ -481,7 +481,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl) } mtu = dst_mtu(dst); - if (!skb->ignore_df && skb->len > mtu) { + if (skb->len > mtu) { skb_dst_update_pmtu(skb, mtu); if (skb->protocol == htons(ETH_P_IPV6)) { -- 2.17.1
[PATCH 1/2] xfrm6: call kfree_skb when skb is toobig
After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching and reporting on xmit"), some too big skbs might be potentially passed down to __xfrm6_output, causing it to fail to transmit but not free the skb, causing a leak of skb, and consequentially a leak of dst references. After running pmtu.sh, that shows as failure to unregister devices in a namespace: [ 311.397671] unregister_netdevice: waiting for veth_b to become free. Usage count = 1 The fix is to call kfree_skb in case of transmit failures. Signed-off-by: Thadeu Lima de Souza Cascardo --- net/ipv6/xfrm6_output.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c index 5959ce9620eb..6a74080005cf 100644 --- a/net/ipv6/xfrm6_output.c +++ b/net/ipv6/xfrm6_output.c @@ -170,9 +170,11 @@ static int __xfrm6_output(struct net *net, struct sock *sk, struct sk_buff *skb) if (toobig && xfrm6_local_dontfrag(skb)) { xfrm6_local_rxpmtu(skb, mtu); + kfree_skb(skb); return -EMSGSIZE; } else if (!skb->ignore_df && toobig && skb->sk) { xfrm_local_error(skb, mtu); + kfree_skb(skb); return -EMSGSIZE; } -- 2.17.1
[PATCH bpf-next v2] xsk: include XDP meta data in AF_XDP frames
From: Björn Töpel Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did not include XDP meta data in the data buffers copied out to the user application. In this commit, we check if meta data is available, and if so, it is prepended to the frame. Signed-off-by: Björn Töpel --- net/xdp/xsk.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4e937cd7c17d..569048e299df 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr); static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len) { - void *buffer; + void *to_buf, *from_buf; + u32 metalen; u64 addr; int err; if (!xskq_peek_addr(xs->umem->fq, &addr) || - len > xs->umem->chunk_size_nohr) { + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) { xs->rx_dropped++; return -ENOSPC; } addr += xs->umem->headroom; - buffer = xdp_umem_get_data(xs->umem, addr); - memcpy(buffer, xdp->data, len); + if (unlikely(xdp_data_meta_unsupported(xdp))) { + from_buf = xdp->data; + metalen = 0; + } else { + from_buf = xdp->data_meta; + metalen = xdp->data - xdp->data_meta; + } + + to_buf = xdp_umem_get_data(xs->umem, addr); + memcpy(to_buf, from_buf, len + metalen); + addr += metalen; err = xskq_produce_batch_desc(xs->rx, addr, len); if (!err) { xskq_discard_addr(xs->umem->fq); @@ -111,6 +121,7 @@ void xsk_flush(struct xdp_sock *xs) int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) { + u32 metalen = xdp->data - xdp->data_meta; u32 len = xdp->data_end - xdp->data; void *buffer; u64 addr; @@ -120,7 +131,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) return -EINVAL; if (!xskq_peek_addr(xs->umem->fq, &addr) || - len > xs->umem->chunk_size_nohr) { + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) { xs->rx_dropped++; return -ENOSPC; } @@ -128,7 +139,8 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) addr += xs->umem->headroom; buffer = xdp_umem_get_data(xs->umem, addr); - memcpy(buffer, xdp->data, len); + memcpy(buffer, xdp->data_meta, len + metalen); + addr += metalen; err = xskq_produce_batch_desc(xs->rx, addr, len); if (!err) { xskq_discard_addr(xs->umem->fq); -- 2.17.1
Re: [PATCH 1/2] xfrm6: call kfree_skb when skb is toobig
2018-08-30, 09:58:16 -0300, Thadeu Lima de Souza Cascardo wrote: > After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching > and reporting on xmit"), some too big skbs might be potentially passed down to > __xfrm6_output, causing it to fail to transmit but not free the skb, causing a > leak of skb, and consequentially a leak of dst references. > > After running pmtu.sh, that shows as failure to unregister devices in a > namespace: > > [ 311.397671] unregister_netdevice: waiting for veth_b to become free. Usage > count = 1 > > The fix is to call kfree_skb in case of transmit failures. > > Signed-off-by: Thadeu Lima de Souza Cascardo Reviewed-by: Sabrina Dubroca I was about to post the same patch. Arguably, the commit introducing this bug is the one that added those "return -EMSGSIZE" to __xfrm6_output without freeing. Either way, it's missing a Fixes: tag, which should be one of those, or both: Fixes: d6990976af7c ("vti6: fix PMTU caching and reporting on xmit") Fixes: dd767856a36e ("xfrm6: Don't call icmpv6_send on local error") -- Sabrina
Re: [PATCH bpf-next v2] xsk: include XDP meta data in AF_XDP frames
On 08/30/2018 03:12 PM, Björn Töpel wrote: > From: Björn Töpel > > Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did > not include XDP meta data in the data buffers copied out to the user > application. > > In this commit, we check if meta data is available, and if so, it is > prepended to the frame. > > Signed-off-by: Björn Töpel Applied to bpf-next, thanks Björn!
Re: [PATCH 00/15] soc: octeontx2: Add RVU admin function driver
> > > My feeling overall is that we need a review from the network driver > > > folks more than the arm-soc team etc, and that maybe the driver > > > as a whole should go into drivers/net/ethernet. > > > > This driver doesn't handle any network IO and moreever this driver has to > > handle > > configuration requests from crypto driver as well. There will be > > separate network and > > crypto drivers which will be upstreamed into drivers/net/ethernet and > > drivers/crypto. > > And in future silicons there will be different types of functional > > blocks which will be > > added into this resource virtualization unit (RVU). Hence i thought > > this driver is not a > > right fit in drivers/net/ethernet. Hi Sunil Do you have a git branch for everything? I would like to look at the actual Ethernet driver, and the full API this driver exports to other drivers. I think there real question here is, do you have split between this driver and the actual device drivers in the right place? For me, link up/down detection should be in the Ethernet driver, since it is not shared with the crypto driver. Thanks Andrew
[PATCH bpf-next 0/3] xsk: misc code cleanup
This patch set cleans up two code style issues with the xsk zero-copy code. The resulting code is smaller and simpler. Patch 1: Removes a potential compiler warning reported by the Intel 0-DAY kernel test infrastructure. Patches 2-3: Removes the xdp_umem_props structure. At some point, it was used to break a dependency, but the members are these days much better off in the xdp_umem since the dependency does not exist anymore. I based this patch set on bpf-next commit 234dbe3dc1db ("Merge branch 'verifier-liveness-simplification'") Thanks: Magnus Magnus Karlsson (3): i40e: fix possible compiler warning in xsk TX path xsk: get rid of useless struct xdp_umem_props i40e: adapt driver to new xdp_umem structure drivers/net/ethernet/intel/i40e/i40e_xsk.c | 10 -- include/net/xdp_sock.h | 8 ++-- net/xdp/xdp_umem.c | 4 ++-- net/xdp/xdp_umem_props.h | 14 -- net/xdp/xsk.c | 10 ++ net/xdp/xsk_queue.c| 5 +++-- net/xdp/xsk_queue.h| 13 +++-- 7 files changed, 24 insertions(+), 40 deletions(-) delete mode 100644 net/xdp/xdp_umem_props.h -- 2.7.4
[PATCH bpf-next 2/3] xsk: get rid of useless struct xdp_umem_props
This commit gets rid of the structure xdp_umem_props. It was there to be able to break a dependency at one point, but this is no longer needed. The values in the struct are instead stored directly in the xdp_umem structure. This simplifies the xsk code as well as af_xdp zero-copy drivers and as a bonus gets rid of one internal header file. Signed-off-by: Magnus Karlsson --- include/net/xdp_sock.h | 8 ++-- net/xdp/xdp_umem.c | 4 ++-- net/xdp/xdp_umem_props.h | 14 -- net/xdp/xsk.c| 10 ++ net/xdp/xsk_queue.c | 5 +++-- net/xdp/xsk_queue.h | 13 +++-- 6 files changed, 20 insertions(+), 34 deletions(-) delete mode 100644 net/xdp/xdp_umem_props.h diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index 56994ad..932ca0d 100644 --- a/include/net/xdp_sock.h +++ b/include/net/xdp_sock.h @@ -16,11 +16,6 @@ struct net_device; struct xsk_queue; -struct xdp_umem_props { - u64 chunk_mask; - u64 size; -}; - struct xdp_umem_page { void *addr; dma_addr_t dma; @@ -30,7 +25,8 @@ struct xdp_umem { struct xsk_queue *fq; struct xsk_queue *cq; struct xdp_umem_page *pages; - struct xdp_umem_props props; + u64 chunk_mask; + u64 size; u32 headroom; u32 chunk_size_nohr; struct user_struct *user; diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index bfe2dbe..2471614 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -314,8 +314,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr) umem->pid = get_task_pid(current, PIDTYPE_PID); umem->address = (unsigned long)addr; - umem->props.chunk_mask = ~((u64)chunk_size - 1); - umem->props.size = size; + umem->chunk_mask = ~((u64)chunk_size - 1); + umem->size = size; umem->headroom = headroom; umem->chunk_size_nohr = chunk_size - headroom; umem->npgs = size / PAGE_SIZE; diff --git a/net/xdp/xdp_umem_props.h b/net/xdp/xdp_umem_props.h deleted file mode 100644 index 40eab10..000 --- a/net/xdp/xdp_umem_props.h +++ /dev/null @@ -1,14 +0,0 @@ -/* SPDX-License-Identifier: GPL-2.0 */ -/* XDP user-space packet buffer - * Copyright(c) 2018 Intel Corporation. - */ - -#ifndef XDP_UMEM_PROPS_H_ -#define XDP_UMEM_PROPS_H_ - -struct xdp_umem_props { - u64 chunk_mask; - u64 size; -}; - -#endif /* XDP_UMEM_PROPS_H_ */ diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index 4e937cd..acbe5b5 100644 --- a/net/xdp/xsk.c +++ b/net/xdp/xsk.c @@ -458,8 +458,10 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) goto out_unlock; } else { /* This xsk has its own umem. */ - xskq_set_umem(xs->umem->fq, &xs->umem->props); - xskq_set_umem(xs->umem->cq, &xs->umem->props); + xskq_set_umem(xs->umem->fq, xs->umem->size, + xs->umem->chunk_mask); + xskq_set_umem(xs->umem->cq, xs->umem->size, + xs->umem->chunk_mask); err = xdp_umem_assign_dev(xs->umem, dev, qid, flags); if (err) @@ -469,8 +471,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len) xs->dev = dev; xs->zc = xs->umem->zc; xs->queue_id = qid; - xskq_set_umem(xs->rx, &xs->umem->props); - xskq_set_umem(xs->tx, &xs->umem->props); + xskq_set_umem(xs->rx, xs->umem->size, xs->umem->chunk_mask); + xskq_set_umem(xs->tx, xs->umem->size, xs->umem->chunk_mask); xdp_add_sk_umem(xs->umem, xs); out_unlock: diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c index 6c32e92..2dc1384d 100644 --- a/net/xdp/xsk_queue.c +++ b/net/xdp/xsk_queue.c @@ -7,12 +7,13 @@ #include "xsk_queue.h" -void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props) +void xskq_set_umem(struct xsk_queue *q, u64 size, u64 chunk_mask) { if (!q) return; - q->umem_props = *umem_props; + q->size = size; + q->chunk_mask = chunk_mask; } static u32 xskq_umem_get_ring_size(struct xsk_queue *q) diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h index 8a64b15..82252cc 100644 --- a/net/xdp/xsk_queue.h +++ b/net/xdp/xsk_queue.h @@ -31,7 +31,8 @@ struct xdp_umem_ring { }; struct xsk_queue { - struct xdp_umem_props umem_props; + u64 chunk_mask; + u64 size; u32 ring_mask; u32 nentries; u32 prod_head; @@ -78,7 +79,7 @@ static inline u32 xskq_nb_free(struct xsk_queue *q, u32 producer, u32 dcnt) static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr) { - if (addr >= q->umem_props.size) { + if (addr >= q->size) { q->invalid_descs++; return false; } @@ -92,7 +93,7 @@ static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr) str
[PATCH bpf-next 1/3] i40e: fix possible compiler warning in xsk TX path
With certain gcc versions, it was possible to get the warning "'tx_desc' may be used uninitialized in this function" for the i40e_xmit_zc. This was not possible, however this commit simplifies the code path so that this warning is no longer emitted. Signed-off-by: Magnus Karlsson --- drivers/net/ethernet/intel/i40e/i40e_xsk.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c index 94947a8..41ca7e1 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c @@ -668,9 +668,8 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int budget) **/ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget) { - unsigned int total_packets = 0; + struct i40e_tx_desc *tx_desc = NULL; struct i40e_tx_buffer *tx_bi; - struct i40e_tx_desc *tx_desc; bool work_done = true; dma_addr_t dma; u32 len; @@ -697,14 +696,13 @@ static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget) build_ctob(I40E_TX_DESC_CMD_ICRC | I40E_TX_DESC_CMD_EOP, 0, len, 0); - total_packets++; xdp_ring->next_to_use++; if (xdp_ring->next_to_use == xdp_ring->count) xdp_ring->next_to_use = 0; } - if (total_packets > 0) { + if (tx_desc) { /* Request an interrupt for the last frame and bump tail ptr. */ tx_desc->cmd_type_offset_bsz |= (I40E_TX_DESC_CMD_RS << I40E_TXD_QW1_CMD_SHIFT); -- 2.7.4
[PATCH bpf-next 3/3] i40e: adapt driver to new xdp_umem structure
The struct xdp_umem_props was removed in the xsk code and this commit adapts the i40e af_xdp zero-copy driver code to the new xdp_umem structure. Signed-off-by: Magnus Karlsson --- drivers/net/ethernet/intel/i40e/i40e_xsk.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c index 41ca7e1..2ebfc78 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c @@ -442,7 +442,7 @@ static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring, struct i40e_rx_buffer *old_bi) { struct i40e_rx_buffer *new_bi = &rx_ring->rx_bi[rx_ring->next_to_alloc]; - unsigned long mask = (unsigned long)rx_ring->xsk_umem->props.chunk_mask; + unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask; u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM; u16 nta = rx_ring->next_to_alloc; @@ -477,7 +477,7 @@ void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle) rx_ring = container_of(alloc, struct i40e_ring, zca); hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM; - mask = rx_ring->xsk_umem->props.chunk_mask; + mask = rx_ring->xsk_umem->chunk_mask; nta = rx_ring->next_to_alloc; bi = &rx_ring->rx_bi[nta]; -- 2.7.4
[PATCH net 1/2] selftests: pmtu: maximum MTU for vti4 is 2^16-1-20
Since commit 82612de1c98e ("ip_tunnel: restore binding to ifaces with a large mtu"), the maximum MTU for vti4 is based on IP_MAX_MTU instead of the mysterious constant 0xFFF8. This makes this selftest fail. Fixes: 82612de1c98e ("ip_tunnel: restore binding to ifaces with a large mtu") Signed-off-by: Sabrina Dubroca Acked-by: Stefano Brivio --- tools/testing/selftests/net/pmtu.sh | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh index f8cc38afffa2..0ecf2609b9a4 100755 --- a/tools/testing/selftests/net/pmtu.sh +++ b/tools/testing/selftests/net/pmtu.sh @@ -334,7 +334,7 @@ test_pmtu_vti4_link_add_mtu() { fail=0 min=68 - max=$((65528 - 20)) + max=$((65535 - 20)) # Check invalid values first for v in $((min - 1)) $((max + 1)); do ${ns_a} ip link add vti4_a mtu ${v} type vti local ${veth4_a_addr} remote ${veth4_b_addr} key 10 2>/dev/null -- 2.18.0
[PATCH net 2/2] selftests: pmtu: detect correct binary to ping ipv6 addresses
Some systems don't have the ping6 binary anymore, and use ping for everything. Detect the absence of ping6 and try to use ping instead. Fixes: d1f1b9cbf34c ("selftests: net: Introduce first PMTU test") Signed-off-by: Sabrina Dubroca Acked-by: Stefano Brivio --- tools/testing/selftests/net/pmtu.sh | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/net/pmtu.sh b/tools/testing/selftests/net/pmtu.sh index 0ecf2609b9a4..cc2798a0a2d7 100755 --- a/tools/testing/selftests/net/pmtu.sh +++ b/tools/testing/selftests/net/pmtu.sh @@ -46,6 +46,9 @@ # Kselftest framework requirement - SKIP code is 4. ksft_skip=4 +# Some systems don't have a ping6 binary anymore +which ping6 > /dev/null 2>&1 && ping6=$(which ping6) || ping6=$(which ping) + tests=" pmtu_vti6_exception vti6: PMTU exceptions pmtu_vti4_exception vti4: PMTU exceptions @@ -274,7 +277,7 @@ test_pmtu_vti6_exception() { mtu "${ns_b}" veth_b 4000 mtu "${ns_a}" vti6_a 5000 mtu "${ns_b}" vti6_b 5000 - ${ns_a} ping6 -q -i 0.1 -w 2 -s 6 ${vti6_b_addr} > /dev/null + ${ns_a} ${ping6} -q -i 0.1 -w 2 -s 6 ${vti6_b_addr} > /dev/null # Check that exception was created if [ "$(route_get_dst_pmtu_from_exception "${ns_a}" ${vti6_b_addr})" = "" ]; then -- 2.18.0
[PATCH net-next] net: stmmac: Add CBS support in XGMAC2
XGMAC2 uses the same CBS mechanism as GMAC5, only registers offset changes. Lets use the same TC callbacks and implement the .config_cbs callback in XGMAC2 core. Signed-off-by: Jose Abreu Cc: David S. Miller Cc: Joao Pinto Cc: Giuseppe Cavallaro Cc: Alexandre Torgue --- drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h | 12 drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c | 19 ++- drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c | 19 +++ drivers/net/ethernet/stmicro/stmmac/hwif.c | 2 +- 4 files changed, 50 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h index 0a80fa25afe3..d6bb953685fa 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h +++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2.h @@ -119,11 +119,23 @@ #define XGMAC_MTL_TXQ_OPMODE(x)(0x1100 + (0x80 * (x))) #define XGMAC_TQS GENMASK(25, 16) #define XGMAC_TQS_SHIFT16 +#define XGMAC_Q2TCMAP GENMASK(10, 8) +#define XGMAC_Q2TCMAP_SHIFT8 #define XGMAC_TTC GENMASK(6, 4) #define XGMAC_TTC_SHIFT4 #define XGMAC_TXQENGENMASK(3, 2) #define XGMAC_TXQEN_SHIFT 2 #define XGMAC_TSF BIT(1) +#define XGMAC_MTL_TCx_ETS_CONTROL(x) (0x1110 + (0x80 * (x))) +#define XGMAC_MTL_TCx_QUANTUM_WEIGHT(x)(0x1118 + (0x80 * (x))) +#define XGMAC_MTL_TCx_SENDSLOPE(x) (0x111c + (0x80 * (x))) +#define XGMAC_MTL_TCx_HICREDIT(x) (0x1120 + (0x80 * (x))) +#define XGMAC_MTL_TCx_LOCREDIT(x) (0x1124 + (0x80 * (x))) +#define XGMAC_CC BIT(3) +#define XGMAC_TSA GENMASK(1, 0) +#define XGMAC_SP (0x0 << 0) +#define XGMAC_CBS (0x1 << 0) +#define XGMAC_ETS (0x2 << 0) #define XGMAC_MTL_RXQ_OPMODE(x)(0x1140 + (0x80 * (x))) #define XGMAC_RQS GENMASK(25, 16) #define XGMAC_RQS_SHIFT16 diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c index d182f82f7b58..64b8cb88ea45 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_core.c @@ -177,6 +177,23 @@ static void dwxgmac2_map_mtl_to_dma(struct mac_device_info *hw, u32 queue, writel(value, ioaddr + reg); } +static void dwxgmac2_config_cbs(struct mac_device_info *hw, + u32 send_slope, u32 idle_slope, + u32 high_credit, u32 low_credit, u32 queue) +{ + void __iomem *ioaddr = hw->pcsr; + u32 value; + + writel(send_slope, ioaddr + XGMAC_MTL_TCx_SENDSLOPE(queue)); + writel(idle_slope, ioaddr + XGMAC_MTL_TCx_QUANTUM_WEIGHT(queue)); + writel(high_credit, ioaddr + XGMAC_MTL_TCx_HICREDIT(queue)); + writel(low_credit, ioaddr + XGMAC_MTL_TCx_LOCREDIT(queue)); + + value = readl(ioaddr + XGMAC_MTL_TCx_ETS_CONTROL(queue)); + value |= XGMAC_CC | XGMAC_CBS; + writel(value, ioaddr + XGMAC_MTL_TCx_ETS_CONTROL(queue)); +} + static int dwxgmac2_host_irq_status(struct mac_device_info *hw, struct stmmac_extra_stats *x) { @@ -316,7 +333,7 @@ const struct stmmac_ops dwxgmac210_ops = { .prog_mtl_tx_algorithms = dwxgmac2_prog_mtl_tx_algorithms, .set_mtl_tx_queue_weight = NULL, .map_mtl_to_dma = dwxgmac2_map_mtl_to_dma, - .config_cbs = NULL, + .config_cbs = dwxgmac2_config_cbs, .dump_regs = NULL, .host_irq_status = dwxgmac2_host_irq_status, .host_mtl_irq_status = dwxgmac2_host_mtl_irq_status, diff --git a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c index 20909036e002..6c5092e7771c 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwxgmac2_dma.c @@ -182,6 +182,9 @@ static void dwxgmac2_dma_tx_mode(void __iomem *ioaddr, int mode, value |= 0x7 << XGMAC_TTC_SHIFT; } + /* Use static TC to Queue mapping */ + value |= (channel << XGMAC_Q2TCMAP_SHIFT) & XGMAC_Q2TCMAP; + value &= ~XGMAC_TXQEN; if (qmode != MTL_QUEUE_AVB) value |= 0x2 << XGMAC_TXQEN_SHIFT; @@ -374,6 +377,21 @@ static void dwxgmac2_enable_tso(void __iomem *ioaddr, bool en, u32 chan) writel(value, ioaddr + XGMAC_DMA_CH_TX_CONTROL(chan)); } +static void dwxgmac2_qmode(void __iomem *ioaddr, u32 channel, u8 qmode) +{ + u32 value = readl(ioaddr + XGMAC_MTL_TXQ_OPMODE(channel)); + + value &= ~XGMAC_TXQEN; + if (qmode != MTL_QUEUE_AVB) { + value |= 0x2 << XGMAC_TXQEN_SHIFT
[PATCH net-next] net/sched: fix type of htb statistics
tokens and ctokens are defined as s64 in htb_class structure, and clamped to 32bits value during netlink dumps: cl->xstats.tokens = clamp_t(s64, PSCHED_NS2TICKS(cl->tokens), INT_MIN, INT_MAX); Defining it as u32 is working since userspace (tc) is printing it as signed int, but a correct definition from the beginning is probably better. In the same time, 'giants' structure member is unused since years, so update the comment to mark it unused. Signed-off-by: Florent Fourcot --- include/uapi/linux/pkt_sched.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h index 8975fd1a1421..e9b7244ac381 100644 --- a/include/uapi/linux/pkt_sched.h +++ b/include/uapi/linux/pkt_sched.h @@ -395,9 +395,9 @@ enum { struct tc_htb_xstats { __u32 lends; __u32 borrows; - __u32 giants; /* too big packets (rate will not be accurate) */ - __u32 tokens; - __u32 ctokens; + __u32 giants; /* unused since 'Make HTB scheduler work with TSO.' */ + __s32 tokens; + __s32 ctokens; }; /* HFSC section */ -- 2.11.0
[PATCH iproute2] tc/htb: remove unused variable
Since introduction of htb module, this variable has never been used. Signed-off-by: Florent Fourcot --- tc/q_htb.c | 9 - 1 file changed, 9 deletions(-) diff --git a/tc/q_htb.c b/tc/q_htb.c index b93d31d4..c8b2941d 100644 --- a/tc/q_htb.c +++ b/tc/q_htb.c @@ -109,7 +109,6 @@ static int htb_parse_opt(struct qdisc_util *qu, int argc, static int htb_parse_class_opt(struct qdisc_util *qu, int argc, char **argv, struct nlmsghdr *n, const char *dev) { - int ok = 0; struct tc_htb_opt opt = {}; __u32 rtab[256], ctab[256]; unsigned buffer = 0, cbuffer = 0; @@ -127,7 +126,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int argc, char **argv, str if (get_u32(&opt.prio, *argv, 10)) { explain1("prio"); return -1; } - ok++; } else if (matches(*argv, "mtu") == 0) { NEXT_ARG(); if (get_u32(&mtu, *argv, 10)) { @@ -161,7 +159,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int argc, char **argv, str explain1("buffer"); return -1; } - ok++; } else if (matches(*argv, "cburst") == 0 || strcmp(*argv, "cbuffer") == 0 || strcmp(*argv, "cmaxburst") == 0) { @@ -170,7 +167,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int argc, char **argv, str explain1("cbuffer"); return -1; } - ok++; } else if (strcmp(*argv, "ceil") == 0) { NEXT_ARG(); if (ceil64) { @@ -186,7 +182,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int argc, char **argv, str explain1("ceil"); return -1; } - ok++; } else if (strcmp(*argv, "rate") == 0) { NEXT_ARG(); if (rate64) { @@ -202,7 +197,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int argc, char **argv, str explain1("rate"); return -1; } - ok++; } else if (strcmp(*argv, "help") == 0) { explain(); return -1; @@ -214,9 +208,6 @@ static int htb_parse_class_opt(struct qdisc_util *qu, int argc, char **argv, str argc--; argv++; } - /* if (!ok) - return 0;*/ - if (!rate64) { fprintf(stderr, "\"rate\" is required.\n"); return -1; -- 2.11.0
Re: [iproute PATCH] iprule: Fix for incorrect space between dst and prefix
On Wed, 29 Aug 2018 15:52:57 +0200 Phil Sutter wrote: > This was added by accident when introducing JSON support. > > Fixes: 0dd4ccc56c0e3 ("iprule: add json support") > Signed-off-by: Phil Sutter Applied, thanks.
Re: [Patch iproute2 v2] ss: add UNIX_DIAG_VFS and UNIX_DIAG_ICONS for unix sockets
On Wed, 29 Aug 2018 10:09:27 -0700 Cong Wang wrote: > UNIX_DIAG_VFS and UNIX_DIAG_ICONS are never used by ss, > make them available in ss -e output. > > Cc: Stephen Hemminger > Signed-off-by: Cong Wang Applied, thanks
Re: [PATCHv3 iproute2 0/2] clang + misc changes
On Wed, 22 Aug 2018 18:01:30 -0700 Mahesh Bandewar wrote: > From: Mahesh Bandewar > > The primary theme is to make clang compile the iproute2 package without > warnings. Along with this there are two other misc patches in the series. > > First patch uses the preferred_family when operating with maddr feature. > Prior to this patch, it would always open an AF_INET socket irrespective > of the family that is preferred via command-line. > > Second patch mostly adds format attributes to make the c-lang compiler > happy and not throw the warning messages. > > Mahesh Bandewar (2): > ipmaddr: use preferred_family when given > iproute: make clang happy with iproute2 package > > include/json_writer.h | 3 +-- > ip/iplink_can.c | 19 --- > ip/ipmaddr.c | 13 - > lib/color.c | 1 + > lib/json_print.c | 1 + > lib/json_writer.c | 15 +-- > misc/ss.c | 3 ++- > tc/m_ematch.c | 1 + > tc/m_ematch.h | 1 + > 9 files changed, 32 insertions(+), 25 deletions(-) > > -- > 2.18.0.1017.ga543ac7ca45-goog > Applied, thanks
Re: [PATCH iproute2] tc/htb: remove unused variable
On Thu, 30 Aug 2018 16:38:54 +0200 Florent Fourcot wrote: > Since introduction of htb module, this variable has never been used. > > Signed-off-by: Florent Fourcot Looks good. Applied
Re: [PATCH net] tcp: do not restart timewait timer on rst reception
On 08/30/2018 05:24 AM, Florian Westphal wrote: > RFC 1337 says: > ''Ignore RST segments in TIME-WAIT state. >If the 2 minute MSL is enforced, this fix avoids all three hazards.'' > > So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk > expire rather than removing it instantly when a reset is received. > > However, Linux will also re-start the TIME-WAIT timer. > > This causes connect to fail when tying to re-use ports or very long > delays (until syn retry interval exceeds MSL). > > Reported-by: Michal Tesar > Signed-off-by: Florian Westphal > --- SGTM, thanks. Signed-off-by: Eric Dumazet
[PATCH v3 bpf-next 2/2] new sample bpf prog
sample program which shows TCP_SAVE_SYN/TCP_SAVED_SYN usage example: bpf's program which is doing TOS/TCLASS reflection (server would reply with a same TOS/TCLASS as client) Signed-off-by: Nikita V. Shirokov --- samples/bpf/Makefile | 1 + samples/bpf/tcp_tos_reflect_kern.c | 87 ++ 2 files changed, 88 insertions(+) create mode 100644 samples/bpf/tcp_tos_reflect_kern.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index 36f9f41d094b..be0a961450bc 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -153,6 +153,7 @@ always += tcp_cong_kern.o always += tcp_iw_kern.o always += tcp_clamp_kern.o always += tcp_basertt_kern.o +always += tcp_tos_reflect_kern.o always += xdp_redirect_kern.o always += xdp_redirect_map_kern.o always += xdp_redirect_cpu_kern.o diff --git a/samples/bpf/tcp_tos_reflect_kern.c b/samples/bpf/tcp_tos_reflect_kern.c new file mode 100644 index ..d51dab19eca6 --- /dev/null +++ b/samples/bpf/tcp_tos_reflect_kern.c @@ -0,0 +1,87 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2018 Facebook + * + * BPF program to automatically reflect TOS option from received syn packet + * + * Use load_sock_ops to load this BPF program. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include "bpf_helpers.h" +#include "bpf_endian.h" + +#define DEBUG 1 + +#define bpf_printk(fmt, ...) \ +({ \ + char fmt[] = fmt;\ + bpf_trace_printk(fmt, sizeof(fmt), \ + ##__VA_ARGS__); \ +}) + +SEC("sockops") +int bpf_basertt(struct bpf_sock_ops *skops) +{ + char header[sizeof(struct ipv6hdr)]; + struct ipv6hdr *hdr6; + struct iphdr *hdr; + int hdr_size = 0; + int save_syn = 1; + int tos = 0; + int rv = 0; + int op; + + op = (int) skops->op; + +#ifdef DEBUG + bpf_printk("BPF command: %d\n", op); +#endif + switch (op) { + case BPF_SOCK_OPS_TCP_LISTEN_CB: + rv = bpf_setsockopt(skops, SOL_TCP, TCP_SAVE_SYN, + &save_syn, sizeof(save_syn)); + break; + case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: + if (skops->family == AF_INET) + hdr_size = sizeof(struct iphdr); + else + hdr_size = sizeof(struct ipv6hdr); + rv = bpf_getsockopt(skops, SOL_TCP, TCP_SAVED_SYN, + header, hdr_size); + if (!rv) { + if (skops->family == AF_INET) { + hdr = (struct iphdr *) header; + tos = hdr->tos; + if (tos != 0) + bpf_setsockopt(skops, SOL_IP, IP_TOS, + &tos, sizeof(tos)); + } else { + hdr6 = (struct ipv6hdr *) header; + tos = ((hdr6->priority) << 4 | + (hdr6->flow_lbl[0]) >> 4); + if (tos) + bpf_setsockopt(skops, SOL_IPV6, + IPV6_TCLASS, + &tos, sizeof(tos)); + } + rv = 0; + } + break; + default: + rv = -1; + } +#ifdef DEBUG + bpf_printk("Returning %d\n", rv); +#endif + skops->reply = rv; + return 1; +} +char _license[] SEC("license") = "GPL"; -- 2.17.1
[PATCH v3 bpf-next 1/2] new options for bpf_(set|get)sockopt
adding support for two new bpf's get/set sockopts: TCP_SAVE_SYN (set) and TCP_SAVED_SYN (get). this would allow for bpf program to build logic based on data from ingress SYN packet Signed-off-by: Nikita V. Shirokov --- net/core/filter.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index c25eb36f1320..feb578506009 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4007,6 +4007,12 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock, tp->snd_ssthresh = val; } break; + case TCP_SAVE_SYN: + if (val < 0 || val > 1) + ret = -EINVAL; + else + tp->save_syn = val; + break; default: ret = -EINVAL; } @@ -4032,21 +4038,32 @@ static const struct bpf_func_proto bpf_setsockopt_proto = { BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, bpf_sock, int, level, int, optname, char *, optval, int, optlen) { + struct inet_connection_sock *icsk; struct sock *sk = bpf_sock->sk; + struct tcp_sock *tp; if (!sk_fullsock(sk)) goto err_clear; - #ifdef CONFIG_INET if (level == SOL_TCP && sk->sk_prot->getsockopt == tcp_getsockopt) { - if (optname == TCP_CONGESTION) { - struct inet_connection_sock *icsk = inet_csk(sk); + switch (optname) { + case TCP_CONGESTION: + icsk = inet_csk(sk); if (!icsk->icsk_ca_ops || optlen <= 1) goto err_clear; strncpy(optval, icsk->icsk_ca_ops->name, optlen); optval[optlen - 1] = 0; - } else { + break; + case TCP_SAVED_SYN: + tp = tcp_sk(sk); + + if (optlen <= 0 || !tp->saved_syn || + optlen > tp->saved_syn[0]) + goto err_clear; + memcpy(optval, tp->saved_syn + 1, optlen); + break; + default: goto err_clear; } } else if (level == SOL_IP) { -- 2.17.1
[PATCH v3 bpf-next 0/2] bpf tcp save syn set/get sockoptions
adding supprot for two new bpf's tcp sockopts: TCP_SAVE_SYN (set) and TCP_SAVED_SYN (get) this would allow for tcp-bpf program to build some logic based on fields from ingress syn packet (e.g. doing tcp's tos/tclass reflection (see sample prog)) and do it transparently from userspace program point of view v2->v3: - make patch series public v1->v2: - adding proper SPDX license Nikita V. Shirokov (2): new options for bpf_(set|get)sockopt new sample bpf prog net/core/filter.c | 25 +++-- samples/bpf/Makefile | 1 + samples/bpf/tcp_tos_reflect_kern.c | 87 ++ 3 files changed, 109 insertions(+), 4 deletions(-) create mode 100644 samples/bpf/tcp_tos_reflect_kern.c -- 2.17.1
Re: [pull request][net-next 00/10] Mellanox, mlx5 and devlink updates 2018-07-31
I'm dropping all the old comments since the conversation was flattened and only has one level of marks for everything. On Thu, Aug 30, 2018 at 7:43 AM Alex Vesker wrote: > To which devlink interfaces are you referring? All of them. Not just the ones in this patch. If you are exposing an interface to the user you should have documentation for it somewhere. You should probably look at adding a patch to make certain you have all the existing devlink interfaces in the driver documented. I would like to see something added to the documentation folder that explains what all the DEVLINK_PARAM_GENERIC interfaces are expected to do, and maybe why I would use them. Then in addition I would like to see per-driver documentation added for the DEVLINK_PARAM_DRIVER calls. So for example I can't find any documentation in the kernel on what enable_64b_cqe_eqe or enable_4k_uar do in mlx4 or why I would need them, but you have them exposed as interfaces to userspace. > There are 3 patches here that provide the crdump capability, > these are the patches I would like to resubmit. > > net/mlx5: Add Vendor Specific Capability access gateway: > This is needed to read from the VSC by only the driver to collect a dump You should probably work with the linux-pci mailing list on this bit since you are exposing a new capability and they can probably point you in the direction of how they want to deal with any potential races in terms of access to the device versus your capability which you are adding support for dumping via devlink. > net/mlx5: Add Crdump FW snapshot support > This is code that collects the dump and registers a region called crdump > net/mlx5: Use devlink region_snapshot parameter > Here I use an already implemented global param that specifies whether > snapshots are supported. > > The devlink region feature is well documented. Where? > can it be that you referring to devlink region called "crdump" which mlx5 > exposes? I don't care about the internals. I care about user available documentation for the interface that is exposed. How do you expect the user to use this functionality? That is what I want documented. > Will it be sufficient to prevent setcpi access using "pci_cfg_access_lock - > any userspace reads or writes to config space and concurrent lock requests > will sleep" > otherwise do you have a different solution? That sounds like a step in the right direction, but that is something you should work with the linux-pci list on. My main concern is that I don't want us being able to come at this interface from multiple directions and screw things up.
Re: WARNING in handle_irq (3)
On Thu, Aug 30, 2018 at 8:31 AM, syzbot wrote: > Hello, > > syzbot found the following crash on: > > HEAD commit:58c3f14f86c9 Merge tag 'riscv-for-linus-4.19-rc2' of git:/.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=10be176a40 > kernel config: https://syzkaller.appspot.com/x/.config?x=531a917630d2a492 > dashboard link: https://syzkaller.appspot.com/bug?extid=a58b558e3e62d0604e5c > compiler: gcc (GCC) 8.0.1 20180413 (experimental) > > Unfortunately, I don't have any reproducer for this crash yet. +bpf maintainers Looks suspiciously similar to: https://groups.google.com/d/msg/syzkaller-bugs/4v7MtbIT1hY/A87hInzyAwAJ Note this commit seems to already have "bpf, sockmap: fix sock_hash_alloc and reject zero-sized keys ". Tentative reproducer from the log is: 14:08:59 executing program 5: socketpair(0x2, 0x0, 0x0, &(0x7f000140)) r0 = socket$inet6_tcp(0xa, 0x1, 0x0) r1 = socket$inet6_tcp(0xa, 0x1, 0x0) bind$inet6(r1, &(0x7fc0)={0xa, 0x4e22}, 0x1c) listen(r1, 0x0) sendto$inet6(r0, &(0x7f000140), 0x2d6, 0x2004, &(0x7f80)={0xa, 0x10004e22, 0x0, @loopback}, 0x1c) setsockopt$inet6_tcp_TCP_ULP(r0, 0x6, 0x1f, &(0x7f80)='tls\x00', 0x152) r2 = bpf$MAP_CREATE(0x0, &(0x7f000280)={0xf, 0x4, 0x4, 0x70}, 0x2c) bpf$MAP_UPDATE_ELEM(0x2, &(0x7f000180)={r2, &(0x7f00), &(0x7f000140)}, 0x20) Which does not create a 0-key map. > IMPORTANT: if you fix the bug, please add the following tag to the commit: > Reported-by: syzbot+a58b558e3e62d0604...@syzkaller.appspotmail.com > > TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending > cookies. Check SNMP counters. > [ cut here ] > do_IRQ(): syz-executor5 has overflown the kernel stack > (cur:88018aec,sp:88018aeb0e18,irq stk > top-bottom:8801db80-8801db008000,exception stk > top-bottom:fe007080-fe011000,ip:lock_is_held_type+0x18b/0x210) > WARNING: CPU: 0 PID: 13805 at arch/x86/kernel/irq_64.c:64 > stack_overflow_check arch/x86/kernel/irq_64.c:61 [inline] > WARNING: CPU: 0 PID: 13805 at arch/x86/kernel/irq_64.c:64 > handle_irq+0x1fb/0x2e7 arch/x86/kernel/irq_64.c:73 > Kernel panic - not syncing: panic_on_warn set ... > > CPU: 0 PID: 13805 Comm: syz-executor5 Not tainted 4.19.0-rc1+ #215 > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > Google 01/01/2011 > Call Trace: > > __dump_stack lib/dump_stack.c:77 [inline] > dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113 > panic+0x238/0x4e7 kernel/panic.c:184 > __warn.cold.8+0x163/0x1ba kernel/panic.c:536 > report_bug+0x252/0x2d0 lib/bug.c:186 > fixup_bug arch/x86/kernel/traps.c:178 [inline] > do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296 > do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316 > invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:993 > RIP: 0010:stack_overflow_check arch/x86/kernel/irq_64.c:61 [inline] > RIP: 0010:handle_irq+0x1fb/0x2e7 arch/x86/kernel/irq_64.c:73 > Code: 00 00 ff b6 80 00 00 00 48 c7 c7 80 ca 24 87 41 54 41 55 65 48 8b 04 > 25 40 ee 01 00 48 05 68 06 00 00 48 89 c6 e8 95 c4 1c 00 <0f> 0b 48 83 c4 18 > e9 3f ff ff ff 48 89 75 e0 e8 c1 fe 90 00 48 8b > RSP: 0018:8801db007f58 EFLAGS: 00010082 > RAX: RBX: 8801cee0ad80 RCX: > RDX: 0001 RSI: 8163ac01 RDI: 0001 > RBP: 8801db007fb0 R08: 8801c9b4a700 R09: ed003b603eca > R10: ed003b603eca R11: 8801db01f657 R12: fe011000 > R13: fe007080 R14: 002a R15: > do_IRQ+0x80/0x1a0 arch/x86/kernel/irq.c:246 > common_interrupt+0xf/0xf arch/x86/entry/entry_64.S:643 > > Dumping ftrace buffer: >(ftrace buffer empty) > Kernel Offset: disabled > Rebooting in 86400 seconds.. > > > --- > This bug is generated by a bot. It may contain errors. > See https://goo.gl/tpsmEJ for more information about syzbot. > syzbot engineers can be reached at syzkal...@googlegroups.com. > > syzbot will keep track of this bug report. See: > https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with > syzbot.
Re: [PATCH net 1/2] selftests: pmtu: maximum MTU for vti4 is 2^16-1-20
Le 30/08/2018 à 16:01, Sabrina Dubroca a écrit : > Since commit 82612de1c98e ("ip_tunnel: restore binding to ifaces with a > large mtu"), the maximum MTU for vti4 is based on IP_MAX_MTU instead of > the mysterious constant 0xFFF8. This makes this selftest fail. > > Fixes: 82612de1c98e ("ip_tunnel: restore binding to ifaces with a large mtu") > Signed-off-by: Sabrina Dubroca > Acked-by: Stefano Brivio Thanks for fixing this. Acked-by: Nicolas Dichtel
[PATCH net] ipv6: don't get lwtstate twice in ip6_rt_copy_init()
Commit 80f1a0f4e0cd ("net/ipv6: Put lwtstate when destroying fib6_info") partially fixed the kmemleak [1], lwtstate can be copied from fib6_info, with ip6_rt_copy_init(), and it should be done only once there. rt->dst.lwtstate is set by ip6_rt_init_dst(), at the start of the function ip6_rt_copy_init(), so there is no need to get it again at the end. With this patch, lwtstate also isn't copied from RTF_REJECT routes. [1]: unreferenced object 0x880b6aaa14e0 (size 64): comm "ip", pid 10577, jiffies 4295149341 (age 1273.903s) hex dump (first 32 bytes): 01 00 04 00 04 00 00 00 10 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 backtrace: [<18664623>] lwtunnel_build_state+0x1bc/0x420 [] ip6_route_info_create+0x9f7/0x1fd0 [ ] ip6_route_add+0x14/0x70 [<8537b55c>] inet6_rtm_newroute+0xd9/0xe0 [<2acc50f5>] rtnetlink_rcv_msg+0x66f/0x8e0 [<8d9cd381>] netlink_rcv_skb+0x268/0x3b0 [<4c893c76>] netlink_unicast+0x417/0x5a0 [ ] netlink_sendmsg+0x70b/0xc30 [<890ff0aa>] sock_sendmsg+0xb1/0xf0 [ ] ___sys_sendmsg+0x659/0x950 [<1e7426c8>] __sys_sendmsg+0xde/0x170 [ ] do_syscall_64+0x9f/0x4a0 [<1be7b28b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe [<6d21f353>] 0x Fixes: 6edb3c96a5f0 ("net/ipv6: Defer initialization of dst to data path") Signed-off-by: Alexey Kodanev --- net/ipv6/route.c | 1 - 1 file changed, 1 deletion(-) diff --git a/net/ipv6/route.c b/net/ipv6/route.c index 8e08a91..9f27ada 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -996,7 +996,6 @@ static void ip6_rt_copy_init(struct rt6_info *rt, struct fib6_info *ort) rt->rt6i_src = ort->fib6_src; #endif rt->rt6i_prefsrc = ort->fib6_prefsrc; - rt->dst.lwtstate = lwtstate_get(ort->fib6_nh.nh_lwtstate); } static struct fib6_node* fib6_backtrack(struct fib6_node *fn, -- 1.8.3.1
Re: [PATCH net] ipv6: don't get lwtstate twice in ip6_rt_copy_init()
On 8/30/18 10:11 AM, Alexey Kodanev wrote: > Commit 80f1a0f4e0cd ("net/ipv6: Put lwtstate when destroying fib6_info") > partially fixed the kmemleak [1], lwtstate can be copied from fib6_info, > with ip6_rt_copy_init(), and it should be done only once there. > > rt->dst.lwtstate is set by ip6_rt_init_dst(), at the start of the function > ip6_rt_copy_init(), so there is no need to get it again at the end. > > With this patch, lwtstate also isn't copied from RTF_REJECT routes. Those should not have lwtstate set. > > [1]: > unreferenced object 0x880b6aaa14e0 (size 64): > comm "ip", pid 10577, jiffies 4295149341 (age 1273.903s) > hex dump (first 32 bytes): > 01 00 04 00 04 00 00 00 10 00 00 00 00 00 00 00 > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > backtrace: > [<18664623>] lwtunnel_build_state+0x1bc/0x420 > [] ip6_route_info_create+0x9f7/0x1fd0 > [ ] ip6_route_add+0x14/0x70 > [<8537b55c>] inet6_rtm_newroute+0xd9/0xe0 > [<2acc50f5>] rtnetlink_rcv_msg+0x66f/0x8e0 > [<8d9cd381>] netlink_rcv_skb+0x268/0x3b0 > [<4c893c76>] netlink_unicast+0x417/0x5a0 > [ ] netlink_sendmsg+0x70b/0xc30 > [<890ff0aa>] sock_sendmsg+0xb1/0xf0 > [ ] ___sys_sendmsg+0x659/0x950 > [<1e7426c8>] __sys_sendmsg+0xde/0x170 > [ ] do_syscall_64+0x9f/0x4a0 > [<1be7b28b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe > [<6d21f353>] 0x What test did you run to uncover this? Curious as to why my testing that found the need for 80f1a0f4e0cd did not hit this. > > Fixes: 6edb3c96a5f0 ("net/ipv6: Defer initialization of dst to data path") > Signed-off-by: Alexey Kodanev > --- > net/ipv6/route.c | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/net/ipv6/route.c b/net/ipv6/route.c > index 8e08a91..9f27ada 100644 > --- a/net/ipv6/route.c > +++ b/net/ipv6/route.c > @@ -996,7 +996,6 @@ static void ip6_rt_copy_init(struct rt6_info *rt, struct > fib6_info *ort) > rt->rt6i_src = ort->fib6_src; > #endif > rt->rt6i_prefsrc = ort->fib6_prefsrc; > - rt->dst.lwtstate = lwtstate_get(ort->fib6_nh.nh_lwtstate); > } > > static struct fib6_node* fib6_backtrack(struct fib6_node *fn, > Thanks for the patch. Reviewed-by: David Ahern
Re: [PATCH net-next 1/2] netlink: ipv4 IGMP join notifications
Don't know what happened to the 0/2 cover for this series so here it is: This patch is an update to https://patchwork.ozlabs.org/patch/571127/. The previous patch was based on sending multicast MAC addresses in the netlink messages to allow the programming of hardware. It was agreed to rework this to use RTM_NEW/DELLINK messages which were more appropriate for layer 2 addresses. In the interim period it has become apparent that the applications actually needs to see the L3 multicast addresses which are joined for FORUS processing so this patch has been reworked to send the L3 multicast addresses using RTM_NEW/DELADDR. These new multicast L3 netlink notifications should use the IFA_MULTICAST address type but this has been dropped in favour of IFA_ADDRESS as during testing it was noticed that some applications - notably getaddrinfo in lib6c assume that there is an IFA_ADDRESS in a RTM_NEW/DELADDR and blindly dereference it. Finally the RTM_GETADDR for both address families has been modified to include the multicast l3 addresses. Patrick Ruddy (2): netlink: ipv4 IGMP join notifications netlink: ipv6 MLD join notifications include/linux/igmp.h | 2 + net/ipv4/devinet.c | 39 +-- net/ipv4/igmp.c | 90 net/ipv6/addrconf.c | 44 -- net/ipv6/mcast.c | 66 5 files changed, 218 insertions(+), 23 deletions(-) -- 2.17.1 On Thu, 2018-08-30 at 10:35 +0100, Patrick Ruddy wrote: > Some userspace applications need to know about IGMP joins from the kernel > for 2 reasons > 1. To allow the programming of multicast MAC filters in hardware > 2. To form a multicast FORUS list for non link-local multicast >groups to be sent to the kernel and from there to the interested >party. > (1) can be fulfilled but simply sending the hardware multicast MAC > address to be programmed but (2) requires the L3 address to be sent > since this cannot be constructed from the MAC address whereas the > reverse translation is a standard library function. > > This commit provides addition and deletion of multicast addresses > using the RTM_NEWADDR and RTM_DELADDR messages. It also provides > the RTM_GETADDR extension to allow multicast join state to be read > from the kernel. > > Signed-off-by: Patrick Ruddy > --- > include/linux/igmp.h | 2 + > net/ipv4/devinet.c | 39 +-- > net/ipv4/igmp.c | 90 > 3 files changed, 120 insertions(+), 11 deletions(-) > > diff --git a/include/linux/igmp.h b/include/linux/igmp.h > index 119f53941c12..1fb417865e7d 100644 > --- a/include/linux/igmp.h > +++ b/include/linux/igmp.h > @@ -130,6 +130,8 @@ extern void ip_mc_unmap(struct in_device *); > extern void ip_mc_remap(struct in_device *); > extern void ip_mc_dec_group(struct in_device *in_dev, __be32 addr); > extern void ip_mc_inc_group(struct in_device *in_dev, __be32 addr); > +extern int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback > *cb, > + struct net_device *dev); > int ip_mc_check_igmp(struct sk_buff *skb, struct sk_buff **skb_trimmed); > > #endif > diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c > index ea4bd8a52422..42f7dcc4fb5e 100644 > --- a/net/ipv4/devinet.c > +++ b/net/ipv4/devinet.c > @@ -57,6 +57,7 @@ > #endif > #include > #include > +#include > > #include > #include > @@ -1651,6 +1652,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct > netlink_callback *cb) > int h, s_h; > int idx, s_idx; > int ip_idx, s_ip_idx; > + int multicast, mcast_idx; > struct net_device *dev; > struct in_device *in_dev; > struct in_ifaddr *ifa; > @@ -1659,6 +1661,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct > netlink_callback *cb) > s_h = cb->args[0]; > s_idx = idx = cb->args[1]; > s_ip_idx = ip_idx = cb->args[2]; > + multicast = cb->args[3]; > + mcast_idx = cb->args[4]; > > for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) { > idx = 0; > @@ -1675,18 +1679,29 @@ static int inet_dump_ifaddr(struct sk_buff *skb, > struct netlink_callback *cb) > if (!in_dev) > goto cont; > > - for (ifa = in_dev->ifa_list, ip_idx = 0; ifa; > - ifa = ifa->ifa_next, ip_idx++) { > - if (ip_idx < s_ip_idx) > - continue; > - if (inet_fill_ifaddr(skb, ifa, > - NETLINK_CB(cb->skb).portid, > - cb->nlh->nlmsg_seq, > - RTM_NEWADDR, NLM_F_MULTI) < 0) { > - rcu_read_unlock(); > - goto done; > + if (!multicast)
Re: [PATCH 2/4] r8169: Get and enable optional ether_clk clock
Quoting Hans de Goede (2018-08-29 10:09:57) > Hi, > > On 27-08-18 21:14, Stephen Boyd wrote: > > Quoting Hans de Goede (2018-08-27 11:53:19) > >> On 27-08-18 20:47, Stephen Boyd wrote: > >>> How would you know that a clk device driver hasn't probed yet and isn't > >>> the driver that's actually providing the clk to this device on x86 > >>> systems? With DT systems we can figure that out by looking at the DT and > >>> seeing if the device driver requesting the clk has the clocks property. > >>> On x86 systems it's all clkdev which doesn't really lend itself to > >>> solving this problem. > >> > >> Right on x86 the assumption is that the clk driver will be builtin and > >> will probe before the consumer. In this case that is true as the > >> pmc-atom-clk driver can only be builtin and its platform device is > >> instantiated from the acpi_lpss code and acpi init happens before > >> the PCI bus is scanned. > > > > If we can go with this assumption then we can make the optional clk API > > work even on clkdev based systems. Maybe if x86 had some way of > > indicating that all builtin clks are registered? > > Unfortunately there is no such thing I'm afraid. Ugh! > > > That might work but > > it's not very clean. Or if we could check to see if we're running on an > > ACPI based system in clkdev we could use that to assume that clk_get() > > will only be called after all providers have registered their lookups. > > Yes some check for x86 + ACPI (ARM also uses ACPI, but there we > should no do this AFAICT) is probably best. That or not use the > new optional clk API on x86, but that means that any cross platform > driver cannot use it, which would be a pain. Right. The optional clk API will be not so great until we can get ACPI to move way from clkdev. > > BTW does your Acked-by indicate you are ok with merging this series > through the netdev tree as I suggested in the cover-letter? If so > can I also add your Acked-by to the 3th patch ? > Yep, I thought I did that but now I've really done it.
Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle
On 08/29/2018 07:47 PM, Tony Lindgren wrote: * Grygorii Strashko [180830 00:12]: Hi Tony, On 08/29/2018 10:00 AM, Tony Lindgren wrote: The current cpsw usage for cpsw-phy-sel is undocumented but is used for all the boards using cpsw. And cpsw-phy-sel is not really a child of the cpsw device, it lives in the system control module instead. Let's document the existing usage, and improve it a bit where we prefer to use a phandle instead of a child device for it. That way we can properly describe the hardware in dts files for things like genpd. I'm ok with this series, but I really don't like cpsw-phy-sel in general. Yeah this binding predates any standards. This series only fixes the nasty issue of cpsw claiming a module as a child that's outside it's IO range. It was introduced long time back and now I'm thinking about possibility to replace it with one of current generic interfaces - for example mux-controller. Each port will control up to 3 muxes (port mode, idmode and rmii_ext_clk) and transform phy-mode => mux states. What do you think? Sure a mux-controller here makes sense. Another option is to use phy, but it'd be complicated. For the port muxes, how about a phy driver just using a pinctrl driver? In general, it seems cpsw is just an interconnect instance (L4_FAST) with a control module (CPSW_WR) and a pile of independent other modules. That's described nicely in am437x TRM chapter "2.1.4 L4 Fast Peripheral Memory Map". So from that point of view the binding reg entries right now are all wrong :) TRM not consistent - for am5 it's one MMIO region. In the long run cpsw should be really treated as an interconnect instance with it's control module providing standard Linux framework services such as clock / regulator / phy / pinctrl / iio whatever for the other modules. Just my 2c based on looking at the interconnect, I'm not too familiar with cpsw otherwise. It's not separate modules. this is composite module which have only one fck/ick and most of blocks can't even function without each other. Above might be the case for Keystone 2, but not omap CPSW. Keystone 2 - has packet processor, security accelerator, queue manager in addition to its basic switch block. -- regards, -grygorii
Re: [PATCH v2 iproute2-next 0/3] support delivering packets in
On 8/26/18 8:42 PM, Yousuk Seung wrote: > This series adds support for the new "slot" netem parameter for > slotting. Slotting is an approximation of shared media that gather up > packets within a varying delay window before delivering them nearly at > once. > > Dave Taht (2): > tc: support conversions to or from 64 bit nanosecond-based time > q_netem: support delivering packets in delayed time slots > > Yousuk Seung (1): > q_netem: slotting with non-uniform distribution > > include/utils.h | 12 + > lib/utils.c | 104 +++ > man/man8/tc-netem.8 | 40 ++- > tc/q_netem.c| 115 +++- > tc/tc_cbq.c | 1 + > tc/tc_core.c| 1 + > tc/tc_core.h| 2 - > tc/tc_estimator.c | 1 + > tc/tc_util.c| 46 -- > tc/tc_util.h| 3 -- > 10 files changed, 272 insertions(+), 53 deletions(-) > applied to iproute2-next after fixing up a whitespace issue and 2 checkpatch errors in patch 2.
[PATCH net] ibmvnic: Include missing return code checks in reset function
Check the return codes of these functions and halt reset in case of failure. The driver will remain in a dormant state until the next reset event, when device initialization will be re-attempted. Signed-off-by: Thomas Falcon --- drivers/net/ethernet/ibm/ibmvnic.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c index ffe7acb..d834308 100644 --- a/drivers/net/ethernet/ibm/ibmvnic.c +++ b/drivers/net/ethernet/ibm/ibmvnic.c @@ -1841,11 +1841,17 @@ static int do_reset(struct ibmvnic_adapter *adapter, adapter->map_id = 1; release_rx_pools(adapter); release_tx_pools(adapter); - init_rx_pools(netdev); - init_tx_pools(netdev); + rc = init_rx_pools(netdev); + if (rc) + return rc; + rc = init_tx_pools(netdev); + if (rc) + return rc; release_napi(adapter); - init_napi(adapter); + rc = init_napi(adapter); + if (rc) + return rc; } else { rc = reset_tx_pools(adapter); if (rc) -- 1.8.3.1
[bpf-next 0/3] Introduce eBPF flow dissector
From: Petar Penkov This patch series hardens the RX stack by allowing flow dissection in BPF, as previously discussed [1]. Because of the rigorous checks of the BPF verifier, this provides significant security guarantees. In particular, the BPF flow dissector cannot get inside of an infinite loop, as with CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot read outside of packet bounds, because all memory accesses are checked. Also, with BPF the administrator can decide which protocols to support, reducing potential attack surface. Rarely encountered protocols can be excluded from dissection and the program can be updated without kernel recompile or reboot if a bug is discovered. Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect. This includes a new BPF program and attach type. Patch 2 adds a flow dissector program in BPF. This parses most protocols in __skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports, and address types). Patch 3 adds a selftest that attaches the BPF program to the flow dissector and sends traffic with different levels of encapsulation. Performance Evaluation: The in-kernel implementation was compared against the demo program from patch 2 using the test in patch 3 with IPv4/UDP traffic over 10 seconds. $perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \ -t 10 In-kernel Dissector: __skb_flow_dissect overhead: 2.12% Total Packets: 3,272,597 (from output of ./test_flow_dissector) BPF Dissector: __skb_flow_dissect overhead: 1.63% Total Packets: 3,232,356 (from output of ./test_flow_dissector) No-op Dissector: __skb_flow_dissect overhead: 1.52% Total Packets: 3,330,635 (from output of ./test_flow_dissector) Changes since RFC: 1/ (Patch 1) Flow dissector hook is no longer global. Instead, it is per-netns 2/ (Patch 1) Defined struct bpf_flow_keys to be used in BPF flow dissector programs instead of exposing the internal flow keys layout. Added a function to translate from bpf_flow_keys to the internal layout after BPF dissection is complete. The pointer to this struct is stored in qdisc_skb_cb rather than inside of the 20 byte control block which simplifies verification and allows access to all 20 bytes of the cb. 3/ (Patch 2) Removed GUE parsing as it relied on a hardcoded port 4/ (Patch 2) MPLS parsing now stops at the first label which is consistent with the in-kernel flow dissector 5/ (Patch 2) Refactored to use direct packet access and to write out to struct bpf_flow_keys [1] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf Petar Penkov (3): flow_dissector: implements flow dissector BPF hook flow_dissector: implements eBPF parser selftests/bpf: test bpf flow dissection include/linux/bpf.h | 1 + include/linux/bpf_types.h | 1 + include/linux/skbuff.h| 7 + include/net/net_namespace.h | 3 + include/net/sch_generic.h | 12 +- include/uapi/linux/bpf.h | 25 + kernel/bpf/syscall.c | 8 + kernel/bpf/verifier.c | 33 + net/core/filter.c | 67 ++ net/core/flow_dissector.c | 136 +++ tools/bpf/bpftool/prog.c | 1 + tools/include/uapi/linux/bpf.h| 25 + tools/lib/bpf/libbpf.c| 2 + tools/testing/selftests/bpf/.gitignore| 2 + tools/testing/selftests/bpf/Makefile | 8 +- tools/testing/selftests/bpf/bpf_flow.c| 390 + tools/testing/selftests/bpf/config| 1 + .../selftests/bpf/flow_dissector_load.c | 140 .../selftests/bpf/test_flow_dissector.c | 782 ++ .../selftests/bpf/test_flow_dissector.sh | 115 +++ tools/testing/selftests/bpf/with_addr.sh | 54 ++ tools/testing/selftests/bpf/with_tunnels.sh | 36 + 22 files changed, 1843 insertions(+), 6 deletions(-) create mode 100644 tools/testing/selftests/bpf/bpf_flow.c create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh create mode 100755 tools/testing/selftests/bpf/with_addr.sh create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh -- 2.19.0.rc0.228.g281dcd1b4d0-goog
[PATCH v2 1/2] IB/ipoib: Use dev_port to expose network interface port numbers
Some InfiniBand network devices have multiple ports on the same PCI function. This initializes the `dev_port' sysfs field of those network interfaces with their port number. Prior to this the kernel erroneously used the `dev_id' sysfs field of those network interfaces to convey the port number to userspace. The use of `dev_id' was considered correct until Linux 3.15, when another field, `dev_port', was defined for this particular purpose and `dev_id' was reserved for distinguishing stacked ifaces (e.g: VLANs) with the same hardware address as their parent device. Similar fixes to net/mlx4_en and many other drivers, which started exporting this information through `dev_id' before 3.15, were accepted into the kernel 4 years ago. See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs'). Signed-off-by: Arseny Maslennikov --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index e3d28f9ad9c0..ba16a63ee303 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1880,7 +1880,7 @@ static int ipoib_parent_init(struct net_device *ndev) sizeof(union ib_gid)); SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent); - priv->dev->dev_id = priv->port - 1; + priv->dev->dev_port = priv->port - 1; return 0; } -- 2.18.0
[PATCH v2 2/2] Documentation/ABI: document /sys/class/net/*/dev_port
The sysfs field was introduced 4 years ago along with fixes to various drivers that erroneously used `dev_id' for that purpose, but it was not properly documented anywhere. See commit v3.14-rc3-739-g3f85944fe207. Signed-off-by: Arseny Maslennikov --- Documentation/ABI/testing/sysfs-class-net | 10 ++ 1 file changed, 10 insertions(+) diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net index 2f1788111cd9..1593d8997ade 100644 --- a/Documentation/ABI/testing/sysfs-class-net +++ b/Documentation/ABI/testing/sysfs-class-net @@ -91,6 +91,16 @@ Description: stacked (e.g: VLAN interfaces) but still have the same MAC address as their parent device. +What: /sys/class/net//dev_port +Date: February 2014 +KernelVersion: 3.15 +Contact: netdev@vger.kernel.org +Description: + Indicates the port number of this network device, formatted + as a decimal value. Some NICs have multiple independent ports + on the same PCI bus, device and function. This field allows + userspace to distinguish the respective interfaces. + What: /sys/class/net//dormant Date: March 2006 KernelVersion: 2.6.17 -- 2.18.0
[PATCH v2 0/2] IB/ipoib: Use dev_port to disambiguate port numbers
Pre-3.15 userspace had trouble distinguishing different ports of a NIC on a single PCI bus/device/function. To solve this, a sysfs field `dev_port' was introduced quite a while ago (commit v3.14-rc3-739-g3f85944fe207), and some relevant device drivers were fixed to use it, but not in case of IPoIB. The convention for some reason never got documented in the kernel, but was immediately adopted by userspace (notably udev[1][2], biosdevname[3]) 3/3 documents the sysfs field — that's why I'm CC-ing netdev. This series was tested on and applies to 4.19-rc1. [1] https://lists.freedesktop.org/archives/systemd-devel/2014-June/020788.html [2] https://lists.freedesktop.org/archives/systemd-devel/2014-July/020804.html [3] https://github.com/CloudAutomationNTools/biosdevname/blob/c795d51dd93a5309652f0d635f12a3ecfabfaa72/src/eths.c#L38 v1->v2: replace a line instead of inserting and then removing. Arseny Maslennikov (2): IB/ipoib: Use dev_port to expose network interface port numbers Documentation/ABI: document /sys/class/net/*/dev_port Documentation/ABI/testing/sysfs-class-net | 10 ++ drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +- 2 files changed, 11 insertions(+), 1 deletion(-) -- 2.18.0
[bpf-next 2/3] flow_dissector: implements eBPF parser
From: Petar Penkov This eBPF program extracts basic/control/ip address/ports keys from incoming packets. It supports recursive parsing for IP encapsulation, and VLAN, along with IPv4/IPv6 and extension headers. This program is meant to show how flow dissection and key extraction can be done in eBPF. Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf Signed-off-by: Petar Penkov Signed-off-by: Willem de Bruijn --- tools/testing/selftests/bpf/Makefile | 2 +- tools/testing/selftests/bpf/bpf_flow.c | 390 + 2 files changed, 391 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/bpf_flow.c diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index fff7fb1285fc..e65f50f9185e 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \ test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \ get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \ - test_skb_cgroup_id_kern.o + test_skb_cgroup_id_kern.o bpf_flow.o # Order correspond to 'make run_tests' order TEST_PROGS := test_kmod.sh \ diff --git a/tools/testing/selftests/bpf/bpf_flow.c b/tools/testing/selftests/bpf/bpf_flow.c new file mode 100644 index ..6d8388735d1d --- /dev/null +++ b/tools/testing/selftests/bpf/bpf_flow.c @@ -0,0 +1,390 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "bpf_helpers.h" +#include "bpf_endian.h" + +int _version SEC("version") = 1; +#define PROG(F) SEC(#F) int bpf_func_##F + +/* These are the identifiers of the BPF programs that will be used in tail + * calls. Name is limited to 16 characters, with the terminating character and + * bpf_func_ above, we have only 6 to work with, anything after will be cropped. + */ +enum { + IP, + IPV6, + IPV6OP, /* Destination/Hop-by-Hop Options IPv6 Extension header */ + IPV6FR, /* Fragmentation IPv6 Extension Header */ + MPLS, + VLAN, +}; + +#define IP_MF 0x2000 +#define IP_OFFSET 0x1FFF +#define IP6_MF 0x0001 +#define IP6_OFFSET 0xFFF8 + +struct vlan_hdr { + __be16 h_vlan_TCI; + __be16 h_vlan_encapsulated_proto; +}; + +struct gre_hdr { + __be16 flags; + __be16 proto; +}; + +struct frag_hdr { + __u8 nexthdr; + __u8 reserved; + __be16 frag_off; + __be32 identification; +}; + +struct bpf_map_def SEC("maps") jmp_table = { + .type = BPF_MAP_TYPE_PROG_ARRAY, + .key_size = sizeof(__u32), + .value_size = sizeof(__u32), + .max_entries = 8 +}; + +struct bpf_dissect_cb { + __u16 nhoff; +}; + +static __always_inline void *bpf_flow_dissect_get_header(struct __sk_buff *skb, +__u16 hdr_size) +{ + struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb); + void *data_end = (__u8 *)(long)skb->data_end; + void *data = (__u8 *)(long)skb->data; + __u8 *hdr; + + /* Verifies this variable offset does not overflow */ + if (cb->nhoff > (USHRT_MAX - hdr_size)) + return NULL; + + hdr = data + cb->nhoff; + if (hdr + hdr_size > data_end) + return NULL; + + return hdr; +} + +/* Dispatches on ETHERTYPE */ +static __always_inline int parse_eth_proto(struct __sk_buff *skb, __be16 proto) +{ + struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys); + keys->n_proto = proto; + + switch (proto) { + case bpf_htons(ETH_P_IP): + bpf_tail_call(skb, &jmp_table, IP); + break; + case bpf_htons(ETH_P_IPV6): + bpf_tail_call(skb, &jmp_table, IPV6); + break; + case bpf_htons(ETH_P_MPLS_MC): + case bpf_htons(ETH_P_MPLS_UC): + bpf_tail_call(skb, &jmp_table, MPLS); + break; + case bpf_htons(ETH_P_8021Q): + case bpf_htons(ETH_P_8021AD): + bpf_tail_call(skb, &jmp_table, VLAN); + break; + default: + /* Protocol not supported */ + return BPF_DROP; + } + + return BPF_DROP; +} + +static __always_inline void write_ports(struct __sk_buff *skb) +{ + struct bpf_flow_keys *keys = (struct bpf_flow_keys *)(long)(skb->flow_keys); + struct bpf_dissect_cb *cb = (struct bpf_dissect_cb *)(skb->cb); + void *data = (__u8 *)(long)skb->data; + __be16 *ports = data + cb->nhoff; /* TCP/UDP start with the ports */ + + keys->sport = ports[0];
[bpf-next 1/3] flow_dissector: implements flow dissector BPF hook
From: Petar Penkov Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector path. The BPF program is per-network namespace. Signed-off-by: Petar Penkov Signed-off-by: Willem de Bruijn --- include/linux/bpf.h| 1 + include/linux/bpf_types.h | 1 + include/linux/skbuff.h | 7 ++ include/net/net_namespace.h| 3 + include/net/sch_generic.h | 12 ++- include/uapi/linux/bpf.h | 25 ++ kernel/bpf/syscall.c | 8 ++ kernel/bpf/verifier.c | 33 net/core/filter.c | 67 net/core/flow_dissector.c | 136 + tools/bpf/bpftool/prog.c | 1 + tools/include/uapi/linux/bpf.h | 25 ++ tools/lib/bpf/libbpf.c | 2 + 13 files changed, 318 insertions(+), 3 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 523481a3471b..988a00797bcd 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -212,6 +212,7 @@ enum bpf_reg_type { PTR_TO_PACKET_META, /* skb->data - meta_len */ PTR_TO_PACKET, /* reg points to skb->data */ PTR_TO_PACKET_END, /* skb->data + headlen */ + PTR_TO_FLOW_KEYS,/* reg points to bpf_flow_keys */ }; /* The information passed from prog-specific *_is_valid_access diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h index cd26c090e7c0..22083712dd18 100644 --- a/include/linux/bpf_types.h +++ b/include/linux/bpf_types.h @@ -32,6 +32,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2) #ifdef CONFIG_INET BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport) #endif +BPF_PROG_TYPE(BPF_PROG_TYPE_FLOW_DISSECTOR, flow_dissector) BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops) BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 17a13e4785fc..ce0e863f02a2 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -243,6 +243,8 @@ struct scatterlist; struct pipe_inode_info; struct iov_iter; struct napi_struct; +struct bpf_prog; +union bpf_attr; #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct nf_conntrack { @@ -1192,6 +1194,11 @@ void skb_flow_dissector_init(struct flow_dissector *flow_dissector, const struct flow_dissector_key *key, unsigned int key_count); +int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr, + struct bpf_prog *prog); + +int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr); + bool __skb_flow_dissect(const struct sk_buff *skb, struct flow_dissector *flow_dissector, void *target_container, diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h index 9b5fdc50519a..99d4148e0f90 100644 --- a/include/net/net_namespace.h +++ b/include/net/net_namespace.h @@ -43,6 +43,7 @@ struct ctl_table_header; struct net_generic; struct uevent_sock; struct netns_ipvs; +struct bpf_prog; #define NETDEV_HASHBITS8 @@ -145,6 +146,8 @@ struct net { #endif struct net_generic __rcu*gen; + struct bpf_prog __rcu *flow_dissector_prog; + /* Note : following structs are cache line aligned */ #ifdef CONFIG_XFRM struct netns_xfrm xfrm; diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h index a6d00093f35e..1b81ba85fd2d 100644 --- a/include/net/sch_generic.h +++ b/include/net/sch_generic.h @@ -19,6 +19,7 @@ struct Qdisc_ops; struct qdisc_walker; struct tcf_walker; struct module; +struct bpf_flow_keys; typedef int tc_setup_cb_t(enum tc_setup_type type, void *type_data, void *cb_priv); @@ -307,9 +308,14 @@ struct tcf_proto { }; struct qdisc_skb_cb { - unsigned intpkt_len; - u16 slave_dev_queue_mapping; - u16 tc_classid; + union { + struct { + unsigned intpkt_len; + u16 slave_dev_queue_mapping; + u16 tc_classid; + }; + struct bpf_flow_keys *flow_keys; + }; #define QDISC_CB_PRIV_LEN 20 unsigned char data[QDISC_CB_PRIV_LEN]; }; diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 66917a4eba27..3064706fcaaa 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -152,6 +152,7 @@ enum bpf_prog_type { BPF_PROG_TYPE_LWT_SEG6LOCAL, BPF_PROG_TYPE_LIRC_MODE2, BPF_PROG_TYPE_SK_REUSEPORT, + BPF_PROG_TYPE_FLOW_DISSECTOR, }; enum bpf_attach_type { @@ -172,6 +173,7 @@ enum bpf_attach_type { BPF_CGROUP_UDP4_SENDMSG, BPF_CGROUP_UDP6_SEN
[bpf-next 3/3] selftests/bpf: test bpf flow dissection
From: Petar Penkov Adds a test that sends different types of packets over multiple tunnels and verifies that valid packets are dissected correctly. To do so, a tc-flower rule is added to drop packets on UDP src port 9, and packets are sent from ports 8, 9, and 10. Only the packets on port 9 should be dropped. Because tc-flower relies on the flow dissector to match flows, correct classification demonstrates correct dissection. Also add support logic to load the BPF program and to inject the test packets. Signed-off-by: Petar Penkov Signed-off-by: Willem de Bruijn --- tools/testing/selftests/bpf/.gitignore| 2 + tools/testing/selftests/bpf/Makefile | 6 +- tools/testing/selftests/bpf/config| 1 + .../selftests/bpf/flow_dissector_load.c | 140 .../selftests/bpf/test_flow_dissector.c | 782 ++ .../selftests/bpf/test_flow_dissector.sh | 115 +++ tools/testing/selftests/bpf/with_addr.sh | 54 ++ tools/testing/selftests/bpf/with_tunnels.sh | 36 + 8 files changed, 1134 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/bpf/flow_dissector_load.c create mode 100644 tools/testing/selftests/bpf/test_flow_dissector.c create mode 100755 tools/testing/selftests/bpf/test_flow_dissector.sh create mode 100755 tools/testing/selftests/bpf/with_addr.sh create mode 100755 tools/testing/selftests/bpf/with_tunnels.sh diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore index 49938d72cf63..e61a85ac4b79 100644 --- a/tools/testing/selftests/bpf/.gitignore +++ b/tools/testing/selftests/bpf/.gitignore @@ -19,3 +19,5 @@ test_btf test_sockmap test_lirc_mode2_user get_cgroup_id_user +test_flow_dissector +flow_dissector_load diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index e65f50f9185e..fd3851d5c079 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -47,10 +47,12 @@ TEST_PROGS := test_kmod.sh \ test_tunnel.sh \ test_lwt_seg6local.sh \ test_lirc_mode2.sh \ - test_skb_cgroup_id.sh + test_skb_cgroup_id.sh \ + test_flow_dissector.sh # Compile but not part of 'make run_tests' -TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user +TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr test_skb_cgroup_id_user \ + flow_dissector_load test_flow_dissector include ../lib.mk diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config index b4994a94968b..3655508f95fd 100644 --- a/tools/testing/selftests/bpf/config +++ b/tools/testing/selftests/bpf/config @@ -18,3 +18,4 @@ CONFIG_CRYPTO_HMAC=m CONFIG_CRYPTO_SHA256=m CONFIG_VXLAN=y CONFIG_GENEVE=y +CONFIG_NET_CLS_FLOWER=m diff --git a/tools/testing/selftests/bpf/flow_dissector_load.c b/tools/testing/selftests/bpf/flow_dissector_load.c new file mode 100644 index ..d3273b5b3173 --- /dev/null +++ b/tools/testing/selftests/bpf/flow_dissector_load.c @@ -0,0 +1,140 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +const char *cfg_pin_path = "/sys/fs/bpf/flow_dissector"; +const char *cfg_map_name = "jmp_table"; +bool cfg_attach = true; +char *cfg_section_name; +char *cfg_path_name; + +static void load_and_attach_program(void) +{ + struct bpf_program *prog, *main_prog; + struct bpf_map *prog_array; + int i, fd, prog_fd, ret; + struct bpf_object *obj; + int prog_array_fd; + + ret = bpf_prog_load(cfg_path_name, BPF_PROG_TYPE_FLOW_DISSECTOR, &obj, + &prog_fd); + if (ret) + error(1, 0, "bpf_prog_load %s", cfg_path_name); + + main_prog = bpf_object__find_program_by_title(obj, cfg_section_name); + if (!main_prog) + error(1, 0, "bpf_object__find_program_by_title %s", + cfg_section_name); + + prog_fd = bpf_program__fd(main_prog); + if (prog_fd < 0) + error(1, 0, "bpf_program__fd"); + + prog_array = bpf_object__find_map_by_name(obj, cfg_map_name); + if (!prog_array) + error(1, 0, "bpf_object__find_map_by_name %s", cfg_map_name); + + prog_array_fd = bpf_map__fd(prog_array); + if (prog_array_fd < 0) + error(1, 0, "bpf_map__fd %s", cfg_map_name); + + i = 0; + bpf_object__for_each_program(prog, obj) { + fd = bpf_program__fd(prog); + if (fd < 0) + error(1, 0, "bpf_program__fd"); + + if (fd != prog_fd) { + printf("%d: %s\n", i, bpf_program__title(prog, false)); + bpf_map_update_elem(prog_array_fd, &i, &fd, BPF_ANY); + ++i; + } + } + + ret = bpf_prog_attach(prog_fd, 0
Re: kernels > v4.12 oops/crash with ipsec-traffic: bisected to b838d5e1c5b6e57b10ec8af2268824041e3ea911: ipv4: mark DST_NOGC and remove the operation of dst_free()
Hello, kernels > 4.12 do not work on one of our main routers. They crash as soon as ipsec-tunnels are configured and ipsec-traffic actually flows. Just configuring ipsec (that is starting strongswan) does not trigger the oops. I finally found time to bisect that. It bisected down to b838d5e1c5b6e57b10ec8af2268824041e3ea911 ipv4: mark DST_NOGC and remove the operation of dst_free() Now we have other machines which run just fine with the very same kernels doing ipsec. They differ insofar as they have much less cores, do not use the ixgbe driver, do not have 10G and terminate only a few tunnels instead of hundreds. I already tested distribution kernels > 4.12 from debian, they also crash. All kernels I created in the bisection run fine if I didn't use ipsec. The bad ones all oopsed/crashed exactly as vanilla 4.14 described above. Here is the bisect-log: # bad: [bebc6082da0a9f5d47a1ea2edc099bf671058bd4] Linux 4.14 # good: [69973b830859bc6529a7a0468ba0d80ee5117826] Linux 4.9 git bisect start 'v4.14' 'v4.9' # good: [d82dd0e34d0347be201fd274dc84cd645dccc064] raid1: prefer disk without bad blocks git bisect good d82dd0e34d0347be201fd274dc84cd645dccc064 # bad: [9967468c0a109644e4a1f5b39b39bf86fe7507a7] Merge branch 'akpm' (patches from Andrew) git bisect bad 9967468c0a109644e4a1f5b39b39bf86fe7507a7 # bad: [17d9aa66b08de445645bd0688fc1635bed77a57b] Merge tag 'iwlwifi-next-for-kalle-2017-06-30' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next git bisect bad 17d9aa66b08de445645bd0688fc1635bed77a57b # good: [de4d195308ad589626571dbe5789cebf9695a204] Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good de4d195308ad589626571dbe5789cebf9695a204 # good: [9376906c17fa975bf6a7ea9dd124be697bcda289] Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good 9376906c17fa975bf6a7ea9dd124be697bcda289 # good: [40e86a3619a1e84ad73c716c943f65fc38eb1e28] iwlwifi: mvm: use scnprintf() instead of snprintf() git bisect good 40e86a3619a1e84ad73c716c943f65fc38eb1e28 # bad: [c66f2091c9248ddf42504c74cd327ae8619b04a4] net/mlx5e: Prevent PFC call for non ethernet ports git bisect bad c66f2091c9248ddf42504c74cd327ae8619b04a4 # good: [a090bd4ff8387c409732a8e059fbf264ea0bdd56] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net git bisect good a090bd4ff8387c409732a8e059fbf264ea0bdd56 # good: [1947030645b6012aeee98da764d6dd47071a6aad] Merge branch 'dsa-prefix-Global-macros' git bisect good 1947030645b6012aeee98da764d6dd47071a6aad # good: [69137ea60c9dad58773a1918de6c1b00b088520c] pktgen: Specify num packets per thread git bisect good 69137ea60c9dad58773a1918de6c1b00b088520c # good: [d24406c85d123df773bc4df88ad5da2233896919] udp: call dst_hold_safe() in udp_sk_rx_set_dst() git bisect good d24406c85d123df773bc4df88ad5da2233896919 # bad: [5b7c9a8ff828287af5aebe93e707271bf1a82cc3] net: remove dst gc related code git bisect bad 5b7c9a8ff828287af5aebe93e707271bf1a82cc3 # bad: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() git bisect bad b838d5e1c5b6e57b10ec8af2268824041e3ea911 # good: [4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36] net: introduce a new function dst_dev_put() git bisect good 4a6ce2b6f2ecabbddcfe47e7cf61dd0f00b10e36 # good: [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly git bisect good 95c47f9cf5e028d1ae77dc6c767c1edc8a18025b # good: [9df16efadd2a8a82731dc76ff656c771e261827f] ipv4: call dst_hold_safe() properly git bisect good 9df16efadd2a8a82731dc76ff656c771e261827f # first bad commit: [b838d5e1c5b6e57b10ec8af2268824041e3ea911] ipv4: mark DST_NOGC and remove the operation of dst_free() In my first email I wrote >= 4.12, but I think 4.12 works. I bisected between 4.9 and 4.14 as we actually run 4.9 on the machine with the problem and 4.14 on most other routers. I also tested 4.18.5 and it still shows this bug. Regards, -- Wolfgang Walter Studentenwerk München Anstalt des öffentlichen Rechts
Re: [PATCH net-next] packet: add sockopt to ignore outgoing packets
On Thu, Aug 30, 2018 at 6:12 AM Vincent Whitchurch wrote: > > Currently, the only way to ignore outgoing packets on a packet socket is > via the BPF filter. With MSG_ZEROCOPY, packets that are looped into > AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even > if the filter run from packet_rcv() would reject them. So the presence > of a packet socket on the interface takes away the benefits of > MSG_ZEROCOPY, even if the packet socket is not interested in outgoing > packets. (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily > cloned, but the cost for that is much lower.) > > Add a socket option to allow AF_PACKET sockets to ignore outgoing > packets to solve this. Note that the *BSDs already have something > similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT. > > The first intended user is lldpd. Clear description of the use case, thanks. I don't see a simple alternative to introducing a new socket option, either (a new ETH_P_xx protocol wildcard different from ETH_P_ALL, perhaps). > Signed-off-by: Vincent Whitchurch > --- > include/linux/netdevice.h | 1 + > include/uapi/linux/if_packet.h | 1 + > net/core/dev.c | 3 +++ > net/packet/af_packet.c | 15 +++ > 4 files changed, 20 insertions(+) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index ca5ab98053c8..8ef14d9edc58 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -2317,6 +2317,7 @@ static inline struct sk_buff > *call_gro_receive_sk(gro_receive_sk_t cb, > > struct packet_type { > __be16 type; /* This is really htons(ether_type). > */ > + boolignore_outgoing; > struct net_device *dev; /* NULL is wildcarded here > */ > int (*func) (struct sk_buff *, > struct net_device *, > diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h > index 67b61d91d89b..467b654bd4c7 100644 > --- a/include/uapi/linux/if_packet.h > +++ b/include/uapi/linux/if_packet.h > @@ -57,6 +57,7 @@ struct sockaddr_ll { > #define PACKET_QDISC_BYPASS20 > #define PACKET_ROLLOVER_STATS 21 > #define PACKET_FANOUT_DATA 22 > +#define PACKET_IGNORE_OUTGOING 23 > > #define PACKET_FANOUT_HASH 0 > #define PACKET_FANOUT_LB 1 > diff --git a/net/core/dev.c b/net/core/dev.c > index 325fc5088370..0addb4f0abfe 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1947,6 +1947,9 @@ static inline bool skb_loop_sk(struct packet_type > *ptype, struct sk_buff *skb) > if (!ptype->af_packet_priv || !skb->sk) > return false; > > + if (ptype->ignore_outgoing) > + return true; > + This probably does not belong in skb_loop_sk, but in dev_queue_xmit_nit directly. > if (ptype->id_match) > return ptype->id_match(ptype, skb->sk); > else if ((struct sock *)ptype->af_packet_priv == skb->sk)
[RFC PATCH net-next 0/4] net: batched receive in GRO path
This series listifies part of GRO processing, in a manner which allows those packets which are not GROed (i.e. for which dev_gro_receive returns GRO_NORMAL) to be passed on to the listified regular receive path. I have not listified dev_gro_receive() itself, or the per-protocol GRO callback, since GRO's need to hold packets on lists under napi->gro_hash makes keeping the packets on other lists awkward, and since the GRO control block state of held skbs can refer only to one 'new' skb at a time. Nonetheless the batching of the calling code yields some performance gains in the GRO case as well. Herewith the performance figures obtained in a NetPerf TCP stream test (with four streams, and irqs bound to a single core): net-next: 7.166 Gbit/s (sigma 0.435) after #2: 7.715 Gbit/s (sigma 0.145) = datum + 7.7% after #4: 7.890 Gbit/s (sigma 0.217) = datum + 10.1% (Note that the 'net-next' results were distinctly bimodal, with two results of about 8 Gbit/s and the remaining ten around 7 Gbit/s. I don't have a good explanation for this.) And with GRO disabled through ethtool -K (thus simulating traffic which is not GRO-able but, being TCP, is still passed to the GRO entry point): net-next: 4.756 Gbit/s (sigma 0.240) after #4: 5.355 Gbit/s (sigma 0.232) = datum + 12.6% Edward Cree (4): net: introduce list entry point for GRO sfc: use batched receive for GRO net: make listified RX functions return number of good packets net/core: handle GRO_NORMAL skbs as a list in napi_gro_receive_list drivers/net/ethernet/sfc/efx.c| 11 +++- drivers/net/ethernet/sfc/net_driver.h | 1 + drivers/net/ethernet/sfc/rx.c | 16 +- include/linux/netdevice.h | 6 +- include/net/ip.h | 4 +- include/net/ipv6.h| 4 +- net/core/dev.c| 104 ++ net/ipv4/ip_input.c | 39 - net/ipv6/ip6_input.c | 37 +++- 9 files changed, 157 insertions(+), 65 deletions(-)
[RFC PATCH net-next 1/4] net: introduce list entry point for GRO
Also export napi_frags_skb() so that drivers using the napi_gro_frags() interface can prepare their SKBs properly for submitting on such a list. Signed-off-by: Edward Cree --- include/linux/netdevice.h | 2 ++ net/core/dev.c| 28 +++- 2 files changed, 29 insertions(+), 1 deletion(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index ca5ab98053c8..a4f55142d3eb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3521,8 +3521,10 @@ int netif_receive_skb(struct sk_buff *skb); int netif_receive_skb_core(struct sk_buff *skb); void netif_receive_skb_list(struct list_head *head); gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb); +int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head); void napi_gro_flush(struct napi_struct *napi, bool flush_old); struct sk_buff *napi_get_frags(struct napi_struct *napi); +struct sk_buff *napi_frags_skb(struct napi_struct *napi); gro_result_t napi_gro_frags(struct napi_struct *napi); struct packet_offload *gro_find_receive_by_type(__be16 type); struct packet_offload *gro_find_complete_by_type(__be16 type); diff --git a/net/core/dev.c b/net/core/dev.c index 325fc5088370..d1e55bef84eb 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5596,6 +5596,31 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb) } EXPORT_SYMBOL(napi_gro_receive); +/* Returns the number of SKBs on the list successfully received */ +int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head) +{ + struct sk_buff *skb, *next; + gro_result_t result; + int kept = 0; + + list_for_each_entry(skb, head, list) { + skb_mark_napi_id(skb, napi); + trace_napi_gro_receive_entry(skb); + skb_gro_reset_offset(skb); + } + + list_for_each_entry_safe(skb, next, head, list) { + list_del(&skb->list); + skb->next = NULL; + result = dev_gro_receive(napi, skb); + result = napi_skb_finish(result, skb); + if (result != GRO_DROP) + kept++; + } + return kept; +} +EXPORT_SYMBOL(napi_gro_receive_list); + static void napi_reuse_skb(struct napi_struct *napi, struct sk_buff *skb) { if (unlikely(skb->pfmemalloc)) { @@ -5667,7 +5692,7 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi, * Drivers could call both napi_gro_frags() and napi_gro_receive() * We copy ethernet header into skb->data to have a common layout. */ -static struct sk_buff *napi_frags_skb(struct napi_struct *napi) +struct sk_buff *napi_frags_skb(struct napi_struct *napi) { struct sk_buff *skb = napi->skb; const struct ethhdr *eth; @@ -5703,6 +5728,7 @@ static struct sk_buff *napi_frags_skb(struct napi_struct *napi) return skb; } +EXPORT_SYMBOL(napi_frags_skb); gro_result_t napi_gro_frags(struct napi_struct *napi) {
[RFC PATCH net-next 2/4] sfc: use batched receive for GRO
Signed-off-by: Edward Cree --- drivers/net/ethernet/sfc/efx.c| 11 +-- drivers/net/ethernet/sfc/net_driver.h | 1 + drivers/net/ethernet/sfc/rx.c | 16 +--- 3 files changed, 23 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/sfc/efx.c b/drivers/net/ethernet/sfc/efx.c index 330233286e78..dba13a28014c 100644 --- a/drivers/net/ethernet/sfc/efx.c +++ b/drivers/net/ethernet/sfc/efx.c @@ -263,9 +263,9 @@ static int efx_check_disabled(struct efx_nic *efx) */ static int efx_process_channel(struct efx_channel *channel, int budget) { + struct list_head rx_list, gro_list; struct efx_tx_queue *tx_queue; - struct list_head rx_list; - int spent; + int spent, gro_count; if (unlikely(!channel->enabled)) return 0; @@ -275,6 +275,10 @@ static int efx_process_channel(struct efx_channel *channel, int budget) INIT_LIST_HEAD(&rx_list); channel->rx_list = &rx_list; + EFX_WARN_ON_PARANOID(channel->gro_list != NULL); + INIT_LIST_HEAD(&gro_list); + channel->gro_list = &gro_list; + efx_for_each_channel_tx_queue(tx_queue, channel) { tx_queue->pkts_compl = 0; tx_queue->bytes_compl = 0; @@ -300,6 +304,9 @@ static int efx_process_channel(struct efx_channel *channel, int budget) /* Receive any packets we queued up */ netif_receive_skb_list(channel->rx_list); channel->rx_list = NULL; + gro_count = napi_gro_receive_list(&channel->napi_str, channel->gro_list); + channel->irq_mod_score += gro_count * 2; + channel->gro_list = NULL; return spent; } diff --git a/drivers/net/ethernet/sfc/net_driver.h b/drivers/net/ethernet/sfc/net_driver.h index 961b92979640..72addac7a84a 100644 --- a/drivers/net/ethernet/sfc/net_driver.h +++ b/drivers/net/ethernet/sfc/net_driver.h @@ -502,6 +502,7 @@ struct efx_channel { unsigned int rx_pkt_index; struct list_head *rx_list; + struct list_head *gro_list; struct efx_rx_queue rx_queue; struct efx_tx_queue tx_queue[EFX_TXQ_TYPES]; diff --git a/drivers/net/ethernet/sfc/rx.c b/drivers/net/ethernet/sfc/rx.c index 396ff01298cd..0534a54048c6 100644 --- a/drivers/net/ethernet/sfc/rx.c +++ b/drivers/net/ethernet/sfc/rx.c @@ -453,9 +453,19 @@ efx_rx_packet_gro(struct efx_channel *channel, struct efx_rx_buffer *rx_buf, skb_record_rx_queue(skb, channel->rx_queue.core_index); - gro_result = napi_gro_frags(napi); - if (gro_result != GRO_DROP) - channel->irq_mod_score += 2; + /* Pass the packet up */ + if (channel->gro_list != NULL) { + /* Clear napi->skb and prepare skb for GRO */ + skb = napi_frags_skb(napi); + if (skb) + /* Add to list, will pass up later */ + list_add_tail(&skb->list, channel->gro_list); + } else { + /* No list, so pass it up now */ + gro_result = napi_gro_frags(napi); + if (gro_result != GRO_DROP) + channel->irq_mod_score += 2; + } } /* Allocate and construct an SKB around page fragments */
[RFC PATCH net-next 3/4] net: make listified RX functions return number of good packets
Signed-off-by: Edward Cree --- include/linux/netdevice.h | 4 +-- include/net/ip.h | 4 +-- include/net/ipv6.h| 4 +-- net/core/dev.c| 63 +-- net/ipv4/ip_input.c | 39 ++--- net/ipv6/ip6_input.c | 37 +--- 6 files changed, 92 insertions(+), 59 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a4f55142d3eb..ba5f985688eb 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2322,7 +2322,7 @@ struct packet_type { struct net_device *, struct packet_type *, struct net_device *); - void(*list_func) (struct list_head *, + int (*list_func) (struct list_head *, struct packet_type *, struct net_device *); bool(*id_match)(struct packet_type *ptype, @@ -3519,7 +3519,7 @@ int netif_rx(struct sk_buff *skb); int netif_rx_ni(struct sk_buff *skb); int netif_receive_skb(struct sk_buff *skb); int netif_receive_skb_core(struct sk_buff *skb); -void netif_receive_skb_list(struct list_head *head); +int netif_receive_skb_list(struct list_head *head); gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb); int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head); void napi_gro_flush(struct napi_struct *napi, bool flush_old); diff --git a/include/net/ip.h b/include/net/ip.h index e44b1a44f67a..aab1f7eea1e1 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -152,8 +152,8 @@ int ip_build_and_send_pkt(struct sk_buff *skb, const struct sock *sk, struct ip_options_rcu *opt); int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev); -void ip_list_rcv(struct list_head *head, struct packet_type *pt, -struct net_device *orig_dev); +int ip_list_rcv(struct list_head *head, struct packet_type *pt, + struct net_device *orig_dev); int ip_local_deliver(struct sk_buff *skb); int ip_mr_input(struct sk_buff *skb); int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb); diff --git a/include/net/ipv6.h b/include/net/ipv6.h index ff33f498c137..f15651eabfe0 100644 --- a/include/net/ipv6.h +++ b/include/net/ipv6.h @@ -914,8 +914,8 @@ static inline __be32 flowi6_get_flowlabel(const struct flowi6 *fl6) int ipv6_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt, struct net_device *orig_dev); -void ipv6_list_rcv(struct list_head *head, struct packet_type *pt, - struct net_device *orig_dev); +int ipv6_list_rcv(struct list_head *head, struct packet_type *pt, + struct net_device *orig_dev); int ip6_rcv_finish(struct net *net, struct sock *sk, struct sk_buff *skb); diff --git a/net/core/dev.c b/net/core/dev.c index d1e55bef84eb..ac9741273c62 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -4920,24 +4920,27 @@ int netif_receive_skb_core(struct sk_buff *skb) } EXPORT_SYMBOL(netif_receive_skb_core); -static inline void __netif_receive_skb_list_ptype(struct list_head *head, - struct packet_type *pt_prev, - struct net_device *orig_dev) +static inline int __netif_receive_skb_list_ptype(struct list_head *head, +struct packet_type *pt_prev, +struct net_device *orig_dev) { struct sk_buff *skb, *next; + int kept = 0; if (!pt_prev) - return; + return 0; if (list_empty(head)) - return; + return 0; if (pt_prev->list_func != NULL) - pt_prev->list_func(head, pt_prev, orig_dev); + kept = pt_prev->list_func(head, pt_prev, orig_dev); else list_for_each_entry_safe(skb, next, head, list) - pt_prev->func(skb, skb->dev, pt_prev, orig_dev); + if (pt_prev->func(skb, skb->dev, pt_prev, orig_dev) == NET_RX_SUCCESS) + kept++; + return kept; } -static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc) +static int __netif_receive_skb_list_core(struct list_head *head, bool pfmemalloc) { /* Fast-path assumptions: * - There is no RX handler. @@ -4954,6 +4957,7 @@ static void __netif_receive_skb_list_core(struct list_head *head, bool pfmemallo struct net_device *od_curr = NULL; struct list_head sublist; struct sk_buff *skb, *next; + int kept = 0, ret;
[RFC PATCH net-next 4/4] net/core: handle GRO_NORMAL skbs as a list in napi_gro_receive_list
Allows GRO-using drivers to get the benefits of batching for non-GROable traffic. Signed-off-by: Edward Cree --- net/core/dev.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index ac9741273c62..f9391f76feb7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5615,6 +5615,7 @@ EXPORT_SYMBOL(napi_gro_receive); int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head) { struct sk_buff *skb, *next; + struct list_head sublist; gro_result_t result; int kept = 0; @@ -5624,14 +5625,26 @@ int napi_gro_receive_list(struct napi_struct *napi, struct list_head *head) skb_gro_reset_offset(skb); } + INIT_LIST_HEAD(&sublist); list_for_each_entry_safe(skb, next, head, list) { list_del(&skb->list); skb->next = NULL; result = dev_gro_receive(napi, skb); - result = napi_skb_finish(result, skb); - if (result != GRO_DROP) - kept++; + if (result == GRO_NORMAL) { + list_add_tail(&skb->list, &sublist); + continue; + } else { + if (!list_empty(&sublist)) { + /* Handle the GRO_NORMAL skbs to prevent OoO */ + kept += netif_receive_skb_list_internal(&sublist); + INIT_LIST_HEAD(&sublist); + } + result = napi_skb_finish(result, skb); + if (result != GRO_DROP) + kept++; + } } + kept += netif_receive_skb_list_internal(&sublist); return kept; } EXPORT_SYMBOL(napi_gro_receive_list);
Re: [PATCH v2 1/2] IB/ipoib: Use dev_port to expose network interface port numbers
On Thu, 2018-08-30 at 21:22 +0300, Arseny Maslennikov wrote: > Some InfiniBand network devices have multiple ports on the same PCI > function. This initializes the `dev_port' sysfs field of those > network interfaces with their port number. > > Prior to this the kernel erroneously used the `dev_id' sysfs > field of those network interfaces to convey the port number to userspace. > > The use of `dev_id' was considered correct until Linux 3.15, > when another field, `dev_port', was defined for this particular > purpose and `dev_id' was reserved for distinguishing stacked ifaces > (e.g: VLANs) with the same hardware address as their parent device. > > Similar fixes to net/mlx4_en and many other drivers, which started > exporting this information through `dev_id' before 3.15, were accepted > into the kernel 4 years ago. > See 76a066f2a2a0 (`net/mlx4_en: Expose port number through sysfs'). > > Signed-off-by: Arseny Maslennikov > --- > drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c > b/drivers/infiniband/ulp/ipoib/ipoib_main.c > index e3d28f9ad9c0..ba16a63ee303 100644 > --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c > +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c > @@ -1880,7 +1880,7 @@ static int ipoib_parent_init(struct net_device *ndev) > sizeof(union ib_gid)); > > SET_NETDEV_DEV(priv->dev, priv->ca->dev.parent); > - priv->dev->dev_id = priv->port - 1; > + priv->dev->dev_port = priv->port - 1; I don't know that we can't do this. At least not yet. Expose the new item to make us compliant with the new docs, and deprecate the old sysfs item, but we can't just yank the old item. Existing tools/scripts might (probably) rely on it (existing tools already special case IPoIB interfaces and we'll need to make sure they don't special case this element too). -- Doug Ledford GPG KeyID: B826A3330E572FDD Key fingerprint = AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD signature.asc Description: This is a digitally signed message part
[PATCH net-next 0/2] netlink: multicast join notifications
This patch is an update to https://patchwork.ozlabs.org/patch/571127/. The previous patch was based on sending multicast MAC addresses in the netlink messages to allow the programming of hardware. It was agreed to rework this to use RTM_NEW/DELLINK messages which were more appropriate for layer 2 addresses. In the interim period it has become apparent that the applications actually needs to see the L3 multicast addresses which are joined for FORUS processing so this patch has been reworked to send the L3 multicast addresses using RTM_NEW/DELADDR. These new multicast L3 netlink notifications should use the IFA_MULTICAST address type but this has been dropped in favour of IFA_ADDRESS as during testing it was noticed that some applications - notably getaddrinfo in lib6c assume that there is an IFA_ADDRESS in a RTM_NEW/DELADDR and blindly dereference it. Finally the RTM_GETADDR for both address families has been modified to include the multicast l3 addresses. Patrick Ruddy (2): netlink: ipv4 IGMP join notifications netlink: ipv6 MLD join notifications include/linux/igmp.h | 2 + net/ipv4/devinet.c | 39 +-- net/ipv4/igmp.c | 90 net/ipv6/addrconf.c | 44 -- net/ipv6/mcast.c | 66 5 files changed, 218 insertions(+), 23 deletions(-) -- 2.17.1
[net-next v2 00/15][pull request] 40GbE Intel Wired LAN Driver Updates 2018-08-30
This series contains updates to i40e, i40evf and virtchnl. Jake implements helper functions to use an array to handle the queue stats which reduces the boiler plate code as well as keep the complexity localized to a few functions. Paweł adds the ability to change a VF's MAC address from the host side without having to reload the VF driver on the guest side. Paul adds a check to ensure that the number of queues that the PF sends to the VF is equal to or less than the maximum number of queues the VF can support. Mitch fixes an issue caught by GCC 8, where we need to not include the terminating null in the length of the string for strncpy(). Lihong fixes a VF issue to ensure that it does not enter into promiscuous mode when macvlan is added to the VF. Fixed a potential crash after a VF is removed, since the workqueue sync for the adminq task was not being cancelled. Harshitha fixes the type for field_flags in the virtchnl_filter struct. Martyna removes an unnecessary check in a conditional if statement. Björn fixes an issue reported by Jesper Dangaard Brouer, where the driver was reporting incorrect statistics when XDP was enabled. Jan fixes the potential reporting of incorrect speed settings. Patryk fixed an issue where the flag I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING was getting set when any offload is set via ethtool. Resolved by only setting this flag when VLAN offload is enabled. Also ensure we hold the rtnl lock when we are clearing the interrupt scheme. Added a check when deleting the MAC address from the VF to ensure that the MAC address was not set by the PF and if it was, do not delete it. v2: updated patch 2 in the series based on community feedback from David Miller to inline a function The following are changes since commit f0259b6ac4a3d27f6b7e938d6fce367cea377063: Merge tag 'mac80211-next-for-davem-2018-08-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE Björn Töpel (1): i40e: report correct statistics when XDP is enabled Harshitha Ramamurthy (1): virtchnl: use u8 type for a field in the virtchnl_filter struct Jacob Keller (3): i40e: convert queue stats to i40e_stats array i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h i40evf: update ethtool stats code and use helper functions Jan Sokolowski (1): i40e: Check and correct speed values for link on open Lihong Yang (2): i40evf: set IFF_UNICAST_FLT flag for the VF i40evf: cancel workqueue sync for adminq when a VF is removed Martyna Szapar (1): i40e: static analysis report from community Mitch Williams (1): i40e: use correct length for strncpy Patryk Małek (3): i40evf: Don't enable vlan stripping when rx offload is turned on i40e: hold the rtnl lock on clearing interrupt scheme i40e: Prevent deleting MAC address from VF when set by PF Paul M Stillwell Jr (1): i40evf: Validate the number of queues a PF sends Paweł Jabłoński (1): i40evf: Change a VF mac without reloading the VF driver .../net/ethernet/intel/i40e/i40e_ethtool.c| 238 -- .../ethernet/intel/i40e/i40e_ethtool_stats.h | 221 drivers/net/ethernet/intel/i40e/i40e_main.c | 61 +++-- drivers/net/ethernet/intel/i40e/i40e_ptp.c| 3 +- .../ethernet/intel/i40e/i40e_virtchnl_pf.c| 18 +- .../intel/i40evf/i40e_ethtool_stats.h | 221 .../ethernet/intel/i40evf/i40evf_ethtool.c| 137 +- .../net/ethernet/intel/i40evf/i40evf_main.c | 15 +- .../ethernet/intel/i40evf/i40evf_virtchnl.c | 43 +++- include/linux/avf/virtchnl.h | 2 +- 10 files changed, 678 insertions(+), 281 deletions(-) create mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h -- 2.17.1
[net-next v2 03/15] i40evf: update ethtool stats code and use helper functions
From: Jacob Keller Fix a bug in the way we handled VF queues, by always showing stats for the maximum number of queues, even if they aren't allocated. It is not safe to change the number of strings reported to ethtool, as grabbing statistics occurs over multiple ethtool ops for which the rtnl_lock() cannot be held the entire time. Avoid this by always reporting queue stats for the maximum number of queues in the netdevice. Share some of the helper functionality for adding stats with the PF code in i40e_ethtool_stats.h This should reduce the chance of potential future bugs, and make adding new statistics easier. Note for the queue stats, unlike the PF driver we do not keep an array of queue pointers, but an array of queues, so care must be taken to avoid accessing queue memory that hasn't yet been allocated. Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- .../intel/i40evf/i40e_ethtool_stats.h | 221 ++ .../ethernet/intel/i40evf/i40evf_ethtool.c| 137 ++- 2 files changed, 298 insertions(+), 60 deletions(-) create mode 100644 drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h diff --git a/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h b/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h new file mode 100644 index ..60b595dd8c39 --- /dev/null +++ b/drivers/net/ethernet/intel/i40evf/i40e_ethtool_stats.h @@ -0,0 +1,221 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* Copyright(c) 2013 - 2018 Intel Corporation. */ + +/* ethtool statistics helpers */ + +/** + * struct i40e_stats - definition for an ethtool statistic + * @stat_string: statistic name to display in ethtool -S output + * @sizeof_stat: the sizeof() the stat, must be no greater than sizeof(u64) + * @stat_offset: offsetof() the stat from a base pointer + * + * This structure defines a statistic to be added to the ethtool stats buffer. + * It defines a statistic as offset from a common base pointer. Stats should + * be defined in constant arrays using the I40E_STAT macro, with every element + * of the array using the same _type for calculating the sizeof_stat and + * stat_offset. + * + * The @sizeof_stat is expected to be sizeof(u8), sizeof(u16), sizeof(u32) or + * sizeof(u64). Other sizes are not expected and will produce a WARN_ONCE from + * the i40e_add_ethtool_stat() helper function. + * + * The @stat_string is interpreted as a format string, allowing formatted + * values to be inserted while looping over multiple structures for a given + * statistics array. Thus, every statistic string in an array should have the + * same type and number of format specifiers, to be formatted by variadic + * arguments to the i40e_add_stat_string() helper function. + **/ +struct i40e_stats { + char stat_string[ETH_GSTRING_LEN]; + int sizeof_stat; + int stat_offset; +}; + +/* Helper macro to define an i40e_stat structure with proper size and type. + * Use this when defining constant statistics arrays. Note that @_type expects + * only a type name and is used multiple times. + */ +#define I40E_STAT(_type, _name, _stat) { \ + .stat_string = _name, \ + .sizeof_stat = FIELD_SIZEOF(_type, _stat), \ + .stat_offset = offsetof(_type, _stat) \ +} + +/* Helper macro for defining some statistics directly copied from the netdev + * stats structure. + */ +#define I40E_NETDEV_STAT(_net_stat) \ + I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat) + +/* Helper macro for defining some statistics related to queues */ +#define I40E_QUEUE_STAT(_name, _stat) \ + I40E_STAT(struct i40e_ring, _name, _stat) + +/* Stats associated with a Tx or Rx ring */ +static const struct i40e_stats i40e_gstrings_queue_stats[] = { + I40E_QUEUE_STAT("%s-%u.packets", stats.packets), + I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes), +}; + +/** + * i40evf_add_one_ethtool_stat - copy the stat into the supplied buffer + * @data: location to store the stat value + * @pointer: basis for where to copy from + * @stat: the stat definition + * + * Copies the stat data defined by the pointer and stat structure pair into + * the memory supplied as data. Used to implement i40e_add_ethtool_stats and + * i40evf_add_queue_stats. If the pointer is null, data will be zero'd. + */ +static inline void +i40evf_add_one_ethtool_stat(u64 *data, void *pointer, + const struct i40e_stats *stat) +{ + char *p; + + if (!pointer) { + /* ensure that the ethtool data buffer is zero'd for any stats +* which don't have a valid pointer. +*/ + *data = 0; + return; + } + + p = (char *)pointer + stat->stat_offset; + switch (stat->sizeof_stat) { + case sizeof(u64): + *data = *((u64 *)p); + break; + case sizeof(u32): + *data = *((u32 *)p); + break; + case sizeof(u16): +
[net-next v2 09/15] i40e: static analysis report from community
From: Martyna Szapar Static analysis tools report a problem from original driver submission. Removing unnecessary check in condition. Signed-off-by: Martyna Szapar Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index ac685ad4d877..62f2b5bce6bb 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -13033,7 +13033,7 @@ struct i40e_veb *i40e_veb_setup(struct i40e_pf *pf, u16 flags, for (vsi_idx = 0; vsi_idx < pf->num_alloc_vsi; vsi_idx++) if (pf->vsi[vsi_idx] && pf->vsi[vsi_idx]->seid == vsi_seid) break; - if (vsi_idx >= pf->num_alloc_vsi && vsi_seid != 0) { + if (vsi_idx == pf->num_alloc_vsi && vsi_seid != 0) { dev_info(&pf->pdev->dev, "vsi seid %d not found\n", vsi_seid); return NULL; -- 2.17.1
[net-next v2 12/15] i40evf: Don't enable vlan stripping when rx offload is turned on
From: Patryk Małek With current implementation of i40evf_set_features when user sets any offload via ethtool we set I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING as a required aq which triggers driver to call i40evf_enable_vlan_stripping. This shouldn't take place. This patches fixes it by setting the flag only when VLAN offload is turned on. Signed-off-by: Patryk Małek Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index 07f187b5bcac..c7048cf484fb 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -3120,18 +3120,19 @@ static int i40evf_set_features(struct net_device *netdev, { struct i40evf_adapter *adapter = netdev_priv(netdev); - /* Don't allow changing VLAN_RX flag when VLAN is set for VF -* and return an error in this case + /* Don't allow changing VLAN_RX flag when adapter is not capable +* of VLAN offload */ - if (VLAN_ALLOWED(adapter)) { + if (!VLAN_ALLOWED(adapter)) { + if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) + return -EINVAL; + } else if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) { if (features & NETIF_F_HW_VLAN_CTAG_RX) adapter->aq_required |= I40EVF_FLAG_AQ_ENABLE_VLAN_STRIPPING; else adapter->aq_required |= I40EVF_FLAG_AQ_DISABLE_VLAN_STRIPPING; - } else if ((netdev->features ^ features) & NETIF_F_HW_VLAN_CTAG_RX) { - return -EINVAL; } return 0; -- 2.17.1
[net-next v2 14/15] i40evf: cancel workqueue sync for adminq when a VF is removed
From: Lihong Yang If a VF is being removed, there is no need to continue with the workqueue sync for the adminq task, thus cancel it. Without this call, when VFs are created and removed right away, there might be a chance for the driver to crash with events stuck in the adminq. Signed-off-by: Lihong Yang Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index c7048cf484fb..174d1da2857b 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -3910,6 +3910,8 @@ static void i40evf_remove(struct pci_dev *pdev) if (adapter->watchdog_timer.function) del_timer_sync(&adapter->watchdog_timer); + cancel_work_sync(&adapter->adminq_task); + i40evf_free_rss(adapter); if (hw->aq.asq.count) -- 2.17.1
[net-next v2 05/15] i40evf: Validate the number of queues a PF sends
From: Paul M Stillwell Jr A PF can send any number of queues to the VF and the VF may not be able to support that many. Check to see that the number of queues is less than or equal to the max number of queues the VF can have. Signed-off-by: Paul M Stillwell Jr Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- .../ethernet/intel/i40evf/i40evf_virtchnl.c | 32 +++ 1 file changed, 32 insertions(+) diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c index 79b7be83b5c3..6579dabab78c 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c @@ -153,6 +153,32 @@ int i40evf_send_vf_config_msg(struct i40evf_adapter *adapter) NULL, 0); } +/** + * i40evf_validate_num_queues + * @adapter: adapter structure + * + * Validate that the number of queues the PF has sent in + * VIRTCHNL_OP_GET_VF_RESOURCES is not larger than the VF can handle. + **/ +static void i40evf_validate_num_queues(struct i40evf_adapter *adapter) +{ + if (adapter->vf_res->num_queue_pairs > I40EVF_MAX_REQ_QUEUES) { + struct virtchnl_vsi_resource *vsi_res; + int i; + + dev_info(&adapter->pdev->dev, "Received %d queues, but can only have a max of %d\n", +adapter->vf_res->num_queue_pairs, +I40EVF_MAX_REQ_QUEUES); + dev_info(&adapter->pdev->dev, "Fixing by reducing queues to %d\n", +I40EVF_MAX_REQ_QUEUES); + adapter->vf_res->num_queue_pairs = I40EVF_MAX_REQ_QUEUES; + for (i = 0; i < adapter->vf_res->num_vsis; i++) { + vsi_res = &adapter->vf_res->vsi_res[i]; + vsi_res->num_queue_pairs = I40EVF_MAX_REQ_QUEUES; + } + } +} + /** * i40evf_get_vf_config * @adapter: private adapter structure @@ -195,6 +221,11 @@ int i40evf_get_vf_config(struct i40evf_adapter *adapter) err = (i40e_status)le32_to_cpu(event.desc.cookie_low); memcpy(adapter->vf_res, event.msg_buf, min(event.msg_len, len)); + /* some PFs send more queues than we should have so validate that +* we aren't getting too many queues +*/ + if (!err) + i40evf_validate_num_queues(adapter); i40e_vf_parse_hw_config(hw, adapter->vf_res); out_alloc: kfree(event.msg_buf); @@ -1329,6 +1360,7 @@ void i40evf_virtchnl_completion(struct i40evf_adapter *adapter, I40E_MAX_VF_VSI * sizeof(struct virtchnl_vsi_resource); memcpy(adapter->vf_res, msg, min(msglen, len)); + i40evf_validate_num_queues(adapter); i40e_vf_parse_hw_config(&adapter->hw, adapter->vf_res); if (is_zero_ether_addr(adapter->hw.mac.addr)) { /* restore current mac address */ -- 2.17.1
[net-next v2 02/15] i40e: move ethtool stats boiler plate code to i40e_ethtool_stats.h
From: Jacob Keller Move the boiler plate structures and helper functions we recently added into their own header file, so that the complete collection is located together. Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- v2: Based on feedback from David Miller, inline the __i40e_add_stat_strings function .../net/ethernet/intel/i40e/i40e_ethtool.c| 182 +-- .../ethernet/intel/i40e/i40e_ethtool_stats.h | 221 ++ 2 files changed, 222 insertions(+), 181 deletions(-) create mode 100644 drivers/net/ethernet/intel/i40e/i40e_ethtool_stats.h diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 235012b3bd42..d7d3974beca2 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -6,25 +6,8 @@ #include "i40e.h" #include "i40e_diag.h" -struct i40e_stats { - /* The stat_string is expected to be a format string formatted using -* vsnprintf by i40e_add_stat_strings. Every member of a stats array -* should use the same format specifiers as they will be formatted -* using the same variadic arguments. -*/ - char stat_string[ETH_GSTRING_LEN]; - int sizeof_stat; - int stat_offset; -}; - -#define I40E_STAT(_type, _name, _stat) { \ - .stat_string = _name, \ - .sizeof_stat = FIELD_SIZEOF(_type, _stat), \ - .stat_offset = offsetof(_type, _stat) \ -} +#include "i40e_ethtool_stats.h" -#define I40E_NETDEV_STAT(_net_stat) \ - I40E_STAT(struct rtnl_link_stats64, #_net_stat, _net_stat) #define I40E_PF_STAT(_name, _stat) \ I40E_STAT(struct i40e_pf, _name, _stat) #define I40E_VSI_STAT(_name, _stat) \ @@ -173,11 +156,6 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = { I40E_PFC_STAT("port.rx_priority_%u_xon_2_xoff", priority_xon_2_xoff), }; -static const struct i40e_stats i40e_gstrings_queue_stats[] = { - I40E_QUEUE_STAT("%s-%u.packets", stats.packets), - I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes), -}; - #define I40E_NETDEV_STATS_LEN ARRAY_SIZE(i40e_gstrings_net_stats) #define I40E_MISC_STATS_LENARRAY_SIZE(i40e_gstrings_misc_stats) @@ -1749,128 +1727,6 @@ static int i40e_get_sset_count(struct net_device *netdev, int sset) } } -/** - * i40e_add_one_ethtool_stat - copy the stat into the supplied buffer - * @data: location to store the stat value - * @pointer: basis for where to copy from - * @stat: the stat definition - * - * Copies the stat data defined by the pointer and stat structure pair into - * the memory supplied as data. Used to implement i40e_add_ethtool_stats and - * i40e_add_queue_stats. If the pointer is null, data will be zero'd. - */ -static inline void -i40e_add_one_ethtool_stat(u64 *data, void *pointer, - const struct i40e_stats *stat) -{ - char *p; - - if (!pointer) { - /* ensure that the ethtool data buffer is zero'd for any stats -* which don't have a valid pointer. -*/ - *data = 0; - return; - } - - p = (char *)pointer + stat->stat_offset; - switch (stat->sizeof_stat) { - case sizeof(u64): - *data = *((u64 *)p); - break; - case sizeof(u32): - *data = *((u32 *)p); - break; - case sizeof(u16): - *data = *((u16 *)p); - break; - case sizeof(u8): - *data = *((u8 *)p); - break; - default: - WARN_ONCE(1, "unexpected stat size for %s", - stat->stat_string); - *data = 0; - } -} - -/** - * __i40e_add_ethtool_stats - copy stats into the ethtool supplied buffer - * @data: ethtool stats buffer - * @pointer: location to copy stats from - * @stats: array of stats to copy - * @size: the size of the stats definition - * - * Copy the stats defined by the stats array using the pointer as a base into - * the data buffer supplied by ethtool. Updates the data pointer to point to - * the next empty location for successive calls to __i40e_add_ethtool_stats. - * If pointer is null, set the data values to zero and update the pointer to - * skip these stats. - **/ -static inline void -__i40e_add_ethtool_stats(u64 **data, void *pointer, -const struct i40e_stats stats[], -const unsigned int size) -{ - unsigned int i; - - for (i = 0; i < size; i++) - i40e_add_one_ethtool_stat((*data)++, pointer, &stats[i]); -} - -/** - * i40e_add_ethtool_stats - copy stats into ethtool supplied buffer - * @data: ethtool stats buffer - * @pointer: location where stats are stored - * @stats: static const array of stat definitions - * - * Macro to ease the use of __i40e_add_ethtool_stats by taking a static - * constant stat
[net-next v2 06/15] i40e: use correct length for strncpy
From: Mitch Williams Caught by GCC 8. When we provide a length for strncpy, we should not include the terminating null. So we must tell it one less than the size of the destination buffer. Signed-off-by: Mitch Williams Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_ptp.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ptp.c b/drivers/net/ethernet/intel/i40e/i40e_ptp.c index 35f2866b38c6..1199f0502d6d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ptp.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ptp.c @@ -694,7 +694,8 @@ static long i40e_ptp_create_clock(struct i40e_pf *pf) if (!IS_ERR_OR_NULL(pf->ptp_clock)) return 0; - strncpy(pf->ptp_caps.name, i40e_driver_name, sizeof(pf->ptp_caps.name)); + strncpy(pf->ptp_caps.name, i40e_driver_name, + sizeof(pf->ptp_caps.name) - 1); pf->ptp_caps.owner = THIS_MODULE; pf->ptp_caps.max_adj = 9; pf->ptp_caps.n_ext_ts = 0; -- 2.17.1
[net-next v2 10/15] i40e: report correct statistics when XDP is enabled
From: Björn Töpel When XDP is enabled, the driver will report incorrect statistics. Received frames will reported as transmitted frames. This commits fixes the i40e implementation of ndo_get_stats64 (struct net_device_ops), so that iproute2 will report correct statistics (e.g. when running "ip -stats link show dev eth0") even when XDP is enabled. Reported-by: Jesper Dangaard Brouer Fixes: 74608d17fe29 ("i40e: add support for XDP_TX action") Signed-off-by: Björn Töpel Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 24 +++-- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 62f2b5bce6bb..6cb4076e8fba 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -420,9 +420,9 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev, struct rtnl_link_stats64 *stats) { struct i40e_netdev_priv *np = netdev_priv(netdev); - struct i40e_ring *tx_ring, *rx_ring; struct i40e_vsi *vsi = np->vsi; struct rtnl_link_stats64 *vsi_stats = i40e_get_vsi_stats_struct(vsi); + struct i40e_ring *ring; int i; if (test_bit(__I40E_VSI_DOWN, vsi->state)) @@ -436,24 +436,26 @@ static void i40e_get_netdev_stats_struct(struct net_device *netdev, u64 bytes, packets; unsigned int start; - tx_ring = READ_ONCE(vsi->tx_rings[i]); - if (!tx_ring) + ring = READ_ONCE(vsi->tx_rings[i]); + if (!ring) continue; - i40e_get_netdev_stats_struct_tx(tx_ring, stats); + i40e_get_netdev_stats_struct_tx(ring, stats); - rx_ring = &tx_ring[1]; + if (i40e_enabled_xdp_vsi(vsi)) { + ring++; + i40e_get_netdev_stats_struct_tx(ring, stats); + } + ring++; do { - start = u64_stats_fetch_begin_irq(&rx_ring->syncp); - packets = rx_ring->stats.packets; - bytes = rx_ring->stats.bytes; - } while (u64_stats_fetch_retry_irq(&rx_ring->syncp, start)); + start = u64_stats_fetch_begin_irq(&ring->syncp); + packets = ring->stats.packets; + bytes = ring->stats.bytes; + } while (u64_stats_fetch_retry_irq(&ring->syncp, start)); stats->rx_packets += packets; stats->rx_bytes += bytes; - if (i40e_enabled_xdp_vsi(vsi)) - i40e_get_netdev_stats_struct_tx(&rx_ring[1], stats); } rcu_read_unlock(); -- 2.17.1
[net-next v2 08/15] virtchnl: use u8 type for a field in the virtchnl_filter struct
From: Harshitha Ramamurthy The virtchnl_filter struct has a field called field_flags. A previous commit mistakenly had the type to be a __u8. What we want is for the field to be an unsigned 8 bit value, so let's just use the existing kernel type u8 for that. Signed-off-by: Harshitha Ramamurthy Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- include/linux/avf/virtchnl.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/avf/virtchnl.h b/include/linux/avf/virtchnl.h index 212b3822d180..b41f7bc958ef 100644 --- a/include/linux/avf/virtchnl.h +++ b/include/linux/avf/virtchnl.h @@ -573,7 +573,7 @@ struct virtchnl_filter { enumvirtchnl_flow_type flow_type; enumvirtchnl_action action; u32 action_meta; - __u8field_flags; + u8 field_flags; }; VIRTCHNL_CHECK_STRUCT_LEN(272, virtchnl_filter); -- 2.17.1
[net-next v2 01/15] i40e: convert queue stats to i40e_stats array
From: Jacob Keller Use an i40e_stats array to handle the queue stats, instead of coding similar functionality separately. Because of how the queue stats are accessed on some kernels, we can't easily use i40e_add_ethtool_stats. Instead, implement a separate helper, i40e_add_queue_stats, which we'll use instead. This helper will correctly implement the u64_stats_fetch_begin_irq logic and allow retries until successful. We share the most complex code by re-using i40e_add_one_ethtool_stat. This logic additionally easily supports skipping disabled rings by using a ternary operator before calling the u64_stats_fetch_begin_irq() function, so that we correctly zero-out the stats values without having to perform two separate sections of code. This significantly reduces the boiler plate code in i40e_get_ethtool_stats, and helps keep the complex logic contained to as few functions as possible. With this patch, we've finally converted all the statistics to use the helpers and the i40e_stats function. Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- .../net/ethernet/intel/i40e/i40e_ethtool.c| 148 +++--- 1 file changed, 89 insertions(+), 59 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c index 5ff6caa83948..235012b3bd42 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c +++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c @@ -33,6 +33,8 @@ struct i40e_stats { I40E_STAT(struct i40e_veb, _name, _stat) #define I40E_PFC_STAT(_name, _stat) \ I40E_STAT(struct i40e_pfc_stats, _name, _stat) +#define I40E_QUEUE_STAT(_name, _stat) \ + I40E_STAT(struct i40e_ring, _name, _stat) static const struct i40e_stats i40e_gstrings_net_stats[] = { I40E_NETDEV_STAT(rx_packets), @@ -171,20 +173,16 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = { I40E_PFC_STAT("port.rx_priority_%u_xon_2_xoff", priority_xon_2_xoff), }; -/* We use num_tx_queues here as a proxy for the maximum number of queues - * available because we always allocate queues symmetrically. - */ -#define I40E_MAX_NUM_QUEUES(n) ((n)->num_tx_queues) -#define I40E_QUEUE_STATS_LEN(n) \ - (I40E_MAX_NUM_QUEUES(n) \ - * 2 /* Tx and Rx together */ \ - * (sizeof(struct i40e_queue_stats) / sizeof(u64))) -#define I40E_GLOBAL_STATS_LEN ARRAY_SIZE(i40e_gstrings_stats) +static const struct i40e_stats i40e_gstrings_queue_stats[] = { + I40E_QUEUE_STAT("%s-%u.packets", stats.packets), + I40E_QUEUE_STAT("%s-%u.bytes", stats.bytes), +}; + #define I40E_NETDEV_STATS_LEN ARRAY_SIZE(i40e_gstrings_net_stats) + #define I40E_MISC_STATS_LENARRAY_SIZE(i40e_gstrings_misc_stats) -#define I40E_VSI_STATS_LEN(n) (I40E_NETDEV_STATS_LEN + \ -I40E_MISC_STATS_LEN + \ -I40E_QUEUE_STATS_LEN((n))) + +#define I40E_VSI_STATS_LEN (I40E_NETDEV_STATS_LEN + I40E_MISC_STATS_LEN) #define I40E_PFC_STATS_LEN (ARRAY_SIZE(i40e_gstrings_pfc_stats) * \ I40E_MAX_USER_PRIORITY) @@ -193,10 +191,15 @@ static const struct i40e_stats i40e_gstrings_pfc_stats[] = { (ARRAY_SIZE(i40e_gstrings_veb_tc_stats) * \ I40E_MAX_TRAFFIC_CLASS)) -#define I40E_PF_STATS_LEN(n) (I40E_GLOBAL_STATS_LEN + \ +#define I40E_GLOBAL_STATS_LEN ARRAY_SIZE(i40e_gstrings_stats) + +#define I40E_PF_STATS_LEN (I40E_GLOBAL_STATS_LEN + \ I40E_PFC_STATS_LEN + \ I40E_VEB_STATS_LEN + \ -I40E_VSI_STATS_LEN((n))) +I40E_VSI_STATS_LEN) + +/* Length of stats for a single queue */ +#define I40E_QUEUE_STATS_LEN ARRAY_SIZE(i40e_gstrings_queue_stats) enum i40e_ethtool_test_id { I40E_ETH_TEST_REG = 0, @@ -1701,11 +1704,30 @@ static int i40e_get_stats_count(struct net_device *netdev) struct i40e_netdev_priv *np = netdev_priv(netdev); struct i40e_vsi *vsi = np->vsi; struct i40e_pf *pf = vsi->back; + int stats_len; if (vsi == pf->vsi[pf->lan_vsi] && pf->hw.partition_id == 1) - return I40E_PF_STATS_LEN(netdev); + stats_len = I40E_PF_STATS_LEN; else - return I40E_VSI_STATS_LEN(netdev); + stats_len = I40E_VSI_STATS_LEN; + + /* The number of stats reported for a given net_device must remain +* constant throughout the life of that device. +* +* This is because the API for obtaining the size, strings, and stats +* is spread out over three separate ethtool ioctls. There is no safe +* way to lock the number of stats across these calls, so we must +
[net-next v2 07/15] i40evf: set IFF_UNICAST_FLT flag for the VF
From: Lihong Yang Set IFF_UNICAST_FLT flag for the VF to prevent it from entering promiscuous mode when macvlan is added to the VF. Signed-off-by: Lihong Yang Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40evf/i40evf_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_main.c b/drivers/net/ethernet/intel/i40evf/i40evf_main.c index 5906c1c1d19d..07f187b5bcac 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_main.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_main.c @@ -3358,6 +3358,8 @@ int i40evf_process_config(struct i40evf_adapter *adapter) if (vfres->vf_cap_flags & VIRTCHNL_VF_OFFLOAD_VLAN) netdev->features |= NETIF_F_HW_VLAN_CTAG_FILTER; + netdev->priv_flags |= IFF_UNICAST_FLT; + /* Do not turn on offloads when they are requested to be turned off. * TSO needs minimum 576 bytes to work correctly. */ -- 2.17.1
[net-next v2 11/15] i40e: Check and correct speed values for link on open
From: Jan Sokolowski If our card has been put in an unstable state due to other drivers interacting with it, speed settings might be incorrect. If incorrect, forcefully reset them on open to known default values. Signed-off-by: Jan Sokolowski Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 27 ++--- 1 file changed, 24 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 6cb4076e8fba..c30feb27e1c0 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -6570,6 +6570,24 @@ static i40e_status i40e_force_link_state(struct i40e_pf *pf, bool is_up) struct i40e_hw *hw = &pf->hw; i40e_status err; u64 mask; + u8 speed; + + /* Card might've been put in an unstable state by other drivers +* and applications, which causes incorrect speed values being +* set on startup. In order to clear speed registers, we call +* get_phy_capabilities twice, once to get initial state of +* available speeds, and once to get current PHY config. +*/ + err = i40e_aq_get_phy_capabilities(hw, false, true, &abilities, + NULL); + if (err) { + dev_err(&pf->pdev->dev, + "failed to get phy cap., ret = %s last_status = %s\n", + i40e_stat_str(hw, err), + i40e_aq_str(hw, hw->aq.asq_last_status)); + return err; + } + speed = abilities.link_speed; /* Get the current phy config */ err = i40e_aq_get_phy_capabilities(hw, false, false, &abilities, @@ -6583,9 +6601,9 @@ static i40e_status i40e_force_link_state(struct i40e_pf *pf, bool is_up) } /* If link needs to go up, but was not forced to go down, -* no need for a flap +* and its speed values are OK, no need for a flap */ - if (is_up && abilities.phy_type != 0) + if (is_up && abilities.phy_type != 0 && abilities.link_speed != 0) return I40E_SUCCESS; /* To force link we need to set bits for all supported PHY types, @@ -6597,7 +6615,10 @@ static i40e_status i40e_force_link_state(struct i40e_pf *pf, bool is_up) config.phy_type_ext = is_up ? (u8)((mask >> 32) & 0xff) : 0; /* Copy the old settings, except of phy_type */ config.abilities = abilities.abilities; - config.link_speed = abilities.link_speed; + if (abilities.link_speed != 0) + config.link_speed = abilities.link_speed; + else + config.link_speed = speed; config.eee_capability = abilities.eee_capability; config.eeer = abilities.eeer_val; config.low_power_ctrl = abilities.d3_lpan; -- 2.17.1
[net-next v2 13/15] i40e: hold the rtnl lock on clearing interrupt scheme
From: Patryk Małek Hold the rtnl lock when we're clearing interrupt scheme in i40e_shutdown and in i40e_remove. Signed-off-by: Patryk Małek Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_main.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index c30feb27e1c0..112245f32d7d 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -14182,6 +14182,7 @@ static void i40e_remove(struct pci_dev *pdev) mutex_destroy(&hw->aq.asq_mutex); /* Clear all dynamic memory lists of rings, q_vectors, and VSIs */ + rtnl_lock(); i40e_clear_interrupt_scheme(pf); for (i = 0; i < pf->num_alloc_vsi; i++) { if (pf->vsi[i]) { @@ -14190,6 +14191,7 @@ static void i40e_remove(struct pci_dev *pdev) pf->vsi[i] = NULL; } } + rtnl_unlock(); for (i = 0; i < I40E_MAX_VEB; i++) { kfree(pf->veb[i]); @@ -14401,7 +14403,13 @@ static void i40e_shutdown(struct pci_dev *pdev) wr32(hw, I40E_PFPM_WUFC, (pf->wol_en ? I40E_PFPM_WUFC_MAG_MASK : 0)); + /* Since we're going to destroy queues during the +* i40e_clear_interrupt_scheme() we should hold the RTNL lock for this +* whole section +*/ + rtnl_lock(); i40e_clear_interrupt_scheme(pf); + rtnl_unlock(); if (system_state == SYSTEM_POWER_OFF) { pci_wake_from_d3(pdev, pf->wol_en); -- 2.17.1
[net-next v2 15/15] i40e: Prevent deleting MAC address from VF when set by PF
From: Patryk Małek To prevent VF from deleting MAC address that was assigned by the PF we need to check for that scenario when we try to delete a MAC address from a VF. Signed-off-by: Patryk Małek Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index f56c645588f3..3e707c7e6782 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -2569,6 +2569,16 @@ static int i40e_vc_del_mac_addr_msg(struct i40e_vf *vf, u8 *msg, u16 msglen) ret = I40E_ERR_INVALID_MAC_ADDR; goto error_param; } + + if (vf->pf_set_mac && + ether_addr_equal(al->list[i].addr, +vf->default_lan_addr.addr)) { + dev_err(&pf->pdev->dev, + "MAC addr %pM has been set by PF, cannot delete it for VF %d, reset VF to change MAC addr\n", + vf->default_lan_addr.addr, vf->vf_id); + ret = I40E_ERR_PARAM; + goto error_param; + } } vsi = pf->vsi[vf->lan_vsi_idx]; -- 2.17.1
[net-next v2 04/15] i40evf: Change a VF mac without reloading the VF driver
From: Paweł Jabłoński Add possibility to change a VF mac address from host side without reloading the VF driver on the guest side. Without this patch it is not possible to change the VF mac because executing i40evf_virtchnl_completion function with VIRTCHNL_OP_GET_VF_RESOURCES opcode resets the VF mac address to previous value. Signed-off-by: Paweł Jabłoński Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 8 +--- drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c | 11 +-- 2 files changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c index c6d24eaede18..f56c645588f3 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c +++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c @@ -2458,7 +2458,7 @@ static inline int i40e_check_vf_permission(struct i40e_vf *vf, !is_multicast_ether_addr(addr) && vf->pf_set_mac && !ether_addr_equal(addr, vf->default_lan_addr.addr)) { dev_err(&pf->pdev->dev, - "VF attempting to override administratively set MAC address, reload the VF driver to resume normal operation\n"); + "VF attempting to override administratively set MAC address, bring down and up the VF interface to resume normal operation\n"); return -EPERM; } } @@ -3873,9 +3873,11 @@ int i40e_ndo_set_vf_mac(struct net_device *netdev, int vf_id, u8 *mac) mac, vf_id); } - /* Force the VF driver stop so it has to reload with new MAC address */ + /* Force the VF interface down so it has to bring up with new MAC +* address +*/ i40e_vc_disable_vf(vf); - dev_info(&pf->pdev->dev, "Reload the VF driver to make this change effective.\n"); + dev_info(&pf->pdev->dev, "Bring down and up the VF interface to make this change effective.\n"); error_param: return ret; diff --git a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c index 565677de5ba3..79b7be83b5c3 100644 --- a/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c +++ b/drivers/net/ethernet/intel/i40evf/i40evf_virtchnl.c @@ -1330,8 +1330,15 @@ void i40evf_virtchnl_completion(struct i40evf_adapter *adapter, sizeof(struct virtchnl_vsi_resource); memcpy(adapter->vf_res, msg, min(msglen, len)); i40e_vf_parse_hw_config(&adapter->hw, adapter->vf_res); - /* restore current mac address */ - ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr); + if (is_zero_ether_addr(adapter->hw.mac.addr)) { + /* restore current mac address */ + ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr); + } else { + /* refresh current mac address if changed */ + ether_addr_copy(netdev->dev_addr, adapter->hw.mac.addr); + ether_addr_copy(netdev->perm_addr, + adapter->hw.mac.addr); + } i40evf_process_config(adapter); } break; -- 2.17.1
[PATCH net] net/ipv6: Only update MTU metric if it set
From: David Ahern Jan reported a regression after an update to 4.18.5. In this case ipv6 default route is setup by systemd-networkd based on data from an RA. The RA contains an MTU of 1492 which is used when the route is first inserted but then systemd-networkd pushes down updates to the default route without the mtu set. Prior to the change to fib6_info, metrics such as MTU were held in the dst_entry and rt6i_pmtu in rt6_info contained an update to the mtu if any. ip6_mtu would look at rt6i_pmtu first and use it if set. If not, the value from the metrics is used if it is set and finally falling back to the idev value. After the fib6_info change metrics are contained in the fib6_info struct and there is no equivalent to rt6i_pmtu. To maintain consistency with the old behavior the new code should only reset the MTU in the metrics if the route update has it set. Fixes: d4ead6b34b67 ("net/ipv6: move metrics from dst to rt6_info") Reported-by: Jan Janssen Signed-off-by: David Ahern --- net/ipv6/ip6_fib.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index d212738e9d10..f43d278e0040 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -987,7 +987,10 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct fib6_info *rt, fib6_clean_expires(iter); else fib6_set_expires(iter, rt->expires); - fib6_metric_set(iter, RTAX_MTU, rt->fib6_pmtu); + + if (rt->fib6_pmtu) + fib6_metric_set(iter, RTAX_MTU, + rt->fib6_pmtu); return -EEXIST; } /* If we have the same destination and the same metric, -- 2.15.2 (Apple Git-101.1)
Re: [PATCH net] net/ipv6: Only update MTU metric if it set
On Donnerstag, 30. August 2018 23:15:43 CEST dsah...@kernel.org wrote: > From: David Ahern > > Jan reported a regression after an update to 4.18.5. In this case ipv6 > default route is setup by systemd-networkd based on data from an RA. The > RA contains an MTU of 1492 which is used when the route is first inserted > but then systemd-networkd pushes down updates to the default route > without the mtu set. > > Prior to the change to fib6_info, metrics such as MTU were held in the > dst_entry and rt6i_pmtu in rt6_info contained an update to the mtu if > any. ip6_mtu would look at rt6i_pmtu first and use it if set. If not, > the value from the metrics is used if it is set and finally falling > back to the idev value. > > After the fib6_info change metrics are contained in the fib6_info struct > and there is no equivalent to rt6i_pmtu. To maintain consistency with > the old behavior the new code should only reset the MTU in the metrics > if the route update has it set. > > Fixes: d4ead6b34b67 ("net/ipv6: move metrics from dst to rt6_info") > Reported-by: Jan Janssen > Signed-off-by: David Ahern > --- > net/ipv6/ip6_fib.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c > index d212738e9d10..f43d278e0040 100644 > --- a/net/ipv6/ip6_fib.c > +++ b/net/ipv6/ip6_fib.c > @@ -987,7 +987,10 @@ static int fib6_add_rt2node(struct fib6_node *fn, > struct fib6_info *rt, fib6_clean_expires(iter); > else > fib6_set_expires(iter, rt->expires); > - fib6_metric_set(iter, RTAX_MTU, rt->fib6_pmtu); > + > + if (rt->fib6_pmtu) > + fib6_metric_set(iter, RTAX_MTU, > + rt->fib6_pmtu); > return -EEXIST; > } > /* If we have the same destination and the same metric, I just tested this and can confirm that it fixes my problem. You mentioned that systemd-networkd is doing something silly here, so I was wondering if you could make a bug report to fix the underlying issue? I don't mind doing it myself but you're the expert and can explain things better. Thank you for fixing this.
Re: [PATCH 1/2] dt-bindings: net: cpsw: Document cpsw-phy-sel usage but prefer phandle
* Grygorii Strashko [180830 17:08]: > On 08/29/2018 07:47 PM, Tony Lindgren wrote: > > In general, it seems cpsw is just an interconnect instance > > (L4_FAST) with a control module (CPSW_WR) and a pile of > > independent other modules. That's described nicely in > > am437x TRM chapter "2.1.4 L4 Fast Peripheral Memory Map". > > So from that point of view the binding reg entries right > > now are all wrong :) > > TRM not consistent - for am5 it's one MMIO region. Well that same information is there in 57xx TRM in chapter "Table 26-1454. GMAC_SW Instance Summary". But yeah, all the cpsw internal devices are stuffed into a single interconnect target module. > > In the long run cpsw should be really treated as an > > interconnect instance with it's control module providing > > standard Linux framework services such as clock / > > regulator / phy / pinctrl / iio whatever for the other > > modules. > > > > Just my 2c based on looking at the interconnect, I'm > > not too familiar with cpsw otherwise. > > It's not separate modules. this is composite module which have only one > fck/ick and most of blocks can't even function without each other. > Above might be the case for Keystone 2, but not omap CPSW. > Keystone 2 - has packet processor, security accelerator, queue manager in > addition to its basic switch block. Yeah there's just one fck/ick as it's all in a single interconnect module. But you might want to look at the CPSW_WR device registers and see what gate clocks and other Linux generic subsystem services CPSW_WR could provide for the other cpsw internal devices. It might just make your life easier maintaining all these variants ;) Regards, Tony
Re: [PATCH net] net/ipv6: Only update MTU metric if it set
On 8/30/18 3:49 PM, Jan Janssen wrote: > You mentioned that systemd-networkd is doing something silly here, so I was > wondering if you could make a bug report to fix the underlying issue? I don't > mind doing it myself but you're the expert and can explain things better. It is silly in the sense that systemd-networkd installs a default route with an MTU and then pushes down updates (route replace) without the MTU.
[bpf PATCH] bpf: avoid kcm psock use and tcp_bpf from colliding
Currently we check sk_user_data is non NULL to determine if the sk exists in a map. However, this is not sufficient to ensure the psock is not in use by another (non-ULP TCP) user, such as kcm. To avoid this when adding a sock to a map also verify it is of the correct ULP type. Signed-off-by: John Fastabend --- 0 files changed diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c index ce63e58..1c05794 100644 --- a/kernel/bpf/sockmap.c +++ b/kernel/bpf/sockmap.c @@ -1808,6 +1808,11 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key) return 0; } +static bool psock_is_smap_sk(struct sock *sk) +{ + return inet_csk(sk)->icsk_ulp_ops == &bpf_tcp_ulp_ops; +} + /* Locking notes: Concurrent updates, deletes, and lookups are allowed and are * done inside rcu critical sections. This ensures on updates that the psock * will not be released via smap_release_sock() until concurrent updates/deletes @@ -1892,6 +1897,10 @@ static int __sock_map_ctx_update_elem(struct bpf_map *map, * doesn't update user data. */ if (psock) { + if (!psock_is_smap_sk(sock)) { + err = -EBUSY; + goto out_progs; + } if (READ_ONCE(psock->bpf_parse) && parse) { err = -EBUSY; goto out_progs;
[RFC PATCH] bpf: test case for kcm + sockmap
This finds the issue with kcm and sockmap where when a kcm socket is added to a sockmap we fail to notice it is a kcm sock and not a tcp bpf ULP sock. This results in a refcount_inc_not_zero() being called on the psock which is assigned to a smap_psock incorrectly. On my system it just happens to fail there and returns EAGAIN. With the fix, just submitted for bpf tree, we get the correct EBUSY error. https://patchwork.ozlabs.org/project/netdev/list/?series=63393 RFC for now and will submit properly when bpf fix makes its way into bpf-next. But, wanted to get the test on the list for folks to look at. --- John Fastabend (1): bpf: test_maps add a test to catch kcm + sockmap tools/testing/selftests/bpf/Makefile|2 - tools/testing/selftests/bpf/test_maps.c | 64 ++- 2 files changed, 63 insertions(+), 3 deletions(-) -- Signature
[RFC PATCH] bpf: test_maps add a test to catch kcm + sockmap
RFC for now, once the fix is in bpf-next will submit there. Adding a socket to both sockmap and kcm is not supported due to collision on sk_user_data usage. At some point we will fix this but until then lets ensure we don't corrupt things. Now if selftests is run without KCM support we will issue a warning and continue with the tests. Signed-off-by: John Fastabend --- tools/testing/selftests/bpf/Makefile|2 - tools/testing/selftests/bpf/test_maps.c | 64 ++- 2 files changed, 63 insertions(+), 3 deletions(-) diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index fff7fb1..16f5dcc 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \ test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o test_lirc_mode2_kern.o \ get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \ - test_skb_cgroup_id_kern.o + test_skb_cgroup_id_kern.o sockmap_kcm.o # Order correspond to 'make run_tests' order TEST_PROGS := test_kmod.sh \ diff --git a/tools/testing/selftests/bpf/test_maps.c b/tools/testing/selftests/bpf/test_maps.c index 6f54f84..473b690 100644 --- a/tools/testing/selftests/bpf/test_maps.c +++ b/tools/testing/selftests/bpf/test_maps.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -479,14 +480,16 @@ static void test_devmap(int task, void *data) #define SOCKMAP_PARSE_PROG "./sockmap_parse_prog.o" #define SOCKMAP_VERDICT_PROG "./sockmap_verdict_prog.o" #define SOCKMAP_TCP_MSG_PROG "./sockmap_tcp_msg_prog.o" +#define KCM_PROG "./sockmap_kcm.o" static void test_sockmap(int tasks, void *data) { struct bpf_map *bpf_map_rx, *bpf_map_tx, *bpf_map_msg, *bpf_map_break; - int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break; + int map_fd_msg = 0, map_fd_rx = 0, map_fd_tx = 0, map_fd_break, kcm; int ports[] = {50200, 50201, 50202, 50204}; int err, i, fd, udp, sfd[6] = {0xdeadbeef}; u8 buf[20] = {0x0, 0x5, 0x3, 0x2, 0x1, 0x0}; - int parse_prog, verdict_prog, msg_prog; + int parse_prog, verdict_prog, msg_prog, kcm_prog; + struct kcm_attach attach_info; struct sockaddr_in addr; int one = 1, s, sc, rc; struct bpf_object *obj; @@ -740,6 +743,62 @@ static void test_sockmap(int tasks, void *data) goto out_sockmap; } + /* Test adding a KCM socket into map */ +#define AF_KCM 41 + kcm = socket(AF_KCM, SOCK_DGRAM, KCMPROTO_CONNECTED); + if (kcm == -1) { + printf("Warning, KCM+Sockmap could not be tested.\n"); + goto skip_kcm; + } + + err = bpf_prog_load(KCM_PROG, + BPF_PROG_TYPE_SOCKET_FILTER, + &obj, &kcm_prog); + if (err) { + printf("Failed to load SK_SKB parse prog\n"); + goto out_sockmap; + } + + i = 2; + memset(&attach_info, 0, sizeof(attach_info)); + attach_info.fd = sfd[i]; + attach_info.bpf_fd = kcm_prog; + err = ioctl(kcm, SIOCKCMATTACH, &attach_info); + if (!err) { + perror("Failed KCM attached to sockmap fd: "); + goto out_sockmap; + } + + err = bpf_map_delete_elem(fd, &i); + if (err) { + printf("Failed delete sockmap from empty map %i %i\n", err, errno); + goto out_sockmap; + } + + err = ioctl(kcm, SIOCKCMATTACH, &attach_info); + if (err) { + perror("Failed KCM attach"); + goto out_sockmap; + } + + err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY); + if (!err) { + printf("Failed sockmap attached KCM sock!\n"); + goto out_sockmap; + } + err = ioctl(kcm, SIOCKCMUNATTACH, &attach_info); + if (err) { + printf("Failed detattch KCM sock!\n"); + goto out_sockmap; + } + + err = bpf_map_update_elem(fd, &i, &sfd[i], BPF_ANY); + if (err) { + printf("Failed post-kcm update sockmap '%i:%i'\n", + i, sfd[i]); + goto out_sockmap; + } + /* Test map update elem afterwards fd lives in fd and map_fd */ for (i = 0; i < 6; i++) { err = bpf_map_update_elem(map_fd_rx, &i, &sfd[i], BPF_ANY); @@ -772,6 +831,7 @@ static void test_sockmap(int tasks, void *data) } } +skip_kcm: /* Put sfd[2] (sending fd below) into msg map to test sendmsg bpf */ i = 0; err = bpf_map_update_elem(map_fd_msg, &i, &sfd[2], BPF_ANY);
Re: [bpf PATCH] bpf: avoid kcm psock use and tcp_bpf from colliding
On 08/30/2018 03:33 PM, John Fastabend wrote: > Currently we check sk_user_data is non NULL to determine if the sk > exists in a map. However, this is not sufficient to ensure the psock > is not in use by another (non-ULP TCP) user, such as kcm. To avoid > this when adding a sock to a map also verify it is of the correct ULP > type. > > Signed-off-by: John Fastabend > --- I'll send a v2 of this we have one more spot we need to check the psock type in the error path. Thanks, John
[PATCH bpf-next] xsk: remove unnecessary assignment
Since xdp_umem_query() was added one assignment of bpf.command was missed from cleanup. Removing the assignment statement. Fixes: 84c6b86875e01a0 ("xsk: don't allow umem replace at stack level") Signed-off-by: Prashant Bhole --- net/xdp/xdp_umem.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c index bfe2dbea480b..d179732617dc 100644 --- a/net/xdp/xdp_umem.c +++ b/net/xdp/xdp_umem.c @@ -76,8 +76,6 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev, if (!dev->netdev_ops->ndo_bpf || !dev->netdev_ops->ndo_xsk_async_xmit) return force_zc ? -EOPNOTSUPP : 0; /* fail or fallback */ - bpf.command = XDP_QUERY_XSK_UMEM; - rtnl_lock(); err = xdp_umem_query(dev, queue_id); if (err) { -- 2.17.1
[PATCH bpf-next] samples/bpf: xdpsock, minor fixes
- xsks_map size was fixed to 4, changed it MAX_SOCKS - Remove redundant definition of MAX_SOCKS in xdpsock_user.c - In dump_stats(), add NULL check for xsks[i] Signed-off-by: Prashant Bhole --- samples/bpf/xdpsock_kern.c | 2 +- samples/bpf/xdpsock_user.c | 3 +-- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c index d8806c41362e..b8ccd0802b3f 100644 --- a/samples/bpf/xdpsock_kern.c +++ b/samples/bpf/xdpsock_kern.c @@ -16,7 +16,7 @@ struct bpf_map_def SEC("maps") xsks_map = { .type = BPF_MAP_TYPE_XSKMAP, .key_size = sizeof(int), .value_size = sizeof(int), - .max_entries = 4, + .max_entries = MAX_SOCKS, }; struct bpf_map_def SEC("maps") rr_map = { diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c index b3906111bedb..57ecadc58403 100644 --- a/samples/bpf/xdpsock_user.c +++ b/samples/bpf/xdpsock_user.c @@ -118,7 +118,6 @@ struct xdpsock { unsigned long prev_tx_npkts; }; -#define MAX_SOCKS 4 static int num_socks; struct xdpsock *xsks[MAX_SOCKS]; @@ -596,7 +595,7 @@ static void dump_stats(void) prev_time = now; - for (i = 0; i < num_socks; i++) { + for (i = 0; i < num_socks && xsks[i]; i++) { char *fmt = "%-15s %'-11.0f %'-11lu\n"; double rx_pps, tx_pps; -- 2.17.1
Re: [PATCH bpf-next 0/3] xsk: misc code cleanup
On Thu, Aug 30, 2018 at 03:56:40PM +0200, Magnus Karlsson wrote: > This patch set cleans up two code style issues with the xsk zero-copy > code. The resulting code is smaller and simpler. > > Patch 1: Removes a potential compiler warning reported by the Intel > 0-DAY kernel test infrastructure. > Patches 2-3: Removes the xdp_umem_props structure. At some point, it > was used to break a dependency, but the members are these > days much better off in the xdp_umem since the dependency > does not exist anymore. > > I based this patch set on bpf-next commit 234dbe3dc1db ("Merge branch > 'verifier-liveness-simplification'") Applied, Thanks
Re: [PATCH v3 bpf-next 0/2] bpf tcp save syn set/get sockoptions
On Thu, Aug 30, 2018 at 07:51:52AM -0700, Nikita V. Shirokov wrote: > > adding supprot for two new bpf's tcp sockopts: > TCP_SAVE_SYN (set) and TCP_SAVED_SYN (get) > this would allow for tcp-bpf program to build some logic based on fields from > ingress syn packet (e.g. doing tcp's tos/tclass reflection (see sample prog)) > and do it transparently from userspace program point of view Applied, Thanks but please convert the sample code into selftest.
[PATCH net-next] net: remove duplicated include from net_failover.c
Remove duplicated include. Signed-off-by: YueHaibing --- drivers/net/net_failover.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c index 7ae1856..192ae1c 100644 --- a/drivers/net/net_failover.c +++ b/drivers/net/net_failover.c @@ -19,7 +19,6 @@ #include #include #include -#include #include #include #include
[PATCH net-next] failover: remove set but not used variable 'primary_dev'
Fixes gcc '-Wunused-but-set-variable' warning: drivers/net/net_failover.c: In function 'net_failover_slave_unregister': drivers/net/net_failover.c:598:35: warning: variable 'primary_dev' set but not used [-Wunused-but-set-variable] Signed-off-by: YueHaibing --- drivers/net/net_failover.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/net_failover.c b/drivers/net/net_failover.c index 7ae1856..e103c94e 100644 --- a/drivers/net/net_failover.c +++ b/drivers/net/net_failover.c @@ -595,12 +595,11 @@ static int net_failover_slave_pre_unregister(struct net_device *slave_dev, static int net_failover_slave_unregister(struct net_device *slave_dev, struct net_device *failover_dev) { - struct net_device *standby_dev, *primary_dev; + struct net_device *standby_dev; struct net_failover_info *nfo_info; bool slave_is_standby; nfo_info = netdev_priv(failover_dev); - primary_dev = rtnl_dereference(nfo_info->primary_dev); standby_dev = rtnl_dereference(nfo_info->standby_dev); vlan_vids_del_by_dev(slave_dev, failover_dev);