Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps
On Thu, Apr 5, 2018 at 8:29 AM, Ulf Hansson wrote: > On 4 April 2018 at 21:56, Boris Brezillon wrote: >> On Wed, 04 Apr 2018 21:49:26 +0200 >> Robert Jarzmik wrote: >> >>> Ulf Hansson writes: >>> >>> > On 2 April 2018 at 16:26, Robert Jarzmik wrote: >>> >> Hi, >>> >> >>> >> This serie is aimed at removing the dmaengine slave compat use, and >>> >> transfer >>> >> knowledge of the DMA requestors into architecture code. >>> >> As this looks like a patch bomb, each maintainer expressing for his tree >>> >> either >>> >> an Ack or "I want to take through my tree" will be spared in the next >>> >> iterations >>> >> of this serie. >>> > >>> > Perhaps an option is to send this hole series as PR for 3.17 rc1, that >>> > would removed some churns and make this faster/easier? Well, if you >>> > receive the needed acks of course. >>> For 3.17-rc1 it looks a bit optimistic with the review time ... If I have >>> all >> >> Especially since 3.17-rc1 has been released more than 3 years ago :-), >> but I guess you meant 4.17-rc1. > > Yeah, I realize that I was a bit lost in time yesterday. Even more > people have been having fun about it (me too). :-) I occasionally still type 2.6.17 when I mean 4.17. Arnd
Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps
On 4 April 2018 at 21:56, Boris Brezillon wrote: > On Wed, 04 Apr 2018 21:49:26 +0200 > Robert Jarzmik wrote: > >> Ulf Hansson writes: >> >> > On 2 April 2018 at 16:26, Robert Jarzmik wrote: >> >> Hi, >> >> >> >> This serie is aimed at removing the dmaengine slave compat use, and >> >> transfer >> >> knowledge of the DMA requestors into architecture code. >> >> As this looks like a patch bomb, each maintainer expressing for his tree >> >> either >> >> an Ack or "I want to take through my tree" will be spared in the next >> >> iterations >> >> of this serie. >> > >> > Perhaps an option is to send this hole series as PR for 3.17 rc1, that >> > would removed some churns and make this faster/easier? Well, if you >> > receive the needed acks of course. >> For 3.17-rc1 it looks a bit optimistic with the review time ... If I have all > > Especially since 3.17-rc1 has been released more than 3 years ago :-), > but I guess you meant 4.17-rc1. Yeah, I realize that I was a bit lost in time yesterday. Even more people have been having fun about it (me too). :-) Kind regards Uffe
[PATCH ethtool] ethtool: Add register dump support for MICROCHIP LAN78xx
This patch adds support for Microchip's lan78xx families of USB Ethernet controllers to ethtool's dump registers command. This patch is for use with the lan78xx driver. Signed-off-by: Raghuram Chary J --- Makefile.am | 2 +- ethtool.c | 1 + internal.h | 4 +++ lan78xx.c | 87 + 4 files changed, 93 insertions(+), 1 deletion(-) create mode 100644 lan78xx.c diff --git a/Makefile.am b/Makefile.am index edbda57..14f79b6 100644 --- a/Makefile.am +++ b/Makefile.am @@ -14,7 +14,7 @@ ethtool_SOURCES += \ pcnet32.c realtek.c tg3.c marvell.c vioc.c\ smsc911x.c at76c50x-usb.c sfc.c stmmac.c \ sff-common.c sff-common.h sfpid.c sfpdiag.c \ - ixgbevf.c tse.c vmxnet3.c qsfp.c qsfp.h fjes.c + ixgbevf.c tse.c vmxnet3.c qsfp.c qsfp.h fjes.c lan78xx.c endif TESTS = test-cmdline test-features diff --git a/ethtool.c b/ethtool.c index da7421c..3494402 100644 --- a/ethtool.c +++ b/ethtool.c @@ -1160,6 +1160,7 @@ static const struct { { "altera_tse", altera_tse_dump_regs }, { "vmxnet3", vmxnet3_dump_regs }, { "fjes", fjes_dump_regs }, + { "lan78xx", lan78xx_dump_regs }, #endif }; diff --git a/internal.h b/internal.h index 913f4eb..b239dc7 100644 --- a/internal.h +++ b/internal.h @@ -350,4 +350,8 @@ void sff8636_show_all(const __u8 *id, __u32 eeprom_len); /* FUJITSU Extended Socket network device */ int fjes_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs); + +/* MICROCHIP LAN78XX USB ETHERNET Controller */ +int lan78xx_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs); + #endif /* ETHTOOL_INTERNAL_H__ */ diff --git a/lan78xx.c b/lan78xx.c new file mode 100644 index 000..bb64e80 --- /dev/null +++ b/lan78xx.c @@ -0,0 +1,87 @@ +#include +#include +#include "internal.h" + +int lan78xx_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs) +{ + unsigned int *lan78xx_reg = (unsigned int *)regs->data; + + fprintf(stdout, "LAN78xx Registers:\n"); + fprintf(stdout, "--\n"); + fprintf(stdout, "ID_REV = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "INT_STS = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "HW_CFG = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "PMT_CTRL = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "E2P_CMD = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "E2P_DATA = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "USB_STATUS = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "VLAN_TYPE= 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "\n"); + + fprintf(stdout, "MAC Registers:\n"); + fprintf(stdout, "--\n"); + fprintf(stdout, "MAC_CR = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "MAC_RX = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "MAC_TX = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "FLOW = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "ERR_STS= 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "MII_ACC= 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "MII_DATA = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "EEE_TX_LPI_REQ_DLY = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "EEE_TW_TX_SYS = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "EEE_TX_LPI_REM_DLY = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "WUCSR = 0x%08X\n", *lan78xx_reg++); + fprintf(stdout, "\n"); + + fprintf(stdout, "PHY Registers:\n"); + fprintf(stdout, "--\n"); + fprintf(stdout, "Mode Control = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "Mode Status = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "Device identifier1 = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "Device identifier2 = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "Auto-Neg Advertisement = 0x%04X\n", + *lan78xx_reg++); + fprintf(stdout, "Auto-Neg Link Partner Ability = 0x%04X\n", + *lan78xx_reg++); + fprintf(stdout, "Auto-Neg Expansion = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "Auto-Neg Next Page TX = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "Auto-Neg Link Partner Next Page RX = 0x%04X\n", + *lan78xx_reg++); + fprintf(stdout, "1000BASE-T Control = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "1000BASE-T Status = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "Reserved = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "Reserved = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "MMD Access Control = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout, "MMD Access Address/Data = 0x%04X\n", *lan78xx_reg++); + fprintf(stdout,
Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration
Thu, Apr 05, 2018 at 01:00:18AM CEST, dsah...@gmail.com wrote: >On 4/4/18 4:59 PM, Jakub Kicinski wrote: >> On Wed, 4 Apr 2018 08:25:11 +0200, Jiri Pirko wrote: Jiri, I am not aware of any other API where a driver registers with it yet doesn't want the handler to be called so either waits to register >>> >>> Again, the thing is, this is kind of unusual because of the reload >>> thing. >> >> FWIW my knee jerk thought is that it's strange that devlink ops can >> be executed at all while reload is happening (incl. another reload >> request). I'm probably missing the real issue.. >> > >Just responding with the same question ... > >Why are you not unregistering resources on a reload? Because you need the use the values of course!
Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications
On Thu, Apr 05, 2018 at 12:58:46AM +0100, Edward Cree wrote: > On 04/04/18 00:37, Alexei Starovoitov wrote: > > hmm. that doesn't fail for me and any other bots didn't complain. > > Are you sure you're running the latest kernel and tests? > Ah, test_progs isn't actually rebuilding because __NR_bpf is undeclared; > something must be going wrong with header files. > Never mind. > > > hmm. what's wrong with bsearch? It's trivial and fast. > bsearch is O(log n), and the sort() call on the subprog_starts (which happens > every time add_subprog() is called) is O(n log n). > Whereas reading aux->subprogno is O(1). > As far as I'm concerned, that's a sign that the latter data structure is the > appropriate one. only if it can be done as separate _first_ pass before cfg. > > Even if we don't see the solution today we have to work towards it. > I guess I'm just not confident "towards" is the direction you think it is. > > > Compiler designers could have combined multiple of such passes into > > fewer ones, but it's not done, because it increases complexity and > > causes tough bugs where pass is partially complete. > I'm not trying to combine together multiple 'for bb in prog/for insn in bb'- > type passes. The combining I was doing was more on 'for all possible > execution paths'-type passes, because it's those that explode > combinatorially. > Happily I think we can go a long way towards getting rid of them; but while I > think we can get down to only having 1, I don't think we can reach 0. we have to. That's the point. 'explore all possible paths' must be removed. New code should not be relying on that in a way that it's difficult to remove later. subprog boundaries is a typical example where doing it as part of 'explore all paths' is harmful long term. Regardless of extra O(num_of_subrogs) complexity it brings short term. > > The prime example where more than 4k instructions and loops are mandatory > > is user space stack analysis inside the program. Like walking python stack > > requires non-trival pointer chasing. With 'pragma unroll' the stack depth > > limit today is ~18. That's not really usable. Looping through 100 python > > frames would require about 16k bpf assembler instructions. > But this would be solved by having support for bounded loops, and I think I've > successfully shown that this is not inherently incompatible with a do_check() > style walk. No. It's worse. Your proposed approach to bounded loops completely relying on 'explore all paths' whereas John's does not. Can 'explore all paths' work with 1M bpf insns? No. Whereas an approach that builds dom graph, detects natural loops and loop invariants - can. > > Hence do_check approach must go. The rough idea is to compute per basic > > block a set of INs (registers and stack) that basic block needs > > to see to be safe and corresponding set of OUTs. > > Then propagate this knowledge across cfg edges. > > Once we have such set per bpf function, it will essentially answer the > > question > > 'what arguments this function needs to see to be safe and what it returns' > > To make bpf libraries scale we'd need to keep such information > > around after the verification, so dynamic linking and indirect calls > > are fast to verify. > > It's very high level obviously. There are many gotchas to resolve. > I agree that if we can do this it'll be ideal. But that's a big 'if'; my > example code was intended to demonstrate that the "set of INs bb/func needs > to > see to be safe" can be an arbitrarily complicated disjunction, and that > instead > of a combinatorially exploding number of paths to walk (the do_check() > approach) > you now have combinatorially exploding IN-constraints to propagate backwards. This sounds like arbitrary assumptions what this new approach would do. Instead of rejecting it outright try to solve this very hard problem. Before this discussion gets carried away too far let's get back to this patch set. Having seen all arguments I still think that only patch 3 is worth pursuing further.
Re: [PATCH net-next RFC 0/5] ipv6: sr: introduce seg6local End.BPF action
On Wed, Apr 4, 2018 at 2:34 AM, Mathieu Xhonneux wrote: > 2018-04-03 16:25 GMT+02:00 David Lebrun : >> Actually I'm wrong here. dst_input() will call either ip6_input() or >> ip6_forward(), not ipv6_rcv(). Both functions expect IP6CB() to be set, >> so using skb->cb here will interfere with them. >> >> What about saving and restoring the IPv6 CB, similarly to what TCP does with >> tcp_v6_restore_cb() ? > > > Yes. I can change the call to bpf_prog_run_save_cb to bpf_prog_run_clear_cb, > and then manually save/restore the IPv6 CB in input_action_end_bpf. > > Or is there maybe a better solution to share some state between the bpf caller > and helpers, that does not need access to skb->cb ? I think per-cpu scratch buffer approach can work for this situation. Similar to one used by do_redirect and sockmap.
Re: [PATCH v3 2/4] bus: fsl-mc: add restool userspace support
On Wed, Apr 4, 2018 at 7:42 AM, Andrew Lunn wrote: >> I hear you. It is more complicated this way...having all these individual >> objects vs just a single "bundle" of them that represents a NIC. But, that's >> the way the DPAA2 hardware is, and we're implementing kernel support for >> the hardware as it is. > > Hi Stuart > > I see we are not making any progress here. > > So what i suggest is you post the kernel code and configuration tool > concept to netdev for a full review. You want reviews from David > Miller, Jiri Pirko, Jakub Kicinski, David Ahern, etc. I know Ioana has other feedback she is addressing so a respin will be coming soon, and she can include those additional reviewers. Thanks, Stuart
Re: [PATCH net-next] vxlan: add ttl inherit support
On Wed, Apr 04, 2018 at 12:03:39PM -0400, David Miller wrote: > The net-next tree is closed, please resubmit this when the net-next tree opens > back up. Sorry, I didn't pay much attention on the net-next open/close cycle before. After re-read netdev-FAQ.txt. Now I understand the steps and I need send net-next patch after rc1 released. Sorry to bother you again... Regards Hangbin
marvell switch
Hello, I am trying to use marvell switch in linux, Is it that the kernel drivers from marvell switch are used just to enable all ports, or do they also provide APIs to userspace to enable specific ports only. I have not find examples or wiki for marvell switch, so I am not too sure as what are the drivers meant for. Thank you, ranran
Re: [PATCH net] net: dsa: Discard frames from unused ports
On Wed, Apr 04, 2018 at 05:49:10PM -0700, Florian Fainelli wrote: > On 04/04/2018 04:56 PM, Andrew Lunn wrote: > > The Marvell switches under some conditions will pass a frame to the > > host with the port being the CPU port. Such frames are invalid, and > > should be dropped. Not dropping them can result in a crash when > > incrementing the receive statistics for an invalid port. > > > > Reported-by: Chris Healy > > Fixes: 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics") > > Are you sure this is the commit that introduced the problem? Hi Florian Well, the problem is it crashes when trying to update the statistics. The CPU port is not allocated a p->stats64, only slave ports get those. So before this patch, there was no crash and the frame would be delivered to the master interface. This in itself is probably not correct, but also not fatal. Talking to Chris, it seems this behaviour has existing for a long while. I needed to use lldpd to trigger the issue, because i assume the Marvell switch sees these as special frames and forwards them to the CPU. The other thing is, the code got refactored recently. So this fix will not rebase to too many earlier versions. It needs a fix per tagging protocol for before the common dsa_master_find_slave() was added. > > - return ds->ports[port].slave; > > + slave_port = &ds->ports[port]; > > + > > + if (slave_port->type != DSA_PORT_TYPE_USER) > > Can we optimize this with an unlikely()? Yes, that would make sense. Andrew
Re: [rtlwifi-btcoex] Suspicious code in halbtc8821a1ant driver
On Thu, 2018-04-05 at 01:25 +, Gustavo A. R. Silva wrote: > Hi all, > > While doing some static analysis I came across the following piece of code at > drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtc8821a1ant.c:1581: > > 1581 static void btc8821a1ant_act_bt_sco_hid_only_busy(struct btc_coexist > *btcoexist, > 1582 u8 wifi_status) > 1583 { > 1584 /* tdma and coex table */ > 1585 btc8821a1ant_ps_tdma(btcoexist, NORMAL_EXEC, true, 5); > 1586 > 1587 if (BT_8821A_1ANT_WIFI_STATUS_NON_CONNECTED_ASSO_AUTH_SCAN == > 1588 wifi_status) > 1589 btc8821a1ant_coex_table_with_type(btcoexist, > NORMAL_EXEC, 1); > 1590 else > 1591 btc8821a1ant_coex_table_with_type(btcoexist, > NORMAL_EXEC, 1); > 1592 } > > The issue here is that the code for both branches of the if-else statement is > identical. > > The if-else was introduced a year ago in this commit c6821613e653 > > I wonder if an argument should be changed in any of the calls to > btc8821a1ant_coex_table_with_type? > > It looks weird. Since we're in spring vacation, I'll check my colleague next Monday. PK
[PULL] fwcfg, vhost: features and fixes
The following changes since commit 0c8efd610b58cb23cefdfa12015799079aef94ae: Linux 4.16-rc5 (2018-03-11 17:25:09 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus for you to fetch changes up to dc32bb678e103afbcfa4d814489af0566307f528: vhost: add vsock compat ioctl (2018-03-20 03:17:42 +0200) fw_cfg, vhost: features fixes This cleans up the qemu fw cfg device driver. On top of this, vmcore is dumped there on crash to help debugging witH kASLR enabled. Also included are some fixes in vhost. Signed-off-by: Michael S. Tsirkin Marc-André Lureau (10): fw_cfg: fix sparse warnings in fw_cfg_sel_endianness() fw_cfg: fix sparse warnings with fw_cfg_file fw_cfg: fix sparse warning reading FW_CFG_ID fw_cfg: fix sparse warnings around FW_CFG_FILE_DIR read fw_cfg: remove inline from fw_cfg_read_blob() fw_cfg: handle fw_cfg_read_blob() error fw_cfg: add a public uapi header fw_cfg: add DMA register crash: export paddr_vmcoreinfo_note() fw_cfg: write vmcoreinfo details Michael S. Tsirkin (1): ptr_ring: fix build Sonny Rao (2): vhost: fix vhost ioctl signature to build with clang vhost: add vsock compat ioctl MAINTAINERS | 1 + drivers/firmware/qemu_fw_cfg.c | 291 +++ drivers/vhost/vhost.c| 2 +- drivers/vhost/vhost.h| 4 +- drivers/vhost/vsock.c| 11 ++ include/uapi/linux/qemu_fw_cfg.h | 97 + kernel/crash_core.c | 1 + tools/virtio/ringtest/ptr_ring.c | 5 + 8 files changed, 348 insertions(+), 64 deletions(-) create mode 100644 include/uapi/linux/qemu_fw_cfg.h
[rtlwifi-btcoex] Suspicious code in halbtc8821a1ant driver
Hi all, While doing some static analysis I came across the following piece of code at drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtc8821a1ant.c:1581: 1581 static void btc8821a1ant_act_bt_sco_hid_only_busy(struct btc_coexist *btcoexist, 1582 u8 wifi_status) 1583 { 1584 /* tdma and coex table */ 1585 btc8821a1ant_ps_tdma(btcoexist, NORMAL_EXEC, true, 5); 1586 1587 if (BT_8821A_1ANT_WIFI_STATUS_NON_CONNECTED_ASSO_AUTH_SCAN == 1588 wifi_status) 1589 btc8821a1ant_coex_table_with_type(btcoexist, NORMAL_EXEC, 1); 1590 else 1591 btc8821a1ant_coex_table_with_type(btcoexist, NORMAL_EXEC, 1); 1592 } The issue here is that the code for both branches of the if-else statement is identical. The if-else was introduced a year ago in this commit c6821613e653 I wonder if an argument should be changed in any of the calls to btc8821a1ant_coex_table_with_type? What do you think? Thanks -- Gustavo
RE: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC address support for ethtool nftuple filters
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On > Behalf Of Vinicius Costa Gomes > Sent: Thursday, March 29, 2018 2:08 PM > To: intel-wired-...@lists.osuosl.org > Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus palen...@intel.com> > Subject: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC > address support for ethtool nftuple filters > > This adds the capability of configuring the queue steering of arriving > packets based on their source and destination MAC addresses. > > In practical terms this adds support for the following use cases, > characterized by these examples: > > $ ethtool -N eth0 flow-type ether dst aa:aa:aa:aa:aa:aa action 0 > (this will direct packets with destination address "aa:aa:aa:aa:aa:aa" > to the RX queue 0) This is now working for me, testing with the dst MAC being the MAC on the i210. I set the filter and all the traffic to the destination MAC address gets routed to the chosen RX queue. > $ ethtool -N eth0 flow-type ether src 44:44:44:44:44:44 action 3 > (this will direct packets with source address "44:44:44:44:44:44" to > the RX queue 3) However, I am still not getting the raw ethernet source filter to work. Even back to back with no other system to "confuse" the stream, I set the filter so the source MAC is the same as the MAC on the link partner, send traffic and the traffic bounces around the queues as if the filter is not set. > > Signed-off-by: Vinicius Costa Gomes > --- > drivers/net/ethernet/intel/igb/igb_ethtool.c | 35 > > 1 file changed, 31 insertions(+), 4 deletions(-)
Re: linux-next: manual merge of the net-next tree with the pci tree
Hi all, On Tue, 3 Apr 2018 13:14:54 +1000 Stephen Rothwell wrote: > > Today's linux-next merge of the net-next tree got a conflict in: > > drivers/net/ethernet/mellanox/mlx5/core/en_main.c > > between commit: > > 2907938d2375 ("net/mlx5e: Use pcie_bandwidth_available() to compute > bandwidth") > > from the pci tree and commit: > > 0608d4dbaf4e ("net/mlx5e: Unify slow PCI heuristic") > > from the net-next tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > -- > Cheers, > Stephen Rothwell > > diff --cc drivers/net/ethernet/mellanox/mlx5/core/en_main.c > index 884337f88589,0aab3afc6885.. > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c > @@@ -3880,16 -4026,50 +4033,20 @@@ void mlx5e_build_default_indir_rqt(u32 > indirection_rqt[i] = i % num_channels; > } > > - static bool cqe_compress_heuristic(u32 link_speed, u32 pci_bw) > -static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw) > -{ > -enum pcie_link_width width; > -enum pci_bus_speed speed; > -int err = 0; > - > -err = pcie_get_minimum_link(mdev->pdev, &speed, &width); > -if (err) > -return err; > - > -if (speed == PCI_SPEED_UNKNOWN || width == PCIE_LNK_WIDTH_UNKNOWN) > -return -EINVAL; > - > -switch (speed) { > -case PCIE_SPEED_2_5GT: > -*pci_bw = 2500 * width; > -break; > -case PCIE_SPEED_5_0GT: > -*pci_bw = 5000 * width; > -break; > -case PCIE_SPEED_8_0GT: > -*pci_bw = 8000 * width; > -break; > -default: > -return -EINVAL; > -} > - > -return 0; > -} > - > + static bool slow_pci_heuristic(struct mlx5_core_dev *mdev) > { > - return (link_speed && pci_bw && > - (pci_bw < 4) && (pci_bw < link_speed)); > - } > + u32 link_speed = 0; > + u32 pci_bw = 0; > > - static bool hw_lro_heuristic(u32 link_speed, u32 pci_bw) > - { > - return !(link_speed && pci_bw && > - (pci_bw <= 16000) && (pci_bw < link_speed)); > + mlx5e_get_max_linkspeed(mdev, &link_speed); > -mlx5e_get_pci_bw(mdev, &pci_bw); > ++pci_bw = pcie_bandwidth_available(mdev->pdev, NULL, NULL, NULL); > + mlx5_core_dbg_once(mdev, "Max link speed = %d, PCI BW = %d\n", > +link_speed, pci_bw); > + > + #define MLX5E_SLOW_PCI_RATIO (2) > + > + return link_speed && pci_bw && > + link_speed > MLX5E_SLOW_PCI_RATIO * pci_bw; > } > > void mlx5e_set_tx_cq_mode_params(struct mlx5e_params *params, u8 > cq_period_mode) This is now a conflict between the pci tree and Linus' tree. -- Cheers, Stephen Rothwell pgpWGHhkC1zBc.pgp Description: OpenPGP digital signature
Re: [PATCH net] route: check sysctl_fib_multipath_use_neigh earlier than hash
Hi Xin Long. [This is an automated email] This commit has been processed because it contains a "Fixes:" tag. fixing commit: a6db4494d218 net: ipv4: Consider failed nexthops in multipath routes. The bot has also determined it's probably a bug fixing patch. (score: 58.0140) The bot has tested the following trees: v4.15.15, v4.14.32, v4.9.92, v4.15.15: Build OK! v4.14.32: Build OK! v4.9.92: Build OK! -- Thanks. Sasha
Re: [PATCH v2 net-next 01/10] mlxsw: spectrum_acl: Fix flex actions header ifndef define construct
Hi Ido Schimmel Jiri Pirko. [This is an automated email] This commit has been processed by the -stable helper bot and determined to be a high probability candidate for -stable trees. (score: 2.6781) The bot has tested the following trees: v4.15.15, v4.14.32, v4.9.92, v4.4.126, v4.15.15: Build OK! v4.14.32: Failed to apply! Possible dependencies: d3b939b8f9a5: ("mlxsw: spectrum: Move ACL flexible actions instance to spectrum") e2b2d35a052d: ("mlxsw: spectrum: Change init order") v4.9.92: Failed to apply! Possible dependencies: d3b939b8f9a5: ("mlxsw: spectrum: Move ACL flexible actions instance to spectrum") 38ebc0f45474: ("mlxsw: spectrum_router: Add mlxsw_sp_ipip_ops") ff7b0d27208b: ("mlxsw: spectrum: Add support for counter allocator") 7aa0f5aa9030: ("mlxsw: spectrum: Implement TC flower offload") 22a677661f56: ("mlxsw: spectrum: Introduce ACL core with simple TCAM implementation") 1d20d23c59c9: ("mlxsw: Move PCI id table definitions into driver modules") 62e86f9e8281: ("mlxsw: pci: Rename header with HW definitions") 98d0f7b9acda: ("mlxsw: spectrum: Add packet sample offloading support") 65acb5d0827c: ("mlxsw: spectrum: Make the add_matchall_tc_entry symmetric") 5724b8b56947: ("net/sched: tc_mirred: Rename public predicates 'is_tcf_mirred_redirect' and 'is_tcf_mirred_mirror'") v4.4.126: Failed to apply! Possible dependencies: d3b939b8f9a5: ("mlxsw: spectrum: Move ACL flexible actions instance to spectrum") 38ebc0f45474: ("mlxsw: spectrum_router: Add mlxsw_sp_ipip_ops") ff7b0d27208b: ("mlxsw: spectrum: Add support for counter allocator") 7aa0f5aa9030: ("mlxsw: spectrum: Implement TC flower offload") b090ef068645: ("mlxsw: Introduce simplistic KVD linear area manager") 464dce188487: ("mlxsw: spectrum_router: Add basic ipv4 router initialization") f00817df2b42: ("mlxsw: spectrum: Introduce support for Data Center Bridging (DCB)") 90183b980d0a: ("mlxsw: spectrum: Initialize egress scheduling") 7f71eb46a485: ("mlxsw: spectrum: Split vFID range in two") bd40e9d6d538: ("mlxsw: spectrum: Allocate active VLANs only for port netdevs") 0d65fc13042f: ("mlxsw: spectrum: Implement LAG port join/leave") c4745500e988: ("mlxsw: Implement devlink interface") 89309da39f55: ("mlxsw: core: Implement temperature hwmon interface") 7f71eb46a485: ("mlxsw: spectrum: Split vFID range in two") bd40e9d6d538: ("mlxsw: spectrum: Allocate active VLANs only for port netdevs") 0d65fc13042f: ("mlxsw: spectrum: Implement LAG port join/leave") 18f1e70c4137: ("mlxsw: spectrum: Introduce port splitting") 3e9b27b8fc8b: ("mlxsw: spectrum: Unmap local port from module during teardown") bd40e9d6d538: ("mlxsw: spectrum: Allocate active VLANs only for port netdevs") c4745500e988: ("mlxsw: Implement devlink interface") Please let us know if you'd like to have this patch included in a stable tree. -- Thanks. Sasha
Re: [PATCH v2 net-next 06/10] mlxsw: core: Fix arg name of MLXSW_CORE_RES_VALID and MLXSW_CORE_RES_GET
Hi Ido Schimmel Jiri Pirko. [This is an automated email] This commit has been processed by the -stable helper bot and determined to be a high probability candidate for -stable trees. (score: 1.5151) The bot has tested the following trees: v4.15.15, v4.14.32, v4.9.92, v4.4.126, v4.15.15: Build OK! v4.14.32: Build OK! v4.9.92: Failed to apply! Possible dependencies: c1a3831121f6: ("mlxsw: Convert resources into array") v4.4.126: Failed to apply! Possible dependencies: 403547d38d0b: ("mlxsw: profile: Add KVD resources to profile config") 489107bda1d1: ("mlxsw: Add KVD sizes configuration into profile") 57d316ba2017: ("mlxsw: pci: Add resources query implementation.") 8060646a0fd1: ("mlxsw: core: Add support for packets received from LAG port") 89309da39f55: ("mlxsw: core: Implement temperature hwmon interface") 89309da39f55: ("mlxsw: core: Implement temperature hwmon interface") 932762b69a28: ("mlxsw: Move devlink port registration into common core code") 8060646a0fd1: ("mlxsw: core: Add support for packets received from LAG port") 89309da39f55: ("mlxsw: core: Implement temperature hwmon interface") 90183b980d0a: ("mlxsw: spectrum: Initialize egress scheduling") 7f71eb46a485: ("mlxsw: spectrum: Split vFID range in two") bd40e9d6d538: ("mlxsw: spectrum: Allocate active VLANs only for port netdevs") 0d65fc13042f: ("mlxsw: spectrum: Implement LAG port join/leave") c4745500e988: ("mlxsw: Implement devlink interface") 89309da39f55: ("mlxsw: core: Implement temperature hwmon interface") 7f71eb46a485: ("mlxsw: spectrum: Split vFID range in two") Please let us know if you'd like to have this patch included in a stable tree. -- Thanks. Sasha
Re: [PATCH net] netns: filter uevents correctly
On Wed, Apr 04, 2018 at 05:38:02PM -0500, Eric W. Biederman wrote: > Christian Brauner writes: > > > On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote: > >> commit 07e98962fa77 ("kobject: Send hotplug events in all network > >> namespaces") > >> > >> enabled sending hotplug events into all network namespaces back in 2010. > >> Over time the set of uevents that get sent into all network namespaces has > >> shrunk. We have now reached the point where hotplug events for all devices > >> that carry a namespace tag are filtered according to that namespace. > >> > >> Specifically, they are filtered whenever the namespace tag of the kobject > >> does not match the namespace tag of the netlink socket. One example are > >> network devices. Uevents for network devices only show up in the network > >> namespaces these devices are moved to or created in. > >> > >> However, any uevent for a kobject that does not have a namespace tag > >> associated with it will not be filtered and we will *try* to broadcast it > >> into all network namespaces. > >> > >> The original patchset was written in 2010 before user namespaces were a > >> thing. With the introduction of user namespaces sending out uevents became > >> partially isolated as they were filtered by user namespaces: > >> > >> net/netlink/af_netlink.c:do_one_broadcast() > >> > >> if (!net_eq(sock_net(sk), p->net)) { > >> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID)) > >> return; > >> > >> if (!peernet_has_id(sock_net(sk), p->net)) > >> return; > >> > >> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, > >> CAP_NET_BROADCAST)) > >> j return; > >> } > >> > >> The file_ns_capable() check will check whether the caller had > >> CAP_NET_BROADCAST at the time of opening the netlink socket in the user > >> namespace of interest. This check is fine in general but seems insufficient > >> to me when paired with uevents. The reason is that devices always belong to > >> the initial user namespace so uevents for kobjects that do not carry a > >> namespace tag should never be sent into another user namespace. This has > >> been the intention all along. But there's one case where this breaks, > >> namely if a new user namespace is created by root on the host and an > >> identity mapping is established between root on the host and root in the > >> new user namespace. Here's a reproducer: > >> > >> sudo unshare -U --map-root > >> udevadm monitor -k > >> # Now change to initial user namespace and e.g. do > >> modprobe kvm > >> # or > >> rmmod kvm > >> > >> will allow the non-initial user namespace to retrieve all uevents from the > >> host. This seems very anecdotal given that in the general case user > >> namespaces do not see any uevents and also can't really do anything useful > >> with them. > >> > >> Additionally, it is now possible to send uevents from userspace. As such we > >> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user > >> namespace of the network namespace of the netlink socket) userspace process > >> make a decision what uevents should be sent. > >> > >> This makes me think that we should simply ensure that uevents for kobjects > >> that do not carry a namespace tag are *always* filtered by user namespace > >> in kobj_bcast_filter(). Specifically: > >> - If the owning user namespace of the uevent socket is not init_user_ns the > >> event will always be filtered. > >> - If the network namespace the uevent socket belongs to was created in the > >> initial user namespace but was opened from a non-initial user namespace > >> the event will be filtered as well. > >> Put another way, uevents for kobjects not carrying a namespace tag are now > >> always only sent to the initial user namespace. The regression potential > >> for this is near to non-existent since user namespaces can't really do > >> anything with interesting devices. > >> > >> Signed-off-by: Christian Brauner > > > > That was supposed to be [PATCH net] not [PATCH net-next] which is > > obviously closed. Sorry about that. > > This does not appear to be a fix. > This looks like feature work. > The motivation appears to be that looks wrong let's change it. Hm, it looked like an oversight an therefore seems like a bug which is why I thought would be a good candidate for net. Recent patches to the semantics here just make it more obvious and provide a better argument to fix it in the current release rather then defer it to the next one. But I'm happy to leave this for net-next. I don't want to rush things if this change in semantics is not trivial enough. For the record, I'm merely fixing/expanding on the current status quo. David, is it ok to queue this or would you prefer I resend when net-next reopens? > > So let's please leave this for when net-next opens again so we can > have time to fully consider a change in semantics. Sure, if we a
Re: [PATCH net 2/2] net: bgmac: Fix endian access in bgmac_dma_tx_ring_free()
Hi Florian Fainelli. [This is an automated email] This commit has been processed because it contains a "Fixes:" tag. fixing commit: 9cde94506eac bgmac: implement scatter/gather support. The bot has also determined it's probably a bug fixing patch. (score: 45.9160) The bot has tested the following trees: v4.15.15, v4.14.32, v4.9.92, v4.4.126, v4.15.15: Build OK! v4.14.32: Build OK! v4.9.92: Build OK! v4.4.126: Build OK! -- Thanks. Sasha
Re: [PATCH net v5 3/3] ipv6: udp6: set dst cache for a connected sk if current not valid
Hi Alexey Kodanev. [This is an automated email] This commit has been processed because it contains a "Fixes:" tag. fixing commit: 33c162a980fe ipv6: datagram: Update dst cache of a connected datagram sk during pmtu update. The bot has tested the following trees: v4.15.15, v4.14.32, v4.9.92, v4.15.15: Failed to apply! Possible dependencies: 22b12d2bf404: ("ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()") v4.14.32: Failed to apply! Possible dependencies: 22b12d2bf404: ("ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()") v4.9.92: Failed to apply! Possible dependencies: 22b12d2bf404: ("ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()") -- Thanks. Sasha
Re: [PATCH net 1/2] net: bgmac: Correctly annotate register space
Hi Florian Fainelli. [This is an automated email] This commit has been processed because it contains a "Fixes:" tag. fixing commit: f6a95a24957a net: ethernet: bgmac: Add platform device support. The bot has also determined it's probably a bug fixing patch. (score: 67.8237) The bot has tested the following trees: v4.15.15, v4.14.32, v4.9.92, v4.15.15: Build OK! v4.14.32: Build OK! v4.9.92: Failed to apply! Possible dependencies: dd5c5d037f5e: ("net: ethernet: bgmac: add NS2 support") 1676aba5ef7e: ("net: ethernet: bgmac: device tree phy enablement") -- Thanks. Sasha
Re: [PATCH net] net: dsa: Discard frames from unused ports
On 04/04/2018 04:56 PM, Andrew Lunn wrote: > The Marvell switches under some conditions will pass a frame to the > host with the port being the CPU port. Such frames are invalid, and > should be dropped. Not dropping them can result in a crash when > incrementing the receive statistics for an invalid port. > > Reported-by: Chris Healy > Fixes: 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics") Are you sure this is the commit that introduced the problem? > Signed-off-by: Andrew Lunn > --- > net/dsa/dsa_priv.h | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h > index 70de7895e5b8..6c1bf3d9f652 100644 > --- a/net/dsa/dsa_priv.h > +++ b/net/dsa/dsa_priv.h > @@ -126,6 +126,7 @@ static inline struct net_device > *dsa_master_find_slave(struct net_device *dev, > struct dsa_port *cpu_dp = dev->dsa_ptr; > struct dsa_switch_tree *dst = cpu_dp->dst; > struct dsa_switch *ds; > + struct dsa_port *slave_port; > > if (device < 0 || device >= DSA_MAX_SWITCHES) > return NULL; > @@ -137,7 +138,12 @@ static inline struct net_device > *dsa_master_find_slave(struct net_device *dev, > if (port < 0 || port >= ds->num_ports) > return NULL; > > - return ds->ports[port].slave; > + slave_port = &ds->ports[port]; > + > + if (slave_port->type != DSA_PORT_TYPE_USER) Can we optimize this with an unlikely()? > + return NULL; > + > + return slave_port->slave; > } > > /* port.c */ > -- Florian
[PATCH] brcm80211: brcmsmac: phy_lcn: remove duplicate code
Remove and refactor some code in order to avoid having identical code for different branches. Notice that this piece of code hasn't been modified since 2011. Addresses-Coverity-ID: 1226756 ("Identical code for different branches") Signed-off-by: Gustavo A. R. Silva --- drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c b/drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c index 93d4cde..9d830d2 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_lcn.c @@ -3388,13 +3388,8 @@ void wlc_lcnphy_deaf_mode(struct brcms_phy *pi, bool mode) u8 phybw40; phybw40 = CHSPEC_IS40(pi->radio_chanspec); - if (LCNREV_LT(pi->pubpi.phy_rev, 2)) { - mod_phy_reg(pi, 0x4b0, (0x1 << 5), (mode) << 5); - mod_phy_reg(pi, 0x4b1, (0x1 << 9), 0 << 9); - } else { - mod_phy_reg(pi, 0x4b0, (0x1 << 5), (mode) << 5); - mod_phy_reg(pi, 0x4b1, (0x1 << 9), 0 << 9); - } + mod_phy_reg(pi, 0x4b0, (0x1 << 5), (mode) << 5); + mod_phy_reg(pi, 0x4b1, (0x1 << 9), 0 << 9); if (phybw40 == 0) { mod_phy_reg((pi), 0x410, -- 2.7.4
Re: [RFC] net: bump the default number of RSS queues
On Tue, 3 Apr 2018 17:20:49 -0700, Eric Dumazet wrote: > On 04/03/2018 05:14 PM, Jakub Kicinski wrote: > > Some popular NIC vendors are not adhering to > > netif_get_num_default_rss_queues() which leads to users being > > surprised and filing bugs :) Bump the number of default RX > > queues to something more reasonable for modern machines. > > > > Signed-off-by: Jakub Kicinski > > --- > > I'm mostly wondering what's the policy on this default? When > > should it be applied? Why was 8 chosen as the default? We > > can abandon using netif_get_num_default_rss_queues() for the > > nfp but I wonder what's the correct course of action here... > > Should new drivers use netif_get_num_default_rss_queues() for > > example? > > > > include/linux/netdevice.h | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index 2a2d9cf50aa2..26fe145ada2a 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -3260,7 +3260,7 @@ static inline unsigned int get_netdev_rx_queue_index( > > } > > #endif > > > > -#define DEFAULT_MAX_NUM_RSS_QUEUES (8) > > +#define DEFAULT_MAX_NUM_RSS_QUEUES (64) > > int netif_get_num_default_rss_queues(void); > > There is no evidence having so many queues is beneficial. > > Too many queues -> lots of overhead in many cases. > > So I would rather not touch this, unless you can present good numbers ;) Thank you for the comment! I don't have convincing number it was more of a matter of consistency :) Now I think I forgot about aRFS, when aRFS support for the nfp is added we will probably start ignoring the default as well.
Re: [PATCH v2 bpf-next 0/3] bpf/verifier: subprog/func_call simplifications
On 04/04/18 00:37, Alexei Starovoitov wrote: > hmm. that doesn't fail for me and any other bots didn't complain. > Are you sure you're running the latest kernel and tests? Ah, test_progs isn't actually rebuilding because __NR_bpf is undeclared; something must be going wrong with header files. Never mind. > hmm. what's wrong with bsearch? It's trivial and fast. bsearch is O(log n), and the sort() call on the subprog_starts (which happens every time add_subprog() is called) is O(n log n). Whereas reading aux->subprogno is O(1). As far as I'm concerned, that's a sign that the latter data structure is the appropriate one. > Even if we don't see the solution today we have to work towards it. I guess I'm just not confident "towards" is the direction you think it is. > Compiler designers could have combined multiple of such passes into > fewer ones, but it's not done, because it increases complexity and > causes tough bugs where pass is partially complete. I'm not trying to combine together multiple 'for bb in prog/for insn in bb'- type passes. The combining I was doing was more on 'for all possible execution paths'-type passes, because it's those that explode combinatorially. Happily I think we can go a long way towards getting rid of them; but while I think we can get down to only having 1, I don't think we can reach 0. > The prime example where more than 4k instructions and loops are mandatory > is user space stack analysis inside the program. Like walking python stack > requires non-trival pointer chasing. With 'pragma unroll' the stack depth > limit today is ~18. That's not really usable. Looping through 100 python > frames would require about 16k bpf assembler instructions. But this would be solved by having support for bounded loops, and I think I've successfully shown that this is not inherently incompatible with a do_check() style walk. > Hence do_check approach must go. The rough idea is to compute per basic > block a set of INs (registers and stack) that basic block needs > to see to be safe and corresponding set of OUTs. > Then propagate this knowledge across cfg edges. > Once we have such set per bpf function, it will essentially answer the > question > 'what arguments this function needs to see to be safe and what it returns' > To make bpf libraries scale we'd need to keep such information > around after the verification, so dynamic linking and indirect calls > are fast to verify. > It's very high level obviously. There are many gotchas to resolve. I agree that if we can do this it'll be ideal. But that's a big 'if'; my example code was intended to demonstrate that the "set of INs bb/func needs to see to be safe" can be an arbitrarily complicated disjunction, and that instead of a combinatorially exploding number of paths to walk (the do_check() approach) you now have combinatorially exploding IN-constraints to propagate backwards. > Please do, since that's my concern with tsort. > The verifier is the key piece of bpf infra and to be effective maintainers > we need to thoroughly understand the verifier code. > We cannot just take the patch based on the cover letter. The author may > disappear tomorrow and what we're going to do with the code? I entirely accept this argument. Unfortunately, when writing explanations, it's difficult to know when one has reached something that will be understood, so inevitably there will have to be a few iterations to get there ;-)
[PATCH net] net: dsa: Discard frames from unused ports
The Marvell switches under some conditions will pass a frame to the host with the port being the CPU port. Such frames are invalid, and should be dropped. Not dropping them can result in a crash when incrementing the receive statistics for an invalid port. Reported-by: Chris Healy Fixes: 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics") Signed-off-by: Andrew Lunn --- net/dsa/dsa_priv.h | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/dsa/dsa_priv.h b/net/dsa/dsa_priv.h index 70de7895e5b8..6c1bf3d9f652 100644 --- a/net/dsa/dsa_priv.h +++ b/net/dsa/dsa_priv.h @@ -126,6 +126,7 @@ static inline struct net_device *dsa_master_find_slave(struct net_device *dev, struct dsa_port *cpu_dp = dev->dsa_ptr; struct dsa_switch_tree *dst = cpu_dp->dst; struct dsa_switch *ds; + struct dsa_port *slave_port; if (device < 0 || device >= DSA_MAX_SWITCHES) return NULL; @@ -137,7 +138,12 @@ static inline struct net_device *dsa_master_find_slave(struct net_device *dev, if (port < 0 || port >= ds->num_ports) return NULL; - return ds->ports[port].slave; + slave_port = &ds->ports[port]; + + if (slave_port->type != DSA_PORT_TYPE_USER) + return NULL; + + return slave_port->slave; } /* port.c */ -- 2.16.3
Re: [PATCH iproute2] ip/l2tp: remove offset and peer-offset options
On Tue, 3 Apr 2018 17:39:54 +0200 Guillaume Nault wrote: > Ignore options "peer-offset" and "offset" when creating sessions. Keep > them when dumping sessions in order to avoid breaking external scripts. > > "peer-offset" has always been a noop in iproute2. "offset" is now > ignored in Linux 4.16 (and was broken before that). > > Signed-off-by: Guillaume Nault Sure, this makes sense applied. In theory, you could have just dropped them from the JSON output.
Re: [PATCH iproute2 rdma: Ignore unknown netlink attributes
On Tue, 3 Apr 2018 10:28:42 +0300 Leon Romanovsky wrote: > From: Leon Romanovsky > > The check if netlink attributes supplied more than maximum supported > is to strict and may lead to backward compatibility issues with old > application with a newer kernel that supports new attribute. > > CC: Steve Wise > Fixes: 74bd75c2b68d ("rdma: Add basic infrastructure for RDMA tool") > Signed-off-by: Leon Romanovsky Applied
Re: [PATCH iproute2-next] tc: Correct json output for actions
On Wed, 04 Apr 2018 17:09:04 -0400 Roman Mashak wrote: > Yuval Mintz writes: > > > Commit 9fd3f0b255d9 ("tc: enable json output for actions") added JSON > > support for tc-actions at the expense of breaking other use cases that > > reach tc_print_action(), as the latter don't expect the 'actions' array > > to be a new object. > > > > Consider the following taken duringrun of tc_chain.sh selftest, > > and see the latter command output is broken: > > > > $ ./tc/tc -j -p actions list action gact | grep -C 3 actions > > [ { > > "total acts": 1 > > },{ > > "actions": [ { > > "order": 0, > > > > $ ./tc/tc -p -j -s filter show dev enp3s0np2 ingress | grep -C 3 actions > > }, > > "skip_hw": true, > > "not_in_hw": true,{ > > "actions": [ { > > "order": 1, > > "kind": "gact", > > "control_action": { > > > > Relocate the open/close of the JSON object to declare the object only > > for the case that needs it. > > > > Signed-off-by: Yuval Mintz > > [...] > > > Good catch, thanks Yuval. > > Tested-by: Roman Mashak Applied
Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration
On 4/4/18 4:59 PM, Jakub Kicinski wrote: > On Wed, 4 Apr 2018 08:25:11 +0200, Jiri Pirko wrote: >>> Jiri, I am not aware of any other API where a driver registers with it >>> yet doesn't want the handler to be called so either waits to register >> >> Again, the thing is, this is kind of unusual because of the reload >> thing. > > FWIW my knee jerk thought is that it's strange that devlink ops can > be executed at all while reload is happening (incl. another reload > request). I'm probably missing the real issue.. > Just responding with the same question ... Why are you not unregistering resources on a reload?
Re: [PATCH net-next 09/11] devlink: convert occ_get op to separate registration
On Wed, 4 Apr 2018 08:25:11 +0200, Jiri Pirko wrote: > >Jiri, I am not aware of any other API where a driver registers with it > >yet doesn't want the handler to be called so either waits to register > > Again, the thing is, this is kind of unusual because of the reload > thing. FWIW my knee jerk thought is that it's strange that devlink ops can be executed at all while reload is happening (incl. another reload request). I'm probably missing the real issue..
Re: [PATCH net] netns: filter uevents correctly
Christian Brauner writes: > On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote: >> commit 07e98962fa77 ("kobject: Send hotplug events in all network >> namespaces") >> >> enabled sending hotplug events into all network namespaces back in 2010. >> Over time the set of uevents that get sent into all network namespaces has >> shrunk. We have now reached the point where hotplug events for all devices >> that carry a namespace tag are filtered according to that namespace. >> >> Specifically, they are filtered whenever the namespace tag of the kobject >> does not match the namespace tag of the netlink socket. One example are >> network devices. Uevents for network devices only show up in the network >> namespaces these devices are moved to or created in. >> >> However, any uevent for a kobject that does not have a namespace tag >> associated with it will not be filtered and we will *try* to broadcast it >> into all network namespaces. >> >> The original patchset was written in 2010 before user namespaces were a >> thing. With the introduction of user namespaces sending out uevents became >> partially isolated as they were filtered by user namespaces: >> >> net/netlink/af_netlink.c:do_one_broadcast() >> >> if (!net_eq(sock_net(sk), p->net)) { >> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID)) >> return; >> >> if (!peernet_has_id(sock_net(sk), p->net)) >> return; >> >> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, >> CAP_NET_BROADCAST)) >> j return; >> } >> >> The file_ns_capable() check will check whether the caller had >> CAP_NET_BROADCAST at the time of opening the netlink socket in the user >> namespace of interest. This check is fine in general but seems insufficient >> to me when paired with uevents. The reason is that devices always belong to >> the initial user namespace so uevents for kobjects that do not carry a >> namespace tag should never be sent into another user namespace. This has >> been the intention all along. But there's one case where this breaks, >> namely if a new user namespace is created by root on the host and an >> identity mapping is established between root on the host and root in the >> new user namespace. Here's a reproducer: >> >> sudo unshare -U --map-root >> udevadm monitor -k >> # Now change to initial user namespace and e.g. do >> modprobe kvm >> # or >> rmmod kvm >> >> will allow the non-initial user namespace to retrieve all uevents from the >> host. This seems very anecdotal given that in the general case user >> namespaces do not see any uevents and also can't really do anything useful >> with them. >> >> Additionally, it is now possible to send uevents from userspace. As such we >> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user >> namespace of the network namespace of the netlink socket) userspace process >> make a decision what uevents should be sent. >> >> This makes me think that we should simply ensure that uevents for kobjects >> that do not carry a namespace tag are *always* filtered by user namespace >> in kobj_bcast_filter(). Specifically: >> - If the owning user namespace of the uevent socket is not init_user_ns the >> event will always be filtered. >> - If the network namespace the uevent socket belongs to was created in the >> initial user namespace but was opened from a non-initial user namespace >> the event will be filtered as well. >> Put another way, uevents for kobjects not carrying a namespace tag are now >> always only sent to the initial user namespace. The regression potential >> for this is near to non-existent since user namespaces can't really do >> anything with interesting devices. >> >> Signed-off-by: Christian Brauner > > That was supposed to be [PATCH net] not [PATCH net-next] which is > obviously closed. Sorry about that. This does not appear to be a fix. This looks like feature work. The motivation appears to be that looks wrong let's change it. So let's please leave this for when net-next opens again so we can have time to fully consider a change in semantics. Thank you, Eric
RE: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
> >> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS > >> >is that the documentation of the tp_status field is somewhat > >> >inconsistent. In some places it's described as TP_STATUS_KERNEL(0) > >> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) > >> >meaning the entry is owned by user space. In other places ownership > >> >by user space is defined by the TP_STATUS_USER(1) bit being set. > >> > >> But indeed this example in packet_mmap.txt is problematic > >> > >> if (status == TP_STATUS_KERNEL) > >> retval = poll(&pfd, 1, timeout); > >> > >> It does not really matter whether the docs are possibly inconsistent and > >> which one is authoritative. Examples like the above make it likely that > >> some user code expects such code to work. > > > > Yes, that's exactly my concern. Yet another troubling example seems to be > > lipbcap which also is looking specifically for status to be anything other > > than > > TP_STATUS_KERNEL(0) to indicate a frame is available in user space. > > Good catch. If pcap-linux.c relies on this then the status field > cannot be changed. Other fields can be modified freely while tp_status > remains 0, perhaps that's an option. Possibly. Someone else suggested something similar but in at least the one example we thought through it still seemed like it didn't address the problem. For example, let's say we used tp_len == -1 to indicate to other kernel threads that the entry was already in progress. This would require that user space never set tp_len = -1 before returning the entry back to the kernel. If it did then no kernel thread would ever claim ownership and the ring would hang. Now, it seems pretty unlikely that user space would do such a thing so maybe we could look past that, but then we run into the issue that there is still a window of opportunity for other kernel threads to come in and wrap the ring. The reason is we can't set tp_len to the correct length after setting tp_status because user space could grab the entry and see tp_len == -1 so we have to set tp_len before we set tp_status. This means that there is still a window where other kernel threads could come in and see tp_len as something other than -1 and a tp_status of TP_STATUS_KERNEL and think it's ok to allocate the entry. This puts us back to where we are today (arguably with a smaller window, but a window none the less). Alternatively we could reacquire the spin_lock to then set tp_len followed by tp_status. This would give the necessary indivisibility in the kernel while preserving proper order as made visible to user space, but it comes at the cost of another spin_lock. Thanks for the suggestion. If you can think of a way around this I'm all ears. I'll think on this some more but so far I'm stuck on how to get past having to broaden the scope of the spin_lock, reacquire the spin_lock, or use some sort of atomic construct along with a parallel shadow ring structure (still thinking through that one as well).
Re: [RFC PATCH] packet: mark ring entry as in-use inside spin_lock to prevent RX ring overrun
>> >One issue with the above proposed change to use TP_STATUS_IN_PROGRESS >> >is that the documentation of the tp_status field is somewhat >> >inconsistent. In some places it's described as TP_STATUS_KERNEL(0) >> >meaning the entry is owned by the kernel and !TP_STATUS_KERNEL(0) >> >meaning the entry is owned by user space. In other places ownership >> >by user space is defined by the TP_STATUS_USER(1) bit being set. >> >> But indeed this example in packet_mmap.txt is problematic >> >> if (status == TP_STATUS_KERNEL) >> retval = poll(&pfd, 1, timeout); >> >> It does not really matter whether the docs are possibly inconsistent and >> which one is authoritative. Examples like the above make it likely that >> some user code expects such code to work. > > Yes, that's exactly my concern. Yet another troubling example seems to be > lipbcap which also is looking specifically for status to be anything other > than > TP_STATUS_KERNEL(0) to indicate a frame is available in user space. Good catch. If pcap-linux.c relies on this then the status field cannot be changed. Other fields can be modified freely while tp_status remains 0, perhaps that's an option.
Re: [PATCH iproute2-next] tc: Correct json output for actions
Yuval Mintz writes: > Commit 9fd3f0b255d9 ("tc: enable json output for actions") added JSON > support for tc-actions at the expense of breaking other use cases that > reach tc_print_action(), as the latter don't expect the 'actions' array > to be a new object. > > Consider the following taken duringrun of tc_chain.sh selftest, > and see the latter command output is broken: > > $ ./tc/tc -j -p actions list action gact | grep -C 3 actions > [ { > "total acts": 1 > },{ > "actions": [ { > "order": 0, > > $ ./tc/tc -p -j -s filter show dev enp3s0np2 ingress | grep -C 3 actions > }, > "skip_hw": true, > "not_in_hw": true,{ > "actions": [ { > "order": 1, > "kind": "gact", > "control_action": { > > Relocate the open/close of the JSON object to declare the object only > for the case that needs it. > > Signed-off-by: Yuval Mintz [...] Good catch, thanks Yuval. Tested-by: Roman Mashak
Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
On Wed, Apr 4, 2018 at 2:16 AM, Jesper Dangaard Brouer wrote: > > On Sun, 1 Apr 2018 20:47:28 -0400 Md. Islam" wrote: > >> [...] More specifically, header parsing and fib >> lookup only takes around 82 ns. This shows that this could be used to >> implement linerate packet forwarding in kernel. > > I cannot resist correcting you... > > You didn't specify the link speed, but assuming 10Gbit/s, then the > linerate is 14.88Mpps, which is 67.2 ns between arriving packets. Thus, > if the lookup cost is 82 ns, thus you cannot claim linerate performance > with these numbers. > > > Details: > > This is calculated based on the the minimum Ethernet frame size > 84-bytes, see https://en.wikipedia.org/wiki/Ethernet_frame for why this > is the minimum size. > > 10*10^9/(84*8) = 14,880,952 pps > 1/last*10^9= 67.2 ns > Yes, it's not actually line-rate forwarding, but it shows the intent towards that. Currently we are doing many things in fib_table_lookup() that can be simplified for a router. fib_get_table() and FIB_RES_DEV() would be simplified if we disable IP_ROUTE_MULTIPATH and IP_MULTIPLE_TABLES. We can increase throughput by doing less :-) Moreover if a network mostly carries larger packets (for instance, a network exclusively used for video streaming), then a 40Gb NIC produces packets in every 300ns. 40*10^9/(1500*8) = 3.4mpps 1/last*10^9= 300 ns > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH net] netns: filter uevents correctly
On Wed, Apr 04, 2018 at 09:48:57PM +0200, Christian Brauner wrote: > commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") > > enabled sending hotplug events into all network namespaces back in 2010. > Over time the set of uevents that get sent into all network namespaces has > shrunk. We have now reached the point where hotplug events for all devices > that carry a namespace tag are filtered according to that namespace. > > Specifically, they are filtered whenever the namespace tag of the kobject > does not match the namespace tag of the netlink socket. One example are > network devices. Uevents for network devices only show up in the network > namespaces these devices are moved to or created in. > > However, any uevent for a kobject that does not have a namespace tag > associated with it will not be filtered and we will *try* to broadcast it > into all network namespaces. > > The original patchset was written in 2010 before user namespaces were a > thing. With the introduction of user namespaces sending out uevents became > partially isolated as they were filtered by user namespaces: > > net/netlink/af_netlink.c:do_one_broadcast() > > if (!net_eq(sock_net(sk), p->net)) { > if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID)) > return; > > if (!peernet_has_id(sock_net(sk), p->net)) > return; > > if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, > CAP_NET_BROADCAST)) > j return; > } > > The file_ns_capable() check will check whether the caller had > CAP_NET_BROADCAST at the time of opening the netlink socket in the user > namespace of interest. This check is fine in general but seems insufficient > to me when paired with uevents. The reason is that devices always belong to > the initial user namespace so uevents for kobjects that do not carry a > namespace tag should never be sent into another user namespace. This has > been the intention all along. But there's one case where this breaks, > namely if a new user namespace is created by root on the host and an > identity mapping is established between root on the host and root in the > new user namespace. Here's a reproducer: > > sudo unshare -U --map-root > udevadm monitor -k > # Now change to initial user namespace and e.g. do > modprobe kvm > # or > rmmod kvm > > will allow the non-initial user namespace to retrieve all uevents from the > host. This seems very anecdotal given that in the general case user > namespaces do not see any uevents and also can't really do anything useful > with them. > > Additionally, it is now possible to send uevents from userspace. As such we > can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user > namespace of the network namespace of the netlink socket) userspace process > make a decision what uevents should be sent. > > This makes me think that we should simply ensure that uevents for kobjects > that do not carry a namespace tag are *always* filtered by user namespace > in kobj_bcast_filter(). Specifically: > - If the owning user namespace of the uevent socket is not init_user_ns the > event will always be filtered. > - If the network namespace the uevent socket belongs to was created in the > initial user namespace but was opened from a non-initial user namespace > the event will be filtered as well. > Put another way, uevents for kobjects not carrying a namespace tag are now > always only sent to the initial user namespace. The regression potential > for this is near to non-existent since user namespaces can't really do > anything with interesting devices. > > Signed-off-by: Christian Brauner That was supposed to be [PATCH net] not [PATCH net-next] which is obviously closed. Sorry about that. Christian > --- > lib/kobject_uevent.c | 10 +- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c > index 15ea216a67ce..cb98cddb6e3b 100644 > --- a/lib/kobject_uevent.c > +++ b/lib/kobject_uevent.c > @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct > sk_buff *skb, void *data) > return sock_ns != ns; > } > > - return 0; > + /* > + * The kobject does not carry a namespace tag so filter by user > + * namespace below. > + */ > + if (sock_net(dsk)->user_ns != &init_user_ns) > + return 1; > + > + /* Check if socket was opened from non-initial user namespace. */ > + return sk_user_ns(dsk) != &init_user_ns; > } > #endif > > -- > 2.15.1 >
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
> Networking vendors have out of tree kernel modules. Those modules use a > netdev (call it a master netdev, a control netdev, cpu port, whatever) > to pull packets from the ASIC and deliver to virtual netdevices > representing physical ports. Sounds a lot like DSA. Please ask the vendor to contribute the drivers :-) > The master netdev should not be mucked with by a user. It should be > ignored by certain s/w with lldpd as just an *example*. I have come across occasional problems with the master device in DSA. But nothing too serious. Generally the switch will just toss frames it gets which don't have the needed header, when they come direct from the master device, rather than via the slave devices. Andrew
[jkirsher/next-queue, RFC PATCH 3/3] net-sysfs: Add interface for Rx queue map per Tx queue
Extend transmit queue sysfs attribute to configure Rx queue map per Tx queue. By default no receive queues are configured for the Tx queue. - /sys/class/net/eth0/queues/tx-*/xps_rxqs Signed-off-by: Amritha Nambiar --- net/core/net-sysfs.c | 81 ++ 1 file changed, 81 insertions(+) diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index d7abd33..0654243 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -1283,6 +1283,86 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue, static struct netdev_queue_attribute xps_cpus_attribute __ro_after_init = __ATTR_RW(xps_cpus); + +static ssize_t xps_rxqs_show(struct netdev_queue *queue, char *buf) +{ + struct net_device *dev = queue->dev; + struct xps_dev_maps *dev_maps; + unsigned long *mask, index; + int j, len, num_tc = 1, tc = 0; + + mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long), + GFP_KERNEL); + if (!mask) + return -ENOMEM; + + index = get_netdev_queue_index(queue); + + if (dev->num_tc) { + num_tc = dev->num_tc; + tc = netdev_txq_to_tc(dev, index); + if (tc < 0) + return -EINVAL; + } + + rcu_read_lock(); + dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_RXQS]); + if (dev_maps) { + for (j = -1; j = attrmask_next(j, NULL, dev->num_rx_queues), +j < dev->num_rx_queues;) { + int i, tci = j * num_tc + tc; + struct xps_map *map; + + map = rcu_dereference(dev_maps->attr_map[tci]); + if (!map) + continue; + + for (i = map->len; i--;) { + if (map->queues[i] == index) { + set_bit(j, mask); + break; + } + } + } + } + + len = bitmap_print_to_pagebuf(false, buf, mask, dev->num_rx_queues); + rcu_read_unlock(); + kfree(mask); + + return len < PAGE_SIZE ? len : -EINVAL; +} + +static ssize_t xps_rxqs_store(struct netdev_queue *queue, const char *buf, + size_t len) +{ + struct net_device *dev = queue->dev; + unsigned long *mask, index; + int err; + + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + mask = kcalloc(BITS_TO_LONGS(dev->num_rx_queues), sizeof(long), + GFP_KERNEL); + if (!mask) + return -ENOMEM; + + index = get_netdev_queue_index(queue); + + err = bitmap_parse(buf, len, mask, dev->num_rx_queues); + if (err) { + kfree(mask); + return err; + } + + err = __netif_set_xps_queue(dev, mask, index, XPS_MAP_RXQS); + kfree(mask); + return err ? : len; +} + +static struct netdev_queue_attribute xps_rxqs_attribute __ro_after_init + = __ATTR_RW(xps_rxqs); #endif /* CONFIG_XPS */ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = { @@ -1290,6 +1370,7 @@ static struct attribute *netdev_queue_default_attrs[] __ro_after_init = { &queue_traffic_class.attr, #ifdef CONFIG_XPS &xps_cpus_attribute.attr, + &xps_rxqs_attribute.attr, &queue_tx_maxrate.attr, #endif NULL
[jkirsher/next-queue, RFC PATCH 0/3] Symmetric queue selection using XPS for Rx queues
This patch series implements support for Tx queue selection based on Rx queue map. This is done by configuring Rx queue map per Tx-queue using sysfs attribute. If the user configuration for Rx queues does not apply, then the Tx queue selection falls back to XPS using CPUs and finally to hashing. XPS is refactored to support Tx queue selection based on either the CPU map or the Rx-queue map. The config option CONFIG_XPS needs to be enabled. By default no receive queues are configured for the Tx queue. - /sys/class/net/eth0/queues/tx-*/xps_rxqs This is to enable sending packets on the same Tx-Rx queue pair as this is useful for busy polling multi-threaded workloads where it is not possible to pin the threads to a CPU. This is a rework of Sridhar's patch for symmetric queueing via socket option: https://www.spinics.net/lists/netdev/msg453106.html --- Amritha Nambiar (3): net: Refactor XPS for CPUs and Rx queues net: Enable Tx queue selection based on Rx queues net-sysfs: Add interface for Rx queue map per Tx queue include/linux/netdevice.h | 82 +++ include/net/sock.h| 18 +++ net/core/dev.c| 242 +++-- net/core/net-sysfs.c | 85 +++- net/core/sock.c |5 + net/ipv4/tcp_input.c |7 + net/ipv4/tcp_ipv4.c |1 net/ipv4/tcp_minisocks.c |1 8 files changed, 360 insertions(+), 81 deletions(-) --
[jkirsher/next-queue, RFC PATCH 2/3] net: Enable Tx queue selection based on Rx queues
This patch adds support to pick Tx queue based on the Rx queue map configuration set by the admin through the sysfs attribute for each Tx queue. If the user configuration for receive queue map does not apply, then the Tx queue selection falls back to CPU map based selection and finally to hashing. Signed-off-by: Amritha Nambiar Signed-off-by: Sridhar Samudrala --- include/net/sock.h | 18 ++ net/core/dev.c | 36 ++-- net/core/sock.c |5 + net/ipv4/tcp_input.c |7 +++ net/ipv4/tcp_ipv4.c |1 + net/ipv4/tcp_minisocks.c |1 + 6 files changed, 62 insertions(+), 6 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 49bd2c1..53d58bc 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -139,6 +139,8 @@ typedef __u64 __bitwise __addrpair; * @skc_node: main hash linkage for various protocol lookup tables * @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol * @skc_tx_queue_mapping: tx queue number for this connection + * @skc_rx_queue_mapping: rx queue number for this connection + * @skc_rx_ifindex: rx ifindex for this connection * @skc_flags: place holder for sk_flags * %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE, * %SO_OOBINLINE settings, %SO_TIMESTAMPING settings @@ -215,6 +217,10 @@ struct sock_common { struct hlist_nulls_node skc_nulls_node; }; int skc_tx_queue_mapping; +#ifdef CONFIG_XPS + int skc_rx_queue_mapping; + int skc_rx_ifindex; +#endif union { int skc_incoming_cpu; u32 skc_rcv_wnd; @@ -326,6 +332,10 @@ struct sock { #define sk_nulls_node __sk_common.skc_nulls_node #define sk_refcnt __sk_common.skc_refcnt #define sk_tx_queue_mapping__sk_common.skc_tx_queue_mapping +#ifdef CONFIG_XPS +#define sk_rx_queue_mapping__sk_common.skc_rx_queue_mapping +#define sk_rx_ifindex __sk_common.skc_rx_ifindex +#endif #define sk_dontcopy_begin __sk_common.skc_dontcopy_begin #define sk_dontcopy_end__sk_common.skc_dontcopy_end @@ -1691,6 +1701,14 @@ static inline int sk_tx_queue_get(const struct sock *sk) return sk ? sk->sk_tx_queue_mapping : -1; } +static inline void sk_mark_rx_queue(struct sock *sk, struct sk_buff *skb) +{ +#ifdef CONFIG_XPS + sk->sk_rx_ifindex = skb->skb_iif; + sk->sk_rx_queue_mapping = skb_get_rx_queue(skb); +#endif +} + static inline void sk_set_socket(struct sock *sk, struct socket *sock) { sk_tx_queue_clear(sk); diff --git a/net/core/dev.c b/net/core/dev.c index 4cfc179..d43f1c2 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3457,18 +3457,14 @@ sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev) } #endif /* CONFIG_NET_EGRESS */ -static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) +static int __get_xps_queue_idx(struct net_device *dev, struct sk_buff *skb, + struct xps_dev_maps *dev_maps, unsigned int tci) { #ifdef CONFIG_XPS - struct xps_dev_maps *dev_maps; struct xps_map *map; int queue_index = -1; - rcu_read_lock(); - dev_maps = rcu_dereference(dev->xps_maps[XPS_MAP_CPUS]); if (dev_maps) { - unsigned int tci = skb->sender_cpu - 1; - if (dev->num_tc) { tci *= dev->num_tc; tci += netdev_get_prio_tc_map(dev, skb->priority); @@ -3485,6 +3481,34 @@ static inline int get_xps_queue(struct net_device *dev, struct sk_buff *skb) queue_index = -1; } } + return queue_index; +#else + return -1; +#endif +} + +static int get_xps_queue(struct net_device *dev, struct sk_buff *skb) +{ +#ifdef CONFIG_XPS + enum xps_map_type i = XPS_MAP_RXQS; + struct xps_dev_maps *dev_maps; + struct sock *sk = skb->sk; + int queue_index = -1; + unsigned int tci = 0; + + if (sk && sk->sk_rx_queue_mapping <= dev->real_num_rx_queues && + dev->ifindex == sk->sk_rx_ifindex) + tci = sk->sk_rx_queue_mapping; + + rcu_read_lock(); + while (queue_index < 0 && i < __XPS_MAP_MAX) { + if (i == XPS_MAP_CPUS) + tci = skb->sender_cpu - 1; + dev_maps = rcu_dereference(dev->xps_maps[i]); + queue_index = __get_xps_queue_idx(dev, skb, dev_maps, tci); + i++; + } + rcu_read_unlock(); return queue_index; diff --git a/net/core/sock.c b/net/core/sock.c index 6444525..bd053db 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2817,6 +2817,11 @@ void sock_init_data(struct socket *sock, struct sock *sk) sk->sk_pacing_rate
[jkirsher/next-queue, RFC PATCH 1/3] net: Refactor XPS for CPUs and Rx queues
Refactor XPS code to support Tx queue selection based on CPU map or Rx queue map. Signed-off-by: Amritha Nambiar --- include/linux/netdevice.h | 82 +- net/core/dev.c| 208 ++--- net/core/net-sysfs.c |4 - 3 files changed, 218 insertions(+), 76 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index cf44503..37dbffe 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -730,10 +730,21 @@ struct xps_map { */ struct xps_dev_maps { struct rcu_head rcu; - struct xps_map __rcu *cpu_map[0]; + struct xps_map __rcu *attr_map[0]; }; -#define XPS_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \ + +#define XPS_CPU_DEV_MAPS_SIZE(_tcs) (sizeof(struct xps_dev_maps) + \ (nr_cpu_ids * (_tcs) * sizeof(struct xps_map *))) + +#define XPS_RXQ_DEV_MAPS_SIZE(_tcs, _rxqs) (sizeof(struct xps_dev_maps) +\ + (_rxqs * (_tcs) * sizeof(struct xps_map *))) + +enum xps_map_type { + XPS_MAP_RXQS, + XPS_MAP_CPUS, + __XPS_MAP_MAX +}; + #endif /* CONFIG_XPS */ #define TC_MAX_QUEUE 16 @@ -1867,7 +1878,7 @@ struct net_device { int watchdog_timeo; #ifdef CONFIG_XPS - struct xps_dev_maps __rcu *xps_maps; + struct xps_dev_maps __rcu *xps_maps[__XPS_MAP_MAX]; #endif #ifdef CONFIG_NET_CLS_ACT struct mini_Qdisc __rcu *miniq_egress; @@ -3204,6 +3215,71 @@ static inline void netif_wake_subqueue(struct net_device *dev, u16 queue_index) #ifdef CONFIG_XPS int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask, u16 index); +int __netif_set_xps_queue(struct net_device *dev, const unsigned long *mask, + u16 index, enum xps_map_type type); + +static inline bool attr_test_mask(unsigned long j, const unsigned long *mask, + unsigned int nr_bits) +{ +#ifdef CONFIG_DEBUG_PER_CPU_MAPS + WARN_ON_ONCE(j >= nr_bits); +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */ + return test_bit(j, mask); +} + +static inline bool attr_test_online(unsigned long j, + const unsigned long *online_mask, + unsigned int nr_bits) +{ +#ifdef CONFIG_DEBUG_PER_CPU_MAPS + WARN_ON_ONCE(j >= nr_bits); +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */ + + if (online_mask) + return test_bit(j, online_mask); + + if (j >= 0 && j < nr_bits) + return true; + + return false; +} + +static inline unsigned int attrmask_next(int n, const unsigned long *srcp, +unsigned int nr_bits) +{ + /* -1 is a legal arg here. */ + if (n != -1) { +#ifdef CONFIG_DEBUG_PER_CPU_MAPS + WARN_ON_ONCE(n >= nr_bits); +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */ + } + + if (srcp) + return find_next_bit(srcp, nr_bits, n + 1); + + return n + 1; +} + +static inline int attrmask_next_and(int n, const unsigned long *src1p, + const unsigned long *src2p, + unsigned int nr_bits) +{ + /* -1 is a legal arg here. */ + if (n != -1) { +#ifdef CONFIG_DEBUG_PER_CPU_MAPS + WARN_ON_ONCE(n >= nr_bits); +#endif /* CONFIG_DEBUG_PER_CPU_MAPS */ + } + + if (src1p && src2p) + return find_next_and_bit(src1p, src2p, nr_bits, n + 1); + else if (src1p) + return find_next_bit(src1p, nr_bits, n + 1); + else if (src2p) + return find_next_bit(src2p, nr_bits, n + 1); + + return n + 1; +} #else static inline int netif_set_xps_queue(struct net_device *dev, const struct cpumask *mask, diff --git a/net/core/dev.c b/net/core/dev.c index 9b04a9f..4cfc179 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2091,7 +2091,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps, int pos; if (dev_maps) - map = xmap_dereference(dev_maps->cpu_map[tci]); + map = xmap_dereference(dev_maps->attr_map[tci]); if (!map) return false; @@ -2104,7 +2104,7 @@ static bool remove_xps_queue(struct xps_dev_maps *dev_maps, break; } - RCU_INIT_POINTER(dev_maps->cpu_map[tci], NULL); + RCU_INIT_POINTER(dev_maps->attr_map[tci], NULL); kfree_rcu(map, rcu); return false; } @@ -2137,30 +2137,49 @@ static bool remove_xps_queue_cpu(struct net_device *dev, static void netif_reset_xps_queues(struct net_device *dev, u16 offset, u16 count) { + const unsigned long *possible_mask = NULL; + enum xps_map_type type = XPS_MAP_RXQS; struct xps_dev_maps *dev_maps; - int cpu, i; bool a
Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps
On Wed, 04 Apr 2018 21:49:26 +0200 Robert Jarzmik wrote: > Ulf Hansson writes: > > > On 2 April 2018 at 16:26, Robert Jarzmik wrote: > >> Hi, > >> > >> This serie is aimed at removing the dmaengine slave compat use, and > >> transfer > >> knowledge of the DMA requestors into architecture code. > >> As this looks like a patch bomb, each maintainer expressing for his tree > >> either > >> an Ack or "I want to take through my tree" will be spared in the next > >> iterations > >> of this serie. > > > > Perhaps an option is to send this hole series as PR for 3.17 rc1, that > > would removed some churns and make this faster/easier? Well, if you > > receive the needed acks of course. > For 3.17-rc1 it looks a bit optimistic with the review time ... If I have all Especially since 3.17-rc1 has been released more than 3 years ago :-), but I guess you meant 4.17-rc1. > acks, I'll queue it into my pxa tree. If at least one maintainer withholds his > ack, the end of the serie (phase 3) won't be applied until it is sorted out. > > Cheers. > > -- > Robert
Re: [PATCH 00/15] ARM: pxa: switch to DMA slave maps
Ulf Hansson writes: > On 2 April 2018 at 16:26, Robert Jarzmik wrote: >> Hi, >> >> This serie is aimed at removing the dmaengine slave compat use, and transfer >> knowledge of the DMA requestors into architecture code. >> As this looks like a patch bomb, each maintainer expressing for his tree >> either >> an Ack or "I want to take through my tree" will be spared in the next >> iterations >> of this serie. > > Perhaps an option is to send this hole series as PR for 3.17 rc1, that > would removed some churns and make this faster/easier? Well, if you > receive the needed acks of course. For 3.17-rc1 it looks a bit optimistic with the review time ... If I have all acks, I'll queue it into my pxa tree. If at least one maintainer withholds his ack, the end of the serie (phase 3) won't be applied until it is sorted out. Cheers. -- Robert
[PATCH net-next] netns: filter uevents correctly
commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces") enabled sending hotplug events into all network namespaces back in 2010. Over time the set of uevents that get sent into all network namespaces has shrunk. We have now reached the point where hotplug events for all devices that carry a namespace tag are filtered according to that namespace. Specifically, they are filtered whenever the namespace tag of the kobject does not match the namespace tag of the netlink socket. One example are network devices. Uevents for network devices only show up in the network namespaces these devices are moved to or created in. However, any uevent for a kobject that does not have a namespace tag associated with it will not be filtered and we will *try* to broadcast it into all network namespaces. The original patchset was written in 2010 before user namespaces were a thing. With the introduction of user namespaces sending out uevents became partially isolated as they were filtered by user namespaces: net/netlink/af_netlink.c:do_one_broadcast() if (!net_eq(sock_net(sk), p->net)) { if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID)) return; if (!peernet_has_id(sock_net(sk), p->net)) return; if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, CAP_NET_BROADCAST)) j return; } The file_ns_capable() check will check whether the caller had CAP_NET_BROADCAST at the time of opening the netlink socket in the user namespace of interest. This check is fine in general but seems insufficient to me when paired with uevents. The reason is that devices always belong to the initial user namespace so uevents for kobjects that do not carry a namespace tag should never be sent into another user namespace. This has been the intention all along. But there's one case where this breaks, namely if a new user namespace is created by root on the host and an identity mapping is established between root on the host and root in the new user namespace. Here's a reproducer: sudo unshare -U --map-root udevadm monitor -k # Now change to initial user namespace and e.g. do modprobe kvm # or rmmod kvm will allow the non-initial user namespace to retrieve all uevents from the host. This seems very anecdotal given that in the general case user namespaces do not see any uevents and also can't really do anything useful with them. Additionally, it is now possible to send uevents from userspace. As such we can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of the network namespace of the netlink socket) userspace process make a decision what uevents should be sent. This makes me think that we should simply ensure that uevents for kobjects that do not carry a namespace tag are *always* filtered by user namespace in kobj_bcast_filter(). Specifically: - If the owning user namespace of the uevent socket is not init_user_ns the event will always be filtered. - If the network namespace the uevent socket belongs to was created in the initial user namespace but was opened from a non-initial user namespace the event will be filtered as well. Put another way, uevents for kobjects not carrying a namespace tag are now always only sent to the initial user namespace. The regression potential for this is near to non-existent since user namespaces can't really do anything with interesting devices. Signed-off-by: Christian Brauner --- lib/kobject_uevent.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c index 15ea216a67ce..cb98cddb6e3b 100644 --- a/lib/kobject_uevent.c +++ b/lib/kobject_uevent.c @@ -251,7 +251,15 @@ static int kobj_bcast_filter(struct sock *dsk, struct sk_buff *skb, void *data) return sock_ns != ns; } - return 0; + /* +* The kobject does not carry a namespace tag so filter by user +* namespace below. +*/ + if (sock_net(dsk)->user_ns != &init_user_ns) + return 1; + + /* Check if socket was opened from non-initial user namespace. */ + return sk_user_ns(dsk) != &init_user_ns; } #endif -- 2.15.1
Re: possible deadlock in skb_queue_tail
On Wed, Apr 4, 2018 at 7:08 AM, Cong Wang wrote: > On Tue, Apr 3, 2018 at 4:42 AM, Kirill Tkhai wrote: >> On 03.04.2018 14:25, Dmitry Vyukov wrote: >>> On Tue, Apr 3, 2018 at 11:50 AM, Kirill Tkhai wrote: sk_diag_dump_icons() dumps only sockets in TCP_LISTEN state. TCP_LISTEN state may be assigned in only place in net/unix/af_unix.c: it's unix_listen(). The function is applied to stream and seqpacket socket types. It can't be stream because of the second stack, and seqpacket also can't, as I don't think it's possible for gcc to inline unix_seqpacket_sendmsg() in the way, we don't see it in the stack. So, this is looks like false positive result for me. Kirill >>> >>> Do you mean that these &(&u->lock)->rlock/1 referenced in 2 stacks are >>> always different? >> >> In these 2 particular stacks they have to be different. > > So actually my patch could fix this false positive? I thought it couldn't. > https://patchwork.ozlabs.org/patch/894342/ You know better! If you suspect it can fix this report, and nobody has better proposals, then we can just mark this as being fixed with your commit and then see if it triggers again with your commit or not.
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
Wed, Apr 04, 2018 at 07:37:49PM CEST, da...@davemloft.net wrote: >From: David Ahern >Date: Wed, 4 Apr 2018 11:21:54 -0600 > >> It is a netdev so there is no reason to have a separate ip command to >> inspect it. 'ip link' is the right place. > >I agree on this. > >What I really don't understand still is the use case... really. > >So there are control netdevs, what exactly is the problem with that? > >Are we not exporting enough information for applications to handle >these devices sanely? If so, then's let add that information. > >We can set netdev->type to ETH_P_LINUXCONTROL or something like that. > >Another alternative is to add an interface flag like IFF_CONTROL or >similar, and that probably is much nicer. > >Hiding the devices means that we acknowledge that applications are >currently broken with control netdevs... and we want them to stay >broken! > >That doesn't sound like a good plan to me. > >So let's fix handling of control netdevs instead of hiding them. Exactly. Don't workaround userspace issues by kernel patches.
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On Wed, Apr 4, 2018 at 10:21 AM, David Ahern wrote: > On 4/4/18 1:36 AM, Siwei Liu wrote: >> On Tue, Apr 3, 2018 at 6:04 PM, David Ahern wrote: >>> On 4/3/18 9:42 AM, Jiri Pirko wrote: > > There are other use cases that want to hide a device from userspace. I What usecases do you have in mind? >>> >>> As mentioned in a previous response some kernel drivers create control >>> netdevs. Just as in this case users should not be mucking with it, and >>> S/W like lldpd should ignore it. >>> > would prefer a better solution than playing games with name prefixes and > one that includes an API for users to list all devices -- even ones > hidden by default. Netdevice hiding feels a bit scarry for me. This smells like a workaround for userspace issues. Why can't the netdevice be visible always and userspace would know what is it and what should it do with it? Once we start with hiding, there are other things related to that which appear. Like who can see what, levels of visibility etc... >>> >>> I would not advocate for any API that does not allow users to have full >>> introspection. The intent is to hide the netdev by default but have an >>> option to see it. >> >> I'm fine with having a link dump API to inspect the hidden netdev. As >> said, the name for hidden netdevs should be in a separate device >> namespace, and we did not even get closer to what it should look like >> as I don't want to make it just an option for ip link. Perhaps a new >> set of sub-commands of, say, 'ip device'. > > It is a netdev so there is no reason to have a separate ip command to > inspect it. 'ip link' is the right place. If you're still thinking the visibility is part of link attribute rather than a separate namespace, I'd say we are trying to solve essentially different problems, and I really don't understand your proposal in solving that problem to be honest. So, let's step back on studying your case if that's the right thing and let's talk about concrete examples. -Siwei
[GIT] Networking
This fixes some fallout from the net-next merge the other day, plus some non-merge-window-related bug fixes: 1) Fix sparse warnings in bcmgenet,systemport, b53, and mt7530, from Florian Fainelli. 2) pptp does a bogus dst_release() on a route we have a single refcount on, and attached to a socket, which needs that refcount. From Eric Dumazet. 3) UDP connected sockets on ipv6 can race with route update handling, resulting in a pre-PMTU update route still stuck on the socket and thus continuing to get ICMPV6_PKT_TOOBIG errors. We end up never seeing the updated route. Fix from Alexey Kodanev. 4) Missing list initializer(s) in TIPC, from Jon Maloy. 5) Connect phy early to prevent crashes in lan78xx driver, from Alexander Graf. 6) Fix build with modular NVMEM, from Arnd Bergmann. 7) netdevsim canot mark nsim_devlink_net_ops and nsim_fib_net_ops as __net_initdata, as these are references from module unload unconditionally. From Arnd Bergmann. Please pull, thanks a lot! The following changes since commit 17dec0a949153d9ac00760ba2f5b78cb583e995f: Merge branch 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace (2018-04-03 19:15:32 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git for you to fetch changes up to 87248d31d1055b56e01a62d9320b4e118bc84e0e: netdevsim: remove incorrect __net_initdata annotations (2018-04-04 12:53:37 -0400) Alexander Graf (1): lan78xx: Connect phy early Alexey Kodanev (4): ipv6: add a wrapper for ip6_dst_store() with flowi6 checks ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow() ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg() ipv6: udp: set dst cache for a connected sk if current not valid Arnd Bergmann (2): nvmem: disallow modular CONFIG_NVMEM netdevsim: remove incorrect __net_initdata annotations Bert Kenward (1): sfc: remove ctpio_dmabuf_start from stats Cong Wang (1): af_unix: remove redundant lockdep class David Howells (1): rxrpc: Fix undefined packet handling David S. Miller (2): Merge branch 'net-Broadcom-drivers-sparse-fixes' Merge branch 'ipv6-udp-set-dst-cache-for-a-connected-sk-if-current-not-valid' Dirk van der Merwe (1): nfp: use full 40 bits of the NSP buffer address Eric Dumazet (2): pptp: remove a buggy dst release in pptp_connect() inet: frags: fix ip6frag_low_thresh boundary Florian Fainelli (4): net: bcmgenet: Fix sparse warnings in bcmgenet_put_tx_csum() net: systemport: Fix sparse warnings in bcm_sysport_insert_tsb() net: dsa: b53: Fix sparse warnings in b53_mmap.c net: dsa: mt7530: Use NULL instead of plain integer GhantaKrishnamurthy MohanKrishna (1): tipc: Fix namespace violation in tipc_sk_fill_sock_diag Jakub Kicinski (1): nfp: add a separate counter for packets with CHECKSUM_COMPLETE Jon Maloy (1): tipc: Fix missing list initializations in struct tipc_subscription Paolo Abeni (1): net: avoid unneeded atomic operation in ip*_append_data() Russell King (1): net: phy: marvell10g: add thermal hwmon device Tan Xiaojun (1): net: hns3: fix length overflow when CONFIG_ARM64_64K_PAGES drivers/net/dsa/b53/b53_mmap.c | 33 +- drivers/net/dsa/mt7530.c | 6 ++-- drivers/net/ethernet/broadcom/bcmsysport.c | 11 +++--- drivers/net/ethernet/broadcom/genet/bcmgenet.c | 11 +++--- drivers/net/ethernet/hisilicon/hns3/hns3_enet.h | 2 +- drivers/net/ethernet/netronome/nfp/nfp_net.h | 4 ++- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +- drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 16 + drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp.c | 9 ++--- drivers/net/ethernet/sfc/ef10.c | 2 -- drivers/net/ethernet/sfc/nic.h | 1 - drivers/net/netdevsim/devlink.c | 2 +- drivers/net/netdevsim/fib.c | 2 +- drivers/net/phy/marvell10g.c | 184 ++-- drivers/net/ppp/pptp.c | 1 - drivers/net/usb/lan78xx.c| 34 +- drivers/nvmem/Kconfig| 2 +- include/net/ip6_route.h | 3 ++ include/net/ipv6.h | 3 +- net/ieee802154/6lowpan/reassembly.c | 2 -- net/ipv4/ip_fragment.c | 5 ++- net/ipv4/ip_output.c | 3 +- net/ipv6/datagram.c | 9 + net/
Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
On Wed, Apr 4, 2018 at 1:41 PM Yuchung Cheng wrote: > On Wed, Apr 4, 2018 at 10:22 AM, Neal Cardwell wrote: > > n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng wrote: > >> Agreed. That's a good point. And I would much preferred to rename that > >> to FLAG_ORIG_PROGRESS (w/ updated comment). > > > >> so I think we're in agreement to use existing patch w/ the new name > >> FLAG_ORIG_PROGRESS > > > > Yes, SGTM. > > > > I guess this "prevent bogus FRTO undos" patch would go to "net" branch and > > the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next" > > branch? > huh? why not one patch ... this is getting close to patch-split paralyses. The flag rename seemed like a cosmetic issue that was not needed for the fix. Smelled like net-next to me. But I don't feel strongly. However you guys want to package it is fine with me. :-) neal
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On Wed, 4 Apr 2018 11:37:52 -0600 David Ahern wrote: > Networking vendors have out of tree kernel modules. Those modules use a > netdev (call it a master netdev, a control netdev, cpu port, whatever) > to pull packets from the ASIC and deliver to virtual netdevices > representing physical ports. The master netdev should not be mucked with > by a user. It should be ignored by certain s/w with lldpd as just an > *example*. Sorry, the linux kernel maintainers have a clear well defined attitude about out of tree kernel modules...
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Ahern Date: Wed, 4 Apr 2018 11:37:52 -0600 > Networking vendors have out of tree kernel modules. Those modules use a > netdev (call it a master netdev, a control netdev, cpu port, whatever) > to pull packets from the ASIC and deliver to virtual netdevices > representing physical ports. The master netdev should not be mucked with > by a user. It should be ignored by certain s/w with lldpd as just an > *example*. Two approaches: 1) Add an IFF_CONTROL and make userspace understand this. It is probably long overdue. 2) Design the driver properly. Have a non-netdev master device like mlxsw does, and control it using devlink or similar. This is exactly how this stuff was meant to be architected. > From there I think you are confusing my intentions: I fundamentally do > not believe the kernel should be hiding anything from an admin. Not > showing data by default is completely different than not showing that > data at all. It is the same David. It measn we have no intention of fixing applications to properly know what to do with and how to handle these devices. If you hide these objects, we are basically giving up on fixing the tools and or the drivers themselves to be architected differently (see #2 above). That really isn't acceptable in my opinion. > The intention of my patch with the IFF_HIDDEN attribute is: > 1. it is a netdev attribute > > 2. that attribute can be used by userpsace to indicate to the kernel I > want all or I want the default > > 3. that attribute can be controlled by an admin. > > The patches go beyond my specific use case (preventing a user from > modifying a netdev it should not be touching) but to defining the > semantics of a generic capability which is what the kernel should have. "Teach, do not hide!" -Yoda
Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
On Wed, Apr 4, 2018 at 10:22 AM, Neal Cardwell wrote: > n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng wrote: >> Agreed. That's a good point. And I would much preferred to rename that >> to FLAG_ORIG_PROGRESS (w/ updated comment). > >> so I think we're in agreement to use existing patch w/ the new name >> FLAG_ORIG_PROGRESS > > Yes, SGTM. > > I guess this "prevent bogus FRTO undos" patch would go to "net" branch and > the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next" > branch? huh? why not one patch ... this is getting close to patch-split paralyses. > > neal
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
[ dropping virtio-...@lists.oasis-open.org since it is a closed list and I am tired of deleting bounces ] On 4/4/18 2:28 AM, Siwei Liu wrote: > On Tue, Apr 3, 2018 at 6:04 PM, David Ahern wrote: >> On 4/3/18 9:42 AM, Jiri Pirko wrote: There are other use cases that want to hide a device from userspace. I >>> >>> What usecases do you have in mind? >> >> As mentioned in a previous response some kernel drivers create control >> netdevs. Just as in this case users should not be mucking with it, and >> S/W like lldpd should ignore it. > > I'm still not sure I understand your case: why you want to hide the > control netdev, as I assume those devices could choose either to > silently ignore the request, or fail loudly against user operations? > Is it creating issues already, or what problem you want to solve if > not making the netdev invisible. Why couldn't lldpd check some > specific flag and ignore the control netdevice (can you please give an > example of a concrete driver for control netdevice *in tree*). > > And I'm completely lost why you want an API to make a hidden netdev > visible again for these control devices. Networking vendors have out of tree kernel modules. Those modules use a netdev (call it a master netdev, a control netdev, cpu port, whatever) to pull packets from the ASIC and deliver to virtual netdevices representing physical ports. The master netdev should not be mucked with by a user. It should be ignored by certain s/w with lldpd as just an *example*. The short of it is that you have your reasons for wanting to hide the virtio bypass device; other users have other arguments for wanting a similar capability. -- >From there I think you are confusing my intentions: I fundamentally do not believe the kernel should be hiding anything from an admin. Not showing data by default is completely different than not showing that data at all. The intention of my patch with the IFF_HIDDEN attribute is: 1. it is a netdev attribute 2. that attribute can be used by userpsace to indicate to the kernel I want all or I want the default 3. that attribute can be controlled by an admin. The patches go beyond my specific use case (preventing a user from modifying a netdev it should not be touching) but to defining the semantics of a generic capability which is what the kernel should have.
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: David Ahern Date: Wed, 4 Apr 2018 11:21:54 -0600 > It is a netdev so there is no reason to have a separate ip command to > inspect it. 'ip link' is the right place. I agree on this. What I really don't understand still is the use case... really. So there are control netdevs, what exactly is the problem with that? Are we not exporting enough information for applications to handle these devices sanely? If so, then's let add that information. We can set netdev->type to ETH_P_LINUXCONTROL or something like that. Another alternative is to add an interface flag like IFF_CONTROL or similar, and that probably is much nicer. Hiding the devices means that we acknowledge that applications are currently broken with control netdevs... and we want them to stay broken! That doesn't sound like a good plan to me. So let's fix handling of control netdevs instead of hiding them. Thanks.
Re: [Intel-wired-lan] [iwl next-queue PATCH 02/10] macvlan: Rename fwd_priv to accel_priv and add accessor function
On Wed, Apr 4, 2018 at 9:53 AM, Shannon Nelson wrote: > On 4/3/2018 2:16 PM, Alexander Duyck wrote: > > [...] >> >> +static inline void *macvlan_accel_priv(struct net_device *dev) >> +{ >> + struct macvlan_dev *macvlan = netdev_priv(dev); >> + >> + return macvlan->accel_priv; > > > Perhaps a check for (macvlan == NULL) before using it? > sln > > The macvlan pointer cannot be NULL.The netdev_priv() function adds an offset to the dev pointer. So I would have to be checking for a NULL netdev. If the netdev was NULL then there was probably no point in calling this function in the first place. - Alex
Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
n Wed, Apr 4, 2018 at 1:13 PM Yuchung Cheng wrote: > Agreed. That's a good point. And I would much preferred to rename that > to FLAG_ORIG_PROGRESS (w/ updated comment). > so I think we're in agreement to use existing patch w/ the new name > FLAG_ORIG_PROGRESS Yes, SGTM. I guess this "prevent bogus FRTO undos" patch would go to "net" branch and the s/FLAG_ORIG_SACK_ACKED/FLAG_ORIG_PROGRESS/ would go in "net-next" branch? neal
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On 4/4/18 1:36 AM, Siwei Liu wrote: > On Tue, Apr 3, 2018 at 6:04 PM, David Ahern wrote: >> On 4/3/18 9:42 AM, Jiri Pirko wrote: There are other use cases that want to hide a device from userspace. I >>> >>> What usecases do you have in mind? >> >> As mentioned in a previous response some kernel drivers create control >> netdevs. Just as in this case users should not be mucking with it, and >> S/W like lldpd should ignore it. >> >>> would prefer a better solution than playing games with name prefixes and one that includes an API for users to list all devices -- even ones hidden by default. >>> >>> Netdevice hiding feels a bit scarry for me. This smells like a workaround >>> for userspace issues. Why can't the netdevice be visible always and >>> userspace would know what is it and what should it do with it? >>> >>> Once we start with hiding, there are other things related to that which >>> appear. Like who can see what, levels of visibility etc... >>> >> >> I would not advocate for any API that does not allow users to have full >> introspection. The intent is to hide the netdev by default but have an >> option to see it. > > I'm fine with having a link dump API to inspect the hidden netdev. As > said, the name for hidden netdevs should be in a separate device > namespace, and we did not even get closer to what it should look like > as I don't want to make it just an option for ip link. Perhaps a new > set of sub-commands of, say, 'ip device'. It is a netdev so there is no reason to have a separate ip command to inspect it. 'ip link' is the right place.
[PATCH iproute2-next 1/1] tc: jsonify tunnel_key action
Signed-off-by: Roman Mashak --- tc/m_tunnel_key.c | 36 +--- 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/tc/m_tunnel_key.c b/tc/m_tunnel_key.c index bac3c07fa90b..0fa461549ad9 100644 --- a/tc/m_tunnel_key.c +++ b/tc/m_tunnel_key.c @@ -221,7 +221,13 @@ static void tunnel_key_print_ip_addr(FILE *f, const char *name, else return; - fprintf(f, "\n\t%s %s", name, rt_addr_n2a_rta(family, attr)); + print_string(PRINT_FP, NULL, "%s", _SL_); + if (matches(name, "src_ip") == 0) + print_string(PRINT_ANY, "src_ip", "\tsrc_ip %s", +rt_addr_n2a_rta(family, attr)); + else if (matches(name, "dst_ip") == 0) + print_string(PRINT_ANY, "dst_ip", "\tdst_ip %s", +rt_addr_n2a_rta(family, attr)); } static void tunnel_key_print_key_id(FILE *f, const char *name, @@ -229,7 +235,8 @@ static void tunnel_key_print_key_id(FILE *f, const char *name, { if (!attr) return; - fprintf(f, "\n\t%s %d", name, rta_getattr_be32(attr)); + print_string(PRINT_FP, NULL, "%s", _SL_); + print_uint(PRINT_ANY, "key_id", "\tkey_id %u", rta_getattr_be32(attr)); } static void tunnel_key_print_dst_port(FILE *f, char *name, @@ -237,7 +244,9 @@ static void tunnel_key_print_dst_port(FILE *f, char *name, { if (!attr) return; - fprintf(f, "\n\t%s %d", name, rta_getattr_be16(attr)); + print_string(PRINT_FP, NULL, "%s", _SL_); + print_uint(PRINT_ANY, "dst_port", "\tdst_port %u", + rta_getattr_be16(attr)); } static void tunnel_key_print_flag(FILE *f, const char *name_on, @@ -246,7 +255,9 @@ static void tunnel_key_print_flag(FILE *f, const char *name_on, { if (!attr) return; - fprintf(f, "\n\t%s", rta_getattr_u8(attr) ? name_on : name_off); + print_string(PRINT_FP, NULL, "%s", _SL_); + print_string(PRINT_ANY, "flag", "\t%s", +rta_getattr_u8(attr) ? name_on : name_off); } static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr *arg) @@ -260,19 +271,20 @@ static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr *arg) parse_rtattr_nested(tb, TCA_TUNNEL_KEY_MAX, arg); if (!tb[TCA_TUNNEL_KEY_PARMS]) { - fprintf(f, "[NULL tunnel_key parameters]"); + print_string(PRINT_FP, NULL, "%s", +"[NULL tunnel_key parameters]"); return -1; } parm = RTA_DATA(tb[TCA_TUNNEL_KEY_PARMS]); - fprintf(f, "tunnel_key"); + print_string(PRINT_ANY, "kind", "%s ", "tunnel_key"); switch (parm->t_action) { case TCA_TUNNEL_KEY_ACT_RELEASE: - fprintf(f, " unset"); + print_string(PRINT_ANY, "mode", " %s", "unset"); break; case TCA_TUNNEL_KEY_ACT_SET: - fprintf(f, " set"); + print_string(PRINT_ANY, "mode", " %s", "set"); tunnel_key_print_ip_addr(f, "src_ip", tb[TCA_TUNNEL_KEY_ENC_IPV4_SRC]); tunnel_key_print_ip_addr(f, "dst_ip", @@ -291,8 +303,10 @@ static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr *arg) } print_action_control(f, " ", parm->action, ""); - fprintf(f, "\n\tindex %d ref %d bind %d", parm->index, parm->refcnt, - parm->bindcnt); + print_string(PRINT_FP, NULL, "%s", _SL_); + print_uint(PRINT_ANY, "index", "\t index %u", parm->index); + print_int(PRINT_ANY, "ref", " ref %d", parm->refcnt); + print_int(PRINT_ANY, "bind", " bind %d", parm->bindcnt); if (show_stats) { if (tb[TCA_TUNNEL_KEY_TM]) { @@ -302,7 +316,7 @@ static int print_tunnel_key(struct action_util *au, FILE *f, struct rtattr *arg) } } - fprintf(f, "\n "); + print_string(PRINT_FP, NULL, "%s", _SL_); return 0; } -- 2.7.4
Re: [PATCH v3 net 1/5] tcp: feed correct number of pkts acked to cc modules also in recovery
On Wed, Apr 4, 2018 at 6:42 AM Ilpo Järvinen wrote: > On Wed, 28 Mar 2018, Yuchung Cheng wrote: ... > > Your patch looks good. Some questions / comments: The patch looks good to me as well (I read the Mar 28 version). Thanks for this fix, Ilpo. > Just to be sure that we understand each other the correct way around this > time, I guess it resolved both of your "disagree" points above? > > 1. Why only apply to CA_Loss and not also CA_Recovery? this may > > mitigate GRO issue I noted in other thread. > Hmm, that's indeed a good idea. I'll give it some more thought but my > initial impression is that it should work. Great. I would agree with Yuchung's suggestion to apply this fix to CA_Recovery as well. > > 2. minor style nit: can we adjust tp->delivered directly in > > tcp_clean_rtx_queue(). I am personally against calling "non-sack" reno > > (e.g. tcp_*reno*()) because its really confusing w/ Reno congestion > > control when SACK is used. One day I'd like to rename all these *reno* > > to _nonsack_. > That's what I actually did first but put ended up putting it into own > function because of line lengths (not a particularly good reason). > ...So yes, I can put it into the tcp_clean_rtx_queue in the next version. > I'll also try to keep that "reno" thing in my mind. Either approach here sounds fine to me. Though personally I find it more readable to keep all the non-SACK helpers together and consistently named, including this one (even if the name is confusing, at least it is consistent... :-). neal
Re: [PATCH iproute2-next] tc: Correct json output for actions
On 4/4/18 6:24 AM, Yuval Mintz wrote: > Commit 9fd3f0b255d9 ("tc: enable json output for actions") added JSON > support for tc-actions at the expense of breaking other use cases that > reach tc_print_action(), as the latter don't expect the 'actions' array > to be a new object. > > Consider the following taken duringrun of tc_chain.sh selftest, > and see the latter command output is broken: > > $ ./tc/tc -j -p actions list action gact | grep -C 3 actions > [ { > "total acts": 1 > },{ > "actions": [ { > "order": 0, > > $ ./tc/tc -p -j -s filter show dev enp3s0np2 ingress | grep -C 3 actions > }, > "skip_hw": true, > "not_in_hw": true,{ > "actions": [ { > "order": 1, > "kind": "gact", > "control_action": { > > Relocate the open/close of the JSON object to declare the object only > for the case that needs it. > > Signed-off-by: Yuval Mintz > --- > tc/m_action.c | 5 ++--- > 1 file changed, 2 insertions(+), 3 deletions(-) This is a bug fix so it should go through master. Added Roman so he is aware of the mistake.
Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
On Wed, Apr 4, 2018 at 9:33 AM, Neal Cardwell wrote: > > On Wed, Apr 4, 2018 at 6:35 AM Ilpo Järvinen > wrote: > > > On Wed, 28 Mar 2018, Yuchung Cheng wrote: > > > > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > > > wrote: > > > > > > > > If SACK is not enabled and the first cumulative ACK after the RTO > > > > retransmission covers more than the retransmitted skb, a spurious > > > > FRTO undo will trigger (assuming FRTO is enabled for that RTO). > > > > The reason is that any non-retransmitted segment acknowledged will > > > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is > > > > no indication that it would have been delivered for real (the > > > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK > > > > case so the check for that bit won't help like it does with SACK). > > > > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo > > > > in tcp_process_loss. > > > > > > > > We need to use more strict condition for non-SACK case and check > > > > that none of the cumulatively ACKed segments were retransmitted > > > > to prove that progress is due to original transmissions. Only then > > > > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in > > > > non-SACK case. > > > > > > > > Signed-off-by: Ilpo Järvinen > > > > --- > > > > net/ipv4/tcp_input.c | 9 + > > > > 1 file changed, 9 insertions(+) > > > > > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > > > index 4a26c09..c60745c 100644 > > > > --- a/net/ipv4/tcp_input.c > > > > +++ b/net/ipv4/tcp_input.c > > > > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock > *sk, u32 prior_fack, > > > > pkts_acked = rexmit_acked + > newdata_acked; > > > > > > > > tcp_remove_reno_sacks(sk, pkts_acked); > > > > + > > > > + /* If any of the cumulatively ACKed segments > was > > > > +* retransmitted, non-SACK case cannot > confirm that > > > > +* progress was due to original transmission > due to > > > > +* lack of TCPCB_SACKED_ACKED bits even if > some of > > > > +* the packets may have been never > retransmitted. > > > > +*/ > > > > + if (flag & FLAG_RETRANS_DATA_ACKED) > > > > + flag &= ~FLAG_ORIG_SACK_ACKED; > > FWIW I'd vote for this version. > > > Of course I could put the back there but I really like the new place more > > (which was a result of your suggestion to place the code elsewhere). > > IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we > > weren't successful in proving (there in tcp_clean_rtx_queue) that progress > > was due original transmission and thus I would not want falsely indicate > > it with that flag. And there's the non-SACK related block anyway already > > there so it keeps the non-SACK "pollution" off from the SACK code paths. > > I think that's a compelling argument. In particular, in terms of long-term > maintenance it seems risky to allow an unsound/buggy FLAG_ORIG_SACK_ACKED > bit be returned by tcp_clean_rtx_queue(). If we return an > incorrect/imcomplete FLAG_ORIG_SACK_ACKED bit then I worry that one day we > will forget that for non-SACK flows that bit is incorrect/imcomplete, and > we will add code using that bit but forgetting to check (tcp_is_sack(tp) || > !FLAG_RETRANS_DATA_ACKED). Agreed. That's a good point. And I would much preferred to rename that to FLAG_ORIG_PROGRESS (w/ updated comment). so I think we're in agreement to use existing patch w/ the new name FLAG_ORIG_PROGRESS > > > (In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to > > FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition > > we're after regardless of SACK and less ambiguous in non-SACK case). > > I'm neutral on this. Not sure if the extra clarity is worth the code churn. > > cheers, > neal
Re: [PATCH net] arp: fix arp_filter on l3slave devices
On 4/4/18 4:13 AM, Miguel Fadon Perlines wrote: > arp_filter performs an ip_route_output search for arp source address and > > checks if output device is the same where the arp request was received, > > if it is not, the arp request is not answered. > > > > This route lookup is always done on main route table so l3slave devices > > never find the proper route and arp is not answered. > > > > Passing l3mdev_master_ifindex_rcu(dev) return value as oif fixes the > > lookup for l3slave devices while maintaining same behavior for non > > l3slave devices as this function returns 0 in that case. > > > > Signed-off-by: Miguel Fadon Perlines > > --- Miguel: The change looks fine to me. Your mail showed up as html; it needs to be sent as plain text only.
Re: [Intel-wired-lan] [iwl next-queue PATCH 03/10] macvlan: Use software path for offloaded local, broadcast, and multicast traffic
On Wed, Apr 4, 2018 at 9:53 AM, Shannon Nelson wrote: > On 4/3/2018 2:16 PM, Alexander Duyck wrote: >> >> This change makes it so that we use a software path for packets that are >> going to be locally switched between two macvlan interfaces on the same >> device. In addition we resort to software replication of broadcast and >> multicast packets instead of offloading that to hardware. >> >> The general idea is that using the device for east/west traffic local to >> the system is extremely inefficient. We can only support up to whatever >> the >> PCIe limit is for any given device so this caps us at somewhere around 20G >> for devices supported by ixgbe. This is compounded even further when you >> take broadcast and multicast into account as a single 10G port can come to >> a crawl as a packet is replicated up to 60+ times in some cases. In order >> to get away from that I am implementing changes so that we handle >> broadcast/multicast replication and east/west local traffic all in >> software. >> >> Signed-off-by: Alexander Duyck >> --- > > > [...] > >> @@ -4225,6 +4226,13 @@ static void ixgbe_configure_virtualization(struct >> ixgbe_adapter *adapter) >> vmdctl |= IXGBE_VT_CTL_REPLEN; >> IXGBE_WRITE_REG(hw, IXGBE_VT_CTL, vmdctl); >> + /* accept untagged packets until a vlan tag is >> +* specifically set for the VMDQ queue/pool >> +*/ >> + vmolr = IXGBE_VMOLR_AUPE; >> + while (pool--) >> + IXGBE_WRITE_REG(hw, IXGBE_VMOLR(VMDQ_P(pool)), vmolr); >> + > > > It took me a bit to figure out what untagged packets have to do with macvlan > config. You might add to the comment that "no multicast or broadcast is > configured for these queues". Yeah. I can probably address that in the next round of patches. Specifically here all we are accepting are untagged unicast packets. We cannot perform the offload through a VLAN since it would require passing the VLAN along with the L2 MAC address. It is something I plan to address in the near future so that macvlan interfaces stacked on top of a VLAN can still be offloaded. > The driver part of this might be separated into a follow-on patch to make it > clearer that this is done as a consequence of the macvlan.c changes. Or > just roll them into your 4/10 patch. > > sln Actually part of the reason for keeping this all as one thing is because if I split the macvlan bits from the driver bits then things are broken until both patches are applied. I had actually split patch 4/10 out from this patch since it is one of the few things that could safely be pulled out in order to try and reduce the overhead for this patch.
Re: [PATCH] netdevsim: remove incorrect __net_initdata annotations
From: Arnd Bergmann Date: Wed, 4 Apr 2018 18:03:41 +0200 > On Wed, Apr 4, 2018 at 5:52 PM, David Miller wrote: >> From: Arnd Bergmann >> Date: Wed, 4 Apr 2018 14:12:39 +0200 >> >>> The __net_initdata section cannot currently be used for structures that >>> get cleaned up in an exitcall using unregister_pernet_operations: >>> >>> WARNING: vmlinux.o(.text+0x868c34): Section mismatch in reference from the >>> function nsim_devlink_exit() to the (unknown reference) .init.data:(unknown) >>> The function nsim_devlink_exit() references >>> the (unknown reference) __initdata (unknown). >>> This is often because nsim_devlink_exit lacks a __initdata >>> annotation or the annotation of (unknown) is wrong. >>> WARNING: vmlinux.o(.text+0x868c64): Section mismatch in reference from the >>> function nsim_devlink_init() to the (unknown reference) .init.data:(unknown) >>> WARNING: vmlinux.o(.text+0x8692bc): Section mismatch in reference from the >>> function nsim_fib_exit() to the (unknown reference) .init.data:(unknown) >>> WARNING: vmlinux.o(.text+0x869300): Section mismatch in reference from the >>> function nsim_fib_init() to the (unknown reference) .init.data:(unknown) >>> >>> As that warning tells us, discarding the structure after a module is >>> loaded would lead to a undefined behavior when that module is removed. >>> >>> It might be possible to change that annotation so it has no effect for >>> loadable modules, but I have not figured out exactly how to do that, and >>> we want this to be fixed in -rc1. >>> >>> This just removes the annotations, just like we do for all other such >>> modules. >>> >>> Fixes: 37923ed6b8ce ("netdevsim: Add simple FIB resource controller via >>> devlink") >>> Signed-off-by: Arnd Bergmann >> >> Hmmm... >> >> __net_initdata defines to nothing if network namespaces are enabled. >> >> That's the whole point. >> >> Therefore, if NETNS is disabled, "unregister_pernet_operations" cannot >> happen and this is the only time when __net_initdata actually does >> something. > > The problem happens with CONFIG_NETNS=n: __net_initdata turns into > __initdata, and nsim_fib_net_ops is referenced from a function that is > not marked __init (i.e. nsim_fib_exit). > > The logic to turn __net_initdata into __init only makes sense for > subsystems that have no way to be unregistered, i.e. code which is > always built-in, or non-unloadable modules. Ok, I see it. I'll apply this for now, thanks.
Re: [Intel-wired-lan] [iwl next-queue PATCH 02/10] macvlan: Rename fwd_priv to accel_priv and add accessor function
On 4/3/2018 2:16 PM, Alexander Duyck wrote: [...] +static inline void *macvlan_accel_priv(struct net_device *dev) +{ + struct macvlan_dev *macvlan = netdev_priv(dev); + + return macvlan->accel_priv; Perhaps a check for (macvlan == NULL) before using it? sln +} #endif /* _LINUX_IF_MACVLAN_H */ ___ Intel-wired-lan mailing list intel-wired-...@osuosl.org https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
Re: [Intel-wired-lan] [iwl next-queue PATCH 03/10] macvlan: Use software path for offloaded local, broadcast, and multicast traffic
On 4/3/2018 2:16 PM, Alexander Duyck wrote: This change makes it so that we use a software path for packets that are going to be locally switched between two macvlan interfaces on the same device. In addition we resort to software replication of broadcast and multicast packets instead of offloading that to hardware. The general idea is that using the device for east/west traffic local to the system is extremely inefficient. We can only support up to whatever the PCIe limit is for any given device so this caps us at somewhere around 20G for devices supported by ixgbe. This is compounded even further when you take broadcast and multicast into account as a single 10G port can come to a crawl as a packet is replicated up to 60+ times in some cases. In order to get away from that I am implementing changes so that we handle broadcast/multicast replication and east/west local traffic all in software. Signed-off-by: Alexander Duyck --- [...] @@ -4225,6 +4226,13 @@ static void ixgbe_configure_virtualization(struct ixgbe_adapter *adapter) vmdctl |= IXGBE_VT_CTL_REPLEN; IXGBE_WRITE_REG(hw, IXGBE_VT_CTL, vmdctl); + /* accept untagged packets until a vlan tag is +* specifically set for the VMDQ queue/pool +*/ + vmolr = IXGBE_VMOLR_AUPE; + while (pool--) + IXGBE_WRITE_REG(hw, IXGBE_VMOLR(VMDQ_P(pool)), vmolr); + It took me a bit to figure out what untagged packets have to do with macvlan config. You might add to the comment that "no multicast or broadcast is configured for these queues". The driver part of this might be separated into a follow-on patch to make it clearer that this is done as a consequence of the macvlan.c changes. Or just roll them into your 4/10 patch. sln
Re: [PATCH v3 net 2/5] tcp: prevent bogus FRTO undos with non-SACK flows
On Wed, Apr 4, 2018 at 6:35 AM Ilpo Järvinen wrote: > On Wed, 28 Mar 2018, Yuchung Cheng wrote: > > On Tue, Mar 13, 2018 at 3:25 AM, Ilpo Järvinen > > wrote: > > > > > > If SACK is not enabled and the first cumulative ACK after the RTO > > > retransmission covers more than the retransmitted skb, a spurious > > > FRTO undo will trigger (assuming FRTO is enabled for that RTO). > > > The reason is that any non-retransmitted segment acknowledged will > > > set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is > > > no indication that it would have been delivered for real (the > > > scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK > > > case so the check for that bit won't help like it does with SACK). > > > Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo > > > in tcp_process_loss. > > > > > > We need to use more strict condition for non-SACK case and check > > > that none of the cumulatively ACKed segments were retransmitted > > > to prove that progress is due to original transmissions. Only then > > > keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in > > > non-SACK case. > > > > > > Signed-off-by: Ilpo Järvinen > > > --- > > > net/ipv4/tcp_input.c | 9 + > > > 1 file changed, 9 insertions(+) > > > > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > > > index 4a26c09..c60745c 100644 > > > --- a/net/ipv4/tcp_input.c > > > +++ b/net/ipv4/tcp_input.c > > > @@ -3166,6 +3166,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, u32 prior_fack, > > > pkts_acked = rexmit_acked + newdata_acked; > > > > > > tcp_remove_reno_sacks(sk, pkts_acked); > > > + > > > + /* If any of the cumulatively ACKed segments was > > > +* retransmitted, non-SACK case cannot confirm that > > > +* progress was due to original transmission due to > > > +* lack of TCPCB_SACKED_ACKED bits even if some of > > > +* the packets may have been never retransmitted. > > > +*/ > > > + if (flag & FLAG_RETRANS_DATA_ACKED) > > > + flag &= ~FLAG_ORIG_SACK_ACKED; FWIW I'd vote for this version. > Of course I could put the back there but I really like the new place more > (which was a result of your suggestion to place the code elsewhere). > IMHO, it makes more sense to have it in tcp_clean_rtx_queue() because we > weren't successful in proving (there in tcp_clean_rtx_queue) that progress > was due original transmission and thus I would not want falsely indicate > it with that flag. And there's the non-SACK related block anyway already > there so it keeps the non-SACK "pollution" off from the SACK code paths. I think that's a compelling argument. In particular, in terms of long-term maintenance it seems risky to allow an unsound/buggy FLAG_ORIG_SACK_ACKED bit be returned by tcp_clean_rtx_queue(). If we return an incorrect/imcomplete FLAG_ORIG_SACK_ACKED bit then I worry that one day we will forget that for non-SACK flows that bit is incorrect/imcomplete, and we will add code using that bit but forgetting to check (tcp_is_sack(tp) || !FLAG_RETRANS_DATA_ACKED). > (In addition, I'd actually also like to rename FLAG_ORIG_SACK_ACKED to > FLAG_ORIG_PROGRESS, the latter is more descriptive about the condition > we're after regardless of SACK and less ambiguous in non-SACK case). I'm neutral on this. Not sure if the extra clarity is worth the code churn. cheers, neal
Re: [PATCH net] sfc: remove ctpio_dmabuf_start from stats
From: Bert Kenward Date: Wed, 4 Apr 2018 16:40:30 +0100 > The ctpio_dmabuf_start entry is not actually a stat and shouldn't > be exposed to ethtool. > > Fixes: 2c0b6ee837db ("sfc: expose CTPIO stats on NICs that support > them") Please don't break up a long Fixes tag line into multiple lines. I corrected it this time. > Signed-off-by: Bert Kenward Applied, thanks.
Commits going to plain 'net' tree now.
Therefore, please submit bug fixes relative to 'net', thank you.
Re: [PATCH] netdevsim: remove incorrect __net_initdata annotations
On Wed, Apr 4, 2018 at 5:52 PM, David Miller wrote: > From: Arnd Bergmann > Date: Wed, 4 Apr 2018 14:12:39 +0200 > >> The __net_initdata section cannot currently be used for structures that >> get cleaned up in an exitcall using unregister_pernet_operations: >> >> WARNING: vmlinux.o(.text+0x868c34): Section mismatch in reference from the >> function nsim_devlink_exit() to the (unknown reference) .init.data:(unknown) >> The function nsim_devlink_exit() references >> the (unknown reference) __initdata (unknown). >> This is often because nsim_devlink_exit lacks a __initdata >> annotation or the annotation of (unknown) is wrong. >> WARNING: vmlinux.o(.text+0x868c64): Section mismatch in reference from the >> function nsim_devlink_init() to the (unknown reference) .init.data:(unknown) >> WARNING: vmlinux.o(.text+0x8692bc): Section mismatch in reference from the >> function nsim_fib_exit() to the (unknown reference) .init.data:(unknown) >> WARNING: vmlinux.o(.text+0x869300): Section mismatch in reference from the >> function nsim_fib_init() to the (unknown reference) .init.data:(unknown) >> >> As that warning tells us, discarding the structure after a module is >> loaded would lead to a undefined behavior when that module is removed. >> >> It might be possible to change that annotation so it has no effect for >> loadable modules, but I have not figured out exactly how to do that, and >> we want this to be fixed in -rc1. >> >> This just removes the annotations, just like we do for all other such >> modules. >> >> Fixes: 37923ed6b8ce ("netdevsim: Add simple FIB resource controller via >> devlink") >> Signed-off-by: Arnd Bergmann > > Hmmm... > > __net_initdata defines to nothing if network namespaces are enabled. > > That's the whole point. > > Therefore, if NETNS is disabled, "unregister_pernet_operations" cannot > happen and this is the only time when __net_initdata actually does > something. The problem happens with CONFIG_NETNS=n: __net_initdata turns into __initdata, and nsim_fib_net_ops is referenced from a function that is not marked __init (i.e. nsim_fib_exit). The logic to turn __net_initdata into __init only makes sense for subsystems that have no way to be unregistered, i.e. code which is always built-in, or non-unloadable modules. Arnd
Re: [PATCH net-next 2/2] inet: frags: fix ip6frag_low_thresh boundary
From: Eric Dumazet Date: Wed, 4 Apr 2018 08:35:10 -0700 > Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches, > since linker might place next to it a non zero value preventing a change > to ip6frag_low_thresh. > > ip6frag_low_thresh is not used anymore in the kernel, but we do not > want to prematuraly break user scripts wanting to change it. > > Since specifying a minimal value of 0 for proc_doulongvec_minmax() > is moot, let's remove these zero values in all defrag units. > > Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh") > Signed-off-by: Eric Dumazet > Reported-by: Maciej Żenczykowski Applied, thanks Eric.
Re: [PATCH net-next] vxlan: add ttl inherit support
From: Hangbin Liu Date: Wed, 4 Apr 2018 21:26:21 +0800 > Like tos inherit, ttl inherit should also means inherit the inner protocol's > ttl values, which actually not implemented in vxlan yet. > > But we could not treat ttl == 0 as "use the inner TTL", because that would be > used also when the "ttl" option is not specified and that would be a behavior > change, and breaking real use cases. > > So add a different attribute IFLA_VXLAN_TTL_INHERIT when "ttl inherit" is > specified with ip cmd. > > Reported-by: Jianlin Shi > Suggested-by: Jiri Benc > Signed-off-by: Hangbin Liu The net-next tree is closed, please resubmit this when the net-next tree opens back up. Thank you.
Re: WARNING: refcount bug in should_fail
Al Viro writes: > On Mon, Apr 02, 2018 at 10:59:34PM +0100, Al Viro wrote: > >> FWIW, I'm going through the ->kill_sb() instances, fixing that sort >> of bugs (most of them preexisting, but I should've checked instead >> of assuming that everything's fine). Will push out later tonight. > > OK, see vfs.git#for-linus. Caught: 4 old bugs (allocation failure > in fill_super oopses ->kill_sb() in hypfs, jffs2 and orangefs resp. > and double-dput in late failure exit in rpc_fill_super()) > and 5 regressions from register_shrinker() failure recovery. One issue with your vfs.git#for-linus branch. It is missing Fixes tags and Cc: stable on those patches. As the bug came in v4.15 those tags would really help the stable maintainers get the recent regression fixes applied. Eric
Re: [net-next 1/1] tipc: Fix namespace violation in tipc_sk_fill_sock_diag
From: GhantaKrishnamurthy MohanKrishna Date: Wed, 4 Apr 2018 14:49:47 +0200 > To fetch UID info for socket diagnostics, we determine the > namespace of user context using tipc socket instance. This > may cause namespace violation, as the kernel will remap based > on UID. > > We fix this by fetching namespace info using the calling userspace > netlink socket. > > Fixes: c30b70deb5f4 (tipc: implement socket diagnostics for AF_TIPC) > Reported-by: syzbot+326e587eff1074657...@syzkaller.appspotmail.com > Acked-by: Jon Maloy > Signed-off-by: GhantaKrishnamurthy MohanKrishna > Applied, thank you.
Re: [PATCH net-next] net: avoid unneeded atomic operation in ip*_append_data()
From: Paolo Abeni Date: Wed, 4 Apr 2018 14:30:01 +0200 > After commit 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates > done by __ip_append_data()") and commit 1f4c6eb24029 ("ipv6: > factorize sk_wmem_alloc updates done by __ip6_append_data()"), > when transmitting sub MTU datagram, an addtional, unneeded atomic > operation is performed in ip*_append_data() to update wmem_alloc: > in the above condition the delta is 0. > > The above cause small but measurable performance regression in UDP > xmit tput test with packet size below MTU. > > This change avoids such overhead updating wmem_alloc only if > wmem_alloc_delta is non zero. > > The error path is left intentionally unmodified: it's a slow path > and simplicity is preferred to performances. > > Fixes: 694aba690de0 ("ipv4: factorize sk_wmem_alloc updates done by > __ip_append_data()") > Fixes: 1f4c6eb24029 ("ipv6: factorize sk_wmem_alloc updates done by > __ip6_append_data()") > Signed-off-by: Paolo Abeni ... > - refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc); > + if (wmem_alloc_delta) > + refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc); ... > - refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc); > + if (wmem_alloc_delta) > + refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc); This is simple enough, so applied. But I wonder if atomic_{add,sub} and refcount_{add,sub}() should just check for zero inline, just like the {set,clear}_bit() implementations avoid the atomic operation if the bit already has the desired value.
Re: [PATCH] netdevsim: remove incorrect __net_initdata annotations
From: Arnd Bergmann Date: Wed, 4 Apr 2018 14:12:39 +0200 > The __net_initdata section cannot currently be used for structures that > get cleaned up in an exitcall using unregister_pernet_operations: > > WARNING: vmlinux.o(.text+0x868c34): Section mismatch in reference from the > function nsim_devlink_exit() to the (unknown reference) .init.data:(unknown) > The function nsim_devlink_exit() references > the (unknown reference) __initdata (unknown). > This is often because nsim_devlink_exit lacks a __initdata > annotation or the annotation of (unknown) is wrong. > WARNING: vmlinux.o(.text+0x868c64): Section mismatch in reference from the > function nsim_devlink_init() to the (unknown reference) .init.data:(unknown) > WARNING: vmlinux.o(.text+0x8692bc): Section mismatch in reference from the > function nsim_fib_exit() to the (unknown reference) .init.data:(unknown) > WARNING: vmlinux.o(.text+0x869300): Section mismatch in reference from the > function nsim_fib_init() to the (unknown reference) .init.data:(unknown) > > As that warning tells us, discarding the structure after a module is > loaded would lead to a undefined behavior when that module is removed. > > It might be possible to change that annotation so it has no effect for > loadable modules, but I have not figured out exactly how to do that, and > we want this to be fixed in -rc1. > > This just removes the annotations, just like we do for all other such > modules. > > Fixes: 37923ed6b8ce ("netdevsim: Add simple FIB resource controller via > devlink") > Signed-off-by: Arnd Bergmann Hmmm... __net_initdata defines to nothing if network namespaces are enabled. That's the whole point. Therefore, if NETNS is disabled, "unregister_pernet_operations" cannot happen and this is the only time when __net_initdata actually does something.
Re: [PATCH 40/47] netfilter: nf_tables: build-in filter chain type
On Wed, Apr 4, 2018 at 5:46 PM, Pablo Neira Ayuso wrote: > On Wed, Apr 04, 2018 at 05:38:31PM +0200, Arnd Bergmann wrote: >> On Fri, Mar 30, 2018 at 1:46 PM, Pablo Neira Ayuso >> wrote: >> > One module per supported filter chain family type takes too much memory >> > for very little code - too much modularization - place all chain filter >> > definitions in one single file. >> > >> > Signed-off-by: Pablo Neira Ayuso >> >> Hi Pablo, >> >> I've bisected a link error to this patch: >> >> net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval': >> nft_reject_inet.c:(.text+0xa7): undefined reference to `nf_send_unreach6' >> nft_reject_inet.c:(.text+0x10c): undefined reference to `nf_send_unreach6' >> nft_reject_inet.c:(.text+0x138): undefined reference to `nf_send_reset6' >> >> Unfortunately I don't immediately see what went wrong, maybe you >> can spot it. > > Can you pass me your .config file? I will have a look at it. Thanks. > Looks like some missing Kconfig stuff. Sure, see https://pastebin.com/raw/yF9Y5Mmn Arnd
Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek
On Wed, Apr 04, 2018 at 11:46:28AM -0400, Bob Peterson wrote: > > The patches look good. The big question is whether to add them to this > merge window while it's still open. Opinions? We're still hashing out the rhashtable interface so I don't think now is the time to rush things. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH] mac80211: Fix bad line warning
From: Masanari Iida Date: Wed, 4 Apr 2018 20:53:33 +0900 > After 03fe2debbb2771fb90881e merged during 4.17 marge window, > I start to see following warning during "make xmldocs" > > ./include/net/mac80211.h:2083: warning: bad line: > > > Replace ">" with "*" fix the issue. > > Signed-off-by: Masanari Iida Adding linux-wireless to CC: > --- > include/net/mac80211.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/net/mac80211.h b/include/net/mac80211.h > index d2279b2d61aa..b2f3a0c018e7 100644 > --- a/include/net/mac80211.h > +++ b/include/net/mac80211.h > @@ -2080,7 +2080,7 @@ struct ieee80211_txq { > * virtual interface might not be given air time for the transmission of > * the frame, as it is not synced with the AP/P2P GO yet, and thus the > * deauthentication frame might not be transmitted. > - > > + * > * @IEEE80211_HW_DOESNT_SUPPORT_QOS_NDP: The driver (or firmware) doesn't > * support QoS NDP for AP probing - that's most likely a driver bug. > * > -- > 2.17.0.rc2.3.gc2a499e6c31e >
Re: [PATCH] [net] nvmem: disallow modular CONFIG_NVMEM
From: Arnd Bergmann Date: Wed, 4 Apr 2018 12:38:40 +0200 > The new of_get_nvmem_mac_address() helper function causes a link error > with CONFIG_NVMEM=m: > > drivers/of/of_net.o: In function `of_get_nvmem_mac_address': > of_net.c:(.text+0x168): undefined reference to `of_nvmem_cell_get' > of_net.c:(.text+0x19c): undefined reference to `nvmem_cell_read' > of_net.c:(.text+0x1a8): undefined reference to `nvmem_cell_put' > > I could not come up with a good solution for this, as the code is always > built-in. Using an #if IS_REACHABLE() check around it would solve the > link time issue but then stop it from working in that configuration. > Making of_nvmem_cell_get() an inline function could also solve that, but > seems a bit ugly since it's somewhat larger than most inline functions, > and it would just bring that problem into the callers. Splitting the > function into a separate file might be an alternative. > > This uses the big hammer by making CONFIG_NVMEM itself a 'bool' symbol, > which avoids the problem entirely but makes the vmlinux larger for anyone > that might use NVMEM support but doesn't need it built-in otherwise. > > Fixes: 9217e566bdee ("of_net: Implement of_get_nvmem_mac_address helper") > Cc: Mike Looijmans > Cc: Florian Fainelli > Cc: Andrew Lunn > Cc: David S. Miller > Cc: netdev@vger.kernel.org > Signed-off-by: Arnd Bergmann > --- > The problem arrived through the networking tree, but it's now in > mainline, so the fix could go through either tree Ok, applied, thanks Arnd.
Re: [PATCH] net: hns3: fix length overflow when CONFIG_ARM64_64K_PAGES
From: Tan Xiaojun Date: Wed, 4 Apr 2018 17:40:48 +0800 > When enable the config item "CONFIG_ARM64_64K_PAGES", the size of PAGE_SIZE > is 65536(64K). But the type of length is u16, it will overflow. So change it > to u32. > > Signed-off-by: Tan Xiaojun Applied, thanks.
Re: [PATCH 40/47] netfilter: nf_tables: build-in filter chain type
On Wed, Apr 04, 2018 at 05:38:31PM +0200, Arnd Bergmann wrote: > On Fri, Mar 30, 2018 at 1:46 PM, Pablo Neira Ayuso > wrote: > > One module per supported filter chain family type takes too much memory > > for very little code - too much modularization - place all chain filter > > definitions in one single file. > > > > Signed-off-by: Pablo Neira Ayuso > > Hi Pablo, > > I've bisected a link error to this patch: > > net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval': > nft_reject_inet.c:(.text+0xa7): undefined reference to `nf_send_unreach6' > nft_reject_inet.c:(.text+0x10c): undefined reference to `nf_send_unreach6' > nft_reject_inet.c:(.text+0x138): undefined reference to `nf_send_reset6' > > Unfortunately I don't immediately see what went wrong, maybe you > can spot it. Can you pass me your .config file? I will have a look at it. Thanks. Looks like some missing Kconfig stuff.
Re: [PATCH net] nfp: use full 40 bits of the NSP buffer address
From: Jakub Kicinski Date: Tue, 3 Apr 2018 17:24:23 -0700 > From: Dirk van der Merwe > > The NSP default buffer is a piece of NFP memory where additional > command data can be placed. Its format has been copied from > host buffer, but the PCIe selection bits do not make sense in > this case. If those get masked out from a NFP address - writes > to random place in the chip memory may be issued and crash the > device. > > Even in the general NSP buffer case, it doesn't make sense to have the > PCIe selection bits there anymore. These are unused at the moment, and > when it becomes necessary, the PCIe selection bits should rather be > moved to another register to utilise more bits for the buffer address. > > This has never been an issue because the buffer used to be > allocated in memory with less-than-38-bit-long address but that > is about to change. > > Fixes: 1a64821c6af7 ("nfp: add support for service processor access") > Signed-off-by: Dirk van der Merwe > Reviewed-by: Jakub Kicinski Applied and queued up for -stable, thanks.
Re: [Cluster-devel] [PATCH v2 0/2] gfs2: Stop using rhashtable_walk_peek
- Original Message - > Here's a second version of the patch (now a patch set) to eliminate > rhashtable_walk_peek in gfs2. > > The first patch introduces lockref_put_not_zero, the inverse of > lockref_get_not_zero. > > The second patch eliminates rhashtable_walk_peek in gfs2. In > gfs2_glock_iter_next, the new lockref function from patch one is used to > drop a lockref count as long as the count doesn't drop to zero. This is > almost always the case; if there is a risk of dropping the last > reference, we must defer that to a work queue because dropping the last > reference may sleep. > > Thanks, > Andreas > > Andreas Gruenbacher (2): > lockref: Add lockref_put_not_zero > gfs2: Stop using rhashtable_walk_peek > > fs/gfs2/glock.c | 47 --- > include/linux/lockref.h | 1 + > lib/lockref.c | 28 > 3 files changed, 57 insertions(+), 19 deletions(-) > > -- > 2.14.3 Hi, The patches look good. The big question is whether to add them to this merge window while it's still open. Opinions? Acked-by: Bob Peterson Regards, Bob Peterson
Re: [PATCH v2] lan78xx: Connect phy early
From: Alexander Graf Date: Wed, 4 Apr 2018 00:19:35 +0200 > When using wicked with a lan78xx device attached to the system, we > end up with ethtool commands issued on the device before an ifup > got issued. That lead to the following crash: > ... > The culprit is quite simple: The driver tries to access the phy left and > right, > but only actually has a working reference to it when the device is up. > > The fix thus is quite simple too: Get a reference to the phy on probe already > and keep it even when the device is going down. > > With this patch applied, I can successfully run wicked on my system and bring > the interface up and down as many times as I want, without getting NULL > pointer > dereferences in between. > > Signed-off-by: Alexander Graf Applied, thank you.
Re: [PATCH] netdevsim: remove incorrect __net_initdata annotations
On Wed, Apr 4, 2018 at 5:33 PM, David Ahern wrote: > On 4/4/18 6:12 AM, Arnd Bergmann wrote: >> The __net_initdata section cannot currently be used for structures that >> get cleaned up in an exitcall using unregister_pernet_operations: >> > > I am confused ... I do the same thing in the VRF driver and that code > has been there since June 2017. I don't see any exit call in that driver, the only caller of unregister_pernet_operations() there is in the init function itself, and the module cannot be removed. Arnd
[PATCH net] sfc: remove ctpio_dmabuf_start from stats
The ctpio_dmabuf_start entry is not actually a stat and shouldn't be exposed to ethtool. Fixes: 2c0b6ee837db ("sfc: expose CTPIO stats on NICs that support them") Signed-off-by: Bert Kenward --- drivers/net/ethernet/sfc/ef10.c | 2 -- drivers/net/ethernet/sfc/nic.h | 1 - 2 files changed, 3 deletions(-) diff --git a/drivers/net/ethernet/sfc/ef10.c b/drivers/net/ethernet/sfc/ef10.c index c4c45c94da77..50daad0a1482 100644 --- a/drivers/net/ethernet/sfc/ef10.c +++ b/drivers/net/ethernet/sfc/ef10.c @@ -1666,7 +1666,6 @@ static const struct efx_hw_stat_desc efx_ef10_stat_desc[EF10_STAT_COUNT] = { EF10_DMA_STAT(fec_corrected_symbols_lane1, FEC_CORRECTED_SYMBOLS_LANE1), EF10_DMA_STAT(fec_corrected_symbols_lane2, FEC_CORRECTED_SYMBOLS_LANE2), EF10_DMA_STAT(fec_corrected_symbols_lane3, FEC_CORRECTED_SYMBOLS_LANE3), - EF10_DMA_STAT(ctpio_dmabuf_start, CTPIO_DMABUF_START), EF10_DMA_STAT(ctpio_vi_busy_fallback, CTPIO_VI_BUSY_FALLBACK), EF10_DMA_STAT(ctpio_long_write_success, CTPIO_LONG_WRITE_SUCCESS), EF10_DMA_STAT(ctpio_missing_dbell_fail, CTPIO_MISSING_DBELL_FAIL), @@ -1777,7 +1776,6 @@ static const struct efx_hw_stat_desc efx_ef10_stat_desc[EF10_STAT_COUNT] = { * These bits are in the second u64 of the raw mask. */ #define EF10_CTPIO_STAT_MASK ( \ - (1ULL << (EF10_STAT_ctpio_dmabuf_start - 64)) | \ (1ULL << (EF10_STAT_ctpio_vi_busy_fallback - 64)) | \ (1ULL << (EF10_STAT_ctpio_long_write_success - 64)) | \ (1ULL << (EF10_STAT_ctpio_missing_dbell_fail - 64)) | \ diff --git a/drivers/net/ethernet/sfc/nic.h b/drivers/net/ethernet/sfc/nic.h index 5640034bda10..5cca0556b47f 100644 --- a/drivers/net/ethernet/sfc/nic.h +++ b/drivers/net/ethernet/sfc/nic.h @@ -332,7 +332,6 @@ enum { EF10_STAT_fec_corrected_symbols_lane1, EF10_STAT_fec_corrected_symbols_lane2, EF10_STAT_fec_corrected_symbols_lane3, - EF10_STAT_ctpio_dmabuf_start, EF10_STAT_ctpio_vi_busy_fallback, EF10_STAT_ctpio_long_write_success, EF10_STAT_ctpio_missing_dbell_fail, -- 2.13.6
Re: [PATCH net-next 2/2] inet: frags: fix ip6frag_low_thresh boundary
On 04/04/2018 08:35 AM, Eric Dumazet wrote: > Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches, > since linker might place next to it a non zero value preventing a change > to ip6frag_low_thresh. > > ip6frag_low_thresh is not used anymore in the kernel, but we do not > want to prematuraly break user scripts wanting to change it. > > Since specifying a minimal value of 0 for proc_doulongvec_minmax() > is moot, let's remove these zero values in all defrag units. > > Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh") > Signed-off-by: Eric Dumazet > Reported-by: Maciej Żenczykowski David, please ignore the net-next 2/2 in the title, there is a single patch, targeting net tree, sorry for the confusion.
Re: [PATCH 40/47] netfilter: nf_tables: build-in filter chain type
On Fri, Mar 30, 2018 at 1:46 PM, Pablo Neira Ayuso wrote: > One module per supported filter chain family type takes too much memory > for very little code - too much modularization - place all chain filter > definitions in one single file. > > Signed-off-by: Pablo Neira Ayuso Hi Pablo, I've bisected a link error to this patch: net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval': nft_reject_inet.c:(.text+0xa7): undefined reference to `nf_send_unreach6' nft_reject_inet.c:(.text+0x10c): undefined reference to `nf_send_unreach6' nft_reject_inet.c:(.text+0x138): undefined reference to `nf_send_reset6' Unfortunately I don't immediately see what went wrong, maybe you can spot it. Arnd
Re: [PATCH net-next] nfp: add a separate counter for packets with CHECKSUM_COMPLETE
From: Jakub Kicinski Date: Mon, 2 Apr 2018 13:31:20 -0700 > We are currently counting packets with CHECKSUM_COMPLETE as > "hw_rx_csum_ok". This is confusing. Add a new counter. > To make sure it fits in the same cacheline move the less used > error counter to a different location. > > Signed-off-by: Jakub Kicinski > Reviewed-by: Dirk van der Merwe Applied, thank you.
[PATCH net-next 2/2] inet: frags: fix ip6frag_low_thresh boundary
Giving an integer to proc_doulongvec_minmax() is dangerous on 64bit arches, since linker might place next to it a non zero value preventing a change to ip6frag_low_thresh. ip6frag_low_thresh is not used anymore in the kernel, but we do not want to prematuraly break user scripts wanting to change it. Since specifying a minimal value of 0 for proc_doulongvec_minmax() is moot, let's remove these zero values in all defrag units. Fixes: 6e00f7dd5e4e ("ipv6: frags: fix /proc/sys/net/ipv6/ip6frag_low_thresh") Signed-off-by: Eric Dumazet Reported-by: Maciej Żenczykowski --- net/ieee802154/6lowpan/reassembly.c | 2 -- net/ipv4/ip_fragment.c | 5 ++--- net/ipv6/netfilter/nf_conntrack_reasm.c | 2 -- net/ipv6/reassembly.c | 2 -- 4 files changed, 2 insertions(+), 9 deletions(-) diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c index 44f148a6bb57bb737c35a07e3f070472e209ea23..1790b65944b3ee188608b1a76d4f9a42fb1479d5 100644 --- a/net/ieee802154/6lowpan/reassembly.c +++ b/net/ieee802154/6lowpan/reassembly.c @@ -411,7 +411,6 @@ int lowpan_frag_rcv(struct sk_buff *skb, u8 frag_type) } #ifdef CONFIG_SYSCTL -static long zero; static struct ctl_table lowpan_frags_ns_ctl_table[] = { { @@ -428,7 +427,6 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = { .maxlen = sizeof(unsigned long), .mode = 0644, .proc_handler = proc_doulongvec_minmax, - .extra1 = &zero, .extra2 = &init_net.ieee802154_lowpan.frags.high_thresh }, { diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 994fa70a910f472ebecc336ddd62d1442014eaba..8e9528ebaa8e1af91172466cc161a83301a217ca 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -667,7 +667,7 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user) EXPORT_SYMBOL(ip_check_defrag); #ifdef CONFIG_SYSCTL -static long zero; +static int dist_min; static struct ctl_table ip4_frags_ns_ctl_table[] = { { @@ -684,7 +684,6 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = { .maxlen = sizeof(unsigned long), .mode = 0644, .proc_handler = proc_doulongvec_minmax, - .extra1 = &zero, .extra2 = &init_net.ipv4.frags.high_thresh }, { @@ -700,7 +699,7 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = { .maxlen = sizeof(int), .mode = 0644, .proc_handler = proc_dointvec_minmax, - .extra1 = &zero + .extra1 = &dist_min, }, { } }; diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c index 3622aac343aea3cd0d8165d8681f42595735cd90..5e0332014c1738999e680c1853829f384e880284 100644 --- a/net/ipv6/netfilter/nf_conntrack_reasm.c +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c @@ -55,7 +55,6 @@ static const char nf_frags_cache_name[] = "nf-frags"; static struct inet_frags nf_frags; #ifdef CONFIG_SYSCTL -static long zero; static struct ctl_table nf_ct_frag6_sysctl_table[] = { { @@ -71,7 +70,6 @@ static struct ctl_table nf_ct_frag6_sysctl_table[] = { .maxlen = sizeof(unsigned long), .mode = 0644, .proc_handler = proc_doulongvec_minmax, - .extra1 = &zero, .extra2 = &init_net.nf_frag.frags.high_thresh }, { diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c index 70e4a578b2fb6b2941daeebe4f476214953db047..4979610287e262a873b807a06ac980a17cd1101b 100644 --- a/net/ipv6/reassembly.c +++ b/net/ipv6/reassembly.c @@ -548,7 +548,6 @@ static const struct inet6_protocol frag_protocol = { }; #ifdef CONFIG_SYSCTL -static int zero; static struct ctl_table ip6_frags_ns_ctl_table[] = { { @@ -565,7 +564,6 @@ static struct ctl_table ip6_frags_ns_ctl_table[] = { .maxlen = sizeof(unsigned long), .mode = 0644, .proc_handler = proc_doulongvec_minmax, - .extra1 = &zero, .extra2 = &init_net.ipv6.frags.high_thresh }, { -- 2.17.0.484.g0c8726318c-goog
Re: [PATCH] netdevsim: remove incorrect __net_initdata annotations
On 4/4/18 6:12 AM, Arnd Bergmann wrote: > The __net_initdata section cannot currently be used for structures that > get cleaned up in an exitcall using unregister_pernet_operations: > > WARNING: vmlinux.o(.text+0x868c34): Section mismatch in reference from the > function nsim_devlink_exit() to the (unknown reference) .init.data:(unknown) > The function nsim_devlink_exit() references > the (unknown reference) __initdata (unknown). > This is often because nsim_devlink_exit lacks a __initdata > annotation or the annotation of (unknown) is wrong. > WARNING: vmlinux.o(.text+0x868c64): Section mismatch in reference from the > function nsim_devlink_init() to the (unknown reference) .init.data:(unknown) > WARNING: vmlinux.o(.text+0x8692bc): Section mismatch in reference from the > function nsim_fib_exit() to the (unknown reference) .init.data:(unknown) > WARNING: vmlinux.o(.text+0x869300): Section mismatch in reference from the > function nsim_fib_init() to the (unknown reference) .init.data:(unknown) > > As that warning tells us, discarding the structure after a module is > loaded would lead to a undefined behavior when that module is removed. > > It might be possible to change that annotation so it has no effect for > loadable modules, but I have not figured out exactly how to do that, and > we want this to be fixed in -rc1. > > This just removes the annotations, just like we do for all other such > modules. > > Fixes: 37923ed6b8ce ("netdevsim: Add simple FIB resource controller via > devlink") > Signed-off-by: Arnd Bergmann > --- > drivers/net/netdevsim/devlink.c | 2 +- > drivers/net/netdevsim/fib.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/netdevsim/devlink.c b/drivers/net/netdevsim/devlink.c > index 27ae05c5fdaf..1dba47936456 100644 > --- a/drivers/net/netdevsim/devlink.c > +++ b/drivers/net/netdevsim/devlink.c > @@ -267,7 +267,7 @@ static int __net_init nsim_devlink_netns_init(struct net > *net) > return 0; > } > > -static struct pernet_operations nsim_devlink_net_ops __net_initdata = { > +static struct pernet_operations nsim_devlink_net_ops = { > .init = nsim_devlink_netns_init, > .id = &nsim_devlink_id, > .size = sizeof(bool), > diff --git a/drivers/net/netdevsim/fib.c b/drivers/net/netdevsim/fib.c > index 0d105bafa261..9bfe9e151e13 100644 > --- a/drivers/net/netdevsim/fib.c > +++ b/drivers/net/netdevsim/fib.c > @@ -230,7 +230,7 @@ static int __net_init nsim_fib_netns_init(struct net *net) > return 0; > } > > -static struct pernet_operations nsim_fib_net_ops __net_initdata = { > +static struct pernet_operations nsim_fib_net_ops = { > .init = nsim_fib_netns_init, > .id = &nsim_fib_net_id, > .size = sizeof(struct nsim_fib_data), > I am confused ... I do the same thing in the VRF driver and that code has been there since June 2017.
Re: [net-next 1/1] tipc: Fix missing list initializations in struct tipc_subscription
From: Jon Maloy Date: Tue, 3 Apr 2018 19:11:19 +0200 > When an item of struct tipc_subscription is created, we fail to > initialize the two lists aggregated into the struct. This has so far > never been a problem, since the items are just added to a root > object by list_add(), which does not require the addee list to be > pre-initialized. However, syzbot is provoking situations where this > addition fails, whereupon the attempted removal if the item from > the list causes a crash. > > This problem seems to always have been around, despite that the code > for creating this object was rewritten in commit 242e82cc95f6 ("tipc: > collapse subscription creation functions"), which is still in net-next. > > We fix this for that commit by initializing the two lists properly. > > Fixes: 242e82cc95f6 ("tipc: collapse subscription creation functions") > Reported-by: syzbot+0bb443b74ce09197e...@syzkaller.appspotmail.com > Signed-off-by: Jon Maloy Applied, thanks Jon.
Re: [PATCH net v6 0/4] ipv6: udp: set dst cache for a connected sk if current not valid
From: Alexey Kodanev Date: Tue, 3 Apr 2018 15:00:06 +0300 > A new RTF_CACHE route can be created with the socket's dst cache > update between the below calls in udpv6_sendmsg(), when datagram > sending results to ICMPV6_PKT_TOOBIG error: > >dst = ip6_sk_dst_lookup_flow(...) >... > release_dst: > if (dst) { > if (connected) { > ip6_dst_store(sk, dst) > > Therefore, the new socket's dst cache reset to the old one on > "release_dst:". > > The first three patches prepare the code to store dst cache > with ip6_sk_dst_lookup_flow(): ... > The last patch contains the actual fix that removes sk dst cache > update in the end of udpv6_sendmsg(), and allows to do it in > ip6_sk_dst_lookup_flow(). Looks good, applied and queued up for -stable. Thanks!
Re: [PATCH net-next 5/5] macvlan/macvtap: Add support for SCTP checksum offload.
On Mon, 2018-04-02 at 09:40 -0400, Vladislav Yasevich wrote: > Since we now have support for software CRC32c offload, turn it on > for macvlan and macvtap devices so that guests can take advantage > of offload SCTP checksums to the host or host hardware. > > Signed-off-by: Vladislav Yasevich > --- > drivers/net/macvlan.c | 5 +++-- > drivers/net/tap.c | 8 +--- > include/uapi/linux/if_tun.h | 1 + > 3 files changed, 9 insertions(+), 5 deletions(-) > I think also ipvlan can improve its SCTP performance with a similar patch. WDYT? thanks, -- davide
Re: [PATCH v2] net: phy: marvell10g: add thermal hwmon device
From: Russell King Date: Tue, 03 Apr 2018 10:31:45 +0100 > Add a thermal monitoring device for the Marvell 88x3310, which updates > once a second. We also need to hook into the suspend/resume mechanism > to ensure that the thermal monitoring is reconfigured when we resume. > > Suggested-by: Andrew Lunn > Signed-off-by: Russell King > --- > v2: update to apply to net-next Applied, thanks for respinning.
Re: [PATCH net-next] pptp: remove a buggy dst release in pptp_connect()
From: Eric Dumazet Date: Mon, 2 Apr 2018 18:48:37 -0700 > Once dst has been cached in socket via sk_setup_caps(), > it is illegal to call ip_rt_put() (or dst_release()), > since sk_setup_caps() did not change dst refcount. > > We can still dereference it since we hold socket lock. > > Caugth by syzbot : ... > Signed-off-by: Eric Dumazet Applied and queued up for -stable, thanks Eric.
Re: [PATCH net] net: dsa: b53: Fix sparse warnings in b53_mmap.c
From: Florian Fainelli Date: Mon, 2 Apr 2018 16:17:01 -0700 > sparse complains about the following warnings: > > drivers/net/dsa/b53/b53_mmap.c:33:31: warning: incorrect type in > initializer (different address spaces) > drivers/net/dsa/b53/b53_mmap.c:33:31:expected unsigned char > [noderef] [usertype] *regs > drivers/net/dsa/b53/b53_mmap.c:33:31:got void *priv > > and indeed, while what we are doing is functional, we are dereferencing > a void * pointer into a void __iomem * which is not great. Just use the > defined b53_mmap_priv structure which holds our register base and use > that. > > Fixes: 967dd82ffc52 ("net: dsa: b53: Add support for Broadcom RoboSwitch") > Signed-off-by: Florian Fainelli Applied.