Re: [PATCH bpf 2/3] bpf: fix build issues on um due to mising bpf_perf_event.h
Am Dienstag, 12. Dezember 2017, 02:25:31 CET schrieb Daniel Borkmann: > Since c895f6f703ad ("bpf: correct broken uapi for > BPF_PROG_TYPE_PERF_EVENT program type") um (uml) won't build > on i386 or x86_64: > > [...] > CC init/main.o > In file included from ../include/linux/perf_event.h:18:0, >from ../include/linux/trace_events.h:10, >from ../include/trace/syscall.h:7, >from ../include/linux/syscalls.h:82, >from ../init/main.c:20: > ../include/uapi/linux/bpf_perf_event.h:11:32: fatal error: > asm/bpf_perf_event.h: No such file or directory #include > > [...] > > Lets add missing bpf_perf_event.h also to um arch. This seems > to be the only one still missing. > > Fixes: c895f6f703ad ("bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT > program type") Reported-by: Randy Dunlap > Suggested-by: Richard Weinberger > Signed-off-by: Daniel Borkmann > Tested-by: Randy Dunlap > Cc: Hendrik Brueckner > Cc: Richard Weinberger > Acked-by: Alexei Starovoitov > --- > arch/um/include/asm/Kbuild | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild > index 50a32c3..73c57f6 100644 > --- a/arch/um/include/asm/Kbuild > +++ b/arch/um/include/asm/Kbuild > @@ -1,4 +1,5 @@ > generic-y += barrier.h > +generic-y += bpf_perf_event.h > generic-y += bug.h > generic-y += clkdev.h > generic-y += current.h Acked-by: Richard Weinberger Thanks, //richard -- sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria ATU66964118 - FN 374287y
Re: [PATCH net-next 2/3] net: dsa: mediatek: combine MediaTek tag with VLAN tag
On Thu, 2017-12-07 at 16:30 +0100, Andrew Lunn wrote: > > @@ -25,20 +28,37 @@ static struct sk_buff *mtk_tag_xmit(struct sk_buff *skb, > > { > > struct dsa_port *dp = dsa_slave_to_port(dev); > > u8 *mtk_tag; > > + bool is_vlan_skb = true; > > .. > > > + /* Mark tag attribute on special tag insertion to notify hardware > > +* whether that's a combined special tag with 802.1Q header. > > +*/ > > + mtk_tag[0] = is_vlan_skb ? MTK_HDR_XMIT_TAGGED_TPID_8100 : > > +MTK_HDR_XMIT_UNTAGGED; > > mtk_tag[1] = (1 << dp->index) & MTK_HDR_XMIT_DP_BIT_MASK; > > - mtk_tag[2] = 0; > > - mtk_tag[3] = 0; > > + > > + /* Tag control information is kept for 802.1Q */ > > + if (!is_vlan_skb) { > > + mtk_tag[2] = 0; > > + mtk_tag[3] = 0; > > + } > > > > return skb; > > } > > Hi Sean > > So you can mark a packet for egress. What about ingress? How do you > know the VLAN/PORT combination for packets the CPU receives? I would > of expected a similar change to mtk_tag_rcv(). > >Andrew Hi, Andrew It's unnecessary for extra handling in mtk_tag_rcv() when VLAN tag is present since it is able to put the VLAN tag after the special tag and then follow the existing way to parse. Sean
Re: [PATCH net-next 1/3] net: dsa: mediatek: add VLAN support for MT7530
Hi, Andrew All sounds reasonable. All will be fixed in the next version. Sean On Thu, 2017-12-07 at 16:24 +0100, Andrew Lunn wrote: > > static void > > +mt7530_port_set_vlan_unware(struct dsa_switch *ds, int port) > > +{ > > + struct mt7530_priv *priv = ds->priv; > > + int i; > > + bool all_user_ports_removed = true; > > Hi Sean > > Reverse Christmas tree please. > will be fixed > > +static int > > +mt7530_vlan_cmd(struct mt7530_priv *priv, enum mt7530_vlan_cmd cmd, u16 > > vid) > > +{ > > + u32 val; > > + int ret; > > + struct mt7530_dummy_poll p; > > Here too. > will be fixed > > +static int > > +mt7530_port_vlan_prepare(struct dsa_switch *ds, int port, > > +const struct switchdev_obj_port_vlan *vlan, > > +struct switchdev_trans *trans) > > +{ > > + struct mt7530_priv *priv = ds->priv; > > + > > + /* The port is being kept as VLAN-unware port when bridge is set up > > +* with vlan_filtering not being set, Otherwise, the port and the > > +* corresponding CPU port is required the setup for becoming a > > +* VLAN-ware port. > > +*/ > > + if (!priv->ports[port].vlan_filtering) > > + return 0; > > + > > + mt7530_port_set_vlan_ware(ds, port); > > + mt7530_port_set_vlan_ware(ds, MT7530_CPU_PORT); > > A prepare function should just validate that it is possible to carry > out the operation. It should not change any state. These two last > lines probably don't belong here. > okay, it will be moved into the proper place such as mt7530_port_vlan_filtering > > + > > + return 0; > > +} > > + > > +static void > > +mt7530_hw_vlan_add(struct mt7530_priv *priv, > > + struct mt7530_hw_vlan_entry *entry) > > +{ > > + u32 val; > > + u8 new_members; > > Reverse Christmas tree. Please check the whole patch. > will be fixed > > +static inline void INIT_MT7530_HW_ENTRY(struct mt7530_hw_vlan_entry *e, > > + int port, bool untagged) > > +{ > > + e->port = port; > > + e->untagged = untagged; > > +} > > All CAPITAL letters is for #defines. This is just a normal > function. Please use lower case. > will be fixed > Andrew >
[PATCH v2 1/3] PCI: Add pcim_set_mwi(), a device-managed pci_set_mwi()
Add pcim_set_mwi(), a device-managed version of pci_set_mwi(). First user is the Realtek r8169 driver. Signed-off-by: Heiner Kallweit Acked-by: Bjorn Helgaas --- v2: - Reorder calls - Adjust and commit message --- drivers/pci/pci.c | 25 + include/linux/pci.h | 1 + 2 files changed, 26 insertions(+) diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c index 4a7c6864f..764ca7b88 100644 --- a/drivers/pci/pci.c +++ b/drivers/pci/pci.c @@ -1458,6 +1458,7 @@ struct pci_devres { unsigned int pinned:1; unsigned int orig_intx:1; unsigned int restore_intx:1; + unsigned int mwi:1; u32 region_mask; }; @@ -1476,6 +1477,9 @@ static void pcim_release(struct device *gendev, void *res) if (this->region_mask & (1 << i)) pci_release_region(dev, i); + if (this->mwi) + pci_clear_mwi(dev); + if (this->restore_intx) pci_intx(dev, this->orig_intx); @@ -3760,6 +3764,27 @@ int pci_set_mwi(struct pci_dev *dev) } EXPORT_SYMBOL(pci_set_mwi); +/** + * pcim_set_mwi - a device-managed pci_set_mwi() + * @dev: the PCI device for which MWI is enabled + * + * Managed pci_set_mwi(). + * + * RETURNS: An appropriate -ERRNO error value on error, or zero for success. + */ +int pcim_set_mwi(struct pci_dev *dev) +{ + struct pci_devres *dr; + + dr = find_pci_dr(dev); + if (!dr) + return -ENOMEM; + + dr->mwi = 1; + return pci_set_mwi(dev); +} +EXPORT_SYMBOL(pcim_set_mwi); + /** * pci_try_set_mwi - enables memory-write-invalidate PCI transaction * @dev: the PCI device for which MWI is enabled diff --git a/include/linux/pci.h b/include/linux/pci.h index 978aad784..0a7ac863a 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1064,6 +1064,7 @@ int pci_set_pcie_reset_state(struct pci_dev *dev, enum pcie_reset_state state); int pci_set_cacheline_size(struct pci_dev *dev); #define HAVE_PCI_SET_MWI int __must_check pci_set_mwi(struct pci_dev *dev); +int __must_check pcim_set_mwi(struct pci_dev *dev); int pci_try_set_mwi(struct pci_dev *dev); void pci_clear_mwi(struct pci_dev *dev); void pci_intx(struct pci_dev *dev, int enable); -- 2.15.1
[PATCH v2 3/3] r8169: remove netif_napi_del in probe error path
netif_napi_del is called implicitely by free_netdev, therefore we don't have to do it explicitely. When the probe error path is reached, the net_device isn't registered yet. Therefore reordering the call to netif_napi_del shouldn't cause any issues. Signed-off-by: Heiner Kallweit --- v2: - no changes --- drivers/net/ethernet/realtek/r8169.c | 13 +++-- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index 3c7d90d3a..857f67beb 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -8672,14 +8672,12 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) tp->counters = dmam_alloc_coherent (&pdev->dev, sizeof(*tp->counters), &tp->counters_phys_addr, GFP_KERNEL); - if (!tp->counters) { - rc = -ENOMEM; - goto err_out_msi_5; - } + if (!tp->counters) + return -ENOMEM; rc = register_netdev(dev); if (rc < 0) - goto err_out_msi_5; + return rc; pci_set_drvdata(pdev, dev); @@ -8709,11 +8707,6 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) netif_carrier_off(dev); return 0; - -err_out_msi_5: - netif_napi_del(&tp->napi); - - return rc; } static struct pci_driver rtl8169_pci_driver = { -- 2.15.1
[PATCH v2 2/3] r8169: switch to device-managed functions in probe
Simplify probe error path and remove callback by using device-managed functions. rtl_disable_msi isn't needed any longer because the release callback of pcim_enable_device does this implicitely. Signed-off-by: Heiner Kallweit --- v2: - no changes --- drivers/net/ethernet/realtek/r8169.c | 80 +--- 1 file changed, 20 insertions(+), 60 deletions(-) diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c index fc0d5fa65..3c7d90d3a 100644 --- a/drivers/net/ethernet/realtek/r8169.c +++ b/drivers/net/ethernet/realtek/r8169.c @@ -4643,16 +4643,6 @@ static void rtl8169_phy_timer(struct timer_list *t) rtl_schedule_task(tp, RTL_FLAG_TASK_PHY_PENDING); } -static void rtl8169_release_board(struct pci_dev *pdev, struct net_device *dev, - void __iomem *ioaddr) -{ - iounmap(ioaddr); - pci_release_regions(pdev); - pci_clear_mwi(pdev); - pci_disable_device(pdev); - free_netdev(dev); -} - DECLARE_RTL_COND(rtl_phy_reset_cond) { return tp->phy_reset_pending(tp); @@ -4784,14 +4774,6 @@ static int rtl_tbi_ioctl(struct rtl8169_private *tp, struct mii_ioctl_data *data return -EOPNOTSUPP; } -static void rtl_disable_msi(struct pci_dev *pdev, struct rtl8169_private *tp) -{ - if (tp->features & RTL_FEATURE_MSI) { - pci_disable_msi(pdev); - tp->features &= ~RTL_FEATURE_MSI; - } -} - static void rtl_init_mdio_ops(struct rtl8169_private *tp) { struct mdio_ops *ops = &tp->mdio_ops; @@ -8256,9 +8238,6 @@ static void rtl_remove_one(struct pci_dev *pdev) unregister_netdev(dev); - dma_free_coherent(&tp->pci_dev->dev, sizeof(*tp->counters), - tp->counters, tp->counters_phys_addr); - rtl_release_firmware(tp); if (pci_dev_run_wake(pdev)) @@ -8266,9 +8245,6 @@ static void rtl_remove_one(struct pci_dev *pdev) /* restore original MAC address */ rtl_rar_set(tp, dev->perm_addr); - - rtl_disable_msi(pdev, tp); - rtl8169_release_board(pdev, dev, tp->mmio_addr); } static const struct net_device_ops rtl_netdev_ops = { @@ -8445,11 +8421,9 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) MODULENAME, RTL8169_VERSION); } - dev = alloc_etherdev(sizeof (*tp)); - if (!dev) { - rc = -ENOMEM; - goto out; - } + dev = devm_alloc_etherdev(&pdev->dev, sizeof (*tp)); + if (!dev) + return -ENOMEM; SET_NETDEV_DEV(dev, &pdev->dev); dev->netdev_ops = &rtl_netdev_ops; @@ -8472,13 +8446,13 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) PCIE_LINK_STATE_CLKPM); /* enable device (incl. PCI PM wakeup and hotplug setup) */ - rc = pci_enable_device(pdev); + rc = pcim_enable_device(pdev); if (rc < 0) { netif_err(tp, probe, dev, "enable failure\n"); - goto err_out_free_dev_1; + return rc; } - if (pci_set_mwi(pdev) < 0) + if (pcim_set_mwi(pdev) < 0) netif_info(tp, probe, dev, "Mem-Wr-Inval unavailable\n"); /* make sure PCI base addr 1 is MMIO */ @@ -8486,30 +8460,28 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) netif_err(tp, probe, dev, "region #%d not an MMIO resource, aborting\n", region); - rc = -ENODEV; - goto err_out_mwi_2; + return -ENODEV; } /* check for weird/broken PCI region reporting */ if (pci_resource_len(pdev, region) < R8169_REGS_SIZE) { netif_err(tp, probe, dev, "Invalid PCI region size(s), aborting\n"); - rc = -ENODEV; - goto err_out_mwi_2; + return -ENODEV; } rc = pci_request_regions(pdev, MODULENAME); if (rc < 0) { netif_err(tp, probe, dev, "could not request regions\n"); - goto err_out_mwi_2; + return rc; } /* ioremap MMIO region */ - ioaddr = ioremap(pci_resource_start(pdev, region), R8169_REGS_SIZE); + ioaddr = devm_ioremap(&pdev->dev, pci_resource_start(pdev, region), + R8169_REGS_SIZE); if (!ioaddr) { netif_err(tp, probe, dev, "cannot remap MMIO, aborting\n"); - rc = -EIO; - goto err_out_free_res_3; + return -EIO; } tp->mmio_addr = ioaddr; @@ -8535,7 +8507,7 @@ static int rtl_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32)); if (rc < 0) {
[PATCH v2 0/3] r8169: extend PCI core and switch to device-managed functions in probe
Probe error path and remove callback can be significantly simplified by using device-managed functions. To be able to do this in the r8169 driver we need a device-managed version of pci_set_mwi first. v2: Change patch 1 based on Björn's review comments and add his Acked-by. Heiner Kallweit (3): PCI: Add pcim_set_mwi(), a device-managed pci_set_mwi() r8169: switch to device-managed functions in probe r8169: remove netif_napi_del in probe error path drivers/net/ethernet/realtek/r8169.c | 87 +--- drivers/pci/pci.c| 25 +++ include/linux/pci.h | 1 + 3 files changed, 46 insertions(+), 67 deletions(-) -- 2.15.1
Re: [PATCH] ptr_ring: add barriers
Hi David, On 12/11/2017 09:23 PM, David Miller wrote: From: "Michael S. Tsirkin" Date: Tue, 5 Dec 2017 21:29:37 +0200 Users of ptr_ring expect that it's safe to give the data structure a pointer and have it be available to consumers, but that actually requires an smb_wmb or a stronger barrier. In absence of such barriers and on architectures that reorder writes, consumer might read an un=initialized value from an skb pointer stored in the skb array. This was observed causing crashes. To fix, add memory barriers. The barrier we use is a wmb, the assumption being that producers do not need to read the value so we do not need to order these reads. Reported-by: George Cherian Suggested-by: Jason Wang Signed-off-by: Michael S. Tsirkin I'm asked for asking for testing feedback and did not get it in a reasonable amount of time. The tests have completed more than 48 hours without any failures. I won't interrupt the same and run for longer time. In case of any issue I will report the same. So I'm applying this as-is, and queueing it up for -stable. Thank you. Regards, -George
Re: [RFC][PATCH] new byteorder primitives - ..._{replace,get}_bits()
On Mon, Dec 11, 2017 at 08:02:24PM -0800, Jakub Kicinski wrote: > On Mon, 11 Dec 2017 15:54:22 +, Al Viro wrote: > > Essentially, it gives helpers for work with bitfields in fixed-endian. > > Suppose we have e.g. a little-endian 32bit value with fixed layout; > > expressing that as a bitfield would go like > > struct foo { > > unsigned foo:4; /* bits 0..3 */ > > unsigned :2; > > unsigned bar:12;/* bits 6..17 */ > > unsigned baz:14;/* bits 18..31 */ > > } > > Even for host-endian it doesn't work all that well - you end up with > > ifdefs in structure definition and generated code stinks. For fixed-endian > > it gets really painful, and people tend to use explicit shift-and-mask > > kind of macros for accessing the fields (and often enough get the > > endianness conversions wrong, at that). With these primitives > > > > struct foo v<=> __le32 v > > v.foo = i ? 1 : 2 <=> v = le32_replace_bits(v, i ? 1 : 2, 0, 4) > > f(4 + v.baz)<=> f(4 + le32_get_bits(v, 18, 14)) > > Looks very useful. The [start bit, size] pair may not land itself > too nicely to creating defines, though. Which is why in > include/linux/bitfield.h we tried to use a shifted mask and work > backwards from that single value what the start and size are. commit > 3e9b3112ec74 ("add basic register-field manipulation macros") has the > description. Could a similar trick perhaps be applicable here? Umm... What's wrong with #define FIELD_FOO 0,4 #define FIELD_BAR 6,12 #define FIELD_BAZ 18,14 A macro can bloody well expand to any sequence of tokens - le32_get_bits(v, FIELD_BAZ) will become le32_get_bits(v, 18, 14) just fine. What's the problem with that?
[PATCH v5 0/3] Add andestech atcpit100 timer
Changelog v5: - Patch 1/3: Changes - Patch 2/3: New - Patch 3/3: Changes [Patch 1/3] clocksource/drivers/atcpit100: Add andestech atcpit100 timer 1 No need to split out the Makefile patch from the actual driver. Suggested by Arnd Bergmann 2 Add of_clk.name = "PCLK" to be explicit on what we use. Suggested by Linus Walleij 3 Remove the GENERIC_CLOCKEVENTS from Kconfig. Suggested by Daniel Lezcano 4 Add depends on NDS32 || COMPILE_TEST in Kconfig Suggested by Greentime Hu [Patch 2/3] clocksource/drivers/atcpit100: VDSO support Why implemented in timer driver, please see details from https://lkml.org/lkml/2017/12/8/362 [PATCH v3 17/33] nds32: VDSO support. Suggested by Mark Rutland Here Mark Rutlan suggested as below: You should not add properties to arbitrary DT bindings to handle a Linux implementation detail. Please remove this DT code, and have the drivers for those timer blocks export this information to your vdso code somehow. [Patch 3/3] dt-bindings: timer: Add andestech atcpit100 timer binding doc Fix incorrect description about PCLK. Suggested by Linus Walleij Rick Chen (3): clocksource/drivers/atcpit100: Add andestech atcpit100 timer clocksource/drivers/atcpit100: VDSO support dt-bindings: timer: Add andestech atcpit100 timer binding doc .../bindings/timer/andestech,atcpit100-timer.txt | 33 +++ drivers/clocksource/Kconfig| 7 + drivers/clocksource/Makefile | 1 + drivers/clocksource/timer-atcpit100.c | 270 + 4 files changed, 311 insertions(+) create mode 100644 Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt create mode 100644 drivers/clocksource/timer-atcpit100.c -- 2.7.4
[PATCH v5 2/3] clocksource/drivers/atcpit100: VDSO support
VDSO needs real-time cycle count to ensure the time accuracy. Unlike others, nds32 architecture does not define clock source, hence VDSO needs atcpit100 offering real-time cycle count to derive the correct time. Signed-off-by: Vincent Chen Signed-off-by: Rick Chen Signed-off-by: Greentime Hu --- drivers/clocksource/timer-atcpit100.c | 15 +++ 1 file changed, 15 insertions(+) diff --git a/drivers/clocksource/timer-atcpit100.c b/drivers/clocksource/timer-atcpit100.c index 0077fdb..1be6c0a 100644 --- a/drivers/clocksource/timer-atcpit100.c +++ b/drivers/clocksource/timer-atcpit100.c @@ -29,6 +29,9 @@ #include #include #include "timer-of.h" +#ifdef CONFIG_NDS32 +#include +#endif /* * Definition of register offsets @@ -211,6 +214,14 @@ static u64 notrace atcpit100_timer_sched_read(void) return ~readl(timer_of_base(&to) + CH1_CNT); } +#ifdef CONFIG_NDS32 +static void fill_vdso_need_info(void) +{ + timer_info.cycle_count_down = true; + timer_info.cycle_count_reg_offset = CH1_CNT; +} +#endif + static int __init atcpit100_timer_init(struct device_node *node) { int ret; @@ -249,6 +260,10 @@ static int __init atcpit100_timer_init(struct device_node *node) val = readl(base + INT_EN); writel(val | CH0INT0EN, base + INT_EN); +#ifdef CONFIG_NDS32 + fill_vdso_need_info(); +#endif + return ret; } -- 2.7.4
[PATCH v5 3/3] dt-bindings: timer: Add andestech atcpit100 timer binding doc
Add a document to describe Andestech atcpit100 timer and binding information. Signed-off-by: Rick Chen Signed-off-by: Greentime Hu Acked-by: Rob Herring --- .../bindings/timer/andestech,atcpit100-timer.txt | 33 ++ 1 file changed, 33 insertions(+) create mode 100644 Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt diff --git a/Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt b/Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt new file mode 100644 index 000..14812f68 --- /dev/null +++ b/Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt @@ -0,0 +1,33 @@ +Andestech ATCPIT100 timer +-- +ATCPIT100 is a generic IP block from Andes Technology, embedded in +Andestech AE3XX platforms and other designs. + +This timer is a set of compact multi-function timers, which can be +used as pulse width modulators (PWM) as well as simple timers. + +It supports up to 4 PIT channels. Each PIT channel is a +multi-function timer and provide the following usage scenarios: +One 32-bit timer +Two 16-bit timers +Four 8-bit timers +One 16-bit PWM +One 16-bit timer and one 8-bit PWM +Two 8-bit timer and one 8-bit PWM + +Required properties: +- compatible : Should be "andestech,atcpit100" +- reg : Address and length of the register set +- interrupts : Reference to the timer interrupt +- clocks : a clock to provide the tick rate for "andestech,atcpit100" +- clock-names : should be "PCLK" for the peripheral clock source. + +Examples: + +timer0: timer@f040 { + compatible = "andestech,atcpit100"; + reg = <0xf040 0x1000>; + interrupts = <2 4>; + clocks = <&apb>; + clock-names = "PCLK"; +}; -- 2.7.4
[PATCH v5 1/3] clocksource/drivers/atcpit100: Add andestech atcpit100 timer
ATCPIT100 is often used on the Andes architecture, This timer provide 4 PIT channels. Each PIT channel is a multi-function timer, can be configured as 32,16,8 bit timers or PWM as well. For system timer it will set channel 1 32-bit timer0 as clock source and count downwards until underflow and restart again. It also set channel 0 32-bit timer0 as clock event and count downwards until condition match. It will generate an interrupt for handling periodically. Signed-off-by: Rick Chen Signed-off-by: Greentime Hu Reviewed-by: Linus Walleij --- drivers/clocksource/Kconfig | 7 + drivers/clocksource/Makefile | 1 + drivers/clocksource/timer-atcpit100.c | 255 ++ 3 files changed, 263 insertions(+) create mode 100644 drivers/clocksource/timer-atcpit100.c diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig index cc60620..8c57ef2 100644 --- a/drivers/clocksource/Kconfig +++ b/drivers/clocksource/Kconfig @@ -615,4 +615,11 @@ config CLKSRC_ST_LPC Enable this option to use the Low Power controller timer as clocksource. +config CLKSRC_ATCPIT100 + bool "Clocksource for AE3XX platform" + depends on NDS32 || COMPILE_TEST + depends on HAS_IOMEM + help + This option enables support for the Andestech AE3XX platform timers. + endmenu diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile index 72711f1..7d072f5 100644 --- a/drivers/clocksource/Makefile +++ b/drivers/clocksource/Makefile @@ -75,3 +75,4 @@ obj-$(CONFIG_H8300_TMR16) += h8300_timer16.o obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o obj-$(CONFIG_X86_NUMACHIP) += numachip.o +obj-$(CONFIG_CLKSRC_ATCPIT100) += timer-atcpit100.o diff --git a/drivers/clocksource/timer-atcpit100.c b/drivers/clocksource/timer-atcpit100.c new file mode 100644 index 000..0077fdb --- /dev/null +++ b/drivers/clocksource/timer-atcpit100.c @@ -0,0 +1,255 @@ +/* + * Andestech ATCPIT100 Timer Device Driver Implementation + * + * Copyright (C) 2017 Andes Technology Corporation + * Rick Chen, Andes Technology Corporation + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include "timer-of.h" + +/* + * Definition of register offsets + */ + +/* ID and Revision Register */ +#define ID_REV 0x0 + +/* Configuration Register */ +#define CFG0x10 + +/* Interrupt Enable Register */ +#define INT_EN 0x14 +#define CH_INT_EN(c, i)((1name, timer_of_rate(&to), 300, 32, + clocksource_mmio_readl_down); + + if (ret) { + pr_err("Failed to register clocksource\n"); + return ret; + } + + /* clear channel
[PATCH net-next] tcp/dccp: avoid one atomic operation for timewait hashdance
From: Eric Dumazet First, rename __inet_twsk_hashdance() to inet_twsk_hashdance() Then, remove one inet_twsk_put() by setting tw_refcnt to 3 instead of 4, but adding a fat warning that we do not have the right to access tw anymore after inet_twsk_hashdance() Signed-off-by: Eric Dumazet --- include/net/inet_timewait_sock.h |4 ++-- net/dccp/minisocks.c |7 --- net/ipv4/inet_timewait_sock.c| 27 +-- net/ipv4/tcp_minisocks.c |7 --- 4 files changed, 23 insertions(+), 22 deletions(-) diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h index 1356fa6a7566bf8b53632215ef8de4b153848f9b..899495589a7ea2bf693cdda42f83cec160e861b5 100644 --- a/include/net/inet_timewait_sock.h +++ b/include/net/inet_timewait_sock.h @@ -93,8 +93,8 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk, struct inet_timewait_death_row *dr, const int state); -void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, - struct inet_hashinfo *hashinfo); +void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, +struct inet_hashinfo *hashinfo); void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm); diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c index 178bb9833311f83205317b07fe64cb2e45a9f734..37ccbe62eb1af3f9dffbf63323c008cc96cd8ea1 100644 --- a/net/dccp/minisocks.c +++ b/net/dccp/minisocks.c @@ -63,9 +63,10 @@ void dccp_time_wait(struct sock *sk, int state, int timeo) */ local_bh_disable(); inet_twsk_schedule(tw, timeo); - /* Linkage updates. */ - __inet_twsk_hashdance(tw, sk, &dccp_hashinfo); - inet_twsk_put(tw); + /* Linkage updates. +* Note that access to tw after this point is illegal. +*/ + inet_twsk_hashdance(tw, sk, &dccp_hashinfo); local_bh_enable(); } else { /* Sorry, if we're out of memory, just CLOSE this diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index b563e0c46bac2362acccf38495546a8b6b726384..277ff69a312dca1d0bc04be4b0b36db133aaf63b 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -97,7 +97,7 @@ static void inet_twsk_add_bind_node(struct inet_timewait_sock *tw, * Essentially we whip up a timewait bucket, copy the relevant info into it * from the SK, and mess with hash chains and list linkage. */ -void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, +void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, struct inet_hashinfo *hashinfo) { const struct inet_sock *inet = inet_sk(sk); @@ -119,18 +119,6 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, spin_lock(lock); - /* -* Step 2: Hash TW into tcp ehash chain. -* Notes : -* - tw_refcnt is set to 4 because : -* - We have one reference from bhash chain. -* - We have one reference from ehash chain. -* - We have one reference from timer. -* - One reference for ourself (our caller will release it). -* We can use atomic_set() because prior spin_lock()/spin_unlock() -* committed into memory all tw fields. -*/ - refcount_set(&tw->tw_refcnt, 4); inet_twsk_add_node_rcu(tw, &ehead->chain); /* Step 3: Remove SK from hash chain */ @@ -138,8 +126,19 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk, sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); spin_unlock(lock); + + /* tw_refcnt is set to 3 because we have : +* - one reference for bhash chain. +* - one reference for ehash chain. +* - one reference for timer. +* We can use atomic_set() because prior spin_lock()/spin_unlock() +* committed into memory all tw fields. +* Also note that after this point, we lost our implicit reference +* so we are not allowed to use tw anymore. +*/ + refcount_set(&tw->tw_refcnt, 3); } -EXPORT_SYMBOL_GPL(__inet_twsk_hashdance); +EXPORT_SYMBOL_GPL(inet_twsk_hashdance); static void tw_timer_handler(struct timer_list *t) { diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index b079b619b60ca577d5ef20a5065fce87acecd96c..a8384b0c11f8fa589e2ed5311899b62c80a269f8 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -316,9 +316,10 @@ void tcp_time_wait(struct sock *sk, int state, int timeo) */ local_bh_disable(); inet_twsk_schedule(tw, timeo); - /* Linkage u
Re: [PATCH net] tcp md5sig: Use skb's saddr when replying to an incoming segment
On Mon, 2017-12-11 at 00:05 -0800, Christoph Paasch wrote: > The MD5-key that belongs to a connection is identified by the peer's > IP-address. When we are in tcp_v4(6)_reqsk_send_ack(), we are > replying > to an incoming segment from tcp_check_req() that failed the seq- > number > checks. > > Thus, to find the correct key, we need to use the skb's saddr and not > the daddr. > > This bug seems to have been there since quite a while, but probably > got > unnoticed because the consequences are not catastrophic. We will call > tcp_v4_reqsk_send_ack only to send a challenge-ACK back to the peer, > thus the connection doesn't really fail. > > Fixes: 9501f9722922 ("tcp md5sig: Let the caller pass appropriate key > for tcp_v{4,6}_do_calc_md5_hash().") > Signed-off-by: Christoph Paasch > --- > net/ipv4/tcp_ipv4.c | 2 +- > net/ipv6/tcp_ipv6.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) Reviewed-by: Eric Dumazet Thanks !
Re: [RFC][PATCH] new byteorder primitives - ..._{replace,get}_bits()
On Mon, 11 Dec 2017 15:54:22 +, Al Viro wrote: > Essentially, it gives helpers for work with bitfields in fixed-endian. > Suppose we have e.g. a little-endian 32bit value with fixed layout; > expressing that as a bitfield would go like > struct foo { > unsigned foo:4; /* bits 0..3 */ > unsigned :2; > unsigned bar:12;/* bits 6..17 */ > unsigned baz:14;/* bits 18..31 */ > } > Even for host-endian it doesn't work all that well - you end up with > ifdefs in structure definition and generated code stinks. For fixed-endian > it gets really painful, and people tend to use explicit shift-and-mask > kind of macros for accessing the fields (and often enough get the > endianness conversions wrong, at that). With these primitives > > struct foo v <=> __le32 v > v.foo = i ? 1 : 2 <=> v = le32_replace_bits(v, i ? 1 : 2, 0, 4) > f(4 + v.baz) <=> f(4 + le32_get_bits(v, 18, 14)) Looks very useful. The [start bit, size] pair may not land itself too nicely to creating defines, though. Which is why in include/linux/bitfield.h we tried to use a shifted mask and work backwards from that single value what the start and size are. commit 3e9b3112ec74 ("add basic register-field manipulation macros") has the description. Could a similar trick perhaps be applicable here?
[BUG] 3com/3c59x: two possible sleep-in-atomic bugs
According to drivers/net/ethernet/3com/3c59x.c, the kernel module may sleep in the interrupt handler. The function call paths are: boomerang_interrupt (interrupt handler) vortex_error vortex_up pci_set_power_state --> may sleep pci_enable_device --> may sleep vortex_interrupt (interrupt handler) vortex_error vortex_up pci_set_power_state --> may sleep pci_enable_device --> may sleep I do not find a good way to fix them, so I only report. These possible bugs are found by my static analysis tool (DSAC) and checked by my code review. Thanks, Jia-Ju Bai
Setting large MTU size on slave interfaces may stall the whole system
(resend this email in text format) Hi, We found an issue with the bonding driver when testing Mellanox devices. The following test commands will stall the whole system sometimes, with serial console flooded with log messages from the bond_miimon_inspect() function. Setting mtu size to be 1500 seems okay but very rarely it may hit the same problem too. ip address flush dev ens3f0 ip link set dev ens3f0 down ip address flush dev ens3f1 ip link set dev ens3f1 down [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1 updelay=500 downdelay=500 [root@ca-hcl629 etc]# ifconfig bond0 up [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1 [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up Seiral console output: ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 5 printk messages dropped ** [ 3717.755737] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 5 printk messages dropped ** [ 3717.767758] bond0: link status down for interface ens3f0, disabling it in 500 ms ** 4 printk messages dropped ** [ 3717.37] bond0: link status down for interface ens3f0, disabling it in 500 ms or ** 4 printk messages dropped ** [274743.297863] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.307866] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.317857] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.327823] bond0: link status down again after 500 ms for interface enp48s0f1 ** 4 printk messages dropped ** [274743.337817] bond0: link status down again after 500 ms for interface enp48s0f1 The root cause is the combined affect from commit 1f2cd845d3827412e82bf26dde0abca332ede402(Revert "Merge branch 'bonding_monitor_locking'") and commit de77ecd4ef02ca783f7762e04e92b3d0964be66b ("bonding: improve link-status update in mii-monitoring"). E.g. reverting the second commit, we don't see the problem. It seems that when setting a large mtu size on an RoCE interface, the RTNL mutex may be held too long by the slave interface, causing bond_mii_monitor() to be called repeatedly at an interval of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive. We found two possible solutions: #1, don't re-arm the mii monitor thread too quick if we cannot get RTNL lock: index b2db581..8fd587a 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2266,7 +2266,6 @@ static void bond_mii_monitor(struct work_struct *work) /* Race avoidance with bond_close cancel of workqueue */ if (!rtnl_trylock()) { - delay = 1; should_notify_peers = false; goto re_arm; } #2, we use printk_ratelimit() to avoid flooding log messages generated by bond_miimon_inspect(). index b2db581..0183b7f 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2054,7 +2054,7 @@ static int bond_miimon_inspect(struct bonding *bond) bond_propose_link_state(slave, BOND_LINK_FAIL); commit++; slave->delay = bond->params.downdelay; - if (slave->delay) { + if (slave->delay && printk_ratelimit()) { netdev_info(bond->dev, "link status down for %sinterface %s, disabling it in %d ms\n", (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP) ? @@ -2105,7 +2105,8 @@ static int bond_miimon_inspect(struct bonding *bond) case BOND_LINK_BACK: if (!link_state) { bond_propose_link_state(slave, BOND_LINK_DOWN); - netdev_info(bond->dev, "link status down again after %d ms for interface %s\n", + if(printk_ratelimit()) + netdev_info(bond->dev, "link status down again after %d ms for interface %s\n", (bond->params.updelay - slave->delay) * bond->params.miimon, slave->dev->name); Regarding the flooding messages, the netdev_info output is misleading anyway when bond_mii_monitor() is called at 1 tick interval due to lock contention. Solution #1 looks simpler and cleaner to me. Any side affect of doing that? Thanks, Qing
Re: [PATCH net-next v5 2/2] net: ethernet: socionext: add AVE ethernet driver
Hi Russell, 2017-12-11 22:46 GMT+09:00 Russell King - ARM Linux : > On Mon, Dec 11, 2017 at 10:34:17PM +0900, Masami Hiramatsu wrote: >> IMHO, even if we use SPDX license identifier, I recommend to use >> C-style comments as many other files do, since it is C code. >> If SPDX identifier requires C++ style, that is SPDX parser's issue >> and should be fixed to get it from C-style comment. > > See the numerous emails on this subject already. The issue of C > vs C++ comments has come up many times by many different people, but > the result is the same. That's not going to happen. Linux kernel > C files are required to use "//" for the SPDX identifier by order > of Linus Torvalds. OK, I got it. > > Linus has also revealed in that discussion that he has a preference > for "//" style commenting for single comments, so it seems that the > kernel coding style may change - but there is no desire for patches > to "clean up" single line comments to use "//". Thank you for making it clear. Then what I'm considering is copyright notice lines. Those are usually treat as the header lines, not single line. So > +// SDPX-License-Identifier: GPL-2.0 > +// sni_ave.c - Socionext UniPhier AVE ethernet driver > +// Copyright 2014 Panasonic Corporation > +// Copyright 2015-2017 Socionext Inc. is acceptable? or should we keep C-style header lines for new drivers? > +// SDPX-License-Identifier: GPL-2.0 > +/* > + * sni_ave.c - Socionext UniPhier AVE ethernet driver > + * Copyright 2014 Panasonic Corporation > + * Copyright 2015-2017 Socionext Inc. > + */ I just concern that those lines are not "single". that's all. :) > > For further information, and to see the discussion that has already > happened, the arguments that have been made about style, see the > threads for the patch series that tglx has been posting wrt documenting > the SPDX stuff for the kernel. OK, got it. https://lkml.org/lkml/2017/11/16/663 Thanks, > > Thanks (let's stop rehashing the same arguments.) > -- Masami Hiramatsu
Re: [PATCH net-next] libbpf: add function to setup XDP
On 12/10/2017 10:07 PM, David Ahern wrote: > On 12/10/17 1:34 PM, Eric Leblond wrote: >>> Would it be possible to print out or preferably return to the caller >>> the ext ack error message? A couple of drivers are using it for XDP >>> mis-configuration reporting instead of printks. We should encourage >>> other to do the same and support it in all user space since ext ack >>> msgs lead to much better user experience. >> >> I've seen the kind of messages displayed by reading at kernel log. They >> are really useful and it looks almost mandatory to be able to display >> them. >> >> Kernel code seems to not have a parser for the ext ack error message. >> Did I miss something here ? >> >> Looking at tc code, it seems it is using libmnl to parse them and I >> doubt it is a good idea to use that in libbpf as it is introducing a >> dependency. >> >> Does someone has an existing parsing code or should I write on my own ? > > I had worked on extack for libbpf but seem to have lost the changes. > > Look at the commits here: > https://github.com/dsahern/iproute2/commits/ext-ack > > I suggest using this: > > https://github.com/dsahern/iproute2/commit/b61e4c7dd54a5d3ff98640da4b480441cee497b2 > > to bring in nlattr from lib/nlattr (as I recall lib/nlattr can not be > used directly). From there, use this one: > > https://github.com/dsahern/iproute2/commit/261f7251e6704d565b91e310faa7e18d14a1 > > to see what is needed for extack support. > > Really not that much code to add. +1, ext ack support would improve troubleshooting a lot here; please add and respin. Thanks, Eric!
Re: [PATCH net-next v4 0/2] bpf/tracing: allow user space to query prog array on the same tp
On 12/11/2017 08:39 PM, Yonghong Song wrote: > Commit e87c6bc3852b ("bpf: permit multiple bpf attachments > for a single perf event") added support to attach multiple > bpf programs to a single perf event. Given a perf event > (kprobe, uprobe, or kernel tracepoint), the perf ioctl interface > is used to query bpf programs attached to the same trace event. > > There already exists a BPF_PROG_QUERY command for introspection > currently used by cgroup+bpf. We did have an implementation for > querying tracepoint+bpf through the same interface. However, it > looks cleaner to use ioctl() style of api here, since attaching > bpf prog to tracepoint/kuprobe is also done via ioctl. > > Patch #1 had the core implementation and patch #2 added > a test case in tools bpf selftests suite. > > Changelogs: > v3 -> v4: > - Fix a compilation error with newer gcc like 6.3.1 while > old gcc 4.8.5 is okay. I was using &uquery->ids to represent > the address to the ids array to make it explicit that the > address is passed, and this syntax is rightly rejected > by gcc 6.3.1. Series applied to bpf-next, thanks Yonghong.
Re: [PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver
From: Tom Herbert Date: Mon, 11 Dec 2017 14:16:17 -0800 > How can we build a system that allows an unlimited number of > resolutions without drop? IPV4 routing solves this with a prefixed trie, for example. The fundamental backing datastructure for the switching or whatever operation must be in-memory, in the kernel, scalable, and without a fronting "cache".
Re: [PATCH v3 17/33] nds32: VDSO support
2017-12-08 20:14 GMT+08:00 Mark Rutland : > On Fri, Dec 08, 2017 at 07:54:42PM +0800, Greentime Hu wrote: >> 2017-12-08 18:21 GMT+08:00 Mark Rutland : >> > On Fri, Dec 08, 2017 at 05:12:00PM +0800, Greentime Hu wrote: >> >> +static int grab_timer_node_info(void) >> >> +{ >> >> + struct device_node *timer_node; >> >> + >> >> + timer_node = of_find_node_by_name(NULL, "timer"); >> > >> > Please use a compatible string, rather than matching the timer by name. >> > >> > It's plausible that you have multiple nodes called "timer" in the DT, >> > under different parent nodes, and this might not be the device you >> > think it is. I see your dt in patch 24 has two timer nodes. >> > >> > It would be best if your clocksource driver exposed some stuct that you >> > looked at here, so that you're guaranteed to user the same device. >> >> We'd like to use "timer" here because there are 2 different timer IPs >> and we are sure that they won't be in the same SoC. >> We think this implementation in VDSO should be platform independent to >> get cycle-count register. >> Our customer or other SoC provider who can use "timer" and define >> cycle-count-offset or cycle-count-down then we can get the correct >> cycle-count. > > This is not the right way to do things. > > So from a DT perspective, NAK. > > You should not add properties to arbitrary DT bindings to handle a Linux > implementation detail. > > Please remove this DT code, and have the drivers for those timer blocks > export this information to your vdso code somehow. > Hi, Mark: Based on your suggestion, we define a new sturct timer_info to let timer driver record the value of cycle-count-offset and cycle-count-down in timer_init function. The above code in timer driver is validate only when CONFIG_NDS32 is defined. >> We sent atcpit100 patch last time along with our arch, however we'd >> like to send it to its sub system this time and my colleague is still >> working on it. >> He may send the timer patch next week. > > I think that it would make sense for that patch to be part of the arch > port, especially given that (AFAICT) there is no dirver for the other > timer IP that you mention. > > [...] > >> >> +int arch_setup_additional_pages(struct linux_binprm *bprm, int >> >> uses_interp) >> >> +{ >> > >> >> + /*Map timer to user space */ >> >> + vdso_base += PAGE_SIZE; >> >> + prot = __pgprot(_PAGE_V | _PAGE_M_UR_KR | _PAGE_D | >> >> + _PAGE_G | _PAGE_C_DEV); >> >> + ret = io_remap_pfn_range(vma, vdso_base, timer_res.start >> >> >> PAGE_SHIFT, >> >> + PAGE_SIZE, prot); >> >> + if (ret) >> >> + goto up_fail; >> > >> > Maybe this is fine, but it looks a bit suspicious. >> > >> > Is it safe to map IO memory to a userspace process like this? >> > >> > In general that isn't safe, since userspace could access other registers >> > (if those exist), perform accesses that change the state of hardware, or >> > make unsupported access types (e.g. unaligned, atomic) that result in >> > errors the kernel can't handle. >> > >> > Does none of that apply here? >> >> We only provide read permission to this page so hareware state won't >> be chagned. It will trigger exception if we try to write. >> We will check about the alignment/atomic issue of this region. > For alignment issue, we intentionally make an un-alignment read to access this region and we got "Segmentation fault" as expected. Thanks, Vincent > Ok, thanks. > > This is another reason to only do this for devices/drivers that we have > drivers for, since we can't know that this is safe in general. > > Thanks, > Mark.
linux-next: build failure after merge of the mac80211-next tree
Hi Johannes, After merging the mac80211-next tree, today's linux-next build (x86_64 allmodconfig) failed like this: drivers/net/wireless/mediatek/mt76/mt76x2_main.c:539:19: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types] .wake_tx_queue = mt76_wake_tx_queue, ^ drivers/net/wireless/mediatek/mt76/mt76x2_main.c:539:19: note: (near initialization for 'mt76x2_ops.wake_tx_queue') Caused by commits 17f1de56df05 ("mt76: add common code shared between multiple chipsets") 7bc04215a66b ("mt76: add driver code for MT76x2e") from the wireless-drivers-next tree interacting with commit e937b8da5a59 ("mac80211: Add TXQ scheduling API") from the mac80211-next tree. I applied the below hack merge fix ... please let me know if something more/better is required. Someone needs to remember to tell Dave when these trees meet in his tree. From: Stephen Rothwell Date: Tue, 12 Dec 2017 12:50:40 +1100 Subject: [PATCH] mt76: fix up for "mac80211: Add TXQ scheduling API" Signed-off-by: Stephen Rothwell --- drivers/net/wireless/mediatek/mt76/mt76.h | 2 +- drivers/net/wireless/mediatek/mt76/tx.c | 10 +++--- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h b/drivers/net/wireless/mediatek/mt76/mt76.h index aa0880bbea7f..e395d3859212 100644 --- a/drivers/net/wireless/mediatek/mt76/mt76.h +++ b/drivers/net/wireless/mediatek/mt76/mt76.h @@ -338,7 +338,7 @@ void mt76_tx(struct mt76_dev *dev, struct ieee80211_sta *sta, struct mt76_wcid *wcid, struct sk_buff *skb); void mt76_txq_init(struct mt76_dev *dev, struct ieee80211_txq *txq); void mt76_txq_remove(struct mt76_dev *dev, struct ieee80211_txq *txq); -void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq); +void mt76_wake_tx_queue(struct ieee80211_hw *hw); void mt76_stop_tx_queues(struct mt76_dev *dev, struct ieee80211_sta *sta, bool send_bar); void mt76_txq_schedule(struct mt76_dev *dev, struct mt76_queue *hwq); diff --git a/drivers/net/wireless/mediatek/mt76/tx.c b/drivers/net/wireless/mediatek/mt76/tx.c index 4eef69bd8a9e..ad414af0750f 100644 --- a/drivers/net/wireless/mediatek/mt76/tx.c +++ b/drivers/net/wireless/mediatek/mt76/tx.c @@ -463,12 +463,16 @@ void mt76_stop_tx_queues(struct mt76_dev *dev, struct ieee80211_sta *sta, } EXPORT_SYMBOL_GPL(mt76_stop_tx_queues); -void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq) +void mt76_wake_tx_queue(struct ieee80211_hw *hw) { + struct ieee80211_txq *txq; struct mt76_dev *dev = hw->priv; - struct mt76_txq *mtxq = (struct mt76_txq *) txq->drv_priv; - struct mt76_queue *hwq = mtxq->hwq; + struct mt76_txq *mtxq; + struct mt76_queue *hwq; + txq = ieee80211_next_txq(hw); + mtxq = (struct mt76_txq *) txq->drv_priv; + hwq = mtxq->hwq; spin_lock_bh(&hwq->lock); if (list_empty(&mtxq->list)) list_add_tail(&mtxq->list, &hwq->swq); -- 2.15.0 -- Cheers, Stephen Rothwell
Re: [PATCH] selftests: bpf: Adding config fragment CONFIG_CGROUP_BPF=y
On 12/11/2017 08:25 PM, Naresh Kamboju wrote: > CONFIG_CGROUP_BPF=y is required for test_dev_cgroup test case. > > Signed-off-by: Naresh Kamboju Applied to bpf-next, thanks Naresh!
[PATCH v2] igb: Free IRQs when device is hotplugged
Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon hotplugging my kernel would immediately crash due to igb: [ 680.825801] kernel BUG at drivers/pci/msi.c:352! [ 680.828388] invalid opcode: [#1] SMP [ 680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev vfat fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi sparse_keymap rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal coretemp crc32_pclmul snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd pcspkr mei soundcore i2c_i801 tpm_tis psmouse shpchp wmi tpm_tis_core tpm video hp_wireless acpi_pad rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw rtsx_pci mfd_core xhci_pci xhci_hcd i2c_hid i2c_core [last unloaded: igb] [ 680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted: G O 4.15.0-rc3Lyude-Test+ #6 [ 680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver. 01.03 06/09/2017 [ 680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0 [ 680.833271] RSP: 0018:c930fbf0 EFLAGS: 00010286 [ 680.833761] RAX: 8803405f9c00 RBX: 88033e3d2e40 RCX: 002c [ 680.834278] RDX: RSI: 00ac RDI: 880340be2178 [ 680.834832] RBP: R08: 880340be1ff0 R09: 8803405f9c00 [ 680.835342] R10: R11: 0040 R12: 88033d63a298 [ 680.835822] R13: 88033d63a000 R14: 0060 R15: 880341959000 [ 680.836332] FS: () GS:88034f44() knlGS: [ 680.836817] CS: 0010 DS: ES: CR0: 80050033 [ 680.837360] CR2: 55e64044afdf CR3: 01c09002 CR4: 003606e0 [ 680.837954] Call Trace: [ 680.838853] pci_disable_msix+0xce/0xf0 [ 680.839616] igb_reset_interrupt_capability+0x5d/0x60 [igb] [ 680.840278] igb_remove+0x9d/0x110 [igb] [ 680.840764] pci_device_remove+0x36/0xb0 [ 680.841279] device_release_driver_internal+0x157/0x220 [ 680.841739] pci_stop_bus_device+0x7d/0xa0 [ 680.842255] pci_stop_bus_device+0x2b/0xa0 [ 680.842722] pci_stop_bus_device+0x3d/0xa0 [ 680.843189] pci_stop_and_remove_bus_device+0xe/0x20 [ 680.843627] trim_stale_devices+0xf3/0x140 [ 680.844086] trim_stale_devices+0x94/0x140 [ 680.844532] trim_stale_devices+0xa6/0x140 [ 680.845031] ? get_slot_status+0x90/0xc0 [ 680.845536] acpiphp_check_bridge.part.5+0xfe/0x140 [ 680.846021] acpiphp_hotplug_notify+0x175/0x200 [ 680.846581] ? free_bridge+0x100/0x100 [ 680.847113] acpi_device_hotplug+0x8a/0x490 [ 680.847535] acpi_hotplug_work_fn+0x1a/0x30 [ 680.848076] process_one_work+0x182/0x3a0 [ 680.848543] worker_thread+0x2e/0x380 [ 680.848963] ? process_one_work+0x3a0/0x3a0 [ 680.849373] kthread+0x111/0x130 [ 680.849776] ? kthread_create_worker_on_cpu+0x50/0x50 [ 680.850188] ret_from_fork+0x1f/0x30 [ 680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39 6b 14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3 <0f> 0b 49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b [ 680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: c930fbf0 As it turns out, normally the freeing of IRQs that would fix this is called inside of the scope of __igb_close(). However, since the device is already gone by the point we try to unregister the netdevice from the driver due to a hotplug we end up seeing that the netif isn't present and thus, forget to free any of the device IRQs. So: make sure that if we're in the process of dismantling the netdev, we always allow __igb_close() to be called so that IRQs may be freed normally. Additionally, only allow igb_close() to be called from __igb_close() if it hasn't already been called for the given adapter. Signed-off-by: Lyude Paul Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach") Cc: Todd Fujinaka Cc: Stephen Hemminger Cc: sta...@vger.kernel.org --- Changes since v1: - Remove code for freeing IRQs from igb_remove(), unbreak __igb_close() instead (re: Stephen Hemminger) drivers/net/ethernet/intel/igb/igb_main.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index c208753ff5b7..a1083fd074dd 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -3663,7 +3663,9 @@ static int __igb_close(struct net_device *netdev, bool suspending) if (!suspending) pm_runtime_get_sync(&pdev->dev); - igb_down(adapter); + if (!test_bit(__IGB_DOWN, &adapter->state)) + igb_down(adapter); + igb_free_irq(adapter); igb_free_all_tx_resources(adapter); @@ -3676,7 +3678,7 @@ static int __igb_close(struct net_device *netdev, bool suspending) int igb_close(struct net_device *netdev) { - if (netif_device_present(netdev)) + if (netif_
[PATCH bpf 0/3] Misc BPF fixes
Couple of outstanding fixes for BPF tree: 1) fixes a perf RB corruption, 2) and 3) fixes a few build issues from the recent bpf_perf_event.h uapi corrections. Thanks! Daniel Borkmann (3): bpf: fix corruption on concurrent perf_event_output calls bpf: fix build issues on um due to mising bpf_perf_event.h bpf: fix broken BPF selftest build arch/um/include/asm/Kbuild | 1 + kernel/trace/bpf_trace.c| 19 --- tools/include/uapi/asm/bpf_perf_event.h | 7 +++ tools/testing/selftests/bpf/Makefile| 13 + 4 files changed, 21 insertions(+), 19 deletions(-) create mode 100644 tools/include/uapi/asm/bpf_perf_event.h -- 2.9.5
[PATCH bpf 3/3] bpf: fix broken BPF selftest build
At least on x86_64, the kernel's BPF selftests seemed to have stopped to build due to 618e165b2a8e ("selftests/bpf: sync kernel headers and introduce arch support in Makefile"): [...] In file included from test_verifier.c:29:0: ../../../include/uapi/linux/bpf_perf_event.h:11:32: fatal error: asm/bpf_perf_event.h: No such file or directory #include ^ compilation terminated. [...] While pulling in tools/arch/*/include/uapi/asm/bpf_perf_event.h seems to work fine, there's no automated fall-back logic right now that would do the same out of tools/include/uapi/asm-generic/bpf_perf_event.h. The usual convention today is to add a include/[uapi/]asm/ equivalent that would pull in the correct arch header or generic one as fall-back, all ifdef'ed based on compiler target definition. It's similarly done also in other cases such as tools/include/asm/barrier.h, thus adapt the same here. Fixes: 618e165b2a8e ("selftests/bpf: sync kernel headers and introduce arch support in Makefile") Signed-off-by: Daniel Borkmann Cc: Hendrik Brueckner Cc: Arnaldo Carvalho de Melo Acked-by: Alexei Starovoitov --- tools/include/uapi/asm/bpf_perf_event.h | 7 +++ tools/testing/selftests/bpf/Makefile| 13 + 2 files changed, 8 insertions(+), 12 deletions(-) create mode 100644 tools/include/uapi/asm/bpf_perf_event.h diff --git a/tools/include/uapi/asm/bpf_perf_event.h b/tools/include/uapi/asm/bpf_perf_event.h new file mode 100644 index 000..13a5853 --- /dev/null +++ b/tools/include/uapi/asm/bpf_perf_event.h @@ -0,0 +1,7 @@ +#if defined(__aarch64__) +#include "../../arch/arm64/include/uapi/asm/bpf_perf_event.h" +#elif defined(__s390__) +#include "../../arch/s390/include/uapi/asm/bpf_perf_event.h" +#else +#include +#endif diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index 21a2d76..792af7c 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -1,19 +1,8 @@ # SPDX-License-Identifier: GPL-2.0 -ifeq ($(srctree),) -srctree := $(patsubst %/,%,$(dir $(CURDIR))) -srctree := $(patsubst %/,%,$(dir $(srctree))) -srctree := $(patsubst %/,%,$(dir $(srctree))) -srctree := $(patsubst %/,%,$(dir $(srctree))) -endif -include $(srctree)/tools/scripts/Makefile.arch - -$(call detected_var,SRCARCH) - LIBDIR := ../../../lib BPFDIR := $(LIBDIR)/bpf APIDIR := ../../../include/uapi -ASMDIR:= ../../../arch/$(ARCH)/include/uapi GENDIR := ../../../../include/generated GENHDR := $(GENDIR)/autoconf.h @@ -21,7 +10,7 @@ ifneq ($(wildcard $(GENHDR)),) GENFLAGS := -DHAVE_GENHDR endif -CFLAGS += -Wall -O2 -I$(APIDIR) -I$(ASMDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include +CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) -I../../../include LDLIBS += -lcap -lelf TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test_progs \ -- 2.9.5
[PATCH bpf 2/3] bpf: fix build issues on um due to mising bpf_perf_event.h
Since c895f6f703ad ("bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type") um (uml) won't build on i386 or x86_64: [...] CC init/main.o In file included from ../include/linux/perf_event.h:18:0, from ../include/linux/trace_events.h:10, from ../include/trace/syscall.h:7, from ../include/linux/syscalls.h:82, from ../init/main.c:20: ../include/uapi/linux/bpf_perf_event.h:11:32: fatal error: asm/bpf_perf_event.h: No such file or directory #include [...] Lets add missing bpf_perf_event.h also to um arch. This seems to be the only one still missing. Fixes: c895f6f703ad ("bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT program type") Reported-by: Randy Dunlap Suggested-by: Richard Weinberger Signed-off-by: Daniel Borkmann Tested-by: Randy Dunlap Cc: Hendrik Brueckner Cc: Richard Weinberger Acked-by: Alexei Starovoitov --- arch/um/include/asm/Kbuild | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild index 50a32c3..73c57f6 100644 --- a/arch/um/include/asm/Kbuild +++ b/arch/um/include/asm/Kbuild @@ -1,4 +1,5 @@ generic-y += barrier.h +generic-y += bpf_perf_event.h generic-y += bug.h generic-y += clkdev.h generic-y += current.h -- 2.9.5
[PATCH bpf 1/3] bpf: fix corruption on concurrent perf_event_output calls
When tracing and networking programs are both attached in the system and both use event-output helpers that eventually call into perf_event_output(), then we could end up in a situation where the tracing attached program runs in user context while a cls_bpf program is triggered on that same CPU out of softirq context. Since both rely on the same per-cpu perf_sample_data, we could potentially corrupt it. This can only ever happen in a combination of the two types; all tracing programs use a bpf_prog_active counter to bail out in case a program is already running on that CPU out of a different context. XDP and cls_bpf programs by themselves don't have this issue as they run in the same context only. Therefore, split both perf_sample_data so they cannot be accessed from each other. Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") Reported-by: Alexei Starovoitov Signed-off-by: Daniel Borkmann Tested-by: Song Liu Acked-by: Alexei Starovoitov --- kernel/trace/bpf_trace.c | 19 --- 1 file changed, 12 insertions(+), 7 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 0ce99c3..40207c2 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -343,14 +343,13 @@ static const struct bpf_func_proto bpf_perf_event_read_value_proto = { .arg4_type = ARG_CONST_SIZE, }; -static DEFINE_PER_CPU(struct perf_sample_data, bpf_sd); +static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd); static __always_inline u64 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, - u64 flags, struct perf_raw_record *raw) + u64 flags, struct perf_sample_data *sd) { struct bpf_array *array = container_of(map, struct bpf_array, map); - struct perf_sample_data *sd = this_cpu_ptr(&bpf_sd); unsigned int cpu = smp_processor_id(); u64 index = flags & BPF_F_INDEX_MASK; struct bpf_event_entry *ee; @@ -373,8 +372,6 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, if (unlikely(event->oncpu != cpu)) return -EOPNOTSUPP; - perf_sample_data_init(sd, 0, 0); - sd->raw = raw; perf_event_output(event, sd, regs); return 0; } @@ -382,6 +379,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, u64, flags, void *, data, u64, size) { + struct perf_sample_data *sd = this_cpu_ptr(&bpf_trace_sd); struct perf_raw_record raw = { .frag = { .size = size, @@ -392,7 +390,10 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, if (unlikely(flags & ~(BPF_F_INDEX_MASK))) return -EINVAL; - return __bpf_perf_event_output(regs, map, flags, &raw); + perf_sample_data_init(sd, 0, 0); + sd->raw = &raw; + + return __bpf_perf_event_output(regs, map, flags, sd); } static const struct bpf_func_proto bpf_perf_event_output_proto = { @@ -407,10 +408,12 @@ static const struct bpf_func_proto bpf_perf_event_output_proto = { }; static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs); +static DEFINE_PER_CPU(struct perf_sample_data, bpf_misc_sd); u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy) { + struct perf_sample_data *sd = this_cpu_ptr(&bpf_misc_sd); struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs); struct perf_raw_frag frag = { .copy = ctx_copy, @@ -428,8 +431,10 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, }; perf_fetch_caller_regs(regs); + perf_sample_data_init(sd, 0, 0); + sd->raw = &raw; - return __bpf_perf_event_output(regs, map, flags, &raw); + return __bpf_perf_event_output(regs, map, flags, sd); } BPF_CALL_0(bpf_get_current_task) -- 2.9.5
linux-next: manual merge of the net-next tree with the net tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: drivers/net/phy/meson-gxl.c between commit: f1e2400a80ff ("net: phy: meson-gxl: detect LPA corruption") from the net tree and commit: 80274abafc60 ("net: phy: remove generic settings for callbacks config_aneg and read_status from drivers") from the net-next tree. I fixed it up (I just used the former) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc drivers/net/phy/meson-gxl.c index 77dd4be5,401e3234be58.. --- a/drivers/net/phy/meson-gxl.c +++ b/drivers/net/phy/meson-gxl.c @@@ -130,9 -58,7 +130,8 @@@ static struct phy_driver meson_gxl_phy[ .features = PHY_BASIC_FEATURES, .flags = PHY_IS_INTERNAL, .config_init= meson_gxl_config_init, - .config_aneg= genphy_config_aneg, .aneg_done = genphy_aneg_done, + .read_status= meson_gxl_read_status, .suspend= genphy_suspend, .resume = genphy_resume, },
[PATCH iproute2 net-next v2 1/4] ss: Replace printf() calls for "main" output by calls to helper
This is preparation work for output buffering, which will allow us to use optimal spacing and alignment of logical "columns". The new out() function is just a re-implementation of a typical libc's printf(), except that the return value of vfprintf() is ignored as no callers use it. This implementation will be replaced in the next patches to provide column width adjustment and adequate spacing. All printf() calls that output parts of the socket list are now replaced by calls to out(). Output of summary and version is excluded from this. No functional differences here, output not affected. Signed-off-by: Stefano Brivio Reviewed-by: Sabrina Dubroca --- v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh") misc/ss.c | 399 -- 1 file changed, 205 insertions(+), 194 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index da52d5edeb7e..a7d3b89e1478 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -26,6 +26,7 @@ #include #include #include +#include #include "utils.h" #include "rt_names.h" @@ -823,6 +824,15 @@ static const char *vsock_netid_name(int type) } } +static void out(const char *fmt, ...) +{ + va_list args; + + va_start(args, fmt); + vfprintf(stdout, fmt, args); + va_end(args); +} + static void sock_state_print(struct sockstat *s) { const char *sock_name; @@ -863,39 +873,39 @@ static void sock_state_print(struct sockstat *s) } if (netid_width) - printf("%-*s ", netid_width, - is_sctp_assoc(s, sock_name) ? "" : sock_name); + out("%-*s ", netid_width, + is_sctp_assoc(s, sock_name) ? "" : sock_name); if (state_width) { if (is_sctp_assoc(s, sock_name)) - printf("`- %-*s ", state_width - 3, - sctp_sstate_name[s->state]); + out("`- %-*s ", state_width - 3, + sctp_sstate_name[s->state]); else - printf("%-*s ", state_width, sstate_name[s->state]); + out("%-*s ", state_width, sstate_name[s->state]); } - printf("%-6d %-6d %s", s->rq, s->wq, odd_width_pad); + out("%-6d %-6d %s", s->rq, s->wq, odd_width_pad); } static void sock_details_print(struct sockstat *s) { if (s->uid) - printf(" uid:%u", s->uid); + out(" uid:%u", s->uid); - printf(" ino:%u", s->ino); - printf(" sk:%llx", s->sk); + out(" ino:%u", s->ino); + out(" sk:%llx", s->sk); if (s->mark) - printf(" fwmark:0x%x", s->mark); + out(" fwmark:0x%x", s->mark); } static void sock_addr_print_width(int addr_len, const char *addr, char *delim, int port_len, const char *port, const char *ifname) { if (ifname) { - printf("%*s%%%s%s%-*s ", addr_len, addr, ifname, delim, - port_len, port); + out("%*s%%%s%s%-*s ", addr_len, addr, ifname, delim, + port_len, port); } else { - printf("%*s%s%-*s ", addr_len, addr, delim, port_len, port); + out("%*s%s%-*s ", addr_len, addr, delim, port_len, port); } } @@ -1793,12 +1803,12 @@ static void proc_ctx_print(struct sockstat *s) if (find_entry(s->ino, &buf, (show_proc_ctx & show_sock_ctx) ? PROC_SOCK_CTX : PROC_CTX) > 0) { - printf(" users:(%s)", buf); + out(" users:(%s)", buf); free(buf); } } else if (show_users) { if (find_entry(s->ino, &buf, USERS) > 0) { - printf(" users:(%s)", buf); + out(" users:(%s)", buf); free(buf); } } @@ -1878,51 +1888,51 @@ static char *sprint_bw(char *buf, double bw) static void sctp_stats_print(struct sctp_info *s) { if (s->sctpi_tag) - printf(" tag:%x", s->sctpi_tag); + out(" tag:%x", s->sctpi_tag); if (s->sctpi_state) - printf(" state:%s", sctp_sstate_name[s->sctpi_state]); + out(" state:%s", sctp_sstate_name[s->sctpi_state]); if (s->sctpi_rwnd) - printf(" rwnd:%d", s->sctpi_rwnd); + out(" rwnd:%d", s->sctpi_rwnd); if (s->sctpi_unackdata) - printf(" unackdata:%d", s->sctpi_unackdata); + out(" unackdata:%d", s->sctpi_unackdata); if (s->sctpi_penddata) - printf(" penddata:%d", s->sctpi_penddata); + out(" penddata:%d", s->sctpi_penddata); if (s->sctpi_instrms) - printf(" instrms:%d", s->sctpi_instrms); + out(" instrms:%d", s->sctpi_instrms);
[PATCH iproute2 net-next v2 3/4] ss: Buffer raw fields first, then render them as a table
This allows us to measure the maximum field length for each column before printing fields and will permit us to apply optimal field spacing and distribution. Structure of the output buffer with chunked allocation is described in comments. Output is still unchanged, original spacing is used. Running over one million sockets with -tul options by simply modifying main() to loop 50,000 times over the *_show() functions, buffering the whole output and rendering it at the end, with 10 UDP sockets, 10 TCP sockets, while throwing output away, doesn't show significant changes in execution time on my laptop with an Intel i7-6600U CPU: - before this patch: $ time ./ss -tul > /dev/null real0m29.899s user0m2.017s sys 0m27.801s - after this patch: $ time ./ss -tul > /dev/null real0m29.827s user0m1.942s sys 0m27.812s Signed-off-by: Stefano Brivio Reviewed-by: Sabrina Dubroca --- v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh") misc/ss.c | 271 +++--- 1 file changed, 225 insertions(+), 46 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 42310ba4120d..166267974c36 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -47,6 +47,8 @@ #include #define MAGIC_SEQ 123456 +#define BUF_CHUNK (1024 * 1024) +#define LEN_ALIGN(x) (((x) + 1) & ~1) #define DIAG_REQUEST(_req, _r) \ struct {\ @@ -127,24 +129,45 @@ struct column { const char *header; const char *ldelim; int width; /* Including delimiter. -1: fit to content, 0: hide */ - int stored; /* Characters buffered */ - int printed;/* Characters printed so far */ }; static struct column columns[] = { - { ALIGN_LEFT, "Netid","", 0, 0, 0 }, - { ALIGN_LEFT, "State"," ",0, 0, 0 }, - { ALIGN_LEFT, "Recv-Q", " ",7, 0, 0 }, - { ALIGN_LEFT, "Send-Q", " ",7, 0, 0 }, - { ALIGN_RIGHT, "Local Address:", " ",0, 0, 0 }, - { ALIGN_LEFT, "Port", "", 0, 0, 0 }, - { ALIGN_RIGHT, "Peer Address:"," ",0, 0, 0 }, - { ALIGN_LEFT, "Port", "", 0, 0, 0 }, - { ALIGN_LEFT, "", "", -1, 0, 0 }, + { ALIGN_LEFT, "Netid","", 0 }, + { ALIGN_LEFT, "State"," ",0 }, + { ALIGN_LEFT, "Recv-Q", " ",7 }, + { ALIGN_LEFT, "Send-Q", " ",7 }, + { ALIGN_RIGHT, "Local Address:", " ",0 }, + { ALIGN_LEFT, "Port", "", 0 }, + { ALIGN_RIGHT, "Peer Address:"," ",0 }, + { ALIGN_LEFT, "Port", "", 0 }, + { ALIGN_LEFT, "", "", -1 }, }; static struct column *current_field = columns; -static char field_buf[BUFSIZ]; + +/* Output buffer: chained chunks of BUF_CHUNK bytes. Each field is written to + * the buffer as a variable size token. A token consists of a 16 bits length + * field, followed by a string which is not NULL-terminated. + * + * A new chunk is allocated and linked when the current chunk doesn't have + * enough room to store the current token as a whole. + */ +struct buf_chunk { + struct buf_chunk *next; /* Next chained chunk */ + char *end; /* Current end of content */ + char data[0]; +}; + +struct buf_token { + uint16_t len; /* Data length, excluding length descriptor */ + char data[0]; +}; + +static struct { + struct buf_token *cur; /* Position of current token in chunk */ + struct buf_chunk *head; /* First chunk */ + struct buf_chunk *tail; /* Current chunk */ +} buffer; static const char *TCP_PROTO = "tcp"; static const char *SCTP_PROTO = "sctp"; @@ -861,25 +884,109 @@ static const char *vsock_netid_name(int type) } } +/* Allocate and initialize a new buffer chunk */ +static struct buf_chunk *buf_chunk_new(void) +{ + struct buf_chunk *new = malloc(BUF_CHUNK); + + if (!new) + abort(); + + new->next = NULL; + + /* This is also the last block */ + buffer.tail = new; + + /* Next token will be stored at the beginning of chunk data area, and +* its initial length is zero. +*/ + buffer.cur = (struct buf_token *)new->data; + buffer.cur->len = 0; + + new->end = buffer.cur->data; + + return new; +} + +/* Return available tail room in given chunk */ +static int buf_chunk_avail(struct buf_chunk *chunk) +{ + return BUF_CHUNK - offsetof(struct buf_chunk, data) - + (chunk->end - chunk->data); +} + +/* Update end pointer and
[PATCH iproute2 net-next v2 4/4] ss: Implement automatic column width calculation
Group fitting fields into lines and space them equally using the remaining screen width for each line. If columns don't fit on one line, break them into the least possible amount of lines and keep them aligned across lines. This is done by: - recording the length of the longest item in each column during formatting and buffering (which was added in the previous patch) - fitting as many fields as possible on each line of output - distributing the remaining padding space equally between the columns Signed-off-by: Stefano Brivio Reviewed-by: Sabrina Dubroca --- v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh") misc/ss.c | 188 +++--- 1 file changed, 120 insertions(+), 68 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index 166267974c36..9d21ed7a0705 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -128,19 +128,21 @@ struct column { const enum col_align align; const char *header; const char *ldelim; - int width; /* Including delimiter. -1: fit to content, 0: hide */ + int disabled; + int width; /* Calculated, including additional layout spacing */ + int max_len;/* Measured maximum field length in this column */ }; static struct column columns[] = { - { ALIGN_LEFT, "Netid","", 0 }, - { ALIGN_LEFT, "State"," ",0 }, - { ALIGN_LEFT, "Recv-Q", " ",7 }, - { ALIGN_LEFT, "Send-Q", " ",7 }, - { ALIGN_RIGHT, "Local Address:", " ",0 }, - { ALIGN_LEFT, "Port", "", 0 }, - { ALIGN_RIGHT, "Peer Address:"," ",0 }, - { ALIGN_LEFT, "Port", "", 0 }, - { ALIGN_LEFT, "", "", -1 }, + { ALIGN_LEFT, "Netid","", 0, 0, 0 }, + { ALIGN_LEFT, "State"," ",0, 0, 0 }, + { ALIGN_LEFT, "Recv-Q", " ",0, 0, 0 }, + { ALIGN_LEFT, "Send-Q", " ",0, 0, 0 }, + { ALIGN_RIGHT, "Local Address:", " ",0, 0, 0 }, + { ALIGN_LEFT, "Port", "", 0, 0, 0 }, + { ALIGN_RIGHT, "Peer Address:"," ",0, 0, 0 }, + { ALIGN_LEFT, "Port", "", 0, 0, 0 }, + { ALIGN_LEFT, "", "", 0, 0, 0 }, }; static struct column *current_field = columns; @@ -960,7 +962,7 @@ static void out(const char *fmt, ...) char *pos; int len; - if (!f->width) + if (f->disabled) return; if (!buffer.head) @@ -983,7 +985,7 @@ static int print_left_spacing(struct column *f, int stored, int printed) { int s; - if (f->width < 0 || f->align == ALIGN_LEFT) + if (!f->width || f->align == ALIGN_LEFT) return 0; s = f->width - stored - printed; @@ -1001,7 +1003,7 @@ static void print_right_spacing(struct column *f, int printed) { int s; - if (f->width < 0 || f->align == ALIGN_RIGHT) + if (!f->width || f->align == ALIGN_RIGHT) return; s = f->width - printed; @@ -1018,9 +1020,12 @@ static void field_flush(struct column *f) struct buf_chunk *chunk = buffer.tail; unsigned int pad = buffer.cur->len % 2; - if (!f->width) + if (f->disabled) return; + if (buffer.cur->len > f->max_len) + f->max_len = buffer.cur->len; + /* We need a new chunk if we can't store the next length descriptor. * Mind the gap between end of previous token and next aligned position * for length descriptor. @@ -1063,7 +1068,7 @@ static void field_set(enum col_id id) static void print_header(void) { while (!field_is_last(current_field)) { - if (current_field->width) + if (!current_field->disabled) out(current_field->header); field_next(); } @@ -1096,16 +1101,106 @@ static void buf_free_all(void) buffer.head = NULL; } +/* Calculate column width from contents length. If columns don't fit on one + * line, break them into the least possible amount of lines and keep them + * aligned across lines. Available screen space is equally spread between fields + * as additional spacing. + */ +static void render_calc_width(int screen_width) +{ + int first, len = 0, linecols = 0; + struct column *c, *eol = columns - 1; + + /* First pass: set width for each column to measured content length */ + for (first = 1, c = columns; c - columns < COL_MAX; c++) { + if (c->disabled) + continue; + + if (!first && c->max_len) + c->width = c->max_len + strlen(c->ldelim); + else + c->width = c->max
[PATCH iproute2 net-next v2 2/4] ss: Introduce columns lightweight abstraction
Instead of embedding spacing directly while printing contents, logically declare columns and functions to buffer their content, to print left and right spacing around fields, to flush them to screen, and to print headers. This makes it a bit easier to handle layout changes and prepares for full output buffering, needed for optimal spacing in field output layout. Columns are currently set up to retain exactly the same output as before. This needs some slight adjustments of the values previously calculated in main(), as the width value introduced here already includes the width of left delimiters and spacing is not explicitly printed anymore whenever a field is printed. These calculations will go away altogether once automatic width calculation is implemented. We can also remove explicit printing of newlines after the final content for a given line is printed, flushing the last field on a line will cause field_flush() to print newlines where appropriate. No changes in output expected here. Signed-off-by: Stefano Brivio Reviewed-by: Sabrina Dubroca --- v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh") misc/ss.c | 291 ++ 1 file changed, 198 insertions(+), 93 deletions(-) diff --git a/misc/ss.c b/misc/ss.c index a7d3b89e1478..42310ba4120d 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -103,11 +103,48 @@ int show_header = 1; int follow_events; int sctp_ino; -int netid_width; -int state_width; -int addr_width; -int serv_width; -char *odd_width_pad = ""; +enum col_id { + COL_NETID, + COL_STATE, + COL_RECVQ, + COL_SENDQ, + COL_ADDR, + COL_SERV, + COL_RADDR, + COL_RSERV, + COL_EXT, + COL_MAX +}; + +enum col_align { + ALIGN_LEFT, + ALIGN_CENTER, + ALIGN_RIGHT +}; + +struct column { + const enum col_align align; + const char *header; + const char *ldelim; + int width; /* Including delimiter. -1: fit to content, 0: hide */ + int stored; /* Characters buffered */ + int printed;/* Characters printed so far */ +}; + +static struct column columns[] = { + { ALIGN_LEFT, "Netid","", 0, 0, 0 }, + { ALIGN_LEFT, "State"," ",0, 0, 0 }, + { ALIGN_LEFT, "Recv-Q", " ",7, 0, 0 }, + { ALIGN_LEFT, "Send-Q", " ",7, 0, 0 }, + { ALIGN_RIGHT, "Local Address:", " ",0, 0, 0 }, + { ALIGN_LEFT, "Port", "", 0, 0, 0 }, + { ALIGN_RIGHT, "Peer Address:"," ",0, 0, 0 }, + { ALIGN_LEFT, "Port", "", 0, 0, 0 }, + { ALIGN_LEFT, "", "", -1, 0, 0 }, +}; + +static struct column *current_field = columns; +static char field_buf[BUFSIZ]; static const char *TCP_PROTO = "tcp"; static const char *SCTP_PROTO = "sctp"; @@ -826,13 +863,113 @@ static const char *vsock_netid_name(int type) static void out(const char *fmt, ...) { + struct column *f = current_field; va_list args; va_start(args, fmt); - vfprintf(stdout, fmt, args); + f->stored += vsnprintf(field_buf + f->stored, BUFSIZ - f->stored, + fmt, args); va_end(args); } +static int print_left_spacing(struct column *f) +{ + int s; + + if (f->width < 0 || f->align == ALIGN_LEFT) + return 0; + + s = f->width - f->stored - f->printed; + if (f->align == ALIGN_CENTER) + /* If count of total spacing is odd, shift right by one */ + s = (s + 1) / 2; + + if (s > 0) + return printf("%*c", s, ' '); + + return 0; +} + +static void print_right_spacing(struct column *f) +{ + int s; + + if (f->width < 0 || f->align == ALIGN_RIGHT) + return; + + s = f->width - f->printed; + if (f->align == ALIGN_CENTER) + s /= 2; + + if (s > 0) + printf("%*c", s, ' '); +} + +static int field_needs_delimiter(struct column *f) +{ + if (!f->stored) + return 0; + + /* Was another field already printed on this line? */ + for (f--; f >= columns; f--) + if (f->width) + return 1; + + return 0; +} + +/* Flush given field to screen together with delimiter and spacing */ +static void field_flush(struct column *f) +{ + if (!f->width) + return; + + if (field_needs_delimiter(f)) + f->printed = printf("%s", f->ldelim); + + f->printed += print_left_spacing(f); + f->printed += printf("%s", field_buf); + print_right_spacing(f); + + *field_buf = 0; + f->printed = 0; + f->stored = 0; +} + +static int field_is_last(struct column *f) +{ +
[PATCH iproute2 net-next v2 0/4] Abstract columns, properly space and wrap fields
Currently, 'ss' simply subdivides the whole available screen width between available columns, starting from a set of hardcoded amount of spacing and growing column widths. This makes the output unreadable in several cases, as it doesn't take into account the actual content width. Fix this by introducing a simple abstraction for columns, buffering the output, measuring the width of the fields, grouping fields into lines as they fit, equally distributing any remaining whitespace, and finally rendering the result. Some examples are reported below [1]. This implementation doesn't seem to cause any significant performance issues, as reported in 3/4. Patch 1/4 replaces all relevant printf() calls by the out() helper, which simply consists of the usual printf() implementation. Patch 2/4 implements column abstraction, with configurable column width and delimiters, and 3/4 splits buffering and rendering phases, employing a simple buffering mechanism with chunked allocation and introducing a rendering function. Up to this point, the output is still unchanged. Finally, 4/4 introduces field width calculation based on content length measured while buffering, in order to split fields onto multiple lines and equally space them within the single lines. Now that column behaviour is well-defined and more easily configurable, it should be easier to further improve the output by splitting logically separable information (e.g. TCP details) into additional columns. However, this patchset keeps the full "extended" information into a single column, for the moment being. v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh") [1] - 80 columns terminal, ss -Z -f netlink * before: Recv-Q Send-Q Local Address:Port Peer Address:Port 0 0rtnl:evolution-calen/2075 * pr oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 0 0rtnl:abrt-applet/32700 * pr oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 0 0rtnl:firefox/21619 * pr oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 0 0rtnl:evolution-calen/32639 * p roc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [...] * after: Recv-Q Send-Q Local Address:Port Peer Address:Port 00 rtnl:evolution-calen/2075 * proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 00 rtnl:abrt-applet/32700 * proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 00 rtnl:firefox/21619 * proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 00 rtnl:evolution-calen/32639 * proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 [...] - 80 columns terminal, ss -tunpl * before: Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port udpUNCONN 0 0 *:37732 *:* udpUNCONN 0 0 *:5353 *:* udpUNCONN 0 0 192.168.122.1:53*:* udpUNCONN 0 0 *%virbr0:67*:* [...] * after: Netid StateRecv-Q Send-Q Local Address:Port Peer Address:Port udp UNCONN 00 *:37732*:* udp UNCONN 00 *:5353 *:* udp UNCONN 00 192.168.122.1:53 *:* udp UNCONN 00 *%virbr0:67 *:* [...] - 66 columns terminal, ss -tunpl * before: Netid State Recv-Q Send-Q Local Address:Port P eer Address:Port udpUNCONN 0 0 *:37732 *:* udpUNCONN 0 0 *:5353*:* udpUNCONN 0 0 192.168.122.1:53 *:* udpUNCONN 0 0 *%virbr0:67 *:* [...] * after: Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port udp UNCONN 0 0 *:37732 *:* udp UNCONN 0 0 *:5353 *:* udp UNCONN 0 0 192.168.122.1:53*:* udp UNCONN 0 0 *%virbr0:67*:* [...] Stefano Brivio (4): ss: Replace printf() calls for "main" output by calls to helper ss: Introduce columns lightweight abstraction ss: Buffer raw fields first, then render them as a table ss: Implement automatic column width calculation misc/ss.c | 895 +++--- 1 file changed, 621 insertions(+), 274 deletions(-) -- 2.9.4
Re: [PATCH] igb: Free IRQs when device is hotplugged
On Mon, 2017-12-11 at 16:34 -0800, Stephen Hemminger wrote: > On Mon, 11 Dec 2017 18:45:02 -0500 > Lyude Paul wrote: > > > Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon > > hotplugging my kernel would immediately crash due to igb: > > > > [ 680.825801] kernel BUG at drivers/pci/msi.c:352! > > [ 680.828388] invalid opcode: [#1] SMP > > [ 680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev > > vfat fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi > > sparse_keymap rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal > > coretemp crc32_pclmul snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd > > pcspkr mei soundcore i2c_i801 tpm_tis psmouse shpchp wmi tpm_tis_core tpm > > video hp_wireless acpi_pad rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw > > rtsx_pci mfd_core xhci_pci xhci_hcd i2c_hid i2c_core [last unloaded: igb] > > [ 680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted: > > G O 4.15.0-rc3Lyude-Test+ #6 > > [ 680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver. > > 01.03 06/09/2017 > > [ 680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn > > [ 680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0 > > [ 680.833271] RSP: 0018:c930fbf0 EFLAGS: 00010286 > > [ 680.833761] RAX: 8803405f9c00 RBX: 88033e3d2e40 RCX: > > 002c > > [ 680.834278] RDX: RSI: 00ac RDI: > > 880340be2178 > > [ 680.834832] RBP: R08: 880340be1ff0 R09: > > 8803405f9c00 > > [ 680.835342] R10: R11: 0040 R12: > > 88033d63a298 > > [ 680.835822] R13: 88033d63a000 R14: 0060 R15: > > 880341959000 > > [ 680.836332] FS: () GS:88034f44() > > knlGS: > > [ 680.836817] CS: 0010 DS: ES: CR0: 80050033 > > [ 680.837360] CR2: 55e64044afdf CR3: 01c09002 CR4: > > 003606e0 > > [ 680.837954] Call Trace: > > [ 680.838853] pci_disable_msix+0xce/0xf0 > > [ 680.839616] igb_reset_interrupt_capability+0x5d/0x60 [igb] > > [ 680.840278] igb_remove+0x9d/0x110 [igb] > > [ 680.840764] pci_device_remove+0x36/0xb0 > > [ 680.841279] device_release_driver_internal+0x157/0x220 > > [ 680.841739] pci_stop_bus_device+0x7d/0xa0 > > [ 680.842255] pci_stop_bus_device+0x2b/0xa0 > > [ 680.842722] pci_stop_bus_device+0x3d/0xa0 > > [ 680.843189] pci_stop_and_remove_bus_device+0xe/0x20 > > [ 680.843627] trim_stale_devices+0xf3/0x140 > > [ 680.844086] trim_stale_devices+0x94/0x140 > > [ 680.844532] trim_stale_devices+0xa6/0x140 > > [ 680.845031] ? get_slot_status+0x90/0xc0 > > [ 680.845536] acpiphp_check_bridge.part.5+0xfe/0x140 > > [ 680.846021] acpiphp_hotplug_notify+0x175/0x200 > > [ 680.846581] ? free_bridge+0x100/0x100 > > [ 680.847113] acpi_device_hotplug+0x8a/0x490 > > [ 680.847535] acpi_hotplug_work_fn+0x1a/0x30 > > [ 680.848076] process_one_work+0x182/0x3a0 > > [ 680.848543] worker_thread+0x2e/0x380 > > [ 680.848963] ? process_one_work+0x3a0/0x3a0 > > [ 680.849373] kthread+0x111/0x130 > > [ 680.849776] ? kthread_create_worker_on_cpu+0x50/0x50 > > [ 680.850188] ret_from_fork+0x1f/0x30 > > [ 680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39 > > 6b 14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3 > > <0f> 0b 49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b > > [ 680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: c930fbf0 > > > > As it turns out, normally the freeing of IRQs that would fix this is called > > inside of the scope of __igb_close(). However, since the device is > > already gone by the point we try to unregister the netdevice from the > > driver due to a hotplug we end up seeing that the netif isn't present > > and thus, forget to free any of the device IRQs. > > > > So: after unregistering the netdev in igb_remove() check whether the PCI > > device is stale and if so, free it's IRQs and tx/rx resources. > > > > Signed-off-by: Lyude Paul > > Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach") > > Cc: Todd Fujinaka > > Cc: sta...@vger.kernel.org > > --- > > drivers/net/ethernet/intel/igb/igb_main.c | 10 ++ > > 1 file changed, 10 insertions(+) > > > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c > > b/drivers/net/ethernet/intel/igb/igb_main.c > > index c208753ff5b7..e650348b4bd7 100644 > > --- a/drivers/net/ethernet/intel/igb/igb_main.c > > +++ b/drivers/net/ethernet/intel/igb/igb_main.c > > @@ -3325,6 +3325,16 @@ static void igb_remove(struct pci_dev *pdev) > > > > unregister_netdev(netdev); > > > > + /* If the PCI device has already been physically removed (e.g. user > > +* unplugged a thunderbolt dock containing our hw) then the netif > > will > > +* already be down, so unregistering the netdev won't free the IRQs > > +*/ > > + if (!pci_device_is_present(pd
RE: [PATCH] Fix handling of verdicts after NF_QUEUE
> From: Pablo Neira Ayuso [mailto:pa...@netfilter.org] > On Mon, Dec 11, 2017 at 06:30:24PM -0500, Debabrata Banerjee wrote: > > + } else { > > + /* Implicit handling for NF_STOLEN, as well as any other > > +* non conventional verdicts. > > +*/ > > + ret = 0; > > Another possibility (more simple?) would be this: > > int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state) { > struct nf_hook_entry *entry; > unsigned int verdict; > - int ret = 0; > + int ret; > > entry = rcu_dereference(state->hook_entries); > next_hook: > + ret = 0; > > Basically, make sure ret is set to zero when jumping to the next_hook label. Many ways to fix it, but I thought including the comment was appropriate. Happy to change it if we want simpler instead. -Deb
Re: [PATCH] igb: Free IRQs when device is hotplugged
On Mon, 11 Dec 2017 18:45:02 -0500 Lyude Paul wrote: > Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon > hotplugging my kernel would immediately crash due to igb: > > [ 680.825801] kernel BUG at drivers/pci/msi.c:352! > [ 680.828388] invalid opcode: [#1] SMP > [ 680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev vfat > fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi sparse_keymap > rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal coretemp > crc32_pclmul snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd pcspkr mei > soundcore i2c_i801 tpm_tis psmouse shpchp wmi tpm_tis_core tpm video > hp_wireless acpi_pad rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw rtsx_pci > mfd_core xhci_pci xhci_hcd i2c_hid i2c_core [last unloaded: igb] > [ 680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted: G O > 4.15.0-rc3Lyude-Test+ #6 > [ 680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver. 01.03 > 06/09/2017 > [ 680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn > [ 680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0 > [ 680.833271] RSP: 0018:c930fbf0 EFLAGS: 00010286 > [ 680.833761] RAX: 8803405f9c00 RBX: 88033e3d2e40 RCX: > 002c > [ 680.834278] RDX: RSI: 00ac RDI: > 880340be2178 > [ 680.834832] RBP: R08: 880340be1ff0 R09: > 8803405f9c00 > [ 680.835342] R10: R11: 0040 R12: > 88033d63a298 > [ 680.835822] R13: 88033d63a000 R14: 0060 R15: > 880341959000 > [ 680.836332] FS: () GS:88034f44() > knlGS: > [ 680.836817] CS: 0010 DS: ES: CR0: 80050033 > [ 680.837360] CR2: 55e64044afdf CR3: 01c09002 CR4: > 003606e0 > [ 680.837954] Call Trace: > [ 680.838853] pci_disable_msix+0xce/0xf0 > [ 680.839616] igb_reset_interrupt_capability+0x5d/0x60 [igb] > [ 680.840278] igb_remove+0x9d/0x110 [igb] > [ 680.840764] pci_device_remove+0x36/0xb0 > [ 680.841279] device_release_driver_internal+0x157/0x220 > [ 680.841739] pci_stop_bus_device+0x7d/0xa0 > [ 680.842255] pci_stop_bus_device+0x2b/0xa0 > [ 680.842722] pci_stop_bus_device+0x3d/0xa0 > [ 680.843189] pci_stop_and_remove_bus_device+0xe/0x20 > [ 680.843627] trim_stale_devices+0xf3/0x140 > [ 680.844086] trim_stale_devices+0x94/0x140 > [ 680.844532] trim_stale_devices+0xa6/0x140 > [ 680.845031] ? get_slot_status+0x90/0xc0 > [ 680.845536] acpiphp_check_bridge.part.5+0xfe/0x140 > [ 680.846021] acpiphp_hotplug_notify+0x175/0x200 > [ 680.846581] ? free_bridge+0x100/0x100 > [ 680.847113] acpi_device_hotplug+0x8a/0x490 > [ 680.847535] acpi_hotplug_work_fn+0x1a/0x30 > [ 680.848076] process_one_work+0x182/0x3a0 > [ 680.848543] worker_thread+0x2e/0x380 > [ 680.848963] ? process_one_work+0x3a0/0x3a0 > [ 680.849373] kthread+0x111/0x130 > [ 680.849776] ? kthread_create_worker_on_cpu+0x50/0x50 > [ 680.850188] ret_from_fork+0x1f/0x30 > [ 680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39 6b > 14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3 <0f> > 0b 49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b > [ 680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: c930fbf0 > > As it turns out, normally the freeing of IRQs that would fix this is called > inside of the scope of __igb_close(). However, since the device is > already gone by the point we try to unregister the netdevice from the > driver due to a hotplug we end up seeing that the netif isn't present > and thus, forget to free any of the device IRQs. > > So: after unregistering the netdev in igb_remove() check whether the PCI > device is stale and if so, free it's IRQs and tx/rx resources. > > Signed-off-by: Lyude Paul > Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach") > Cc: Todd Fujinaka > Cc: sta...@vger.kernel.org > --- > drivers/net/ethernet/intel/igb/igb_main.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c > b/drivers/net/ethernet/intel/igb/igb_main.c > index c208753ff5b7..e650348b4bd7 100644 > --- a/drivers/net/ethernet/intel/igb/igb_main.c > +++ b/drivers/net/ethernet/intel/igb/igb_main.c > @@ -3325,6 +3325,16 @@ static void igb_remove(struct pci_dev *pdev) > > unregister_netdev(netdev); > > + /* If the PCI device has already been physically removed (e.g. user > + * unplugged a thunderbolt dock containing our hw) then the netif will > + * already be down, so unregistering the netdev won't free the IRQs > + */ > + if (!pci_device_is_present(pdev)) { > + igb_free_irq(adapter); > + igb_free_all_tx_resources(adapter); > + igb_free_all_rx_resources(adapter); > + } > + > igb_clear_interrupt_scheme(adapter); > > pci_iounm
[PATCH net-next v3 5/6] net: qualcomm: rmnet: Allow to configure flags for new devices
Add an option to configure the rmnet aggregation and command features on device creation. This is achieved by using the vlan flags option. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index 46bb228..7a4c26e 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -177,11 +177,20 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, if (err) goto err2; - netdev_dbg(dev, "data format [ingress 0x%08X]\n", ingress_format); - port->ingress_data_format = ingress_format; port->rmnet_mode = mode; hlist_add_head_rcu(&ep->hlnode, &port->muxed_ep[mux_id]); + + if (data[IFLA_VLAN_FLAGS]) { + struct ifla_vlan_flags *flags; + + flags = nla_data(data[IFLA_VLAN_FLAGS]); + ingress_format = flags->flags & flags->mask; + } + + netdev_dbg(dev, "data format [ingress 0x%08X]\n", ingress_format); + port->ingress_data_format = ingress_format; + return 0; err2: @@ -313,7 +322,8 @@ static int rmnet_rtnl_validate(struct nlattr *tb[], struct nlattr *data[], static size_t rmnet_get_size(const struct net_device *dev) { - return nla_total_size(2); /* IFLA_VLAN_ID */ + return nla_total_size(2) /* IFLA_VLAN_ID */ + + nla_total_size(sizeof(struct ifla_vlan_flags)); /* IFLA_VLAN_FLAGS */ } struct rtnl_link_ops rmnet_link_ops __read_mostly = { -- 1.9.1
[PATCH net-next v3 1/6] net: qualcomm: rmnet: Remove the rmnet_map_results enum
Only the success and consumed entries were actually in use. Use standard error codes instead. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 15 +++ drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h | 9 - 2 files changed, 3 insertions(+), 21 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 08e4afc..1e1ea10 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -142,11 +142,11 @@ static int rmnet_map_egress_handler(struct sk_buff *skb, skb->protocol = htons(ETH_P_MAP); - return RMNET_MAP_SUCCESS; + return 0; fail: kfree_skb(skb); - return RMNET_MAP_CONSUMED; + return -ENOMEM; } static void @@ -213,17 +213,8 @@ void rmnet_egress_handler(struct sk_buff *skb) } if (port->egress_data_format & RMNET_EGRESS_FORMAT_MAP) { - switch (rmnet_map_egress_handler(skb, port, mux_id, orig_dev)) { - case RMNET_MAP_CONSUMED: + if (rmnet_map_egress_handler(skb, port, mux_id, orig_dev)) return; - - case RMNET_MAP_SUCCESS: - break; - - default: - kfree_skb(skb); - return; - } } rmnet_vnd_tx_fixup(skb, orig_dev); diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h index 3af3fe7..4df359d 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h @@ -30,15 +30,6 @@ struct rmnet_map_control_command { }; } __aligned(1); -enum rmnet_map_results { - RMNET_MAP_SUCCESS, - RMNET_MAP_CONSUMED, - RMNET_MAP_GENERAL_FAILURE, - RMNET_MAP_NOT_ENABLED, - RMNET_MAP_FAILED_AGGREGATION, - RMNET_MAP_FAILED_MUX -}; - enum rmnet_map_commands { RMNET_MAP_COMMAND_NONE, RMNET_MAP_COMMAND_FLOW_DISABLE, -- 1.9.1
[PATCH net-next v3 0/6] net: qualcomm: rmnet: Configuration options
This series adds support for configuring features on rmnet devices. The rmnet specific features to be configured here are aggregation and control commands. Patch 1 is a cleanup of return codes in the transmit path. Patch 2 removes some redundant ingress and egress macros. Patch 3 restricts the creation of rmnet dev to one dev per mux id for a given real dev. Patch 4 adds ethernet data path support. Patches 5-6 add support for configuring features on new and existing rmnet devices. v1->v2: The memory leak fixed as part of patch 1 is merged seperately as a896d94abd2c ("net: qualcomm: rmnet: Fix leak on transmit failure"). Fix a use after free in patch 4 if a packet with headroom lesser than ethernet header length is received. v2->v3: Fix formatting problem in patch 5 in the return statement. Subash Abhinov Kasiviswanathan (6): net: qualcomm: rmnet: Remove the rmnet_map_results enum net: qualcomm: rmnet: Remove the some redundant macros net: qualcomm: rmnet: Allow only one rmnet dev per muxid per real dev net: qualcomm: rmnet: Process packets over ethernet net: qualcomm: rmnet: Allow to configure flags for new devices net: qualcomm: rmnet: Allow to configure flags for existing devices drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 64 ++ drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 1 - .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 42 +++--- drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h| 9 --- .../net/ethernet/qualcomm/rmnet/rmnet_private.h| 10 +--- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c| 3 + 6 files changed, 78 insertions(+), 51 deletions(-) -- 1.9.1
[PATCH net-next v3 2/6] net: qualcomm: rmnet: Remove the some redundant macros
Multiplexing is always enabled when transmiting from a rmnet device, so remove the redundant egress macros. De-multiplexing is always enabled when receiving packets from a rmnet device, so remove those ingress macros. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 10 ++ drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h | 1 - drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 19 +++ drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h | 10 ++ 4 files changed, 11 insertions(+), 29 deletions(-) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index df21e90..46bb228 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -143,11 +143,7 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, struct nlattr *tb[], struct nlattr *data[], struct netlink_ext_ack *extack) { - int ingress_format = RMNET_INGRESS_FORMAT_DEMUXING | -RMNET_INGRESS_FORMAT_DEAGGREGATION | -RMNET_INGRESS_FORMAT_MAP; - int egress_format = RMNET_EGRESS_FORMAT_MUXING | - RMNET_EGRESS_FORMAT_MAP; + int ingress_format = RMNET_INGRESS_FORMAT_DEAGGREGATION; struct net_device *real_dev; int mode = RMNET_EPMODE_VND; struct rmnet_endpoint *ep; @@ -181,9 +177,7 @@ static int rmnet_newlink(struct net *src_net, struct net_device *dev, if (err) goto err2; - netdev_dbg(dev, "data format [ingress 0x%08X] [egress 0x%08X]\n", - ingress_format, egress_format); - port->egress_data_format = egress_format; + netdev_dbg(dev, "data format [ingress 0x%08X]\n", ingress_format); port->ingress_data_format = ingress_format; port->rmnet_mode = mode; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h index c19259e..2ea9fe3 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h @@ -33,7 +33,6 @@ struct rmnet_endpoint { struct rmnet_port { struct net_device *dev; u32 ingress_data_format; - u32 egress_data_format; u8 nr_rmnet_devs; u8 rmnet_mode; struct hlist_head muxed_ep[RMNET_MAX_LOGICAL_EP]; diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index 1e1ea10..a46053c 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -133,12 +133,10 @@ static int rmnet_map_egress_handler(struct sk_buff *skb, if (!map_header) goto fail; - if (port->egress_data_format & RMNET_EGRESS_FORMAT_MUXING) { - if (mux_id == 0xff) - map_header->mux_id = 0; - else - map_header->mux_id = mux_id; - } + if (mux_id == 0xff) + map_header->mux_id = 0; + else + map_header->mux_id = mux_id; skb->protocol = htons(ETH_P_MAP); @@ -178,8 +176,7 @@ rx_handler_result_t rmnet_rx_handler(struct sk_buff **pskb) switch (port->rmnet_mode) { case RMNET_EPMODE_VND: - if (port->ingress_data_format & RMNET_INGRESS_FORMAT_MAP) - rmnet_map_ingress_handler(skb, port); + rmnet_map_ingress_handler(skb, port); break; case RMNET_EPMODE_BRIDGE: rmnet_bridge_handler(skb, port->bridge_ep); @@ -212,10 +209,8 @@ void rmnet_egress_handler(struct sk_buff *skb) return; } - if (port->egress_data_format & RMNET_EGRESS_FORMAT_MAP) { - if (rmnet_map_egress_handler(skb, port, mux_id, orig_dev)) - return; - } + if (rmnet_map_egress_handler(skb, port, mux_id, orig_dev)) + return; rmnet_vnd_tx_fixup(skb, orig_dev); diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h index 49102f9..d214280 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h @@ -19,14 +19,8 @@ #define RMNET_TX_QUEUE_LEN 1000 /* Constants */ -#define RMNET_EGRESS_FORMAT_MAP BIT(1) -#define RMNET_EGRESS_FORMAT_AGGREGATION BIT(2) -#define RMNET_EGRESS_FORMAT_MUXING BIT(3) - -#define RMNET_INGRESS_FORMAT_MAPBIT(1) -#define RMNET_INGRESS_FORMAT_DEAGGREGATION BIT(2) -#define RMNET_INGRESS_FORMAT_DEMUXING BIT(3) -#define RMNET_INGRESS_FORMAT_MAP_COMMANDS BIT(4) +#define RMN
[PATCH net-next v3 6/6] net: qualcomm: rmnet: Allow to configure flags for existing devices
Add an option to configure the mux id, aggregation and commad feature for existing rmnet devices. Implement the changelink netlink operation for this. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 40 ++ 1 file changed, 40 insertions(+) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c index 7a4c26e..cedacdd 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c @@ -320,6 +320,45 @@ static int rmnet_rtnl_validate(struct nlattr *tb[], struct nlattr *data[], return 0; } +static int rmnet_changelink(struct net_device *dev, struct nlattr *tb[], + struct nlattr *data[], + struct netlink_ext_ack *extack) +{ + struct rmnet_priv *priv = netdev_priv(dev); + struct net_device *real_dev; + struct rmnet_endpoint *ep; + struct rmnet_port *port; + u16 mux_id; + + real_dev = __dev_get_by_index(dev_net(dev), + nla_get_u32(tb[IFLA_LINK])); + + if (!real_dev || !dev || !rmnet_is_real_dev_registered(real_dev)) + return -ENODEV; + + port = rmnet_get_port_rtnl(real_dev); + + if (data[IFLA_VLAN_ID]) { + mux_id = nla_get_u16(data[IFLA_VLAN_ID]); + ep = rmnet_get_endpoint(port, priv->mux_id); + + hlist_del_init_rcu(&ep->hlnode); + hlist_add_head_rcu(&ep->hlnode, &port->muxed_ep[mux_id]); + + ep->mux_id = mux_id; + priv->mux_id = mux_id; + } + + if (data[IFLA_VLAN_FLAGS]) { + struct ifla_vlan_flags *flags; + + flags = nla_data(data[IFLA_VLAN_FLAGS]); + port->ingress_data_format = flags->flags & flags->mask; + } + + return 0; +} + static size_t rmnet_get_size(const struct net_device *dev) { return nla_total_size(2) /* IFLA_VLAN_ID */ + @@ -335,6 +374,7 @@ struct rtnl_link_ops rmnet_link_ops __read_mostly = { .newlink= rmnet_newlink, .dellink= rmnet_dellink, .get_size = rmnet_get_size, + .changelink = rmnet_changelink, }; /* Needs either rcu_read_lock() or rtnl lock */ -- 1.9.1
[PATCH net-next v3 4/6] net: qualcomm: rmnet: Process packets over ethernet
Add support to send and receive packets over ethernet. An example of usage is testing the data path on UML. This can be achieved by setting up two UML instances in multicast mode and associating rmnet over the UML ethernet device. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c index a46053c..0553932 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c @@ -15,6 +15,7 @@ #include #include +#include #include "rmnet_private.h" #include "rmnet_config.h" #include "rmnet_vnd.h" @@ -104,6 +105,15 @@ static void rmnet_set_skb_proto(struct sk_buff *skb) { struct sk_buff *skbn; + if (skb->dev->type == ARPHRD_ETHER) { + if (pskb_expand_head(skb, ETH_HLEN, 0, GFP_KERNEL)) { + kfree_skb(skb); + return; + } + + skb_push(skb, ETH_HLEN); + } + if (port->ingress_data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) { while ((skbn = rmnet_map_deaggregate(skb)) != NULL) __rmnet_map_ingress_handler(skbn, port); -- 1.9.1
[PATCH net-next v3 3/6] net: qualcomm: rmnet: Allow only one rmnet dev per muxid per real dev
Upon de-multiplexing data from one real dev, the packets can be sent to an unique rmnet device for a given mux id. Signed-off-by: Subash Abhinov Kasiviswanathan --- drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c index 9caa5e3..5bb29f4 100644 --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c @@ -185,6 +185,9 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev, if (ep->egress_dev) return -EINVAL; + if (rmnet_get_endpoint(port, id)) + return -EBUSY; + rc = register_netdevice(rmnet_dev); if (!rc) { ep->egress_dev = rmnet_dev; -- 1.9.1
Re: [PATCH] Fix handling of verdicts after NF_QUEUE
Hi, Thanks for catching up this, see below. On Mon, Dec 11, 2017 at 06:30:24PM -0500, Debabrata Banerjee wrote: > A verdict of NF_STOLEN after NF_QUEUE will cause an incorrect return value > and a potential kernel panic via double free of skb's > > This was broken by commit 7034b566a4e7 ("netfilter: fix nf_queue handling") > and subsequently fixed in v4.10 by commit c63cbc460419 ("netfilter: > use switch() to handle verdict cases from nf_hook_slow()"). However that > commit cannot be cleanly cherry-picked to v4.9 > > Signed-off-by: Debabrata Banerjee > > --- > > This fix is only needed for v4.9 stable since v4.10+ does not have the > issue > --- > net/netfilter/core.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/net/netfilter/core.c b/net/netfilter/core.c > index 004af030ef1a..d869ea50623e 100644 > --- a/net/netfilter/core.c > +++ b/net/netfilter/core.c > @@ -364,6 +364,11 @@ int nf_hook_slow(struct sk_buff *skb, struct > nf_hook_state *state) > ret = nf_queue(skb, state, &entry, verdict); > if (ret == 1 && entry) > goto next_hook; > + } else { > + /* Implicit handling for NF_STOLEN, as well as any other > + * non conventional verdicts. > + */ > + ret = 0; Another possibility (more simple?) would be this: int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state) { struct nf_hook_entry *entry; unsigned int verdict; - int ret = 0; + int ret; entry = rcu_dereference(state->hook_entries); next_hook: + ret = 0; Basically, make sure ret is set to zero when jumping to the next_hook label. Thanks!
Re: [PATCH iproute2 net-next 0/4] Abstract columns, properly space and wrap fields
On Fri, 8 Dec 2017 18:07:19 +0100 Stefano Brivio wrote: > Currently, 'ss' simply subdivides the whole available screen width > between available columns, starting from a set of hardcoded amount > of spacing and growing column widths. > > This makes the output unreadable in several cases, as it doesn't take > into account the actual content width. > > Fix this by introducing a simple abstraction for columns, buffering > the output, measuring the width of the fields, grouping fields into > lines as they fit, equally distributing any remaining whitespace, and > finally rendering the result. Some examples are reported below [1]. > > This implementation doesn't seem to cause any significant performance > issues, as reported in 3/4. > > Patch 1/4 replaces all relevant printf() calls by the out() helper, > which simply consists of the usual printf() implementation. > > Patch 2/4 implements column abstraction, with configurable column > width and delimiters, and 3/4 splits buffering and rendering phases, > employing a simple buffering mechanism with chunked allocation and > introducing a rendering function. > > Up to this point, the output is still unchanged. > > Finally, 4/4 introduces field width calculation based on content > length measured while buffering, in order to split fields onto > multiple lines and equally space them within the single lines. > > Now that column behaviour is well-defined and more easily > configurable, it should be easier to further improve the output by > splitting logically separable information (e.g. TCP details) into > additional columns. However, this patchset keeps the full "extended" > information into a single column, for the moment being. > > > [1] > > - 80 columns terminal, ss -Z -f netlink > * before: > Recv-Q Send-Q Local Address:Port Peer Address:Port > > 0 0rtnl:evolution-calen/2075 * > pr > oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > 0 0rtnl:abrt-applet/32700 * > pr > oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > 0 0rtnl:firefox/21619 * > pr > oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > 0 0rtnl:evolution-calen/32639 * > p > roc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > [...] > > * after: > Recv-Q Send-Q Local Address:Port Peer Address:Port > 00 rtnl:evolution-calen/2075 * > proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > 00 rtnl:abrt-applet/32700 * > proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > 00 rtnl:firefox/21619 * > proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > 00 rtnl:evolution-calen/32639 * > proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 > [...] > > - 80 colums terminal, ss -tunpl > * before: > Netid State Recv-Q Send-Q Local Address:Port Peer > Address:Port > udpUNCONN 0 0 *:37732 *:* > udpUNCONN 0 0 *:5353 *:* > udpUNCONN 0 0 192.168.122.1:53*:* > udpUNCONN 0 0 *%virbr0:67*:* > [...] > > * after: > Netid StateRecv-Q Send-Q Local Address:Port Peer Address:Port > udp UNCONN 00 *:37732*:* > udp UNCONN 00 *:5353 *:* > udp UNCONN 00 192.168.122.1:53 *:* > udp UNCONN 00 *%virbr0:67 *:* > [...] > > - 66 columns terminal, ss -tunpl > * before: > Netid State Recv-Q Send-Q Local Address:Port P > eer Address:Port > udpUNCONN 0 0 *:37732 *:* > > udpUNCONN 0 0 *:5353*:* > > udpUNCONN 0 0 192.168.122.1:53 > *:* > udpUNCONN 0 0 *%virbr0:67 *:* > [...] > > * after: > Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port > udp UNCONN 0 0 *:37732 *:* > udp UNCONN 0 0 *:5353 *:* > udp UNCONN 0 0 192.168.122.1:53*:* > udp UNCONN 0 0 *%virbr0:67*:* > [...] > > > Stefano Brivio (4): > ss: Replace printf() calls for "main" output by calls to helper > ss: Introduce columns lightweight abstraction > ss: Buffer raw fields first, then render them as a table > ss: Implement automatic column width calculation I was going to apply
Re: [PATCH iproute2 1/1] ss: remove duplicate assignment
On Mon, 11 Dec 2017 16:24:31 -0500 Roman Mashak wrote: > Signed-off-by: Roman Mashak Applied and added Fixes: 8250bc9ff4e5 ("ss: Unify inet sockets output")
[PATCH] igb: Free IRQs when device is hotplugged
Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon hotplugging my kernel would immediately crash due to igb: [ 680.825801] kernel BUG at drivers/pci/msi.c:352! [ 680.828388] invalid opcode: [#1] SMP [ 680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev vfat fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi sparse_keymap rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal coretemp crc32_pclmul snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd pcspkr mei soundcore i2c_i801 tpm_tis psmouse shpchp wmi tpm_tis_core tpm video hp_wireless acpi_pad rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw rtsx_pci mfd_core xhci_pci xhci_hcd i2c_hid i2c_core [last unloaded: igb] [ 680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted: G O 4.15.0-rc3Lyude-Test+ #6 [ 680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver. 01.03 06/09/2017 [ 680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn [ 680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0 [ 680.833271] RSP: 0018:c930fbf0 EFLAGS: 00010286 [ 680.833761] RAX: 8803405f9c00 RBX: 88033e3d2e40 RCX: 002c [ 680.834278] RDX: RSI: 00ac RDI: 880340be2178 [ 680.834832] RBP: R08: 880340be1ff0 R09: 8803405f9c00 [ 680.835342] R10: R11: 0040 R12: 88033d63a298 [ 680.835822] R13: 88033d63a000 R14: 0060 R15: 880341959000 [ 680.836332] FS: () GS:88034f44() knlGS: [ 680.836817] CS: 0010 DS: ES: CR0: 80050033 [ 680.837360] CR2: 55e64044afdf CR3: 01c09002 CR4: 003606e0 [ 680.837954] Call Trace: [ 680.838853] pci_disable_msix+0xce/0xf0 [ 680.839616] igb_reset_interrupt_capability+0x5d/0x60 [igb] [ 680.840278] igb_remove+0x9d/0x110 [igb] [ 680.840764] pci_device_remove+0x36/0xb0 [ 680.841279] device_release_driver_internal+0x157/0x220 [ 680.841739] pci_stop_bus_device+0x7d/0xa0 [ 680.842255] pci_stop_bus_device+0x2b/0xa0 [ 680.842722] pci_stop_bus_device+0x3d/0xa0 [ 680.843189] pci_stop_and_remove_bus_device+0xe/0x20 [ 680.843627] trim_stale_devices+0xf3/0x140 [ 680.844086] trim_stale_devices+0x94/0x140 [ 680.844532] trim_stale_devices+0xa6/0x140 [ 680.845031] ? get_slot_status+0x90/0xc0 [ 680.845536] acpiphp_check_bridge.part.5+0xfe/0x140 [ 680.846021] acpiphp_hotplug_notify+0x175/0x200 [ 680.846581] ? free_bridge+0x100/0x100 [ 680.847113] acpi_device_hotplug+0x8a/0x490 [ 680.847535] acpi_hotplug_work_fn+0x1a/0x30 [ 680.848076] process_one_work+0x182/0x3a0 [ 680.848543] worker_thread+0x2e/0x380 [ 680.848963] ? process_one_work+0x3a0/0x3a0 [ 680.849373] kthread+0x111/0x130 [ 680.849776] ? kthread_create_worker_on_cpu+0x50/0x50 [ 680.850188] ret_from_fork+0x1f/0x30 [ 680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39 6b 14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3 <0f> 0b 49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b [ 680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: c930fbf0 As it turns out, normally the freeing of IRQs that would fix this is called inside of the scope of __igb_close(). However, since the device is already gone by the point we try to unregister the netdevice from the driver due to a hotplug we end up seeing that the netif isn't present and thus, forget to free any of the device IRQs. So: after unregistering the netdev in igb_remove() check whether the PCI device is stale and if so, free it's IRQs and tx/rx resources. Signed-off-by: Lyude Paul Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach") Cc: Todd Fujinaka Cc: sta...@vger.kernel.org --- drivers/net/ethernet/intel/igb/igb_main.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index c208753ff5b7..e650348b4bd7 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -3325,6 +3325,16 @@ static void igb_remove(struct pci_dev *pdev) unregister_netdev(netdev); + /* If the PCI device has already been physically removed (e.g. user +* unplugged a thunderbolt dock containing our hw) then the netif will +* already be down, so unregistering the netdev won't free the IRQs +*/ + if (!pci_device_is_present(pdev)) { + igb_free_irq(adapter); + igb_free_all_tx_resources(adapter); + igb_free_all_rx_resources(adapter); + } + igb_clear_interrupt_scheme(adapter); pci_iounmap(pdev, adapter->io_addr); -- 2.14.3
[PATCH net-next] tcp: allow TLP in ECN CWR
From: Neal Cardwell This patch enables tail loss probe in cwnd reduction (CWR) state to detect potential losses. Prior to this patch, since the sender uses PRR to determine the cwnd in CWR state, the combination of CWR+PRR plus tcp_tso_should_defer() could cause unnecessary stalls upon losses: PRR makes cwnd so gentle that tcp_tso_should_defer() defers sending wait for more ACKs. The ACKs may not come due to packet losses. Disallowing TLP when there is unused cwnd had the primary effect of disallowing TLP when there is TSO deferral, Nagle deferral, or we hit the rwin limit. Because basically every application write() or incoming ACK will cause us to run tcp_write_xmit() to see if we can send more, and then if we sent something we call tcp_schedule_loss_probe() to see if we should schedule a TLP. At that point, there are a few common reasons why some cwnd budget could still be unused: (a) rwin limit (b) nagle check (c) TSO deferral (d) TSQ For (d), after the next packet tx completion the TSQ mechanism will allow us to send more packets, so we don't really need a TLP (in practice it shouldn't matter whether we schedule one or not). But for (a), (b), (c) the sender won't send any more packets until it gets another ACK. But if the whole flight was lost, or all the ACKs were lost, then we won't get any more ACKs, and ideally we should schedule and send a TLP to get more feedback. In particular for a long time we have wanted some kind of timer for TSO deferral, and at least this would give us some kind of timer Reported-by: Steve Ibanez Signed-off-by: Neal Cardwell Signed-off-by: Yuchung Cheng Reviewed-by: Nandita Dukkipati Reviewed-by: Eric Dumazet --- net/ipv4/tcp_output.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index a4d214c7b506..04be9f833927 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2414,15 +2414,12 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto) early_retrans = sock_net(sk)->ipv4.sysctl_tcp_early_retrans; /* Schedule a loss probe in 2*RTT for SACK capable connections -* in Open state, that are either limited by cwnd or application. +* not in loss recovery, that are either limited by cwnd or application. */ if ((early_retrans != 3 && early_retrans != 4) || !tp->packets_out || !tcp_is_sack(tp) || - icsk->icsk_ca_state != TCP_CA_Open) - return false; - - if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) && -!tcp_write_queue_empty(sk)) + (icsk->icsk_ca_state != TCP_CA_Open && +icsk->icsk_ca_state != TCP_CA_CWR)) return false; /* Probe timeout is 2*rtt. Add minimum RTO to account -- 2.15.1.424.g9478a66081-goog
Re: [PATCH net-next v5 2/2] net: thunderx: add timestamping support
On Mon, Dec 11, 2017 at 05:14:31PM +0300, Aleksey Makarov wrote: > @@ -880,6 +889,46 @@ static void nic_pause_frame(struct nicpf *nic, int vf, > struct pfc *cfg) > } > } > > +/* Enable or disable HW timestamping by BGX for pkts received on a LMAC */ > +static void nic_config_timestamp(struct nicpf *nic, int vf, struct set_ptp > *ptp) > +{ > + struct pkind_cfg *pkind; > + u8 lmac, bgx_idx; > + u64 pkind_val, pkind_idx; > + > + if (vf >= nic->num_vf_en) > + return; > + > + bgx_idx = NIC_GET_BGX_FROM_VF_LMAC_MAP(nic->vf_lmac_map[vf]); > + lmac = NIC_GET_LMAC_FROM_VF_LMAC_MAP(nic->vf_lmac_map[vf]); > + > + pkind_idx = lmac + bgx_idx * MAX_LMAC_PER_BGX; > + pkind_val = nic_reg_read(nic, NIC_PF_PKIND_0_15_CFG | (pkind_idx << 3)); > + pkind = (struct pkind_cfg *)&pkind_val; > + > + if (ptp->enable && !pkind->hdr_sl) { > + /* Skiplen to exclude 8byte timestamp while parsing pkt > + * If not configured, will result in L2 errors. > + */ > + pkind->hdr_sl = 4; > + /* Adjust max packet length allowed */ > + pkind->maxlen += (pkind->hdr_sl * 2); > + bgx_config_timestamping(nic->node, bgx_idx, lmac, true); > + nic_reg_write(nic, > + NIC_PF_RX_ETYPE_0_7 | (1 << 3), > + (ETYPE_ALG_ENDPARSE << 16) | ETH_P_1588); don't need three lines for this function call. > + } else if (!ptp->enable && pkind->hdr_sl) { > + pkind->maxlen -= (pkind->hdr_sl * 2); > + pkind->hdr_sl = 0; > + bgx_config_timestamping(nic->node, bgx_idx, lmac, false); > + nic_reg_write(nic, > + NIC_PF_RX_ETYPE_0_7 | (1 << 3), > + (1ULL << 16) | ETH_P_8021Q); /* reset value */ here neither. Also avoid comment on the LHS. If 1<<16 means "reset" then just define a macro. > + } > + > + nic_reg_write(nic, NIC_PF_PKIND_0_15_CFG | (pkind_idx << 3), pkind_val); > +} > + Thanks, Richard
[Patch net-next] net_sched: switch to exit_batch for action pernet ops
Since we now hold RTNL lock in tc_action_net_exit(), it is good to batch them to speedup tc action dismantle. Cc: Jamal Hadi Salim Cc: Jiri Pirko Signed-off-by: Cong Wang --- include/net/act_api.h | 13 ++--- net/sched/act_bpf.c| 8 +++- net/sched/act_connmark.c | 8 +++- net/sched/act_csum.c | 8 +++- net/sched/act_gact.c | 8 +++- net/sched/act_ife.c| 8 +++- net/sched/act_ipt.c| 16 ++-- net/sched/act_mirred.c | 8 +++- net/sched/act_nat.c| 8 +++- net/sched/act_pedit.c | 8 +++- net/sched/act_police.c | 8 +++- net/sched/act_sample.c | 8 +++- net/sched/act_simple.c | 8 +++- net/sched/act_skbedit.c| 8 +++- net/sched/act_skbmod.c | 8 +++- net/sched/act_tunnel_key.c | 8 +++- net/sched/act_vlan.c | 8 +++- 17 files changed, 61 insertions(+), 88 deletions(-) diff --git a/include/net/act_api.h b/include/net/act_api.h index 02bf409140d0..6ed9692f20bd 100644 --- a/include/net/act_api.h +++ b/include/net/act_api.h @@ -120,12 +120,19 @@ int tc_action_net_init(struct tc_action_net *tn, void tcf_idrinfo_destroy(const struct tc_action_ops *ops, struct tcf_idrinfo *idrinfo); -static inline void tc_action_net_exit(struct tc_action_net *tn) +static inline void tc_action_net_exit(struct list_head *net_list, + unsigned int id) { + struct net *net; + rtnl_lock(); - tcf_idrinfo_destroy(tn->ops, tn->idrinfo); + list_for_each_entry(net, net_list, exit_list) { + struct tc_action_net *tn = net_generic(net, id); + + tcf_idrinfo_destroy(tn->ops, tn->idrinfo); + kfree(tn->idrinfo); + } rtnl_unlock(); - kfree(tn->idrinfo); } int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb, diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c index e6c477fa9ca5..b3f2c15affa7 100644 --- a/net/sched/act_bpf.c +++ b/net/sched/act_bpf.c @@ -401,16 +401,14 @@ static __net_init int bpf_init_net(struct net *net) return tc_action_net_init(tn, &act_bpf_ops); } -static void __net_exit bpf_exit_net(struct net *net) +static void __net_exit bpf_exit_net(struct list_head *net_list) { - struct tc_action_net *tn = net_generic(net, bpf_net_id); - - tc_action_net_exit(tn); + tc_action_net_exit(net_list, bpf_net_id); } static struct pernet_operations bpf_net_ops = { .init = bpf_init_net, - .exit = bpf_exit_net, + .exit_batch = bpf_exit_net, .id = &bpf_net_id, .size = sizeof(struct tc_action_net), }; diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c index 10b7a8855a6c..2b15ba84e0c8 100644 --- a/net/sched/act_connmark.c +++ b/net/sched/act_connmark.c @@ -209,16 +209,14 @@ static __net_init int connmark_init_net(struct net *net) return tc_action_net_init(tn, &act_connmark_ops); } -static void __net_exit connmark_exit_net(struct net *net) +static void __net_exit connmark_exit_net(struct list_head *net_list) { - struct tc_action_net *tn = net_generic(net, connmark_net_id); - - tc_action_net_exit(tn); + tc_action_net_exit(net_list, connmark_net_id); } static struct pernet_operations connmark_net_ops = { .init = connmark_init_net, - .exit = connmark_exit_net, + .exit_batch = connmark_exit_net, .id = &connmark_net_id, .size = sizeof(struct tc_action_net), }; diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c index d836f998117b..af4b8ec60d9a 100644 --- a/net/sched/act_csum.c +++ b/net/sched/act_csum.c @@ -635,16 +635,14 @@ static __net_init int csum_init_net(struct net *net) return tc_action_net_init(tn, &act_csum_ops); } -static void __net_exit csum_exit_net(struct net *net) +static void __net_exit csum_exit_net(struct list_head *net_list) { - struct tc_action_net *tn = net_generic(net, csum_net_id); - - tc_action_net_exit(tn); + tc_action_net_exit(net_list, csum_net_id); } static struct pernet_operations csum_net_ops = { .init = csum_init_net, - .exit = csum_exit_net, + .exit_batch = csum_exit_net, .id = &csum_net_id, .size = sizeof(struct tc_action_net), }; diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c index e29a48ef7fc3..9d632e92cad0 100644 --- a/net/sched/act_gact.c +++ b/net/sched/act_gact.c @@ -235,16 +235,14 @@ static __net_init int gact_init_net(struct net *net) return tc_action_net_init(tn, &act_gact_ops); } -static void __net_exit gact_exit_net(struct net *net) +static void __net_exit gact_exit_net(struct list_head *net_list) { - struct tc_action_net *tn = net_generic(net, gact_net_id); - - tc_action_net_exit(tn); + tc_action_net_exit(net_list, gact_net_id); } static struct pernet_operations gact_net_ops = { .in
[PATCH] Fix handling of verdicts after NF_QUEUE
A verdict of NF_STOLEN after NF_QUEUE will cause an incorrect return value and a potential kernel panic via double free of skb's This was broken by commit 7034b566a4e7 ("netfilter: fix nf_queue handling") and subsequently fixed in v4.10 by commit c63cbc460419 ("netfilter: use switch() to handle verdict cases from nf_hook_slow()"). However that commit cannot be cleanly cherry-picked to v4.9 Signed-off-by: Debabrata Banerjee --- This fix is only needed for v4.9 stable since v4.10+ does not have the issue --- net/netfilter/core.c | 5 + 1 file changed, 5 insertions(+) diff --git a/net/netfilter/core.c b/net/netfilter/core.c index 004af030ef1a..d869ea50623e 100644 --- a/net/netfilter/core.c +++ b/net/netfilter/core.c @@ -364,6 +364,11 @@ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state) ret = nf_queue(skb, state, &entry, verdict); if (ret == 1 && entry) goto next_hook; + } else { + /* Implicit handling for NF_STOLEN, as well as any other +* non conventional verdicts. +*/ + ret = 0; } return ret; } -- 2.15.1
Re: [PATCH net-next v5 2/2] net: thunderx: add timestamping support
On Mon, Dec 11, 2017 at 05:14:31PM +0300, Aleksey Makarov wrote: > diff --git a/drivers/net/ethernet/cavium/thunder/nic.h > b/drivers/net/ethernet/cavium/thunder/nic.h > index 4a02e618e318..204b234beb9d 100644 > --- a/drivers/net/ethernet/cavium/thunder/nic.h > +++ b/drivers/net/ethernet/cavium/thunder/nic.h > @@ -263,6 +263,8 @@ struct nicvf_drv_stats { > struct u64_stats_sync syncp; > }; > > +struct cavium_ptp; > + > struct nicvf { > struct nicvf*pnicvf; > struct net_device *netdev; > @@ -312,6 +314,12 @@ struct nicvf { > struct tasklet_struct qs_err_task; > struct work_struct reset_task; > > + /* PTP timestamp */ > + struct cavium_ptp *ptp_clock; > + boolhw_rx_tstamp; > + struct sk_buff *ptp_skb; > + atomic_ttx_ptp_skbs; It is disturbing that the above two fields are set in different places. Shouldn't they be unified into one logical lock? Here you clear them together: > +static void nicvf_snd_ptp_handler(struct net_device *netdev, > + struct cqe_send_t *cqe_tx) > +{ > + struct nicvf *nic = netdev_priv(netdev); > + struct skb_shared_hwtstamps ts; > + u64 ns; > + > + nic = nic->pnicvf; > + > + /* Sync for 'ptp_skb' */ > + smp_rmb(); > + > + /* New timestamp request can be queued now */ > + atomic_set(&nic->tx_ptp_skbs, 0); > + > + /* Check for timestamp requested skb */ > + if (!nic->ptp_skb) > + return; > + > + /* Check if timestamping is timedout, which is set to 10us */ > + if (cqe_tx->send_status == CQ_TX_ERROP_TSTMP_TIMEOUT || > + cqe_tx->send_status == CQ_TX_ERROP_TSTMP_CONFLICT) > + goto no_tstamp; > + > + /* Get the timestamp */ > + memset(&ts, 0, sizeof(ts)); > + ns = cavium_ptp_tstamp2time(nic->ptp_clock, cqe_tx->ptp_timestamp); > + ts.hwtstamp = ns_to_ktime(ns); > + skb_tstamp_tx(nic->ptp_skb, &ts); > + > +no_tstamp: > + /* Free the original skb */ > + dev_kfree_skb_any(nic->ptp_skb); > + nic->ptp_skb = NULL; > + /* Sync 'ptp_skb' */ > + smp_wmb(); > +} > + but here you set the one: > @@ -657,7 +697,12 @@ static void nicvf_snd_pkt_handler(struct net_device > *netdev, > prefetch(skb); > (*tx_pkts)++; > *tx_bytes += skb->len; > - napi_consume_skb(skb, budget); > + /* If timestamp is requested for this skb, don't free it */ > + if (skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS && > + !nic->pnicvf->ptp_skb) > + nic->pnicvf->ptp_skb = skb; > + else > + napi_consume_skb(skb, budget); > sq->skbuff[cqe_tx->sqe_ptr] = (u64)NULL; > } else { > /* In case of SW TSO on 88xx, only last segment will have here you clear one: > @@ -1319,12 +1382,28 @@ int nicvf_stop(struct net_device *netdev) > > nicvf_free_cq_poll(nic); > > + /* Free any pending SKB saved to receive timestamp */ > + if (nic->ptp_skb) { > + dev_kfree_skb_any(nic->ptp_skb); > + nic->ptp_skb = NULL; > + } > + > /* Clear multiqset info */ > nic->pnicvf = nic; > > return 0; > } here you clear both: > @@ -1394,6 +1473,12 @@ int nicvf_open(struct net_device *netdev) > if (nic->sqs_mode) > nicvf_get_primary_vf_struct(nic); > > + /* Configure PTP timestamp */ > + if (nic->ptp_clock) > + nicvf_config_hw_rx_tstamp(nic, nic->hw_rx_tstamp); > + atomic_set(&nic->tx_ptp_skbs, 0); > + nic->ptp_skb = NULL; > + > /* Configure receive side scaling and MTU */ > if (!nic->sqs_mode) { > nicvf_rss_init(nic); here you set the other: > @@ -1385,6 +1388,29 @@ nicvf_sq_add_hdr_subdesc(struct nicvf *nic, struct > snd_queue *sq, int qentry, > hdr->inner_l3_offset = skb_network_offset(skb) - 2; > this_cpu_inc(nic->pnicvf->drv_stats->tx_tso); > } > + > + /* Check if timestamp is requested */ > + if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) { > + skb_tx_timestamp(skb); > + return; > + } > + > + /* Tx timestamping not supported along with TSO, so ignore request */ > + if (skb_shinfo(skb)->gso_size) > + return; > + > + /* HW supports only a single outstanding packet to timestamp */ > + if (!atomic_add_unless(&nic->pnicvf->tx_ptp_skbs, 1, 1)) > + return; > + > + /* Mark the SKB for later reference */ > + skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; > + > + /* Finally enable timestamp generation > + * Since 'post_cqe' is also set, two CQEs will be posted > + * for this packet i.e CQE_TYPE_SEND and CQE_TYPE_SEND_PTP. > + */ > + hdr->tstmp = 1; > } and so it is completely non-obvious whether this is race free or not. T
[PATCH v3] net: ethernet: arc: fix error handling in emac_rockchip_probe
If clk_set_rate() fails, we should disable clk before return. Found by Linux Driver Verification project (linuxtesting.org). Changes since v2 [1]: * Merged with latest code changes Changes since v1: Update made thanks to David's review, much appreciated David. * Improved inconsistent failure handling of clock rate setting * For completeness of usecase, added arc_emac_probe error handling Signed-off-by: Branislav Radocaj --- [1] https://marc.info/?l=linux-netdev&m=151301239802445&w=2 --- drivers/net/ethernet/arc/emac_rockchip.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/arc/emac_rockchip.c b/drivers/net/ethernet/arc/emac_rockchip.c index c6163874e4e7..16f9bee992fe 100644 --- a/drivers/net/ethernet/arc/emac_rockchip.c +++ b/drivers/net/ethernet/arc/emac_rockchip.c @@ -199,9 +199,11 @@ static int emac_rockchip_probe(struct platform_device *pdev) /* RMII interface needs always a rate of 50MHz */ err = clk_set_rate(priv->refclk, 5000); - if (err) + if (err) { dev_err(dev, "failed to change reference clock rate (%d)\n", err); + goto out_regulator_disable; + } if (priv->soc_data->need_div_macclk) { priv->macclk = devm_clk_get(dev, "macclk"); @@ -230,12 +232,14 @@ static int emac_rockchip_probe(struct platform_device *pdev) err = arc_emac_probe(ndev, interface); if (err) { dev_err(dev, "failed to probe arc emac (%d)\n", err); - goto out_regulator_disable; + goto out_clk_disable_macclk; } return 0; + out_clk_disable_macclk: - clk_disable_unprepare(priv->macclk); + if (priv->soc_data->need_div_macclk) + clk_disable_unprepare(priv->macclk); out_regulator_disable: if (priv->regulator) regulator_disable(priv->regulator); -- 2.11.0
Re: [PATCH 1/3] PCI: introduce a device-managed version of pci_set_mwi
On Sun, Dec 10, 2017 at 12:43:48AM +0100, Heiner Kallweit wrote: > Introduce a device-managed version of pci_set_mwi. First user is the > Realtek r8169 driver. > > Signed-off-by: Heiner Kallweit With the subject and changelog as follows and the code reordering below, PCI: Add pcim_set_mwi(), a device-managed pci_set_mwi() Add pcim_set_mwi(), a device-managed version of pci_set_mwi(). First user is the Realtek r8169 driver. Acked-by: Bjorn Helgaas With these changes, feel free to merge with the series via the netdev tree. > --- > drivers/pci/pci.c | 29 + > include/linux/pci.h | 1 + > 2 files changed, 30 insertions(+) > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > index 4a7c6864f..fc57c378d 100644 > --- a/drivers/pci/pci.c > +++ b/drivers/pci/pci.c > @@ -1458,6 +1458,7 @@ struct pci_devres { > unsigned int pinned:1; > unsigned int orig_intx:1; > unsigned int restore_intx:1; > + unsigned int mwi:1; > u32 region_mask; > }; > > @@ -1476,6 +1477,9 @@ static void pcim_release(struct device *gendev, void > *res) > if (this->region_mask & (1 << i)) > pci_release_region(dev, i); > > + if (this->mwi) > + pci_clear_mwi(dev); > + > if (this->restore_intx) > pci_intx(dev, this->orig_intx); > > @@ -3760,6 +3764,31 @@ int pci_set_mwi(struct pci_dev *dev) > } > EXPORT_SYMBOL(pci_set_mwi); > > +/** > + * pcim_set_mwi - Managed pci_set_mwi() > + * @dev: the PCI device for which MWI is enabled > + * > + * Managed pci_set_mwi(). > + * > + * RETURNS: An appropriate -ERRNO error value on error, or zero for success. > + */ > +int pcim_set_mwi(struct pci_dev *dev) > +{ > + struct pci_devres *dr; > + int ret; > + > + ret = pci_set_mwi(dev); > + if (ret) > + return ret; > + > + dr = find_pci_dr(dev); > + if (dr) > + dr->mwi = 1; > + > + return 0; I would rather look up the pci_devres first, e.g., dr = find_pci_dr(dev); if (!dr) return -ENOMEM; dr->mwi = 1; return pci_set_mwi(dev); That way we won't enable MWI and be unable to disable it at release-time. > +} > +EXPORT_SYMBOL(pcim_set_mwi); > + > /** > * pci_try_set_mwi - enables memory-write-invalidate PCI transaction > * @dev: the PCI device for which MWI is enabled > diff --git a/include/linux/pci.h b/include/linux/pci.h > index 978aad784..0a7ac863a 100644 > --- a/include/linux/pci.h > +++ b/include/linux/pci.h > @@ -1064,6 +1064,7 @@ int pci_set_pcie_reset_state(struct pci_dev *dev, enum > pcie_reset_state state); > int pci_set_cacheline_size(struct pci_dev *dev); > #define HAVE_PCI_SET_MWI > int __must_check pci_set_mwi(struct pci_dev *dev); > +int __must_check pcim_set_mwi(struct pci_dev *dev); > int pci_try_set_mwi(struct pci_dev *dev); > void pci_clear_mwi(struct pci_dev *dev); > void pci_intx(struct pci_dev *dev, int enable); > -- > 2.15.1 > >
Re: [PATCH net-next v5 1/2] net: add support for Cavium PTP coprocessor
Sorry I didn't finish reviewing before... On Mon, Dec 11, 2017 at 05:14:30PM +0300, Aleksey Makarov wrote: > +/** > + * cavium_ptp_adjfreq() - Adjust ptp frequency > + * @ptp: PTP clock info > + * @ppb: how much to adjust by, in parts-per-billion > + */ > +static int cavium_ptp_adjfreq(struct ptp_clock_info *ptp_info, s32 ppb) adjfreq() is deprecated. See ptp_clock_kernel.h. Please re-work this to implement the adjfine() method instead. > +/** > + * cavium_ptp_enable() - Check if PTP is enabled Nit - comment is not correct. This method is for the auxiliary PHC functions. > + * @ptp: PTP clock info > + * @rq: request > + * @on: is it on > + */ > +static int cavium_ptp_enable(struct ptp_clock_info *ptp_info, > + struct ptp_clock_request *rq, int on) > +{ > + return -EOPNOTSUPP; > +} ... > +static int cavium_ptp_probe(struct pci_dev *pdev, > + const struct pci_device_id *ent) > +{ > + struct device *dev = &pdev->dev; > + struct cavium_ptp *clock; > + struct cyclecounter *cc; > + u64 clock_cfg; > + u64 clock_comp; > + int err; > + > + clock = devm_kzalloc(dev, sizeof(*clock), GFP_KERNEL); > + if (!clock) > + return -ENOMEM; > + > + clock->pdev = pdev; > + > + err = pcim_enable_device(pdev); > + if (err) > + return err; > + > + err = pcim_iomap_regions(pdev, 1 << PCI_PTP_BAR_NO, pci_name(pdev)); > + if (err) > + return err; > + > + clock->reg_base = pcim_iomap_table(pdev)[PCI_PTP_BAR_NO]; > + > + spin_lock_init(&clock->spin_lock); > + > + cc = &clock->cycle_counter; > + cc->read = cavium_ptp_cc_read; > + cc->mask = CYCLECOUNTER_MASK(64); > + cc->mult = 1; > + cc->shift = 0; > + > + timecounter_init(&clock->time_counter, &clock->cycle_counter, > + ktime_to_ns(ktime_get_real())); > + > + clock->clock_rate = ptp_cavium_clock_get(); > + > + clock->ptp_info = (struct ptp_clock_info) { > + .owner = THIS_MODULE, > + .name = "ThunderX PTP", > + .max_adj= 10ull, > + .n_ext_ts = 0, > + .n_pins = 0, > + .pps= 0, > + .adjfreq= cavium_ptp_adjfreq, > + .adjtime= cavium_ptp_adjtime, > + .gettime64 = cavium_ptp_gettime, > + .settime64 = cavium_ptp_settime, > + .enable = cavium_ptp_enable, > + }; > + > + clock_cfg = readq(clock->reg_base + PTP_CLOCK_CFG); > + clock_cfg |= PTP_CLOCK_CFG_PTP_EN; > + writeq(clock_cfg, clock->reg_base + PTP_CLOCK_CFG); > + > + clock_comp = ((u64)10ull << 32) / clock->clock_rate; > + writeq(clock_comp, clock->reg_base + PTP_CLOCK_COMP); > + > + clock->ptp_clock = ptp_clock_register(&clock->ptp_info, dev); > + if (IS_ERR(clock->ptp_clock)) { You need to handle the case when ptp_clock_register() returns NULL. from ptp_clock_kernel.h: /** * ptp_clock_register() - register a PTP hardware clock driver * * @info: Structure describing the new clock. * @parent: Pointer to the parent device of the new clock. * * Returns a valid pointer on success or PTR_ERR on failure. If PHC * support is missing at the configuration level, this function * returns NULL, and drivers are expected to gracefully handle that * case separately. */ > + clock_cfg = readq(clock->reg_base + PTP_CLOCK_CFG); > + clock_cfg &= ~PTP_CLOCK_CFG_PTP_EN; > + writeq(clock_cfg, clock->reg_base + PTP_CLOCK_CFG); > + return PTR_ERR(clock->ptp_clock); > + } > + > + pci_set_drvdata(pdev, clock); > + return 0; > +} Thanks, Richard
Re: [PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver
On Mon, Dec 11, 2017 at 2:16 PM, Tom Herbert wrote: > On Mon, Dec 11, 2017 at 1:34 PM, David Miller wrote: >> From: Tom Herbert >> Date: Mon, 11 Dec 2017 12:38:28 -0800 >> >>> DOS mitigations: >>> >>> - The number of outstanding resolutions is limited by the size of the >>> table >>> - Timeout of pending entries limits the number of netlink resolution >>> messages >>> - Packets are not queued that are pending resolution. In the current >>> model that can be forwarded to a router that has all reachability >>> information (ILA use case for example) >> >> None of these mitigation schemes matter. >> >> If packet traffic can influence the table of entries (your cache >> or whatever), then you will be DoS'able. >> >> If you limit outstanding resolutions, you harm legitimate traffic >> whose resolutions will not be processed now too just as equally >> as you will harm "bad guy" traffic. >> > David, > Actually, please disregard. I will respin to use secure redirects. > How can we build a system that allows an unlimited number of > resolutions without drop? Unless the resolution path can handle a > higher packet load than the receive path, there will be some place in > the system where memory is allocated and that limits the amount of > pending resolutions (i.e. pending packet skbs, entry in a resolution > table, skbs on a netlink socket). > >> If you forward in the case of pending resolution, the bad guy can >> make you forward everything there. The bad guy can effectively >> make your caching node stop caching completely. >> > But a DOS attack doesn't stop fowarding, at best it forces suboptimal > forwarding. This analogous to when the SYN cache is filled up but SYN > cookies allow forward progress in a degraded operational mode. > > Thanks, > Tom
Re: Huge memory leak with 4.15.0-rc2+
W dniu 2017-12-11 o 23:15, John Fastabend pisze: On 12/11/2017 01:48 PM, Paweł Staszewski wrote: W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze: Hi I just upgraded some testing host to 4.15.0-rc2+ kernel And after some time of traffic processing - when traffic on all ports reach about 3Mpps - memleak started. [...] Some observations - when i disable tso on all cards there is more memleak. When traffic starts to drop - there is less and less memleak below link to memory usage graph: https://ibb.co/hU97kG And there is rising slab_unrecl - Amount of unreclaimable memory used for slab kernel allocations Forgot to add that im using hfsc and qdiscs like pfifo on classes. Maybe some error case I missed in the qdisc patches I'm looking into it. Thanks, John This is how it looks like when corelated on graph - traffic vs mem https://ibb.co/njpkqG Typical hfsc class + qdisc: ### Client interface vlan1616 tc qdisc del dev vlan1616 root tc qdisc add dev vlan1616 handle 1: root hfsc default 100 tc class add dev vlan1616 parent 1: classid 1:100 hfsc ls m2 200Mbit ul m2 200Mbit tc qdisc add dev vlan1616 parent 1:100 handle 100: pfifo limit 128 ### End TM for client interface tc qdisc del dev vlan1616 ingress tc qdisc add dev vlan1616 handle : ingress tc filter add dev vlan1616 parent : protocol ip prio 50 u32 match ip src 0.0.0.0/0 police rate 200Mbit burst 200M mtu 32k drop flowid 1:1 And this is same for about 450 vlan interfaces Good thing is that compared to 4.14.3 i have about 5% less cpu load on 4.15.0-rc2+ When hfsc will be lockless or tbf - then it will be really huge difference in cpu load on x86 when using traffic shaping - so really good job John.
Re: Huge memory leak with 4.15.0-rc2+
On 12/11/2017 01:48 PM, Paweł Staszewski wrote: > > > W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze: >> Hi >> >> >> I just upgraded some testing host to 4.15.0-rc2+ kernel >> >> And after some time of traffic processing - when traffic on all ports >> reach about 3Mpps - memleak started. >> [...] >> Some observations - when i disable tso on all cards there is more >> memleak. >> >> >> >> >> > When traffic starts to drop - there is less and less memleak > below link to memory usage graph: > https://ibb.co/hU97kG > > And there is rising slab_unrecl - Amount of unreclaimable memory used > for slab kernel allocations > > > Forgot to add that im using hfsc and qdiscs like pfifo on classes. > > Maybe some error case I missed in the qdisc patches I'm looking into it. Thanks, John
Re: [PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver
On Mon, Dec 11, 2017 at 1:34 PM, David Miller wrote: > From: Tom Herbert > Date: Mon, 11 Dec 2017 12:38:28 -0800 > >> DOS mitigations: >> >> - The number of outstanding resolutions is limited by the size of the >> table >> - Timeout of pending entries limits the number of netlink resolution >> messages >> - Packets are not queued that are pending resolution. In the current >> model that can be forwarded to a router that has all reachability >> information (ILA use case for example) > > None of these mitigation schemes matter. > > If packet traffic can influence the table of entries (your cache > or whatever), then you will be DoS'able. > > If you limit outstanding resolutions, you harm legitimate traffic > whose resolutions will not be processed now too just as equally > as you will harm "bad guy" traffic. > David, How can we build a system that allows an unlimited number of resolutions without drop? Unless the resolution path can handle a higher packet load than the receive path, there will be some place in the system where memory is allocated and that limits the amount of pending resolutions (i.e. pending packet skbs, entry in a resolution table, skbs on a netlink socket). > If you forward in the case of pending resolution, the bad guy can > make you forward everything there. The bad guy can effectively > make your caching node stop caching completely. > But a DOS attack doesn't stop fowarding, at best it forces suboptimal forwarding. This analogous to when the SYN cache is filled up but SYN cookies allow forward progress in a degraded operational mode. Thanks, Tom
Re: [REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise
On Mon, Dec 11, 2017 at 4:44 PM, Greg Kroah-Hartman wrote: > On Mon, Dec 11, 2017 at 04:25:26PM -0500, Willem de Bruijn wrote: >> Note that UFO was removed in 4.14 and that skb_warn_bad_offload >> can happen for various types of packets, so there may be multiple >> independent bug reports. I'm investigating two other non-UFO reports >> just now. > > Meta-comment, now that UFO is gone from mainline, I'm wondering if I > should just delete it from 4.4 and 4.9 as well. Any objections for > that? I'd like to make it easy to maintain these kernels for a while, > and having them diverge like this, with all of the issues around UFO, > seems like it will just make life harder for myself if I leave it in. > > Any opinions? Some of that removal had to be reverted with commit 0c19f846d582 ("net: accept UFO datagrams from tuntap and packet") for VM live migration between kernels. Any backports probably should squash that in at the least. Just today another thread discussed that that patch may not address all open issues still, so it may be premature to backport at this point. http://lkml.kernel.org/r/
Re: [PATCH v5] leds: trigger: Introduce a NETDEV trigger
Hi Ben, Thanks for the update. On 12/10/2017 10:17 PM, Ben Whitten wrote: > This commit introduces a NETDEV trigger for named device > activity. Available triggers are link, rx, and tx. > > Signed-off-by: Ben Whitten > > --- > Changes in v5: > Adjust header comment style to be consistent > Changes in v4: > Adopt SPDX licence header > Changes in v3: > Cancel the software blink prior to a oneshot re-queue > Changes in v2: > Sort includes and redate documentation > Correct licence > Remove macro and replace with generic function using enums > Convert blink logic in stats work to use led_blink_oneshot > Uses configured brightness instead of FULL > --- > .../ABI/testing/sysfs-class-led-trigger-netdev | 45 ++ > drivers/leds/trigger/Kconfig | 7 + > drivers/leds/trigger/Makefile | 1 + > drivers/leds/trigger/ledtrig-netdev.c | 496 > + > 4 files changed, 549 insertions(+) > create mode 100644 Documentation/ABI/testing/sysfs-class-led-trigger-netdev > create mode 100644 drivers/leds/trigger/ledtrig-netdev.c > > diff --git a/Documentation/ABI/testing/sysfs-class-led-trigger-netdev > b/Documentation/ABI/testing/sysfs-class-led-trigger-netdev > new file mode 100644 > index 000..451af6d > --- /dev/null > +++ b/Documentation/ABI/testing/sysfs-class-led-trigger-netdev > @@ -0,0 +1,45 @@ > +What:/sys/class/leds//device_name > +Date:Dec 2017 > +KernelVersion: 4.16 > +Contact: linux-l...@vger.kernel.org > +Description: > + Specifies the network device name to monitor. > + > +What:/sys/class/leds//interval > +Date:Dec 2017 > +KernelVersion: 4.16 > +Contact: linux-l...@vger.kernel.org > +Description: > + Specifies the duration of the LED blink in milliseconds. > + Defaults to 50 ms. > + > +What:/sys/class/leds//link > +Date:Dec 2017 > +KernelVersion: 4.16 > +Contact: linux-l...@vger.kernel.org > +Description: > + Signal the link state of the named network device. > + If set to 0 (default), the LED's normal state is off. > + If set to 1, the LED's normal state reflects the link state > + of the named network device. > + Setting this value also immediately changes the LED state. > + > +What:/sys/class/leds//tx > +Date:Dec 2017 > +KernelVersion: 4.16 > +Contact: linux-l...@vger.kernel.org > +Description: > + Signal transmission of data on the named network device. > + If set to 0 (default), the LED will not blink on transmission. > + If set to 1, the LED will blink for the milliseconds specified > + in interval to signal transmission. > + > +What:/sys/class/leds//rx > +Date:Dec 2017 > +KernelVersion: 4.16 > +Contact: linux-l...@vger.kernel.org > +Description: > + Signal reception of data on the named network device. > + If set to 0 (default), the LED will not blink on reception. > + If set to 1, the LED will blink for the milliseconds specified > + in interval to signal reception. > diff --git a/drivers/leds/trigger/Kconfig b/drivers/leds/trigger/Kconfig > index 3f9ddb9..4ec1853 100644 > --- a/drivers/leds/trigger/Kconfig > +++ b/drivers/leds/trigger/Kconfig > @@ -126,4 +126,11 @@ config LEDS_TRIGGER_PANIC > a different trigger. > If unsure, say Y. > > +config LEDS_TRIGGER_NETDEV > + tristate "LED Netdev Trigger" > + depends on NET && LEDS_TRIGGERS > + help > + This allows LEDs to be controlled by network device activity. > + If unsure, say Y. > + > endif # LEDS_TRIGGERS > diff --git a/drivers/leds/trigger/Makefile b/drivers/leds/trigger/Makefile > index 9f2e868..59e163d 100644 > --- a/drivers/leds/trigger/Makefile > +++ b/drivers/leds/trigger/Makefile > @@ -11,3 +11,4 @@ obj-$(CONFIG_LEDS_TRIGGER_DEFAULT_ON) += > ledtrig-default-on.o > obj-$(CONFIG_LEDS_TRIGGER_TRANSIENT) += ledtrig-transient.o > obj-$(CONFIG_LEDS_TRIGGER_CAMERA)+= ledtrig-camera.o > obj-$(CONFIG_LEDS_TRIGGER_PANIC) += ledtrig-panic.o > +obj-$(CONFIG_LEDS_TRIGGER_NETDEV)+= ledtrig-netdev.o > diff --git a/drivers/leds/trigger/ledtrig-netdev.c > b/drivers/leds/trigger/ledtrig-netdev.c > new file mode 100644 > index 000..6df4781 > --- /dev/null > +++ b/drivers/leds/trigger/ledtrig-netdev.c > @@ -0,0 +1,496 @@ > +// SPDX-License-Identifier: GPL-2.0 > +// Copyright 2017 Ben Whitten > +// Copyright 2007 Oliver Jowett > +// > +// LED Kernel Netdev Trigger > +// > +// Toggles the LED to reflect the link and traffic state of a named net > device > +// > +// Derived from ledtrig-timer.c which is: > +// Copyright 2005-2006 Openedhand Ltd. > +// Author: Richard Purdie > + > +#include > +#include > +#in
Re: Huge memory leak with 4.15.0-rc2+
W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze: Hi I just upgraded some testing host to 4.15.0-rc2+ kernel And after some time of traffic processing - when traffic on all ports reach about 3Mpps - memleak started. Graph attached from memory usage: https://ibb.co/idK4zb HW config: Intel E5 8x Intel 82599 (used ixgbe driver from kernel) Interfaces with vlans attached All 8 ethernet ports are in one LAG group configured by team. With current settings (this host is acting as a router - and bgpd process is eating same amount of memory from the beginning about 5.2GB) cat /proc/meminfo MemTotal: 32770588 kB MemFree: 11342492 kB MemAvailable: 10982752 kB Buffers: 84704 kB Cached: 83180 kB SwapCached: 0 kB Active: 5105320 kB Inactive: 46252 kB Active(anon): 4985448 kB Inactive(anon): 1096 kB Active(file): 119872 kB Inactive(file): 45156 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4005280 kB SwapFree: 4005280 kB Dirty: 236 kB Writeback: 0 kB AnonPages: 4983752 kB Mapped: 13556 kB Shmem: 2852 kB Slab: 1013124 kB SReclaimable: 45876 kB SUnreclaim: 967248 kB KernelStack: 7152 kB PageTables: 12164 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 20390572 kB Committed_AS: 396568 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 1407572 kB DirectMap2M: 20504576 kB DirectMap1G: 13631488 kB ps aux --sort -rss USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6758 1.8 14.9 5044996 4886964 ? Sl 01:22 23:21 /usr/local/sbin/bgpd -d -u root -g root -I --ignore_warnings root 6752 0.0 0.1 86272 61920 ? Ss 01:22 0:16 /usr/local/sbin/zebra -d -u root -g root -I --ignore_warnings root 6766 12.6 0.0 51592 29196 ? S 01:22 157:48 /usr/sbin/snmpd -p /var/run/snmpd.pid -Ln root 7494 0.0 0.0 708976 5896 ? Ssl 01:22 0:09 /opt/collectd/sbin/collectd root 15531 0.0 0.0 67864 5056 ? Ss 21:57 0:00 sshd: paol [priv] root 4915 0.0 0.0 271912 4904 ? Ss 01:21 0:25 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng/syslog-ng.persist --cfgfile /etc/syslog-ng/syslog-ng.conf --pidfile /run/syslog-ng.pid root 4278 0.0 0.0 37220 4164 ? Ss 01:21 0:00 /lib/systemd/systemd-udevd --daemon root 5147 0.0 0.0 32072 3232 ? Ss 01:21 0:00 /usr/sbin/sshd root 5203 0.0 0.0 28876 2436 ? S 01:21 0:00 teamd -d -f /etc/teamd.conf root 17372 0.0 0.0 17924 2388 pts/2 R+ 22:13 0:00 ps aux --sort -rss root 4789 0.0 0.0 5032 2176 ? Ss 01:21 0:00 mdadm --monitor --scan --daemonise --pid-file /var/run/mdadm.pid --syslog root 7511 0.0 0.0 12676 1920 tty4 Ss+ 01:22 0:00 /sbin/agetty 38400 tty4 linux root 7510 0.0 0.0 12676 1896 tty3 Ss+ 01:22 0:00 /sbin/agetty 38400 tty3 linux root 7512 0.0 0.0 12676 1860 tty5 Ss+ 01:22 0:00 /sbin/agetty 38400 tty5 linux root 7513 0.0 0.0 12676 1836 tty6 Ss+ 01:22 0:00 /sbin/agetty 38400 tty6 linux root 7509 0.0 0.0 12676 1832 tty2 Ss+ 01:22 0:00 /sbin/agetty 38400 tty2 linux And latest kernel that everything was working is: 4.14.3 Some observations - when i disable tso on all cards there is more memleak. When traffic starts to drop - there is less and less memleak below link to memory usage graph: https://ibb.co/hU97kG And there is rising slab_unrecl - Amount of unreclaimable memory used for slab kernel allocations Forgot to add that im using hfsc and qdiscs like pfifo on classes.
Re: [REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise
On Mon, Dec 11, 2017 at 04:25:26PM -0500, Willem de Bruijn wrote: > Note that UFO was removed in 4.14 and that skb_warn_bad_offload > can happen for various types of packets, so there may be multiple > independent bug reports. I'm investigating two other non-UFO reports > just now. Meta-comment, now that UFO is gone from mainline, I'm wondering if I should just delete it from 4.4 and 4.9 as well. Any objections for that? I'd like to make it easy to maintain these kernels for a while, and having them diverge like this, with all of the issues around UFO, seems like it will just make life harder for myself if I leave it in. Any opinions? thanks, greg k-h
Re: [PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver
From: Tom Herbert Date: Mon, 11 Dec 2017 12:38:28 -0800 > DOS mitigations: > > - The number of outstanding resolutions is limited by the size of the > table > - Timeout of pending entries limits the number of netlink resolution > messages > - Packets are not queued that are pending resolution. In the current > model that can be forwarded to a router that has all reachability > information (ILA use case for example) None of these mitigation schemes matter. If packet traffic can influence the table of entries (your cache or whatever), then you will be DoS'able. If you limit outstanding resolutions, you harm legitimate traffic whose resolutions will not be processed now too just as equally as you will harm "bad guy" traffic. If you forward in the case of pending resolution, the bad guy can make you forward everything there. The bad guy can effectively make your caching node stop caching completely. Please, learn from OVS, the ipv4 routing cache, and the IPSEC flow cache. This kind of architecture, _especially_ when the resolution is user side, is deeply flawed. We're trying to remove code that does this kind of stuff, rather than add new instances. Thank you.
Huge memory leak with 4.15.0-rc2+
Hi I just upgraded some testing host to 4.15.0-rc2+ kernel And after some time of traffic processing - when traffic on all ports reach about 3Mpps - memleak started. Graph attached from memory usage: https://ibb.co/idK4zb HW config: Intel E5 8x Intel 82599 (used ixgbe driver from kernel) Interfaces with vlans attached All 8 ethernet ports are in one LAG group configured by team. With current settings (this host is acting as a router - and bgpd process is eating same amount of memory from the beginning about 5.2GB) cat /proc/meminfo MemTotal: 32770588 kB MemFree: 11342492 kB MemAvailable: 10982752 kB Buffers: 84704 kB Cached: 83180 kB SwapCached: 0 kB Active: 5105320 kB Inactive: 46252 kB Active(anon): 4985448 kB Inactive(anon): 1096 kB Active(file): 119872 kB Inactive(file): 45156 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 4005280 kB SwapFree: 4005280 kB Dirty: 236 kB Writeback: 0 kB AnonPages: 4983752 kB Mapped: 13556 kB Shmem: 2852 kB Slab: 1013124 kB SReclaimable: 45876 kB SUnreclaim: 967248 kB KernelStack: 7152 kB PageTables: 12164 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 20390572 kB Committed_AS: 396568 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 1407572 kB DirectMap2M: 20504576 kB DirectMap1G: 13631488 kB ps aux --sort -rss USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 6758 1.8 14.9 5044996 4886964 ? Sl 01:22 23:21 /usr/local/sbin/bgpd -d -u root -g root -I --ignore_warnings root 6752 0.0 0.1 86272 61920 ? Ss 01:22 0:16 /usr/local/sbin/zebra -d -u root -g root -I --ignore_warnings root 6766 12.6 0.0 51592 29196 ? S 01:22 157:48 /usr/sbin/snmpd -p /var/run/snmpd.pid -Ln root 7494 0.0 0.0 708976 5896 ? Ssl 01:22 0:09 /opt/collectd/sbin/collectd root 15531 0.0 0.0 67864 5056 ? Ss 21:57 0:00 sshd: paol [priv] root 4915 0.0 0.0 271912 4904 ? Ss 01:21 0:25 /usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng/syslog-ng.persist --cfgfile /etc/syslog-ng/syslog-ng.conf --pidfile /run/syslog-ng.pid root 4278 0.0 0.0 37220 4164 ? Ss 01:21 0:00 /lib/systemd/systemd-udevd --daemon root 5147 0.0 0.0 32072 3232 ? Ss 01:21 0:00 /usr/sbin/sshd root 5203 0.0 0.0 28876 2436 ? S 01:21 0:00 teamd -d -f /etc/teamd.conf root 17372 0.0 0.0 17924 2388 pts/2 R+ 22:13 0:00 ps aux --sort -rss root 4789 0.0 0.0 5032 2176 ? Ss 01:21 0:00 mdadm --monitor --scan --daemonise --pid-file /var/run/mdadm.pid --syslog root 7511 0.0 0.0 12676 1920 tty4 Ss+ 01:22 0:00 /sbin/agetty 38400 tty4 linux root 7510 0.0 0.0 12676 1896 tty3 Ss+ 01:22 0:00 /sbin/agetty 38400 tty3 linux root 7512 0.0 0.0 12676 1860 tty5 Ss+ 01:22 0:00 /sbin/agetty 38400 tty5 linux root 7513 0.0 0.0 12676 1836 tty6 Ss+ 01:22 0:00 /sbin/agetty 38400 tty6 linux root 7509 0.0 0.0 12676 1832 tty2 Ss+ 01:22 0:00 /sbin/agetty 38400 tty2 linux And latest kernel that everything was working is: 4.14.3 Some observations - when i disable tso on all cards there is more memleak.
Re: [REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise
From: Joseph Salisbury Date: Mon, 11 Dec 2017 15:35:34 -0500 > A kernel bug report was opened against Ubuntu [0]. It was found that > reverting the following commit resolved this bug: > > commit b2504a5dbef3305ef41988ad270b0e8ec289331c > Author: Eric Dumazet > Date: Tue Jan 31 10:20:32 2017 -0800 > > net: reduce skb_warn_bad_offload() noise > > > The regression was introduced as of v4.11-rc1 and still exists in > current mainline. > > I was hoping to get your feedback, since you are the patch author. Do > you think gathering any additional data will help diagnose this issue, > or would it be best to submit a revert request? > > This commit did in fact resolve another bug[1], but in the process > introduced this regression. It helps if you can consolidate the information obtained in your bug tracking here in the email so that people on this list can get an idea of what the problem scope might be without having to go to your special bug tracking site. This is really not about us being snobs about this mailing list, it's about you wanting to get a result. And you'll get a better result faster if you post the details here on the lsit because most developers are not going to go to your bug tracking site to read the bug comments. Also, this isn't a functional regression, it is just that we are generating warnings that we didn't before. It doesn't mean that Eric's patch is wrong, it could just be that his new check is triggering for a bug that has always been there. Scanning the bug myself it seems that the critical required component is IPSEC, and IPSEC has it's own way of doing segmentation offload. Thanks.
Re: [REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise
On Mon, Dec 11, 2017 at 3:35 PM, Joseph Salisbury wrote: > Hi Eric, > > A kernel bug report was opened against Ubuntu [0]. It was found that > reverting the following commit resolved this bug: The recorded trace in that bug is against 4.10.0 with some backports. Given that commit b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise") is implicated, I guess that that was backported from 4.11-rc1. The WARN shows e1000e: caps=(0x0030002149a9, 0x) len=1701 data_len=1659 gso_size=1480 gso_type=2 ip_summed=0 The numbering changed in 4.14, but for this kernel SKB_GSO_UDP = 1 << 1, so this is a UFO packet with CHECKSUM_NONE. The stack shows kernel: [570943.494549] skb_warn_bad_offload+0xd1/0x120 kernel: [570943.494550] __skb_gso_segment+0x17d/0x190 kernel: [570943.494564] validate_xmit_skb+0x14f/0x2a0 kernel: [570943.494565] validate_xmit_skb_list+0x43/0x70 so if that patch has been backported, then this must trigger in __skb_gso_segment on the return path from skb_mac_gso_segment. Did you backport commit 8d63bee643f1fb53e472f0e135cae4eb99d62d19 Author: Willem de Bruijn Date: Tue Aug 8 14:22:55 2017 -0400 net: avoid skb_warn_bad_offload false positives on UFO skb_warn_bad_offload triggers a warning when an skb enters the GSO stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL checksum offload set. Commit b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise") observed that SKB_GSO_DODGY producers can trigger the check and that passing those packets through the GSO handlers will fix it up. But, the software UFO handler will set ip_summed to CHECKSUM_NONE. When __skb_gso_segment is called from the receive path, this triggers the warning again. Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On Tx these two are equivalent. On Rx, this better matches the skb state (checksum computed), as CHECKSUM_NONE here means no checksum computed. See also this thread for context: http://patchwork.ozlabs.org/patch/799015/ Fixes: b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise") Signed-off-by: Willem de Bruijn Signed-off-by: David S. Miller Note that UFO was removed in 4.14 and that skb_warn_bad_offload can happen for various types of packets, so there may be multiple independent bug reports. I'm investigating two other non-UFO reports just now.
[PATCH iproute2 1/1] ss: remove duplicate assignment
Signed-off-by: Roman Mashak --- misc/ss.c | 1 - 1 file changed, 1 deletion(-) diff --git a/misc/ss.c b/misc/ss.c index 90da93e..da52d5e 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -2306,7 +2306,6 @@ static void tcp_show_info(const struct nlmsghdr *nlh, struct inet_diag_msg *r, s.sacked = info->tcpi_sacked; s.fackets= info->tcpi_fackets; s.reordering = info->tcpi_reordering; - s.rcv_space = info->tcpi_rcv_space; s.rcv_ssthresh = info->tcpi_rcv_ssthresh; s.cwnd = info->tcpi_snd_cwnd; -- 2.7.4
Re: [PATCH net,stable] net: qmi_wwan: add Quectel BG96 2c7c:0296
Hi, Sorry for the re-email of the patch below, clearly a beginners mistake of me not to clear my tmp/ folder. Please disregard this. Regards, Sebastian > On Dec 11, 2017, at 21:12 , ssjoh...@mac.com wrote: > > From: Sebastian Sjoholm > > Quectel BG96 is an Qualcomm MDM9206 based IoT modem, supporting both > CAT-M and NB-IoT. Tested hardware is BG96 mounted on Quectel development > board (EVB). The USB id is added to qmi_wwan.c to allow QMI > communication with the BG96. > > Signed-off-by: Sebastian Sjoholm > > --- > drivers/net/usb/qmi_wwan.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c > index 720a3a248070..c750cf7c042b 100644 > --- a/drivers/net/usb/qmi_wwan.c > +++ b/drivers/net/usb/qmi_wwan.c > @@ -1239,6 +1239,7 @@ static const struct usb_device_id products[] = { > {QMI_FIXED_INTF(0x1e0e, 0x9001, 5)},/* SIMCom 7230E */ > {QMI_QUIRK_SET_DTR(0x2c7c, 0x0125, 4)}, /* Quectel EC25, EC20 R2.0 > Mini PCIe */ > {QMI_QUIRK_SET_DTR(0x2c7c, 0x0121, 4)}, /* Quectel EC21 Mini PCIe */ > + {QMI_FIXED_INTF(0x2c7c, 0x0296, 4)},/* Quectel BG96 */ > > /* 4. Gobi 1000 devices */ > {QMI_GOBI1K_DEVICE(0x05c6, 0x9212)},/* Acer Gobi Modem Device */ > -- > 2.11.0 (Apple Git-81) >
[PATCH ipsec-next] xfrm: check for xdo_dev_state_free
The current XFRM code assumes that we've implemented the xdo_dev_state_free() callback, even if it is meaningless to the driver. This patch adds a check for it before calling, as done in other APIs, and is done for the xdo_state_offload_ok() callback. Also, we add a check for the required add and delete functions up front at registration time to be sure both are defined, and complain if not. Signed-off-by: Shannon Nelson --- include/net/xfrm.h | 3 ++- net/xfrm/xfrm_device.c | 18 ++ 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/include/net/xfrm.h b/include/net/xfrm.h index e015e16..dfabd04 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -1891,7 +1891,8 @@ static inline void xfrm_dev_state_free(struct xfrm_state *x) struct net_device *dev = xso->dev; if (dev && dev->xfrmdev_ops) { - dev->xfrmdev_ops->xdo_dev_state_free(x); + if (dev->xfrmdev_ops->xdo_dev_state_free) + dev->xfrmdev_ops->xdo_dev_state_free(x); xso->dev = NULL; dev_put(dev); } diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c index 30e5746..0df1cc2 100644 --- a/net/xfrm/xfrm_device.c +++ b/net/xfrm/xfrm_device.c @@ -144,11 +144,21 @@ EXPORT_SYMBOL_GPL(xfrm_dev_offload_ok); static int xfrm_dev_register(struct net_device *dev) { - if ((dev->features & NETIF_F_HW_ESP) && !dev->xfrmdev_ops) - return NOTIFY_BAD; - if ((dev->features & NETIF_F_HW_ESP_TX_CSUM) && - !(dev->features & NETIF_F_HW_ESP)) + if (!(dev->features & NETIF_F_HW_ESP)) { + if (dev->features & NETIF_F_HW_ESP_TX_CSUM) { + netdev_err(dev, "NETIF_F_HW_ESP_TX_CSUM without NETIF_F_HW_ESP\n"); + return NOTIFY_BAD; + } else { + return NOTIFY_DONE; + } + } + + if (!(dev->xfrmdev_ops && + dev->xfrmdev_ops->xdo_dev_state_add && + dev->xfrmdev_ops->xdo_dev_state_delete)) { + netdev_err(dev, "add or delete function missing from xfrmdev_ops\n"); return NOTIFY_BAD; + } return NOTIFY_DONE; } -- 2.7.4
[PATCH net,stable] net: qmi_wwan: add Sierra EM7565 1199:9091
From: Sebastian Sjoholm Sierra Wireless EM7565 is an Qualcomm MDM9x50 based M.2 modem. The USB id is added to qmi_wwan.c to allow QMI communication with the EM7565. Signed-off-by: Sebastian Sjoholm Acked-by: Bjørn Mork --- [The corresponding qcserial patch will be submitted by Reinhard Speyerer.] --- drivers/net/usb/qmi_wwan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index 304ec6555cd8..3cebd6683938 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1204,6 +1204,7 @@ static const struct usb_device_id products[] = { {QMI_FIXED_INTF(0x1199, 0x9079, 10)}, /* Sierra Wireless EM74xx */ {QMI_FIXED_INTF(0x1199, 0x907b, 8)},/* Sierra Wireless EM74xx */ {QMI_FIXED_INTF(0x1199, 0x907b, 10)}, /* Sierra Wireless EM74xx */ + {QMI_FIXED_INTF(0x1199, 0x9091, 8)},/* Sierra Wireless EM7565 */ {QMI_FIXED_INTF(0x1bbb, 0x011e, 4)},/* Telekom Speedstick LTE II (Alcatel One Touch L100V LTE) */ {QMI_FIXED_INTF(0x1bbb, 0x0203, 2)},/* Alcatel L800MA */ {QMI_FIXED_INTF(0x2357, 0x0201, 4)},/* TP-LINK HSUPA Modem MA180 */ -- 2.14.1
[REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise
Hi Eric, A kernel bug report was opened against Ubuntu [0]. It was found that reverting the following commit resolved this bug: commit b2504a5dbef3305ef41988ad270b0e8ec289331c Author: Eric Dumazet Date: Tue Jan 31 10:20:32 2017 -0800 net: reduce skb_warn_bad_offload() noise The regression was introduced as of v4.11-rc1 and still exists in current mainline. I was hoping to get your feedback, since you are the patch author. Do you think gathering any additional data will help diagnose this issue, or would it be best to submit a revert request? This commit did in fact resolve another bug[1], but in the process introduced this regression. Thanks, Joe [0] http://pad.lv/1715609 [1] http://pad.lv/1705447
[PATCH v3 net-next 1/9] lwt: Add net to build_state argument
Users of LWT need to know net if they want to have per net operations in LWT. Signed-off-by: Tom Herbert --- include/net/lwtunnel.h| 6 +++--- net/core/lwt_bpf.c| 2 +- net/core/lwtunnel.c | 4 ++-- net/ipv4/fib_semantics.c | 13 - net/ipv4/ip_tunnel_core.c | 4 ++-- net/ipv6/ila/ila_lwt.c| 2 +- net/ipv6/route.c | 2 +- net/ipv6/seg6_iptunnel.c | 2 +- net/ipv6/seg6_local.c | 5 +++-- net/mpls/mpls_iptunnel.c | 2 +- 10 files changed, 23 insertions(+), 19 deletions(-) diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h index d747ef975cd8..da5e51e0d122 100644 --- a/include/net/lwtunnel.h +++ b/include/net/lwtunnel.h @@ -34,7 +34,7 @@ struct lwtunnel_state { }; struct lwtunnel_encap_ops { - int (*build_state)(struct nlattr *encap, + int (*build_state)(struct net *net, struct nlattr *encap, unsigned int family, const void *cfg, struct lwtunnel_state **ts, struct netlink_ext_ack *extack); @@ -113,7 +113,7 @@ int lwtunnel_valid_encap_type(u16 encap_type, struct netlink_ext_ack *extack); int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len, struct netlink_ext_ack *extack); -int lwtunnel_build_state(u16 encap_type, +int lwtunnel_build_state(struct net *net, u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, struct lwtunnel_state **lws, @@ -192,7 +192,7 @@ static inline int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len, return 0; } -static inline int lwtunnel_build_state(u16 encap_type, +static inline int lwtunnel_build_state(struct net *net, u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, struct lwtunnel_state **lws, diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c index e7e626fb87bb..3a3ac13fcf06 100644 --- a/net/core/lwt_bpf.c +++ b/net/core/lwt_bpf.c @@ -238,7 +238,7 @@ static const struct nla_policy bpf_nl_policy[LWT_BPF_MAX + 1] = { [LWT_BPF_XMIT_HEADROOM] = { .type = NLA_U32 }, }; -static int bpf_build_state(struct nlattr *nla, +static int bpf_build_state(struct net *net, struct nlattr *nla, unsigned int family, const void *cfg, struct lwtunnel_state **ts, struct netlink_ext_ack *extack) diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index 0b171756453c..b3f2f77dfe72 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -103,7 +103,7 @@ int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops *ops, } EXPORT_SYMBOL_GPL(lwtunnel_encap_del_ops); -int lwtunnel_build_state(u16 encap_type, +int lwtunnel_build_state(struct net *net, u16 encap_type, struct nlattr *encap, unsigned int family, const void *cfg, struct lwtunnel_state **lws, struct netlink_ext_ack *extack) @@ -124,7 +124,7 @@ int lwtunnel_build_state(u16 encap_type, ops = rcu_dereference(lwtun_encaps[encap_type]); if (likely(ops && ops->build_state && try_module_get(ops->owner))) { found = true; - ret = ops->build_state(encap, family, cfg, lws, extack); + ret = ops->build_state(net, encap, family, cfg, lws, extack); if (ret) module_put(ops->owner); } diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index f04d944f8abe..4979e5c6b9b8 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -523,6 +523,7 @@ static int fib_get_nhs(struct fib_info *fi, struct rtnexthop *rtnh, if (nla) { struct lwtunnel_state *lwtstate; struct nlattr *nla_entype; + struct net *net = cfg->fc_nlinfo.nl_net; nla_entype = nla_find(attrs, attrlen, RTA_ENCAP_TYPE); @@ -533,7 +534,7 @@ static int fib_get_nhs(struct fib_info *fi, struct rtnexthop *rtnh, goto err_inval; } - ret = lwtunnel_build_state(nla_get_u16( + ret = lwtunnel_build_state(net, nla_get_u16( nla_entype), nla, AF_INET, cfg, &lwtstate, extack); @@ -607,7 +608,7 @@ static void fib_rebalance(struct fib_info *fi) #endif /* CONFIG_IP_ROUTE_MULTIPATH */ -static int fib_encap_
[PATCH v3 net-next 7/9] ila: Resolver mechanism
Implement an ILA resolver. This uses LWT to implement the hook to a userspace resolver and tracks pending unresolved address using the backend net resolver. The idea is that the kernel sets an ILA resolver route to the SIR prefix, something like: ip route add ::/64 encap ila-resolve \ via 2401:db00:20:911a::27:0 dev eth0 When a packet hits the route the address is looked up in a resolver table. If the entry is created (no entry with the address already exists) then an rtnl message is generated with group RTNLGRP_ILA_NOTIFY and type RTM_ADDR_RESOLVE. A userspace daemon can listen for such messages and perform an ILA resolution protocol to determine the ILA mapping. If the mapping is resolved then a /128 ila encap router is set so that host can perform ILA translation and send directly to destination. Signed-off-by: Tom Herbert --- include/uapi/linux/ila.h | 9 ++ include/uapi/linux/lwtunnel.h | 1 + include/uapi/linux/rtnetlink.h | 8 +- net/core/lwtunnel.c| 2 + net/ipv6/Kconfig | 1 + net/ipv6/ila/Makefile | 2 +- net/ipv6/ila/ila.h | 11 ++ net/ipv6/ila/ila_lwt.c | 8 ++ net/ipv6/ila/ila_main.c| 14 +++ net/ipv6/ila/ila_resolver.c| 244 + 10 files changed, 298 insertions(+), 2 deletions(-) create mode 100644 net/ipv6/ila/ila_resolver.c diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h index db45d3e49a12..66557265bf5b 100644 --- a/include/uapi/linux/ila.h +++ b/include/uapi/linux/ila.h @@ -65,4 +65,13 @@ enum { ILA_HOOK_ROUTE_INPUT, }; +enum { + ILA_NOTIFY_ATTR_UNSPEC, + ILA_NOTIFY_ATTR_TIMEOUT,/* u32 */ + + __ILA_NOTIFY_ATTR_MAX, +}; + +#define ILA_NOTIFY_ATTR_MAX(__ILA_NOTIFY_ATTR_MAX - 1) + #endif /* _UAPI_LINUX_ILA_H */ diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h index de696ca12f2c..2eac16f8323f 100644 --- a/include/uapi/linux/lwtunnel.h +++ b/include/uapi/linux/lwtunnel.h @@ -13,6 +13,7 @@ enum lwtunnel_encap_types { LWTUNNEL_ENCAP_SEG6, LWTUNNEL_ENCAP_BPF, LWTUNNEL_ENCAP_SEG6_LOCAL, + LWTUNNEL_ENCAP_ILA_NOTIFY, __LWTUNNEL_ENCAP_MAX, }; diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h index d8b5f80c2ea6..8d358a300d8a 100644 --- a/include/uapi/linux/rtnetlink.h +++ b/include/uapi/linux/rtnetlink.h @@ -13,7 +13,8 @@ */ #define RTNL_FAMILY_IPMR 128 #define RTNL_FAMILY_IP6MR 129 -#define RTNL_FAMILY_MAX129 +#define RTNL_FAMILY_ILA130 +#define RTNL_FAMILY_MAX130 / * Routing/neighbour discovery messages. @@ -150,6 +151,9 @@ enum { RTM_NEWCACHEREPORT = 96, #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT + RTM_ADDR_RESOLVE = 98, +#define RTM_ADDR_RESOLVE RTM_ADDR_RESOLVE + __RTM_MAX, #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1) }; @@ -676,6 +680,8 @@ enum rtnetlink_groups { #define RTNLGRP_IPV4_MROUTE_R RTNLGRP_IPV4_MROUTE_R RTNLGRP_IPV6_MROUTE_R, #define RTNLGRP_IPV6_MROUTE_R RTNLGRP_IPV6_MROUTE_R + RTNLGRP_ILA_NOTIFY, +#define RTNLGRP_ILA_NOTIFY RTNLGRP_ILA_NOTIFY __RTNLGRP_MAX }; #define RTNLGRP_MAX(__RTNLGRP_MAX - 1) diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c index b3f2f77dfe72..16b04d05e9b9 100644 --- a/net/core/lwtunnel.c +++ b/net/core/lwtunnel.c @@ -46,6 +46,8 @@ static const char *lwtunnel_encap_str(enum lwtunnel_encap_types encap_type) return "BPF"; case LWTUNNEL_ENCAP_SEG6_LOCAL: return "SEG6LOCAL"; + case LWTUNNEL_ENCAP_ILA_NOTIFY: + return "ILA-NOTIFY"; case LWTUNNEL_ENCAP_IP6: case LWTUNNEL_ENCAP_IP: case LWTUNNEL_ENCAP_NONE: diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig index ea71e4b0ab7a..5b0a6e1bd7cc 100644 --- a/net/ipv6/Kconfig +++ b/net/ipv6/Kconfig @@ -110,6 +110,7 @@ config IPV6_ILA tristate "IPv6: Identifier Locator Addressing (ILA)" depends on NETFILTER select LWTUNNEL + select NET_RESOLVER ---help--- Support for IPv6 Identifier Locator Addressing (ILA). diff --git a/net/ipv6/ila/Makefile b/net/ipv6/ila/Makefile index b7739aba6e68..3ec2d65ceee2 100644 --- a/net/ipv6/ila/Makefile +++ b/net/ipv6/ila/Makefile @@ -4,4 +4,4 @@ obj-$(CONFIG_IPV6_ILA) += ila.o -ila-objs := ila_main.o ila_common.o ila_lwt.o ila_xlat.o +ila-objs := ila_main.o ila_common.o ila_lwt.o ila_xlat.o ila_resolver.o diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h index 1f747bcbec29..02a800c71796 100644 --- a/net/ipv6/ila/ila.h +++ b/net/ipv6/ila/ila.h @@ -15,6 +15,7 @@ #include #include #include +#include #include #include #include @@ -112,6 +113,9 @@ struct ila_net { unsigned int locks_mask; bool hooks_registered;
[PATCH v3 net-next 9/9] ila: add netlink control ILA resolver
Add a netlink family to processe netlinkf for the ILA resolver. This calls the net resolver netlink functions. Signed-off-by: Tom Herbert --- include/uapi/linux/ila.h| 11 net/ipv6/ila/ila.h | 8 ++ net/ipv6/ila/ila_main.c | 26 ++ net/ipv6/ila/ila_resolver.c | 67 - 4 files changed, 111 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h index 66557265bf5b..2481dab25d57 100644 --- a/include/uapi/linux/ila.h +++ b/include/uapi/linux/ila.h @@ -19,6 +19,8 @@ enum { ILA_ATTR_CSUM_MODE, /* u8 */ ILA_ATTR_IDENT_TYPE,/* u8 */ ILA_ATTR_HOOK_TYPE, /* u8 */ + ILA_RSLV_ATTR_DST, /* IPv6 address */ + ILA_RSLV_ATTR_TIMEOUT, /* u32 */ __ILA_ATTR_MAX, }; @@ -31,6 +33,10 @@ enum { ILA_CMD_DEL, ILA_CMD_GET, ILA_CMD_FLUSH, + ILA_RSLV_CMD_ADD, + ILA_RSLV_CMD_DEL, + ILA_RSLV_CMD_GET, + ILA_RSLV_CMD_FLUSH, __ILA_CMD_MAX, }; @@ -68,10 +74,15 @@ enum { enum { ILA_NOTIFY_ATTR_UNSPEC, ILA_NOTIFY_ATTR_TIMEOUT,/* u32 */ + ILA_NOTIFY_ATTR_DST,/* Binary address */ __ILA_NOTIFY_ATTR_MAX, }; #define ILA_NOTIFY_ATTR_MAX(__ILA_NOTIFY_ATTR_MAX - 1) +/* NETLINK_GENERIC related info */ +#define ILA_RSLV_GENL_NAME "ila-rslv" +#define ILA_RSLV_GENL_VERSION 0x1 + #endif /* _UAPI_LINUX_ILA_H */ diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h index 02a800c71796..0aa99e359a38 100644 --- a/net/ipv6/ila/ila.h +++ b/net/ipv6/ila/ila.h @@ -137,6 +137,14 @@ int ila_xlat_nl_dump_start(struct netlink_callback *cb); int ila_xlat_nl_dump_done(struct netlink_callback *cb); int ila_xlat_nl_dump(struct sk_buff *skb, struct netlink_callback *cb); +int ila_rslv_nl_cmd_add(struct sk_buff *skb, struct genl_info *info); +int ila_rslv_nl_cmd_del(struct sk_buff *skb, struct genl_info *info); +int ila_rslv_nl_cmd_get(struct sk_buff *skb, struct genl_info *info); +int ila_rslv_nl_cmd_flush(struct sk_buff *skb, struct genl_info *info); +int ila_rslv_nl_dump_start(struct netlink_callback *cb); +int ila_rslv_nl_dump_done(struct netlink_callback *cb); +int ila_rslv_nl_dump(struct sk_buff *skb, struct netlink_callback *cb); + extern unsigned int ila_net_id; extern struct genl_family ila_nl_family; diff --git a/net/ipv6/ila/ila_main.c b/net/ipv6/ila/ila_main.c index 411d3d112157..8589d422568b 100644 --- a/net/ipv6/ila/ila_main.c +++ b/net/ipv6/ila/ila_main.c @@ -40,6 +40,32 @@ static const struct genl_ops ila_nl_ops[] = { .done = ila_xlat_nl_dump_done, .policy = ila_nl_policy, }, + { + .cmd = ILA_RSLV_CMD_ADD, + .doit = ila_rslv_nl_cmd_add, + .policy = ila_nl_policy, + .flags = GENL_ADMIN_PERM, + }, + { + .cmd = ILA_RSLV_CMD_DEL, + .doit = ila_rslv_nl_cmd_del, + .policy = ila_nl_policy, + .flags = GENL_ADMIN_PERM, + }, + { + .cmd = ILA_RSLV_CMD_FLUSH, + .doit = ila_rslv_nl_cmd_flush, + .policy = ila_nl_policy, + .flags = GENL_ADMIN_PERM, + }, + { + .cmd = ILA_RSLV_CMD_GET, + .doit = ila_rslv_nl_cmd_get, + .start = ila_rslv_nl_dump_start, + .dumpit = ila_rslv_nl_dump, + .done = ila_rslv_nl_dump_done, + .policy = ila_nl_policy, + }, }; unsigned int ila_net_id; diff --git a/net/ipv6/ila/ila_resolver.c b/net/ipv6/ila/ila_resolver.c index 2aebc0526221..3278e93bb799 100644 --- a/net/ipv6/ila/ila_resolver.c +++ b/net/ipv6/ila/ila_resolver.c @@ -209,6 +209,13 @@ static const struct lwtunnel_encap_ops ila_rslv_ops = { #define ILA_MAX_SIZE 8192 +static struct net_rslv_netlink_map ila_netlink_map = { + .dst_attr = ILA_RSLV_ATTR_DST, + .timo_attr = ILA_RSLV_ATTR_TIMEOUT, + .get_cmd = ILA_RSLV_CMD_GET, + .genl_family = &ila_nl_family, +}; + int ila_rslv_init_net(struct net *net) { struct ila_net *ilan = net_generic(net, ila_net_id); @@ -216,7 +223,7 @@ int ila_rslv_init_net(struct net *net) nrslv = net_rslv_create(sizeof(struct ila_addr), sizeof(struct ila_addr), ILA_MAX_SIZE, NULL, - NULL); + &ila_netlink_map); if (IS_ERR(nrslv)) return PTR_ERR(nrslv); @@ -234,6 +241,64 @@ void ila_rslv_exit_net(struct net *net) net_rslv_destroy(ilan->rslv.nrslv); } +/* Netlink access */ + +int ila_rslv_nl_cmd_add(struct sk_buff *skb, struct genl_info *info) +{ + struct net *net = sock_net(skb->sk); + struct ila_net *ilan = net_generic(net
[PATCH v3 net-next 6/9] net: Generic resolver backend
This patch implements the backend of a resolver, specifically it provides a means to track unresolved addresses and expire entries based on timeout. The resolver is mostly a frontend to an rhashtable where the key of the table is whatever address type or object is tracked. A resolver instance is created by net_rslv_create. A resolver is destroyed by net_rslv_destroy. There are two functions that are used to manipulate entries in the table: net_rslv_lookup_and_create and net_rslv_resolved. net_rslv_lookup_and_create is called with an unresolved address as the argument. It returns zero on success and an error on failure. When called a lookup is performed to see if an entry for the address is already in the table, if it is then the -EEXISTS is returned. If an entry is not found, one is created and zero is returned. It is expected that when an entry is new the address resolution protocol is initiated (for instance a RTM_ADDR_RESOLVE message may be sent to a userspace daemon as we will do in ILA). If net_rslv_lookup_and_create returns an error other than -EEXIST then presumably the hash table has reached the limit of number of outstanding unresolved addresses, the caller should take appropriate actions to avoid spamming the resolution protocol. net_rslv_resolved is called when resolution is completely (e.g. ILA locator mapping was instantiated for a locator. The entry is removed for the hash table. An argument to net_rslv_create indicates a time for the pending resolution in milliseconds. If the timer fires before resolution then the entry is removed from the table. Subsequently, another attempt to resolve the same address will result in a new entry in the table. There is one callback functions that can be set as arugments in net_rslv_create: - cmp_fn: Compare function for hash table. Arguments are the key and an object in the table. If this is NULL then the default memcmp of rhashtable is used. DOS mitigation is done by limiting the number of entries in the resolver table (the max_size which argument of net_rslv_create) and setting a timeout. If the timeout is set then the maximum rate of new resolution requests is max_table_size / timeout. For instance, with a maximum size of 1000 entries and a timeout of 100 msecs the maximum rate of resolutions requests is 1/s. Signed-off-by: Tom Herbert --- include/net/resolver.h | 43 net/Kconfig | 1 + net/Makefile| 1 + net/resolver/Kconfig| 7 ++ net/resolver/Makefile | 8 ++ net/resolver/resolver.c | 283 6 files changed, 343 insertions(+) create mode 100644 include/net/resolver.h create mode 100644 net/resolver/Kconfig create mode 100644 net/resolver/Makefile create mode 100644 net/resolver/resolver.c diff --git a/include/net/resolver.h b/include/net/resolver.h new file mode 100644 index ..f38c7e9f1205 --- /dev/null +++ b/include/net/resolver.h @@ -0,0 +1,43 @@ +/* + * Generic network address resovler backend + * + * Copyright (c) 2017 Tom Herbert + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + */ + +#ifndef __NET_RESOLVER_H +#define __NET_RESOLVER_H + +#include +#include + +struct net_rslv; + +typedef int (*net_rslv_cmpfn)(struct net_rslv *nrslv, const void *key, + const void *object); + +struct net_rslv { + struct rhashtable rhash_table; + struct rhashtable_params params; + net_rslv_cmpfn rslv_cmp; + size_t obj_size; + spinlock_t *locks; + unsigned int locks_mask; + unsigned int hash_rnd; +}; + +struct net_rslv *net_rslv_create(size_t obj_size, size_t key_len, +size_t max_size, net_rslv_cmpfn cmp_fn); + +void net_rslv_destroy(struct net_rslv *nrslv); + +int net_rslv_lookup_and_create(struct net_rslv *nrslv, void *key, + unsigned int timeout); + +void net_rslv_resolved(struct net_rslv *nrslv, void *key); + +#endif /* __NET_RESOLVER_H */ diff --git a/net/Kconfig b/net/Kconfig index 9dba2715919d..b1e73325de6a 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -399,6 +399,7 @@ source "net/ceph/Kconfig" source "net/nfc/Kconfig" source "net/psample/Kconfig" source "net/ife/Kconfig" +source "net/resolver/Kconfig" config LWTUNNEL bool "Network light weight tunnels" diff --git a/net/Makefile b/net/Makefile index 14fede520840..6b3b0c5e676a 100644 --- a/net/Makefile +++ b/net/Makefile @@ -86,3 +86,4 @@ obj-y += l3mdev/ endif obj-$(CONFIG_QRTR) += qrtr/ obj-$(CONFIG_NET_NCSI) += ncsi/ +obj-$(CONFIG_NET_RESOLVER) += resolver/ diff --git a/net/resolver/Kconfig b/net/resolver/Kconfig new file mode 100644 index ..99eff276e
[PATCH v3 net-next 5/9] ila: Flush netlink command to clear xlat table
Add ILA_CMD_FLUSH netlink command to clear the ILA translation table. Signed-off-by: Tom Herbert --- include/uapi/linux/ila.h | 1 + net/ipv6/ila/ila.h | 1 + net/ipv6/ila/ila_main.c | 6 + net/ipv6/ila/ila_xlat.c | 62 ++-- 4 files changed, 68 insertions(+), 2 deletions(-) diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h index 483b77af4eb8..db45d3e49a12 100644 --- a/include/uapi/linux/ila.h +++ b/include/uapi/linux/ila.h @@ -30,6 +30,7 @@ enum { ILA_CMD_ADD, ILA_CMD_DEL, ILA_CMD_GET, + ILA_CMD_FLUSH, __ILA_CMD_MAX, }; diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h index faba7824ea56..1f747bcbec29 100644 --- a/net/ipv6/ila/ila.h +++ b/net/ipv6/ila/ila.h @@ -123,6 +123,7 @@ void ila_xlat_exit_net(struct net *net); int ila_xlat_nl_cmd_add_mapping(struct sk_buff *skb, struct genl_info *info); int ila_xlat_nl_cmd_del_mapping(struct sk_buff *skb, struct genl_info *info); int ila_xlat_nl_cmd_get_mapping(struct sk_buff *skb, struct genl_info *info); +int ila_xlat_nl_cmd_flush(struct sk_buff *skb, struct genl_info *info); int ila_xlat_nl_dump_start(struct netlink_callback *cb); int ila_xlat_nl_dump_done(struct netlink_callback *cb); int ila_xlat_nl_dump(struct sk_buff *skb, struct netlink_callback *cb); diff --git a/net/ipv6/ila/ila_main.c b/net/ipv6/ila/ila_main.c index f6ac6b14577e..18fac76b9520 100644 --- a/net/ipv6/ila/ila_main.c +++ b/net/ipv6/ila/ila_main.c @@ -27,6 +27,12 @@ static const struct genl_ops ila_nl_ops[] = { .flags = GENL_ADMIN_PERM, }, { + .cmd = ILA_CMD_FLUSH, + .doit = ila_xlat_nl_cmd_flush, + .policy = ila_nl_policy, + .flags = GENL_ADMIN_PERM, + }, + { .cmd = ILA_CMD_GET, .doit = ila_xlat_nl_cmd_get_mapping, .start = ila_xlat_nl_dump_start, diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c index 610852b3dfa7..6bb1a081ff04 100644 --- a/net/ipv6/ila/ila_xlat.c +++ b/net/ipv6/ila/ila_xlat.c @@ -164,9 +164,9 @@ static inline void ila_release(struct ila_map *ila) kfree_rcu(ila, rcu); } -static void ila_free_cb(void *ptr, void *arg) +static void ila_free_node(struct ila_map *ila) { - struct ila_map *ila = (struct ila_map *)ptr, *next; + struct ila_map *next; /* Assume rcu_readlock held */ while (ila) { @@ -176,6 +176,11 @@ static void ila_free_cb(void *ptr, void *arg) } } +static void ila_free_cb(void *ptr, void *arg) +{ + ila_free_node((struct ila_map *)ptr); +} + static int ila_xlat_addr(struct sk_buff *skb, bool sir2ila); static unsigned int @@ -365,6 +370,59 @@ int ila_xlat_nl_cmd_del_mapping(struct sk_buff *skb, struct genl_info *info) return 0; } +static inline spinlock_t *lock_from_ila_map(struct ila_net *ilan, + struct ila_map *ila) +{ + return ila_get_lock(ilan, ila->xp.ip.locator_match); +} + +int ila_xlat_nl_cmd_flush(struct sk_buff *skb, struct genl_info *info) +{ + struct net *net = genl_info_net(info); + struct ila_net *ilan = net_generic(net, ila_net_id); + struct rhashtable_iter iter; + struct ila_map *ila; + spinlock_t *lock; + int ret; + + ret = rhashtable_walk_init(&ilan->xlat.rhash_table, &iter, GFP_KERNEL); + if (ret) + goto done; + + rhashtable_walk_start(&iter); + + for (;;) { + ila = rhashtable_walk_next(&iter); + + if (IS_ERR(ila)) { + if (PTR_ERR(ila) == -EAGAIN) + continue; + ret = PTR_ERR(ila); + goto done; + } else if (!ila) { + break; + } + + lock = lock_from_ila_map(ilan, ila); + + spin_lock(lock); + + ret = rhashtable_remove_fast(&ilan->xlat.rhash_table, +&ila->node, rht_params); + if (!ret) + ila_free_node(ila); + + spin_unlock(lock); + + if (ret) + break; + } + +done: + rhashtable_walk_stop(&iter); + return ret; +} + static int ila_fill_info(struct ila_map *ila, struct sk_buff *msg) { if (nla_put_u64_64bit(msg, ILA_ATTR_LOCATOR, -- 2.11.0
[PATCH v3 net-next 8/9] resolver: add netlink control
Add interfaces into resolver backend that can be used to provide netlink. The interface includes fucntions to support the common netlink commands (get, add, list, delete, and flush). The frontend that is using the resolver implements the actual netlink interfaces for its service and calls the backend functions to provide netlink for the resolver. Signed-off-by: Tom Herbert --- include/net/resolver.h | 26 +++- net/ipv6/ila/ila_resolver.c | 3 +- net/resolver/resolver.c | 280 +++- 3 files changed, 305 insertions(+), 4 deletions(-) diff --git a/include/net/resolver.h b/include/net/resolver.h index f38c7e9f1205..307938ad91a6 100644 --- a/include/net/resolver.h +++ b/include/net/resolver.h @@ -14,12 +14,21 @@ #include #include +#include +#include struct net_rslv; typedef int (*net_rslv_cmpfn)(struct net_rslv *nrslv, const void *key, const void *object); +struct net_rslv_netlink_map { + int dst_attr; + int timo_attr; + int get_cmd; + struct genl_family *genl_family; +}; + struct net_rslv { struct rhashtable rhash_table; struct rhashtable_params params; @@ -28,10 +37,12 @@ struct net_rslv { spinlock_t *locks; unsigned int locks_mask; unsigned int hash_rnd; + const struct net_rslv_netlink_map *nlmap; }; struct net_rslv *net_rslv_create(size_t obj_size, size_t key_len, -size_t max_size, net_rslv_cmpfn cmp_fn); +size_t max_size, net_rslv_cmpfn cmp_fn, +const struct net_rslv_netlink_map *nlmap); void net_rslv_destroy(struct net_rslv *nrslv); @@ -40,4 +51,17 @@ int net_rslv_lookup_and_create(struct net_rslv *nrslv, void *key, void net_rslv_resolved(struct net_rslv *nrslv, void *key); +int net_rslv_nl_cmd_add(struct net_rslv *nrslv, struct sk_buff *skb, + struct genl_info *info); +int net_rslv_nl_cmd_del(struct net_rslv *nrslv, struct sk_buff *skb, + struct genl_info *info); +int net_rslv_nl_cmd_get(struct net_rslv *nrslv, struct sk_buff *skb, + struct genl_info *info); +int net_rslv_nl_cmd_flush(struct net_rslv *nrslv, struct sk_buff *skb, + struct genl_info *info); +int net_rslv_nl_dump_start(struct net_rslv *nrslv, struct netlink_callback *cb); +int net_rslv_nl_dump_done(struct net_rslv *nrslv, struct netlink_callback *cb); +int net_rslv_nl_dump(struct net_rslv *nrslv, struct sk_buff *skb, +struct netlink_callback *cb); + #endif /* __NET_RESOLVER_H */ diff --git a/net/ipv6/ila/ila_resolver.c b/net/ipv6/ila/ila_resolver.c index 8b9a3c5305a4..2aebc0526221 100644 --- a/net/ipv6/ila/ila_resolver.c +++ b/net/ipv6/ila/ila_resolver.c @@ -215,7 +215,8 @@ int ila_rslv_init_net(struct net *net) struct net_rslv *nrslv; nrslv = net_rslv_create(sizeof(struct ila_addr), - sizeof(struct ila_addr), ILA_MAX_SIZE, NULL); + sizeof(struct ila_addr), ILA_MAX_SIZE, NULL, + NULL); if (IS_ERR(nrslv)) return PTR_ERR(nrslv); diff --git a/net/resolver/resolver.c b/net/resolver/resolver.c index 32a915ed8f93..e2496b0bf852 100644 --- a/net/resolver/resolver.c +++ b/net/resolver/resolver.c @@ -19,11 +19,13 @@ #include #include #include +#include #include #include #include #include #include +#include struct net_rslv_ent { struct rhash_head node; @@ -192,8 +194,8 @@ static int net_rslv_cmp(struct rhashtable_compare_arg *arg, #define MAX_LOCKS 1024 struct net_rslv *net_rslv_create(size_t obj_size, size_t key_len, -size_t max_size, -net_rslv_cmpfn cmp_fn) +size_t max_size, net_rslv_cmpfn cmp_fn, +const struct net_rslv_netlink_map *nlmap) { struct net_rslv *nrslv; int err; @@ -212,6 +214,7 @@ struct net_rslv *net_rslv_create(size_t obj_size, size_t key_len, nrslv->obj_size = obj_size; nrslv->rslv_cmp = cmp_fn; + nrslv->nlmap = nlmap; get_random_bytes(&nrslv->hash_rnd, sizeof(nrslv->hash_rnd)); nrslv->params.head_offset = offsetof(struct net_rslv_ent, node); @@ -278,6 +281,279 @@ void net_rslv_destroy(struct net_rslv *nrslv) } EXPORT_SYMBOL_GPL(net_rslv_destroy); +/* Netlink access utility functions and structures. */ + +struct net_rslv_params { + unsigned int timeout; + __u8 key[MAX_ADDR_LEN]; + size_t keysize; +}; + +static int parse_nl_config(struct net_rslv *nrslv, struct genl_info *info, + struct net_rslv_params *np) +{ + if (!info->attrs[nrslv->nlmap->dst_attr] || + nla_len(info->attrs[nrslv->nlmap->dst_attr]) != +
[PATCH v3 net-next 3/9] ila: Call library function alloc_bucket_locks
To allocate the array of bucket locks for the hash table we now call library function alloc_bucket_spinlocks. Signed-off-by: Tom Herbert --- net/ipv6/ila/ila_xlat.c | 22 +- 1 file changed, 5 insertions(+), 17 deletions(-) diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c index 9fca75b9cab3..402193ef74c2 100644 --- a/net/ipv6/ila/ila_xlat.c +++ b/net/ipv6/ila/ila_xlat.c @@ -31,26 +31,14 @@ struct ila_net { bool hooks_registered; }; +#define MAX_LOCKS 1024 #defineLOCKS_PER_CPU 10 static int alloc_ila_locks(struct ila_net *ilan) { - unsigned int i, size; - unsigned int nr_pcpus = num_possible_cpus(); - - nr_pcpus = min_t(unsigned int, nr_pcpus, 32UL); - size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU); - - if (sizeof(spinlock_t) != 0) { - ilan->locks = kvmalloc(size * sizeof(spinlock_t), GFP_KERNEL); - if (!ilan->locks) - return -ENOMEM; - for (i = 0; i < size; i++) - spin_lock_init(&ilan->locks[i]); - } - ilan->locks_mask = size - 1; - - return 0; + return alloc_bucket_spinlocks(&ilan->xlat.locks, &ilan->xlat.locks_mask, + MAX_LOCKS, LOCKS_PER_CPU, + GFP_KERNEL); } static u32 hashrnd __read_mostly; @@ -629,7 +617,7 @@ static __net_exit void ila_exit_net(struct net *net) rhashtable_free_and_destroy(&ilan->rhash_table, ila_free_cb, NULL); - kvfree(ilan->locks); + free_bucket_spinlocks(ilan->xlat.locks); if (ilan->hooks_registered) nf_unregister_net_hooks(net, ila_nf_hook_ops, -- 2.11.0
[PATCH v3 net-next 4/9] ila: create main ila source file
Create a main ila file that contains the module intialization functions as well as netlink definitions. Previously these were defined in ila_xlat and ila_common. This approach allows better extensibility. Signed-off-by: Tom Herbert --- net/ipv6/ila/Makefile | 2 +- net/ipv6/ila/ila.h| 26 - net/ipv6/ila/ila_common.c | 30 -- net/ipv6/ila/ila_main.c | 115 ++ net/ipv6/ila/ila_xlat.c | 138 +- 5 files changed, 166 insertions(+), 145 deletions(-) create mode 100644 net/ipv6/ila/ila_main.c diff --git a/net/ipv6/ila/Makefile b/net/ipv6/ila/Makefile index 4b32e5921e5c..b7739aba6e68 100644 --- a/net/ipv6/ila/Makefile +++ b/net/ipv6/ila/Makefile @@ -4,4 +4,4 @@ obj-$(CONFIG_IPV6_ILA) += ila.o -ila-objs := ila_common.o ila_lwt.o ila_xlat.o +ila-objs := ila_main.o ila_common.o ila_lwt.o ila_xlat.o diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h index 3c7a11b62334..faba7824ea56 100644 --- a/net/ipv6/ila/ila.h +++ b/net/ipv6/ila/ila.h @@ -19,6 +19,7 @@ #include #include #include +#include #include #include #include @@ -104,9 +105,30 @@ void ila_update_ipv6_locator(struct sk_buff *skb, struct ila_params *p, void ila_init_saved_csum(struct ila_params *p); +struct ila_net { + struct { + struct rhashtable rhash_table; + spinlock_t *locks; /* Bucket locks for entry manipulation */ + unsigned int locks_mask; + bool hooks_registered; + } xlat; +}; + int ila_lwt_init(void); void ila_lwt_fini(void); -int ila_xlat_init(void); -void ila_xlat_fini(void); + +int ila_xlat_init_net(struct net *net); +void ila_xlat_exit_net(struct net *net); + +int ila_xlat_nl_cmd_add_mapping(struct sk_buff *skb, struct genl_info *info); +int ila_xlat_nl_cmd_del_mapping(struct sk_buff *skb, struct genl_info *info); +int ila_xlat_nl_cmd_get_mapping(struct sk_buff *skb, struct genl_info *info); +int ila_xlat_nl_dump_start(struct netlink_callback *cb); +int ila_xlat_nl_dump_done(struct netlink_callback *cb); +int ila_xlat_nl_dump(struct sk_buff *skb, struct netlink_callback *cb); + +extern unsigned int ila_net_id; + +extern struct genl_family ila_nl_family; #endif /* __ILA_H */ diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c index 8c88ecf29b93..579310466eac 100644 --- a/net/ipv6/ila/ila_common.c +++ b/net/ipv6/ila/ila_common.c @@ -154,33 +154,3 @@ void ila_update_ipv6_locator(struct sk_buff *skb, struct ila_params *p, iaddr->loc = p->locator; } -static int __init ila_init(void) -{ - int ret; - - ret = ila_lwt_init(); - - if (ret) - goto fail_lwt; - - ret = ila_xlat_init(); - if (ret) - goto fail_xlat; - - return 0; -fail_xlat: - ila_lwt_fini(); -fail_lwt: - return ret; -} - -static void __exit ila_fini(void) -{ - ila_xlat_fini(); - ila_lwt_fini(); -} - -module_init(ila_init); -module_exit(ila_fini); -MODULE_AUTHOR("Tom Herbert "); -MODULE_LICENSE("GPL"); diff --git a/net/ipv6/ila/ila_main.c b/net/ipv6/ila/ila_main.c new file mode 100644 index ..f6ac6b14577e --- /dev/null +++ b/net/ipv6/ila/ila_main.c @@ -0,0 +1,115 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include "ila.h" + +static const struct nla_policy ila_nl_policy[ILA_ATTR_MAX + 1] = { + [ILA_ATTR_LOCATOR] = { .type = NLA_U64, }, + [ILA_ATTR_LOCATOR_MATCH] = { .type = NLA_U64, }, + [ILA_ATTR_IFINDEX] = { .type = NLA_U32, }, + [ILA_ATTR_CSUM_MODE] = { .type = NLA_U8, }, + [ILA_ATTR_IDENT_TYPE] = { .type = NLA_U8, }, +}; + +static const struct genl_ops ila_nl_ops[] = { + { + .cmd = ILA_CMD_ADD, + .doit = ila_xlat_nl_cmd_add_mapping, + .policy = ila_nl_policy, + .flags = GENL_ADMIN_PERM, + }, + { + .cmd = ILA_CMD_DEL, + .doit = ila_xlat_nl_cmd_del_mapping, + .policy = ila_nl_policy, + .flags = GENL_ADMIN_PERM, + }, + { + .cmd = ILA_CMD_GET, + .doit = ila_xlat_nl_cmd_get_mapping, + .start = ila_xlat_nl_dump_start, + .dumpit = ila_xlat_nl_dump, + .done = ila_xlat_nl_dump_done, + .policy = ila_nl_policy, + }, +}; + +unsigned int ila_net_id; + +struct genl_family ila_nl_family __ro_after_init = { + .hdrsize= 0, + .name = ILA_GENL_NAME, + .version= ILA_GENL_VERSION, + .maxattr= ILA_ATTR_MAX, + .netnsok= true, + .parallel_ops = true, + .module = THIS_MODULE, + .ops= ila_nl_ops, + .n_ops = ARRAY_SIZE(ila_nl_ops), +}; + +static __net_init int ila_init_net(struct net *net) +{ + int err; + + err = ila_xlat_init_net(net); +
[PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver
This patch implements generic in-kernel network resolver. The idea is that an LWT "resolver" route is set in the kernel to cover some prefix. When a packet hits the route a netlink message is fired to request resolution and pending resolutions are tracked in a table. Route resolution works in the following manner: Initial configuration: 0. An ila-rslv LWT route is set for some network prefix. The route includes an optional timeout to expire resolution. Resolution process 1. Packet is sent to the a destination in the prefix being resolved 2. A lookup is performed on the destination address in a table of outstanding resolutions requests. If no entry is found: a. A new entry is created for the destination with a timeout value as set in the resolver route b. A netlink "RTM_ADDR_RESOLVE" message is sent to kick the resolution protocol or processing 3. The packet is forwarded per the resolver route When an address is resolved 4. At some point a route is is set that resolves the outstanding request (for instance a host route is set for the destination). The entry is removed for the table. Subsequent packets to the destination will hit the new route rather than the resolver route since prefix is longer 5. Resolution entries may timeout and entry removed from the table. A subsequent packet to the destination will kick off a new resolution as in #2 6. The resolved route might also be timed out or removed, in which case subsequent packets to the same destination can trigger the resolution process DOS mitigations: - The number of outstanding resolutions is limited by the size of the table - Timeout of pending entries limits the number of netlink resolution messages - Packets are not queued that are pending resolution. In the current model that can be forwarded to a router that has all reachability information (ILA use case for example) Possible future work - An optional method to queue packets for pending resolution - More DOS mitigations. It might make sense to limit the number of resolutions per source address etc. This patch set implements an ILA host side resolver. That uses the generic resolver described above. This uses LWT to implement the hook to a userspace resolver and tracks pending unresolved address using the backend net resolver. This patch set contains: - A generic resolver backend infrastructure. This primary does two things: track unresolved addresses and implement a timeout for resolution not happening. These mechanisms provides rate limiting control over resolution requests (for instance in ILA it use used to rate limit requests to userspace to resolve addresses). - The ILA resolver. This is implements to path from the kernel ILA implementation to a userspace daemon that an identifier address needs to be resolved. - Routing messages are used over netlink to indicate resolution requests. - Add net to ila build_state - Add flush command to ila_xlat - Fix uses for rhashtable for latest fixes v3: - Removed rhashtable changes to their own patch set - Restructure ILA code to be more amenbale to changes - Remove extra call back functions in resolution interface Changes from initial RFC: - Added net argument to LWT build_state - Made resolve timeout an attribute of the LWT encap route - Changed ILA notifications to be regular routing messages of event RTM_ADDR_RESOLVE, family RTNL_FAMILY_ILA, and group RTNLGRP_ILA_NOTIFY Tom Herbert (9): lwt: Add net to build_state argument ila: Fix use of rhashtable walk in ila_xlat.c ila: Call library function alloc_bucket_locks ila: create main ila source file ila: Flush netlink command to clear xlat table net: Generic resolver backend ila: Resolver mechanism resolver: add netlink control ila: add netlink control ILA resolver include/net/lwtunnel.h | 6 +- include/net/resolver.h | 67 + include/uapi/linux/ila.h | 21 ++ include/uapi/linux/lwtunnel.h | 1 + include/uapi/linux/rtnetlink.h | 8 +- net/Kconfig| 1 + net/Makefile | 1 + net/core/lwt_bpf.c | 2 +- net/core/lwtunnel.c| 6 +- net/ipv4/fib_semantics.c | 13 +- net/ipv4/ip_tunnel_core.c | 4 +- net/ipv6/Kconfig | 1 + net/ipv6/ila/Makefile | 2 +- net/ipv6/ila/ila.h | 46 +++- net/ipv6/ila/ila_common.c | 30 --- net/ipv6/ila/ila_lwt.c | 10 +- net/ipv6/ila/ila_main.c| 161 net/ipv6/ila/ila_resolver.c| 310 +++ net/ipv6/ila/ila_xlat.c| 280 ++--- net/ipv6/route.c | 2 +- net/ipv6/seg6_iptunnel.c | 2 +- net/ipv6/seg6_local.c | 5 +- net/mpls/mpls_iptunnel.c | 2 +- net/resolver/Kconfig | 7 + net/resolver/Makefile | 8 + net/resolver/resolver.c| 559
Re: [PATCH net,stable] net: qmi_wwan: add Sierra EM7565 1199:9091
ssjoh...@mac.com writes: > From: Sebastian Sjoholm > > From: Sebastian Sjoholm > > Sierra Wireless EM7565 is an Qualcomm MDM9x50 based M.2 modem. > The USB id is added to qmi_wwan.c to allow QMI communication with the EM7565. > > Signed-off-by: Sebastian Sjoholm > --- > [The corresponding qcserial patch will be submitted by Reinhard Speyerer.] > > --- > drivers/net/usb/qmi_wwan.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c > index 304ec6555cd8..3cebd6683938 100644 > --- a/drivers/net/usb/qmi_wwan.c > +++ b/drivers/net/usb/qmi_wwan.c > @@ -1204,6 +1204,7 @@ static const struct usb_device_id products[] = { > {QMI_FIXED_INTF(0x1199, 0x9079, 10)}, /* Sierra Wireless EM74xx */ > {QMI_FIXED_INTF(0x1199, 0x907b, 8)},/* Sierra Wireless EM74xx */ > {QMI_FIXED_INTF(0x1199, 0x907b, 10)}, /* Sierra Wireless EM74xx */ > + {QMI_FIXED_INTF(0x1199, 0x9091, 8)},/* Sierra Wireless EM7565 */ > {QMI_FIXED_INTF(0x1bbb, 0x011e, 4)},/* Telekom Speedstick LTE II > (Alcatel One Touch L100V LTE) */ > {QMI_FIXED_INTF(0x1bbb, 0x0203, 2)},/* Alcatel L800MA */ > {QMI_FIXED_INTF(0x2357, 0x0201, 4)},/* TP-LINK HSUPA Modem MA180 */ Looks good except for the duplicate 'From' line. Drop that and you can add Acked-by: Bjørn Mork
[PATCH v3 net-next 2/9] ila: Fix use of rhashtable walk in ila_xlat.c
Perform better EAGAIN handling, handle case where ila_dump_info fails and we miss mis objects in the dump, and add a skip index to skip over ila entires in a list on a rhashtable node that have already been visited (by a previous call to ila_nl_dump). Signed-off-by: Tom Herbert --- net/ipv6/ila/ila_xlat.c | 60 - 1 file changed, 44 insertions(+), 16 deletions(-) diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c index 44c39c5f0638..9fca75b9cab3 100644 --- a/net/ipv6/ila/ila_xlat.c +++ b/net/ipv6/ila/ila_xlat.c @@ -474,24 +474,31 @@ static int ila_nl_cmd_get_mapping(struct sk_buff *skb, struct genl_info *info) struct ila_dump_iter { struct rhashtable_iter rhiter; + int skip; }; static int ila_nl_dump_start(struct netlink_callback *cb) { struct net *net = sock_net(cb->skb->sk); struct ila_net *ilan = net_generic(net, ila_net_id); - struct ila_dump_iter *iter = (struct ila_dump_iter *)cb->args[0]; + struct ila_dump_iter *iter; + int ret; - if (!iter) { - iter = kmalloc(sizeof(*iter), GFP_KERNEL); - if (!iter) - return -ENOMEM; + iter = kmalloc(sizeof(*iter), GFP_KERNEL); + if (!iter) + return -ENOMEM; - cb->args[0] = (long)iter; + ret = rhashtable_walk_init(&ilan->rhash_table, &iter->rhiter, + GFP_KERNEL); + if (ret) { + kfree(iter); + return ret; } - return rhashtable_walk_init(&ilan->rhash_table, &iter->rhiter, - GFP_KERNEL); + iter->skip = 0; + cb->args[0] = (long)iter; + + return ret; } static int ila_nl_dump_done(struct netlink_callback *cb) @@ -509,37 +516,58 @@ static int ila_nl_dump(struct sk_buff *skb, struct netlink_callback *cb) { struct ila_dump_iter *iter = (struct ila_dump_iter *)cb->args[0]; struct rhashtable_iter *rhiter = &iter->rhiter; + int skip = iter->skip; struct ila_map *ila; int ret; rhashtable_walk_start(rhiter); - for (;;) { - ila = rhashtable_walk_next(rhiter); + /* Get first entty */ + ila = rhashtable_walk_peek(rhiter); + for (;;) { if (IS_ERR(ila)) { - if (PTR_ERR(ila) == -EAGAIN) - continue; ret = PTR_ERR(ila); - goto done; + if (ret == -EAGAIN) { + /* Table has changed and iter has reset. Return +* -EAGAIN to the application even if we have +* written data to the skb. The application +* needs to deal with this. +*/ + + goto out_ret; + } else { + break; + } } else if (!ila) { + ret = 0; break; } + while (ila && skip) { + /* Skip over any ila entries in this list that we +* have already dumped. +*/ + ila = rcu_access_pointer(ila->next); + skip--; + } while (ila) { ret = ila_dump_info(ila, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NLM_F_MULTI, skb, ILA_CMD_GET); if (ret) - goto done; + goto out; ila = rcu_access_pointer(ila->next); } + ila = rhashtable_walk_next(rhiter); } - ret = skb->len; +out: + iter->skip = skip; + ret = (skb->len ? : ret); -done: +out_ret: rhashtable_walk_stop(rhiter); return ret; } -- 2.11.0
Re: [PATCH next] ipvlan: add L2 check for packets arriving via virtual devices
From: Mahesh Bandewar (महेश बंडेवार) Date: Mon, 11 Dec 2017 11:38:04 -0800 > On Mon, Dec 11, 2017 at 8:15 AM, David Miller wrote: >> From: Mahesh Bandewar >> Date: Thu, 7 Dec 2017 15:15:43 -0800 >> >>> From: Mahesh Bandewar >>> >>> Packets that don't have dest mac as the mac of the master device should >>> not be entertained by the IPvlan rx-handler. This is mostly true as the >>> packet path mostly takes care of that, except when the master device is >>> a virtual device. As demonstrated in the following case - >> ... >>> This patch adds that missing check in the IPvlan rx-handler. >>> >>> Reported-by: Amit Sikka >>> Signed-off-by: Mahesh Bandewar >> >> Applied, but it's a shame that the data plane takes on this new MAC >> compare operation. > Your comment made me think little more about this and a discussion > with Eric kind of put things in perspective. eth_type_trans() does the > right thing and sets the packet_type correctly (when .ndo_xmit of veth > is called). However IPvlan is over-aggressive in packet scrubbing and > that scrub changes packet type. This causes the actual problem. It's > not clear to me why skb_scrub_packet() changes the packet type to > PACKET_HOST unconditionally? But that's another issue. > > I'll send another patch to remove excessive scrubbing in IPvlan and > revert of this patch so that this additional comparison (though not > expensive!) can be avoided. Thanks for looking more deeply into this.
[PATCH net,stable] net: qmi_wwan: add Sierra EM7565 1199:9091
From: Sebastian Sjoholm From: Sebastian Sjoholm Sierra Wireless EM7565 is an Qualcomm MDM9x50 based M.2 modem. The USB id is added to qmi_wwan.c to allow QMI communication with the EM7565. Signed-off-by: Sebastian Sjoholm --- [The corresponding qcserial patch will be submitted by Reinhard Speyerer.] --- drivers/net/usb/qmi_wwan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index 304ec6555cd8..3cebd6683938 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1204,6 +1204,7 @@ static const struct usb_device_id products[] = { {QMI_FIXED_INTF(0x1199, 0x9079, 10)}, /* Sierra Wireless EM74xx */ {QMI_FIXED_INTF(0x1199, 0x907b, 8)},/* Sierra Wireless EM74xx */ {QMI_FIXED_INTF(0x1199, 0x907b, 10)}, /* Sierra Wireless EM74xx */ + {QMI_FIXED_INTF(0x1199, 0x9091, 8)},/* Sierra Wireless EM7565 */ {QMI_FIXED_INTF(0x1bbb, 0x011e, 4)},/* Telekom Speedstick LTE II (Alcatel One Touch L100V LTE) */ {QMI_FIXED_INTF(0x1bbb, 0x0203, 2)},/* Alcatel L800MA */ {QMI_FIXED_INTF(0x2357, 0x0201, 4)},/* TP-LINK HSUPA Modem MA180 */ -- 2.14.1
[PATCH net,stable] net: qmi_wwan: add Quectel BG96 2c7c:0296
From: Sebastian Sjoholm Quectel BG96 is an Qualcomm MDM9206 based IoT modem, supporting both CAT-M and NB-IoT. Tested hardware is BG96 mounted on Quectel development board (EVB). The USB id is added to qmi_wwan.c to allow QMI communication with the BG96. Signed-off-by: Sebastian Sjoholm --- drivers/net/usb/qmi_wwan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index 720a3a248070..c750cf7c042b 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1239,6 +1239,7 @@ static const struct usb_device_id products[] = { {QMI_FIXED_INTF(0x1e0e, 0x9001, 5)},/* SIMCom 7230E */ {QMI_QUIRK_SET_DTR(0x2c7c, 0x0125, 4)}, /* Quectel EC25, EC20 R2.0 Mini PCIe */ {QMI_QUIRK_SET_DTR(0x2c7c, 0x0121, 4)}, /* Quectel EC21 Mini PCIe */ + {QMI_FIXED_INTF(0x2c7c, 0x0296, 4)},/* Quectel BG96 */ /* 4. Gobi 1000 devices */ {QMI_GOBI1K_DEVICE(0x05c6, 0x9212)},/* Acer Gobi Modem Device */ -- 2.11.0 (Apple Git-81)
Re: [PATCH net-next v4 1/2] bpf/tracing: allow user space to query prog array on the same tp
On Mon, Dec 11, 2017 at 11:39:02AM -0800, Yonghong Song wrote: > Commit e87c6bc3852b ("bpf: permit multiple bpf attachments > for a single perf event") added support to attach multiple > bpf programs to a single perf event. > Although this provides flexibility, users may want to know > what other bpf programs attached to the same tp interface. > Besides getting visibility for the underlying bpf system, > such information may also help consolidate multiple bpf programs, > understand potential performance issues due to a large array, > and debug (e.g., one bpf program which overwrites return code > may impact subsequent program results). > > Commit 2541517c32be ("tracing, perf: Implement BPF programs > attached to kprobes") utilized the existing perf ioctl > interface and added the command PERF_EVENT_IOC_SET_BPF > to attach a bpf program to a tracepoint. This patch adds a new > ioctl command, given a perf event fd, to query the bpf program > array attached to the same perf tracepoint event. > > The new uapi ioctl command: > PERF_EVENT_IOC_QUERY_BPF > > The new uapi/linux/perf_event.h structure: > struct perf_event_query_bpf { >__u32 ids_len; >__u32 prog_cnt; >__u32 ids[0]; > }; > > User space provides buffer "ids" for kernel to copy to. > When returning from the kernel, the number of available > programs in the array is set in "prog_cnt". > > The usage: > struct perf_event_query_bpf *query = malloc(...); > query.ids_len = ids_len; > err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, &query); > if (err == 0) { > /* query.prog_cnt is the number of available progs, > * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt > */ > } else if (errno == ENOSPC) { > /* query.ids_len number of progs copied, > * query.prog_cnt is the number of available progs > */ > } else { > /* other errors */ > } > > Signed-off-by: Yonghong Song > Acked-by: Peter Zijlstra (Intel) Acked-by: Alexei Starovoitov
Re: [PATCH net-next v2 5/6] net: qualcomm: rmnet: Allow to configure flags for new devices
On Sat, 2017-12-09 at 13:58 -0700, Subash Abhinov Kasiviswanathan wrote: > Add an option to configure the rmnet aggregation and command features > on device creation. This is achieved by using the vlan flags option. Still seems kinda odd to overload IFLA_VLAN_FLAGS to carry RMNET_INGRESS/EGRESS_FORMAT_* flags, but I'll leave that decision to others... Dan > Signed-off-by: Subash Abhinov Kasiviswanathan > > --- > drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 16 > +--- > 1 file changed, 13 insertions(+), 3 deletions(-) > > diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c > b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c > index 5e530db..2f5f661 100644 > --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c > +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c > @@ -177,11 +177,20 @@ static int rmnet_newlink(struct net *src_net, > struct net_device *dev, > if (err) > goto err2; > > - netdev_dbg(dev, "data format [ingress 0x%08X]\n", > ingress_format); > - port->ingress_data_format = ingress_format; > port->rmnet_mode = mode; > > hlist_add_head_rcu(&ep->hlnode, &port->muxed_ep[mux_id]); > + > + if (data[IFLA_VLAN_FLAGS]) { > + struct ifla_vlan_flags *flags; > + > + flags = nla_data(data[IFLA_VLAN_FLAGS]); > + ingress_format = flags->flags & flags->mask; > + } > + > + netdev_dbg(dev, "data format [ingress 0x%08X]\n", > ingress_format); > + port->ingress_data_format = ingress_format; > + > return 0; > > err2: > @@ -312,7 +321,8 @@ static int rmnet_rtnl_validate(struct nlattr > *tb[], struct nlattr *data[], > > static size_t rmnet_get_size(const struct net_device *dev) > { > - return nla_total_size(2); /* IFLA_VLAN_ID */ > + return nla_total_size(2) /* IFLA_VLAN_ID */ + > + nla_total_size(sizeof(struct ifla_vlan_flags)); /* > IFLA_VLAN_FLAGS */ > } > > struct rtnl_link_ops rmnet_link_ops __read_mostly = {
Re: [PATCH] selftests: bpf: Adding config fragment CONFIG_CGROUP_BPF=y
Hi Naresh, Looks good! Thanks! On Tue, Dec 12, 2017 at 12:55:23AM +0530, Naresh Kamboju wrote: > CONFIG_CGROUP_BPF=y is required for test_dev_cgroup test case. > > Signed-off-by: Naresh Kamboju > --- > tools/testing/selftests/bpf/config | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/tools/testing/selftests/bpf/config > b/tools/testing/selftests/bpf/config > index 52d53ed..9d48973 100644 > --- a/tools/testing/selftests/bpf/config > +++ b/tools/testing/selftests/bpf/config > @@ -3,3 +3,4 @@ CONFIG_BPF_SYSCALL=y > CONFIG_NET_CLS_BPF=m > CONFIG_BPF_EVENTS=y > CONFIG_TEST_BPF=m > +CONFIG_CGROUP_BPF=y > -- > 2.7.4 >
Re: [PATCH] Revert "ravb: add workaround for clock when resuming with WoL enabled"
Hello! On 12/11/2017 11:54 AM, Geert Uytterhoeven wrote: This reverts commit fbf3d034f2ff6264183cfa6845770e8cc2a986c8. As of commit 560869100b99a3da ("clk: renesas: cpg-mssr: Restore module clocks during resume"), the workaround is no longer needed. Signed-off-by: Geert Uytterhoeven Acked-by: Sergei Shtylyov [...] MBR, Sergei
[PATCH net-next v4 1/2] bpf/tracing: allow user space to query prog array on the same tp
Commit e87c6bc3852b ("bpf: permit multiple bpf attachments for a single perf event") added support to attach multiple bpf programs to a single perf event. Although this provides flexibility, users may want to know what other bpf programs attached to the same tp interface. Besides getting visibility for the underlying bpf system, such information may also help consolidate multiple bpf programs, understand potential performance issues due to a large array, and debug (e.g., one bpf program which overwrites return code may impact subsequent program results). Commit 2541517c32be ("tracing, perf: Implement BPF programs attached to kprobes") utilized the existing perf ioctl interface and added the command PERF_EVENT_IOC_SET_BPF to attach a bpf program to a tracepoint. This patch adds a new ioctl command, given a perf event fd, to query the bpf program array attached to the same perf tracepoint event. The new uapi ioctl command: PERF_EVENT_IOC_QUERY_BPF The new uapi/linux/perf_event.h structure: struct perf_event_query_bpf { __u32ids_len; __u32prog_cnt; __u32ids[0]; }; User space provides buffer "ids" for kernel to copy to. When returning from the kernel, the number of available programs in the array is set in "prog_cnt". The usage: struct perf_event_query_bpf *query = malloc(...); query.ids_len = ids_len; err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, &query); if (err == 0) { /* query.prog_cnt is the number of available progs, * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt */ } else if (errno == ENOSPC) { /* query.ids_len number of progs copied, * query.prog_cnt is the number of available progs */ } else { /* other errors */ } Signed-off-by: Yonghong Song Acked-by: Peter Zijlstra (Intel) --- include/linux/bpf.h | 4 include/uapi/linux/perf_event.h | 22 ++ kernel/bpf/core.c | 21 + kernel/events/core.c| 3 +++ kernel/trace/bpf_trace.c| 23 +++ 5 files changed, 73 insertions(+) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index e55e425..f812ac5 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -254,6 +254,7 @@ typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const void *src, u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size, void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy); +int bpf_event_query_prog_array(struct perf_event *event, void __user *info); int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr, union bpf_attr __user *uattr); @@ -285,6 +286,9 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs, void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs, struct bpf_prog *old_prog); +int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array, +__u32 __user *prog_ids, u32 request_cnt, +__u32 __user *prog_cnt); int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array, struct bpf_prog *exclude_prog, struct bpf_prog *include_prog, diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index b9a4953..7695336 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -418,6 +418,27 @@ struct perf_event_attr { __u16 __reserved_2; /* align to __u64 */ }; +/* + * Structure used by below PERF_EVENT_IOC_QUERY_BPF command + * to query bpf programs attached to the same perf tracepoint + * as the given perf event. + */ +struct perf_event_query_bpf { + /* +* The below ids array length +*/ + __u32 ids_len; + /* +* Set by the kernel to indicate the number of +* available programs +*/ + __u32 prog_cnt; + /* +* User provided buffer to store program ids +*/ + __u32 ids[0]; +}; + #define perf_flags(attr) (*(&(attr)->read_format + 1)) /* @@ -433,6 +454,7 @@ struct perf_event_attr { #define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *) #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32) #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32) +#define PERF_EVENT_IOC_QUERY_BPF _IOWR('$', 10, struct perf_event_query_bpf *) enum perf_event_ioc_flags { PERF_IOC_FLAG_GROUP = 1U << 0, diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 86b50aa..b16c6f8 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1462,6 +1462,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs, rcu_read_lock(); prog = rcu_dereference(progs)->progs; for (; *prog; prog++) { + if (*prog == &dummy_bpf_prog.prog) + continue;
[PATCH net-next v4 2/2] bpf/tracing: add a bpf test for new ioctl query interface
Added a subtest in test_progs. The tracepoint is sched/sched_switch. Multiple bpf programs are attached to this tracepoint and the query interface is exercised. Signed-off-by: Yonghong Song Acked-by: Alexei Starovoitov Acked-by: Peter Zijlstra (Intel) --- tools/include/uapi/linux/perf_event.h | 22 + tools/testing/selftests/bpf/Makefile | 2 +- tools/testing/selftests/bpf/test_progs.c | 134 ++ tools/testing/selftests/bpf/test_tracepoint.c | 26 + 4 files changed, 183 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/test_tracepoint.c diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h index b9a4953..7695336 100644 --- a/tools/include/uapi/linux/perf_event.h +++ b/tools/include/uapi/linux/perf_event.h @@ -418,6 +418,27 @@ struct perf_event_attr { __u16 __reserved_2; /* align to __u64 */ }; +/* + * Structure used by below PERF_EVENT_IOC_QUERY_BPF command + * to query bpf programs attached to the same perf tracepoint + * as the given perf event. + */ +struct perf_event_query_bpf { + /* +* The below ids array length +*/ + __u32 ids_len; + /* +* Set by the kernel to indicate the number of +* available programs +*/ + __u32 prog_cnt; + /* +* User provided buffer to store program ids +*/ + __u32 ids[0]; +}; + #define perf_flags(attr) (*(&(attr)->read_format + 1)) /* @@ -433,6 +454,7 @@ struct perf_event_attr { #define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *) #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32) #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32) +#define PERF_EVENT_IOC_QUERY_BPF _IOWR('$', 10, struct perf_event_query_bpf *) enum perf_event_ioc_flags { PERF_IOC_FLAG_GROUP = 1U << 0, diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile index f309ab9..b177c55 100644 --- a/tools/testing/selftests/bpf/Makefile +++ b/tools/testing/selftests/bpf/Makefile @@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map test TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o test_obj_id.o \ test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o sockmap_parse_prog.o \ - sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o + sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \ test_offload.py diff --git a/tools/testing/selftests/bpf/test_progs.c b/tools/testing/selftests/bpf/test_progs.c index 6942753..1e0479a 100644 --- a/tools/testing/selftests/bpf/test_progs.c +++ b/tools/testing/selftests/bpf/test_progs.c @@ -21,8 +21,10 @@ typedef __u16 __sum16; #include #include #include +#include #include +#include #include #include #include @@ -617,6 +619,137 @@ static void test_obj_name(void) } } +static void test_tp_attach_query(void) +{ + const int num_progs = 3; + int i, j, bytes, efd, err, prog_fd[num_progs], pmu_fd[num_progs]; + __u32 duration = 0, info_len, saved_prog_ids[num_progs]; + const char *file = "./test_tracepoint.o"; + struct perf_event_query_bpf *query; + struct perf_event_attr attr = {}; + struct bpf_object *obj[num_progs]; + struct bpf_prog_info prog_info; + char buf[256]; + + snprintf(buf, sizeof(buf), +"/sys/kernel/debug/tracing/events/sched/sched_switch/id"); + efd = open(buf, O_RDONLY, 0); + if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno)) + return; + bytes = read(efd, buf, sizeof(buf)); + close(efd); + if (CHECK(bytes <= 0 || bytes >= sizeof(buf), + "read", "bytes %d errno %d\n", bytes, errno)) + return; + + attr.config = strtol(buf, NULL, 0); + attr.type = PERF_TYPE_TRACEPOINT; + attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN; + attr.sample_period = 1; + attr.wakeup_events = 1; + + query = (struct perf_event_query_bpf *)malloc(sizeof(struct perf_event_query_bpf) + + sizeof(__u32) * num_progs); + for (i = 0; i < num_progs; i++) { + err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj[i], + &prog_fd[i]); + if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno)) + goto cleanup1; + + bzero(&prog_info, sizeof(prog_info)); + prog_info.jited_prog_len = 0; + prog_info.xlated_prog_len = 0; + prog_info.nr_map_ids = 0; + info_len = sizeof(prog_info); + err = bpf_obj_get_info_by_fd(prog_fd[i], &prog_info, &info_len); +
[PATCH net-next v4 0/2] bpf/tracing: allow user space to query prog array on the same tp
Commit e87c6bc3852b ("bpf: permit multiple bpf attachments for a single perf event") added support to attach multiple bpf programs to a single perf event. Given a perf event (kprobe, uprobe, or kernel tracepoint), the perf ioctl interface is used to query bpf programs attached to the same trace event. There already exists a BPF_PROG_QUERY command for introspection currently used by cgroup+bpf. We did have an implementation for querying tracepoint+bpf through the same interface. However, it looks cleaner to use ioctl() style of api here, since attaching bpf prog to tracepoint/kuprobe is also done via ioctl. Patch #1 had the core implementation and patch #2 added a test case in tools bpf selftests suite. Changelogs: v3 -> v4: - Fix a compilation error with newer gcc like 6.3.1 while old gcc 4.8.5 is okay. I was using &uquery->ids to represent the address to the ids array to make it explicit that the address is passed, and this syntax is rightly rejected by gcc 6.3.1. v2 -> v3: - Change uapi structure perf_event_query_bpf to be more clearer based on Peter's suggestion, and adjust other codes accordingly. v1 -> v2: - Rebase on top of net-next. - Use existing bpf_prog_array_length function instead of implementing the same functionality in function bpf_prog_array_copy_info. Yonghong Song (2): bpf/tracing: allow user space to query prog array on the same tp bpf/tracing: add a bpf test for new ioctl query interface include/linux/bpf.h | 4 + include/uapi/linux/perf_event.h | 22 + kernel/bpf/core.c | 21 kernel/events/core.c | 3 + kernel/trace/bpf_trace.c | 23 + tools/include/uapi/linux/perf_event.h | 22 + tools/testing/selftests/bpf/Makefile | 2 +- tools/testing/selftests/bpf/test_progs.c | 134 ++ tools/testing/selftests/bpf/test_tracepoint.c | 26 + 9 files changed, 256 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/bpf/test_tracepoint.c -- 2.9.5
Re: [PATCH next] ipvlan: add L2 check for packets arriving via virtual devices
On Mon, Dec 11, 2017 at 8:15 AM, David Miller wrote: > From: Mahesh Bandewar > Date: Thu, 7 Dec 2017 15:15:43 -0800 > >> From: Mahesh Bandewar >> >> Packets that don't have dest mac as the mac of the master device should >> not be entertained by the IPvlan rx-handler. This is mostly true as the >> packet path mostly takes care of that, except when the master device is >> a virtual device. As demonstrated in the following case - > ... >> This patch adds that missing check in the IPvlan rx-handler. >> >> Reported-by: Amit Sikka >> Signed-off-by: Mahesh Bandewar > > Applied, but it's a shame that the data plane takes on this new MAC > compare operation. Your comment made me think little more about this and a discussion with Eric kind of put things in perspective. eth_type_trans() does the right thing and sets the packet_type correctly (when .ndo_xmit of veth is called). However IPvlan is over-aggressive in packet scrubbing and that scrub changes packet type. This causes the actual problem. It's not clear to me why skb_scrub_packet() changes the packet type to PACKET_HOST unconditionally? But that's another issue. I'll send another patch to remove excessive scrubbing in IPvlan and revert of this patch so that this additional comparison (though not expensive!) can be avoided. Thanks, --mahesh..
Re: RFC(v2): Audit Kernel Container IDs
On Monday, December 11, 2017 11:30:57 AM EST Eric Paris wrote: > > Because a container doesn't have to use namespaces to be a container > > you still need a mechanism for a process to declare that it is in > > fact > > in a container, and to identify the container. > > I like the idea but I'm still tossing it around in my head (and > thinking about Casey's statement too). Lets say we have a 'docker-like' > container with pid=100 netns=X,userns=Y,mountns=Z. If I'm on the host > in all init namespaces and I run > nsenter -t 100 -n ip link set eth0 promisc on > How should this be logged? If it is a normal process, then everything would match the init name space and you wouldn't have entered a container. If it were a container, any generated event should have the container ID from registration attached to it. > Did this command run in it's own 'container' unrelated to the 'docker-like' > container? That should be determined by what's in the task struct. -Steve
Re: [PATCH v2] vsock.7: document VSOCK socket address family
On 12/06/2017 03:06 PM, Jorgen S. Hansen wrote: > >> On Dec 5, 2017, at 11:56 AM, Stefan Hajnoczi wrote: >> >> The AF_VSOCK address family has been available since Linux 3.9 without a >> corresponding man page. >> >> This patch adds vsock.7 and describes its use along the same lines as >> existing ip.7, unix.7, and netlink.7 man pages. >> >> CC: Jorgen Hansen >> CC: Dexuan Cui >> Signed-off-by: Stefan Hajnoczi >> --- >> man7/vsock.7 | 180 >> +++ >> 1 file changed, 180 insertions(+) >> create mode 100644 man7/vsock.7 >> >> diff --git a/man7/vsock.7 b/man7/vsock.7 >> new file mode 100644 >> index 0..46dc561f5 >> --- /dev/null >> +++ b/man7/vsock.7 >> @@ -0,0 +1,180 @@ >> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual" >> +.SH NAME >> +vsock \- Linux VSOCK address family >> +.SH SYNOPSIS >> +.B #include >> +.br >> +.B #include >> +.PP >> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);" >> +.br >> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);" >> +.SH DESCRIPTION >> +The VSOCK address family facilitates communication between virtual machines >> and >> +the host they are running on. This address family is used by guest agents >> and >> +hypervisor services that need a communications channel that is independent >> of >> +virtual machine network configuration. >> +.PP >> +Valid socket types are >> +.B SOCK_STREAM >> +and >> +.BR SOCK_DGRAM . >> +.B SOCK_STREAM >> +provides connection-oriented byte streams with guaranteed, in-order >> delivery. >> +.B SOCK_DGRAM >> +provides a connectionless datagram packet service with best-effort delivery >> and >> +best-effort ordering. Availability of these socket types is dependent on >> the >> +underlying hypervisor. >> +.PP >> +A new socket is created with >> +.PP >> +socket(AF_VSOCK, socket_type, 0); >> +.PP >> +When a process wants to establish a connection it calls >> +.BR connect (2) >> +with a given destination socket address. The socket is automatically bound >> to >> +a free port if unbound. >> +.PP >> +A process can listen for incoming connections by first binding to a socket >> +address using >> +.BR bind (2) >> +and then calling >> +.BR listen (2). >> +.PP >> +Data is transferred using the usual >> +.BR send (2) >> +and >> +.BR recv (2) >> +family of socket system calls. >> +.SS Address format >> +A socket address is defined as a combination of a 32-bit Context Identifier >> +(CID) and a 32-bit port number. The CID identifies the source or >> destination, >> +which is either a virtual machine or the host. The port number >> differentiates >> +between multiple services running on a single machine. >> +.PP >> +.in +4n >> +.EX >> +struct sockaddr_vm { >> +sa_family_t svm_family; /* address family: AF_VSOCK */ >> +unsigned short svm_reserved1; >> +unsigned intsvm_port; /* port in native byte order */ >> +unsigned intsvm_cid;/* address in native byte order */ >> +}; >> +.EE >> +.in >> +.PP >> +.I svm_family >> +is always set to >> +.BR AF_VSOCK . >> +.I svm_reserved1 >> +is always set to 0. >> +.I svm_port >> +contains the port in native byte order. >> +The port numbers below 1024 are called >> +.IR "privileged ports" . >> +Only a process with >> +.B CAP_NET_BIND_SERVER >> +capability may >> +.BR bind (2) >> +to these port numbers. >> +.PP >> +There are several special addresses: >> +.B VMADDR_CID_ANY >> +(-1U) >> +means any address for binding; >> +.B VMADDR_CID_HYPERVISOR >> +(0) is reserved for services built into the hypervisor; >> +.B VMADDR_CID_RESERVED >> +(1) must not be used; >> +.B VMADDR_CID_HOST >> +(2) >> +is the well-known address of the host. >> +.PP >> +The special constant >> +.B VMADDR_PORT_ANY >> +(-1U) >> +means any port number for binding. >> +.SS Live migration >> +Sockets are affected by live migration of virtual machines. Connected >> +.B SOCK_STREAM >> +sockets become disconnected when the virtual machine migrates to a new host. >> +Applications must reconnect when this happens. >> +.PP >> +The local CID may change across live migration if the old CID is not >> available >> +on the new host. Bound sockets are automatically updated to the new CID. >> +.SS Ioctls >> +.TP >> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID >> +Get the CID of the local machine. The argument is a pointer to an unsigned >> int. >> +.IP >> +.in +4n >> +.EX >> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", " &cid >> ");" >> +.EE >> +.in >> +.IP >> +Consider using >> +.B VMADDR_CID_ANY >> +when binding instead of getting the local CID with >> +.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID . >> +.SH ERRORS >> +.TP >> +.B EACCES >> +Unable to bind to a privileged port without the >> +.B CAP_NET_BIND_SERVICE >> +capability. >> +.TP >> +.B EINVAL >> +Invalid parameters. This includes: >> +attempting to bind a socket that is already bound, providing an invalid >> struct >> +.BR sockaddr_vm , >> +and oth
Re: [PATCH v2] vsock.7: document VSOCK socket address family
Hello Stefan, Thanks for this page! I have applied your patch, and made a few tweaks, but I have some minor questions. Please see below. On 12/05/2017 11:56 AM, Stefan Hajnoczi wrote: > The AF_VSOCK address family has been available since Linux 3.9 without a > corresponding man page. > > This patch adds vsock.7 and describes its use along the same lines as > existing ip.7, unix.7, and netlink.7 man pages. > > CC: Jorgen Hansen > CC: Dexuan Cui > Signed-off-by: Stefan Hajnoczi > --- > man7/vsock.7 | 180 > +++ > 1 file changed, 180 insertions(+) > create mode 100644 man7/vsock.7 > > diff --git a/man7/vsock.7 b/man7/vsock.7 > new file mode 100644 > index 0..46dc561f5 > --- /dev/null > +++ b/man7/vsock.7 > @@ -0,0 +1,180 @@ > +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual" > +.SH NAME > +vsock \- Linux VSOCK address family > +.SH SYNOPSIS > +.B #include > +.br > +.B #include > +.PP > +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);" > +.br > +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);" > +.SH DESCRIPTION > +The VSOCK address family facilitates communication between virtual machines > and > +the host they are running on. This address family is used by guest agents > and > +hypervisor services that need a communications channel that is independent of > +virtual machine network configuration. > +.PP > +Valid socket types are > +.B SOCK_STREAM > +and > +.BR SOCK_DGRAM . > +.B SOCK_STREAM > +provides connection-oriented byte streams with guaranteed, in-order delivery. > +.B SOCK_DGRAM > +provides a connectionless datagram packet service with best-effort delivery > and > +best-effort ordering. Availability of these socket types is dependent on the > +underlying hypervisor. > +.PP > +A new socket is created with > +.PP > +socket(AF_VSOCK, socket_type, 0); > +.PP > +When a process wants to establish a connection it calls > +.BR connect (2) > +with a given destination socket address. The socket is automatically bound > to > +a free port if unbound. > +.PP > +A process can listen for incoming connections by first binding to a socket > +address using > +.BR bind (2) > +and then calling > +.BR listen (2). > +.PP > +Data is transferred using the usual > +.BR send (2) > +and > +.BR recv (2) Or equally, write(2) and read(2), right? By failing to mention those, the text subtly implies that send(2) and recv(2) are preferred, but I don't suppose that is true. > +family of socket system calls. > +.SS Address format > +A socket address is defined as a combination of a 32-bit Context Identifier > +(CID) and a 32-bit port number. The CID identifies the source or > destination, > +which is either a virtual machine or the host. The port number > differentiates > +between multiple services running on a single machine. > +.PP > +.in +4n > +.EX > +struct sockaddr_vm { > +sa_family_t svm_family; /* address family: AF_VSOCK */ > +unsigned short svm_reserved1; > +unsigned intsvm_port; /* port in native byte order */ > +unsigned intsvm_cid;/* address in native byte order */ > +}; > +.EE > +.in > +.PP > +.I svm_family > +is always set to > +.BR AF_VSOCK . > +.I svm_reserved1 > +is always set to 0. > +.I svm_port > +contains the port in native byte order. > +The port numbers below 1024 are called > +.IR "privileged ports" . > +Only a process with > +.B CAP_NET_BIND_SERVER > +capability may > +.BR bind (2) > +to these port numbers. > +.PP > +There are several special addresses: > +.B VMADDR_CID_ANY > +(-1U) > +means any address for binding; > +.B VMADDR_CID_HYPERVISOR > +(0) is reserved for services built into the hypervisor; > +.B VMADDR_CID_RESERVED > +(1) must not be used; > +.B VMADDR_CID_HOST > +(2) > +is the well-known address of the host. > +.PP > +The special constant > +.B VMADDR_PORT_ANY > +(-1U) > +means any port number for binding. > +.SS Live migration > +Sockets are affected by live migration of virtual machines. Connected > +.B SOCK_STREAM > +sockets become disconnected when the virtual machine migrates to a new host. > +Applications must reconnect when this happens. > +.PP > +The local CID may change across live migration if the old CID is not > available > +on the new host. Bound sockets are automatically updated to the new CID. > +.SS Ioctls > +.TP > +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID > +Get the CID of the local machine. The argument is a pointer to an unsigned > int. > +.IP > +.in +4n > +.EX > +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", " &cid > ");" > +.EE > +.in > +.IP > +Consider using > +.B VMADDR_CID_ANY > +when binding instead of getting the local CID with > +.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID . > +.SH ERRORS > +.TP > +.B EACCES > +Unable to bind to a privileged port without the > +.B CAP_NET_BIND_SERVICE > +capability. > +.TP > +.B EINVAL > +Invalid parameters. This includes: > +attempting to bind a socket that i