Re: [PATCH bpf 2/3] bpf: fix build issues on um due to mising bpf_perf_event.h

2017-12-11 Thread Richard Weinberger
Am Dienstag, 12. Dezember 2017, 02:25:31 CET schrieb Daniel Borkmann:
> Since c895f6f703ad ("bpf: correct broken uapi for
> BPF_PROG_TYPE_PERF_EVENT program type") um (uml) won't build
> on i386 or x86_64:
> 
>   [...]
> CC  init/main.o
>   In file included from ../include/linux/perf_event.h:18:0,
>from ../include/linux/trace_events.h:10,
>from ../include/trace/syscall.h:7,
>from ../include/linux/syscalls.h:82,
>from ../init/main.c:20:
>   ../include/uapi/linux/bpf_perf_event.h:11:32: fatal error:
>   asm/bpf_perf_event.h: No such file or directory #include
>   
>   [...]
> 
> Lets add missing bpf_perf_event.h also to um arch. This seems
> to be the only one still missing.
> 
> Fixes: c895f6f703ad ("bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT
> program type") Reported-by: Randy Dunlap 
> Suggested-by: Richard Weinberger 
> Signed-off-by: Daniel Borkmann 
> Tested-by: Randy Dunlap 
> Cc: Hendrik Brueckner 
> Cc: Richard Weinberger 
> Acked-by: Alexei Starovoitov 
> ---
>  arch/um/include/asm/Kbuild | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
> index 50a32c3..73c57f6 100644
> --- a/arch/um/include/asm/Kbuild
> +++ b/arch/um/include/asm/Kbuild
> @@ -1,4 +1,5 @@
>  generic-y += barrier.h
> +generic-y += bpf_perf_event.h
>  generic-y += bug.h
>  generic-y += clkdev.h
>  generic-y += current.h

Acked-by: Richard Weinberger 

Thanks,
//richard

-- 
sigma star gmbh - Eduard-Bodem-Gasse 6 - 6020 Innsbruck - Austria
ATU66964118 - FN 374287y


Re: [PATCH net-next 2/3] net: dsa: mediatek: combine MediaTek tag with VLAN tag

2017-12-11 Thread Sean Wang
On Thu, 2017-12-07 at 16:30 +0100, Andrew Lunn wrote:
> > @@ -25,20 +28,37 @@ static struct sk_buff *mtk_tag_xmit(struct sk_buff *skb,
> >  {
> > struct dsa_port *dp = dsa_slave_to_port(dev);
> > u8 *mtk_tag;
> > +   bool is_vlan_skb = true;
> 
> ..
> 
> > +   /* Mark tag attribute on special tag insertion to notify hardware
> > +* whether that's a combined special tag with 802.1Q header.
> > +*/
> > +   mtk_tag[0] = is_vlan_skb ? MTK_HDR_XMIT_TAGGED_TPID_8100 :
> > +MTK_HDR_XMIT_UNTAGGED;
> > mtk_tag[1] = (1 << dp->index) & MTK_HDR_XMIT_DP_BIT_MASK;
> > -   mtk_tag[2] = 0;
> > -   mtk_tag[3] = 0;
> > +
> > +   /* Tag control information is kept for 802.1Q */
> > +   if (!is_vlan_skb) {
> > +   mtk_tag[2] = 0;
> > +   mtk_tag[3] = 0;
> > +   }
> >  
> > return skb;
> >  }
> 
> Hi Sean
> 
> So you can mark a packet for egress. What about ingress? How do you
> know the VLAN/PORT combination for packets the CPU receives? I would
> of expected a similar change to mtk_tag_rcv().
> 
>Andrew

Hi, Andrew

It's unnecessary for extra handling in mtk_tag_rcv() when VLAN tag is
present since it is able to put the VLAN tag after the special tag and
then follow the existing way to parse.

Sean





Re: [PATCH net-next 1/3] net: dsa: mediatek: add VLAN support for MT7530

2017-12-11 Thread Sean Wang

Hi, Andrew

All sounds reasonable. All will be fixed in the next version.

Sean

On Thu, 2017-12-07 at 16:24 +0100, Andrew Lunn wrote:
> >  static void
> > +mt7530_port_set_vlan_unware(struct dsa_switch *ds, int port)
> > +{
> > +   struct mt7530_priv *priv = ds->priv;
> > +   int i;
> > +   bool all_user_ports_removed = true;
> 
> Hi Sean
> 
> Reverse Christmas tree please.
> 

will be fixed 

> > +static int
> > +mt7530_vlan_cmd(struct mt7530_priv *priv, enum mt7530_vlan_cmd cmd, u16 
> > vid)
> > +{
> > +   u32 val;
> > +   int ret;
> > +   struct mt7530_dummy_poll p;
> 
> Here too.
> 

will be fixed

> > +static int
> > +mt7530_port_vlan_prepare(struct dsa_switch *ds, int port,
> > +const struct switchdev_obj_port_vlan *vlan,
> > +struct switchdev_trans *trans)
> > +{
> > +   struct mt7530_priv *priv = ds->priv;
> > +
> > +   /* The port is being kept as VLAN-unware port when bridge is set up
> > +* with vlan_filtering not being set, Otherwise, the port and the
> > +* corresponding CPU port is required the setup for becoming a
> > +* VLAN-ware port.
> > +*/
> > +   if (!priv->ports[port].vlan_filtering)
> > +   return 0;
> > +
> > +   mt7530_port_set_vlan_ware(ds, port);
> > +   mt7530_port_set_vlan_ware(ds, MT7530_CPU_PORT);
> 
> A prepare function should just validate that it is possible to carry
> out the operation. It should not change any state. These two last
> lines probably don't belong here.
> 

okay, it will be moved into the proper place such as
mt7530_port_vlan_filtering

> > +
> > +   return 0;
> > +}
> > +
> > +static void
> > +mt7530_hw_vlan_add(struct mt7530_priv *priv,
> > +  struct mt7530_hw_vlan_entry *entry)
> > +{
> > +   u32 val;
> > +   u8 new_members;
> 
> Reverse Christmas tree. Please check the whole patch.
> 
 will be fixed

> > +static inline void INIT_MT7530_HW_ENTRY(struct mt7530_hw_vlan_entry *e,
> > +   int port, bool untagged)
> > +{
> > +   e->port = port;
> > +   e->untagged = untagged;
> > +}
> 
> All CAPITAL letters is for #defines. This is just a normal
> function. Please use lower case.
> 

will be fixed
> Andrew
> 




[PATCH v2 1/3] PCI: Add pcim_set_mwi(), a device-managed pci_set_mwi()

2017-12-11 Thread Heiner Kallweit
Add pcim_set_mwi(), a device-managed version of pci_set_mwi().
First user is the Realtek r8169 driver.

Signed-off-by: Heiner Kallweit 
Acked-by: Bjorn Helgaas 
---
v2:
- Reorder calls
- Adjust and commit message
---
 drivers/pci/pci.c   | 25 +
 include/linux/pci.h |  1 +
 2 files changed, 26 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 4a7c6864f..764ca7b88 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1458,6 +1458,7 @@ struct pci_devres {
unsigned int pinned:1;
unsigned int orig_intx:1;
unsigned int restore_intx:1;
+   unsigned int mwi:1;
u32 region_mask;
 };
 
@@ -1476,6 +1477,9 @@ static void pcim_release(struct device *gendev, void *res)
if (this->region_mask & (1 << i))
pci_release_region(dev, i);
 
+   if (this->mwi)
+   pci_clear_mwi(dev);
+
if (this->restore_intx)
pci_intx(dev, this->orig_intx);
 
@@ -3760,6 +3764,27 @@ int pci_set_mwi(struct pci_dev *dev)
 }
 EXPORT_SYMBOL(pci_set_mwi);
 
+/**
+ * pcim_set_mwi - a device-managed pci_set_mwi()
+ * @dev: the PCI device for which MWI is enabled
+ *
+ * Managed pci_set_mwi().
+ *
+ * RETURNS: An appropriate -ERRNO error value on error, or zero for success.
+ */
+int pcim_set_mwi(struct pci_dev *dev)
+{
+   struct pci_devres *dr;
+
+   dr = find_pci_dr(dev);
+   if (!dr)
+   return -ENOMEM;
+
+   dr->mwi = 1;
+   return pci_set_mwi(dev);
+}
+EXPORT_SYMBOL(pcim_set_mwi);
+
 /**
  * pci_try_set_mwi - enables memory-write-invalidate PCI transaction
  * @dev: the PCI device for which MWI is enabled
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 978aad784..0a7ac863a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1064,6 +1064,7 @@ int pci_set_pcie_reset_state(struct pci_dev *dev, enum 
pcie_reset_state state);
 int pci_set_cacheline_size(struct pci_dev *dev);
 #define HAVE_PCI_SET_MWI
 int __must_check pci_set_mwi(struct pci_dev *dev);
+int __must_check pcim_set_mwi(struct pci_dev *dev);
 int pci_try_set_mwi(struct pci_dev *dev);
 void pci_clear_mwi(struct pci_dev *dev);
 void pci_intx(struct pci_dev *dev, int enable);
-- 
2.15.1




[PATCH v2 3/3] r8169: remove netif_napi_del in probe error path

2017-12-11 Thread Heiner Kallweit
netif_napi_del is called implicitely by free_netdev, therefore we
don't have to do it explicitely.

When the probe error path is reached, the net_device isn't
registered yet. Therefore reordering the call to netif_napi_del
shouldn't cause any issues.

Signed-off-by: Heiner Kallweit 
---
v2:
- no changes
---
 drivers/net/ethernet/realtek/r8169.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 3c7d90d3a..857f67beb 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -8672,14 +8672,12 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
tp->counters = dmam_alloc_coherent (&pdev->dev, sizeof(*tp->counters),
&tp->counters_phys_addr,
GFP_KERNEL);
-   if (!tp->counters) {
-   rc = -ENOMEM;
-   goto err_out_msi_5;
-   }
+   if (!tp->counters)
+   return -ENOMEM;
 
rc = register_netdev(dev);
if (rc < 0)
-   goto err_out_msi_5;
+   return rc;
 
pci_set_drvdata(pdev, dev);
 
@@ -8709,11 +8707,6 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
netif_carrier_off(dev);
 
return 0;
-
-err_out_msi_5:
-   netif_napi_del(&tp->napi);
-
-   return rc;
 }
 
 static struct pci_driver rtl8169_pci_driver = {
-- 
2.15.1




[PATCH v2 2/3] r8169: switch to device-managed functions in probe

2017-12-11 Thread Heiner Kallweit
Simplify probe error path and remove callback by using device-managed
functions.

rtl_disable_msi isn't needed any longer because the release callback
of pcim_enable_device does this implicitely.

Signed-off-by: Heiner Kallweit 
---
v2:
- no changes
---
 drivers/net/ethernet/realtek/r8169.c | 80 +---
 1 file changed, 20 insertions(+), 60 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index fc0d5fa65..3c7d90d3a 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -4643,16 +4643,6 @@ static void rtl8169_phy_timer(struct timer_list *t)
rtl_schedule_task(tp, RTL_FLAG_TASK_PHY_PENDING);
 }
 
-static void rtl8169_release_board(struct pci_dev *pdev, struct net_device *dev,
- void __iomem *ioaddr)
-{
-   iounmap(ioaddr);
-   pci_release_regions(pdev);
-   pci_clear_mwi(pdev);
-   pci_disable_device(pdev);
-   free_netdev(dev);
-}
-
 DECLARE_RTL_COND(rtl_phy_reset_cond)
 {
return tp->phy_reset_pending(tp);
@@ -4784,14 +4774,6 @@ static int rtl_tbi_ioctl(struct rtl8169_private *tp, 
struct mii_ioctl_data *data
return -EOPNOTSUPP;
 }
 
-static void rtl_disable_msi(struct pci_dev *pdev, struct rtl8169_private *tp)
-{
-   if (tp->features & RTL_FEATURE_MSI) {
-   pci_disable_msi(pdev);
-   tp->features &= ~RTL_FEATURE_MSI;
-   }
-}
-
 static void rtl_init_mdio_ops(struct rtl8169_private *tp)
 {
struct mdio_ops *ops = &tp->mdio_ops;
@@ -8256,9 +8238,6 @@ static void rtl_remove_one(struct pci_dev *pdev)
 
unregister_netdev(dev);
 
-   dma_free_coherent(&tp->pci_dev->dev, sizeof(*tp->counters),
- tp->counters, tp->counters_phys_addr);
-
rtl_release_firmware(tp);
 
if (pci_dev_run_wake(pdev))
@@ -8266,9 +8245,6 @@ static void rtl_remove_one(struct pci_dev *pdev)
 
/* restore original MAC address */
rtl_rar_set(tp, dev->perm_addr);
-
-   rtl_disable_msi(pdev, tp);
-   rtl8169_release_board(pdev, dev, tp->mmio_addr);
 }
 
 static const struct net_device_ops rtl_netdev_ops = {
@@ -8445,11 +8421,9 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
   MODULENAME, RTL8169_VERSION);
}
 
-   dev = alloc_etherdev(sizeof (*tp));
-   if (!dev) {
-   rc = -ENOMEM;
-   goto out;
-   }
+   dev = devm_alloc_etherdev(&pdev->dev, sizeof (*tp));
+   if (!dev)
+   return -ENOMEM;
 
SET_NETDEV_DEV(dev, &pdev->dev);
dev->netdev_ops = &rtl_netdev_ops;
@@ -8472,13 +8446,13 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 PCIE_LINK_STATE_CLKPM);
 
/* enable device (incl. PCI PM wakeup and hotplug setup) */
-   rc = pci_enable_device(pdev);
+   rc = pcim_enable_device(pdev);
if (rc < 0) {
netif_err(tp, probe, dev, "enable failure\n");
-   goto err_out_free_dev_1;
+   return rc;
}
 
-   if (pci_set_mwi(pdev) < 0)
+   if (pcim_set_mwi(pdev) < 0)
netif_info(tp, probe, dev, "Mem-Wr-Inval unavailable\n");
 
/* make sure PCI base addr 1 is MMIO */
@@ -8486,30 +8460,28 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
netif_err(tp, probe, dev,
  "region #%d not an MMIO resource, aborting\n",
  region);
-   rc = -ENODEV;
-   goto err_out_mwi_2;
+   return -ENODEV;
}
 
/* check for weird/broken PCI region reporting */
if (pci_resource_len(pdev, region) < R8169_REGS_SIZE) {
netif_err(tp, probe, dev,
  "Invalid PCI region size(s), aborting\n");
-   rc = -ENODEV;
-   goto err_out_mwi_2;
+   return -ENODEV;
}
 
rc = pci_request_regions(pdev, MODULENAME);
if (rc < 0) {
netif_err(tp, probe, dev, "could not request regions\n");
-   goto err_out_mwi_2;
+   return rc;
}
 
/* ioremap MMIO region */
-   ioaddr = ioremap(pci_resource_start(pdev, region), R8169_REGS_SIZE);
+   ioaddr = devm_ioremap(&pdev->dev, pci_resource_start(pdev, region),
+ R8169_REGS_SIZE);
if (!ioaddr) {
netif_err(tp, probe, dev, "cannot remap MMIO, aborting\n");
-   rc = -EIO;
-   goto err_out_free_res_3;
+   return -EIO;
}
tp->mmio_addr = ioaddr;
 
@@ -8535,7 +8507,7 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
if (rc < 0) {
  

[PATCH v2 0/3] r8169: extend PCI core and switch to device-managed functions in probe

2017-12-11 Thread Heiner Kallweit
Probe error path and remove callback can be significantly simplified
by using device-managed functions. To be able to do this in the r8169
driver we need a device-managed version of pci_set_mwi first.

v2:
Change patch 1 based on Björn's review comments and add his Acked-by.

Heiner Kallweit (3):
  PCI: Add pcim_set_mwi(), a device-managed pci_set_mwi()
  r8169: switch to device-managed functions in probe
  r8169: remove netif_napi_del in probe error path

 drivers/net/ethernet/realtek/r8169.c | 87 +---
 drivers/pci/pci.c| 25 +++
 include/linux/pci.h  |  1 +
 3 files changed, 46 insertions(+), 67 deletions(-)

-- 
2.15.1



Re: [PATCH] ptr_ring: add barriers

2017-12-11 Thread George Cherian

Hi David,

On 12/11/2017 09:23 PM, David Miller wrote:

From: "Michael S. Tsirkin" 
Date: Tue, 5 Dec 2017 21:29:37 +0200


Users of ptr_ring expect that it's safe to give the
data structure a pointer and have it be available
to consumers, but that actually requires an smb_wmb
or a stronger barrier.

In absence of such barriers and on architectures that reorder writes,
consumer might read an un=initialized value from an skb pointer stored
in the skb array.  This was observed causing crashes.

To fix, add memory barriers.  The barrier we use is a wmb, the
assumption being that producers do not need to read the value so we do
not need to order these reads.

Reported-by: George Cherian 
Suggested-by: Jason Wang 
Signed-off-by: Michael S. Tsirkin 


I'm asked for asking for testing feedback and did not get it in a
reasonable amount of time.


The tests have completed more than 48 hours without any failures.
I won't interrupt the same and run for longer time.
In case of any issue I will report the same.

So I'm applying this as-is, and queueing it up for -stable.

Thank you.


Regards,
-George




Re: [RFC][PATCH] new byteorder primitives - ..._{replace,get}_bits()

2017-12-11 Thread Al Viro
On Mon, Dec 11, 2017 at 08:02:24PM -0800, Jakub Kicinski wrote:
> On Mon, 11 Dec 2017 15:54:22 +, Al Viro wrote:
> > Essentially, it gives helpers for work with bitfields in fixed-endian.
> > Suppose we have e.g. a little-endian 32bit value with fixed layout;
> > expressing that as a bitfield would go like
> > struct foo {
> > unsigned foo:4; /* bits 0..3 */
> > unsigned :2;
> > unsigned bar:12;/* bits 6..17 */
> > unsigned baz:14;/* bits 18..31 */
> > }
> > Even for host-endian it doesn't work all that well - you end up with
> > ifdefs in structure definition and generated code stinks.  For fixed-endian
> > it gets really painful, and people tend to use explicit shift-and-mask
> > kind of macros for accessing the fields (and often enough get the
> > endianness conversions wrong, at that).  With these primitives
> > 
> > struct foo v<=> __le32 v
> > v.foo = i ? 1 : 2   <=> v = le32_replace_bits(v, i ? 1 : 2, 0, 4)
> > f(4 + v.baz)<=> f(4 + le32_get_bits(v, 18, 14))
> 
> Looks very useful.  The [start bit, size] pair may not land itself
> too nicely to creating defines, though.  Which is why in
> include/linux/bitfield.h we tried to use a shifted mask and work
> backwards from that single value what the start and size are.  commit
> 3e9b3112ec74 ("add basic register-field manipulation macros") has the
> description.  Could a similar trick perhaps be applicable here?

Umm...  What's wrong with

#define FIELD_FOO 0,4
#define FIELD_BAR 6,12
#define FIELD_BAZ 18,14

A macro can bloody well expand to any sequence of tokens - le32_get_bits(v, 
FIELD_BAZ)
will become le32_get_bits(v, 18, 14) just fine.  What's the problem with that?


[PATCH v5 0/3] Add andestech atcpit100 timer

2017-12-11 Thread Rick Chen
Changelog v5:
 - Patch 1/3: Changes
 - Patch 2/3: New
 - Patch 3/3: Changes

[Patch 1/3] clocksource/drivers/atcpit100: Add andestech atcpit100 timer
 1 No need to split out the Makefile patch from the actual driver.
 Suggested by Arnd Bergmann
 2 Add of_clk.name = "PCLK" to be explicit on what we use.
   Suggested by Linus Walleij
 3 Remove the GENERIC_CLOCKEVENTS from Kconfig.
   Suggested by Daniel Lezcano
 4 Add depends on NDS32 || COMPILE_TEST in Kconfig
   Suggested by Greentime Hu

[Patch 2/3] clocksource/drivers/atcpit100: VDSO support
 Why implemented in timer driver, please see details from
 https://lkml.org/lkml/2017/12/8/362
 [PATCH v3 17/33] nds32: VDSO support.
 Suggested by Mark Rutland
 Here Mark Rutlan suggested as below:
 You should not add properties to arbitrary DT bindings to
 handle a Linux implementation detail.
 Please remove this DT code, and have the drivers for those
 timer blocks export this information to your vdso code somehow.

[Patch 3/3] dt-bindings: timer: Add andestech atcpit100 timer binding doc
   Fix incorrect description about PCLK.
 Suggested by Linus Walleij

Rick Chen (3):
  clocksource/drivers/atcpit100: Add andestech atcpit100 timer
  clocksource/drivers/atcpit100: VDSO support
  dt-bindings: timer: Add andestech atcpit100 timer binding doc

 .../bindings/timer/andestech,atcpit100-timer.txt   |  33 +++
 drivers/clocksource/Kconfig|   7 +
 drivers/clocksource/Makefile   |   1 +
 drivers/clocksource/timer-atcpit100.c  | 270 +
 4 files changed, 311 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt
 create mode 100644 drivers/clocksource/timer-atcpit100.c

-- 
2.7.4



[PATCH v5 2/3] clocksource/drivers/atcpit100: VDSO support

2017-12-11 Thread Rick Chen
VDSO needs real-time cycle count to ensure the time accuracy.
Unlike others, nds32 architecture does not define clock source,
hence VDSO needs atcpit100 offering real-time cycle count
to derive the correct time.

Signed-off-by: Vincent Chen 
Signed-off-by: Rick Chen 
Signed-off-by: Greentime Hu 
---
 drivers/clocksource/timer-atcpit100.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/clocksource/timer-atcpit100.c 
b/drivers/clocksource/timer-atcpit100.c
index 0077fdb..1be6c0a 100644
--- a/drivers/clocksource/timer-atcpit100.c
+++ b/drivers/clocksource/timer-atcpit100.c
@@ -29,6 +29,9 @@
 #include 
 #include 
 #include "timer-of.h"
+#ifdef CONFIG_NDS32
+#include 
+#endif
 
 /*
  * Definition of register offsets
@@ -211,6 +214,14 @@ static u64 notrace atcpit100_timer_sched_read(void)
return ~readl(timer_of_base(&to) + CH1_CNT);
 }
 
+#ifdef CONFIG_NDS32
+static void fill_vdso_need_info(void)
+{
+   timer_info.cycle_count_down = true;
+   timer_info.cycle_count_reg_offset = CH1_CNT;
+}
+#endif
+
 static int __init atcpit100_timer_init(struct device_node *node)
 {
int ret;
@@ -249,6 +260,10 @@ static int __init atcpit100_timer_init(struct device_node 
*node)
val = readl(base + INT_EN);
writel(val | CH0INT0EN, base + INT_EN);
 
+#ifdef CONFIG_NDS32
+   fill_vdso_need_info();
+#endif
+
return ret;
 }
 
-- 
2.7.4



[PATCH v5 3/3] dt-bindings: timer: Add andestech atcpit100 timer binding doc

2017-12-11 Thread Rick Chen
Add a document to describe Andestech atcpit100 timer and
binding information.

Signed-off-by: Rick Chen 
Signed-off-by: Greentime Hu 
Acked-by: Rob Herring 
---
 .../bindings/timer/andestech,atcpit100-timer.txt   | 33 ++
 1 file changed, 33 insertions(+)
 create mode 100644 
Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt

diff --git 
a/Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt 
b/Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt
new file mode 100644
index 000..14812f68
--- /dev/null
+++ b/Documentation/devicetree/bindings/timer/andestech,atcpit100-timer.txt
@@ -0,0 +1,33 @@
+Andestech ATCPIT100 timer
+--
+ATCPIT100 is a generic IP block from Andes Technology, embedded in
+Andestech AE3XX platforms and other designs.
+
+This timer is a set of compact multi-function timers, which can be
+used as pulse width modulators (PWM) as well as simple timers.
+
+It supports up to 4 PIT channels. Each PIT channel is a
+multi-function timer and provide the following usage scenarios:
+One 32-bit timer
+Two 16-bit timers
+Four 8-bit timers
+One 16-bit PWM
+One 16-bit timer and one 8-bit PWM
+Two 8-bit timer and one 8-bit PWM
+
+Required properties:
+- compatible   : Should be "andestech,atcpit100"
+- reg  : Address and length of the register set
+- interrupts   : Reference to the timer interrupt
+- clocks : a clock to provide the tick rate for "andestech,atcpit100"
+- clock-names : should be "PCLK" for the peripheral clock source.
+
+Examples:
+
+timer0: timer@f040 {
+   compatible = "andestech,atcpit100";
+   reg = <0xf040 0x1000>;
+   interrupts = <2 4>;
+   clocks = <&apb>;
+   clock-names = "PCLK";
+};
-- 
2.7.4



[PATCH v5 1/3] clocksource/drivers/atcpit100: Add andestech atcpit100 timer

2017-12-11 Thread Rick Chen
ATCPIT100 is often used on the Andes architecture,
This timer provide 4 PIT channels. Each PIT channel is a
multi-function timer, can be configured as 32,16,8 bit timers
or PWM as well.

For system timer it will set channel 1 32-bit timer0 as clock
source and count downwards until underflow and restart again.

It also set channel 0 32-bit timer0 as clock event and count
downwards until condition match. It will generate an interrupt
for handling periodically.

Signed-off-by: Rick Chen 
Signed-off-by: Greentime Hu 
Reviewed-by: Linus Walleij 
---
 drivers/clocksource/Kconfig   |   7 +
 drivers/clocksource/Makefile  |   1 +
 drivers/clocksource/timer-atcpit100.c | 255 ++
 3 files changed, 263 insertions(+)
 create mode 100644 drivers/clocksource/timer-atcpit100.c

diff --git a/drivers/clocksource/Kconfig b/drivers/clocksource/Kconfig
index cc60620..8c57ef2 100644
--- a/drivers/clocksource/Kconfig
+++ b/drivers/clocksource/Kconfig
@@ -615,4 +615,11 @@ config CLKSRC_ST_LPC
  Enable this option to use the Low Power controller timer
  as clocksource.
 
+config CLKSRC_ATCPIT100
+   bool "Clocksource for AE3XX platform"
+   depends on NDS32 || COMPILE_TEST
+   depends on HAS_IOMEM
+   help
+ This option enables support for the Andestech AE3XX platform timers.
+
 endmenu
diff --git a/drivers/clocksource/Makefile b/drivers/clocksource/Makefile
index 72711f1..7d072f5 100644
--- a/drivers/clocksource/Makefile
+++ b/drivers/clocksource/Makefile
@@ -75,3 +75,4 @@ obj-$(CONFIG_H8300_TMR16) += h8300_timer16.o
 obj-$(CONFIG_H8300_TPU)+= h8300_tpu.o
 obj-$(CONFIG_CLKSRC_ST_LPC)+= clksrc_st_lpc.o
 obj-$(CONFIG_X86_NUMACHIP) += numachip.o
+obj-$(CONFIG_CLKSRC_ATCPIT100) += timer-atcpit100.o
diff --git a/drivers/clocksource/timer-atcpit100.c 
b/drivers/clocksource/timer-atcpit100.c
new file mode 100644
index 000..0077fdb
--- /dev/null
+++ b/drivers/clocksource/timer-atcpit100.c
@@ -0,0 +1,255 @@
+/*
+ *  Andestech ATCPIT100 Timer Device Driver Implementation
+ *
+ * Copyright (C) 2017 Andes Technology Corporation
+ * Rick Chen, Andes Technology Corporation 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ *
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "timer-of.h"
+
+/*
+ * Definition of register offsets
+ */
+
+/* ID and Revision Register */
+#define ID_REV 0x0
+
+/* Configuration Register */
+#define CFG0x10
+
+/* Interrupt Enable Register */
+#define INT_EN 0x14
+#define CH_INT_EN(c, i)((1name, timer_of_rate(&to), 300, 32,
+   clocksource_mmio_readl_down);
+
+   if (ret) {
+   pr_err("Failed to register clocksource\n");
+   return ret;
+   }
+
+   /* clear channel 

[PATCH net-next] tcp/dccp: avoid one atomic operation for timewait hashdance

2017-12-11 Thread Eric Dumazet
From: Eric Dumazet 

First, rename __inet_twsk_hashdance() to inet_twsk_hashdance()

Then, remove one inet_twsk_put() by setting tw_refcnt to 3 instead
of 4, but adding a fat warning that we do not have the right to access
tw anymore after inet_twsk_hashdance()

Signed-off-by: Eric Dumazet 
---
 include/net/inet_timewait_sock.h |4 ++--
 net/dccp/minisocks.c |7 ---
 net/ipv4/inet_timewait_sock.c|   27 +--
 net/ipv4/tcp_minisocks.c |7 ---
 4 files changed, 23 insertions(+), 22 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 
1356fa6a7566bf8b53632215ef8de4b153848f9b..899495589a7ea2bf693cdda42f83cec160e861b5
 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -93,8 +93,8 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock 
*sk,
   struct inet_timewait_death_row *dr,
   const int state);
 
-void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
-  struct inet_hashinfo *hashinfo);
+void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
+struct inet_hashinfo *hashinfo);
 
 void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
  bool rearm);
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 
178bb9833311f83205317b07fe64cb2e45a9f734..37ccbe62eb1af3f9dffbf63323c008cc96cd8ea1
 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -63,9 +63,10 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
 */
local_bh_disable();
inet_twsk_schedule(tw, timeo);
-   /* Linkage updates. */
-   __inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
-   inet_twsk_put(tw);
+   /* Linkage updates.
+* Note that access to tw after this point is illegal.
+*/
+   inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
local_bh_enable();
} else {
/* Sorry, if we're out of memory, just CLOSE this
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 
b563e0c46bac2362acccf38495546a8b6b726384..277ff69a312dca1d0bc04be4b0b36db133aaf63b
 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -97,7 +97,7 @@ static void inet_twsk_add_bind_node(struct inet_timewait_sock 
*tw,
  * Essentially we whip up a timewait bucket, copy the relevant info into it
  * from the SK, and mess with hash chains and list linkage.
  */
-void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
+void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
   struct inet_hashinfo *hashinfo)
 {
const struct inet_sock *inet = inet_sk(sk);
@@ -119,18 +119,6 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, 
struct sock *sk,
 
spin_lock(lock);
 
-   /*
-* Step 2: Hash TW into tcp ehash chain.
-* Notes :
-* - tw_refcnt is set to 4 because :
-* - We have one reference from bhash chain.
-* - We have one reference from ehash chain.
-* - We have one reference from timer.
-* - One reference for ourself (our caller will release it).
-* We can use atomic_set() because prior spin_lock()/spin_unlock()
-* committed into memory all tw fields.
-*/
-   refcount_set(&tw->tw_refcnt, 4);
inet_twsk_add_node_rcu(tw, &ehead->chain);
 
/* Step 3: Remove SK from hash chain */
@@ -138,8 +126,19 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, 
struct sock *sk,
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 
spin_unlock(lock);
+
+   /* tw_refcnt is set to 3 because we have :
+* - one reference for bhash chain.
+* - one reference for ehash chain.
+* - one reference for timer.
+* We can use atomic_set() because prior spin_lock()/spin_unlock()
+* committed into memory all tw fields.
+* Also note that after this point, we lost our implicit reference
+* so we are not allowed to use tw anymore.
+*/
+   refcount_set(&tw->tw_refcnt, 3);
 }
-EXPORT_SYMBOL_GPL(__inet_twsk_hashdance);
+EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
 
 static void tw_timer_handler(struct timer_list *t)
 {
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 
b079b619b60ca577d5ef20a5065fce87acecd96c..a8384b0c11f8fa589e2ed5311899b62c80a269f8
 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -316,9 +316,10 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 */
local_bh_disable();
inet_twsk_schedule(tw, timeo);
-   /* Linkage u

Re: [PATCH net] tcp md5sig: Use skb's saddr when replying to an incoming segment

2017-12-11 Thread Eric Dumazet
On Mon, 2017-12-11 at 00:05 -0800, Christoph Paasch wrote:
> The MD5-key that belongs to a connection is identified by the peer's
> IP-address. When we are in tcp_v4(6)_reqsk_send_ack(), we are
> replying
> to an incoming segment from tcp_check_req() that failed the seq-
> number
> checks.
> 
> Thus, to find the correct key, we need to use the skb's saddr and not
> the daddr.
> 
> This bug seems to have been there since quite a while, but probably
> got
> unnoticed because the consequences are not catastrophic. We will call
> tcp_v4_reqsk_send_ack only to send a challenge-ACK back to the peer,
> thus the connection doesn't really fail.
> 
> Fixes: 9501f9722922 ("tcp md5sig: Let the caller pass appropriate key
> for tcp_v{4,6}_do_calc_md5_hash().")
> Signed-off-by: Christoph Paasch 
> ---
>  net/ipv4/tcp_ipv4.c | 2 +-
>  net/ipv6/tcp_ipv6.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Eric Dumazet 

Thanks !



Re: [RFC][PATCH] new byteorder primitives - ..._{replace,get}_bits()

2017-12-11 Thread Jakub Kicinski
On Mon, 11 Dec 2017 15:54:22 +, Al Viro wrote:
> Essentially, it gives helpers for work with bitfields in fixed-endian.
> Suppose we have e.g. a little-endian 32bit value with fixed layout;
> expressing that as a bitfield would go like
>   struct foo {
>   unsigned foo:4; /* bits 0..3 */
>   unsigned :2;
>   unsigned bar:12;/* bits 6..17 */
>   unsigned baz:14;/* bits 18..31 */
>   }
> Even for host-endian it doesn't work all that well - you end up with
> ifdefs in structure definition and generated code stinks.  For fixed-endian
> it gets really painful, and people tend to use explicit shift-and-mask
> kind of macros for accessing the fields (and often enough get the
> endianness conversions wrong, at that).  With these primitives
> 
> struct foo v  <=> __le32 v
> v.foo = i ? 1 : 2 <=> v = le32_replace_bits(v, i ? 1 : 2, 0, 4)
> f(4 + v.baz)  <=> f(4 + le32_get_bits(v, 18, 14))

Looks very useful.  The [start bit, size] pair may not land itself
too nicely to creating defines, though.  Which is why in
include/linux/bitfield.h we tried to use a shifted mask and work
backwards from that single value what the start and size are.  commit
3e9b3112ec74 ("add basic register-field manipulation macros") has the
description.  Could a similar trick perhaps be applicable here?


[BUG] 3com/3c59x: two possible sleep-in-atomic bugs

2017-12-11 Thread Jia-Ju Bai
According to drivers/net/ethernet/3com/3c59x.c, the kernel module may 
sleep in the interrupt handler.

The function call paths are:
boomerang_interrupt (interrupt handler)
  vortex_error
vortex_up
  pci_set_power_state --> may sleep
  pci_enable_device --> may sleep

vortex_interrupt (interrupt handler)
  vortex_error
vortex_up
  pci_set_power_state --> may sleep
  pci_enable_device --> may sleep

I do not find a good way to fix them, so I only report.
These possible bugs are found by my static analysis tool (DSAC) and 
checked by my code review.



Thanks,
Jia-Ju Bai


Setting large MTU size on slave interfaces may stall the whole system

2017-12-11 Thread Qing Huang

(resend this email in text format)


Hi,

We found an issue with the bonding driver when testing Mellanox devices.
The following test commands will stall the whole system sometimes, with 
serial console
flooded with log messages from the bond_miimon_inspect() function. 
Setting mtu size

to be 1500 seems okay but very rarely it may hit the same problem too.

ip address flush dev ens3f0
ip link set dev ens3f0 down
ip address flush dev ens3f1
ip link set dev ens3f1 down
[root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1
updelay=500 downdelay=500
[root@ca-hcl629 etc]# ifconfig bond0 up
[root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1
[root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up


Seiral console output:

** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 5 printk messages dropped ** [ 3717.755737] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 5 printk messages dropped ** [ 3717.767758] bond0: link status down for
interface ens3f0, disabling it in 500 ms

** 4 printk messages dropped ** [ 3717.37] bond0: link status down for
interface ens3f0, disabling it in 500 ms

or

** 4 printk messages dropped ** [274743.297863] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.307866] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.317857] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.327823] bond0: link status down 
again

after 500 ms for interface enp48s0f1
** 4 printk messages dropped ** [274743.337817] bond0: link status down 
again

after 500 ms for interface enp48s0f1


The root cause is the combined affect from commit 
1f2cd845d3827412e82bf26dde0abca332ede402(Revert
"Merge branch 'bonding_monitor_locking'") and commit 
de77ecd4ef02ca783f7762e04e92b3d0964be66b
("bonding: improve link-status update in mii-monitoring"). E.g. 
reverting the second commit, we don't

see the problem.

It seems that when setting a large mtu size on an RoCE interface, the 
RTNL mutex may be held too long by the slave
interface, causing bond_mii_monitor() to be called repeatedly at an 
interval of 1 tick (1K HZ kernel configuration)

and kernel to become unresponsive.


We found two possible solutions:

#1, don't re-arm the mii monitor thread too quick if we cannot get RTNL 
lock:

index b2db581..8fd587a 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2266,7 +2266,6 @@ static void bond_mii_monitor(struct work_struct 
*work)


    /* Race avoidance with bond_close cancel of workqueue */
    if (!rtnl_trylock()) {
-   delay = 1;
    should_notify_peers = false;
    goto re_arm;
    }

#2, we use printk_ratelimit() to avoid flooding log messages generated 
by bond_miimon_inspect().


index b2db581..0183b7f 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2054,7 +2054,7 @@ static int bond_miimon_inspect(struct bonding *bond)
    bond_propose_link_state(slave, BOND_LINK_FAIL);
    commit++;
    slave->delay = bond->params.downdelay;
-   if (slave->delay) {
+   if (slave->delay && printk_ratelimit()) {
    netdev_info(bond->dev, "link status 
down for

%sinterface %s, disabling it in %d ms\n",
    (BOND_MODE(bond) ==
BOND_MODE_ACTIVEBACKUP) ?
@@ -2105,7 +2105,8 @@ static int bond_miimon_inspect(struct bonding *bond)
    case BOND_LINK_BACK:
    if (!link_state) {
    bond_propose_link_state(slave,
BOND_LINK_DOWN);
-   netdev_info(bond->dev, "link status down
again after %d ms for interface %s\n",
+   if(printk_ratelimit())
+   netdev_info(bond->dev, "link status
down again after %d ms for interface %s\n",
(bond->params.updelay -
slave->delay) *
bond->params.miimon,
slave->dev->name);


Regarding the flooding messages, the netdev_info output is misleading 
anyway
when bond_mii_monitor() is called at 1 tick interval due to lock 
contention.



Solution #1 looks simpler and cleaner to me. Any side affect of doing that?


Thanks,
Qing



Re: [PATCH net-next v5 2/2] net: ethernet: socionext: add AVE ethernet driver

2017-12-11 Thread Masami Hiramatsu
Hi Russell,

2017-12-11 22:46 GMT+09:00 Russell King - ARM Linux :
> On Mon, Dec 11, 2017 at 10:34:17PM +0900, Masami Hiramatsu wrote:
>> IMHO, even if we use SPDX license identifier, I recommend to use
>> C-style comments as many other files do, since it is C code.
>> If SPDX identifier requires C++ style, that is SPDX parser's issue
>> and should be fixed to get it from C-style comment.
>
> See the numerous emails on this subject already.  The issue of C
> vs C++ comments has come up many times by many different people, but
> the result is the same.  That's not going to happen.  Linux kernel
> C files are required to use "//" for the SPDX identifier by order
> of Linus Torvalds.

OK, I got it.

>
> Linus has also revealed in that discussion that he has a preference
> for "//" style commenting for single comments, so it seems that the
> kernel coding style may change - but there is no desire for patches
> to "clean up" single line comments to use "//".

Thank you for making it clear.

Then what I'm considering is copyright notice lines. Those are usually
treat as the header lines, not single line. So

> +// SDPX-License-Identifier: GPL-2.0
> +// sni_ave.c - Socionext UniPhier AVE ethernet driver
> +// Copyright 2014 Panasonic Corporation
> +// Copyright 2015-2017 Socionext Inc.

is acceptable? or should we keep C-style header lines for new drivers?

> +// SDPX-License-Identifier: GPL-2.0
> +/*
> + * sni_ave.c - Socionext UniPhier AVE ethernet driver
> + * Copyright 2014 Panasonic Corporation
> + * Copyright 2015-2017 Socionext Inc.
> + */

I just concern that those lines are not "single". that's all. :)

>
> For further information, and to see the discussion that has already
> happened, the arguments that have been made about style, see the
> threads for the patch series that tglx has been posting wrt documenting
> the SPDX stuff for the kernel.

OK, got it.

https://lkml.org/lkml/2017/11/16/663


Thanks,

>
> Thanks (let's stop rehashing the same arguments.)
>


-- 
Masami Hiramatsu


Re: [PATCH net-next] libbpf: add function to setup XDP

2017-12-11 Thread Daniel Borkmann
On 12/10/2017 10:07 PM, David Ahern wrote:
> On 12/10/17 1:34 PM, Eric Leblond wrote:
>>> Would it be possible to print out or preferably return to the caller
>>> the ext ack error message?  A couple of drivers are using it for XDP
>>> mis-configuration reporting instead of printks.  We should encourage
>>> other to do the same and support it in all user space since ext ack 
>>> msgs lead to much better user experience.
>>
>> I've seen the kind of messages displayed by reading at kernel log. They
>> are really useful and it looks almost mandatory to be able to display
>> them.
>>
>> Kernel code seems to not have a parser for the ext ack error message.
>> Did I miss something here ?
>>  
>> Looking at tc code, it seems it is using libmnl to parse them and I
>> doubt it is a good idea to use that in libbpf as it is introducing a
>> dependency.
>>
>> Does someone has an existing parsing code or should I write on my own ?
> 
> I had worked on extack for libbpf but seem to have lost the changes.
> 
> Look at the commits here:
> https://github.com/dsahern/iproute2/commits/ext-ack
> 
> I suggest using this:
> 
> https://github.com/dsahern/iproute2/commit/b61e4c7dd54a5d3ff98640da4b480441cee497b2
> 
> to bring in nlattr from lib/nlattr (as I recall lib/nlattr can not be
> used directly). From there, use this one:
> 
> https://github.com/dsahern/iproute2/commit/261f7251e6704d565b91e310faa7e18d14a1
> 
> to see what is needed for extack support.
> 
> Really not that much code to add.

+1, ext ack support would improve troubleshooting a lot here; please
add and respin. Thanks, Eric!


Re: [PATCH net-next v4 0/2] bpf/tracing: allow user space to query prog array on the same tp

2017-12-11 Thread Daniel Borkmann
On 12/11/2017 08:39 PM, Yonghong Song wrote:
> Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
> for a single perf event") added support to attach multiple
> bpf programs to a single perf event. Given a perf event
> (kprobe, uprobe, or kernel tracepoint), the perf ioctl interface
> is used to query bpf programs attached to the same trace event.
> 
> There already exists a BPF_PROG_QUERY command for introspection
> currently used by cgroup+bpf. We did have an implementation for
> querying tracepoint+bpf through the same interface. However, it
> looks cleaner to use ioctl() style of api here, since attaching
> bpf prog to tracepoint/kuprobe is also done via ioctl.
> 
> Patch #1 had the core implementation and patch #2 added
> a test case in tools bpf selftests suite.
> 
> Changelogs:
> v3 -> v4:
>   - Fix a compilation error with newer gcc like 6.3.1 while
> old gcc 4.8.5 is okay. I was using &uquery->ids to represent
> the address to the ids array to make it explicit that the
> address is passed, and this syntax is rightly rejected
> by gcc 6.3.1.

Series applied to bpf-next, thanks Yonghong.


Re: [PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver

2017-12-11 Thread David Miller
From: Tom Herbert 
Date: Mon, 11 Dec 2017 14:16:17 -0800

> How can we build a system that allows an unlimited number of
> resolutions without drop?

IPV4 routing solves this with a prefixed trie, for example.

The fundamental backing datastructure for the switching
or whatever operation must be in-memory, in the kernel,
scalable, and without a fronting "cache".


Re: [PATCH v3 17/33] nds32: VDSO support

2017-12-11 Thread Vincent Chen
2017-12-08 20:14 GMT+08:00 Mark Rutland :
> On Fri, Dec 08, 2017 at 07:54:42PM +0800, Greentime Hu wrote:
>> 2017-12-08 18:21 GMT+08:00 Mark Rutland :
>> > On Fri, Dec 08, 2017 at 05:12:00PM +0800, Greentime Hu wrote:
>> >> +static int grab_timer_node_info(void)
>> >> +{
>> >> + struct device_node *timer_node;
>> >> +
>> >> + timer_node = of_find_node_by_name(NULL, "timer");
>> >
>> > Please use a compatible string, rather than matching the timer by name.
>> >
>> > It's plausible that you have multiple nodes called "timer" in the DT,
>> > under different parent nodes, and this might not be the device you
>> > think it is. I see your dt in patch 24 has two timer nodes.
>> >
>> > It would be best if your clocksource driver exposed some stuct that you
>> > looked at here, so that you're guaranteed to user the same device.
>>
>> We'd like to use "timer" here because there are 2 different timer IPs
>> and we are sure that they won't be in the same SoC.
>> We think this implementation in VDSO should be platform independent to
>> get cycle-count register.
>> Our customer or other SoC provider who can use "timer" and define
>> cycle-count-offset or cycle-count-down then we can get the correct
>> cycle-count.
>
> This is not the right way to do things.
>
> So from a DT perspective, NAK.
>
> You should not add properties to arbitrary DT bindings to handle a Linux
> implementation detail.
>
> Please remove this DT code, and have the drivers for those timer blocks
> export this information to your vdso code somehow.
>

Hi, Mark:
Based on your suggestion, we define a new sturct timer_info to let
timer driver record the value
of cycle-count-offset and cycle-count-down in timer_init function. The
above code in timer driver
is validate only when CONFIG_NDS32 is defined.

>> We sent atcpit100 patch last time along with our arch, however we'd
>> like to send it to its sub system this time and my colleague is still
>> working on it.
>> He may send the timer patch next week.
>
> I think that it would make sense for that patch to be part of the arch
> port, especially given that (AFAICT) there is no dirver for the other
> timer IP that you mention.
>
> [...]
>
>> >> +int arch_setup_additional_pages(struct linux_binprm *bprm, int 
>> >> uses_interp)
>> >> +{
>> >
>> >> + /*Map timer to user space */
>> >> + vdso_base += PAGE_SIZE;
>> >> + prot = __pgprot(_PAGE_V | _PAGE_M_UR_KR | _PAGE_D |
>> >> + _PAGE_G | _PAGE_C_DEV);
>> >> + ret = io_remap_pfn_range(vma, vdso_base, timer_res.start >> 
>> >> PAGE_SHIFT,
>> >> +  PAGE_SIZE, prot);
>> >> + if (ret)
>> >> + goto up_fail;
>> >
>> > Maybe this is fine, but it looks a bit suspicious.
>> >
>> > Is it safe to map IO memory to a userspace process like this?
>> >
>> > In general that isn't safe, since userspace could access other registers
>> > (if those exist), perform accesses that change the state of hardware, or
>> > make unsupported access types (e.g. unaligned, atomic) that result in
>> > errors the kernel can't handle.
>> >
>> > Does none of that apply here?
>>
>> We only provide read permission to this page so hareware state won't
>> be chagned. It will trigger exception if we try to write.
>> We will check about the alignment/atomic issue of this region.
>

For alignment issue, we intentionally make an un-alignment read to
access this region and we
got "Segmentation fault" as expected.


Thanks,
Vincent

> Ok, thanks.
>
> This is another reason to only do this for devices/drivers that we have
> drivers for, since we can't know that this is safe in general.
>
> Thanks,
> Mark.


linux-next: build failure after merge of the mac80211-next tree

2017-12-11 Thread Stephen Rothwell
Hi Johannes,

After merging the mac80211-next tree, today's linux-next build (x86_64
allmodconfig) failed like this:

drivers/net/wireless/mediatek/mt76/mt76x2_main.c:539:19: error: initialization 
from incompatible pointer type [-Werror=incompatible-pointer-types]
  .wake_tx_queue = mt76_wake_tx_queue,
   ^
drivers/net/wireless/mediatek/mt76/mt76x2_main.c:539:19: note: (near 
initialization for 'mt76x2_ops.wake_tx_queue')

Caused by commits

  17f1de56df05 ("mt76: add common code shared between multiple chipsets")
  7bc04215a66b ("mt76: add driver code for MT76x2e")

from the wireless-drivers-next tree interacting with commit

  e937b8da5a59 ("mac80211: Add TXQ scheduling API")

from the mac80211-next tree.

I applied the below hack merge fix ... please let me know if something
more/better is required.  Someone needs to remember to tell Dave when
these trees meet in his tree.

From: Stephen Rothwell 
Date: Tue, 12 Dec 2017 12:50:40 +1100
Subject: [PATCH] mt76: fix up for "mac80211: Add TXQ scheduling API"

Signed-off-by: Stephen Rothwell 
---
 drivers/net/wireless/mediatek/mt76/mt76.h |  2 +-
 drivers/net/wireless/mediatek/mt76/tx.c   | 10 +++---
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt76.h 
b/drivers/net/wireless/mediatek/mt76/mt76.h
index aa0880bbea7f..e395d3859212 100644
--- a/drivers/net/wireless/mediatek/mt76/mt76.h
+++ b/drivers/net/wireless/mediatek/mt76/mt76.h
@@ -338,7 +338,7 @@ void mt76_tx(struct mt76_dev *dev, struct ieee80211_sta 
*sta,
 struct mt76_wcid *wcid, struct sk_buff *skb);
 void mt76_txq_init(struct mt76_dev *dev, struct ieee80211_txq *txq);
 void mt76_txq_remove(struct mt76_dev *dev, struct ieee80211_txq *txq);
-void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq);
+void mt76_wake_tx_queue(struct ieee80211_hw *hw);
 void mt76_stop_tx_queues(struct mt76_dev *dev, struct ieee80211_sta *sta,
 bool send_bar);
 void mt76_txq_schedule(struct mt76_dev *dev, struct mt76_queue *hwq);
diff --git a/drivers/net/wireless/mediatek/mt76/tx.c 
b/drivers/net/wireless/mediatek/mt76/tx.c
index 4eef69bd8a9e..ad414af0750f 100644
--- a/drivers/net/wireless/mediatek/mt76/tx.c
+++ b/drivers/net/wireless/mediatek/mt76/tx.c
@@ -463,12 +463,16 @@ void mt76_stop_tx_queues(struct mt76_dev *dev, struct 
ieee80211_sta *sta,
 }
 EXPORT_SYMBOL_GPL(mt76_stop_tx_queues);
 
-void mt76_wake_tx_queue(struct ieee80211_hw *hw, struct ieee80211_txq *txq)
+void mt76_wake_tx_queue(struct ieee80211_hw *hw)
 {
+   struct ieee80211_txq *txq;
struct mt76_dev *dev = hw->priv;
-   struct mt76_txq *mtxq = (struct mt76_txq *) txq->drv_priv;
-   struct mt76_queue *hwq = mtxq->hwq;
+   struct mt76_txq *mtxq;
+   struct mt76_queue *hwq;
 
+   txq = ieee80211_next_txq(hw);
+   mtxq = (struct mt76_txq *) txq->drv_priv;
+   hwq = mtxq->hwq;
spin_lock_bh(&hwq->lock);
if (list_empty(&mtxq->list))
list_add_tail(&mtxq->list, &hwq->swq);
-- 
2.15.0

-- 
Cheers,
Stephen Rothwell


Re: [PATCH] selftests: bpf: Adding config fragment CONFIG_CGROUP_BPF=y

2017-12-11 Thread Daniel Borkmann
On 12/11/2017 08:25 PM, Naresh Kamboju wrote:
> CONFIG_CGROUP_BPF=y is required for test_dev_cgroup test case.
> 
> Signed-off-by: Naresh Kamboju 

Applied to bpf-next, thanks Naresh!


[PATCH v2] igb: Free IRQs when device is hotplugged

2017-12-11 Thread Lyude Paul
Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon
hotplugging my kernel would immediately crash due to igb:

[  680.825801] kernel BUG at drivers/pci/msi.c:352!
[  680.828388] invalid opcode:  [#1] SMP
[  680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev vfat 
fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi sparse_keymap 
rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal coretemp crc32_pclmul 
snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd pcspkr mei soundcore i2c_i801 
tpm_tis psmouse shpchp wmi tpm_tis_core tpm video hp_wireless acpi_pad 
rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw rtsx_pci mfd_core xhci_pci 
xhci_hcd i2c_hid i2c_core [last unloaded: igb]
[  680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted: G   O 
4.15.0-rc3Lyude-Test+ #6
[  680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver. 01.03 
06/09/2017
[  680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[  680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0
[  680.833271] RSP: 0018:c930fbf0 EFLAGS: 00010286
[  680.833761] RAX: 8803405f9c00 RBX: 88033e3d2e40 RCX: 002c
[  680.834278] RDX:  RSI: 00ac RDI: 880340be2178
[  680.834832] RBP:  R08: 880340be1ff0 R09: 8803405f9c00
[  680.835342] R10:  R11: 0040 R12: 88033d63a298
[  680.835822] R13: 88033d63a000 R14: 0060 R15: 880341959000
[  680.836332] FS:  () GS:88034f44() 
knlGS:
[  680.836817] CS:  0010 DS:  ES:  CR0: 80050033
[  680.837360] CR2: 55e64044afdf CR3: 01c09002 CR4: 003606e0
[  680.837954] Call Trace:
[  680.838853]  pci_disable_msix+0xce/0xf0
[  680.839616]  igb_reset_interrupt_capability+0x5d/0x60 [igb]
[  680.840278]  igb_remove+0x9d/0x110 [igb]
[  680.840764]  pci_device_remove+0x36/0xb0
[  680.841279]  device_release_driver_internal+0x157/0x220
[  680.841739]  pci_stop_bus_device+0x7d/0xa0
[  680.842255]  pci_stop_bus_device+0x2b/0xa0
[  680.842722]  pci_stop_bus_device+0x3d/0xa0
[  680.843189]  pci_stop_and_remove_bus_device+0xe/0x20
[  680.843627]  trim_stale_devices+0xf3/0x140
[  680.844086]  trim_stale_devices+0x94/0x140
[  680.844532]  trim_stale_devices+0xa6/0x140
[  680.845031]  ? get_slot_status+0x90/0xc0
[  680.845536]  acpiphp_check_bridge.part.5+0xfe/0x140
[  680.846021]  acpiphp_hotplug_notify+0x175/0x200
[  680.846581]  ? free_bridge+0x100/0x100
[  680.847113]  acpi_device_hotplug+0x8a/0x490
[  680.847535]  acpi_hotplug_work_fn+0x1a/0x30
[  680.848076]  process_one_work+0x182/0x3a0
[  680.848543]  worker_thread+0x2e/0x380
[  680.848963]  ? process_one_work+0x3a0/0x3a0
[  680.849373]  kthread+0x111/0x130
[  680.849776]  ? kthread_create_worker_on_cpu+0x50/0x50
[  680.850188]  ret_from_fork+0x1f/0x30
[  680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39 6b 
14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3 <0f> 0b 
49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b
[  680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: c930fbf0

As it turns out, normally the freeing of IRQs that would fix this is called
inside of the scope of __igb_close(). However, since the device is
already gone by the point we try to unregister the netdevice from the
driver due to a hotplug we end up seeing that the netif isn't present
and thus, forget to free any of the device IRQs.

So: make sure that if we're in the process of dismantling the netdev, we
always allow __igb_close() to be called so that IRQs may be freed
normally. Additionally, only allow igb_close() to be called from
__igb_close() if it hasn't already been called for the given adapter.

Signed-off-by: Lyude Paul 
Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach")
Cc: Todd Fujinaka 
Cc: Stephen Hemminger 
Cc: sta...@vger.kernel.org
---
Changes since v1:
  - Remove code for freeing IRQs from igb_remove(), unbreak
__igb_close() instead (re: Stephen Hemminger)

 drivers/net/ethernet/intel/igb/igb_main.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c208753ff5b7..a1083fd074dd 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3663,7 +3663,9 @@ static int __igb_close(struct net_device *netdev, bool 
suspending)
if (!suspending)
pm_runtime_get_sync(&pdev->dev);
 
-   igb_down(adapter);
+   if (!test_bit(__IGB_DOWN, &adapter->state))
+   igb_down(adapter);
+
igb_free_irq(adapter);
 
igb_free_all_tx_resources(adapter);
@@ -3676,7 +3678,7 @@ static int __igb_close(struct net_device *netdev, bool 
suspending)
 
 int igb_close(struct net_device *netdev)
 {
-   if (netif_device_present(netdev))
+   if (netif_

[PATCH bpf 0/3] Misc BPF fixes

2017-12-11 Thread Daniel Borkmann
Couple of outstanding fixes for BPF tree: 1) fixes a perf RB
corruption, 2) and 3) fixes a few build issues from the recent
bpf_perf_event.h uapi corrections. Thanks!

Daniel Borkmann (3):
  bpf: fix corruption on concurrent perf_event_output calls
  bpf: fix build issues on um due to mising bpf_perf_event.h
  bpf: fix broken BPF selftest build

 arch/um/include/asm/Kbuild  |  1 +
 kernel/trace/bpf_trace.c| 19 ---
 tools/include/uapi/asm/bpf_perf_event.h |  7 +++
 tools/testing/selftests/bpf/Makefile| 13 +
 4 files changed, 21 insertions(+), 19 deletions(-)
 create mode 100644 tools/include/uapi/asm/bpf_perf_event.h

-- 
2.9.5



[PATCH bpf 3/3] bpf: fix broken BPF selftest build

2017-12-11 Thread Daniel Borkmann
At least on x86_64, the kernel's BPF selftests seemed to have stopped
to build due to 618e165b2a8e ("selftests/bpf: sync kernel headers and
introduce arch support in Makefile"):

  [...]
  In file included from test_verifier.c:29:0:
  ../../../include/uapi/linux/bpf_perf_event.h:11:32:
 fatal error: asm/bpf_perf_event.h: No such file or directory
   #include 
^
  compilation terminated.
  [...]

While pulling in tools/arch/*/include/uapi/asm/bpf_perf_event.h seems
to work fine, there's no automated fall-back logic right now that would
do the same out of tools/include/uapi/asm-generic/bpf_perf_event.h. The
usual convention today is to add a include/[uapi/]asm/ equivalent that
would pull in the correct arch header or generic one as fall-back, all
ifdef'ed based on compiler target definition. It's similarly done also
in other cases such as tools/include/asm/barrier.h, thus adapt the same
here.

Fixes: 618e165b2a8e ("selftests/bpf: sync kernel headers and introduce arch 
support in Makefile")
Signed-off-by: Daniel Borkmann 
Cc: Hendrik Brueckner 
Cc: Arnaldo Carvalho de Melo 
Acked-by: Alexei Starovoitov 
---
 tools/include/uapi/asm/bpf_perf_event.h |  7 +++
 tools/testing/selftests/bpf/Makefile| 13 +
 2 files changed, 8 insertions(+), 12 deletions(-)
 create mode 100644 tools/include/uapi/asm/bpf_perf_event.h

diff --git a/tools/include/uapi/asm/bpf_perf_event.h 
b/tools/include/uapi/asm/bpf_perf_event.h
new file mode 100644
index 000..13a5853
--- /dev/null
+++ b/tools/include/uapi/asm/bpf_perf_event.h
@@ -0,0 +1,7 @@
+#if defined(__aarch64__)
+#include "../../arch/arm64/include/uapi/asm/bpf_perf_event.h"
+#elif defined(__s390__)
+#include "../../arch/s390/include/uapi/asm/bpf_perf_event.h"
+#else
+#include 
+#endif
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 21a2d76..792af7c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -1,19 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 
-ifeq ($(srctree),)
-srctree := $(patsubst %/,%,$(dir $(CURDIR)))
-srctree := $(patsubst %/,%,$(dir $(srctree)))
-srctree := $(patsubst %/,%,$(dir $(srctree)))
-srctree := $(patsubst %/,%,$(dir $(srctree)))
-endif
-include $(srctree)/tools/scripts/Makefile.arch
-
-$(call detected_var,SRCARCH)
-
 LIBDIR := ../../../lib
 BPFDIR := $(LIBDIR)/bpf
 APIDIR := ../../../include/uapi
-ASMDIR:= ../../../arch/$(ARCH)/include/uapi
 GENDIR := ../../../../include/generated
 GENHDR := $(GENDIR)/autoconf.h
 
@@ -21,7 +10,7 @@ ifneq ($(wildcard $(GENHDR)),)
   GENFLAGS := -DHAVE_GENHDR
 endif
 
-CFLAGS += -Wall -O2 -I$(APIDIR) -I$(ASMDIR) -I$(LIBDIR) -I$(GENDIR) 
$(GENFLAGS) -I../../../include
+CFLAGS += -Wall -O2 -I$(APIDIR) -I$(LIBDIR) -I$(GENDIR) $(GENFLAGS) 
-I../../../include
 LDLIBS += -lcap -lelf
 
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs \
-- 
2.9.5



[PATCH bpf 2/3] bpf: fix build issues on um due to mising bpf_perf_event.h

2017-12-11 Thread Daniel Borkmann
Since c895f6f703ad ("bpf: correct broken uapi for
BPF_PROG_TYPE_PERF_EVENT program type") um (uml) won't build
on i386 or x86_64:

  [...]
CC  init/main.o
  In file included from ../include/linux/perf_event.h:18:0,
   from ../include/linux/trace_events.h:10,
   from ../include/trace/syscall.h:7,
   from ../include/linux/syscalls.h:82,
   from ../init/main.c:20:
  ../include/uapi/linux/bpf_perf_event.h:11:32: fatal error:
  asm/bpf_perf_event.h: No such file or directory #include
  
  [...]

Lets add missing bpf_perf_event.h also to um arch. This seems
to be the only one still missing.

Fixes: c895f6f703ad ("bpf: correct broken uapi for BPF_PROG_TYPE_PERF_EVENT 
program type")
Reported-by: Randy Dunlap 
Suggested-by: Richard Weinberger 
Signed-off-by: Daniel Borkmann 
Tested-by: Randy Dunlap 
Cc: Hendrik Brueckner 
Cc: Richard Weinberger 
Acked-by: Alexei Starovoitov 
---
 arch/um/include/asm/Kbuild | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index 50a32c3..73c57f6 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -1,4 +1,5 @@
 generic-y += barrier.h
+generic-y += bpf_perf_event.h
 generic-y += bug.h
 generic-y += clkdev.h
 generic-y += current.h
-- 
2.9.5



[PATCH bpf 1/3] bpf: fix corruption on concurrent perf_event_output calls

2017-12-11 Thread Daniel Borkmann
When tracing and networking programs are both attached in the
system and both use event-output helpers that eventually call
into perf_event_output(), then we could end up in a situation
where the tracing attached program runs in user context while
a cls_bpf program is triggered on that same CPU out of softirq
context.

Since both rely on the same per-cpu perf_sample_data, we could
potentially corrupt it. This can only ever happen in a combination
of the two types; all tracing programs use a bpf_prog_active
counter to bail out in case a program is already running on
that CPU out of a different context. XDP and cls_bpf programs
by themselves don't have this issue as they run in the same
context only. Therefore, split both perf_sample_data so they
cannot be accessed from each other.

Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
Reported-by: Alexei Starovoitov 
Signed-off-by: Daniel Borkmann 
Tested-by: Song Liu 
Acked-by: Alexei Starovoitov 
---
 kernel/trace/bpf_trace.c | 19 ---
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 0ce99c3..40207c2 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -343,14 +343,13 @@ static const struct bpf_func_proto 
bpf_perf_event_read_value_proto = {
.arg4_type  = ARG_CONST_SIZE,
 };
 
-static DEFINE_PER_CPU(struct perf_sample_data, bpf_sd);
+static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd);
 
 static __always_inline u64
 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
-   u64 flags, struct perf_raw_record *raw)
+   u64 flags, struct perf_sample_data *sd)
 {
struct bpf_array *array = container_of(map, struct bpf_array, map);
-   struct perf_sample_data *sd = this_cpu_ptr(&bpf_sd);
unsigned int cpu = smp_processor_id();
u64 index = flags & BPF_F_INDEX_MASK;
struct bpf_event_entry *ee;
@@ -373,8 +372,6 @@ __bpf_perf_event_output(struct pt_regs *regs, struct 
bpf_map *map,
if (unlikely(event->oncpu != cpu))
return -EOPNOTSUPP;
 
-   perf_sample_data_init(sd, 0, 0);
-   sd->raw = raw;
perf_event_output(event, sd, regs);
return 0;
 }
@@ -382,6 +379,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct 
bpf_map *map,
 BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, 
map,
   u64, flags, void *, data, u64, size)
 {
+   struct perf_sample_data *sd = this_cpu_ptr(&bpf_trace_sd);
struct perf_raw_record raw = {
.frag = {
.size = size,
@@ -392,7 +390,10 @@ BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, 
struct bpf_map *, map,
if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
return -EINVAL;
 
-   return __bpf_perf_event_output(regs, map, flags, &raw);
+   perf_sample_data_init(sd, 0, 0);
+   sd->raw = &raw;
+
+   return __bpf_perf_event_output(regs, map, flags, sd);
 }
 
 static const struct bpf_func_proto bpf_perf_event_output_proto = {
@@ -407,10 +408,12 @@ static const struct bpf_func_proto 
bpf_perf_event_output_proto = {
 };
 
 static DEFINE_PER_CPU(struct pt_regs, bpf_pt_regs);
+static DEFINE_PER_CPU(struct perf_sample_data, bpf_misc_sd);
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy)
 {
+   struct perf_sample_data *sd = this_cpu_ptr(&bpf_misc_sd);
struct pt_regs *regs = this_cpu_ptr(&bpf_pt_regs);
struct perf_raw_frag frag = {
.copy   = ctx_copy,
@@ -428,8 +431,10 @@ u64 bpf_event_output(struct bpf_map *map, u64 flags, void 
*meta, u64 meta_size,
};
 
perf_fetch_caller_regs(regs);
+   perf_sample_data_init(sd, 0, 0);
+   sd->raw = &raw;
 
-   return __bpf_perf_event_output(regs, map, flags, &raw);
+   return __bpf_perf_event_output(regs, map, flags, sd);
 }
 
 BPF_CALL_0(bpf_get_current_task)
-- 
2.9.5



linux-next: manual merge of the net-next tree with the net tree

2017-12-11 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/phy/meson-gxl.c

between commit:

  f1e2400a80ff ("net: phy: meson-gxl: detect LPA corruption")

from the net tree and commit:

  80274abafc60 ("net: phy: remove generic settings for callbacks config_aneg 
and read_status from drivers")

from the net-next tree.

I fixed it up (I just used the former) and can carry the fix as
necessary. This is now fixed as far as linux-next is concerned, but any
non trivial conflicts should be mentioned to your upstream maintainer
when your tree is submitted for merging.  You may also want to consider
cooperating with the maintainer of the conflicting tree to minimise any
particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/phy/meson-gxl.c
index 77dd4be5,401e3234be58..
--- a/drivers/net/phy/meson-gxl.c
+++ b/drivers/net/phy/meson-gxl.c
@@@ -130,9 -58,7 +130,8 @@@ static struct phy_driver meson_gxl_phy[
.features   = PHY_BASIC_FEATURES,
.flags  = PHY_IS_INTERNAL,
.config_init= meson_gxl_config_init,
-   .config_aneg= genphy_config_aneg,
.aneg_done  = genphy_aneg_done,
 +  .read_status= meson_gxl_read_status,
.suspend= genphy_suspend,
.resume = genphy_resume,
},


[PATCH iproute2 net-next v2 1/4] ss: Replace printf() calls for "main" output by calls to helper

2017-12-11 Thread Stefano Brivio
This is preparation work for output buffering, which will allow
us to use optimal spacing and alignment of logical "columns".

The new out() function is just a re-implementation of a typical
libc's printf(), except that the return value of vfprintf() is
ignored as no callers use it. This implementation will be
replaced in the next patches to provide column width adjustment
and adequate spacing.

All printf() calls that output parts of the socket list are now
replaced by calls to out(). Output of summary and version is
excluded from this.

No functional differences here, output not affected.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh")

 misc/ss.c | 399 --
 1 file changed, 205 insertions(+), 194 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index da52d5edeb7e..a7d3b89e1478 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "utils.h"
 #include "rt_names.h"
@@ -823,6 +824,15 @@ static const char *vsock_netid_name(int type)
}
 }
 
+static void out(const char *fmt, ...)
+{
+   va_list args;
+
+   va_start(args, fmt);
+   vfprintf(stdout, fmt, args);
+   va_end(args);
+}
+
 static void sock_state_print(struct sockstat *s)
 {
const char *sock_name;
@@ -863,39 +873,39 @@ static void sock_state_print(struct sockstat *s)
}
 
if (netid_width)
-   printf("%-*s ", netid_width,
-  is_sctp_assoc(s, sock_name) ? "" : sock_name);
+   out("%-*s ", netid_width,
+   is_sctp_assoc(s, sock_name) ? "" : sock_name);
if (state_width) {
if (is_sctp_assoc(s, sock_name))
-   printf("`- %-*s ", state_width - 3,
-  sctp_sstate_name[s->state]);
+   out("`- %-*s ", state_width - 3,
+   sctp_sstate_name[s->state]);
else
-   printf("%-*s ", state_width, sstate_name[s->state]);
+   out("%-*s ", state_width, sstate_name[s->state]);
}
 
-   printf("%-6d %-6d %s", s->rq, s->wq, odd_width_pad);
+   out("%-6d %-6d %s", s->rq, s->wq, odd_width_pad);
 }
 
 static void sock_details_print(struct sockstat *s)
 {
if (s->uid)
-   printf(" uid:%u", s->uid);
+   out(" uid:%u", s->uid);
 
-   printf(" ino:%u", s->ino);
-   printf(" sk:%llx", s->sk);
+   out(" ino:%u", s->ino);
+   out(" sk:%llx", s->sk);
 
if (s->mark)
-   printf(" fwmark:0x%x", s->mark);
+   out(" fwmark:0x%x", s->mark);
 }
 
 static void sock_addr_print_width(int addr_len, const char *addr, char *delim,
int port_len, const char *port, const char *ifname)
 {
if (ifname) {
-   printf("%*s%%%s%s%-*s ", addr_len, addr, ifname, delim,
-   port_len, port);
+   out("%*s%%%s%s%-*s ", addr_len, addr, ifname, delim,
+   port_len, port);
} else {
-   printf("%*s%s%-*s ", addr_len, addr, delim, port_len, port);
+   out("%*s%s%-*s ", addr_len, addr, delim, port_len, port);
}
 }
 
@@ -1793,12 +1803,12 @@ static void proc_ctx_print(struct sockstat *s)
if (find_entry(s->ino, &buf,
(show_proc_ctx & show_sock_ctx) ?
PROC_SOCK_CTX : PROC_CTX) > 0) {
-   printf(" users:(%s)", buf);
+   out(" users:(%s)", buf);
free(buf);
}
} else if (show_users) {
if (find_entry(s->ino, &buf, USERS) > 0) {
-   printf(" users:(%s)", buf);
+   out(" users:(%s)", buf);
free(buf);
}
}
@@ -1878,51 +1888,51 @@ static char *sprint_bw(char *buf, double bw)
 static void sctp_stats_print(struct sctp_info *s)
 {
if (s->sctpi_tag)
-   printf(" tag:%x", s->sctpi_tag);
+   out(" tag:%x", s->sctpi_tag);
if (s->sctpi_state)
-   printf(" state:%s", sctp_sstate_name[s->sctpi_state]);
+   out(" state:%s", sctp_sstate_name[s->sctpi_state]);
if (s->sctpi_rwnd)
-   printf(" rwnd:%d", s->sctpi_rwnd);
+   out(" rwnd:%d", s->sctpi_rwnd);
if (s->sctpi_unackdata)
-   printf(" unackdata:%d", s->sctpi_unackdata);
+   out(" unackdata:%d", s->sctpi_unackdata);
if (s->sctpi_penddata)
-   printf(" penddata:%d", s->sctpi_penddata);
+   out(" penddata:%d", s->sctpi_penddata);
if (s->sctpi_instrms)
-   printf(" instrms:%d", s->sctpi_instrms);
+   out(" instrms:%d", s->sctpi_instrms);
   

[PATCH iproute2 net-next v2 3/4] ss: Buffer raw fields first, then render them as a table

2017-12-11 Thread Stefano Brivio
This allows us to measure the maximum field length for each
column before printing fields and will permit us to apply
optimal field spacing and distribution. Structure of the output
buffer with chunked allocation is described in comments.

Output is still unchanged, original spacing is used.

Running over one million sockets with -tul options by simply
modifying main() to loop 50,000 times over the *_show()
functions, buffering the whole output and rendering it at the
end, with 10 UDP sockets, 10 TCP sockets, while throwing
output away, doesn't show significant changes in execution time
on my laptop with an Intel i7-6600U CPU:

- before this patch:
$ time ./ss -tul > /dev/null
real0m29.899s
user0m2.017s
sys 0m27.801s

- after this patch:
$ time ./ss -tul > /dev/null
real0m29.827s
user0m1.942s
sys 0m27.812s

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh")

 misc/ss.c | 271 +++---
 1 file changed, 225 insertions(+), 46 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 42310ba4120d..166267974c36 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -47,6 +47,8 @@
 #include 
 
 #define MAGIC_SEQ 123456
+#define BUF_CHUNK (1024 * 1024)
+#define LEN_ALIGN(x) (((x) + 1) & ~1)
 
 #define DIAG_REQUEST(_req, _r) \
struct {\
@@ -127,24 +129,45 @@ struct column {
const char *header;
const char *ldelim;
int width;  /* Including delimiter. -1: fit to content, 0: hide */
-   int stored; /* Characters buffered */
-   int printed;/* Characters printed so far */
 };
 
 static struct column columns[] = {
-   { ALIGN_LEFT,   "Netid","", 0,  0,  0 },
-   { ALIGN_LEFT,   "State"," ",0,  0,  0 },
-   { ALIGN_LEFT,   "Recv-Q",   " ",7,  0,  0 },
-   { ALIGN_LEFT,   "Send-Q",   " ",7,  0,  0 },
-   { ALIGN_RIGHT,  "Local Address:",   " ",0,  0,  0 },
-   { ALIGN_LEFT,   "Port", "", 0,  0,  0 },
-   { ALIGN_RIGHT,  "Peer Address:"," ",0,  0,  0 },
-   { ALIGN_LEFT,   "Port", "", 0,  0,  0 },
-   { ALIGN_LEFT,   "", "", -1, 0,  0 },
+   { ALIGN_LEFT,   "Netid","", 0 },
+   { ALIGN_LEFT,   "State"," ",0 },
+   { ALIGN_LEFT,   "Recv-Q",   " ",7 },
+   { ALIGN_LEFT,   "Send-Q",   " ",7 },
+   { ALIGN_RIGHT,  "Local Address:",   " ",0 },
+   { ALIGN_LEFT,   "Port", "", 0 },
+   { ALIGN_RIGHT,  "Peer Address:"," ",0 },
+   { ALIGN_LEFT,   "Port", "", 0 },
+   { ALIGN_LEFT,   "", "", -1 },
 };
 
 static struct column *current_field = columns;
-static char field_buf[BUFSIZ];
+
+/* Output buffer: chained chunks of BUF_CHUNK bytes. Each field is written to
+ * the buffer as a variable size token. A token consists of a 16 bits length
+ * field, followed by a string which is not NULL-terminated.
+ *
+ * A new chunk is allocated and linked when the current chunk doesn't have
+ * enough room to store the current token as a whole.
+ */
+struct buf_chunk {
+   struct buf_chunk *next; /* Next chained chunk */
+   char *end;  /* Current end of content */
+   char data[0];
+};
+
+struct buf_token {
+   uint16_t len;   /* Data length, excluding length descriptor */
+   char data[0];
+};
+
+static struct {
+   struct buf_token *cur;  /* Position of current token in chunk */
+   struct buf_chunk *head; /* First chunk */
+   struct buf_chunk *tail; /* Current chunk */
+} buffer;
 
 static const char *TCP_PROTO = "tcp";
 static const char *SCTP_PROTO = "sctp";
@@ -861,25 +884,109 @@ static const char *vsock_netid_name(int type)
}
 }
 
+/* Allocate and initialize a new buffer chunk */
+static struct buf_chunk *buf_chunk_new(void)
+{
+   struct buf_chunk *new = malloc(BUF_CHUNK);
+
+   if (!new)
+   abort();
+
+   new->next = NULL;
+
+   /* This is also the last block */
+   buffer.tail = new;
+
+   /* Next token will be stored at the beginning of chunk data area, and
+* its initial length is zero.
+*/
+   buffer.cur = (struct buf_token *)new->data;
+   buffer.cur->len = 0;
+
+   new->end = buffer.cur->data;
+
+   return new;
+}
+
+/* Return available tail room in given chunk */
+static int buf_chunk_avail(struct buf_chunk *chunk)
+{
+   return BUF_CHUNK - offsetof(struct buf_chunk, data) -
+  (chunk->end - chunk->data);
+}
+
+/* Update end pointer and 

[PATCH iproute2 net-next v2 4/4] ss: Implement automatic column width calculation

2017-12-11 Thread Stefano Brivio
Group fitting fields into lines and space them equally using the
remaining screen width for each line. If columns don't fit on
one line, break them into the least possible amount of lines and
keep them aligned across lines.

This is done by:
 - recording the length of the longest item in each column during
   formatting and buffering (which was added in the previous patch)
 - fitting as many fields as possible on each line of output
 - distributing the remaining padding space equally between the
   columns

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh")

 misc/ss.c | 188 +++---
 1 file changed, 120 insertions(+), 68 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index 166267974c36..9d21ed7a0705 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -128,19 +128,21 @@ struct column {
const enum col_align align;
const char *header;
const char *ldelim;
-   int width;  /* Including delimiter. -1: fit to content, 0: hide */
+   int disabled;
+   int width;  /* Calculated, including additional layout spacing */
+   int max_len;/* Measured maximum field length in this column */
 };
 
 static struct column columns[] = {
-   { ALIGN_LEFT,   "Netid","", 0 },
-   { ALIGN_LEFT,   "State"," ",0 },
-   { ALIGN_LEFT,   "Recv-Q",   " ",7 },
-   { ALIGN_LEFT,   "Send-Q",   " ",7 },
-   { ALIGN_RIGHT,  "Local Address:",   " ",0 },
-   { ALIGN_LEFT,   "Port", "", 0 },
-   { ALIGN_RIGHT,  "Peer Address:"," ",0 },
-   { ALIGN_LEFT,   "Port", "", 0 },
-   { ALIGN_LEFT,   "", "", -1 },
+   { ALIGN_LEFT,   "Netid","", 0, 0, 0 },
+   { ALIGN_LEFT,   "State"," ",0, 0, 0 },
+   { ALIGN_LEFT,   "Recv-Q",   " ",0, 0, 0 },
+   { ALIGN_LEFT,   "Send-Q",   " ",0, 0, 0 },
+   { ALIGN_RIGHT,  "Local Address:",   " ",0, 0, 0 },
+   { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
+   { ALIGN_RIGHT,  "Peer Address:"," ",0, 0, 0 },
+   { ALIGN_LEFT,   "Port", "", 0, 0, 0 },
+   { ALIGN_LEFT,   "", "", 0, 0, 0 },
 };
 
 static struct column *current_field = columns;
@@ -960,7 +962,7 @@ static void out(const char *fmt, ...)
char *pos;
int len;
 
-   if (!f->width)
+   if (f->disabled)
return;
 
if (!buffer.head)
@@ -983,7 +985,7 @@ static int print_left_spacing(struct column *f, int stored, 
int printed)
 {
int s;
 
-   if (f->width < 0 || f->align == ALIGN_LEFT)
+   if (!f->width || f->align == ALIGN_LEFT)
return 0;
 
s = f->width - stored - printed;
@@ -1001,7 +1003,7 @@ static void print_right_spacing(struct column *f, int 
printed)
 {
int s;
 
-   if (f->width < 0 || f->align == ALIGN_RIGHT)
+   if (!f->width || f->align == ALIGN_RIGHT)
return;
 
s = f->width - printed;
@@ -1018,9 +1020,12 @@ static void field_flush(struct column *f)
struct buf_chunk *chunk = buffer.tail;
unsigned int pad = buffer.cur->len % 2;
 
-   if (!f->width)
+   if (f->disabled)
return;
 
+   if (buffer.cur->len > f->max_len)
+   f->max_len = buffer.cur->len;
+
/* We need a new chunk if we can't store the next length descriptor.
 * Mind the gap between end of previous token and next aligned position
 * for length descriptor.
@@ -1063,7 +1068,7 @@ static void field_set(enum col_id id)
 static void print_header(void)
 {
while (!field_is_last(current_field)) {
-   if (current_field->width)
+   if (!current_field->disabled)
out(current_field->header);
field_next();
}
@@ -1096,16 +1101,106 @@ static void buf_free_all(void)
buffer.head = NULL;
 }
 
+/* Calculate column width from contents length. If columns don't fit on one
+ * line, break them into the least possible amount of lines and keep them
+ * aligned across lines. Available screen space is equally spread between 
fields
+ * as additional spacing.
+ */
+static void render_calc_width(int screen_width)
+{
+   int first, len = 0, linecols = 0;
+   struct column *c, *eol = columns - 1;
+
+   /* First pass: set width for each column to measured content length */
+   for (first = 1, c = columns; c - columns < COL_MAX; c++) {
+   if (c->disabled)
+   continue;
+
+   if (!first && c->max_len)
+   c->width = c->max_len + strlen(c->ldelim);
+   else
+   c->width = c->max

[PATCH iproute2 net-next v2 2/4] ss: Introduce columns lightweight abstraction

2017-12-11 Thread Stefano Brivio
Instead of embedding spacing directly while printing contents,
logically declare columns and functions to buffer their content,
to print left and right spacing around fields, to flush them to
screen, and to print headers.

This makes it a bit easier to handle layout changes and prepares
for full output buffering, needed for optimal spacing in field
output layout.

Columns are currently set up to retain exactly the same output
as before. This needs some slight adjustments of the values
previously calculated in main(), as the width value introduced
here already includes the width of left delimiters and spacing
is not explicitly printed anymore whenever a field is printed.
These calculations will go away altogether once automatic width
calculation is implemented.

We can also remove explicit printing of newlines after the final
content for a given line is printed, flushing the last field on
a line will cause field_flush() to print newlines where
appropriate.

No changes in output expected here.

Signed-off-by: Stefano Brivio 
Reviewed-by: Sabrina Dubroca 
---
v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh")

 misc/ss.c | 291 ++
 1 file changed, 198 insertions(+), 93 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index a7d3b89e1478..42310ba4120d 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -103,11 +103,48 @@ int show_header = 1;
 int follow_events;
 int sctp_ino;
 
-int netid_width;
-int state_width;
-int addr_width;
-int serv_width;
-char *odd_width_pad = "";
+enum col_id {
+   COL_NETID,
+   COL_STATE,
+   COL_RECVQ,
+   COL_SENDQ,
+   COL_ADDR,
+   COL_SERV,
+   COL_RADDR,
+   COL_RSERV,
+   COL_EXT,
+   COL_MAX
+};
+
+enum col_align {
+   ALIGN_LEFT,
+   ALIGN_CENTER,
+   ALIGN_RIGHT
+};
+
+struct column {
+   const enum col_align align;
+   const char *header;
+   const char *ldelim;
+   int width;  /* Including delimiter. -1: fit to content, 0: hide */
+   int stored; /* Characters buffered */
+   int printed;/* Characters printed so far */
+};
+
+static struct column columns[] = {
+   { ALIGN_LEFT,   "Netid","", 0,  0,  0 },
+   { ALIGN_LEFT,   "State"," ",0,  0,  0 },
+   { ALIGN_LEFT,   "Recv-Q",   " ",7,  0,  0 },
+   { ALIGN_LEFT,   "Send-Q",   " ",7,  0,  0 },
+   { ALIGN_RIGHT,  "Local Address:",   " ",0,  0,  0 },
+   { ALIGN_LEFT,   "Port", "", 0,  0,  0 },
+   { ALIGN_RIGHT,  "Peer Address:"," ",0,  0,  0 },
+   { ALIGN_LEFT,   "Port", "", 0,  0,  0 },
+   { ALIGN_LEFT,   "", "", -1, 0,  0 },
+};
+
+static struct column *current_field = columns;
+static char field_buf[BUFSIZ];
 
 static const char *TCP_PROTO = "tcp";
 static const char *SCTP_PROTO = "sctp";
@@ -826,13 +863,113 @@ static const char *vsock_netid_name(int type)
 
 static void out(const char *fmt, ...)
 {
+   struct column *f = current_field;
va_list args;
 
va_start(args, fmt);
-   vfprintf(stdout, fmt, args);
+   f->stored += vsnprintf(field_buf + f->stored, BUFSIZ - f->stored,
+  fmt, args);
va_end(args);
 }
 
+static int print_left_spacing(struct column *f)
+{
+   int s;
+
+   if (f->width < 0 || f->align == ALIGN_LEFT)
+   return 0;
+
+   s = f->width - f->stored - f->printed;
+   if (f->align == ALIGN_CENTER)
+   /* If count of total spacing is odd, shift right by one */
+   s = (s + 1) / 2;
+
+   if (s > 0)
+   return printf("%*c", s, ' ');
+
+   return 0;
+}
+
+static void print_right_spacing(struct column *f)
+{
+   int s;
+
+   if (f->width < 0 || f->align == ALIGN_RIGHT)
+   return;
+
+   s = f->width - f->printed;
+   if (f->align == ALIGN_CENTER)
+   s /= 2;
+
+   if (s > 0)
+   printf("%*c", s, ' ');
+}
+
+static int field_needs_delimiter(struct column *f)
+{
+   if (!f->stored)
+   return 0;
+
+   /* Was another field already printed on this line? */
+   for (f--; f >= columns; f--)
+   if (f->width)
+   return 1;
+
+   return 0;
+}
+
+/* Flush given field to screen together with delimiter and spacing */
+static void field_flush(struct column *f)
+{
+   if (!f->width)
+   return;
+
+   if (field_needs_delimiter(f))
+   f->printed = printf("%s", f->ldelim);
+
+   f->printed += print_left_spacing(f);
+   f->printed += printf("%s", field_buf);
+   print_right_spacing(f);
+
+   *field_buf = 0;
+   f->printed = 0;
+   f->stored = 0;
+}
+
+static int field_is_last(struct column *f)
+{
+ 

[PATCH iproute2 net-next v2 0/4] Abstract columns, properly space and wrap fields

2017-12-11 Thread Stefano Brivio
Currently, 'ss' simply subdivides the whole available screen width
between available columns, starting from a set of hardcoded amount
of spacing and growing column widths.

This makes the output unreadable in several cases, as it doesn't take
into account the actual content width.

Fix this by introducing a simple abstraction for columns, buffering
the output, measuring the width of the fields, grouping fields into
lines as they fit, equally distributing any remaining whitespace, and
finally rendering the result. Some examples are reported below [1].

This implementation doesn't seem to cause any significant performance
issues, as reported in 3/4.

Patch 1/4 replaces all relevant printf() calls by the out() helper,
which simply consists of the usual printf() implementation.

Patch 2/4 implements column abstraction, with configurable column
width and delimiters, and 3/4 splits buffering and rendering phases,
employing a simple buffering mechanism with chunked allocation and
introducing a rendering function.

Up to this point, the output is still unchanged.

Finally, 4/4 introduces field width calculation based on content
length measured while buffering, in order to split fields onto
multiple lines and equally space them within the single lines.

Now that column behaviour is well-defined and more easily
configurable, it should be easier to further improve the output by
splitting logically separable information (e.g. TCP details) into
additional columns. However, this patchset keeps the full "extended"
information into a single column, for the moment being.


v2: rebase after conflict with 00ac78d39c29 ("ss: print tcpi_rcv_ssthresh")


[1]

- 80 columns terminal, ss -Z -f netlink
  * before:
Recv-Q Send-Q Local Address:Port Peer Address:Port

0  0rtnl:evolution-calen/2075   * pr
oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
0  0rtnl:abrt-applet/32700  * pr
oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
0  0rtnl:firefox/21619  * pr
oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
0  0rtnl:evolution-calen/32639   * p
roc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[...]

  * after:
Recv-Q   Send-Q Local Address:Port  Peer Address:Port
00   rtnl:evolution-calen/2075  *
 proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
00   rtnl:abrt-applet/32700 *
 proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
00   rtnl:firefox/21619 *
 proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
00   rtnl:evolution-calen/32639 *
 proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[...]

- 80 columns terminal, ss -tunpl
  * before:
Netid  State  Recv-Q Send-Q Local Address:Port   Peer 
Address:Port
udpUNCONN 0  0 *:37732 *:*
udpUNCONN 0  0 *:5353  *:*
udpUNCONN 0  0  192.168.122.1:53*:*
udpUNCONN 0  0  *%virbr0:67*:*
[...]

  * after:
Netid   StateRecv-Q   Send-Q Local Address:Port  Peer Address:Port
udp UNCONN   00  *:37732*:*
udp UNCONN   00  *:5353 *:*
udp UNCONN   00  192.168.122.1:53   *:*
udp UNCONN   00   *%virbr0:67   *:*
[...]

 - 66 columns terminal, ss -tunpl
  * before:
Netid  State  Recv-Q Send-Q Local Address:Port   P
eer Address:Port
udpUNCONN 0  0   *:37732   *:*

udpUNCONN 0  0   *:5353*:*

udpUNCONN 0  0  192.168.122.1:53
*:*
udpUNCONN 0  0  *%virbr0:67  *:*
[...]

  * after:
Netid State  Recv-Q Send-Q Local Address:Port   Peer Address:Port
udp   UNCONN 0  0  *:37732 *:*
udp   UNCONN 0  0  *:5353  *:*
udp   UNCONN 0  0  192.168.122.1:53*:*
udp   UNCONN 0  0   *%virbr0:67*:*
[...]


Stefano Brivio (4):
  ss: Replace printf() calls for "main" output by calls to helper
  ss: Introduce columns lightweight abstraction
  ss: Buffer raw fields first, then render them as a table
  ss: Implement automatic column width calculation

 misc/ss.c | 895 +++---
 1 file changed, 621 insertions(+), 274 deletions(-)

-- 
2.9.4



Re: [PATCH] igb: Free IRQs when device is hotplugged

2017-12-11 Thread Lyude Paul
On Mon, 2017-12-11 at 16:34 -0800, Stephen Hemminger wrote:
> On Mon, 11 Dec 2017 18:45:02 -0500
> Lyude Paul  wrote:
> 
> > Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon
> > hotplugging my kernel would immediately crash due to igb:
> > 
> > [  680.825801] kernel BUG at drivers/pci/msi.c:352!
> > [  680.828388] invalid opcode:  [#1] SMP
> > [  680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev
> > vfat fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi
> > sparse_keymap rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal
> > coretemp crc32_pclmul snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd
> > pcspkr mei soundcore i2c_i801 tpm_tis psmouse shpchp wmi tpm_tis_core tpm
> > video hp_wireless acpi_pad rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw
> > rtsx_pci mfd_core xhci_pci xhci_hcd i2c_hid i2c_core [last unloaded: igb]
> > [  680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted:
> > G   O 4.15.0-rc3Lyude-Test+ #6
> > [  680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver.
> > 01.03 06/09/2017
> > [  680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> > [  680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0
> > [  680.833271] RSP: 0018:c930fbf0 EFLAGS: 00010286
> > [  680.833761] RAX: 8803405f9c00 RBX: 88033e3d2e40 RCX:
> > 002c
> > [  680.834278] RDX:  RSI: 00ac RDI:
> > 880340be2178
> > [  680.834832] RBP:  R08: 880340be1ff0 R09:
> > 8803405f9c00
> > [  680.835342] R10:  R11: 0040 R12:
> > 88033d63a298
> > [  680.835822] R13: 88033d63a000 R14: 0060 R15:
> > 880341959000
> > [  680.836332] FS:  () GS:88034f44()
> > knlGS:
> > [  680.836817] CS:  0010 DS:  ES:  CR0: 80050033
> > [  680.837360] CR2: 55e64044afdf CR3: 01c09002 CR4:
> > 003606e0
> > [  680.837954] Call Trace:
> > [  680.838853]  pci_disable_msix+0xce/0xf0
> > [  680.839616]  igb_reset_interrupt_capability+0x5d/0x60 [igb]
> > [  680.840278]  igb_remove+0x9d/0x110 [igb]
> > [  680.840764]  pci_device_remove+0x36/0xb0
> > [  680.841279]  device_release_driver_internal+0x157/0x220
> > [  680.841739]  pci_stop_bus_device+0x7d/0xa0
> > [  680.842255]  pci_stop_bus_device+0x2b/0xa0
> > [  680.842722]  pci_stop_bus_device+0x3d/0xa0
> > [  680.843189]  pci_stop_and_remove_bus_device+0xe/0x20
> > [  680.843627]  trim_stale_devices+0xf3/0x140
> > [  680.844086]  trim_stale_devices+0x94/0x140
> > [  680.844532]  trim_stale_devices+0xa6/0x140
> > [  680.845031]  ? get_slot_status+0x90/0xc0
> > [  680.845536]  acpiphp_check_bridge.part.5+0xfe/0x140
> > [  680.846021]  acpiphp_hotplug_notify+0x175/0x200
> > [  680.846581]  ? free_bridge+0x100/0x100
> > [  680.847113]  acpi_device_hotplug+0x8a/0x490
> > [  680.847535]  acpi_hotplug_work_fn+0x1a/0x30
> > [  680.848076]  process_one_work+0x182/0x3a0
> > [  680.848543]  worker_thread+0x2e/0x380
> > [  680.848963]  ? process_one_work+0x3a0/0x3a0
> > [  680.849373]  kthread+0x111/0x130
> > [  680.849776]  ? kthread_create_worker_on_cpu+0x50/0x50
> > [  680.850188]  ret_from_fork+0x1f/0x30
> > [  680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39
> > 6b 14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3
> > <0f> 0b 49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b
> > [  680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: c930fbf0
> > 
> > As it turns out, normally the freeing of IRQs that would fix this is called
> > inside of the scope of __igb_close(). However, since the device is
> > already gone by the point we try to unregister the netdevice from the
> > driver due to a hotplug we end up seeing that the netif isn't present
> > and thus, forget to free any of the device IRQs.
> > 
> > So: after unregistering the netdev in igb_remove() check whether the PCI
> > device is stale and if so, free it's IRQs and tx/rx resources.
> > 
> > Signed-off-by: Lyude Paul 
> > Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach")
> > Cc: Todd Fujinaka 
> > Cc: sta...@vger.kernel.org
> > ---
> >  drivers/net/ethernet/intel/igb/igb_main.c | 10 ++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/drivers/net/ethernet/intel/igb/igb_main.c
> > b/drivers/net/ethernet/intel/igb/igb_main.c
> > index c208753ff5b7..e650348b4bd7 100644
> > --- a/drivers/net/ethernet/intel/igb/igb_main.c
> > +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> > @@ -3325,6 +3325,16 @@ static void igb_remove(struct pci_dev *pdev)
> >  
> > unregister_netdev(netdev);
> >  
> > +   /* If the PCI device has already been physically removed (e.g. user
> > +* unplugged a thunderbolt dock containing our hw) then the netif
> > will
> > +* already be down, so unregistering the netdev won't free the IRQs
> > +*/
> > +   if (!pci_device_is_present(pd

RE: [PATCH] Fix handling of verdicts after NF_QUEUE

2017-12-11 Thread Banerjee, Debabrata
> From: Pablo Neira Ayuso [mailto:pa...@netfilter.org]
> On Mon, Dec 11, 2017 at 06:30:24PM -0500, Debabrata Banerjee wrote:
> > +   } else {
> > +   /* Implicit handling for NF_STOLEN, as well as any other
> > +* non conventional verdicts.
> > +*/
> > +   ret = 0;
> 
> Another possibility (more simple?) would be this:
> 
> int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state) {
> struct nf_hook_entry *entry;
> unsigned int verdict;
> -   int ret = 0;
> +   int ret;
> 
> entry = rcu_dereference(state->hook_entries);
> next_hook:
> +   ret = 0;
> 
> Basically, make sure ret is set to zero when jumping to the next_hook label.

Many ways to fix it, but I thought including the comment was appropriate.
Happy to change it if we want simpler instead.

-Deb


Re: [PATCH] igb: Free IRQs when device is hotplugged

2017-12-11 Thread Stephen Hemminger
On Mon, 11 Dec 2017 18:45:02 -0500
Lyude Paul  wrote:

> Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon
> hotplugging my kernel would immediately crash due to igb:
> 
> [  680.825801] kernel BUG at drivers/pci/msi.c:352!
> [  680.828388] invalid opcode:  [#1] SMP
> [  680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev vfat 
> fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi sparse_keymap 
> rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal coretemp 
> crc32_pclmul snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd pcspkr mei 
> soundcore i2c_i801 tpm_tis psmouse shpchp wmi tpm_tis_core tpm video 
> hp_wireless acpi_pad rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw rtsx_pci 
> mfd_core xhci_pci xhci_hcd i2c_hid i2c_core [last unloaded: igb]
> [  680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted: G   O 
> 4.15.0-rc3Lyude-Test+ #6
> [  680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver. 01.03 
> 06/09/2017
> [  680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
> [  680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0
> [  680.833271] RSP: 0018:c930fbf0 EFLAGS: 00010286
> [  680.833761] RAX: 8803405f9c00 RBX: 88033e3d2e40 RCX: 
> 002c
> [  680.834278] RDX:  RSI: 00ac RDI: 
> 880340be2178
> [  680.834832] RBP:  R08: 880340be1ff0 R09: 
> 8803405f9c00
> [  680.835342] R10:  R11: 0040 R12: 
> 88033d63a298
> [  680.835822] R13: 88033d63a000 R14: 0060 R15: 
> 880341959000
> [  680.836332] FS:  () GS:88034f44() 
> knlGS:
> [  680.836817] CS:  0010 DS:  ES:  CR0: 80050033
> [  680.837360] CR2: 55e64044afdf CR3: 01c09002 CR4: 
> 003606e0
> [  680.837954] Call Trace:
> [  680.838853]  pci_disable_msix+0xce/0xf0
> [  680.839616]  igb_reset_interrupt_capability+0x5d/0x60 [igb]
> [  680.840278]  igb_remove+0x9d/0x110 [igb]
> [  680.840764]  pci_device_remove+0x36/0xb0
> [  680.841279]  device_release_driver_internal+0x157/0x220
> [  680.841739]  pci_stop_bus_device+0x7d/0xa0
> [  680.842255]  pci_stop_bus_device+0x2b/0xa0
> [  680.842722]  pci_stop_bus_device+0x3d/0xa0
> [  680.843189]  pci_stop_and_remove_bus_device+0xe/0x20
> [  680.843627]  trim_stale_devices+0xf3/0x140
> [  680.844086]  trim_stale_devices+0x94/0x140
> [  680.844532]  trim_stale_devices+0xa6/0x140
> [  680.845031]  ? get_slot_status+0x90/0xc0
> [  680.845536]  acpiphp_check_bridge.part.5+0xfe/0x140
> [  680.846021]  acpiphp_hotplug_notify+0x175/0x200
> [  680.846581]  ? free_bridge+0x100/0x100
> [  680.847113]  acpi_device_hotplug+0x8a/0x490
> [  680.847535]  acpi_hotplug_work_fn+0x1a/0x30
> [  680.848076]  process_one_work+0x182/0x3a0
> [  680.848543]  worker_thread+0x2e/0x380
> [  680.848963]  ? process_one_work+0x3a0/0x3a0
> [  680.849373]  kthread+0x111/0x130
> [  680.849776]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  680.850188]  ret_from_fork+0x1f/0x30
> [  680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39 6b 
> 14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3 <0f> 
> 0b 49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b
> [  680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: c930fbf0
> 
> As it turns out, normally the freeing of IRQs that would fix this is called
> inside of the scope of __igb_close(). However, since the device is
> already gone by the point we try to unregister the netdevice from the
> driver due to a hotplug we end up seeing that the netif isn't present
> and thus, forget to free any of the device IRQs.
> 
> So: after unregistering the netdev in igb_remove() check whether the PCI
> device is stale and if so, free it's IRQs and tx/rx resources.
> 
> Signed-off-by: Lyude Paul 
> Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach")
> Cc: Todd Fujinaka 
> Cc: sta...@vger.kernel.org
> ---
>  drivers/net/ethernet/intel/igb/igb_main.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
> b/drivers/net/ethernet/intel/igb/igb_main.c
> index c208753ff5b7..e650348b4bd7 100644
> --- a/drivers/net/ethernet/intel/igb/igb_main.c
> +++ b/drivers/net/ethernet/intel/igb/igb_main.c
> @@ -3325,6 +3325,16 @@ static void igb_remove(struct pci_dev *pdev)
>  
>   unregister_netdev(netdev);
>  
> + /* If the PCI device has already been physically removed (e.g. user
> +  * unplugged a thunderbolt dock containing our hw) then the netif will
> +  * already be down, so unregistering the netdev won't free the IRQs
> +  */
> + if (!pci_device_is_present(pdev)) {
> + igb_free_irq(adapter);
> + igb_free_all_tx_resources(adapter);
> + igb_free_all_rx_resources(adapter);
> + }
> +
>   igb_clear_interrupt_scheme(adapter);
>  
>   pci_iounm

[PATCH net-next v3 5/6] net: qualcomm: rmnet: Allow to configure flags for new devices

2017-12-11 Thread Subash Abhinov Kasiviswanathan
Add an option to configure the rmnet aggregation and command features
on device creation. This is achieved by using the vlan flags option.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index 46bb228..7a4c26e 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -177,11 +177,20 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
if (err)
goto err2;
 
-   netdev_dbg(dev, "data format [ingress 0x%08X]\n", ingress_format);
-   port->ingress_data_format = ingress_format;
port->rmnet_mode = mode;
 
hlist_add_head_rcu(&ep->hlnode, &port->muxed_ep[mux_id]);
+
+   if (data[IFLA_VLAN_FLAGS]) {
+   struct ifla_vlan_flags *flags;
+
+   flags = nla_data(data[IFLA_VLAN_FLAGS]);
+   ingress_format = flags->flags & flags->mask;
+   }
+
+   netdev_dbg(dev, "data format [ingress 0x%08X]\n", ingress_format);
+   port->ingress_data_format = ingress_format;
+
return 0;
 
 err2:
@@ -313,7 +322,8 @@ static int rmnet_rtnl_validate(struct nlattr *tb[], struct 
nlattr *data[],
 
 static size_t rmnet_get_size(const struct net_device *dev)
 {
-   return nla_total_size(2); /* IFLA_VLAN_ID */
+   return nla_total_size(2) /* IFLA_VLAN_ID */ +
+  nla_total_size(sizeof(struct ifla_vlan_flags)); /* 
IFLA_VLAN_FLAGS */
 }
 
 struct rtnl_link_ops rmnet_link_ops __read_mostly = {
-- 
1.9.1



[PATCH net-next v3 1/6] net: qualcomm: rmnet: Remove the rmnet_map_results enum

2017-12-11 Thread Subash Abhinov Kasiviswanathan
Only the success and consumed entries were actually in use.
Use standard error codes instead.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 15 +++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h  |  9 -
 2 files changed, 3 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 08e4afc..1e1ea10 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -142,11 +142,11 @@ static int rmnet_map_egress_handler(struct sk_buff *skb,
 
skb->protocol = htons(ETH_P_MAP);
 
-   return RMNET_MAP_SUCCESS;
+   return 0;
 
 fail:
kfree_skb(skb);
-   return RMNET_MAP_CONSUMED;
+   return -ENOMEM;
 }
 
 static void
@@ -213,17 +213,8 @@ void rmnet_egress_handler(struct sk_buff *skb)
}
 
if (port->egress_data_format & RMNET_EGRESS_FORMAT_MAP) {
-   switch (rmnet_map_egress_handler(skb, port, mux_id, orig_dev)) {
-   case RMNET_MAP_CONSUMED:
+   if (rmnet_map_egress_handler(skb, port, mux_id, orig_dev))
return;
-
-   case RMNET_MAP_SUCCESS:
-   break;
-
-   default:
-   kfree_skb(skb);
-   return;
-   }
}
 
rmnet_vnd_tx_fixup(skb, orig_dev);
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
index 3af3fe7..4df359d 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h
@@ -30,15 +30,6 @@ struct rmnet_map_control_command {
};
 }  __aligned(1);
 
-enum rmnet_map_results {
-   RMNET_MAP_SUCCESS,
-   RMNET_MAP_CONSUMED,
-   RMNET_MAP_GENERAL_FAILURE,
-   RMNET_MAP_NOT_ENABLED,
-   RMNET_MAP_FAILED_AGGREGATION,
-   RMNET_MAP_FAILED_MUX
-};
-
 enum rmnet_map_commands {
RMNET_MAP_COMMAND_NONE,
RMNET_MAP_COMMAND_FLOW_DISABLE,
-- 
1.9.1



[PATCH net-next v3 0/6] net: qualcomm: rmnet: Configuration options

2017-12-11 Thread Subash Abhinov Kasiviswanathan
This series adds support for configuring features on rmnet devices.
The rmnet specific features to be configured here are aggregation and
control commands.

Patch 1 is a cleanup of return codes in the transmit path.
Patch 2 removes some redundant ingress and egress macros.
Patch 3 restricts the creation of rmnet dev to one dev per mux id for a
given real dev.
Patch 4 adds ethernet data path support.
Patches 5-6 add support for configuring features on new and existing
rmnet devices.

v1->v2:
The memory leak fixed as part of patch 1 is merged seperately as
a896d94abd2c ("net: qualcomm: rmnet: Fix leak on transmit failure"). 
Fix a use after free in patch 4 if a packet with headroom lesser than ethernet
header length is received.

v2->v3: 
Fix formatting problem in patch 5 in the return statement.

Subash Abhinov Kasiviswanathan (6):
  net: qualcomm: rmnet: Remove the rmnet_map_results enum
  net: qualcomm: rmnet: Remove the some redundant macros
  net: qualcomm: rmnet: Allow only one rmnet dev per muxid per real dev
  net: qualcomm: rmnet: Process packets over ethernet
  net: qualcomm: rmnet: Allow to configure flags for new devices
  net: qualcomm: rmnet: Allow to configure flags for existing devices

 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 64 ++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h |  1 -
 .../net/ethernet/qualcomm/rmnet/rmnet_handlers.c   | 42 +++---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_map.h|  9 ---
 .../net/ethernet/qualcomm/rmnet/rmnet_private.h| 10 +---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c|  3 +
 6 files changed, 78 insertions(+), 51 deletions(-)

-- 
1.9.1



[PATCH net-next v3 2/6] net: qualcomm: rmnet: Remove the some redundant macros

2017-12-11 Thread Subash Abhinov Kasiviswanathan
Multiplexing is always enabled when transmiting from a rmnet device,
so remove the redundant egress macros. De-multiplexing is always
enabled when receiving packets from a rmnet device, so remove those
ingress macros.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c   | 10 ++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h   |  1 -
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 19 +++
 drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h  | 10 ++
 4 files changed, 11 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index df21e90..46bb228 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -143,11 +143,7 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
 struct nlattr *tb[], struct nlattr *data[],
 struct netlink_ext_ack *extack)
 {
-   int ingress_format = RMNET_INGRESS_FORMAT_DEMUXING |
-RMNET_INGRESS_FORMAT_DEAGGREGATION |
-RMNET_INGRESS_FORMAT_MAP;
-   int egress_format = RMNET_EGRESS_FORMAT_MUXING |
-   RMNET_EGRESS_FORMAT_MAP;
+   int ingress_format = RMNET_INGRESS_FORMAT_DEAGGREGATION;
struct net_device *real_dev;
int mode = RMNET_EPMODE_VND;
struct rmnet_endpoint *ep;
@@ -181,9 +177,7 @@ static int rmnet_newlink(struct net *src_net, struct 
net_device *dev,
if (err)
goto err2;
 
-   netdev_dbg(dev, "data format [ingress 0x%08X] [egress 0x%08X]\n",
-  ingress_format, egress_format);
-   port->egress_data_format = egress_format;
+   netdev_dbg(dev, "data format [ingress 0x%08X]\n", ingress_format);
port->ingress_data_format = ingress_format;
port->rmnet_mode = mode;
 
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
index c19259e..2ea9fe3 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.h
@@ -33,7 +33,6 @@ struct rmnet_endpoint {
 struct rmnet_port {
struct net_device *dev;
u32 ingress_data_format;
-   u32 egress_data_format;
u8 nr_rmnet_devs;
u8 rmnet_mode;
struct hlist_head muxed_ep[RMNET_MAX_LOGICAL_EP];
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index 1e1ea10..a46053c 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -133,12 +133,10 @@ static int rmnet_map_egress_handler(struct sk_buff *skb,
if (!map_header)
goto fail;
 
-   if (port->egress_data_format & RMNET_EGRESS_FORMAT_MUXING) {
-   if (mux_id == 0xff)
-   map_header->mux_id = 0;
-   else
-   map_header->mux_id = mux_id;
-   }
+   if (mux_id == 0xff)
+   map_header->mux_id = 0;
+   else
+   map_header->mux_id = mux_id;
 
skb->protocol = htons(ETH_P_MAP);
 
@@ -178,8 +176,7 @@ rx_handler_result_t rmnet_rx_handler(struct sk_buff **pskb)
 
switch (port->rmnet_mode) {
case RMNET_EPMODE_VND:
-   if (port->ingress_data_format & RMNET_INGRESS_FORMAT_MAP)
-   rmnet_map_ingress_handler(skb, port);
+   rmnet_map_ingress_handler(skb, port);
break;
case RMNET_EPMODE_BRIDGE:
rmnet_bridge_handler(skb, port->bridge_ep);
@@ -212,10 +209,8 @@ void rmnet_egress_handler(struct sk_buff *skb)
return;
}
 
-   if (port->egress_data_format & RMNET_EGRESS_FORMAT_MAP) {
-   if (rmnet_map_egress_handler(skb, port, mux_id, orig_dev))
-   return;
-   }
+   if (rmnet_map_egress_handler(skb, port, mux_id, orig_dev))
+   return;
 
rmnet_vnd_tx_fixup(skb, orig_dev);
 
diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
index 49102f9..d214280 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_private.h
@@ -19,14 +19,8 @@
 #define RMNET_TX_QUEUE_LEN 1000
 
 /* Constants */
-#define RMNET_EGRESS_FORMAT_MAP BIT(1)
-#define RMNET_EGRESS_FORMAT_AGGREGATION BIT(2)
-#define RMNET_EGRESS_FORMAT_MUXING  BIT(3)
-
-#define RMNET_INGRESS_FORMAT_MAPBIT(1)
-#define RMNET_INGRESS_FORMAT_DEAGGREGATION  BIT(2)
-#define RMNET_INGRESS_FORMAT_DEMUXING   BIT(3)
-#define RMNET_INGRESS_FORMAT_MAP_COMMANDS   BIT(4)
+#define RMN

[PATCH net-next v3 6/6] net: qualcomm: rmnet: Allow to configure flags for existing devices

2017-12-11 Thread Subash Abhinov Kasiviswanathan
Add an option to configure the mux id, aggregation and commad feature
for existing rmnet devices. Implement the changelink netlink
operation for this.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 40 ++
 1 file changed, 40 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index 7a4c26e..cedacdd 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -320,6 +320,45 @@ static int rmnet_rtnl_validate(struct nlattr *tb[], struct 
nlattr *data[],
return 0;
 }
 
+static int rmnet_changelink(struct net_device *dev, struct nlattr *tb[],
+   struct nlattr *data[],
+   struct netlink_ext_ack *extack)
+{
+   struct rmnet_priv *priv = netdev_priv(dev);
+   struct net_device *real_dev;
+   struct rmnet_endpoint *ep;
+   struct rmnet_port *port;
+   u16 mux_id;
+
+   real_dev = __dev_get_by_index(dev_net(dev),
+ nla_get_u32(tb[IFLA_LINK]));
+
+   if (!real_dev || !dev || !rmnet_is_real_dev_registered(real_dev))
+   return -ENODEV;
+
+   port = rmnet_get_port_rtnl(real_dev);
+
+   if (data[IFLA_VLAN_ID]) {
+   mux_id = nla_get_u16(data[IFLA_VLAN_ID]);
+   ep = rmnet_get_endpoint(port, priv->mux_id);
+
+   hlist_del_init_rcu(&ep->hlnode);
+   hlist_add_head_rcu(&ep->hlnode, &port->muxed_ep[mux_id]);
+
+   ep->mux_id = mux_id;
+   priv->mux_id = mux_id;
+   }
+
+   if (data[IFLA_VLAN_FLAGS]) {
+   struct ifla_vlan_flags *flags;
+
+   flags = nla_data(data[IFLA_VLAN_FLAGS]);
+   port->ingress_data_format = flags->flags & flags->mask;
+   }
+
+   return 0;
+}
+
 static size_t rmnet_get_size(const struct net_device *dev)
 {
return nla_total_size(2) /* IFLA_VLAN_ID */ +
@@ -335,6 +374,7 @@ struct rtnl_link_ops rmnet_link_ops __read_mostly = {
.newlink= rmnet_newlink,
.dellink= rmnet_dellink,
.get_size   = rmnet_get_size,
+   .changelink = rmnet_changelink,
 };
 
 /* Needs either rcu_read_lock() or rtnl lock */
-- 
1.9.1



[PATCH net-next v3 4/6] net: qualcomm: rmnet: Process packets over ethernet

2017-12-11 Thread Subash Abhinov Kasiviswanathan
Add support to send and receive packets over ethernet.
An example of usage is testing the data path on UML. This can be
achieved by setting up two UML instances in multicast mode and
associating rmnet over the UML ethernet device.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
index a46053c..0553932 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_handlers.c
@@ -15,6 +15,7 @@
 
 #include 
 #include 
+#include 
 #include "rmnet_private.h"
 #include "rmnet_config.h"
 #include "rmnet_vnd.h"
@@ -104,6 +105,15 @@ static void rmnet_set_skb_proto(struct sk_buff *skb)
 {
struct sk_buff *skbn;
 
+   if (skb->dev->type == ARPHRD_ETHER) {
+   if (pskb_expand_head(skb, ETH_HLEN, 0, GFP_KERNEL)) {
+   kfree_skb(skb);
+   return;
+   }
+
+   skb_push(skb, ETH_HLEN);
+   }
+
if (port->ingress_data_format & RMNET_INGRESS_FORMAT_DEAGGREGATION) {
while ((skbn = rmnet_map_deaggregate(skb)) != NULL)
__rmnet_map_ingress_handler(skbn, port);
-- 
1.9.1



[PATCH net-next v3 3/6] net: qualcomm: rmnet: Allow only one rmnet dev per muxid per real dev

2017-12-11 Thread Subash Abhinov Kasiviswanathan
Upon de-multiplexing data from one real dev, the packets can be sent
to an unique rmnet device for a given mux id.

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c 
b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
index 9caa5e3..5bb29f4 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_vnd.c
@@ -185,6 +185,9 @@ int rmnet_vnd_newlink(u8 id, struct net_device *rmnet_dev,
if (ep->egress_dev)
return -EINVAL;
 
+   if (rmnet_get_endpoint(port, id))
+   return -EBUSY;
+
rc = register_netdevice(rmnet_dev);
if (!rc) {
ep->egress_dev = rmnet_dev;
-- 
1.9.1



Re: [PATCH] Fix handling of verdicts after NF_QUEUE

2017-12-11 Thread Pablo Neira Ayuso
Hi,

Thanks for catching up this, see below.

On Mon, Dec 11, 2017 at 06:30:24PM -0500, Debabrata Banerjee wrote:
> A verdict of NF_STOLEN after NF_QUEUE will cause an incorrect return value
> and a potential kernel panic via double free of skb's
> 
> This was broken by commit 7034b566a4e7 ("netfilter: fix nf_queue handling")
> and subsequently fixed in v4.10 by commit c63cbc460419 ("netfilter:
> use switch() to handle verdict cases from nf_hook_slow()"). However that
> commit cannot be cleanly cherry-picked to v4.9
> 
> Signed-off-by: Debabrata Banerjee 
> 
> ---
> 
> This fix is only needed for v4.9 stable since v4.10+ does not have the
> issue
> ---
>  net/netfilter/core.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> index 004af030ef1a..d869ea50623e 100644
> --- a/net/netfilter/core.c
> +++ b/net/netfilter/core.c
> @@ -364,6 +364,11 @@ int nf_hook_slow(struct sk_buff *skb, struct 
> nf_hook_state *state)
>   ret = nf_queue(skb, state, &entry, verdict);
>   if (ret == 1 && entry)
>   goto next_hook;
> + } else {
> + /* Implicit handling for NF_STOLEN, as well as any other
> +  * non conventional verdicts.
> +  */
> + ret = 0;

Another possibility (more simple?) would be this:

int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state *state)
{
struct nf_hook_entry *entry;
unsigned int verdict;
-   int ret = 0;
+   int ret;

entry = rcu_dereference(state->hook_entries);
next_hook:
+   ret = 0;

Basically, make sure ret is set to zero when jumping to the next_hook
label.

Thanks!


Re: [PATCH iproute2 net-next 0/4] Abstract columns, properly space and wrap fields

2017-12-11 Thread Stephen Hemminger
On Fri,  8 Dec 2017 18:07:19 +0100
Stefano Brivio  wrote:

> Currently, 'ss' simply subdivides the whole available screen width
> between available columns, starting from a set of hardcoded amount
> of spacing and growing column widths.
> 
> This makes the output unreadable in several cases, as it doesn't take
> into account the actual content width.
> 
> Fix this by introducing a simple abstraction for columns, buffering
> the output, measuring the width of the fields, grouping fields into
> lines as they fit, equally distributing any remaining whitespace, and
> finally rendering the result. Some examples are reported below [1].
> 
> This implementation doesn't seem to cause any significant performance
> issues, as reported in 3/4.
> 
> Patch 1/4 replaces all relevant printf() calls by the out() helper,
> which simply consists of the usual printf() implementation.
> 
> Patch 2/4 implements column abstraction, with configurable column
> width and delimiters, and 3/4 splits buffering and rendering phases,
> employing a simple buffering mechanism with chunked allocation and
> introducing a rendering function.
> 
> Up to this point, the output is still unchanged.
> 
> Finally, 4/4 introduces field width calculation based on content
> length measured while buffering, in order to split fields onto
> multiple lines and equally space them within the single lines.
> 
> Now that column behaviour is well-defined and more easily
> configurable, it should be easier to further improve the output by
> splitting logically separable information (e.g. TCP details) into
> additional columns. However, this patchset keeps the full "extended"
> information into a single column, for the moment being.
> 
> 
> [1]
> 
> - 80 columns terminal, ss -Z -f netlink
>   * before:
> Recv-Q Send-Q Local Address:Port Peer Address:Port
> 
> 0  0rtnl:evolution-calen/2075   * 
> pr
> oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> 0  0rtnl:abrt-applet/32700  * 
> pr
> oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> 0  0rtnl:firefox/21619  * 
> pr
> oc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> 0  0rtnl:evolution-calen/32639   *
>  p
> roc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> [...]
> 
>   * after:
> Recv-Q   Send-Q Local Address:Port  Peer Address:Port
> 00   rtnl:evolution-calen/2075  *
>  proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> 00   rtnl:abrt-applet/32700 *
>  proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> 00   rtnl:firefox/21619 *
>  proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> 00   rtnl:evolution-calen/32639 *
>  proc_ctx=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> [...]
> 
> - 80 colums terminal, ss -tunpl
>   * before:
> Netid  State  Recv-Q Send-Q Local Address:Port   Peer 
> Address:Port
> udpUNCONN 0  0 *:37732 *:*
> udpUNCONN 0  0 *:5353  *:*
> udpUNCONN 0  0  192.168.122.1:53*:*
> udpUNCONN 0  0  *%virbr0:67*:*
> [...]
> 
>   * after:
> Netid   StateRecv-Q   Send-Q Local Address:Port  Peer Address:Port
> udp UNCONN   00  *:37732*:*
> udp UNCONN   00  *:5353 *:*
> udp UNCONN   00  192.168.122.1:53   *:*
> udp UNCONN   00   *%virbr0:67   *:*
> [...]
> 
>  - 66 columns terminal, ss -tunpl
>   * before:
> Netid  State  Recv-Q Send-Q Local Address:Port   P
> eer Address:Port
> udpUNCONN 0  0   *:37732   *:*
> 
> udpUNCONN 0  0   *:5353*:*
> 
> udpUNCONN 0  0  192.168.122.1:53
> *:*
> udpUNCONN 0  0  *%virbr0:67  *:*
> [...]
> 
>   * after:
> Netid State  Recv-Q Send-Q Local Address:Port   Peer Address:Port
> udp   UNCONN 0  0  *:37732 *:*
> udp   UNCONN 0  0  *:5353  *:*
> udp   UNCONN 0  0  192.168.122.1:53*:*
> udp   UNCONN 0  0   *%virbr0:67*:*
> [...]
> 
> 
> Stefano Brivio (4):
>   ss: Replace printf() calls for "main" output by calls to helper
>   ss: Introduce columns lightweight abstraction
>   ss: Buffer raw fields first, then render them as a table
>   ss: Implement automatic column width calculation

I was going to apply

Re: [PATCH iproute2 1/1] ss: remove duplicate assignment

2017-12-11 Thread Stephen Hemminger
On Mon, 11 Dec 2017 16:24:31 -0500
Roman Mashak  wrote:

> Signed-off-by: Roman Mashak 


Applied and added Fixes: 8250bc9ff4e5 ("ss: Unify inet sockets output")


[PATCH] igb: Free IRQs when device is hotplugged

2017-12-11 Thread Lyude Paul
Recently I got a Caldigit TS3 Thunderbolt 3 dock, and noticed that upon
hotplugging my kernel would immediately crash due to igb:

[  680.825801] kernel BUG at drivers/pci/msi.c:352!
[  680.828388] invalid opcode:  [#1] SMP
[  680.829194] Modules linked in: igb(O) thunderbolt i2c_algo_bit joydev vfat 
fat btusb btrtl btbcm btintel bluetooth ecdh_generic hp_wmi sparse_keymap 
rfkill wmi_bmof iTCO_wdt intel_rapl x86_pkg_temp_thermal coretemp crc32_pclmul 
snd_pcm rtsx_pci_ms mei_me snd_timer memstick snd pcspkr mei soundcore i2c_i801 
tpm_tis psmouse shpchp wmi tpm_tis_core tpm video hp_wireless acpi_pad 
rtsx_pci_sdmmc mmc_core crc32c_intel serio_raw rtsx_pci mfd_core xhci_pci 
xhci_hcd i2c_hid i2c_core [last unloaded: igb]
[  680.831085] CPU: 1 PID: 78 Comm: kworker/u16:1 Tainted: G   O 
4.15.0-rc3Lyude-Test+ #6
[  680.831596] Hardware name: HP HP ZBook Studio G4/826B, BIOS P71 Ver. 01.03 
06/09/2017
[  680.832168] Workqueue: kacpi_hotplug acpi_hotplug_work_fn
[  680.832687] RIP: 0010:free_msi_irqs+0x180/0x1b0
[  680.833271] RSP: 0018:c930fbf0 EFLAGS: 00010286
[  680.833761] RAX: 8803405f9c00 RBX: 88033e3d2e40 RCX: 002c
[  680.834278] RDX:  RSI: 00ac RDI: 880340be2178
[  680.834832] RBP:  R08: 880340be1ff0 R09: 8803405f9c00
[  680.835342] R10:  R11: 0040 R12: 88033d63a298
[  680.835822] R13: 88033d63a000 R14: 0060 R15: 880341959000
[  680.836332] FS:  () GS:88034f44() 
knlGS:
[  680.836817] CS:  0010 DS:  ES:  CR0: 80050033
[  680.837360] CR2: 55e64044afdf CR3: 01c09002 CR4: 003606e0
[  680.837954] Call Trace:
[  680.838853]  pci_disable_msix+0xce/0xf0
[  680.839616]  igb_reset_interrupt_capability+0x5d/0x60 [igb]
[  680.840278]  igb_remove+0x9d/0x110 [igb]
[  680.840764]  pci_device_remove+0x36/0xb0
[  680.841279]  device_release_driver_internal+0x157/0x220
[  680.841739]  pci_stop_bus_device+0x7d/0xa0
[  680.842255]  pci_stop_bus_device+0x2b/0xa0
[  680.842722]  pci_stop_bus_device+0x3d/0xa0
[  680.843189]  pci_stop_and_remove_bus_device+0xe/0x20
[  680.843627]  trim_stale_devices+0xf3/0x140
[  680.844086]  trim_stale_devices+0x94/0x140
[  680.844532]  trim_stale_devices+0xa6/0x140
[  680.845031]  ? get_slot_status+0x90/0xc0
[  680.845536]  acpiphp_check_bridge.part.5+0xfe/0x140
[  680.846021]  acpiphp_hotplug_notify+0x175/0x200
[  680.846581]  ? free_bridge+0x100/0x100
[  680.847113]  acpi_device_hotplug+0x8a/0x490
[  680.847535]  acpi_hotplug_work_fn+0x1a/0x30
[  680.848076]  process_one_work+0x182/0x3a0
[  680.848543]  worker_thread+0x2e/0x380
[  680.848963]  ? process_one_work+0x3a0/0x3a0
[  680.849373]  kthread+0x111/0x130
[  680.849776]  ? kthread_create_worker_on_cpu+0x50/0x50
[  680.850188]  ret_from_fork+0x1f/0x30
[  680.850601] Code: 43 14 85 c0 0f 84 d5 fe ff ff 31 ed eb 0f 83 c5 01 39 6b 
14 0f 86 c5 fe ff ff 8b 7b 10 01 ef e8 b7 e4 d2 ff 48 83 78 70 00 74 e3 <0f> 0b 
49 8d b5 a0 00 00 00 e8 62 6f d3 ff e9 c7 fe ff ff 48 8b
[  680.851497] RIP: free_msi_irqs+0x180/0x1b0 RSP: c930fbf0

As it turns out, normally the freeing of IRQs that would fix this is called
inside of the scope of __igb_close(). However, since the device is
already gone by the point we try to unregister the netdevice from the
driver due to a hotplug we end up seeing that the netif isn't present
and thus, forget to free any of the device IRQs.

So: after unregistering the netdev in igb_remove() check whether the PCI
device is stale and if so, free it's IRQs and tx/rx resources.

Signed-off-by: Lyude Paul 
Fixes: 9474933caf21 ("igb: close/suspend race in netif_device_detach")
Cc: Todd Fujinaka 
Cc: sta...@vger.kernel.org
---
 drivers/net/ethernet/intel/igb/igb_main.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c208753ff5b7..e650348b4bd7 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -3325,6 +3325,16 @@ static void igb_remove(struct pci_dev *pdev)
 
unregister_netdev(netdev);
 
+   /* If the PCI device has already been physically removed (e.g. user
+* unplugged a thunderbolt dock containing our hw) then the netif will
+* already be down, so unregistering the netdev won't free the IRQs
+*/
+   if (!pci_device_is_present(pdev)) {
+   igb_free_irq(adapter);
+   igb_free_all_tx_resources(adapter);
+   igb_free_all_rx_resources(adapter);
+   }
+
igb_clear_interrupt_scheme(adapter);
 
pci_iounmap(pdev, adapter->io_addr);
-- 
2.14.3



[PATCH net-next] tcp: allow TLP in ECN CWR

2017-12-11 Thread Yuchung Cheng
From: Neal Cardwell 

This patch enables tail loss probe in cwnd reduction (CWR) state
to detect potential losses. Prior to this patch, since the sender
uses PRR to determine the cwnd in CWR state, the combination of
CWR+PRR plus tcp_tso_should_defer() could cause unnecessary stalls
upon losses: PRR makes cwnd so gentle that tcp_tso_should_defer()
defers sending wait for more ACKs. The ACKs may not come due to
packet losses.

Disallowing TLP when there is unused cwnd had the primary effect
of disallowing TLP when there is TSO deferral, Nagle deferral,
or we hit the rwin limit. Because basically every application
write() or incoming ACK will cause us to run tcp_write_xmit()
to see if we can send more, and then if we sent something we call
tcp_schedule_loss_probe() to see if we should schedule a TLP. At
that point, there are a few common reasons why some cwnd budget
could still be unused:

(a) rwin limit
(b) nagle check
(c) TSO deferral
(d) TSQ

For (d), after the next packet tx completion the TSQ mechanism
will allow us to send more packets, so we don't really need a
TLP (in practice it shouldn't matter whether we schedule one
or not). But for (a), (b), (c) the sender won't send any more
packets until it gets another ACK. But if the whole flight was
lost, or all the ACKs were lost, then we won't get any more ACKs,
and ideally we should schedule and send a TLP to get more feedback.
In particular for a long time we have wanted some kind of timer for
TSO deferral, and at least this would give us some kind of timer

Reported-by: Steve Ibanez 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Reviewed-by: Nandita Dukkipati 
Reviewed-by: Eric Dumazet 
---
 net/ipv4/tcp_output.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index a4d214c7b506..04be9f833927 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2414,15 +2414,12 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool 
advancing_rto)
 
early_retrans = sock_net(sk)->ipv4.sysctl_tcp_early_retrans;
/* Schedule a loss probe in 2*RTT for SACK capable connections
-* in Open state, that are either limited by cwnd or application.
+* not in loss recovery, that are either limited by cwnd or application.
 */
if ((early_retrans != 3 && early_retrans != 4) ||
!tp->packets_out || !tcp_is_sack(tp) ||
-   icsk->icsk_ca_state != TCP_CA_Open)
-   return false;
-
-   if ((tp->snd_cwnd > tcp_packets_in_flight(tp)) &&
-!tcp_write_queue_empty(sk))
+   (icsk->icsk_ca_state != TCP_CA_Open &&
+icsk->icsk_ca_state != TCP_CA_CWR))
return false;
 
/* Probe timeout is 2*rtt. Add minimum RTO to account
-- 
2.15.1.424.g9478a66081-goog



Re: [PATCH net-next v5 2/2] net: thunderx: add timestamping support

2017-12-11 Thread Richard Cochran
On Mon, Dec 11, 2017 at 05:14:31PM +0300, Aleksey Makarov wrote:
> @@ -880,6 +889,46 @@ static void nic_pause_frame(struct nicpf *nic, int vf, 
> struct pfc *cfg)
>   }
>  }
>  
> +/* Enable or disable HW timestamping by BGX for pkts received on a LMAC */
> +static void nic_config_timestamp(struct nicpf *nic, int vf, struct set_ptp 
> *ptp)
> +{
> + struct pkind_cfg *pkind;
> + u8 lmac, bgx_idx;
> + u64 pkind_val, pkind_idx;
> +
> + if (vf >= nic->num_vf_en)
> + return;
> +
> + bgx_idx = NIC_GET_BGX_FROM_VF_LMAC_MAP(nic->vf_lmac_map[vf]);
> + lmac = NIC_GET_LMAC_FROM_VF_LMAC_MAP(nic->vf_lmac_map[vf]);
> +
> + pkind_idx = lmac + bgx_idx * MAX_LMAC_PER_BGX;
> + pkind_val = nic_reg_read(nic, NIC_PF_PKIND_0_15_CFG | (pkind_idx << 3));
> + pkind = (struct pkind_cfg *)&pkind_val;
> +
> + if (ptp->enable && !pkind->hdr_sl) {
> + /* Skiplen to exclude 8byte timestamp while parsing pkt
> +  * If not configured, will result in L2 errors.
> +  */
> + pkind->hdr_sl = 4;
> + /* Adjust max packet length allowed */
> + pkind->maxlen += (pkind->hdr_sl * 2);
> + bgx_config_timestamping(nic->node, bgx_idx, lmac, true);
> + nic_reg_write(nic,
> +   NIC_PF_RX_ETYPE_0_7 | (1 << 3),
> +   (ETYPE_ALG_ENDPARSE << 16) | ETH_P_1588);

don't need three lines for this function call.

> + } else if (!ptp->enable && pkind->hdr_sl) {
> + pkind->maxlen -= (pkind->hdr_sl * 2);
> + pkind->hdr_sl = 0;
> + bgx_config_timestamping(nic->node, bgx_idx, lmac, false);
> + nic_reg_write(nic,
> +   NIC_PF_RX_ETYPE_0_7 | (1 << 3),
> +   (1ULL << 16) | ETH_P_8021Q); /* reset value */

here neither.  Also avoid comment on the LHS.  If 1<<16 means "reset"
then just define a macro.

> + }
> +
> + nic_reg_write(nic, NIC_PF_PKIND_0_15_CFG | (pkind_idx << 3), pkind_val);
> +}
> +

Thanks,
Richard


[Patch net-next] net_sched: switch to exit_batch for action pernet ops

2017-12-11 Thread Cong Wang
Since we now hold RTNL lock in tc_action_net_exit(), it is good to
batch them to speedup tc action dismantle.

Cc: Jamal Hadi Salim 
Cc: Jiri Pirko 
Signed-off-by: Cong Wang 
---
 include/net/act_api.h  | 13 ++---
 net/sched/act_bpf.c|  8 +++-
 net/sched/act_connmark.c   |  8 +++-
 net/sched/act_csum.c   |  8 +++-
 net/sched/act_gact.c   |  8 +++-
 net/sched/act_ife.c|  8 +++-
 net/sched/act_ipt.c| 16 ++--
 net/sched/act_mirred.c |  8 +++-
 net/sched/act_nat.c|  8 +++-
 net/sched/act_pedit.c  |  8 +++-
 net/sched/act_police.c |  8 +++-
 net/sched/act_sample.c |  8 +++-
 net/sched/act_simple.c |  8 +++-
 net/sched/act_skbedit.c|  8 +++-
 net/sched/act_skbmod.c |  8 +++-
 net/sched/act_tunnel_key.c |  8 +++-
 net/sched/act_vlan.c   |  8 +++-
 17 files changed, 61 insertions(+), 88 deletions(-)

diff --git a/include/net/act_api.h b/include/net/act_api.h
index 02bf409140d0..6ed9692f20bd 100644
--- a/include/net/act_api.h
+++ b/include/net/act_api.h
@@ -120,12 +120,19 @@ int tc_action_net_init(struct tc_action_net *tn,
 void tcf_idrinfo_destroy(const struct tc_action_ops *ops,
 struct tcf_idrinfo *idrinfo);
 
-static inline void tc_action_net_exit(struct tc_action_net *tn)
+static inline void tc_action_net_exit(struct list_head *net_list,
+ unsigned int id)
 {
+   struct net *net;
+
rtnl_lock();
-   tcf_idrinfo_destroy(tn->ops, tn->idrinfo);
+   list_for_each_entry(net, net_list, exit_list) {
+   struct tc_action_net *tn = net_generic(net, id);
+
+   tcf_idrinfo_destroy(tn->ops, tn->idrinfo);
+   kfree(tn->idrinfo);
+   }
rtnl_unlock();
-   kfree(tn->idrinfo);
 }
 
 int tcf_generic_walker(struct tc_action_net *tn, struct sk_buff *skb,
diff --git a/net/sched/act_bpf.c b/net/sched/act_bpf.c
index e6c477fa9ca5..b3f2c15affa7 100644
--- a/net/sched/act_bpf.c
+++ b/net/sched/act_bpf.c
@@ -401,16 +401,14 @@ static __net_init int bpf_init_net(struct net *net)
return tc_action_net_init(tn, &act_bpf_ops);
 }
 
-static void __net_exit bpf_exit_net(struct net *net)
+static void __net_exit bpf_exit_net(struct list_head *net_list)
 {
-   struct tc_action_net *tn = net_generic(net, bpf_net_id);
-
-   tc_action_net_exit(tn);
+   tc_action_net_exit(net_list, bpf_net_id);
 }
 
 static struct pernet_operations bpf_net_ops = {
.init = bpf_init_net,
-   .exit = bpf_exit_net,
+   .exit_batch = bpf_exit_net,
.id   = &bpf_net_id,
.size = sizeof(struct tc_action_net),
 };
diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
index 10b7a8855a6c..2b15ba84e0c8 100644
--- a/net/sched/act_connmark.c
+++ b/net/sched/act_connmark.c
@@ -209,16 +209,14 @@ static __net_init int connmark_init_net(struct net *net)
return tc_action_net_init(tn, &act_connmark_ops);
 }
 
-static void __net_exit connmark_exit_net(struct net *net)
+static void __net_exit connmark_exit_net(struct list_head *net_list)
 {
-   struct tc_action_net *tn = net_generic(net, connmark_net_id);
-
-   tc_action_net_exit(tn);
+   tc_action_net_exit(net_list, connmark_net_id);
 }
 
 static struct pernet_operations connmark_net_ops = {
.init = connmark_init_net,
-   .exit = connmark_exit_net,
+   .exit_batch = connmark_exit_net,
.id   = &connmark_net_id,
.size = sizeof(struct tc_action_net),
 };
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index d836f998117b..af4b8ec60d9a 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -635,16 +635,14 @@ static __net_init int csum_init_net(struct net *net)
return tc_action_net_init(tn, &act_csum_ops);
 }
 
-static void __net_exit csum_exit_net(struct net *net)
+static void __net_exit csum_exit_net(struct list_head *net_list)
 {
-   struct tc_action_net *tn = net_generic(net, csum_net_id);
-
-   tc_action_net_exit(tn);
+   tc_action_net_exit(net_list, csum_net_id);
 }
 
 static struct pernet_operations csum_net_ops = {
.init = csum_init_net,
-   .exit = csum_exit_net,
+   .exit_batch = csum_exit_net,
.id   = &csum_net_id,
.size = sizeof(struct tc_action_net),
 };
diff --git a/net/sched/act_gact.c b/net/sched/act_gact.c
index e29a48ef7fc3..9d632e92cad0 100644
--- a/net/sched/act_gact.c
+++ b/net/sched/act_gact.c
@@ -235,16 +235,14 @@ static __net_init int gact_init_net(struct net *net)
return tc_action_net_init(tn, &act_gact_ops);
 }
 
-static void __net_exit gact_exit_net(struct net *net)
+static void __net_exit gact_exit_net(struct list_head *net_list)
 {
-   struct tc_action_net *tn = net_generic(net, gact_net_id);
-
-   tc_action_net_exit(tn);
+   tc_action_net_exit(net_list, gact_net_id);
 }
 
 static struct pernet_operations gact_net_ops = {
.in

[PATCH] Fix handling of verdicts after NF_QUEUE

2017-12-11 Thread Debabrata Banerjee
A verdict of NF_STOLEN after NF_QUEUE will cause an incorrect return value
and a potential kernel panic via double free of skb's

This was broken by commit 7034b566a4e7 ("netfilter: fix nf_queue handling")
and subsequently fixed in v4.10 by commit c63cbc460419 ("netfilter:
use switch() to handle verdict cases from nf_hook_slow()"). However that
commit cannot be cleanly cherry-picked to v4.9

Signed-off-by: Debabrata Banerjee 

---

This fix is only needed for v4.9 stable since v4.10+ does not have the
issue
---
 net/netfilter/core.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index 004af030ef1a..d869ea50623e 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -364,6 +364,11 @@ int nf_hook_slow(struct sk_buff *skb, struct nf_hook_state 
*state)
ret = nf_queue(skb, state, &entry, verdict);
if (ret == 1 && entry)
goto next_hook;
+   } else {
+   /* Implicit handling for NF_STOLEN, as well as any other
+* non conventional verdicts.
+*/
+   ret = 0;
}
return ret;
 }
-- 
2.15.1



Re: [PATCH net-next v5 2/2] net: thunderx: add timestamping support

2017-12-11 Thread Richard Cochran
On Mon, Dec 11, 2017 at 05:14:31PM +0300, Aleksey Makarov wrote:
> diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
> b/drivers/net/ethernet/cavium/thunder/nic.h
> index 4a02e618e318..204b234beb9d 100644
> --- a/drivers/net/ethernet/cavium/thunder/nic.h
> +++ b/drivers/net/ethernet/cavium/thunder/nic.h
> @@ -263,6 +263,8 @@ struct nicvf_drv_stats {
>   struct u64_stats_sync   syncp;
>  };
>  
> +struct cavium_ptp;
> +
>  struct nicvf {
>   struct nicvf*pnicvf;
>   struct net_device   *netdev;
> @@ -312,6 +314,12 @@ struct nicvf {
>   struct tasklet_struct   qs_err_task;
>   struct work_struct  reset_task;
>  
> + /* PTP timestamp */
> + struct cavium_ptp   *ptp_clock;
> + boolhw_rx_tstamp;
> + struct sk_buff  *ptp_skb;
> + atomic_ttx_ptp_skbs;

It is disturbing that the above two fields are set in different
places.  Shouldn't they be unified into one logical lock?

Here you clear them together:

> +static void nicvf_snd_ptp_handler(struct net_device *netdev,
> +   struct cqe_send_t *cqe_tx)
> +{
> + struct nicvf *nic = netdev_priv(netdev);
> + struct skb_shared_hwtstamps ts;
> + u64 ns;
> +
> + nic = nic->pnicvf;
> +
> + /* Sync for 'ptp_skb' */
> + smp_rmb();
> +
> + /* New timestamp request can be queued now */
> + atomic_set(&nic->tx_ptp_skbs, 0);
> +
> + /* Check for timestamp requested skb */
> + if (!nic->ptp_skb)
> + return;
> +
> + /* Check if timestamping is timedout, which is set to 10us */
> + if (cqe_tx->send_status == CQ_TX_ERROP_TSTMP_TIMEOUT ||
> + cqe_tx->send_status == CQ_TX_ERROP_TSTMP_CONFLICT)
> + goto no_tstamp;
> +
> + /* Get the timestamp */
> + memset(&ts, 0, sizeof(ts));
> + ns = cavium_ptp_tstamp2time(nic->ptp_clock, cqe_tx->ptp_timestamp);
> + ts.hwtstamp = ns_to_ktime(ns);
> + skb_tstamp_tx(nic->ptp_skb, &ts);
> +
> +no_tstamp:
> + /* Free the original skb */
> + dev_kfree_skb_any(nic->ptp_skb);
> + nic->ptp_skb = NULL;
> + /* Sync 'ptp_skb' */
> + smp_wmb();
> +}
> +

but here you set the one:

> @@ -657,7 +697,12 @@ static void nicvf_snd_pkt_handler(struct net_device 
> *netdev,
>   prefetch(skb);
>   (*tx_pkts)++;
>   *tx_bytes += skb->len;
> - napi_consume_skb(skb, budget);
> + /* If timestamp is requested for this skb, don't free it */
> + if (skb_shinfo(skb)->tx_flags & SKBTX_IN_PROGRESS &&
> + !nic->pnicvf->ptp_skb)
> + nic->pnicvf->ptp_skb = skb;
> + else
> + napi_consume_skb(skb, budget);
>   sq->skbuff[cqe_tx->sqe_ptr] = (u64)NULL;
>   } else {
>   /* In case of SW TSO on 88xx, only last segment will have

here you clear one:

> @@ -1319,12 +1382,28 @@ int nicvf_stop(struct net_device *netdev)
>  
>   nicvf_free_cq_poll(nic);
>  
> + /* Free any pending SKB saved to receive timestamp */
> + if (nic->ptp_skb) {
> + dev_kfree_skb_any(nic->ptp_skb);
> + nic->ptp_skb = NULL;
> + }
> +
>   /* Clear multiqset info */
>   nic->pnicvf = nic;
>  
>   return 0;
>  }

here you clear both:

> @@ -1394,6 +1473,12 @@ int nicvf_open(struct net_device *netdev)
>   if (nic->sqs_mode)
>   nicvf_get_primary_vf_struct(nic);
>  
> + /* Configure PTP timestamp */
> + if (nic->ptp_clock)
> + nicvf_config_hw_rx_tstamp(nic, nic->hw_rx_tstamp);
> + atomic_set(&nic->tx_ptp_skbs, 0);
> + nic->ptp_skb = NULL;
> +
>   /* Configure receive side scaling and MTU */
>   if (!nic->sqs_mode) {
>   nicvf_rss_init(nic);

here you set the other:

> @@ -1385,6 +1388,29 @@ nicvf_sq_add_hdr_subdesc(struct nicvf *nic, struct 
> snd_queue *sq, int qentry,
>   hdr->inner_l3_offset = skb_network_offset(skb) - 2;
>   this_cpu_inc(nic->pnicvf->drv_stats->tx_tso);
>   }
> +
> + /* Check if timestamp is requested */
> + if (!(skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP)) {
> + skb_tx_timestamp(skb);
> + return;
> + }
> +
> + /* Tx timestamping not supported along with TSO, so ignore request */
> + if (skb_shinfo(skb)->gso_size)
> + return;
> +
> + /* HW supports only a single outstanding packet to timestamp */
> + if (!atomic_add_unless(&nic->pnicvf->tx_ptp_skbs, 1, 1))
> + return;
> +
> + /* Mark the SKB for later reference */
> + skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS;
> +
> + /* Finally enable timestamp generation
> +  * Since 'post_cqe' is also set, two CQEs will be posted
> +  * for this packet i.e CQE_TYPE_SEND and CQE_TYPE_SEND_PTP.
> +  */
> + hdr->tstmp = 1;
>  }

and so it is completely non-obvious whether this is race free or not.

T

[PATCH v3] net: ethernet: arc: fix error handling in emac_rockchip_probe

2017-12-11 Thread Branislav Radocaj
If clk_set_rate() fails, we should disable clk before return.
Found by Linux Driver Verification project (linuxtesting.org).

Changes since v2 [1]:
* Merged with latest code changes

Changes since v1:
Update made thanks to David's review, much appreciated David.
* Improved inconsistent failure handling of clock rate setting
* For completeness of usecase, added arc_emac_probe error handling

Signed-off-by: Branislav Radocaj 
---
[1] https://marc.info/?l=linux-netdev&m=151301239802445&w=2
---
 drivers/net/ethernet/arc/emac_rockchip.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/arc/emac_rockchip.c 
b/drivers/net/ethernet/arc/emac_rockchip.c
index c6163874e4e7..16f9bee992fe 100644
--- a/drivers/net/ethernet/arc/emac_rockchip.c
+++ b/drivers/net/ethernet/arc/emac_rockchip.c
@@ -199,9 +199,11 @@ static int emac_rockchip_probe(struct platform_device 
*pdev)
 
/* RMII interface needs always a rate of 50MHz */
err = clk_set_rate(priv->refclk, 5000);
-   if (err)
+   if (err) {
dev_err(dev,
"failed to change reference clock rate (%d)\n", err);
+   goto out_regulator_disable;
+   }
 
if (priv->soc_data->need_div_macclk) {
priv->macclk = devm_clk_get(dev, "macclk");
@@ -230,12 +232,14 @@ static int emac_rockchip_probe(struct platform_device 
*pdev)
err = arc_emac_probe(ndev, interface);
if (err) {
dev_err(dev, "failed to probe arc emac (%d)\n", err);
-   goto out_regulator_disable;
+   goto out_clk_disable_macclk;
}
 
return 0;
+
 out_clk_disable_macclk:
-   clk_disable_unprepare(priv->macclk);
+   if (priv->soc_data->need_div_macclk)
+   clk_disable_unprepare(priv->macclk);
 out_regulator_disable:
if (priv->regulator)
regulator_disable(priv->regulator);
-- 
2.11.0



Re: [PATCH 1/3] PCI: introduce a device-managed version of pci_set_mwi

2017-12-11 Thread Bjorn Helgaas
On Sun, Dec 10, 2017 at 12:43:48AM +0100, Heiner Kallweit wrote:
> Introduce a device-managed version of pci_set_mwi. First user is the
> Realtek r8169 driver.
> 
> Signed-off-by: Heiner Kallweit 

With the subject and changelog as follows and the code reordering below,

  PCI: Add pcim_set_mwi(), a device-managed pci_set_mwi()

  Add pcim_set_mwi(), a device-managed version of pci_set_mwi(). First user
  is the Realtek r8169 driver.

Acked-by: Bjorn Helgaas 

With these changes, feel free to merge with the series via the netdev
tree.

> ---
>  drivers/pci/pci.c   | 29 +
>  include/linux/pci.h |  1 +
>  2 files changed, 30 insertions(+)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 4a7c6864f..fc57c378d 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -1458,6 +1458,7 @@ struct pci_devres {
>   unsigned int pinned:1;
>   unsigned int orig_intx:1;
>   unsigned int restore_intx:1;
> + unsigned int mwi:1;
>   u32 region_mask;
>  };
>  
> @@ -1476,6 +1477,9 @@ static void pcim_release(struct device *gendev, void 
> *res)
>   if (this->region_mask & (1 << i))
>   pci_release_region(dev, i);
>  
> + if (this->mwi)
> + pci_clear_mwi(dev);
> +
>   if (this->restore_intx)
>   pci_intx(dev, this->orig_intx);
>  
> @@ -3760,6 +3764,31 @@ int pci_set_mwi(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL(pci_set_mwi);
>  
> +/**
> + * pcim_set_mwi - Managed pci_set_mwi()
> + * @dev: the PCI device for which MWI is enabled
> + *
> + * Managed pci_set_mwi().
> + *
> + * RETURNS: An appropriate -ERRNO error value on error, or zero for success.

> + */
> +int pcim_set_mwi(struct pci_dev *dev)
> +{
> + struct pci_devres *dr;
> + int ret;
> +
> + ret = pci_set_mwi(dev);
> + if (ret)
> + return ret;
> +
> + dr = find_pci_dr(dev);
> + if (dr)
> + dr->mwi = 1;
> +
> + return 0;

I would rather look up the pci_devres first, e.g.,

  dr = find_pci_dr(dev);
  if (!dr)
return -ENOMEM;

  dr->mwi = 1;
  return pci_set_mwi(dev);

That way we won't enable MWI and be unable to disable it at release-time.

> +}
> +EXPORT_SYMBOL(pcim_set_mwi);
> +
>  /**
>   * pci_try_set_mwi - enables memory-write-invalidate PCI transaction
>   * @dev: the PCI device for which MWI is enabled
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 978aad784..0a7ac863a 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1064,6 +1064,7 @@ int pci_set_pcie_reset_state(struct pci_dev *dev, enum 
> pcie_reset_state state);
>  int pci_set_cacheline_size(struct pci_dev *dev);
>  #define HAVE_PCI_SET_MWI
>  int __must_check pci_set_mwi(struct pci_dev *dev);
> +int __must_check pcim_set_mwi(struct pci_dev *dev);
>  int pci_try_set_mwi(struct pci_dev *dev);
>  void pci_clear_mwi(struct pci_dev *dev);
>  void pci_intx(struct pci_dev *dev, int enable);
> -- 
> 2.15.1
> 
> 


Re: [PATCH net-next v5 1/2] net: add support for Cavium PTP coprocessor

2017-12-11 Thread Richard Cochran

Sorry I didn't finish reviewing before...

On Mon, Dec 11, 2017 at 05:14:30PM +0300, Aleksey Makarov wrote:
> +/**
> + * cavium_ptp_adjfreq() - Adjust ptp frequency
> + * @ptp: PTP clock info
> + * @ppb: how much to adjust by, in parts-per-billion
> + */
> +static int cavium_ptp_adjfreq(struct ptp_clock_info *ptp_info, s32 ppb)

adjfreq() is deprecated.  See ptp_clock_kernel.h.  Please re-work this
to implement the adjfine() method instead.

> +/**
> + * cavium_ptp_enable() - Check if PTP is enabled

Nit - comment is not correct. This method is for the auxiliary PHC
functions.

> + * @ptp: PTP clock info
> + * @rq:  request
> + * @on:  is it on
> + */
> +static int cavium_ptp_enable(struct ptp_clock_info *ptp_info,
> +  struct ptp_clock_request *rq, int on)
> +{
> + return -EOPNOTSUPP;
> +}

...

> +static int cavium_ptp_probe(struct pci_dev *pdev,
> + const struct pci_device_id *ent)
> +{
> + struct device *dev = &pdev->dev;
> + struct cavium_ptp *clock;
> + struct cyclecounter *cc;
> + u64 clock_cfg;
> + u64 clock_comp;
> + int err;
> +
> + clock = devm_kzalloc(dev, sizeof(*clock), GFP_KERNEL);
> + if (!clock)
> + return -ENOMEM;
> +
> + clock->pdev = pdev;
> +
> + err = pcim_enable_device(pdev);
> + if (err)
> + return err;
> +
> + err = pcim_iomap_regions(pdev, 1 << PCI_PTP_BAR_NO, pci_name(pdev));
> + if (err)
> + return err;
> +
> + clock->reg_base = pcim_iomap_table(pdev)[PCI_PTP_BAR_NO];
> +
> + spin_lock_init(&clock->spin_lock);
> +
> + cc = &clock->cycle_counter;
> + cc->read = cavium_ptp_cc_read;
> + cc->mask = CYCLECOUNTER_MASK(64);
> + cc->mult = 1;
> + cc->shift = 0;
> +
> + timecounter_init(&clock->time_counter, &clock->cycle_counter,
> +  ktime_to_ns(ktime_get_real()));
> +
> + clock->clock_rate = ptp_cavium_clock_get();
> +
> + clock->ptp_info = (struct ptp_clock_info) {
> + .owner  = THIS_MODULE,
> + .name   = "ThunderX PTP",
> + .max_adj= 10ull,
> + .n_ext_ts   = 0,
> + .n_pins = 0,
> + .pps= 0,
> + .adjfreq= cavium_ptp_adjfreq,
> + .adjtime= cavium_ptp_adjtime,
> + .gettime64  = cavium_ptp_gettime,
> + .settime64  = cavium_ptp_settime,
> + .enable = cavium_ptp_enable,
> + };
> +
> + clock_cfg = readq(clock->reg_base + PTP_CLOCK_CFG);
> + clock_cfg |= PTP_CLOCK_CFG_PTP_EN;
> + writeq(clock_cfg, clock->reg_base + PTP_CLOCK_CFG);
> +
> + clock_comp = ((u64)10ull << 32) / clock->clock_rate;
> + writeq(clock_comp, clock->reg_base + PTP_CLOCK_COMP);
> +
> + clock->ptp_clock = ptp_clock_register(&clock->ptp_info, dev);
> + if (IS_ERR(clock->ptp_clock)) {

You need to handle the case when ptp_clock_register() returns NULL.

from ptp_clock_kernel.h:

/**
 * ptp_clock_register() - register a PTP hardware clock driver
 *
 * @info:   Structure describing the new clock.
 * @parent: Pointer to the parent device of the new clock.
 *
 * Returns a valid pointer on success or PTR_ERR on failure.  If PHC
 * support is missing at the configuration level, this function
 * returns NULL, and drivers are expected to gracefully handle that
 * case separately.
 */

> + clock_cfg = readq(clock->reg_base + PTP_CLOCK_CFG);
> + clock_cfg &= ~PTP_CLOCK_CFG_PTP_EN;
> + writeq(clock_cfg, clock->reg_base + PTP_CLOCK_CFG);
> + return PTR_ERR(clock->ptp_clock);
> + }
> +
> + pci_set_drvdata(pdev, clock);
> + return 0;
> +}

Thanks,
Richard


Re: [PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver

2017-12-11 Thread Tom Herbert
On Mon, Dec 11, 2017 at 2:16 PM, Tom Herbert  wrote:
> On Mon, Dec 11, 2017 at 1:34 PM, David Miller  wrote:
>> From: Tom Herbert 
>> Date: Mon, 11 Dec 2017 12:38:28 -0800
>>
>>> DOS mitigations:
>>>
>>> - The number of outstanding resolutions is limited by the size of the
>>>   table
>>> - Timeout of pending entries limits the number of netlink resolution
>>>   messages
>>> - Packets are not queued that are pending resolution. In the current
>>>   model that can be forwarded to a router that has all reachability
>>>   information (ILA use case for example)
>>
>> None of these mitigation schemes matter.
>>
>> If packet traffic can influence the table of entries (your cache
>> or whatever), then you will be DoS'able.
>>
>> If you limit outstanding resolutions, you harm legitimate traffic
>> whose resolutions will not be processed now too just as equally
>> as you will harm "bad guy" traffic.
>>
> David,
>
Actually, please disregard. I will respin to use secure redirects.

> How can we build a system that allows an unlimited number of
> resolutions without drop? Unless the resolution path can handle a
> higher packet load than the receive path, there will be some place in
> the system where memory is allocated and that limits the amount of
> pending resolutions (i.e. pending packet skbs, entry in a resolution
> table, skbs on a netlink socket).
>
>> If you forward in the case of pending resolution, the bad guy can
>> make you forward everything there.  The bad guy can effectively
>> make your caching node stop caching completely.
>>
> But a DOS attack doesn't stop fowarding, at best it forces suboptimal
> forwarding. This analogous to when the SYN cache is filled up but SYN
> cookies allow forward progress in a degraded operational mode.
>
> Thanks,
> Tom


Re: Huge memory leak with 4.15.0-rc2+

2017-12-11 Thread Paweł Staszewski



W dniu 2017-12-11 o 23:15, John Fastabend pisze:

On 12/11/2017 01:48 PM, Paweł Staszewski wrote:


W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze:

Hi


I just upgraded some testing host to 4.15.0-rc2+ kernel

And after some time of traffic processing - when traffic on all ports
reach about 3Mpps - memleak started.



[...]


Some observations - when i disable tso on all cards there is more
memleak.






When traffic starts to drop - there is less and less memleak
below link to memory usage graph:
https://ibb.co/hU97kG

And there is rising slab_unrecl - Amount of unreclaimable memory used
for slab kernel allocations


Forgot to add that im using hfsc and qdiscs like pfifo on classes.



Maybe some error case I missed in the qdisc patches I'm looking into
it.

Thanks,
John



This is how it looks like when corelated on graph - traffic vs mem
https://ibb.co/njpkqG

Typical hfsc class + qdisc:
### Client interface vlan1616
tc qdisc del dev vlan1616 root
tc qdisc add dev vlan1616 handle 1: root hfsc default 100
tc class add dev vlan1616 parent 1: classid 1:100 hfsc ls m2 200Mbit ul 
m2 200Mbit

tc qdisc add dev vlan1616 parent 1:100 handle 100: pfifo limit 128
### End TM for client interface
tc qdisc del dev vlan1616 ingress
tc qdisc add dev vlan1616 handle : ingress
tc filter add dev vlan1616 parent : protocol ip prio 50 u32 match ip 
src 0.0.0.0/0 police rate 200Mbit burst 200M mtu 32k drop flowid 1:1


And this is same for about 450 vlan interfaces


Good thing is that compared to 4.14.3 i have about 5% less cpu load on 
4.15.0-rc2+


When hfsc will be lockless or tbf - then it will be really huge 
difference in cpu load on x86 when using traffic shaping - so really 
good job John.






Re: Huge memory leak with 4.15.0-rc2+

2017-12-11 Thread John Fastabend
On 12/11/2017 01:48 PM, Paweł Staszewski wrote:
> 
> 
> W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze:
>> Hi
>>
>>
>> I just upgraded some testing host to 4.15.0-rc2+ kernel
>>
>> And after some time of traffic processing - when traffic on all ports
>> reach about 3Mpps - memleak started.
>>


[...]

>> Some observations - when i disable tso on all cards there is more
>> memleak.
>>
>>
>>
>>
>>
> When traffic starts to drop - there is less and less memleak
> below link to memory usage graph:
> https://ibb.co/hU97kG
> 
> And there is rising slab_unrecl - Amount of unreclaimable memory used
> for slab kernel allocations
> 
> 
> Forgot to add that im using hfsc and qdiscs like pfifo on classes.
> 
> 

Maybe some error case I missed in the qdisc patches I'm looking into
it.

Thanks,
John



Re: [PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver

2017-12-11 Thread Tom Herbert
On Mon, Dec 11, 2017 at 1:34 PM, David Miller  wrote:
> From: Tom Herbert 
> Date: Mon, 11 Dec 2017 12:38:28 -0800
>
>> DOS mitigations:
>>
>> - The number of outstanding resolutions is limited by the size of the
>>   table
>> - Timeout of pending entries limits the number of netlink resolution
>>   messages
>> - Packets are not queued that are pending resolution. In the current
>>   model that can be forwarded to a router that has all reachability
>>   information (ILA use case for example)
>
> None of these mitigation schemes matter.
>
> If packet traffic can influence the table of entries (your cache
> or whatever), then you will be DoS'able.
>
> If you limit outstanding resolutions, you harm legitimate traffic
> whose resolutions will not be processed now too just as equally
> as you will harm "bad guy" traffic.
>
David,

How can we build a system that allows an unlimited number of
resolutions without drop? Unless the resolution path can handle a
higher packet load than the receive path, there will be some place in
the system where memory is allocated and that limits the amount of
pending resolutions (i.e. pending packet skbs, entry in a resolution
table, skbs on a netlink socket).

> If you forward in the case of pending resolution, the bad guy can
> make you forward everything there.  The bad guy can effectively
> make your caching node stop caching completely.
>
But a DOS attack doesn't stop fowarding, at best it forces suboptimal
forwarding. This analogous to when the SYN cache is filled up but SYN
cookies allow forward progress in a degraded operational mode.

Thanks,
Tom


Re: [REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise

2017-12-11 Thread Willem de Bruijn
On Mon, Dec 11, 2017 at 4:44 PM, Greg Kroah-Hartman
 wrote:
> On Mon, Dec 11, 2017 at 04:25:26PM -0500, Willem de Bruijn wrote:
>> Note that UFO was removed in 4.14 and that skb_warn_bad_offload
>> can happen for various types of packets, so there may be multiple
>> independent bug reports. I'm investigating two other non-UFO reports
>> just now.
>
> Meta-comment, now that UFO is gone from mainline, I'm wondering if I
> should just delete it from 4.4 and 4.9 as well.  Any objections for
> that?  I'd like to make it easy to maintain these kernels for a while,
> and having them diverge like this, with all of the issues around UFO,
> seems like it will just make life harder for myself if I leave it in.
>
> Any opinions?

Some of that removal had to be reverted with commit 0c19f846d582
("net: accept UFO datagrams from tuntap and packet") for VM live
migration between kernels.

Any backports probably should squash that in at the least. Just today
another thread discussed that that patch may not address all open
issues still, so it may be premature to backport at this point.
http://lkml.kernel.org/r/


Re: [PATCH v5] leds: trigger: Introduce a NETDEV trigger

2017-12-11 Thread Jacek Anaszewski
Hi Ben,

Thanks for the update.

On 12/10/2017 10:17 PM, Ben Whitten wrote:
> This commit introduces a NETDEV trigger for named device
> activity. Available triggers are link, rx, and tx.
> 
> Signed-off-by: Ben Whitten 
> 
> ---
> Changes in v5:
> Adjust header comment style to be consistent
> Changes in v4:
> Adopt SPDX licence header
> Changes in v3:
> Cancel the software blink prior to a oneshot re-queue
> Changes in v2:
> Sort includes and redate documentation
> Correct licence
> Remove macro and replace with generic function using enums
> Convert blink logic in stats work to use led_blink_oneshot
> Uses configured brightness instead of FULL
> ---
>  .../ABI/testing/sysfs-class-led-trigger-netdev |  45 ++
>  drivers/leds/trigger/Kconfig   |   7 +
>  drivers/leds/trigger/Makefile  |   1 +
>  drivers/leds/trigger/ledtrig-netdev.c  | 496 
> +
>  4 files changed, 549 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-class-led-trigger-netdev
>  create mode 100644 drivers/leds/trigger/ledtrig-netdev.c
> 
> diff --git a/Documentation/ABI/testing/sysfs-class-led-trigger-netdev 
> b/Documentation/ABI/testing/sysfs-class-led-trigger-netdev
> new file mode 100644
> index 000..451af6d
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-class-led-trigger-netdev
> @@ -0,0 +1,45 @@
> +What:/sys/class/leds//device_name
> +Date:Dec 2017
> +KernelVersion:   4.16
> +Contact: linux-l...@vger.kernel.org
> +Description:
> + Specifies the network device name to monitor.
> +
> +What:/sys/class/leds//interval
> +Date:Dec 2017
> +KernelVersion:   4.16
> +Contact: linux-l...@vger.kernel.org
> +Description:
> + Specifies the duration of the LED blink in milliseconds.
> + Defaults to 50 ms.
> +
> +What:/sys/class/leds//link
> +Date:Dec 2017
> +KernelVersion:   4.16
> +Contact: linux-l...@vger.kernel.org
> +Description:
> + Signal the link state of the named network device.
> + If set to 0 (default), the LED's normal state is off.
> + If set to 1, the LED's normal state reflects the link state
> + of the named network device.
> + Setting this value also immediately changes the LED state.
> +
> +What:/sys/class/leds//tx
> +Date:Dec 2017
> +KernelVersion:   4.16
> +Contact: linux-l...@vger.kernel.org
> +Description:
> + Signal transmission of data on the named network device.
> + If set to 0 (default), the LED will not blink on transmission.
> + If set to 1, the LED will blink for the milliseconds specified
> + in interval to signal transmission.
> +
> +What:/sys/class/leds//rx
> +Date:Dec 2017
> +KernelVersion:   4.16
> +Contact: linux-l...@vger.kernel.org
> +Description:
> + Signal reception of data on the named network device.
> + If set to 0 (default), the LED will not blink on reception.
> + If set to 1, the LED will blink for the milliseconds specified
> + in interval to signal reception.
> diff --git a/drivers/leds/trigger/Kconfig b/drivers/leds/trigger/Kconfig
> index 3f9ddb9..4ec1853 100644
> --- a/drivers/leds/trigger/Kconfig
> +++ b/drivers/leds/trigger/Kconfig
> @@ -126,4 +126,11 @@ config LEDS_TRIGGER_PANIC
> a different trigger.
> If unsure, say Y.
>  
> +config LEDS_TRIGGER_NETDEV
> + tristate "LED Netdev Trigger"
> + depends on NET && LEDS_TRIGGERS
> + help
> +   This allows LEDs to be controlled by network device activity.
> +   If unsure, say Y.
> +
>  endif # LEDS_TRIGGERS
> diff --git a/drivers/leds/trigger/Makefile b/drivers/leds/trigger/Makefile
> index 9f2e868..59e163d 100644
> --- a/drivers/leds/trigger/Makefile
> +++ b/drivers/leds/trigger/Makefile
> @@ -11,3 +11,4 @@ obj-$(CONFIG_LEDS_TRIGGER_DEFAULT_ON)   += 
> ledtrig-default-on.o
>  obj-$(CONFIG_LEDS_TRIGGER_TRANSIENT) += ledtrig-transient.o
>  obj-$(CONFIG_LEDS_TRIGGER_CAMERA)+= ledtrig-camera.o
>  obj-$(CONFIG_LEDS_TRIGGER_PANIC) += ledtrig-panic.o
> +obj-$(CONFIG_LEDS_TRIGGER_NETDEV)+= ledtrig-netdev.o
> diff --git a/drivers/leds/trigger/ledtrig-netdev.c 
> b/drivers/leds/trigger/ledtrig-netdev.c
> new file mode 100644
> index 000..6df4781
> --- /dev/null
> +++ b/drivers/leds/trigger/ledtrig-netdev.c
> @@ -0,0 +1,496 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright 2017 Ben Whitten 
> +// Copyright 2007 Oliver Jowett 
> +//
> +// LED Kernel Netdev Trigger
> +//
> +// Toggles the LED to reflect the link and traffic state of a named net 
> device
> +//
> +// Derived from ledtrig-timer.c which is:
> +//  Copyright 2005-2006 Openedhand Ltd.
> +//  Author: Richard Purdie 
> +
> +#include 
> +#include 
> +#in

Re: Huge memory leak with 4.15.0-rc2+

2017-12-11 Thread Paweł Staszewski



W dniu 2017-12-11 o 22:23, Paweł Staszewski pisze:

Hi


I just upgraded some testing host to 4.15.0-rc2+ kernel

And after some time of traffic processing - when traffic on all ports 
reach about 3Mpps - memleak started.


Graph attached from memory usage: https://ibb.co/idK4zb



HW config:

Intel E5

8x Intel 82599 (used ixgbe driver from kernel)

Interfaces with vlans attached

All 8 ethernet ports are in one LAG group configured by team.

With current settings

(this host is acting as a router - and bgpd process is eating same 
amount of memory from the beginning about 5.2GB)


 cat /proc/meminfo
MemTotal:   32770588 kB
MemFree:    11342492 kB
MemAvailable:   10982752 kB
Buffers:   84704 kB
Cached:    83180 kB
SwapCached:    0 kB
Active:  5105320 kB
Inactive:  46252 kB
Active(anon):    4985448 kB
Inactive(anon): 1096 kB
Active(file): 119872 kB
Inactive(file):    45156 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal:   4005280 kB
SwapFree:    4005280 kB
Dirty:   236 kB
Writeback: 0 kB
AnonPages:   4983752 kB
Mapped:    13556 kB
Shmem:  2852 kB
Slab:    1013124 kB
SReclaimable:  45876 kB
SUnreclaim:   967248 kB
KernelStack:    7152 kB
PageTables:    12164 kB
NFS_Unstable:  0 kB
Bounce:    0 kB
WritebackTmp:  0 kB
CommitLimit:    20390572 kB
Committed_AS: 396568 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages:    0 kB
ShmemPmdMapped:    0 kB
CmaTotal:  0 kB
CmaFree:   0 kB
HugePages_Total:   0
HugePages_Free:    0
HugePages_Rsvd:    0
HugePages_Surp:    0
Hugepagesize:   2048 kB
DirectMap4k: 1407572 kB
DirectMap2M:    20504576 kB
DirectMap1G:    13631488 kB

ps aux --sort -rss
USER   PID %CPU %MEM    VSZ   RSS TTY  STAT START   TIME COMMAND
root  6758  1.8 14.9 5044996 4886964 ? Sl   01:22  23:21 
/usr/local/sbin/bgpd -d  -u root -g root -I --ignore_warnings
root  6752  0.0  0.1  86272 61920 ?    Ss   01:22   0:16 
/usr/local/sbin/zebra -d  -u root -g root -I --ignore_warnings
root  6766 12.6  0.0  51592 29196 ?    S    01:22 157:48 
/usr/sbin/snmpd -p /var/run/snmpd.pid -Ln
root  7494  0.0  0.0 708976  5896 ?    Ssl  01:22   0:09 
/opt/collectd/sbin/collectd
root 15531  0.0  0.0  67864  5056 ?    Ss   21:57   0:00 sshd: 
paol [priv]
root  4915  0.0  0.0 271912  4904 ?    Ss   01:21   0:25 
/usr/sbin/syslog-ng --persist-file 
/var/lib/syslog-ng/syslog-ng.persist --cfgfile 
/etc/syslog-ng/syslog-ng.conf --pidfile /run/syslog-ng.pid
root  4278  0.0  0.0  37220  4164 ?    Ss   01:21   0:00 
/lib/systemd/systemd-udevd --daemon
root  5147  0.0  0.0  32072  3232 ?    Ss   01:21   0:00 
/usr/sbin/sshd
root  5203  0.0  0.0  28876  2436 ?    S    01:21   0:00 teamd 
-d -f /etc/teamd.conf
root 17372  0.0  0.0  17924  2388 pts/2    R+   22:13   0:00 ps 
aux --sort -rss
root  4789  0.0  0.0   5032  2176 ?    Ss   01:21   0:00 mdadm 
--monitor --scan --daemonise --pid-file /var/run/mdadm.pid --syslog
root  7511  0.0  0.0  12676  1920 tty4 Ss+  01:22   0:00 
/sbin/agetty 38400 tty4 linux
root  7510  0.0  0.0  12676  1896 tty3 Ss+  01:22   0:00 
/sbin/agetty 38400 tty3 linux
root  7512  0.0  0.0  12676  1860 tty5 Ss+  01:22   0:00 
/sbin/agetty 38400 tty5 linux
root  7513  0.0  0.0  12676  1836 tty6 Ss+  01:22   0:00 
/sbin/agetty 38400 tty6 linux
root  7509  0.0  0.0  12676  1832 tty2 Ss+  01:22   0:00 
/sbin/agetty 38400 tty2 linux


And latest kernel that everything was working is: 4.14.3


Some observations - when i disable tso on all cards there is more 
memleak.







When traffic starts to drop - there is less and less memleak
below link to memory usage graph:
https://ibb.co/hU97kG

And there is rising slab_unrecl - Amount of unreclaimable memory used 
for slab kernel allocations



Forgot to add that im using hfsc and qdiscs like pfifo on classes.









Re: [REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise

2017-12-11 Thread Greg Kroah-Hartman
On Mon, Dec 11, 2017 at 04:25:26PM -0500, Willem de Bruijn wrote:
> Note that UFO was removed in 4.14 and that skb_warn_bad_offload
> can happen for various types of packets, so there may be multiple
> independent bug reports. I'm investigating two other non-UFO reports
> just now.

Meta-comment, now that UFO is gone from mainline, I'm wondering if I
should just delete it from 4.4 and 4.9 as well.  Any objections for
that?  I'd like to make it easy to maintain these kernels for a while,
and having them diverge like this, with all of the issues around UFO,
seems like it will just make life harder for myself if I leave it in.

Any opinions?

thanks,

greg k-h


Re: [PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver

2017-12-11 Thread David Miller
From: Tom Herbert 
Date: Mon, 11 Dec 2017 12:38:28 -0800

> DOS mitigations:
> 
> - The number of outstanding resolutions is limited by the size of the
>   table
> - Timeout of pending entries limits the number of netlink resolution
>   messages
> - Packets are not queued that are pending resolution. In the current
>   model that can be forwarded to a router that has all reachability
>   information (ILA use case for example)

None of these mitigation schemes matter.

If packet traffic can influence the table of entries (your cache
or whatever), then you will be DoS'able.

If you limit outstanding resolutions, you harm legitimate traffic
whose resolutions will not be processed now too just as equally
as you will harm "bad guy" traffic.

If you forward in the case of pending resolution, the bad guy can
make you forward everything there.  The bad guy can effectively
make your caching node stop caching completely.

Please, learn from OVS, the ipv4 routing cache, and the IPSEC
flow cache.  This kind of architecture, _especially_ when the
resolution is user side, is deeply flawed.

We're trying to remove code that does this kind of stuff, rather
than add new instances.

Thank you.


Huge memory leak with 4.15.0-rc2+

2017-12-11 Thread Paweł Staszewski

Hi


I just upgraded some testing host to 4.15.0-rc2+ kernel

And after some time of traffic processing - when traffic on all ports 
reach about 3Mpps - memleak started.


Graph attached from memory usage: https://ibb.co/idK4zb



HW config:

Intel E5

8x Intel 82599 (used ixgbe driver from kernel)

Interfaces with vlans attached

All 8 ethernet ports are in one LAG group configured by team.

With current settings

(this host is acting as a router - and bgpd process is eating same 
amount of memory from the beginning about 5.2GB)


 cat /proc/meminfo
MemTotal:   32770588 kB
MemFree:    11342492 kB
MemAvailable:   10982752 kB
Buffers:   84704 kB
Cached:    83180 kB
SwapCached:    0 kB
Active:  5105320 kB
Inactive:  46252 kB
Active(anon):    4985448 kB
Inactive(anon): 1096 kB
Active(file): 119872 kB
Inactive(file):    45156 kB
Unevictable:   0 kB
Mlocked:   0 kB
SwapTotal:   4005280 kB
SwapFree:    4005280 kB
Dirty:   236 kB
Writeback: 0 kB
AnonPages:   4983752 kB
Mapped:    13556 kB
Shmem:  2852 kB
Slab:    1013124 kB
SReclaimable:  45876 kB
SUnreclaim:   967248 kB
KernelStack:    7152 kB
PageTables:    12164 kB
NFS_Unstable:  0 kB
Bounce:    0 kB
WritebackTmp:  0 kB
CommitLimit:    20390572 kB
Committed_AS: 396568 kB
VmallocTotal:   34359738367 kB
VmallocUsed:   0 kB
VmallocChunk:  0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages:    0 kB
ShmemPmdMapped:    0 kB
CmaTotal:  0 kB
CmaFree:   0 kB
HugePages_Total:   0
HugePages_Free:    0
HugePages_Rsvd:    0
HugePages_Surp:    0
Hugepagesize:   2048 kB
DirectMap4k: 1407572 kB
DirectMap2M:    20504576 kB
DirectMap1G:    13631488 kB

ps aux --sort -rss
USER   PID %CPU %MEM    VSZ   RSS TTY  STAT START   TIME COMMAND
root  6758  1.8 14.9 5044996 4886964 ? Sl   01:22  23:21 
/usr/local/sbin/bgpd -d  -u root -g root -I --ignore_warnings
root  6752  0.0  0.1  86272 61920 ?    Ss   01:22   0:16 
/usr/local/sbin/zebra -d  -u root -g root -I --ignore_warnings
root  6766 12.6  0.0  51592 29196 ?    S    01:22 157:48 
/usr/sbin/snmpd -p /var/run/snmpd.pid -Ln
root  7494  0.0  0.0 708976  5896 ?    Ssl  01:22   0:09 
/opt/collectd/sbin/collectd
root 15531  0.0  0.0  67864  5056 ?    Ss   21:57   0:00 sshd: 
paol [priv]
root  4915  0.0  0.0 271912  4904 ?    Ss   01:21   0:25 
/usr/sbin/syslog-ng --persist-file /var/lib/syslog-ng/syslog-ng.persist 
--cfgfile /etc/syslog-ng/syslog-ng.conf --pidfile /run/syslog-ng.pid
root  4278  0.0  0.0  37220  4164 ?    Ss   01:21   0:00 
/lib/systemd/systemd-udevd --daemon
root  5147  0.0  0.0  32072  3232 ?    Ss   01:21   0:00 
/usr/sbin/sshd
root  5203  0.0  0.0  28876  2436 ?    S    01:21   0:00 teamd 
-d -f /etc/teamd.conf
root 17372  0.0  0.0  17924  2388 pts/2    R+   22:13   0:00 ps aux 
--sort -rss
root  4789  0.0  0.0   5032  2176 ?    Ss   01:21   0:00 mdadm 
--monitor --scan --daemonise --pid-file /var/run/mdadm.pid --syslog
root  7511  0.0  0.0  12676  1920 tty4 Ss+  01:22   0:00 
/sbin/agetty 38400 tty4 linux
root  7510  0.0  0.0  12676  1896 tty3 Ss+  01:22   0:00 
/sbin/agetty 38400 tty3 linux
root  7512  0.0  0.0  12676  1860 tty5 Ss+  01:22   0:00 
/sbin/agetty 38400 tty5 linux
root  7513  0.0  0.0  12676  1836 tty6 Ss+  01:22   0:00 
/sbin/agetty 38400 tty6 linux
root  7509  0.0  0.0  12676  1832 tty2 Ss+  01:22   0:00 
/sbin/agetty 38400 tty2 linux


And latest kernel that everything was working is: 4.14.3


Some observations - when i disable tso on all cards there is more memleak.






Re: [REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise

2017-12-11 Thread David Miller
From: Joseph Salisbury 
Date: Mon, 11 Dec 2017 15:35:34 -0500

> A kernel bug report was opened against Ubuntu [0].  It was found that
> reverting the following commit resolved this bug:
> 
> commit b2504a5dbef3305ef41988ad270b0e8ec289331c
> Author: Eric Dumazet 
> Date:   Tue Jan 31 10:20:32 2017 -0800
> 
>     net: reduce skb_warn_bad_offload() noise
>    
> 
> The regression was introduced as of v4.11-rc1 and still exists in
> current mainline.
>    
> I was hoping to get your feedback, since you are the patch author.  Do
> you think gathering any additional data will help diagnose this issue,
> or would it be best to submit a revert request?
>    
> This commit did in fact resolve another bug[1], but in the process
> introduced this regression.

It helps if you can consolidate the information obtained in your bug
tracking here in the email so that people on this list can get an idea
of what the problem scope might be without having to go to your
special bug tracking site.

This is really not about us being snobs about this mailing list, it's
about you wanting to get a result.  And you'll get a better result
faster if you post the details here on the lsit because most
developers are not going to go to your bug tracking site to read the
bug comments.

Also, this isn't a functional regression, it is just that we are
generating warnings that we didn't before.  It doesn't mean that
Eric's patch is wrong, it could just be that his new check is
triggering for a bug that has always been there.

Scanning the bug myself it seems that the critical required component
is IPSEC, and IPSEC has it's own way of doing segmentation offload.

Thanks.



Re: [REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise

2017-12-11 Thread Willem de Bruijn
On Mon, Dec 11, 2017 at 3:35 PM, Joseph Salisbury
 wrote:
> Hi Eric,
>
> A kernel bug report was opened against Ubuntu [0].  It was found that
> reverting the following commit resolved this bug:

The recorded trace in that bug is against 4.10.0 with some backports.
Given that commit b2504a5dbef3 ("net: reduce skb_warn_bad_offload()
noise") is implicated, I guess that that was backported from 4.11-rc1.

The WARN shows

  e1000e: caps=(0x0030002149a9, 0x)
  len=1701 data_len=1659 gso_size=1480 gso_type=2 ip_summed=0

The numbering changed in 4.14, but for this kernel

  SKB_GSO_UDP = 1 << 1,

so this is a UFO packet with CHECKSUM_NONE. The stack shows

kernel: [570943.494549] skb_warn_bad_offload+0xd1/0x120
kernel: [570943.494550] __skb_gso_segment+0x17d/0x190
kernel: [570943.494564] validate_xmit_skb+0x14f/0x2a0
kernel: [570943.494565] validate_xmit_skb_list+0x43/0x70

so if that patch has been backported, then this must trigger in
__skb_gso_segment on the return path from skb_mac_gso_segment.

Did you backport

commit 8d63bee643f1fb53e472f0e135cae4eb99d62d19
Author: Willem de Bruijn 
Date:   Tue Aug 8 14:22:55 2017 -0400

net: avoid skb_warn_bad_offload false positives on UFO

skb_warn_bad_offload triggers a warning when an skb enters the GSO
stack at __skb_gso_segment that does not have CHECKSUM_PARTIAL
checksum offload set.

Commit b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise")
observed that SKB_GSO_DODGY producers can trigger the check and
that passing those packets through the GSO handlers will fix it
up. But, the software UFO handler will set ip_summed to
CHECKSUM_NONE.

When __skb_gso_segment is called from the receive path, this
triggers the warning again.

Make UFO set CHECKSUM_UNNECESSARY instead of CHECKSUM_NONE. On
Tx these two are equivalent. On Rx, this better matches the
skb state (checksum computed), as CHECKSUM_NONE here means no
checksum computed.

See also this thread for context:
http://patchwork.ozlabs.org/patch/799015/

Fixes: b2504a5dbef3 ("net: reduce skb_warn_bad_offload() noise")
Signed-off-by: Willem de Bruijn 
Signed-off-by: David S. Miller 

Note that UFO was removed in 4.14 and that skb_warn_bad_offload
can happen for various types of packets, so there may be multiple
independent bug reports. I'm investigating two other non-UFO reports
just now.


[PATCH iproute2 1/1] ss: remove duplicate assignment

2017-12-11 Thread Roman Mashak
Signed-off-by: Roman Mashak 
---
 misc/ss.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/misc/ss.c b/misc/ss.c
index 90da93e..da52d5e 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -2306,7 +2306,6 @@ static void tcp_show_info(const struct nlmsghdr *nlh, 
struct inet_diag_msg *r,
s.sacked = info->tcpi_sacked;
s.fackets= info->tcpi_fackets;
s.reordering = info->tcpi_reordering;
-   s.rcv_space  = info->tcpi_rcv_space;
s.rcv_ssthresh   = info->tcpi_rcv_ssthresh;
s.cwnd   = info->tcpi_snd_cwnd;
 
-- 
2.7.4



Re: [PATCH net,stable] net: qmi_wwan: add Quectel BG96 2c7c:0296

2017-12-11 Thread Sebastian Sjoholm
Hi,

Sorry for the re-email of the patch below, clearly a beginners mistake of me 
not to clear my tmp/ folder.

Please disregard this.

Regards,
Sebastian

> On Dec 11, 2017, at 21:12 , ssjoh...@mac.com wrote:
> 
> From: Sebastian Sjoholm 
> 
> Quectel BG96 is an Qualcomm MDM9206 based IoT modem, supporting both 
> CAT-M and NB-IoT. Tested hardware is BG96 mounted on Quectel development 
> board (EVB). The USB id is added to qmi_wwan.c to allow QMI 
> communication with the BG96.
> 
> Signed-off-by: Sebastian Sjoholm 
> 
> ---
> drivers/net/usb/qmi_wwan.c | 1 +
> 1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
> index 720a3a248070..c750cf7c042b 100644
> --- a/drivers/net/usb/qmi_wwan.c
> +++ b/drivers/net/usb/qmi_wwan.c
> @@ -1239,6 +1239,7 @@ static const struct usb_device_id products[] = {
>   {QMI_FIXED_INTF(0x1e0e, 0x9001, 5)},/* SIMCom 7230E */
>   {QMI_QUIRK_SET_DTR(0x2c7c, 0x0125, 4)}, /* Quectel EC25, EC20 R2.0  
> Mini PCIe */
>   {QMI_QUIRK_SET_DTR(0x2c7c, 0x0121, 4)}, /* Quectel EC21 Mini PCIe */
> + {QMI_FIXED_INTF(0x2c7c, 0x0296, 4)},/* Quectel BG96 */
> 
>   /* 4. Gobi 1000 devices */
>   {QMI_GOBI1K_DEVICE(0x05c6, 0x9212)},/* Acer Gobi Modem Device */
> -- 
> 2.11.0 (Apple Git-81)
> 



[PATCH ipsec-next] xfrm: check for xdo_dev_state_free

2017-12-11 Thread Shannon Nelson
The current XFRM code assumes that we've implemented the
xdo_dev_state_free() callback, even if it is meaningless to the driver.
This patch adds a check for it before calling, as done in other APIs,
and is done for the xdo_state_offload_ok() callback.

Also, we add a check for the required add and delete functions up front
at registration time to be sure both are defined, and complain if not.

Signed-off-by: Shannon Nelson 
---
 include/net/xfrm.h |  3 ++-
 net/xfrm/xfrm_device.c | 18 ++
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index e015e16..dfabd04 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1891,7 +1891,8 @@ static inline void xfrm_dev_state_free(struct xfrm_state 
*x)
 struct net_device *dev = xso->dev;
 
if (dev && dev->xfrmdev_ops) {
-   dev->xfrmdev_ops->xdo_dev_state_free(x);
+   if (dev->xfrmdev_ops->xdo_dev_state_free)
+   dev->xfrmdev_ops->xdo_dev_state_free(x);
xso->dev = NULL;
dev_put(dev);
}
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 30e5746..0df1cc2 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -144,11 +144,21 @@ EXPORT_SYMBOL_GPL(xfrm_dev_offload_ok);
 
 static int xfrm_dev_register(struct net_device *dev)
 {
-   if ((dev->features & NETIF_F_HW_ESP) && !dev->xfrmdev_ops)
-   return NOTIFY_BAD;
-   if ((dev->features & NETIF_F_HW_ESP_TX_CSUM) &&
-   !(dev->features & NETIF_F_HW_ESP))
+   if (!(dev->features & NETIF_F_HW_ESP)) {
+   if (dev->features & NETIF_F_HW_ESP_TX_CSUM) {
+   netdev_err(dev, "NETIF_F_HW_ESP_TX_CSUM without 
NETIF_F_HW_ESP\n");
+   return NOTIFY_BAD;
+   } else {
+   return NOTIFY_DONE;
+   }
+   }
+
+   if (!(dev->xfrmdev_ops &&
+ dev->xfrmdev_ops->xdo_dev_state_add &&
+ dev->xfrmdev_ops->xdo_dev_state_delete)) {
+   netdev_err(dev, "add or delete function missing from 
xfrmdev_ops\n");
return NOTIFY_BAD;
+   }
 
return NOTIFY_DONE;
 }
-- 
2.7.4



[PATCH net,stable] net: qmi_wwan: add Sierra EM7565 1199:9091

2017-12-11 Thread ssjoholm
From: Sebastian Sjoholm 

Sierra Wireless EM7565 is an Qualcomm MDM9x50 based M.2 modem.
The USB id is added to qmi_wwan.c to allow QMI communication 
with the EM7565.

Signed-off-by: Sebastian Sjoholm 
Acked-by: Bjørn Mork 
---
[The corresponding qcserial patch will be submitted by Reinhard Speyerer.]
---
 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 304ec6555cd8..3cebd6683938 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1204,6 +1204,7 @@ static const struct usb_device_id products[] = {
{QMI_FIXED_INTF(0x1199, 0x9079, 10)},   /* Sierra Wireless EM74xx */
{QMI_FIXED_INTF(0x1199, 0x907b, 8)},/* Sierra Wireless EM74xx */
{QMI_FIXED_INTF(0x1199, 0x907b, 10)},   /* Sierra Wireless EM74xx */
+   {QMI_FIXED_INTF(0x1199, 0x9091, 8)},/* Sierra Wireless EM7565 */
{QMI_FIXED_INTF(0x1bbb, 0x011e, 4)},/* Telekom Speedstick LTE II 
(Alcatel One Touch L100V LTE) */
{QMI_FIXED_INTF(0x1bbb, 0x0203, 2)},/* Alcatel L800MA */
{QMI_FIXED_INTF(0x2357, 0x0201, 4)},/* TP-LINK HSUPA Modem MA180 */
-- 
2.14.1



[REGRESSION][4.13.y][4.14.y][v4.15.y] net: reduce skb_warn_bad_offload() noise

2017-12-11 Thread Joseph Salisbury
Hi Eric,

A kernel bug report was opened against Ubuntu [0].  It was found that
reverting the following commit resolved this bug:

commit b2504a5dbef3305ef41988ad270b0e8ec289331c
Author: Eric Dumazet 
Date:   Tue Jan 31 10:20:32 2017 -0800

    net: reduce skb_warn_bad_offload() noise
   

The regression was introduced as of v4.11-rc1 and still exists in
current mainline.
   
I was hoping to get your feedback, since you are the patch author.  Do
you think gathering any additional data will help diagnose this issue,
or would it be best to submit a revert request?
   
This commit did in fact resolve another bug[1], but in the process
introduced this regression.
  
Thanks,

Joe

[0] http://pad.lv/1715609
[1] http://pad.lv/1705447



[PATCH v3 net-next 1/9] lwt: Add net to build_state argument

2017-12-11 Thread Tom Herbert
Users of LWT need to know net if they want to have per net operations
in LWT.

Signed-off-by: Tom Herbert 
---
 include/net/lwtunnel.h|  6 +++---
 net/core/lwt_bpf.c|  2 +-
 net/core/lwtunnel.c   |  4 ++--
 net/ipv4/fib_semantics.c  | 13 -
 net/ipv4/ip_tunnel_core.c |  4 ++--
 net/ipv6/ila/ila_lwt.c|  2 +-
 net/ipv6/route.c  |  2 +-
 net/ipv6/seg6_iptunnel.c  |  2 +-
 net/ipv6/seg6_local.c |  5 +++--
 net/mpls/mpls_iptunnel.c  |  2 +-
 10 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h
index d747ef975cd8..da5e51e0d122 100644
--- a/include/net/lwtunnel.h
+++ b/include/net/lwtunnel.h
@@ -34,7 +34,7 @@ struct lwtunnel_state {
 };
 
 struct lwtunnel_encap_ops {
-   int (*build_state)(struct nlattr *encap,
+   int (*build_state)(struct net *net, struct nlattr *encap,
   unsigned int family, const void *cfg,
   struct lwtunnel_state **ts,
   struct netlink_ext_ack *extack);
@@ -113,7 +113,7 @@ int lwtunnel_valid_encap_type(u16 encap_type,
  struct netlink_ext_ack *extack);
 int lwtunnel_valid_encap_type_attr(struct nlattr *attr, int len,
   struct netlink_ext_ack *extack);
-int lwtunnel_build_state(u16 encap_type,
+int lwtunnel_build_state(struct net *net, u16 encap_type,
 struct nlattr *encap,
 unsigned int family, const void *cfg,
 struct lwtunnel_state **lws,
@@ -192,7 +192,7 @@ static inline int lwtunnel_valid_encap_type_attr(struct 
nlattr *attr, int len,
return 0;
 }
 
-static inline int lwtunnel_build_state(u16 encap_type,
+static inline int lwtunnel_build_state(struct net *net, u16 encap_type,
   struct nlattr *encap,
   unsigned int family, const void *cfg,
   struct lwtunnel_state **lws,
diff --git a/net/core/lwt_bpf.c b/net/core/lwt_bpf.c
index e7e626fb87bb..3a3ac13fcf06 100644
--- a/net/core/lwt_bpf.c
+++ b/net/core/lwt_bpf.c
@@ -238,7 +238,7 @@ static const struct nla_policy bpf_nl_policy[LWT_BPF_MAX + 
1] = {
[LWT_BPF_XMIT_HEADROOM] = { .type = NLA_U32 },
 };
 
-static int bpf_build_state(struct nlattr *nla,
+static int bpf_build_state(struct net *net, struct nlattr *nla,
   unsigned int family, const void *cfg,
   struct lwtunnel_state **ts,
   struct netlink_ext_ack *extack)
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index 0b171756453c..b3f2f77dfe72 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -103,7 +103,7 @@ int lwtunnel_encap_del_ops(const struct lwtunnel_encap_ops 
*ops,
 }
 EXPORT_SYMBOL_GPL(lwtunnel_encap_del_ops);
 
-int lwtunnel_build_state(u16 encap_type,
+int lwtunnel_build_state(struct net *net, u16 encap_type,
 struct nlattr *encap, unsigned int family,
 const void *cfg, struct lwtunnel_state **lws,
 struct netlink_ext_ack *extack)
@@ -124,7 +124,7 @@ int lwtunnel_build_state(u16 encap_type,
ops = rcu_dereference(lwtun_encaps[encap_type]);
if (likely(ops && ops->build_state && try_module_get(ops->owner))) {
found = true;
-   ret = ops->build_state(encap, family, cfg, lws, extack);
+   ret = ops->build_state(net, encap, family, cfg, lws, extack);
if (ret)
module_put(ops->owner);
}
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index f04d944f8abe..4979e5c6b9b8 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -523,6 +523,7 @@ static int fib_get_nhs(struct fib_info *fi, struct 
rtnexthop *rtnh,
if (nla) {
struct lwtunnel_state *lwtstate;
struct nlattr *nla_entype;
+   struct net *net = cfg->fc_nlinfo.nl_net;
 
nla_entype = nla_find(attrs, attrlen,
  RTA_ENCAP_TYPE);
@@ -533,7 +534,7 @@ static int fib_get_nhs(struct fib_info *fi, struct 
rtnexthop *rtnh,
goto err_inval;
}
 
-   ret = lwtunnel_build_state(nla_get_u16(
+   ret = lwtunnel_build_state(net, nla_get_u16(
   nla_entype),
   nla,  AF_INET, cfg,
   &lwtstate, extack);
@@ -607,7 +608,7 @@ static void fib_rebalance(struct fib_info *fi)
 
 #endif /* CONFIG_IP_ROUTE_MULTIPATH */
 
-static int fib_encap_

[PATCH v3 net-next 7/9] ila: Resolver mechanism

2017-12-11 Thread Tom Herbert
Implement an ILA resolver. This uses LWT to implement the hook to a
userspace resolver and tracks pending unresolved address using the
backend net resolver.

The idea is that the kernel sets an ILA resolver route to the
SIR prefix, something like:

ip route add ::/64 encap ila-resolve \
 via 2401:db00:20:911a::27:0 dev eth0

When a packet hits the route the address is looked up in a resolver
table. If the entry is created (no entry with the address already
exists) then an rtnl message is generated with group
RTNLGRP_ILA_NOTIFY and type RTM_ADDR_RESOLVE. A userspace
daemon can listen for such messages and perform an ILA resolution
protocol to determine the ILA mapping. If the mapping is resolved
then a /128 ila encap router is set so that host can perform
ILA translation and send directly to destination.

Signed-off-by: Tom Herbert 
---
 include/uapi/linux/ila.h   |   9 ++
 include/uapi/linux/lwtunnel.h  |   1 +
 include/uapi/linux/rtnetlink.h |   8 +-
 net/core/lwtunnel.c|   2 +
 net/ipv6/Kconfig   |   1 +
 net/ipv6/ila/Makefile  |   2 +-
 net/ipv6/ila/ila.h |  11 ++
 net/ipv6/ila/ila_lwt.c |   8 ++
 net/ipv6/ila/ila_main.c|  14 +++
 net/ipv6/ila/ila_resolver.c| 244 +
 10 files changed, 298 insertions(+), 2 deletions(-)
 create mode 100644 net/ipv6/ila/ila_resolver.c

diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h
index db45d3e49a12..66557265bf5b 100644
--- a/include/uapi/linux/ila.h
+++ b/include/uapi/linux/ila.h
@@ -65,4 +65,13 @@ enum {
ILA_HOOK_ROUTE_INPUT,
 };
 
+enum {
+   ILA_NOTIFY_ATTR_UNSPEC,
+   ILA_NOTIFY_ATTR_TIMEOUT,/* u32 */
+
+   __ILA_NOTIFY_ATTR_MAX,
+};
+
+#define ILA_NOTIFY_ATTR_MAX(__ILA_NOTIFY_ATTR_MAX - 1)
+
 #endif /* _UAPI_LINUX_ILA_H */
diff --git a/include/uapi/linux/lwtunnel.h b/include/uapi/linux/lwtunnel.h
index de696ca12f2c..2eac16f8323f 100644
--- a/include/uapi/linux/lwtunnel.h
+++ b/include/uapi/linux/lwtunnel.h
@@ -13,6 +13,7 @@ enum lwtunnel_encap_types {
LWTUNNEL_ENCAP_SEG6,
LWTUNNEL_ENCAP_BPF,
LWTUNNEL_ENCAP_SEG6_LOCAL,
+   LWTUNNEL_ENCAP_ILA_NOTIFY,
__LWTUNNEL_ENCAP_MAX,
 };
 
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index d8b5f80c2ea6..8d358a300d8a 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -13,7 +13,8 @@
  */
 #define RTNL_FAMILY_IPMR   128
 #define RTNL_FAMILY_IP6MR  129
-#define RTNL_FAMILY_MAX129
+#define RTNL_FAMILY_ILA130
+#define RTNL_FAMILY_MAX130
 
 /
  * Routing/neighbour discovery messages.
@@ -150,6 +151,9 @@ enum {
RTM_NEWCACHEREPORT = 96,
 #define RTM_NEWCACHEREPORT RTM_NEWCACHEREPORT
 
+   RTM_ADDR_RESOLVE = 98,
+#define RTM_ADDR_RESOLVE RTM_ADDR_RESOLVE
+
__RTM_MAX,
 #define RTM_MAX(((__RTM_MAX + 3) & ~3) - 1)
 };
@@ -676,6 +680,8 @@ enum rtnetlink_groups {
 #define RTNLGRP_IPV4_MROUTE_R  RTNLGRP_IPV4_MROUTE_R
RTNLGRP_IPV6_MROUTE_R,
 #define RTNLGRP_IPV6_MROUTE_R  RTNLGRP_IPV6_MROUTE_R
+   RTNLGRP_ILA_NOTIFY,
+#define RTNLGRP_ILA_NOTIFY RTNLGRP_ILA_NOTIFY
__RTNLGRP_MAX
 };
 #define RTNLGRP_MAX(__RTNLGRP_MAX - 1)
diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c
index b3f2f77dfe72..16b04d05e9b9 100644
--- a/net/core/lwtunnel.c
+++ b/net/core/lwtunnel.c
@@ -46,6 +46,8 @@ static const char *lwtunnel_encap_str(enum 
lwtunnel_encap_types encap_type)
return "BPF";
case LWTUNNEL_ENCAP_SEG6_LOCAL:
return "SEG6LOCAL";
+   case LWTUNNEL_ENCAP_ILA_NOTIFY:
+   return "ILA-NOTIFY";
case LWTUNNEL_ENCAP_IP6:
case LWTUNNEL_ENCAP_IP:
case LWTUNNEL_ENCAP_NONE:
diff --git a/net/ipv6/Kconfig b/net/ipv6/Kconfig
index ea71e4b0ab7a..5b0a6e1bd7cc 100644
--- a/net/ipv6/Kconfig
+++ b/net/ipv6/Kconfig
@@ -110,6 +110,7 @@ config IPV6_ILA
tristate "IPv6: Identifier Locator Addressing (ILA)"
depends on NETFILTER
select LWTUNNEL
+   select NET_RESOLVER
---help---
  Support for IPv6 Identifier Locator Addressing (ILA).
 
diff --git a/net/ipv6/ila/Makefile b/net/ipv6/ila/Makefile
index b7739aba6e68..3ec2d65ceee2 100644
--- a/net/ipv6/ila/Makefile
+++ b/net/ipv6/ila/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_IPV6_ILA) += ila.o
 
-ila-objs := ila_main.o ila_common.o ila_lwt.o ila_xlat.o
+ila-objs := ila_main.o ila_common.o ila_lwt.o ila_xlat.o ila_resolver.o
diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h
index 1f747bcbec29..02a800c71796 100644
--- a/net/ipv6/ila/ila.h
+++ b/net/ipv6/ila/ila.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -112,6 +113,9 @@ struct ila_net {
unsigned int locks_mask;
bool hooks_registered;

[PATCH v3 net-next 9/9] ila: add netlink control ILA resolver

2017-12-11 Thread Tom Herbert
Add a netlink family to processe netlinkf for the ILA resolver.
This calls the net resolver netlink functions.

Signed-off-by: Tom Herbert 
---
 include/uapi/linux/ila.h| 11 
 net/ipv6/ila/ila.h  |  8 ++
 net/ipv6/ila/ila_main.c | 26 ++
 net/ipv6/ila/ila_resolver.c | 67 -
 4 files changed, 111 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h
index 66557265bf5b..2481dab25d57 100644
--- a/include/uapi/linux/ila.h
+++ b/include/uapi/linux/ila.h
@@ -19,6 +19,8 @@ enum {
ILA_ATTR_CSUM_MODE, /* u8 */
ILA_ATTR_IDENT_TYPE,/* u8 */
ILA_ATTR_HOOK_TYPE, /* u8 */
+   ILA_RSLV_ATTR_DST,  /* IPv6 address */
+   ILA_RSLV_ATTR_TIMEOUT,  /* u32 */
 
__ILA_ATTR_MAX,
 };
@@ -31,6 +33,10 @@ enum {
ILA_CMD_DEL,
ILA_CMD_GET,
ILA_CMD_FLUSH,
+   ILA_RSLV_CMD_ADD,
+   ILA_RSLV_CMD_DEL,
+   ILA_RSLV_CMD_GET,
+   ILA_RSLV_CMD_FLUSH,
 
__ILA_CMD_MAX,
 };
@@ -68,10 +74,15 @@ enum {
 enum {
ILA_NOTIFY_ATTR_UNSPEC,
ILA_NOTIFY_ATTR_TIMEOUT,/* u32 */
+   ILA_NOTIFY_ATTR_DST,/* Binary address */
 
__ILA_NOTIFY_ATTR_MAX,
 };
 
 #define ILA_NOTIFY_ATTR_MAX(__ILA_NOTIFY_ATTR_MAX - 1)
 
+/* NETLINK_GENERIC related info */
+#define ILA_RSLV_GENL_NAME "ila-rslv"
+#define ILA_RSLV_GENL_VERSION  0x1
+
 #endif /* _UAPI_LINUX_ILA_H */
diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h
index 02a800c71796..0aa99e359a38 100644
--- a/net/ipv6/ila/ila.h
+++ b/net/ipv6/ila/ila.h
@@ -137,6 +137,14 @@ int ila_xlat_nl_dump_start(struct netlink_callback *cb);
 int ila_xlat_nl_dump_done(struct netlink_callback *cb);
 int ila_xlat_nl_dump(struct sk_buff *skb, struct netlink_callback *cb);
 
+int ila_rslv_nl_cmd_add(struct sk_buff *skb, struct genl_info *info);
+int ila_rslv_nl_cmd_del(struct sk_buff *skb, struct genl_info *info);
+int ila_rslv_nl_cmd_get(struct sk_buff *skb, struct genl_info *info);
+int ila_rslv_nl_cmd_flush(struct sk_buff *skb, struct genl_info *info);
+int ila_rslv_nl_dump_start(struct netlink_callback *cb);
+int ila_rslv_nl_dump_done(struct netlink_callback *cb);
+int ila_rslv_nl_dump(struct sk_buff *skb, struct netlink_callback *cb);
+
 extern unsigned int ila_net_id;
 
 extern struct genl_family ila_nl_family;
diff --git a/net/ipv6/ila/ila_main.c b/net/ipv6/ila/ila_main.c
index 411d3d112157..8589d422568b 100644
--- a/net/ipv6/ila/ila_main.c
+++ b/net/ipv6/ila/ila_main.c
@@ -40,6 +40,32 @@ static const struct genl_ops ila_nl_ops[] = {
.done = ila_xlat_nl_dump_done,
.policy = ila_nl_policy,
},
+   {
+   .cmd = ILA_RSLV_CMD_ADD,
+   .doit = ila_rslv_nl_cmd_add,
+   .policy = ila_nl_policy,
+   .flags = GENL_ADMIN_PERM,
+   },
+   {
+   .cmd = ILA_RSLV_CMD_DEL,
+   .doit = ila_rslv_nl_cmd_del,
+   .policy = ila_nl_policy,
+   .flags = GENL_ADMIN_PERM,
+   },
+   {
+   .cmd = ILA_RSLV_CMD_FLUSH,
+   .doit = ila_rslv_nl_cmd_flush,
+   .policy = ila_nl_policy,
+   .flags = GENL_ADMIN_PERM,
+   },
+   {
+   .cmd = ILA_RSLV_CMD_GET,
+   .doit = ila_rslv_nl_cmd_get,
+   .start = ila_rslv_nl_dump_start,
+   .dumpit = ila_rslv_nl_dump,
+   .done = ila_rslv_nl_dump_done,
+   .policy = ila_nl_policy,
+   },
 };
 
 unsigned int ila_net_id;
diff --git a/net/ipv6/ila/ila_resolver.c b/net/ipv6/ila/ila_resolver.c
index 2aebc0526221..3278e93bb799 100644
--- a/net/ipv6/ila/ila_resolver.c
+++ b/net/ipv6/ila/ila_resolver.c
@@ -209,6 +209,13 @@ static const struct lwtunnel_encap_ops ila_rslv_ops = {
 
 #define ILA_MAX_SIZE 8192
 
+static struct net_rslv_netlink_map ila_netlink_map = {
+   .dst_attr = ILA_RSLV_ATTR_DST,
+   .timo_attr = ILA_RSLV_ATTR_TIMEOUT,
+   .get_cmd = ILA_RSLV_CMD_GET,
+   .genl_family = &ila_nl_family,
+};
+
 int ila_rslv_init_net(struct net *net)
 {
struct ila_net *ilan = net_generic(net, ila_net_id);
@@ -216,7 +223,7 @@ int ila_rslv_init_net(struct net *net)
 
nrslv = net_rslv_create(sizeof(struct ila_addr),
sizeof(struct ila_addr), ILA_MAX_SIZE, NULL,
-   NULL);
+   &ila_netlink_map);
 
if (IS_ERR(nrslv))
return PTR_ERR(nrslv);
@@ -234,6 +241,64 @@ void ila_rslv_exit_net(struct net *net)
net_rslv_destroy(ilan->rslv.nrslv);
 }
 
+/* Netlink access */
+
+int ila_rslv_nl_cmd_add(struct sk_buff *skb, struct genl_info *info)
+{
+   struct net *net = sock_net(skb->sk);
+   struct ila_net *ilan = net_generic(net

[PATCH v3 net-next 6/9] net: Generic resolver backend

2017-12-11 Thread Tom Herbert
This patch implements the backend of a resolver, specifically it
provides a means to track unresolved addresses and expire entries
based on timeout.

The resolver is mostly a frontend to an rhashtable where the key
of the table is whatever address type or object is tracked. A resolver
instance is created by net_rslv_create. A resolver is destroyed by
net_rslv_destroy.

There are two functions that are used to manipulate entries in the
table: net_rslv_lookup_and_create and net_rslv_resolved.

net_rslv_lookup_and_create is called with an unresolved address as
the argument. It returns zero on success and an error on failure.
When called a lookup is performed to see if an entry for the address
is already in the table, if it is then the -EEXISTS is returned.  If
an entry is not found, one is created and zero is returned.  It is
expected that when an entry is new the address resolution protocol is
initiated (for instance a RTM_ADDR_RESOLVE message may be sent to a
userspace daemon as we will do in ILA). If net_rslv_lookup_and_create
returns an error other than -EEXIST then presumably the hash table
has reached the limit of number of outstanding unresolved addresses,
the caller should take appropriate actions to avoid spamming the
resolution protocol.

net_rslv_resolved is called when resolution is completely (e.g.
ILA locator mapping was instantiated for a locator. The entry is
removed for the hash table.

An argument to net_rslv_create indicates a time for the pending
resolution in milliseconds. If the timer fires before resolution
then the entry is removed from the table. Subsequently, another
attempt to resolve the same address will result in a new entry in
the table.

There is one callback functions that can be set as arugments in
net_rslv_create:

   - cmp_fn: Compare function for hash table. Arguments are the
   key and an object in the table. If this is NULL then the
   default memcmp of rhashtable is used.

DOS mitigation is done by limiting the number of entries in the
resolver table (the max_size which argument of net_rslv_create)
and setting a timeout. If the timeout is set then the maximum rate
of new resolution requests is max_table_size / timeout. For
instance, with a maximum size of 1000 entries and a timeout of 100
msecs the maximum rate of resolutions requests is 1/s.

Signed-off-by: Tom Herbert 
---
 include/net/resolver.h  |  43 
 net/Kconfig |   1 +
 net/Makefile|   1 +
 net/resolver/Kconfig|   7 ++
 net/resolver/Makefile   |   8 ++
 net/resolver/resolver.c | 283 
 6 files changed, 343 insertions(+)
 create mode 100644 include/net/resolver.h
 create mode 100644 net/resolver/Kconfig
 create mode 100644 net/resolver/Makefile
 create mode 100644 net/resolver/resolver.c

diff --git a/include/net/resolver.h b/include/net/resolver.h
new file mode 100644
index ..f38c7e9f1205
--- /dev/null
+++ b/include/net/resolver.h
@@ -0,0 +1,43 @@
+/*
+ * Generic network address resovler backend
+ *
+ * Copyright (c) 2017 Tom Herbert 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ */
+
+#ifndef __NET_RESOLVER_H
+#define __NET_RESOLVER_H
+
+#include 
+#include 
+
+struct net_rslv;
+
+typedef int (*net_rslv_cmpfn)(struct net_rslv *nrslv, const void *key,
+ const void *object);
+
+struct net_rslv {
+   struct rhashtable rhash_table;
+   struct rhashtable_params params;
+   net_rslv_cmpfn rslv_cmp;
+   size_t obj_size;
+   spinlock_t *locks;
+   unsigned int locks_mask;
+   unsigned int hash_rnd;
+};
+
+struct net_rslv *net_rslv_create(size_t obj_size, size_t key_len,
+size_t max_size, net_rslv_cmpfn cmp_fn);
+
+void net_rslv_destroy(struct net_rslv *nrslv);
+
+int net_rslv_lookup_and_create(struct net_rslv *nrslv, void *key,
+  unsigned int timeout);
+
+void net_rslv_resolved(struct net_rslv *nrslv, void *key);
+
+#endif /* __NET_RESOLVER_H */
diff --git a/net/Kconfig b/net/Kconfig
index 9dba2715919d..b1e73325de6a 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -399,6 +399,7 @@ source "net/ceph/Kconfig"
 source "net/nfc/Kconfig"
 source "net/psample/Kconfig"
 source "net/ife/Kconfig"
+source "net/resolver/Kconfig"
 
 config LWTUNNEL
bool "Network light weight tunnels"
diff --git a/net/Makefile b/net/Makefile
index 14fede520840..6b3b0c5e676a 100644
--- a/net/Makefile
+++ b/net/Makefile
@@ -86,3 +86,4 @@ obj-y += l3mdev/
 endif
 obj-$(CONFIG_QRTR) += qrtr/
 obj-$(CONFIG_NET_NCSI) += ncsi/
+obj-$(CONFIG_NET_RESOLVER) += resolver/
diff --git a/net/resolver/Kconfig b/net/resolver/Kconfig
new file mode 100644
index ..99eff276e

[PATCH v3 net-next 5/9] ila: Flush netlink command to clear xlat table

2017-12-11 Thread Tom Herbert
Add ILA_CMD_FLUSH netlink command to clear the ILA translation table.

Signed-off-by: Tom Herbert 
---
 include/uapi/linux/ila.h |  1 +
 net/ipv6/ila/ila.h   |  1 +
 net/ipv6/ila/ila_main.c  |  6 +
 net/ipv6/ila/ila_xlat.c  | 62 ++--
 4 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/ila.h b/include/uapi/linux/ila.h
index 483b77af4eb8..db45d3e49a12 100644
--- a/include/uapi/linux/ila.h
+++ b/include/uapi/linux/ila.h
@@ -30,6 +30,7 @@ enum {
ILA_CMD_ADD,
ILA_CMD_DEL,
ILA_CMD_GET,
+   ILA_CMD_FLUSH,
 
__ILA_CMD_MAX,
 };
diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h
index faba7824ea56..1f747bcbec29 100644
--- a/net/ipv6/ila/ila.h
+++ b/net/ipv6/ila/ila.h
@@ -123,6 +123,7 @@ void ila_xlat_exit_net(struct net *net);
 int ila_xlat_nl_cmd_add_mapping(struct sk_buff *skb, struct genl_info *info);
 int ila_xlat_nl_cmd_del_mapping(struct sk_buff *skb, struct genl_info *info);
 int ila_xlat_nl_cmd_get_mapping(struct sk_buff *skb, struct genl_info *info);
+int ila_xlat_nl_cmd_flush(struct sk_buff *skb, struct genl_info *info);
 int ila_xlat_nl_dump_start(struct netlink_callback *cb);
 int ila_xlat_nl_dump_done(struct netlink_callback *cb);
 int ila_xlat_nl_dump(struct sk_buff *skb, struct netlink_callback *cb);
diff --git a/net/ipv6/ila/ila_main.c b/net/ipv6/ila/ila_main.c
index f6ac6b14577e..18fac76b9520 100644
--- a/net/ipv6/ila/ila_main.c
+++ b/net/ipv6/ila/ila_main.c
@@ -27,6 +27,12 @@ static const struct genl_ops ila_nl_ops[] = {
.flags = GENL_ADMIN_PERM,
},
{
+   .cmd = ILA_CMD_FLUSH,
+   .doit = ila_xlat_nl_cmd_flush,
+   .policy = ila_nl_policy,
+   .flags = GENL_ADMIN_PERM,
+   },
+   {
.cmd = ILA_CMD_GET,
.doit = ila_xlat_nl_cmd_get_mapping,
.start = ila_xlat_nl_dump_start,
diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
index 610852b3dfa7..6bb1a081ff04 100644
--- a/net/ipv6/ila/ila_xlat.c
+++ b/net/ipv6/ila/ila_xlat.c
@@ -164,9 +164,9 @@ static inline void ila_release(struct ila_map *ila)
kfree_rcu(ila, rcu);
 }
 
-static void ila_free_cb(void *ptr, void *arg)
+static void ila_free_node(struct ila_map *ila)
 {
-   struct ila_map *ila = (struct ila_map *)ptr, *next;
+   struct ila_map *next;
 
/* Assume rcu_readlock held */
while (ila) {
@@ -176,6 +176,11 @@ static void ila_free_cb(void *ptr, void *arg)
}
 }
 
+static void ila_free_cb(void *ptr, void *arg)
+{
+   ila_free_node((struct ila_map *)ptr);
+}
+
 static int ila_xlat_addr(struct sk_buff *skb, bool sir2ila);
 
 static unsigned int
@@ -365,6 +370,59 @@ int ila_xlat_nl_cmd_del_mapping(struct sk_buff *skb, 
struct genl_info *info)
return 0;
 }
 
+static inline spinlock_t *lock_from_ila_map(struct ila_net *ilan,
+   struct ila_map *ila)
+{
+   return ila_get_lock(ilan, ila->xp.ip.locator_match);
+}
+
+int ila_xlat_nl_cmd_flush(struct sk_buff *skb, struct genl_info *info)
+{
+   struct net *net = genl_info_net(info);
+   struct ila_net *ilan = net_generic(net, ila_net_id);
+   struct rhashtable_iter iter;
+   struct ila_map *ila;
+   spinlock_t *lock;
+   int ret;
+
+   ret = rhashtable_walk_init(&ilan->xlat.rhash_table, &iter, GFP_KERNEL);
+   if (ret)
+   goto done;
+
+   rhashtable_walk_start(&iter);
+
+   for (;;) {
+   ila = rhashtable_walk_next(&iter);
+
+   if (IS_ERR(ila)) {
+   if (PTR_ERR(ila) == -EAGAIN)
+   continue;
+   ret = PTR_ERR(ila);
+   goto done;
+   } else if (!ila) {
+   break;
+   }
+
+   lock = lock_from_ila_map(ilan, ila);
+
+   spin_lock(lock);
+
+   ret = rhashtable_remove_fast(&ilan->xlat.rhash_table,
+&ila->node, rht_params);
+   if (!ret)
+   ila_free_node(ila);
+
+   spin_unlock(lock);
+
+   if (ret)
+   break;
+   }
+
+done:
+   rhashtable_walk_stop(&iter);
+   return ret;
+}
+
 static int ila_fill_info(struct ila_map *ila, struct sk_buff *msg)
 {
if (nla_put_u64_64bit(msg, ILA_ATTR_LOCATOR,
-- 
2.11.0



[PATCH v3 net-next 8/9] resolver: add netlink control

2017-12-11 Thread Tom Herbert
Add interfaces into resolver backend that can be used to provide
netlink. The interface includes fucntions to support the common
netlink commands (get, add, list, delete, and flush). The
frontend that is using the resolver implements the actual
netlink interfaces for its service and calls the backend functions
to provide netlink for the resolver.

Signed-off-by: Tom Herbert 
---
 include/net/resolver.h  |  26 +++-
 net/ipv6/ila/ila_resolver.c |   3 +-
 net/resolver/resolver.c | 280 +++-
 3 files changed, 305 insertions(+), 4 deletions(-)

diff --git a/include/net/resolver.h b/include/net/resolver.h
index f38c7e9f1205..307938ad91a6 100644
--- a/include/net/resolver.h
+++ b/include/net/resolver.h
@@ -14,12 +14,21 @@
 
 #include 
 #include 
+#include 
+#include 
 
 struct net_rslv;
 
 typedef int (*net_rslv_cmpfn)(struct net_rslv *nrslv, const void *key,
  const void *object);
 
+struct net_rslv_netlink_map {
+   int dst_attr;
+   int timo_attr;
+   int get_cmd;
+   struct genl_family *genl_family;
+};
+
 struct net_rslv {
struct rhashtable rhash_table;
struct rhashtable_params params;
@@ -28,10 +37,12 @@ struct net_rslv {
spinlock_t *locks;
unsigned int locks_mask;
unsigned int hash_rnd;
+   const struct net_rslv_netlink_map *nlmap;
 };
 
 struct net_rslv *net_rslv_create(size_t obj_size, size_t key_len,
-size_t max_size, net_rslv_cmpfn cmp_fn);
+size_t max_size, net_rslv_cmpfn cmp_fn,
+const struct net_rslv_netlink_map *nlmap);
 
 void net_rslv_destroy(struct net_rslv *nrslv);
 
@@ -40,4 +51,17 @@ int net_rslv_lookup_and_create(struct net_rslv *nrslv, void 
*key,
 
 void net_rslv_resolved(struct net_rslv *nrslv, void *key);
 
+int net_rslv_nl_cmd_add(struct net_rslv *nrslv, struct sk_buff *skb,
+   struct genl_info *info);
+int net_rslv_nl_cmd_del(struct net_rslv *nrslv, struct sk_buff *skb,
+   struct genl_info *info);
+int net_rslv_nl_cmd_get(struct net_rslv *nrslv, struct sk_buff *skb,
+   struct genl_info *info);
+int net_rslv_nl_cmd_flush(struct net_rslv *nrslv, struct sk_buff *skb,
+ struct genl_info *info);
+int net_rslv_nl_dump_start(struct net_rslv *nrslv, struct netlink_callback 
*cb);
+int net_rslv_nl_dump_done(struct net_rslv *nrslv, struct netlink_callback *cb);
+int net_rslv_nl_dump(struct net_rslv *nrslv, struct sk_buff *skb,
+struct netlink_callback *cb);
+
 #endif /* __NET_RESOLVER_H */
diff --git a/net/ipv6/ila/ila_resolver.c b/net/ipv6/ila/ila_resolver.c
index 8b9a3c5305a4..2aebc0526221 100644
--- a/net/ipv6/ila/ila_resolver.c
+++ b/net/ipv6/ila/ila_resolver.c
@@ -215,7 +215,8 @@ int ila_rslv_init_net(struct net *net)
struct net_rslv *nrslv;
 
nrslv = net_rslv_create(sizeof(struct ila_addr),
-   sizeof(struct ila_addr), ILA_MAX_SIZE, NULL);
+   sizeof(struct ila_addr), ILA_MAX_SIZE, NULL,
+   NULL);
 
if (IS_ERR(nrslv))
return PTR_ERR(nrslv);
diff --git a/net/resolver/resolver.c b/net/resolver/resolver.c
index 32a915ed8f93..e2496b0bf852 100644
--- a/net/resolver/resolver.c
+++ b/net/resolver/resolver.c
@@ -19,11 +19,13 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 #include 
+#include 
 
 struct net_rslv_ent {
struct rhash_head node;
@@ -192,8 +194,8 @@ static int net_rslv_cmp(struct rhashtable_compare_arg *arg,
 #define MAX_LOCKS 1024
 
 struct net_rslv *net_rslv_create(size_t obj_size, size_t key_len,
-size_t max_size,
-net_rslv_cmpfn cmp_fn)
+size_t max_size, net_rslv_cmpfn cmp_fn,
+const struct net_rslv_netlink_map *nlmap)
 {
struct net_rslv *nrslv;
int err;
@@ -212,6 +214,7 @@ struct net_rslv *net_rslv_create(size_t obj_size, size_t 
key_len,
 
nrslv->obj_size = obj_size;
nrslv->rslv_cmp = cmp_fn;
+   nrslv->nlmap = nlmap;
get_random_bytes(&nrslv->hash_rnd, sizeof(nrslv->hash_rnd));
 
nrslv->params.head_offset = offsetof(struct net_rslv_ent, node);
@@ -278,6 +281,279 @@ void net_rslv_destroy(struct net_rslv *nrslv)
 }
 EXPORT_SYMBOL_GPL(net_rslv_destroy);
 
+/* Netlink access utility functions and structures. */
+
+struct net_rslv_params {
+   unsigned int timeout;
+   __u8 key[MAX_ADDR_LEN];
+   size_t keysize;
+};
+
+static int parse_nl_config(struct net_rslv *nrslv, struct genl_info *info,
+  struct net_rslv_params *np)
+{
+   if (!info->attrs[nrslv->nlmap->dst_attr] ||
+   nla_len(info->attrs[nrslv->nlmap->dst_attr]) !=
+

[PATCH v3 net-next 3/9] ila: Call library function alloc_bucket_locks

2017-12-11 Thread Tom Herbert
To allocate the array of bucket locks for the hash table we now
call library function alloc_bucket_spinlocks.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ila/ila_xlat.c | 22 +-
 1 file changed, 5 insertions(+), 17 deletions(-)

diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
index 9fca75b9cab3..402193ef74c2 100644
--- a/net/ipv6/ila/ila_xlat.c
+++ b/net/ipv6/ila/ila_xlat.c
@@ -31,26 +31,14 @@ struct ila_net {
bool hooks_registered;
 };
 
+#define MAX_LOCKS 1024
 #defineLOCKS_PER_CPU 10
 
 static int alloc_ila_locks(struct ila_net *ilan)
 {
-   unsigned int i, size;
-   unsigned int nr_pcpus = num_possible_cpus();
-
-   nr_pcpus = min_t(unsigned int, nr_pcpus, 32UL);
-   size = roundup_pow_of_two(nr_pcpus * LOCKS_PER_CPU);
-
-   if (sizeof(spinlock_t) != 0) {
-   ilan->locks = kvmalloc(size * sizeof(spinlock_t), GFP_KERNEL);
-   if (!ilan->locks)
-   return -ENOMEM;
-   for (i = 0; i < size; i++)
-   spin_lock_init(&ilan->locks[i]);
-   }
-   ilan->locks_mask = size - 1;
-
-   return 0;
+   return alloc_bucket_spinlocks(&ilan->xlat.locks, &ilan->xlat.locks_mask,
+ MAX_LOCKS, LOCKS_PER_CPU,
+ GFP_KERNEL);
 }
 
 static u32 hashrnd __read_mostly;
@@ -629,7 +617,7 @@ static __net_exit void ila_exit_net(struct net *net)
 
rhashtable_free_and_destroy(&ilan->rhash_table, ila_free_cb, NULL);
 
-   kvfree(ilan->locks);
+   free_bucket_spinlocks(ilan->xlat.locks);
 
if (ilan->hooks_registered)
nf_unregister_net_hooks(net, ila_nf_hook_ops,
-- 
2.11.0



[PATCH v3 net-next 4/9] ila: create main ila source file

2017-12-11 Thread Tom Herbert
Create a main ila file that contains the module intialization functions
as well as netlink definitions. Previously these were defined in
ila_xlat and ila_common. This approach allows better extensibility.

Signed-off-by: Tom Herbert 
---
 net/ipv6/ila/Makefile |   2 +-
 net/ipv6/ila/ila.h|  26 -
 net/ipv6/ila/ila_common.c |  30 --
 net/ipv6/ila/ila_main.c   | 115 ++
 net/ipv6/ila/ila_xlat.c   | 138 +-
 5 files changed, 166 insertions(+), 145 deletions(-)
 create mode 100644 net/ipv6/ila/ila_main.c

diff --git a/net/ipv6/ila/Makefile b/net/ipv6/ila/Makefile
index 4b32e5921e5c..b7739aba6e68 100644
--- a/net/ipv6/ila/Makefile
+++ b/net/ipv6/ila/Makefile
@@ -4,4 +4,4 @@
 
 obj-$(CONFIG_IPV6_ILA) += ila.o
 
-ila-objs := ila_common.o ila_lwt.o ila_xlat.o
+ila-objs := ila_main.o ila_common.o ila_lwt.o ila_xlat.o
diff --git a/net/ipv6/ila/ila.h b/net/ipv6/ila/ila.h
index 3c7a11b62334..faba7824ea56 100644
--- a/net/ipv6/ila/ila.h
+++ b/net/ipv6/ila/ila.h
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -104,9 +105,30 @@ void ila_update_ipv6_locator(struct sk_buff *skb, struct 
ila_params *p,
 
 void ila_init_saved_csum(struct ila_params *p);
 
+struct ila_net {
+   struct {
+   struct rhashtable rhash_table;
+   spinlock_t *locks; /* Bucket locks for entry manipulation */
+   unsigned int locks_mask;
+   bool hooks_registered;
+   } xlat;
+};
+
 int ila_lwt_init(void);
 void ila_lwt_fini(void);
-int ila_xlat_init(void);
-void ila_xlat_fini(void);
+
+int ila_xlat_init_net(struct net *net);
+void ila_xlat_exit_net(struct net *net);
+
+int ila_xlat_nl_cmd_add_mapping(struct sk_buff *skb, struct genl_info *info);
+int ila_xlat_nl_cmd_del_mapping(struct sk_buff *skb, struct genl_info *info);
+int ila_xlat_nl_cmd_get_mapping(struct sk_buff *skb, struct genl_info *info);
+int ila_xlat_nl_dump_start(struct netlink_callback *cb);
+int ila_xlat_nl_dump_done(struct netlink_callback *cb);
+int ila_xlat_nl_dump(struct sk_buff *skb, struct netlink_callback *cb);
+
+extern unsigned int ila_net_id;
+
+extern struct genl_family ila_nl_family;
 
 #endif /* __ILA_H */
diff --git a/net/ipv6/ila/ila_common.c b/net/ipv6/ila/ila_common.c
index 8c88ecf29b93..579310466eac 100644
--- a/net/ipv6/ila/ila_common.c
+++ b/net/ipv6/ila/ila_common.c
@@ -154,33 +154,3 @@ void ila_update_ipv6_locator(struct sk_buff *skb, struct 
ila_params *p,
iaddr->loc = p->locator;
 }
 
-static int __init ila_init(void)
-{
-   int ret;
-
-   ret = ila_lwt_init();
-
-   if (ret)
-   goto fail_lwt;
-
-   ret = ila_xlat_init();
-   if (ret)
-   goto fail_xlat;
-
-   return 0;
-fail_xlat:
-   ila_lwt_fini();
-fail_lwt:
-   return ret;
-}
-
-static void __exit ila_fini(void)
-{
-   ila_xlat_fini();
-   ila_lwt_fini();
-}
-
-module_init(ila_init);
-module_exit(ila_fini);
-MODULE_AUTHOR("Tom Herbert ");
-MODULE_LICENSE("GPL");
diff --git a/net/ipv6/ila/ila_main.c b/net/ipv6/ila/ila_main.c
new file mode 100644
index ..f6ac6b14577e
--- /dev/null
+++ b/net/ipv6/ila/ila_main.c
@@ -0,0 +1,115 @@
+// SPDX-License-Identifier: GPL-2.0
+#include 
+#include 
+#include 
+#include 
+#include "ila.h"
+
+static const struct nla_policy ila_nl_policy[ILA_ATTR_MAX + 1] = {
+   [ILA_ATTR_LOCATOR] = { .type = NLA_U64, },
+   [ILA_ATTR_LOCATOR_MATCH] = { .type = NLA_U64, },
+   [ILA_ATTR_IFINDEX] = { .type = NLA_U32, },
+   [ILA_ATTR_CSUM_MODE] = { .type = NLA_U8, },
+   [ILA_ATTR_IDENT_TYPE] = { .type = NLA_U8, },
+};
+
+static const struct genl_ops ila_nl_ops[] = {
+   {
+   .cmd = ILA_CMD_ADD,
+   .doit = ila_xlat_nl_cmd_add_mapping,
+   .policy = ila_nl_policy,
+   .flags = GENL_ADMIN_PERM,
+   },
+   {
+   .cmd = ILA_CMD_DEL,
+   .doit = ila_xlat_nl_cmd_del_mapping,
+   .policy = ila_nl_policy,
+   .flags = GENL_ADMIN_PERM,
+   },
+   {
+   .cmd = ILA_CMD_GET,
+   .doit = ila_xlat_nl_cmd_get_mapping,
+   .start = ila_xlat_nl_dump_start,
+   .dumpit = ila_xlat_nl_dump,
+   .done = ila_xlat_nl_dump_done,
+   .policy = ila_nl_policy,
+   },
+};
+
+unsigned int ila_net_id;
+
+struct genl_family ila_nl_family __ro_after_init = {
+   .hdrsize= 0,
+   .name   = ILA_GENL_NAME,
+   .version= ILA_GENL_VERSION,
+   .maxattr= ILA_ATTR_MAX,
+   .netnsok= true,
+   .parallel_ops   = true,
+   .module = THIS_MODULE,
+   .ops= ila_nl_ops,
+   .n_ops  = ARRAY_SIZE(ila_nl_ops),
+};
+
+static __net_init int ila_init_net(struct net *net)
+{
+   int err;
+
+   err = ila_xlat_init_net(net);
+ 

[PATCH v3 net-next 0/9] net: Generic network resolver backend and ILA resolver

2017-12-11 Thread Tom Herbert
This patch implements generic in-kernel network resolver. The idea is
that an LWT "resolver" route is set in the kernel to cover some prefix.
When a packet hits the route a netlink message is fired to request
resolution and pending resolutions are tracked in a table.

Route resolution works in the following manner:

Initial configuration:

0. An ila-rslv LWT route is set for some network prefix. The route
   includes an optional timeout to expire resolution.

Resolution process

1. Packet is sent to the a destination in the prefix being resolved
2. A lookup is performed on the destination address in a table of
   outstanding resolutions requests. If no entry is found:
a. A new entry is created for the destination with a timeout
   value as set in the resolver route
b. A netlink "RTM_ADDR_RESOLVE" message is sent to kick the
   resolution protocol or processing
3. The packet is forwarded per the resolver route

When an address is resolved

4. At some point a route is is set that resolves the outstanding
   request (for instance a host route is set for the destination).
   The entry is removed for the table. Subsequent packets to the
   destination will hit the new route rather than the resolver
   route since prefix is longer
5. Resolution entries may timeout and entry removed from the table.
   A subsequent packet to the destination will kick off a new
   resolution as in #2
6. The resolved route might also be timed out or removed, in which case
   subsequent packets to the same destination can trigger the
   resolution process

DOS mitigations:

- The number of outstanding resolutions is limited by the size of the
  table
- Timeout of pending entries limits the number of netlink resolution
  messages
- Packets are not queued that are pending resolution. In the current
  model that can be forwarded to a router that has all reachability
  information (ILA use case for example)

Possible future work

- An optional method to queue packets for pending resolution
- More DOS mitigations. It might make sense to limit the number of
resolutions per source address etc.

This patch set implements an ILA host side resolver. That uses the
generic resolver described above. This uses LWT to implement the hook
to a userspace resolver and tracks pending unresolved address using
the backend net resolver.

This patch set contains:

- A generic resolver backend infrastructure. This primary does two
  things: track unresolved addresses and implement a timeout for
  resolution not happening. These mechanisms provides rate limiting
  control over resolution requests (for instance in ILA it use used
  to rate limit requests to userspace to resolve addresses).
- The ILA resolver. This is implements to path from the kernel ILA
  implementation to a userspace daemon that an identifier address
  needs to be resolved.
- Routing messages are used over netlink to indicate resolution
  requests.
- Add net to ila build_state
- Add flush command to ila_xlat
- Fix uses for rhashtable for latest fixes

v3:
 - Removed rhashtable changes to their own patch set
 - Restructure ILA code to be more amenbale to changes
 - Remove extra call back functions in resolution interface

Changes from initial RFC:

 - Added net argument to LWT build_state
 - Made resolve timeout an attribute of the LWT encap route
 - Changed ILA notifications to be regular routing messages of event
   RTM_ADDR_RESOLVE, family RTNL_FAMILY_ILA, and group
   RTNLGRP_ILA_NOTIFY


Tom Herbert (9):
  lwt: Add net to build_state argument
  ila: Fix use of rhashtable walk in ila_xlat.c
  ila: Call library function alloc_bucket_locks
  ila: create main ila source file
  ila: Flush netlink command to clear xlat table
  net: Generic resolver backend
  ila: Resolver mechanism
  resolver: add netlink control
  ila: add netlink control ILA resolver

 include/net/lwtunnel.h |   6 +-
 include/net/resolver.h |  67 +
 include/uapi/linux/ila.h   |  21 ++
 include/uapi/linux/lwtunnel.h  |   1 +
 include/uapi/linux/rtnetlink.h |   8 +-
 net/Kconfig|   1 +
 net/Makefile   |   1 +
 net/core/lwt_bpf.c |   2 +-
 net/core/lwtunnel.c|   6 +-
 net/ipv4/fib_semantics.c   |  13 +-
 net/ipv4/ip_tunnel_core.c  |   4 +-
 net/ipv6/Kconfig   |   1 +
 net/ipv6/ila/Makefile  |   2 +-
 net/ipv6/ila/ila.h |  46 +++-
 net/ipv6/ila/ila_common.c  |  30 ---
 net/ipv6/ila/ila_lwt.c |  10 +-
 net/ipv6/ila/ila_main.c| 161 
 net/ipv6/ila/ila_resolver.c| 310 +++
 net/ipv6/ila/ila_xlat.c| 280 ++---
 net/ipv6/route.c   |   2 +-
 net/ipv6/seg6_iptunnel.c   |   2 +-
 net/ipv6/seg6_local.c  |   5 +-
 net/mpls/mpls_iptunnel.c   |   2 +-
 net/resolver/Kconfig   |   7 +
 net/resolver/Makefile  |   8 +
 net/resolver/resolver.c| 559 

Re: [PATCH net,stable] net: qmi_wwan: add Sierra EM7565 1199:9091

2017-12-11 Thread Bjørn Mork
ssjoh...@mac.com writes:

> From: Sebastian Sjoholm 
>
> From: Sebastian Sjoholm 
>
> Sierra Wireless EM7565 is an Qualcomm MDM9x50 based M.2 modem.
> The USB id is added to qmi_wwan.c to allow QMI communication with the EM7565.
>
> Signed-off-by: Sebastian Sjoholm 
> ---
> [The corresponding qcserial patch will be submitted by Reinhard Speyerer.]
>
> ---
>  drivers/net/usb/qmi_wwan.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
> index 304ec6555cd8..3cebd6683938 100644
> --- a/drivers/net/usb/qmi_wwan.c
> +++ b/drivers/net/usb/qmi_wwan.c
> @@ -1204,6 +1204,7 @@ static const struct usb_device_id products[] = {
>   {QMI_FIXED_INTF(0x1199, 0x9079, 10)},   /* Sierra Wireless EM74xx */
>   {QMI_FIXED_INTF(0x1199, 0x907b, 8)},/* Sierra Wireless EM74xx */
>   {QMI_FIXED_INTF(0x1199, 0x907b, 10)},   /* Sierra Wireless EM74xx */
> + {QMI_FIXED_INTF(0x1199, 0x9091, 8)},/* Sierra Wireless EM7565 */
>   {QMI_FIXED_INTF(0x1bbb, 0x011e, 4)},/* Telekom Speedstick LTE II 
> (Alcatel One Touch L100V LTE) */
>   {QMI_FIXED_INTF(0x1bbb, 0x0203, 2)},/* Alcatel L800MA */
>   {QMI_FIXED_INTF(0x2357, 0x0201, 4)},/* TP-LINK HSUPA Modem MA180 */

Looks good except for the duplicate 'From' line.  Drop that and you can
add

Acked-by: Bjørn Mork 



[PATCH v3 net-next 2/9] ila: Fix use of rhashtable walk in ila_xlat.c

2017-12-11 Thread Tom Herbert
Perform better EAGAIN handling, handle case where ila_dump_info
fails and we miss mis objects in the dump, and add a skip index
to skip over ila entires in a list on a rhashtable node that have
already been visited (by a previous call to ila_nl_dump).

Signed-off-by: Tom Herbert 
---
 net/ipv6/ila/ila_xlat.c | 60 -
 1 file changed, 44 insertions(+), 16 deletions(-)

diff --git a/net/ipv6/ila/ila_xlat.c b/net/ipv6/ila/ila_xlat.c
index 44c39c5f0638..9fca75b9cab3 100644
--- a/net/ipv6/ila/ila_xlat.c
+++ b/net/ipv6/ila/ila_xlat.c
@@ -474,24 +474,31 @@ static int ila_nl_cmd_get_mapping(struct sk_buff *skb, 
struct genl_info *info)
 
 struct ila_dump_iter {
struct rhashtable_iter rhiter;
+   int skip;
 };
 
 static int ila_nl_dump_start(struct netlink_callback *cb)
 {
struct net *net = sock_net(cb->skb->sk);
struct ila_net *ilan = net_generic(net, ila_net_id);
-   struct ila_dump_iter *iter = (struct ila_dump_iter *)cb->args[0];
+   struct ila_dump_iter *iter;
+   int ret;
 
-   if (!iter) {
-   iter = kmalloc(sizeof(*iter), GFP_KERNEL);
-   if (!iter)
-   return -ENOMEM;
+   iter = kmalloc(sizeof(*iter), GFP_KERNEL);
+   if (!iter)
+   return -ENOMEM;
 
-   cb->args[0] = (long)iter;
+   ret = rhashtable_walk_init(&ilan->rhash_table, &iter->rhiter,
+  GFP_KERNEL);
+   if (ret) {
+   kfree(iter);
+   return ret;
}
 
-   return rhashtable_walk_init(&ilan->rhash_table, &iter->rhiter,
-   GFP_KERNEL);
+   iter->skip = 0;
+   cb->args[0] = (long)iter;
+
+   return ret;
 }
 
 static int ila_nl_dump_done(struct netlink_callback *cb)
@@ -509,37 +516,58 @@ static int ila_nl_dump(struct sk_buff *skb, struct 
netlink_callback *cb)
 {
struct ila_dump_iter *iter = (struct ila_dump_iter *)cb->args[0];
struct rhashtable_iter *rhiter = &iter->rhiter;
+   int skip = iter->skip;
struct ila_map *ila;
int ret;
 
rhashtable_walk_start(rhiter);
 
-   for (;;) {
-   ila = rhashtable_walk_next(rhiter);
+   /* Get first entty */
+   ila = rhashtable_walk_peek(rhiter);
 
+   for (;;) {
if (IS_ERR(ila)) {
-   if (PTR_ERR(ila) == -EAGAIN)
-   continue;
ret = PTR_ERR(ila);
-   goto done;
+   if (ret == -EAGAIN) {
+   /* Table has changed and iter has reset. Return
+* -EAGAIN to the application even if we have
+* written data to the skb. The application
+* needs to deal with this.
+*/
+
+   goto out_ret;
+   } else {
+   break;
+   }
} else if (!ila) {
+   ret = 0;
break;
}
 
+   while (ila && skip) {
+   /* Skip over any ila entries in this list that we
+* have already dumped.
+*/
+   ila = rcu_access_pointer(ila->next);
+   skip--;
+   }
while (ila) {
ret =  ila_dump_info(ila, NETLINK_CB(cb->skb).portid,
 cb->nlh->nlmsg_seq, NLM_F_MULTI,
 skb, ILA_CMD_GET);
if (ret)
-   goto done;
+   goto out;
 
ila = rcu_access_pointer(ila->next);
}
+   ila = rhashtable_walk_next(rhiter);
}
 
-   ret = skb->len;
+out:
+   iter->skip = skip;
+   ret = (skb->len ? : ret);
 
-done:
+out_ret:
rhashtable_walk_stop(rhiter);
return ret;
 }
-- 
2.11.0



Re: [PATCH next] ipvlan: add L2 check for packets arriving via virtual devices

2017-12-11 Thread David Miller
From: Mahesh Bandewar (महेश बंडेवार) 
Date: Mon, 11 Dec 2017 11:38:04 -0800

> On Mon, Dec 11, 2017 at 8:15 AM, David Miller  wrote:
>> From: Mahesh Bandewar 
>> Date: Thu,  7 Dec 2017 15:15:43 -0800
>>
>>> From: Mahesh Bandewar 
>>>
>>> Packets that don't have dest mac as the mac of the master device should
>>> not be entertained by the IPvlan rx-handler. This is mostly true as the
>>> packet path mostly takes care of that, except when the master device is
>>> a virtual device. As demonstrated in the following case -
>>  ...
>>> This patch adds that missing check in the IPvlan rx-handler.
>>>
>>> Reported-by: Amit Sikka 
>>> Signed-off-by: Mahesh Bandewar 
>>
>> Applied, but it's a shame that the data plane takes on this new MAC
>> compare operation.
> Your comment made me think little more about this and a discussion
> with Eric kind of put things in perspective. eth_type_trans() does the
> right thing and sets the packet_type correctly (when .ndo_xmit of veth
> is called). However IPvlan is over-aggressive in packet scrubbing and
> that scrub changes packet type. This causes the actual problem. It's
> not clear to me why skb_scrub_packet() changes the packet type to
> PACKET_HOST unconditionally? But that's another issue.
> 
> I'll send another patch to remove excessive scrubbing in IPvlan and
> revert of this patch so that this additional comparison (though not
> expensive!) can be avoided.

Thanks for looking more deeply into this.


[PATCH net,stable] net: qmi_wwan: add Sierra EM7565 1199:9091

2017-12-11 Thread ssjoholm
From: Sebastian Sjoholm 

From: Sebastian Sjoholm 

Sierra Wireless EM7565 is an Qualcomm MDM9x50 based M.2 modem.
The USB id is added to qmi_wwan.c to allow QMI communication with the EM7565.

Signed-off-by: Sebastian Sjoholm 
---
[The corresponding qcserial patch will be submitted by Reinhard Speyerer.]

---
 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 304ec6555cd8..3cebd6683938 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1204,6 +1204,7 @@ static const struct usb_device_id products[] = {
{QMI_FIXED_INTF(0x1199, 0x9079, 10)},   /* Sierra Wireless EM74xx */
{QMI_FIXED_INTF(0x1199, 0x907b, 8)},/* Sierra Wireless EM74xx */
{QMI_FIXED_INTF(0x1199, 0x907b, 10)},   /* Sierra Wireless EM74xx */
+   {QMI_FIXED_INTF(0x1199, 0x9091, 8)},/* Sierra Wireless EM7565 */
{QMI_FIXED_INTF(0x1bbb, 0x011e, 4)},/* Telekom Speedstick LTE II 
(Alcatel One Touch L100V LTE) */
{QMI_FIXED_INTF(0x1bbb, 0x0203, 2)},/* Alcatel L800MA */
{QMI_FIXED_INTF(0x2357, 0x0201, 4)},/* TP-LINK HSUPA Modem MA180 */
-- 
2.14.1



[PATCH net,stable] net: qmi_wwan: add Quectel BG96 2c7c:0296

2017-12-11 Thread ssjoholm
From: Sebastian Sjoholm 

Quectel BG96 is an Qualcomm MDM9206 based IoT modem, supporting both 
CAT-M and NB-IoT. Tested hardware is BG96 mounted on Quectel development 
board (EVB). The USB id is added to qmi_wwan.c to allow QMI 
communication with the BG96.

Signed-off-by: Sebastian Sjoholm 

---
 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 720a3a248070..c750cf7c042b 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1239,6 +1239,7 @@ static const struct usb_device_id products[] = {
{QMI_FIXED_INTF(0x1e0e, 0x9001, 5)},/* SIMCom 7230E */
{QMI_QUIRK_SET_DTR(0x2c7c, 0x0125, 4)}, /* Quectel EC25, EC20 R2.0  
Mini PCIe */
{QMI_QUIRK_SET_DTR(0x2c7c, 0x0121, 4)}, /* Quectel EC21 Mini PCIe */
+   {QMI_FIXED_INTF(0x2c7c, 0x0296, 4)},/* Quectel BG96 */
 
/* 4. Gobi 1000 devices */
{QMI_GOBI1K_DEVICE(0x05c6, 0x9212)},/* Acer Gobi Modem Device */
-- 
2.11.0 (Apple Git-81)



Re: [PATCH net-next v4 1/2] bpf/tracing: allow user space to query prog array on the same tp

2017-12-11 Thread Alexei Starovoitov
On Mon, Dec 11, 2017 at 11:39:02AM -0800, Yonghong Song wrote:
> Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
> for a single perf event") added support to attach multiple
> bpf programs to a single perf event.
> Although this provides flexibility, users may want to know
> what other bpf programs attached to the same tp interface.
> Besides getting visibility for the underlying bpf system,
> such information may also help consolidate multiple bpf programs,
> understand potential performance issues due to a large array,
> and debug (e.g., one bpf program which overwrites return code
> may impact subsequent program results).
> 
> Commit 2541517c32be ("tracing, perf: Implement BPF programs
> attached to kprobes") utilized the existing perf ioctl
> interface and added the command PERF_EVENT_IOC_SET_BPF
> to attach a bpf program to a tracepoint. This patch adds a new
> ioctl command, given a perf event fd, to query the bpf program
> array attached to the same perf tracepoint event.
> 
> The new uapi ioctl command:
>   PERF_EVENT_IOC_QUERY_BPF
> 
> The new uapi/linux/perf_event.h structure:
>   struct perf_event_query_bpf {
>__u32  ids_len;
>__u32  prog_cnt;
>__u32  ids[0];
>   };
> 
> User space provides buffer "ids" for kernel to copy to.
> When returning from the kernel, the number of available
> programs in the array is set in "prog_cnt".
> 
> The usage:
>   struct perf_event_query_bpf *query = malloc(...);
>   query.ids_len = ids_len;
>   err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, &query);
>   if (err == 0) {
> /* query.prog_cnt is the number of available progs,
>  * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
>  */
>   } else if (errno == ENOSPC) {
> /* query.ids_len number of progs copied,
>  * query.prog_cnt is the number of available progs
>  */
>   } else {
>   /* other errors */
>   }
> 
> Signed-off-by: Yonghong Song 
> Acked-by: Peter Zijlstra (Intel) 

Acked-by: Alexei Starovoitov 



Re: [PATCH net-next v2 5/6] net: qualcomm: rmnet: Allow to configure flags for new devices

2017-12-11 Thread Dan Williams
On Sat, 2017-12-09 at 13:58 -0700, Subash Abhinov Kasiviswanathan
wrote:
> Add an option to configure the rmnet aggregation and command features
> on device creation. This is achieved by using the vlan flags option.

Still seems kinda odd to overload IFLA_VLAN_FLAGS to carry
RMNET_INGRESS/EGRESS_FORMAT_* flags, but I'll leave that decision to
others...

Dan

> Signed-off-by: Subash Abhinov Kasiviswanathan
> 
> ---
>  drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 16
> +---
>  1 file changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> index 5e530db..2f5f661 100644
> --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> @@ -177,11 +177,20 @@ static int rmnet_newlink(struct net *src_net,
> struct net_device *dev,
>   if (err)
>   goto err2;
>  
> - netdev_dbg(dev, "data format [ingress 0x%08X]\n",
> ingress_format);
> - port->ingress_data_format = ingress_format;
>   port->rmnet_mode = mode;
>  
>   hlist_add_head_rcu(&ep->hlnode, &port->muxed_ep[mux_id]);
> +
> + if (data[IFLA_VLAN_FLAGS]) {
> + struct ifla_vlan_flags *flags;
> +
> + flags = nla_data(data[IFLA_VLAN_FLAGS]);
> + ingress_format = flags->flags & flags->mask;
> + }
> +
> + netdev_dbg(dev, "data format [ingress 0x%08X]\n",
> ingress_format);
> + port->ingress_data_format = ingress_format;
> +
>   return 0;
>  
>  err2:
> @@ -312,7 +321,8 @@ static int rmnet_rtnl_validate(struct nlattr
> *tb[], struct nlattr *data[],
>  
>  static size_t rmnet_get_size(const struct net_device *dev)
>  {
> - return nla_total_size(2); /* IFLA_VLAN_ID */
> + return nla_total_size(2) /* IFLA_VLAN_ID */ +
> + nla_total_size(sizeof(struct ifla_vlan_flags)); /*
> IFLA_VLAN_FLAGS */
>  }
>  
>  struct rtnl_link_ops rmnet_link_ops __read_mostly = {


Re: [PATCH] selftests: bpf: Adding config fragment CONFIG_CGROUP_BPF=y

2017-12-11 Thread Roman Gushchin
Hi Naresh,

Looks good!

Thanks!

On Tue, Dec 12, 2017 at 12:55:23AM +0530, Naresh Kamboju wrote:
> CONFIG_CGROUP_BPF=y is required for test_dev_cgroup test case.
> 
> Signed-off-by: Naresh Kamboju 
> ---
>  tools/testing/selftests/bpf/config | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/testing/selftests/bpf/config 
> b/tools/testing/selftests/bpf/config
> index 52d53ed..9d48973 100644
> --- a/tools/testing/selftests/bpf/config
> +++ b/tools/testing/selftests/bpf/config
> @@ -3,3 +3,4 @@ CONFIG_BPF_SYSCALL=y
>  CONFIG_NET_CLS_BPF=m
>  CONFIG_BPF_EVENTS=y
>  CONFIG_TEST_BPF=m
> +CONFIG_CGROUP_BPF=y
> -- 
> 2.7.4
> 


Re: [PATCH] Revert "ravb: add workaround for clock when resuming with WoL enabled"

2017-12-11 Thread Sergei Shtylyov

Hello!

On 12/11/2017 11:54 AM, Geert Uytterhoeven wrote:


This reverts commit fbf3d034f2ff6264183cfa6845770e8cc2a986c8.

As of commit 560869100b99a3da ("clk: renesas: cpg-mssr: Restore module
clocks during resume"), the workaround is no longer needed.

Signed-off-by: Geert Uytterhoeven 


Acked-by: Sergei Shtylyov 

[...]

MBR, Sergei


[PATCH net-next v4 1/2] bpf/tracing: allow user space to query prog array on the same tp

2017-12-11 Thread Yonghong Song
Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event.
Although this provides flexibility, users may want to know
what other bpf programs attached to the same tp interface.
Besides getting visibility for the underlying bpf system,
such information may also help consolidate multiple bpf programs,
understand potential performance issues due to a large array,
and debug (e.g., one bpf program which overwrites return code
may impact subsequent program results).

Commit 2541517c32be ("tracing, perf: Implement BPF programs
attached to kprobes") utilized the existing perf ioctl
interface and added the command PERF_EVENT_IOC_SET_BPF
to attach a bpf program to a tracepoint. This patch adds a new
ioctl command, given a perf event fd, to query the bpf program
array attached to the same perf tracepoint event.

The new uapi ioctl command:
  PERF_EVENT_IOC_QUERY_BPF

The new uapi/linux/perf_event.h structure:
  struct perf_event_query_bpf {
   __u32ids_len;
   __u32prog_cnt;
   __u32ids[0];
  };

User space provides buffer "ids" for kernel to copy to.
When returning from the kernel, the number of available
programs in the array is set in "prog_cnt".

The usage:
  struct perf_event_query_bpf *query = malloc(...);
  query.ids_len = ids_len;
  err = ioctl(pmu_efd, PERF_EVENT_IOC_QUERY_BPF, &query);
  if (err == 0) {
/* query.prog_cnt is the number of available progs,
 * number of progs in ids: (ids_len == 0) ? 0 : query.prog_cnt
 */
  } else if (errno == ENOSPC) {
/* query.ids_len number of progs copied,
 * query.prog_cnt is the number of available progs
 */
  } else {
  /* other errors */
  }

Signed-off-by: Yonghong Song 
Acked-by: Peter Zijlstra (Intel) 
---
 include/linux/bpf.h |  4 
 include/uapi/linux/perf_event.h | 22 ++
 kernel/bpf/core.c   | 21 +
 kernel/events/core.c|  3 +++
 kernel/trace/bpf_trace.c| 23 +++
 5 files changed, 73 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e55e425..f812ac5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -254,6 +254,7 @@ typedef unsigned long (*bpf_ctx_copy_t)(void *dst, const 
void *src,
 
 u64 bpf_event_output(struct bpf_map *map, u64 flags, void *meta, u64 meta_size,
 void *ctx, u64 ctx_size, bpf_ctx_copy_t ctx_copy);
+int bpf_event_query_prog_array(struct perf_event *event, void __user *info);
 
 int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
  union bpf_attr __user *uattr);
@@ -285,6 +286,9 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu 
*progs,
 
 void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
struct bpf_prog *old_prog);
+int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
+__u32 __user *prog_ids, u32 request_cnt,
+__u32 __user *prog_cnt);
 int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
struct bpf_prog *exclude_prog,
struct bpf_prog *include_prog,
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index b9a4953..7695336 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -418,6 +418,27 @@ struct perf_event_attr {
__u16   __reserved_2;   /* align to __u64 */
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_BPF command
+ * to query bpf programs attached to the same perf tracepoint
+ * as the given perf event.
+ */
+struct perf_event_query_bpf {
+   /*
+* The below ids array length
+*/
+   __u32   ids_len;
+   /*
+* Set by the kernel to indicate the number of
+* available programs
+*/
+   __u32   prog_cnt;
+   /*
+* User provided buffer to store program ids
+*/
+   __u32   ids[0];
+};
+
 #define perf_flags(attr)   (*(&(attr)->read_format + 1))
 
 /*
@@ -433,6 +454,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_ID  _IOR('$', 7, __u64 *)
 #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32)
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
+#define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 86b50aa..b16c6f8 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1462,6 +1462,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array 
__rcu *progs,
rcu_read_lock();
prog = rcu_dereference(progs)->progs;
for (; *prog; prog++) {
+   if (*prog == &dummy_bpf_prog.prog)
+   continue;

[PATCH net-next v4 2/2] bpf/tracing: add a bpf test for new ioctl query interface

2017-12-11 Thread Yonghong Song
Added a subtest in test_progs. The tracepoint is
sched/sched_switch. Multiple bpf programs are attached to
this tracepoint and the query interface is exercised.

Signed-off-by: Yonghong Song 
Acked-by: Alexei Starovoitov 
Acked-by: Peter Zijlstra (Intel) 
---
 tools/include/uapi/linux/perf_event.h |  22 +
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  | 134 ++
 tools/testing/selftests/bpf/test_tracepoint.c |  26 +
 4 files changed, 183 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_tracepoint.c

diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index b9a4953..7695336 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -418,6 +418,27 @@ struct perf_event_attr {
__u16   __reserved_2;   /* align to __u64 */
 };
 
+/*
+ * Structure used by below PERF_EVENT_IOC_QUERY_BPF command
+ * to query bpf programs attached to the same perf tracepoint
+ * as the given perf event.
+ */
+struct perf_event_query_bpf {
+   /*
+* The below ids array length
+*/
+   __u32   ids_len;
+   /*
+* Set by the kernel to indicate the number of
+* available programs
+*/
+   __u32   prog_cnt;
+   /*
+* User provided buffer to store program ids
+*/
+   __u32   ids[0];
+};
+
 #define perf_flags(attr)   (*(&(attr)->read_format + 1))
 
 /*
@@ -433,6 +454,7 @@ struct perf_event_attr {
 #define PERF_EVENT_IOC_ID  _IOR('$', 7, __u64 *)
 #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32)
 #define PERF_EVENT_IOC_PAUSE_OUTPUT_IOW('$', 9, __u32)
+#define PERF_EVENT_IOC_QUERY_BPF   _IOWR('$', 10, struct 
perf_event_query_bpf *)
 
 enum perf_event_ioc_flags {
PERF_IOC_FLAG_GROUP = 1U << 0,
diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index f309ab9..b177c55 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -29,7 +29,7 @@ TEST_GEN_PROGS = test_verifier test_tag test_maps 
test_lru_map test_lpm_map test
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
-   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o
+   sockmap_verdict_prog.o dev_cgroup.o sample_ret0.o test_tracepoint.o
 
 TEST_PROGS := test_kmod.sh test_xdp_redirect.sh test_xdp_meta.sh \
test_offload.py
diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 6942753..1e0479a 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -21,8 +21,10 @@ typedef __u16 __sum16;
 #include 
 #include 
 #include 
+#include 
 #include 
 
+#include 
 #include 
 #include 
 #include 
@@ -617,6 +619,137 @@ static void test_obj_name(void)
}
 }
 
+static void test_tp_attach_query(void)
+{
+   const int num_progs = 3;
+   int i, j, bytes, efd, err, prog_fd[num_progs], pmu_fd[num_progs];
+   __u32 duration = 0, info_len, saved_prog_ids[num_progs];
+   const char *file = "./test_tracepoint.o";
+   struct perf_event_query_bpf *query;
+   struct perf_event_attr attr = {};
+   struct bpf_object *obj[num_progs];
+   struct bpf_prog_info prog_info;
+   char buf[256];
+
+   snprintf(buf, sizeof(buf),
+"/sys/kernel/debug/tracing/events/sched/sched_switch/id");
+   efd = open(buf, O_RDONLY, 0);
+   if (CHECK(efd < 0, "open", "err %d errno %d\n", efd, errno))
+   return;
+   bytes = read(efd, buf, sizeof(buf));
+   close(efd);
+   if (CHECK(bytes <= 0 || bytes >= sizeof(buf),
+ "read", "bytes %d errno %d\n", bytes, errno))
+   return;
+
+   attr.config = strtol(buf, NULL, 0);
+   attr.type = PERF_TYPE_TRACEPOINT;
+   attr.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_CALLCHAIN;
+   attr.sample_period = 1;
+   attr.wakeup_events = 1;
+
+   query = (struct perf_event_query_bpf *)malloc(sizeof(struct 
perf_event_query_bpf) +
+ sizeof(__u32) * 
num_progs);
+   for (i = 0; i < num_progs; i++) {
+   err = bpf_prog_load(file, BPF_PROG_TYPE_TRACEPOINT, &obj[i],
+   &prog_fd[i]);
+   if (CHECK(err, "prog_load", "err %d errno %d\n", err, errno))
+   goto cleanup1;
+
+   bzero(&prog_info, sizeof(prog_info));
+   prog_info.jited_prog_len = 0;
+   prog_info.xlated_prog_len = 0;
+   prog_info.nr_map_ids = 0;
+   info_len = sizeof(prog_info);
+   err = bpf_obj_get_info_by_fd(prog_fd[i], &prog_info, &info_len);
+ 

[PATCH net-next v4 0/2] bpf/tracing: allow user space to query prog array on the same tp

2017-12-11 Thread Yonghong Song
Commit e87c6bc3852b ("bpf: permit multiple bpf attachments
for a single perf event") added support to attach multiple
bpf programs to a single perf event. Given a perf event
(kprobe, uprobe, or kernel tracepoint), the perf ioctl interface
is used to query bpf programs attached to the same trace event.

There already exists a BPF_PROG_QUERY command for introspection
currently used by cgroup+bpf. We did have an implementation for
querying tracepoint+bpf through the same interface. However, it
looks cleaner to use ioctl() style of api here, since attaching
bpf prog to tracepoint/kuprobe is also done via ioctl.

Patch #1 had the core implementation and patch #2 added
a test case in tools bpf selftests suite.

Changelogs:
v3 -> v4:
  - Fix a compilation error with newer gcc like 6.3.1 while
old gcc 4.8.5 is okay. I was using &uquery->ids to represent
the address to the ids array to make it explicit that the
address is passed, and this syntax is rightly rejected
by gcc 6.3.1.
v2 -> v3:
  - Change uapi structure perf_event_query_bpf to be more
clearer based on Peter's suggestion, and adjust
other codes accordingly.
v1 -> v2:
  - Rebase on top of net-next.
  - Use existing bpf_prog_array_length function instead of
implementing the same functionality in function
bpf_prog_array_copy_info.

Yonghong Song (2):
  bpf/tracing: allow user space to query prog array on the same tp
  bpf/tracing: add a bpf test for new ioctl query interface

 include/linux/bpf.h   |   4 +
 include/uapi/linux/perf_event.h   |  22 +
 kernel/bpf/core.c |  21 
 kernel/events/core.c  |   3 +
 kernel/trace/bpf_trace.c  |  23 +
 tools/include/uapi/linux/perf_event.h |  22 +
 tools/testing/selftests/bpf/Makefile  |   2 +-
 tools/testing/selftests/bpf/test_progs.c  | 134 ++
 tools/testing/selftests/bpf/test_tracepoint.c |  26 +
 9 files changed, 256 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/test_tracepoint.c

-- 
2.9.5



Re: [PATCH next] ipvlan: add L2 check for packets arriving via virtual devices

2017-12-11 Thread महेश बंडेवार
On Mon, Dec 11, 2017 at 8:15 AM, David Miller  wrote:
> From: Mahesh Bandewar 
> Date: Thu,  7 Dec 2017 15:15:43 -0800
>
>> From: Mahesh Bandewar 
>>
>> Packets that don't have dest mac as the mac of the master device should
>> not be entertained by the IPvlan rx-handler. This is mostly true as the
>> packet path mostly takes care of that, except when the master device is
>> a virtual device. As demonstrated in the following case -
>  ...
>> This patch adds that missing check in the IPvlan rx-handler.
>>
>> Reported-by: Amit Sikka 
>> Signed-off-by: Mahesh Bandewar 
>
> Applied, but it's a shame that the data plane takes on this new MAC
> compare operation.
Your comment made me think little more about this and a discussion
with Eric kind of put things in perspective. eth_type_trans() does the
right thing and sets the packet_type correctly (when .ndo_xmit of veth
is called). However IPvlan is over-aggressive in packet scrubbing and
that scrub changes packet type. This causes the actual problem. It's
not clear to me why skb_scrub_packet() changes the packet type to
PACKET_HOST unconditionally? But that's another issue.

I'll send another patch to remove excessive scrubbing in IPvlan and
revert of this patch so that this additional comparison (though not
expensive!) can be avoided.

Thanks,
--mahesh..


Re: RFC(v2): Audit Kernel Container IDs

2017-12-11 Thread Steve Grubb
On Monday, December 11, 2017 11:30:57 AM EST Eric Paris wrote:
> > Because a container doesn't have to use namespaces to be a container
> > you still need a mechanism for a process to declare that it is in
> > fact
> > in a container, and to identify the container.
> 
> I like the idea but I'm still tossing it around in my head (and
> thinking about Casey's statement too). Lets say we have a 'docker-like'
> container with pid=100  netns=X,userns=Y,mountns=Z. If I'm on the host
> in all init namespaces and I run
>   nsenter -t 100 -n ip link set eth0 promisc on
> How should this be logged?

If it is a normal process, then everything would match the init name space and 
you wouldn't have entered a container. If it were a container, any generated 
event should have the container ID from registration attached to it.

> Did this command run in it's own 'container' unrelated to the 'docker-like'
> container?

That should be determined by what's in the task struct.

-Steve


Re: [PATCH v2] vsock.7: document VSOCK socket address family

2017-12-11 Thread Michael Kerrisk (man-pages)
On 12/06/2017 03:06 PM, Jorgen S. Hansen wrote:
> 
>> On Dec 5, 2017, at 11:56 AM, Stefan Hajnoczi  wrote:
>>
>> The AF_VSOCK address family has been available since Linux 3.9 without a
>> corresponding man page.
>>
>> This patch adds vsock.7 and describes its use along the same lines as
>> existing ip.7, unix.7, and netlink.7 man pages.
>>
>> CC: Jorgen Hansen 
>> CC: Dexuan Cui 
>> Signed-off-by: Stefan Hajnoczi 
>> ---
>> man7/vsock.7 | 180 
>> +++
>> 1 file changed, 180 insertions(+)
>> create mode 100644 man7/vsock.7
>>
>> diff --git a/man7/vsock.7 b/man7/vsock.7
>> new file mode 100644
>> index 0..46dc561f5
>> --- /dev/null
>> +++ b/man7/vsock.7
>> @@ -0,0 +1,180 @@
>> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
>> +.SH NAME
>> +vsock \- Linux VSOCK address family
>> +.SH SYNOPSIS
>> +.B #include 
>> +.br
>> +.B #include 
>> +.PP
>> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
>> +.br
>> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
>> +.SH DESCRIPTION
>> +The VSOCK address family facilitates communication between virtual machines 
>> and
>> +the host they are running on.  This address family is used by guest agents 
>> and
>> +hypervisor services that need a communications channel that is independent 
>> of
>> +virtual machine network configuration.
>> +.PP
>> +Valid socket types are
>> +.B SOCK_STREAM
>> +and
>> +.BR SOCK_DGRAM .
>> +.B SOCK_STREAM
>> +provides connection-oriented byte streams with guaranteed, in-order 
>> delivery.
>> +.B SOCK_DGRAM
>> +provides a connectionless datagram packet service with best-effort delivery 
>> and
>> +best-effort ordering.  Availability of these socket types is dependent on 
>> the
>> +underlying hypervisor.
>> +.PP
>> +A new socket is created with
>> +.PP
>> +socket(AF_VSOCK, socket_type, 0);
>> +.PP
>> +When a process wants to establish a connection it calls
>> +.BR connect (2)
>> +with a given destination socket address.  The socket is automatically bound 
>> to
>> +a free port if unbound.
>> +.PP
>> +A process can listen for incoming connections by first binding to a socket
>> +address using
>> +.BR bind (2)
>> +and then calling
>> +.BR listen (2).
>> +.PP
>> +Data is transferred using the usual
>> +.BR send (2)
>> +and
>> +.BR recv (2)
>> +family of socket system calls.
>> +.SS Address format
>> +A socket address is defined as a combination of a 32-bit Context Identifier
>> +(CID) and a 32-bit port number.  The CID identifies the source or 
>> destination,
>> +which is either a virtual machine or the host.  The port number 
>> differentiates
>> +between multiple services running on a single machine.
>> +.PP
>> +.in +4n
>> +.EX
>> +struct sockaddr_vm {
>> +sa_family_t svm_family; /* address family: AF_VSOCK */
>> +unsigned short  svm_reserved1;
>> +unsigned intsvm_port;   /* port in native byte order */
>> +unsigned intsvm_cid;/* address in native byte order */
>> +};
>> +.EE
>> +.in
>> +.PP
>> +.I svm_family
>> +is always set to
>> +.BR AF_VSOCK .
>> +.I svm_reserved1
>> +is always set to 0.
>> +.I svm_port
>> +contains the port in native byte order.
>> +The port numbers below 1024 are called
>> +.IR "privileged ports" .
>> +Only a process with
>> +.B CAP_NET_BIND_SERVER
>> +capability may
>> +.BR bind (2)
>> +to these port numbers.
>> +.PP
>> +There are several special addresses:
>> +.B VMADDR_CID_ANY
>> +(-1U)
>> +means any address for binding;
>> +.B VMADDR_CID_HYPERVISOR
>> +(0) is reserved for services built into the hypervisor;
>> +.B VMADDR_CID_RESERVED
>> +(1) must not be used;
>> +.B VMADDR_CID_HOST
>> +(2)
>> +is the well-known address of the host.
>> +.PP
>> +The special constant
>> +.B VMADDR_PORT_ANY
>> +(-1U)
>> +means any port number for binding.
>> +.SS Live migration
>> +Sockets are affected by live migration of virtual machines.  Connected
>> +.B SOCK_STREAM
>> +sockets become disconnected when the virtual machine migrates to a new host.
>> +Applications must reconnect when this happens.
>> +.PP
>> +The local CID may change across live migration if the old CID is not 
>> available
>> +on the new host.  Bound sockets are automatically updated to the new CID.
>> +.SS Ioctls
>> +.TP
>> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
>> +Get the CID of the local machine.  The argument is a pointer to an unsigned 
>> int.
>> +.IP
>> +.in +4n
>> +.EX
>> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", " &cid 
>> ");"
>> +.EE
>> +.in
>> +.IP
>> +Consider using
>> +.B VMADDR_CID_ANY
>> +when binding instead of getting the local CID with
>> +.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID .
>> +.SH ERRORS
>> +.TP
>> +.B EACCES
>> +Unable to bind to a privileged port without the
>> +.B CAP_NET_BIND_SERVICE
>> +capability.
>> +.TP
>> +.B EINVAL
>> +Invalid parameters.  This includes:
>> +attempting to bind a socket that is already bound, providing an invalid 
>> struct
>> +.BR sockaddr_vm ,
>> +and oth

Re: [PATCH v2] vsock.7: document VSOCK socket address family

2017-12-11 Thread Michael Kerrisk (man-pages)
Hello Stefan,

Thanks for this page!

I have applied your patch, and made a few tweaks, but
I have some minor questions. Please see below.

On 12/05/2017 11:56 AM, Stefan Hajnoczi wrote:
> The AF_VSOCK address family has been available since Linux 3.9 without a
> corresponding man page.
> 
> This patch adds vsock.7 and describes its use along the same lines as
> existing ip.7, unix.7, and netlink.7 man pages.
> 
> CC: Jorgen Hansen 
> CC: Dexuan Cui 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  man7/vsock.7 | 180 
> +++
>  1 file changed, 180 insertions(+)
>  create mode 100644 man7/vsock.7
> 
> diff --git a/man7/vsock.7 b/man7/vsock.7
> new file mode 100644
> index 0..46dc561f5
> --- /dev/null
> +++ b/man7/vsock.7
> @@ -0,0 +1,180 @@
> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +vsock \- Linux VSOCK address family
> +.SH SYNOPSIS
> +.B #include 
> +.br
> +.B #include 
> +.PP
> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
> +.br
> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
> +.SH DESCRIPTION
> +The VSOCK address family facilitates communication between virtual machines 
> and
> +the host they are running on.  This address family is used by guest agents 
> and
> +hypervisor services that need a communications channel that is independent of
> +virtual machine network configuration.
> +.PP
> +Valid socket types are
> +.B SOCK_STREAM
> +and
> +.BR SOCK_DGRAM .
> +.B SOCK_STREAM
> +provides connection-oriented byte streams with guaranteed, in-order delivery.
> +.B SOCK_DGRAM
> +provides a connectionless datagram packet service with best-effort delivery 
> and
> +best-effort ordering.  Availability of these socket types is dependent on the
> +underlying hypervisor.
> +.PP
> +A new socket is created with
> +.PP
> +socket(AF_VSOCK, socket_type, 0);
> +.PP
> +When a process wants to establish a connection it calls
> +.BR connect (2)
> +with a given destination socket address.  The socket is automatically bound 
> to
> +a free port if unbound.
> +.PP
> +A process can listen for incoming connections by first binding to a socket
> +address using
> +.BR bind (2)
> +and then calling
> +.BR listen (2).
> +.PP
> +Data is transferred using the usual
> +.BR send (2)
> +and
> +.BR recv (2)

Or equally, write(2) and read(2), right? By failing to mention those, the
text subtly implies that send(2) and recv(2) are preferred, but I don't
suppose that is true.

> +family of socket system calls.
> +.SS Address format
> +A socket address is defined as a combination of a 32-bit Context Identifier
> +(CID) and a 32-bit port number.  The CID identifies the source or 
> destination,
> +which is either a virtual machine or the host.  The port number 
> differentiates
> +between multiple services running on a single machine.
> +.PP
> +.in +4n
> +.EX
> +struct sockaddr_vm {
> +sa_family_t svm_family; /* address family: AF_VSOCK */
> +unsigned short  svm_reserved1;
> +unsigned intsvm_port;   /* port in native byte order */
> +unsigned intsvm_cid;/* address in native byte order */
> +};
> +.EE
> +.in
> +.PP
> +.I svm_family
> +is always set to
> +.BR AF_VSOCK .
> +.I svm_reserved1
> +is always set to 0.
> +.I svm_port
> +contains the port in native byte order.
> +The port numbers below 1024 are called
> +.IR "privileged ports" .
> +Only a process with
> +.B CAP_NET_BIND_SERVER
> +capability may
> +.BR bind (2)
> +to these port numbers.
> +.PP
> +There are several special addresses:
> +.B VMADDR_CID_ANY
> +(-1U)
> +means any address for binding;
> +.B VMADDR_CID_HYPERVISOR
> +(0) is reserved for services built into the hypervisor;
> +.B VMADDR_CID_RESERVED
> +(1) must not be used;
> +.B VMADDR_CID_HOST
> +(2)
> +is the well-known address of the host.
> +.PP
> +The special constant
> +.B VMADDR_PORT_ANY
> +(-1U)
> +means any port number for binding.
> +.SS Live migration
> +Sockets are affected by live migration of virtual machines.  Connected
> +.B SOCK_STREAM
> +sockets become disconnected when the virtual machine migrates to a new host.
> +Applications must reconnect when this happens.
> +.PP
> +The local CID may change across live migration if the old CID is not 
> available
> +on the new host.  Bound sockets are automatically updated to the new CID.
> +.SS Ioctls
> +.TP
> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
> +Get the CID of the local machine.  The argument is a pointer to an unsigned 
> int.
> +.IP
> +.in +4n
> +.EX
> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", " &cid 
> ");"
> +.EE
> +.in
> +.IP
> +Consider using
> +.B VMADDR_CID_ANY
> +when binding instead of getting the local CID with
> +.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID .
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Unable to bind to a privileged port without the
> +.B CAP_NET_BIND_SERVICE
> +capability.
> +.TP
> +.B EINVAL
> +Invalid parameters.  This includes:
> +attempting to bind a socket that i

  1   2   3   >