[PATCH 3/3] dt-bindings: net: dsa: add new bindings MT7530

2018-11-29 Thread gerg
From: Greg Ungerer 

Add descriptive entries for the new bindings introduced to support the
MT7530 implementation in the MediaTek MT7621 SoC.

New bindings added for:

  mediatek,no-clock-regulator
  mediatek,mfc-has-cpuport

Signed-off-by: Greg Ungerer 
---
 Documentation/devicetree/bindings/net/dsa/mt7530.txt | 5 +
 1 file changed, 5 insertions(+)

diff --git a/Documentation/devicetree/bindings/net/dsa/mt7530.txt 
b/Documentation/devicetree/bindings/net/dsa/mt7530.txt
index aa3527f71fdc..549ad7c338ea 100644
--- a/Documentation/devicetree/bindings/net/dsa/mt7530.txt
+++ b/Documentation/devicetree/bindings/net/dsa/mt7530.txt
@@ -9,6 +9,11 @@ Required properties:
 - mediatek,mcm: Boolean; if defined, indicates that either MT7530 is the part
on multi-chip module belong to MT7623A has or the remotely standalone
chip as the function MT7623N reference board provided for.
+- mediatek,no-clock-regulator: Boolean; if defined, indicates that the MT7530
+   is in a system-on-chip that does not require clock or regulator
+   control setup (for example the MT7621).
+- mediatek,mfc-has-cpuport: Boolean; if defined, indicates that the MT7530
+   has an MFC register that has a CPU PORT field and ENABLE bit
 - core-supply: Phandle to the regulator node necessary for the core power.
 - io-supply: Phandle to the regulator node necessary for the I/O power.
See Documentation/devicetree/bindings/regulator/mt6323-regulator.txt
-- 
2.17.1



[PATCH 1/3] net: dsa: mt7530: make clock/regulator setup optional

2018-11-29 Thread gerg
From: Greg Ungerer 

At least one device that contains the MediaTek MT7530 switch module
does not need the clock and regulator setup parts of the MT7530 DSA
driver. That setup looks to be very specific to the MT7623.

The MediaTek MT7621 SoC device contains a 7530 switch, but its MIPS CPU
cores and overall architecture do not have any of the same clock or
regulator setup as the MT7623.

Create a new devicetree tag, mediatek,no-clock-regulator, that controls
whether we should setup and use the clocks and regulators specific to
some uses of the 7530.

Signed-off-by: Greg Ungerer 
---
 drivers/net/dsa/mt7530.c | 86 
 drivers/net/dsa/mt7530.h |  1 +
 2 files changed, 52 insertions(+), 35 deletions(-)

diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index a5de9bffe5be..532240c4bef9 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -625,14 +625,16 @@ static void mt7530_adjust_link(struct dsa_switch *ds, int 
port,
dev_dbg(priv->dev, "phy-mode for master device = %x\n",
phydev->interface);
 
-   /* Setup TX circuit incluing relevant PAD and driving */
-   mt7530_pad_clk_setup(ds, phydev->interface);
-
-   /* Setup RX circuit, relevant PAD and driving on the host
-* which must be placed after the setup on the device side is
-* all finished.
-*/
-   mt7623_pad_clk_setup(ds);
+   if (priv->clkreg) {
+   /* Setup TX circuit incluing relevant PAD and driving */
+   mt7530_pad_clk_setup(ds, phydev->interface);
+
+   /* Setup RX circuit, relevant PAD and driving on the
+* host which must be placed after the setup on the
+* device side is all finished.
+*/
+   mt7623_pad_clk_setup(ds);
+   }
} else {
u16 lcl_adv = 0, rmt_adv = 0;
u8 flowctrl;
@@ -1219,24 +1221,27 @@ mt7530_setup(struct dsa_switch *ds)
 * as two netdev instances.
 */
dn = ds->ports[MT7530_CPU_PORT].master->dev.of_node->parent;
-   priv->ethernet = syscon_node_to_regmap(dn);
-   if (IS_ERR(priv->ethernet))
-   return PTR_ERR(priv->ethernet);
 
-   regulator_set_voltage(priv->core_pwr, 100, 100);
-   ret = regulator_enable(priv->core_pwr);
-   if (ret < 0) {
-   dev_err(priv->dev,
-   "Failed to enable core power: %d\n", ret);
-   return ret;
-   }
+   if (priv->clkreg) {
+   priv->ethernet = syscon_node_to_regmap(dn);
+   if (IS_ERR(priv->ethernet))
+   return PTR_ERR(priv->ethernet);
+
+   regulator_set_voltage(priv->core_pwr, 100, 100);
+   ret = regulator_enable(priv->core_pwr);
+   if (ret < 0) {
+   dev_err(priv->dev,
+   "Failed to enable core power: %d\n", ret);
+   return ret;
+   }
 
-   regulator_set_voltage(priv->io_pwr, 330, 330);
-   ret = regulator_enable(priv->io_pwr);
-   if (ret < 0) {
-   dev_err(priv->dev, "Failed to enable io pwr: %d\n",
-   ret);
-   return ret;
+   regulator_set_voltage(priv->io_pwr, 330, 330);
+   ret = regulator_enable(priv->io_pwr);
+   if (ret < 0) {
+   dev_err(priv->dev, "Failed to enable io pwr: %d\n",
+   ret);
+   return ret;
+   }
}
 
/* Reset whole chip through gpio pin or memory-mapped registers for
@@ -1268,10 +1273,12 @@ mt7530_setup(struct dsa_switch *ds)
return -ENODEV;
}
 
-   /* Reset the switch through internal reset */
-   mt7530_write(priv, MT7530_SYS_CTRL,
-SYS_CTRL_PHY_RST | SYS_CTRL_SW_RST |
-SYS_CTRL_REG_RST);
+   if (priv->clkreg) {
+   /* Reset the switch through internal reset */
+   mt7530_write(priv, MT7530_SYS_CTRL,
+SYS_CTRL_PHY_RST | SYS_CTRL_SW_RST |
+SYS_CTRL_REG_RST);
+   }
 
/* Enable Port 6 only; P5 as GMAC5 which currently is not supported */
val = mt7530_read(priv, MT7530_MHWTRAP);
@@ -1356,13 +1363,22 @@ mt7530_probe(struct mdio_device *mdiodev)
}
}
 
-   priv->core_pwr = devm_regulator_get(&mdiodev->dev, "core");
-   if (IS_ERR(priv->core_pwr))
-   return PTR_ERR(priv->core_pwr);
-
-   priv->io_pwr = devm_regulator_get(&mdiodev->dev, "io");
-   if (IS_ERR(priv->io_pwr))
-   return PTR_ERR(priv->io_pwr);
+   /* Use mediatek,no-clock-regu

[PATCH 2/3] net: dsa: mt7530: optional setting CPU field in MFC register

2018-11-29 Thread gerg
From: Greg Ungerer 

Some versions of the MediaTek MT7530 switch have a CPU port number and
enable bit in their MFC register. The MT7530 instance on the MediaTek
MT7621 SoC is one that does for example. The switch will not work
without these fields being correctly setup on these devices.

Create a new devicetree tag, mediatek,mfc-has-cpuport, that signifies
that this device needs the CPU port field and enable bit set when the
port is enabled.

Signed-off-by: Greg Ungerer 
---
 drivers/net/dsa/mt7530.c | 9 +
 drivers/net/dsa/mt7530.h | 4 
 2 files changed, 13 insertions(+)

diff --git a/drivers/net/dsa/mt7530.c b/drivers/net/dsa/mt7530.c
index 532240c4bef9..e41ada47233a 100644
--- a/drivers/net/dsa/mt7530.c
+++ b/drivers/net/dsa/mt7530.c
@@ -689,6 +689,10 @@ mt7530_cpu_port_enable(struct mt7530_priv *priv,
/* Unknown unicast frame fordwarding to the cpu port */
mt7530_set(priv, MT7530_MFC, UNU_FFP(BIT(port)));
 
+   /* Set CPU port number */
+   if (priv->mfccpu)
+   mt7530_rmw(priv, MT7530_MFC, CPU_MASK, CPU_EN | CPU_PORT(port));
+
/* CPU port gets connected to all user ports of
 * the switch
 */
@@ -1380,6 +1384,11 @@ mt7530_probe(struct mdio_device *mdiodev)
 "MT7530 with no CLOCK or REGULATOR control\n");
}
 
+   /* Use mediatek,mfc-has-cpuport to distinguish hardware where the MFC
+* register has a CPU port number field setting.
+*/
+   priv->mfccpu = of_property_read_bool(dn, "mediatek,mfc-has-cpuport");
+
/* Not MCM that indicates switch works as the remote standalone
 * integrated circuit so the GPIO pin would be used to complete
 * the reset, otherwise memory-mapped register accessing used
diff --git a/drivers/net/dsa/mt7530.h b/drivers/net/dsa/mt7530.h
index ee97944c4507..cbc0725c4258 100644
--- a/drivers/net/dsa/mt7530.h
+++ b/drivers/net/dsa/mt7530.h
@@ -36,6 +36,9 @@
 #define  UNM_FFP(x)(((x) & 0xff) << 16)
 #define  UNU_FFP(x)(((x) & 0xff) << 8)
 #define  UNU_FFP_MASK  UNU_FFP(~0)
+#define  CPU_ENBIT(7)
+#define  CPU_PORT(x)   ((x) << 4)
+#define  CPU_MASK  (0xf << 4)
 
 /* Registers for address table access */
 #define MT7530_ATA10x74
@@ -432,6 +435,7 @@ struct mt7530_priv {
struct gpio_desc*reset;
boolmcm;
boolclkreg;
+   boolmfccpu;
 
struct mt7530_port  ports[MT7530_NUM_PORTS];
/* protect among processes for registers access*/
-- 
2.17.1



[PATCH 0/3]: net: dsa: mt7530: support MT7530 in the MT7621 SoC

2018-11-29 Thread gerg
I have been working towards supporting the MT7530 switch as used in the
MediaTek MT7621 SoC. Unlike the MediaTek MT7623 the MT7621 is built around
a dual core MIPS CPU architecture. But underneath it is what appears to
be the same 7530 switch.

The following 3 patches are more of an RFC than anything. They allow
use of the mt7530 dsa driver on this device - though with some issues
still to resolve. The primary change required is to not use the 7623
specific clock and regulator setup - none of that applies when using
the 7621 (and maybe other devices?). The other change required is to
set the 7530 MFC register CPU port number and enable bit.

The unresolved issues I still have appear to be more related to the
MT7621 ethernet driver (drivers/staging/mt7621-eth/*). I am hoping
someone might have some ideas on these. I don't really have any good
documentation on the ethernet devices on the 7621, so I am kind of
working in the dark here.

1. TX packets are not getting an IP header checksum via the normal
   off-loaded checksumming when in DSA mode. I have to switch off
   NETIF_F_IP_CSUM, so the software stack generates the checksum.
   That checksum offloading works ok when not using the 7530 DSA driver.

2. Maximal sized RX packets get silently dropped. So receive side packets
   that are large (perfect case is the all-but-last packets in a fragemented
   larger packet) appear to be dropped at the mt7621 ethernet MAC level.
   The 7530 MIB switch register counters show receive packets at the physical
   switch port side and at the CPU switch port - but I get no packets
   received or errors in the 7621 ethernet MAC. If I set the mtu of the
   server at the other end a little smaller (a few bytes is enough) then
   I get all the packets through. It seems like the DSA/VLAN tag bytes
   are causing a too large packet to get silently dropped somewhere.

Any thoughts greatly appreciated...

 Documentation/devicetree/bindings/net/dsa/mt7530.txt |5 +
 drivers/net/dsa/mt7530.c |   95 ---
 drivers/net/dsa/mt7530.h |5 +
 3 files changed, 70 insertions(+), 35 deletions(-)




Re: [B.A.T.M.A.N.] [PATCH] batman-adv: fix null pointer dereference in batadv_gw_election

2018-11-29 Thread Sven Eckelmann
On Friday, 30 November 2018 15:00:02 CET Wen Yang wrote:
> This patch fixes a possible null pointer dereference in
> batadv_gw_election, detected by the semantic patch
> deref_null.cocci, with the following warning:
> 
> ./net/batman-adv/gateway_client.c:289:15-24: ERROR: next_gw is NULL but 
> dereferenced.
> ./net/batman-adv/gateway_client.c:290:15-29: ERROR: next_gw is NULL but 
> dereferenced.
> ./net/batman-adv/gateway_client.c:291:15-29: ERROR: next_gw is NULL but 
> dereferenced.
> ./net/batman-adv/gateway_client.c:292:15-27: ERROR: next_gw is NULL but 
> dereferenced.
> ./net/batman-adv/gateway_client.c:293:15-27: ERROR: next_gw is NULL but 
> dereferenced.

This patch is seems to be nonsensical. next_gw cannot be NULL at this point 
(let us call this location in the code "4."). Let us go through the code

// 1. when both are NULL then it would jump out of the the function.
if (curr_gw == next_gw)
goto out;

[...]

if (curr_gw && !next_gw) {
[...]
// 2. this handles the only valid case when next_gw is NULL
}  else if (!curr_gw && next_gw) {
// 3. here we know that next_gw is not NULL and curr_gw is NULL
// we can therefore infer that 
[...]
} else {
// 4. here you try to add an ugly patch to handle a non-existing 
// next_gw == NULL case
[...]
}

Let us go through all possible combinations:

curr_gw  next_gw
I   00
II  x0
III 0y
IV  xy

For I:   we would leave the function even at 1. and never reach 4.
For II:  would be handled by 2. and thus never reach 4.
For III: would be handled by 3. and thus never reach 4.
For IV:  This can be handled by 1. (when x == y). Or otherwise it would be
 handled by 4. but is not the next_gw == NULL case.

Please correct me (in case I missed something) but it looks to me that this 
patch should be rejected.

Kind regards,
Sven

signature.asc
Description: This is a digitally signed message part.


RE: Invalid transport_offset with AF_PACKET socket

2018-11-29 Thread Maxim Mikityanskiy
> > > So it should return at least 18 and not 14.
> >
> > Yes, the function does its best to return at least 18, but it silently
> expects
> > skb_transport_offset to exceed 18. In normal conditions, it will be more
> that
> > 18, because it will be at least 14 + 20. But in my case, when I send a
> packet
> > via an AF_PACKET socket, skb_transport_offset returns 14 (which is
> nonsense),
> > and the driver uses this value, causing the hardware to fail, because it's
> less
> > than 18.
> >
>
> Got it, so even if you copy 18 it is not sufficient ! if the packet is
> ipv4 or ipv6
> and the inline mode is set to  MLX5_INLINE_MODE_IP in the vport
> context you must copy the IP headers as well !

Yes, I know that. That's why the driver uses skb_transport_offset - to include
the network header, but as skb_transport_offset returns a wrong offset, the IP
header doesn't get included. The thing is that with 14 bytes the driver even
fails to send a packet, and with 18 bytes and missing L3 header it seems to send
it, but it doesn't matter much, of course, we should pass the L3 header as well.

> but what do you expect from AF_PACKET socket ? to parse each and every
> packet and set skb_transport_offset ?

Not exactly.

First, AF_PACKET already parses each packet, and it even sets the transport
offset correctly if I specify the protocol in the third parameter of the
socket() call:

socket(AF_PACKET, SOCK_RAW, htons(0x0800));

So, if called in a specific way, the AF_SOCKET does exactly that - it parses
each packet and sets the transport offset, but if the protocol is not specified
(0) in the third parameter, it fails to parse correctly.

Second, I don't expect that. It's absolutely OK if AF_PACKET tells the driver
that it failed to parse the packet, then the driver can guess the network
protocol and restart parsing itself. The driver knows the format of its L2
header and can look at the necessary field.

However, if AF_PACKET could ask the driver to guess the protocol, it would be a
little bit faster, because this way skb_probe_transport_header is only called
once, and the field lookup in the L2 header is O(1).


Re: Freeze when using ipheth+IPsec+IPv6

2018-11-29 Thread Yves-Alexis Perez
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On Thu, 2018-11-29 at 15:52 -0800, David Miller wrote:
> From: Kees Cook 
> Date: Thu, 29 Nov 2018 15:31:25 -0800
> 
> > Did you ever solve this?
> 
> I think it was fixed by:
> 
> commit 45611c61dd503454b2edae00aabe1e429ec49ebe
> Author: Bernd Eckstein <3erndeckst...@gmail.com>
> Date:   Fri Nov 23 13:51:26 2018 +0100
> 
> usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2

Supposedly yes. Bernd Eckstein contacted me with his patch, unfortunately at
that time I wasn't able to reproduce the bug consistently so I wasn't able to
put a Tested-By tag on it.

Regards,
- -- 
Yves-Alexis
-BEGIN PGP SIGNATURE-

iQEzBAEBCAAdFiEE8vi34Qgfo83x35gF3rYcyPpXRFsFAlwA53cACgkQ3rYcyPpX
RFtB6wgAhwVbajsYNqii7OYSN+Mpd8u9iYKySYceJg2UO1NOkoTxY47iwuHwQ7Aq
QicVI2fgwC4E1kHj4ZnxdZ9w09XZ7k/za5uvc19ZWWopscsyEkq6JeyLWGp/l7xA
OFxUy0NxTi8qkUDXM25dqoLChCAI5NWsHO6LBwbDghI+2A7aCbI092gkbwKDZsja
NBpkVS1LNYoUPRH+aP+kXw+Hzln88pRP9aKyc2+WyEH7AmFGRSPU+qL1snvzLjg/
1bI09LuU56nJe9hr68222MQ1WclOs69HDcSjHxu21LISEpRAUrXzt9ZN6cbTOFD8
LvgtkBAr7/AsQE9/VStCRdswgsW08g==
=O+dv
-END PGP SIGNATURE-


[PATCH] batman-adv: fix null pointer dereference in batadv_gw_election

2018-11-29 Thread Wen Yang
This patch fixes a possible null pointer dereference in
batadv_gw_election, detected by the semantic patch
deref_null.cocci, with the following warning:

./net/batman-adv/gateway_client.c:289:15-24: ERROR: next_gw is NULL but 
dereferenced.
./net/batman-adv/gateway_client.c:290:15-29: ERROR: next_gw is NULL but 
dereferenced.
./net/batman-adv/gateway_client.c:291:15-29: ERROR: next_gw is NULL but 
dereferenced.
./net/batman-adv/gateway_client.c:292:15-27: ERROR: next_gw is NULL but 
dereferenced.
./net/batman-adv/gateway_client.c:293:15-27: ERROR: next_gw is NULL but 
dereferenced.

Signed-off-by: Wen Yang 
Reviewed-by: Tan Hu 
---
 net/batman-adv/gateway_client.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/net/batman-adv/gateway_client.c b/net/batman-adv/gateway_client.c
index 140c61a..d80ef1c 100644
--- a/net/batman-adv/gateway_client.c
+++ b/net/batman-adv/gateway_client.c
@@ -284,14 +284,16 @@ void batadv_gw_election(struct batadv_priv *bat_priv)
batadv_throw_uevent(bat_priv, BATADV_UEV_GW, BATADV_UEV_ADD,
gw_addr);
} else {
-   batadv_dbg(BATADV_DBG_BATMAN, bat_priv,
-  "Changing route to gateway %pM (bandwidth: 
%u.%u/%u.%u MBit, tq: %i)\n",
-  next_gw->orig_node->orig,
-  next_gw->bandwidth_down / 10,
-  next_gw->bandwidth_down % 10,
-  next_gw->bandwidth_up / 10,
-  next_gw->bandwidth_up % 10,
-  router_ifinfo->bat_iv.tq_avg);
+   if (next_gw) {
+   batadv_dbg(BATADV_DBG_BATMAN, bat_priv,
+  "Changing route to gateway %pM (bandwidth: 
%u.%u/%u.%u MBit, tq: %i)\n",
+   next_gw->orig_node->orig,
+   next_gw->bandwidth_down / 10,
+   next_gw->bandwidth_down % 10,
+   next_gw->bandwidth_up / 10,
+   next_gw->bandwidth_up % 10,
+   router_ifinfo->bat_iv.tq_avg);
+   }
batadv_throw_uevent(bat_priv, BATADV_UEV_GW, BATADV_UEV_CHANGE,
gw_addr);
}
-- 
2.9.5



Re: [RFC PATCH net] net: phy: fix the issue that netif always links up after resuming

2018-11-29 Thread Kunihiko Hayashi
Hi Heiner,

On Fri, 30 Nov 2018 07:20:27 +0100  wrote:

> On 30.11.2018 05:37, Kunihiko Hayashi wrote:
> > Hi Heiner Florian,
> > 
> > Thank you for your comments.
> > 
> > On Thu, 29 Nov 2018 16:37:48 -0800  wrote:
> > 
> >>
> >>
> >> On 11/29/2018 2:47 PM, Heiner Kallweit wrote:
> >>> On 29.11.2018 09:12, Kunihiko Hayashi wrote:
>  Even though the link is down before entering hibernation,
>  there is an issue that the network interface always links up after 
>  resuming
>  from hibernation.
> 
>  The phydev->state is PHY_READY before enabling the network interface, so
>  the link is down. After resuming from hibernation, the phydev->state is
>  forcibly set to PHY_UP in mdio_bus_phy_restore(), and the link becomes 
>  up.
> 
>  This patch expects to solve the issue by changing phydev->state to PHY_UP
>  only when the link is up.
> 
>  Signed-off-by: Kunihiko Hayashi 
>  ---
>   drivers/net/phy/phy_device.c | 6 --
>   1 file changed, 4 insertions(+), 2 deletions(-)
> 
>  diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>  index ab33d17..d5bba0f 100644
>  --- a/drivers/net/phy/phy_device.c
>  +++ b/drivers/net/phy/phy_device.c
>  @@ -309,8 +309,10 @@ static int mdio_bus_phy_restore(struct device *dev)
>   return ret;
>   
>   /* The PHY needs to renegotiate. */
>  -phydev->link = 0;
>  -phydev->state = PHY_UP;
>  +if (phydev->link) {
>  +phydev->link = 0;
>  +phydev->state = PHY_UP;
>  +}
>   
> >>> Thanks for reporting. I agree that it isn't right to unconditionally set
> >>> PHY_UP, because we don't know whether the PHY was started before
> >>> hibernation. However I don't think using phydev->link as criteria is
> >>> right. Example would be: PHY was started before hibernation, but w/o link.
> >>> In this case we want to set PHY_UP to start an aneg, because a cable may
> >>> have been plugged in whilst system was sleeping.
> > 
> > Indeed. I didn't consider the case that the PHY was started but a cable was
> > unplugged before hibernation.
> > 
> >>> So I think, similar to phy_stop_machine, we should use state >= UP and
> >>> state != HALTED as criteria, and also phy_start_machine() would need to
> >>> be called only if this criteria is met.
> >>>
> >>> It may make sense to add a helper for checking whether PHY is in a
> >>> started state (>=UP && !=HALTED), because we need this in more than
> >>> one place.
> >>
> >> Agreed, that would make sense.
> > 
> > I agree, too.
> > I'll try this in v2 patch that changes the PHY state to PHY_UP and calls
> > phy_start_machine(), only when the PHY was started before hibernation.
> > If I understand correctly, it will be like that:
> > 
> > phydev->link = 0;
> Even this may go into the if clause. If PHY isn't started then
> phydev->link should be 0 anyway.

Yes. To clarify more, this will go into the if clause. 

> 
> > if (phy_is_started(phydev)) {
> > phydev->state = PHY_UP;
> > phy_start_machine(phydev);
> > }
> > 
> Yes, this is what was meant. Thanks.

Thank you,

---
Best Regards,
Kunihiko Hayashi




Re: [PATCHv2 net] sctp: check and update stream->out_curr when allocating stream_out

2018-11-29 Thread Xin Long
On Thu, Nov 29, 2018 at 11:39 PM Neil Horman  wrote:
>
> On Thu, Nov 29, 2018 at 02:42:56PM +0800, Xin Long wrote:
> > Now when using stream reconfig to add out streams, stream->out
> > will get re-allocated, and all old streams' information will
> > be copied to the new ones and the old ones will be freed.
> >
> > So without stream->out_curr updated, next time when trying to
> > send from stream->out_curr stream, a panic would be caused.
> >
> > This patch is to check and update stream->out_curr when
> > allocating stream_out.
> >
> > v1->v2:
> >   - define fa_index() to get elem index from stream->out_curr.
> >
> > Fixes: 5e32a431 ("sctp: introduce stream scheduler foundations")
> > Reported-by: Ying Xu 
> > Reported-by: syzbot+e33a3a138267ca119...@syzkaller.appspotmail.com
> > Signed-off-by: Xin Long 
> > ---
> >  net/sctp/stream.c | 20 
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/net/sctp/stream.c b/net/sctp/stream.c
> > index 3892e76..30e7809 100644
> > --- a/net/sctp/stream.c
> > +++ b/net/sctp/stream.c
> > @@ -84,6 +84,19 @@ static void fa_zero(struct flex_array *fa, size_t index, 
> > size_t count)
> >   }
> >  }
> >
> > +static size_t fa_index(struct flex_array *fa, void *elem, size_t count)
> > +{
> > + size_t index = 0;
> > +
> > + while (count--) {
> > + if (elem == flex_array_get(fa, index))
> > + break;
> > + index++;
> > + }
> > +
> > + return index;
> > +}
> > +
> >  /* Migrates chunks from stream queues to new stream queues if needed,
> >   * but not across associations. Also, removes those chunks to streams
> >   * higher than the new max.
> > @@ -147,6 +160,13 @@ static int sctp_stream_alloc_out(struct sctp_stream 
> > *stream, __u16 outcnt,
> >
> >   if (stream->out) {
> >   fa_copy(out, stream->out, 0, min(outcnt, stream->outcnt));
> > + if (stream->out_curr) {
> > + size_t index = fa_index(stream->out, stream->out_curr,
> > + stream->outcnt);
> > +
> > + BUG_ON(index == stream->outcnt);
> > + stream->out_curr = flex_array_get(out, index);
> > + }
> >   fa_free(stream->out);
> >   }
> >
> > --
> > 2.1.0
> >
> >
>
> This is the sort of thing I'm talking about. Its a little more code, but if 
> you
> augment the flex_array api like this, you can preform a resize operation on 
> your
> existing flex array, and you can avoid all the copying, and need to update
> pointers maintained outside the array.  Note this code isn't tested at all, 
> but
> its close to what I think should work.
>
>
> diff --git a/include/linux/flex_array.h b/include/linux/flex_array.h
> index b94fa61b51fb..7fa1f27a91b5 100644
> --- a/include/linux/flex_array.h
> +++ b/include/linux/flex_array.h
> @@ -73,6 +73,8 @@ struct flex_array {
>  struct flex_array *flex_array_alloc(int element_size, unsigned int total,
> gfp_t flags);
>
> +struct flex_array *flex_array_resize(struct flex_array *fa, unsigned int 
> total, gfp_t flags);
> +
>  /**
>   * flex_array_prealloc() - Ensures that memory for the elements indexed in 
> the
>   * range defined by start and nr_elements has been allocated.
> diff --git a/lib/flex_array.c b/lib/flex_array.c
> index 2eed22fa507c..f8d54af3891b 100644
> --- a/lib/flex_array.c
> +++ b/lib/flex_array.c
> @@ -109,6 +109,7 @@ struct flex_array *flex_array_alloc(int element_size, 
> unsigned int total,
> ret->total_nr_elements = total;
> ret->elems_per_part = elems_per_part;
> ret->reciprocal_elems = reciprocal_elems;
> +   ret->elements_used = 0;
> if (elements_fit_in_base(ret) && !(flags & __GFP_ZERO))
> memset(&ret->parts[0], FLEX_ARRAY_FREE,
> FLEX_ARRAY_BASE_BYTES_LEFT);
> @@ -116,6 +117,53 @@ struct flex_array *flex_array_alloc(int element_size, 
> unsigned int total,
>  }
>  EXPORT_SYMBOL(flex_array_alloc);
>
> +static int flex_array_last_element_index(struct flex_array *fa)
> +{
> +   struct flex_array_part *part;
> +   int part_nr;
> +   int i,j;
> +
> +   if (elements_fit_in_base(fa)) {
> +   part = (struct flex_array_part *)&fa->parts[0];
> +   for (i = fa->elems_per_part; i >= 0; i--)
> +   if (part->elements[i] != FLEX_ARRAY_FREE)
> +   return i;
> +   }
> +
> +   i = fa->total_nr_elements;
> +   for (part_nr = 0; part_nr < FLEX_ARRAY_NR_BASE_PTRS; part_nr++) {
> +   part = fa->parts[part_nr];
> +   if (!part) {
> +   i -= fa->elems_per_part;
> +   continue;
> +   }
> +   for (j = fa->elems_per_part; j >= 0; j--, i--)
> +   if (part->elements[j] != FLEX_ARRAY_FREE)
> +   goto out;
> 

Re: [RFC PATCH net] net: phy: fix the issue that netif always links up after resuming

2018-11-29 Thread Heiner Kallweit
On 30.11.2018 05:37, Kunihiko Hayashi wrote:
> Hi Heiner Florian,
> 
> Thank you for your comments.
> 
> On Thu, 29 Nov 2018 16:37:48 -0800  wrote:
> 
>>
>>
>> On 11/29/2018 2:47 PM, Heiner Kallweit wrote:
>>> On 29.11.2018 09:12, Kunihiko Hayashi wrote:
 Even though the link is down before entering hibernation,
 there is an issue that the network interface always links up after resuming
 from hibernation.

 The phydev->state is PHY_READY before enabling the network interface, so
 the link is down. After resuming from hibernation, the phydev->state is
 forcibly set to PHY_UP in mdio_bus_phy_restore(), and the link becomes up.

 This patch expects to solve the issue by changing phydev->state to PHY_UP
 only when the link is up.

 Signed-off-by: Kunihiko Hayashi 
 ---
  drivers/net/phy/phy_device.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

 diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
 index ab33d17..d5bba0f 100644
 --- a/drivers/net/phy/phy_device.c
 +++ b/drivers/net/phy/phy_device.c
 @@ -309,8 +309,10 @@ static int mdio_bus_phy_restore(struct device *dev)
return ret;
  
/* The PHY needs to renegotiate. */
 -  phydev->link = 0;
 -  phydev->state = PHY_UP;
 +  if (phydev->link) {
 +  phydev->link = 0;
 +  phydev->state = PHY_UP;
 +  }
  
>>> Thanks for reporting. I agree that it isn't right to unconditionally set
>>> PHY_UP, because we don't know whether the PHY was started before
>>> hibernation. However I don't think using phydev->link as criteria is
>>> right. Example would be: PHY was started before hibernation, but w/o link.
>>> In this case we want to set PHY_UP to start an aneg, because a cable may
>>> have been plugged in whilst system was sleeping.
> 
> Indeed. I didn't consider the case that the PHY was started but a cable was
> unplugged before hibernation.
> 
>>> So I think, similar to phy_stop_machine, we should use state >= UP and
>>> state != HALTED as criteria, and also phy_start_machine() would need to
>>> be called only if this criteria is met.
>>>
>>> It may make sense to add a helper for checking whether PHY is in a
>>> started state (>=UP && !=HALTED), because we need this in more than
>>> one place.
>>
>> Agreed, that would make sense.
> 
> I agree, too.
> I'll try this in v2 patch that changes the PHY state to PHY_UP and calls
> phy_start_machine(), only when the PHY was started before hibernation.
> If I understand correctly, it will be like that:
> 
>   phydev->link = 0;
Even this may go into the if clause. If PHY isn't started then
phydev->link should be 0 anyway.

>   if (phy_is_started(phydev)) {
>   phydev->state = PHY_UP;
>   phy_start_machine(phydev);
>   }
> 
Yes, this is what was meant. Thanks.

> Thank you,
> 
> ---
> Best Regards,
> Kunihiko Hayashi
> 
> 
> 



Re: [PATCH net] tun: remove skb access after netif_receive_skb

2018-11-29 Thread Toshiaki Makita
On 2018/11/30 11:40, Jason Wang wrote:
> On 2018/11/30 上午10:30, Prashant Bhole wrote:
>> In tun.c skb->len was accessed while doing stats accounting after a
>> call to netif_receive_skb. We can not access skb after this call
>> because buffers may be dropped.
>>
>> The fix for this bug would be to store skb->len in local variable and
>> then use it after netif_receive_skb(). IMO using xdp data size for
>> accounting bytes will be better because input for tun_xdp_one() is
>> xdp_buff.
>>
>> Hence this patch:
>> - fixes a bug by removing skb access after netif_receive_skb()
>> - uses xdp data size for accounting bytes
...
>> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through
>> sendmsg()")
>> Reviewed-by: Toshiaki Makita 
>> Signed-off-by: Prashant Bhole 
>> ---
>>   drivers/net/tun.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>> index e244f5d7512a..6e388792c0a8 100644
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -2385,6 +2385,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>  struct tun_file *tfile,
>>  struct xdp_buff *xdp, int *flush)
>>   {
>> +    unsigned int datasize = xdp->data_end - xdp->data;
>>   struct tun_xdp_hdr *hdr = xdp->data_hard_start;
>>   struct virtio_net_hdr *gso = &hdr->gso;
>>   struct tun_pcpu_stats *stats;
>> @@ -2461,7 +2462,7 @@ static int tun_xdp_one(struct tun_struct *tun,
>>   stats = get_cpu_ptr(tun->pcpu_stats);
>>   u64_stats_update_begin(&stats->syncp);
>>   stats->rx_packets++;
>> -    stats->rx_bytes += skb->len;
>> +    stats->rx_bytes += datasize;
>>   u64_stats_update_end(&stats->syncp);
>>   put_cpu_ptr(stats);
>>   
> 
> 
> Good catch, but you probably need to calculate the datasize after XDP
> processing since it may modify the packet length.

(+CC David Ahern who may be interested in this area.)

I'd rather think we should calculate it before XDP.
I checked several drivers behavior. mlx5, bnxt and qede use hardware
counters for rx bytes, which means the size is calculated before XDP.
nfp calculate it by software but before XDP. On the other hand, intel
drivers use skb->len. So currently bytes counters do not look
consistent, but I think calculation before XDP is more common.

-- 
Toshiaki Makita



Re: [PATCHv2 net] sctp: hold transport before accessing its asoc in sctp_epaddr_lookup_transport

2018-11-29 Thread Xin Long
On Fri, Nov 30, 2018 at 5:52 AM Neil Horman  wrote:
>
> On Thu, Nov 29, 2018 at 02:44:07PM +0800, Xin Long wrote:
> > Without holding transport to dereference its asoc, a use after
> > free panic can be caused in sctp_epaddr_lookup_transport. Note
> > that a sock lock can't protect these transports that belong to
> > other socks.
> >
> > A similar fix as Commit bab1be79a516 ("sctp: hold transport
> > before accessing its asoc in sctp_transport_get_next") is
> > needed to hold the transport before accessing its asoc in
> > sctp_epaddr_lookup_transport.
> >
> > Note that this extra atomic operation is on the datapath,
> > but as rhlist keeps the lists to a small size, it won't
> > see a noticeable performance hurt.
> >
> > v1->v2:
> >   - improve the changelog.
> >
> > Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
> > rhashtable")
> > Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
> > Signed-off-by: Xin Long 
> > ---
> >  net/sctp/input.c | 10 --
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/net/sctp/input.c b/net/sctp/input.c
> > index 5c36a99..ce7351c 100644
> > --- a/net/sctp/input.c
> > +++ b/net/sctp/input.c
> > @@ -967,9 +967,15 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
> >   list = rhltable_lookup(&sctp_transport_hashtable, &arg,
> >  sctp_hash_params);
> >
> > - rhl_for_each_entry_rcu(t, tmp, list, node)
> > - if (ep == t->asoc->ep)
> > + rhl_for_each_entry_rcu(t, tmp, list, node) {
> > + if (!sctp_transport_hold(t))
> > + continue;
> > + if (ep == t->asoc->ep) {
> > + sctp_transport_put(t);
> >   return t;
> > + }
> > + sctp_transport_put(t);
> > + }
> >
> >   return NULL;
> >  }
>
> Wait a second, what if we just added an rcu_head to the association structure
> and changed the kfree call in sctp_association_destroy to a kfree_rcu call
> instead?  That would force the actual freeing of the association to pass 
> through
> a grace period, during which any in flight list traversal in
> sctp_epaddr_lookup_transport could complete safely.  Its another two pointers
We discussed this in last thread:
https://www.spinics.net/lists/netdev/msg535191.html

It will cause closed sk to linger longer.

> worth of space in the association, but I think that would be a worthwhile
> tradeoff for not having to do N atomic adds/puts every time you wanted to
> receive or send a frame.
N is not a big value, as rhlist itself keeps lists in a size.

>
> Neil
>
> > --
> > 2.1.0
> >
> >


Re: [RFC PATCH net] net: phy: fix the issue that netif always links up after resuming

2018-11-29 Thread Kunihiko Hayashi
Hi Heiner Florian,

Thank you for your comments.

On Thu, 29 Nov 2018 16:37:48 -0800  wrote:

> 
> 
> On 11/29/2018 2:47 PM, Heiner Kallweit wrote:
> > On 29.11.2018 09:12, Kunihiko Hayashi wrote:
> >> Even though the link is down before entering hibernation,
> >> there is an issue that the network interface always links up after resuming
> >> from hibernation.
> >>
> >> The phydev->state is PHY_READY before enabling the network interface, so
> >> the link is down. After resuming from hibernation, the phydev->state is
> >> forcibly set to PHY_UP in mdio_bus_phy_restore(), and the link becomes up.
> >>
> >> This patch expects to solve the issue by changing phydev->state to PHY_UP
> >> only when the link is up.
> >>
> >> Signed-off-by: Kunihiko Hayashi 
> >> ---
> >>  drivers/net/phy/phy_device.c | 6 --
> >>  1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> >> index ab33d17..d5bba0f 100644
> >> --- a/drivers/net/phy/phy_device.c
> >> +++ b/drivers/net/phy/phy_device.c
> >> @@ -309,8 +309,10 @@ static int mdio_bus_phy_restore(struct device *dev)
> >>return ret;
> >>  
> >>/* The PHY needs to renegotiate. */
> >> -  phydev->link = 0;
> >> -  phydev->state = PHY_UP;
> >> +  if (phydev->link) {
> >> +  phydev->link = 0;
> >> +  phydev->state = PHY_UP;
> >> +  }
> >>  
> > Thanks for reporting. I agree that it isn't right to unconditionally set
> > PHY_UP, because we don't know whether the PHY was started before
> > hibernation. However I don't think using phydev->link as criteria is
> > right. Example would be: PHY was started before hibernation, but w/o link.
> > In this case we want to set PHY_UP to start an aneg, because a cable may
> > have been plugged in whilst system was sleeping.

Indeed. I didn't consider the case that the PHY was started but a cable was
unplugged before hibernation.

> > So I think, similar to phy_stop_machine, we should use state >= UP and
> > state != HALTED as criteria, and also phy_start_machine() would need to
> > be called only if this criteria is met.
> > 
> > It may make sense to add a helper for checking whether PHY is in a
> > started state (>=UP && !=HALTED), because we need this in more than
> > one place.
> 
> Agreed, that would make sense.

I agree, too.
I'll try this in v2 patch that changes the PHY state to PHY_UP and calls
phy_start_machine(), only when the PHY was started before hibernation.
If I understand correctly, it will be like that:

phydev->link = 0;
if (phy_is_started(phydev)) {
phydev->state = PHY_UP;
phy_start_machine(phydev);
}

Thank you,

---
Best Regards,
Kunihiko Hayashi




[PATCH] dma-debug: hns_enet_drv could use more DMA entries

2018-11-29 Thread Qian Cai
The amount of DMA mappings from Hisilicon HNS ethernet devices is huge,
so it could trigger "DMA-API: debugging out of memory - disabling".

hnae_get_handle [1]
  hnae_init_queue
hnae_init_ring
  hnae_alloc_buffers [2]
debug_dma_map_page
  dma_entry_alloc

[1] for (i = 0; i < handle->q_num; i++)
[2] for (i = 0; i < ring->desc_num; i++)

On this Huawei TaiShan 2280 aarch64 server, it has reached the limit
already,

4 (ports) x 16 (handles) x 1024 (rings) = 65536

Signed-off-by: Qian Cai 
---
 kernel/dma/debug.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c
index 231ca4628062..ae91689cc9ad 100644
--- a/kernel/dma/debug.c
+++ b/kernel/dma/debug.c
@@ -43,8 +43,13 @@
 
 /* allow architectures to override this if absolutely required */
 #ifndef PREALLOC_DMA_DEBUG_ENTRIES
+/* amount of DMA mappings on this driver is huge. */
+#ifdef HNS_ENET
+#define PREALLOC_DMA_DEBUG_ENTRIES (1 << 17)
+#else
 #define PREALLOC_DMA_DEBUG_ENTRIES (1 << 16)
 #endif
+#endif
 
 enum {
dma_debug_single,
-- 
2.17.2 (Apple Git-113)



[PATCH bpf-next 2/4] bpf: Adjust F_NEEDS_EFFICIENT_UNALIGNED_ACCESS handling in test_verifier.c

2018-11-29 Thread David Miller


Make it set the flag argument to bpf_verify_program() which will relax
the alignment restrictions.

Now all such test cases will go properly through the verifier even on
inefficient unaligned access architectures.

On inefficient unaligned access architectures do not try to run such
programs, instead mark the test case as passing but annotate the
result similarly to how it is done now in the presence of this flag.

So, we get complete full coverage for all REJECT test cases, and at
least verifier level coverage for ACCEPT test cases.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 34 +++--
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index c3de824..af145f5 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -14200,7 +14200,7 @@ static int set_admin(bool admin)
 static void do_test_single(struct bpf_test *test, bool unpriv,
   int *passes, int *errors)
 {
-   int fd_prog, expected_ret, reject_from_alignment;
+   int fd_prog, expected_ret, alignment_prevented_execution;
int prog_len, prog_type = test->prog_type;
struct bpf_insn *prog = test->insns;
int map_fds[MAX_NR_MAPS];
@@ -14219,7 +14219,7 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
 
fd_prog = bpf_verify_program(prog_type, prog, prog_len,
 test->flags & F_LOAD_WITH_STRICT_ALIGNMENT,
-0,
+test->flags & 
F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
 "GPL", 0, bpf_vlog, sizeof(bpf_vlog), 1);
 
expected_ret = unpriv && test->result_unpriv != UNDEF ?
@@ -14229,28 +14229,27 @@ static void do_test_single(struct bpf_test *test, 
bool unpriv,
expected_val = unpriv && test->retval_unpriv ?
   test->retval_unpriv : test->retval;
 
-   reject_from_alignment = fd_prog < 0 &&
-   (test->flags & 
F_NEEDS_EFFICIENT_UNALIGNED_ACCESS) &&
-   strstr(bpf_vlog, "misaligned");
-#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
-   if (reject_from_alignment) {
-   printf("FAIL\nFailed due to alignment despite having efficient 
unaligned access: '%s'!\n",
-  strerror(errno));
-   goto fail_log;
-   }
-#endif
+   alignment_prevented_execution = 0;
+
if (expected_ret == ACCEPT) {
-   if (fd_prog < 0 && !reject_from_alignment) {
+   if (fd_prog < 0) {
printf("FAIL\nFailed to load prog '%s'!\n",
   strerror(errno));
goto fail_log;
}
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+   if (fd_prog >= 0 &&
+   (test->flags & F_NEEDS_EFFICIENT_UNALIGNED_ACCESS)) {
+   alignment_prevented_execution = 1;
+   goto test_ok;
+   }
+#endif
} else {
if (fd_prog >= 0) {
printf("FAIL\nUnexpected success to load!\n");
goto fail_log;
}
-   if (!strstr(bpf_vlog, expected_err) && !reject_from_alignment) {
+   if (!strstr(bpf_vlog, expected_err)) {
printf("FAIL\nUnexpected error message!\n\tEXP: 
%s\n\tRES: %s\n",
  expected_err, bpf_vlog);
goto fail_log;
@@ -14278,9 +14277,12 @@ static void do_test_single(struct bpf_test *test, bool 
unpriv,
goto fail_log;
}
}
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+test_ok:
+#endif
(*passes)++;
-   printf("OK%s\n", reject_from_alignment ?
-  " (NOTE: reject due to unknown alignment)" : "");
+   printf("OK%s\n", alignment_prevented_execution ?
+  " (NOTE: not executed due to unknown alignment)" : "");
 close_fds:
close(fd_prog);
for (i = 0; i < MAX_NR_MAPS; i++)
-- 
2.1.2.532.g19b5d50



[PATCH bpf-next 3/4] bpf: Make more use of 'any' alignment in test_verifier.c

2018-11-29 Thread David Miller


Use F_NEEDS_EFFICIENT_UNALIGNED_ACCESS in more tests where the
expected result is REJECT.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 44 +
 1 file changed, 44 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index af145f5..d8dda16 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -1823,6 +1823,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid bpf_context access",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SK_MSG,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet read for SK_MSG",
@@ -2187,6 +2188,8 @@ static struct bpf_test tests[] = {
},
.errstr = "invalid bpf_context access",
.result = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"check cb access: half, wrong type",
@@ -3249,6 +3252,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R0 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"raw_stack: skb_load_bytes, spilled regs corruption 2",
@@ -3279,6 +3283,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "R3 invalid mem access 'inv'",
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"raw_stack: skb_load_bytes, spilled regs + data",
@@ -3778,6 +3783,7 @@ static struct bpf_test tests[] = {
.errstr = "R2 invalid mem access 'inv'",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test16 (arith on data_end)",
@@ -3961,6 +3967,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = REJECT,
.errstr = "invalid access to packet, off=0 size=8, 
R5(id=1,off=0,r=0)",
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test24 (x += pkt_ptr, 5)",
@@ -5117,6 +5124,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"invalid cgroup storage access 5",
@@ -5233,6 +5241,7 @@ static struct bpf_test tests[] = {
.result = REJECT,
.errstr = "invalid access to map value, value_size=64 off=-2 
size=4",
.prog_type = BPF_PROG_TYPE_CGROUP_SKB,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"invalid per-cpu cgroup storage access 5",
@@ -7149,6 +7158,7 @@ static struct bpf_test tests[] = {
.errstr = "invalid mem access 'inv'",
.result = REJECT,
.result_unpriv = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"map element value illegal alu op, 5",
@@ -7171,6 +7181,7 @@ static struct bpf_test tests[] = {
.fixup_map_hash_48b = { 3 },
.errstr = "R0 invalid mem access 'inv'",
.result = REJECT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"map element value is preserved across register spilling",
@@ -9663,6 +9674,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end > pkt_data', good access",
@@ -9701,6 +9713,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end > pkt_data', bad access 2",
@@ -9719,6 +9732,7 @@ static struct bpf_test tests[] = {
.errstr = "R1 offset is outside of the packet",
.result = REJECT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},

[PATCH bpf-next 4/4] bpf: Apply F_NEEDS_EFFICIENT_UNALIGNED_ACCESS to more ACCEPT test cases.

2018-11-29 Thread David Miller


If a testcase has alignment problems but is expected to be ACCEPT,
verify it using F_NEEDS_EFFICIENT_UNALIGNED_ACCESS too.

Maybe in the future if we add some architecture specific code to elide
the unaligned memory access warnings during the test, we can execute
these as well.

Signed-off-by: David S. Miller 
---
 tools/testing/selftests/bpf/test_verifier.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index d8dda16..683bf0c 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -3886,6 +3886,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test21 (x += pkt_ptr, 2)",
@@ -3911,6 +3912,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test22 (x += pkt_ptr, 3)",
@@ -3941,6 +3943,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test23 (x += pkt_ptr, 4)",
@@ -3993,6 +3996,7 @@ static struct bpf_test tests[] = {
},
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"direct packet access: test25 (marking on <, good access)",
@@ -7675,6 +7679,7 @@ static struct bpf_test tests[] = {
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.retval = 0 /* csum_diff of 64-byte packet */,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"helper access to variable memory: size = 0 not allowed on NULL 
(!ARG_PTR_TO_MEM_OR_NULL)",
@@ -9637,6 +9642,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data' > pkt_end, bad access 1",
@@ -9808,6 +9814,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end < pkt_data', bad access 1",
@@ -9920,6 +9927,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_end >= pkt_data', bad access 1",
@@ -9977,6 +9985,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data' <= pkt_end, bad access 1",
@@ -10089,6 +10098,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_meta' > pkt_data, bad access 1",
@@ -10260,6 +10270,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data < pkt_meta', bad access 1",
@@ -10372,6 +10383,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_data >= pkt_meta', bad access 1",
@@ -10429,6 +10441,7 @@ static struct bpf_test tests[] = {
},
.result = ACCEPT,
.prog_type = BPF_PROG_TYPE_XDP,
+   .flags = F_NEEDS_EFFICIENT_UNALIGNED_ACCESS,
},
{
"XDP pkt read, pkt_meta' <= pkt_data, bad access 1",
@@ -12348,6 +12361,7 @@ static struct bpf_test tests[] = {
.prog_type = BPF_PROG_TYPE_SCHED_CLS,
.result = ACCEPT,
.retval = 1,
+   .flags = F_NEEDS_EFFICI

[PATCH bpf-next 0/4] bpf: Improve verifier test coverage on sparc64 et al.

2018-11-29 Thread David Miller


On sparc64 a ton of test cases in test_verifier.c fail because
the memory accesses in the test case are unaligned (or cannot
be proven to be aligned by the verifier).

Perhaps we can eventually try to (carefully) modify each test case
which has this problem to not use unaligned accesses but:

1) That is delicate work.

2) The changes might not fully respect the original
   intention of the testcase.

3) In some cases, such a transformation might not even
   be feasible at all.

So add an "any alignment" flag to tell the verifier to forcefully
disable it's alignment checks completely.

test_verifier.c is then annotated to use this flag when necessary.

The presence of the flag in each test case is good documentation to
anyone who wants to actually tackle the job of eliminating the
unaligned memory accesses in the test cases.

I've also seen several weird things in test cases, like trying to
access __skb->mark in a packet buffer.

This gets rid of 104 test_verifier.c failures on sparc64.

Changes since RFC:

1) Only the admin can allow the relaxation of alignment restrictions
   on inefficient unaligned access architectures.

2) Use F_NEEDS_EFFICIENT_UNALIGNED_ACCESS instead of making a new
   flag.

3) Annotate in the output, when we have a test case that the verifier
   accepted but we did not try to execute because we are on an
   inefficient unaligned access platform.  Maybe with some arch
   machinery we can avoid this in the future.

Signed-off-by: David S. Miller 


[PATCH bpf-next 1/4] bpf: Add BPF_F_ANY_ALIGNMENT.

2018-11-29 Thread David Miller


Often we want to write tests cases that check things like bad context
offset accesses.  And one way to do this is to use an odd offset on,
for example, a 32-bit load.

This unfortunately triggers the alignment checks first on platforms
that do not set CONFIG_EFFICIENT_UNALIGNED_ACCESS.  So the test
case see the alignment failure rather than what it was testing for.

It is often not completely possible to respect the original intention
of the test, or even test the same exact thing, while solving the
alignment issue.

Another option could have been to check the alignment after the
context and other validations are performed by the verifier, but
that is a non-trivial change to the verifier.

Signed-off-by: David S. Miller 
---
 include/uapi/linux/bpf.h| 10 ++
 kernel/bpf/syscall.c|  7 ++-
 kernel/bpf/verifier.c   |  2 ++
 tools/include/uapi/linux/bpf.h  | 10 ++
 tools/lib/bpf/bpf.c | 12 +---
 tools/lib/bpf/bpf.h |  6 +++---
 tools/testing/selftests/bpf/test_align.c|  2 +-
 tools/testing/selftests/bpf/test_verifier.c |  1 +
 8 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 426b5c8..c9647ea 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -232,6 +232,16 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  This bypasses
+ * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the validator, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cf5040f..cae65bb 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1450,9 +1450,14 @@ static int bpf_prog_load(union bpf_attr *attr)
if (CHECK_ATTR(BPF_PROG_LOAD))
return -EINVAL;
 
-   if (attr->prog_flags & ~BPF_F_STRICT_ALIGNMENT)
+   if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT))
return -EINVAL;
 
+   if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
+   (attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
+   !capable(CAP_SYS_ADMIN))
+   return -EPERM;
+
/* copy eBPF program license from user space */
if (strncpy_from_user(license, u64_to_user_ptr(attr->license),
  sizeof(license) - 1) < 0)
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 6dd4195..2fefeae 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -6350,6 +6350,8 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr 
*attr)
env->strict_alignment = !!(attr->prog_flags & BPF_F_STRICT_ALIGNMENT);
if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS))
env->strict_alignment = true;
+   if (attr->prog_flags & BPF_F_ANY_ALIGNMENT)
+   env->strict_alignment = false;
 
ret = replace_map_fd_with_map_ptr(env);
if (ret < 0)
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 426b5c8..c9647ea 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -232,6 +232,16 @@ enum bpf_attach_type {
  */
 #define BPF_F_STRICT_ALIGNMENT (1U << 0)
 
+/* If BPF_F_ANY_ALIGNMENT is used in BPF_PROF_LOAD command, the
+ * verifier will allow any alignment whatsoever.  This bypasses
+ * what CONFIG_EFFICIENT_UNALIGNED_ACCESS would cause it to do.
+ * It is mostly used for testing when we want to validate the
+ * context and memory access aspects of the validator, but because
+ * of an unaligned access the alignment check would trigger before
+ * the one we are interested in.
+ */
+#define BPF_F_ANY_ALIGNMENT(1U << 1)
+
 /* when bpf_ldimm64->src_reg == BPF_PSEUDO_MAP_FD, bpf_ldimm64->imm == fd */
 #define BPF_PSEUDO_MAP_FD  1
 
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 03f9bcc..d4f76d2 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -232,9 +232,11 @@ int bpf_load_program(enum bpf_prog_type type, const struct 
bpf_insn *insns,
 
 int bpf_verify_program(enum bpf_prog_type type, const struct bpf_insn *insns,
   size_t insns_cnt, int strict_alignment,
-  const char *license, __u32 kern_version,
-  char *log_buf, size_t log_buf_sz, int log_level)
+  int any_alignment, const char *license,
+  __u32 kern_version, char *l

RE: [Intel-wired-lan] [next-queue PATCH v1 2/2] Documentation: igb: Add a section about CBS

2018-11-29 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Vinicius Costa Gomes
> Sent: Friday, November 16, 2018 4:19 PM
> To: intel-wired-...@lists.osuosl.org
> Cc: netdev@vger.kernel.org; jesus.s.palen...@gmail.com
> Subject: [Intel-wired-lan] [next-queue PATCH v1 2/2] Documentation: igb:
> Add a section about CBS
> 
> Add some pointers to the definition of the CBS algorithm, and some
> notes about the limits of its implementation in the i210 family of
> controllers.
> 
> Signed-off-by: Vinicius Costa Gomes 
> ---
>  Documentation/networking/igb.rst | 19 +++
>  1 file changed, 19 insertions(+)
> 

Tested-by: Aaron Brown 


RE: [next-queue PATCH v1 1/2] igb: Change RXPBSIZE size when setting Qav mode

2018-11-29 Thread Brown, Aaron F
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Vinicius Costa Gomes
> Sent: Friday, November 16, 2018 4:19 PM
> To: intel-wired-...@lists.osuosl.org
> Cc: Sanchez-Palencia, Jesus ; Kirsher,
> Jeffrey T ; netdev@vger.kernel.org;
> jesus.s.palen...@gmail.com
> Subject: [next-queue PATCH v1 1/2] igb: Change RXPBSIZE size when setting
> Qav mode
> 
> From: Jesus Sanchez-Palencia 
> 
> Section 4.5.9 of the datasheet says that the total size of all packet
> buffers combined (TxPB 0 + 1 + 2 + 3 + RxPB + BMC2OS + OS2BMC) must not
> exceed 60KB. Today we are configuring a total of 62KB, so reduce the
> RxPB from 32KB to 30KB in order to respect that.
> 
> The choice of changing RxPBSIZE here is mainly because it seems more
> correct to give more priority to the transmit packet buffers over the
> receiver ones when running in Qav mode. Also, the BMC2OS and OS2BMC
> sizes are already too short.
> 
> Signed-off-by: Jesus Sanchez-Palencia 
> ---
>  drivers/net/ethernet/intel/igb/e1000_defines.h | 1 +
>  drivers/net/ethernet/intel/igb/igb_main.c  | 2 +-
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 

Tested-by: Aaron Brown 



Re: [PATCH net] tun: remove skb access after netif_receive_skb

2018-11-29 Thread Jason Wang



On 2018/11/30 上午10:30, Prashant Bhole wrote:

In tun.c skb->len was accessed while doing stats accounting after a
call to netif_receive_skb. We can not access skb after this call
because buffers may be dropped.

The fix for this bug would be to store skb->len in local variable and
then use it after netif_receive_skb(). IMO using xdp data size for
accounting bytes will be better because input for tun_xdp_one() is
xdp_buff.

Hence this patch:
- fixes a bug by removing skb access after netif_receive_skb()
- uses xdp data size for accounting bytes

[613.019057] BUG: KASAN: use-after-free in tun_sendmsg+0x77c/0xc50 [tun]
[613.021062] Read of size 4 at addr 8881da9ab7c0 by task vhost-1115/1155
[613.023073]
[613.024003] CPU: 0 PID: 1155 Comm: vhost-1115 Not tainted 4.20.0-rc3-vm+ #232
[613.026029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[613.029116] Call Trace:
[613.031145]  dump_stack+0x5b/0x90
[613.032219]  print_address_description+0x6c/0x23c
[613.034156]  ? tun_sendmsg+0x77c/0xc50 [tun]
[613.036141]  kasan_report.cold.5+0x241/0x308
[613.038125]  tun_sendmsg+0x77c/0xc50 [tun]
[613.040109]  ? tun_get_user+0x1960/0x1960 [tun]
[613.042094]  ? __isolate_free_page+0x270/0x270
[613.045173]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.047127]  ? peek_head_len.part.13+0x90/0x90 [vhost_net]
[613.049096]  ? get_tx_bufs+0x5a/0x2c0 [vhost_net]
[613.051106]  ? vhost_enable_notify+0x2d8/0x420 [vhost]
[613.053139]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.053139]  ? vhost_net_buf_peek+0x340/0x340 [vhost_net]
[613.053139]  ? __mutex_lock+0x8d9/0xb30
[613.053139]  ? finish_task_switch+0x8f/0x3f0
[613.053139]  ? handle_tx+0x32/0x120 [vhost_net]
[613.053139]  ? mutex_trylock+0x110/0x110
[613.053139]  ? finish_task_switch+0xcf/0x3f0
[613.053139]  ? finish_task_switch+0x240/0x3f0
[613.053139]  ? __switch_to_asm+0x34/0x70
[613.053139]  ? __switch_to_asm+0x40/0x70
[613.053139]  ? __schedule+0x506/0xf10
[613.053139]  handle_tx+0xc7/0x120 [vhost_net]
[613.053139]  vhost_worker+0x166/0x200 [vhost]
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  ? __kthread_parkme+0x77/0x90
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  kthread+0x1b1/0x1d0
[613.053139]  ? kthread_park+0xb0/0xb0
[613.053139]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Allocated by task 1155:
[613.088705]  kasan_kmalloc+0xbf/0xe0
[613.088705]  kmem_cache_alloc+0xdc/0x220
[613.088705]  __build_skb+0x2a/0x160
[613.088705]  build_skb+0x14/0xc0
[613.088705]  tun_sendmsg+0x4f0/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Freed by task 1155:
[613.088705]  __kasan_slab_free+0x12e/0x180
[613.088705]  kmem_cache_free+0xa0/0x230
[613.088705]  ip6_mc_input+0x40f/0x5a0
[613.088705]  ipv6_rcv+0xc9/0x1e0
[613.088705]  __netif_receive_skb_one_core+0xc1/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  br_pass_frame_up+0x2b9/0x2e0
[613.088705]  br_handle_frame_finish+0x2fb/0x7a0
[613.088705]  br_handle_frame+0x30f/0x6c0
[613.088705]  __netif_receive_skb_core+0x61a/0x15b0
[613.088705]  __netif_receive_skb_one_core+0x8e/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  tun_sendmsg+0x738/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] The buggy address belongs to the object at 8881da9ab740
[613.088705]  which belongs to the cache skbuff_head_cache of size 232

Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Reviewed-by: Toshiaki Makita 
Signed-off-by: Prashant Bhole 
---
  drivers/net/tun.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e244f5d7512a..6e388792c0a8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2385,6 +2385,7 @@ static int tun_xdp_one(struct tun_struct *tun,
   struct tun_file *tfile,
   struct xdp_buff *xdp, int *flush)
  {
+   unsigned int datasize = xdp->data_end - xdp->data;
struct tun_xdp_hdr *hdr = xdp->data_hard_start;
struct virtio_net_hdr *gso = &hdr->gso;
struct tun_pcpu_stats *stats;
@@ -2461,7 +2462,7 @@ static int tun_xdp_one(struct tun_struct *tun,
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(&stats->syncp);
stats->rx_packets++;
-   stats->rx_bytes += skb->len;
+   stats->rx_bytes += datasize;
u64_stats_update_end(&stats->syncp);
put_cpu_ptr

Re: [REBASE PATCH net-next v9 1/4] net: vhost: lock the vqs one by one

2018-11-29 Thread Jason Wang



On 2018/11/30 上午3:28, Jean-Philippe Brucker wrote:

Hi,

On 25/09/2018 13:36,xiangxia.m@gmail.com  wrote:

From: Tonghao Zhang

This patch changes the way that lock all vqs
at the same, to lock them one by one. It will
be used for next patch to avoid the deadlock.

Signed-off-by: Tonghao Zhang
Acked-by: Jason Wang
Signed-off-by: Jason Wang
---
  drivers/vhost/vhost.c | 24 +++-
  1 file changed, 7 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b13c6b4..f52008b 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -294,8 +294,11 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
  {
int i;
  
-	for (i = 0; i < d->nvqs; ++i)

+   for (i = 0; i < d->nvqs; ++i) {
+   mutex_lock(&d->vqs[i]->mutex);
__vhost_vq_meta_reset(d->vqs[i]);
+   mutex_unlock(&d->vqs[i]->mutex);
+   }
  }
  
  static void vhost_vq_reset(struct vhost_dev *dev,

@@ -891,20 +894,6 @@ static inline void __user *__vhost_get_user(struct 
vhost_virtqueue *vq,
  #define vhost_get_used(vq, x, ptr) \
vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
  
-static void vhost_dev_lock_vqs(struct vhost_dev *d)

-{
-   int i = 0;
-   for (i = 0; i < d->nvqs; ++i)
-   mutex_lock_nested(&d->vqs[i]->mutex, i);
-}
-
-static void vhost_dev_unlock_vqs(struct vhost_dev *d)
-{
-   int i = 0;
-   for (i = 0; i < d->nvqs; ++i)
-   mutex_unlock(&d->vqs[i]->mutex);
-}
-
  static int vhost_new_umem_range(struct vhost_umem *umem,
u64 start, u64 size, u64 end,
u64 userspace_addr, int perm)
@@ -954,7 +943,10 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
if (msg->iova <= vq_msg->iova &&
msg->iova + msg->size - 1 >= vq_msg->iova &&
vq_msg->type == VHOST_IOTLB_MISS) {
+   mutex_lock(&node->vq->mutex);

This seems to introduce a deadlock (and sleep-in-atomic): the vq->mutex
is taken while the IOTLB spinlock is held (taken earlier in
vhost_iotlb_notify_vq()). On the vhost_iotlb_miss() path, the IOTLB
spinlock is taken while the vq->mutex is held.



Good catch.



I'm not sure how to fix it. Given that we're holding dev->mutex, that
vq->poll only seems to be modified under dev->mutex, and assuming that
vhost_poll_queue(vq->poll) can be called concurrently, is it safe to
simply not take vq->mutex here?



Yes, I think it can be removed here.

Want to post a patch for this?

Thanks



Thanks,
Jean




[PATCH net] tun: remove skb access after netif_receive_skb

2018-11-29 Thread Prashant Bhole
In tun.c skb->len was accessed while doing stats accounting after a
call to netif_receive_skb. We can not access skb after this call
because buffers may be dropped.

The fix for this bug would be to store skb->len in local variable and
then use it after netif_receive_skb(). IMO using xdp data size for
accounting bytes will be better because input for tun_xdp_one() is
xdp_buff.

Hence this patch:
- fixes a bug by removing skb access after netif_receive_skb()
- uses xdp data size for accounting bytes

[613.019057] BUG: KASAN: use-after-free in tun_sendmsg+0x77c/0xc50 [tun]
[613.021062] Read of size 4 at addr 8881da9ab7c0 by task vhost-1115/1155
[613.023073]
[613.024003] CPU: 0 PID: 1155 Comm: vhost-1115 Not tainted 4.20.0-rc3-vm+ #232
[613.026029] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.10.2-1ubuntu1 04/01/2014
[613.029116] Call Trace:
[613.031145]  dump_stack+0x5b/0x90
[613.032219]  print_address_description+0x6c/0x23c
[613.034156]  ? tun_sendmsg+0x77c/0xc50 [tun]
[613.036141]  kasan_report.cold.5+0x241/0x308
[613.038125]  tun_sendmsg+0x77c/0xc50 [tun]
[613.040109]  ? tun_get_user+0x1960/0x1960 [tun]
[613.042094]  ? __isolate_free_page+0x270/0x270
[613.045173]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.047127]  ? peek_head_len.part.13+0x90/0x90 [vhost_net]
[613.049096]  ? get_tx_bufs+0x5a/0x2c0 [vhost_net]
[613.051106]  ? vhost_enable_notify+0x2d8/0x420 [vhost]
[613.053139]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.053139]  ? vhost_net_buf_peek+0x340/0x340 [vhost_net]
[613.053139]  ? __mutex_lock+0x8d9/0xb30
[613.053139]  ? finish_task_switch+0x8f/0x3f0
[613.053139]  ? handle_tx+0x32/0x120 [vhost_net]
[613.053139]  ? mutex_trylock+0x110/0x110
[613.053139]  ? finish_task_switch+0xcf/0x3f0
[613.053139]  ? finish_task_switch+0x240/0x3f0
[613.053139]  ? __switch_to_asm+0x34/0x70
[613.053139]  ? __switch_to_asm+0x40/0x70
[613.053139]  ? __schedule+0x506/0xf10
[613.053139]  handle_tx+0xc7/0x120 [vhost_net]
[613.053139]  vhost_worker+0x166/0x200 [vhost]
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  ? __kthread_parkme+0x77/0x90
[613.053139]  ? vhost_dev_init+0x580/0x580 [vhost]
[613.053139]  kthread+0x1b1/0x1d0
[613.053139]  ? kthread_park+0xb0/0xb0
[613.053139]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Allocated by task 1155:
[613.088705]  kasan_kmalloc+0xbf/0xe0
[613.088705]  kmem_cache_alloc+0xdc/0x220
[613.088705]  __build_skb+0x2a/0x160
[613.088705]  build_skb+0x14/0xc0
[613.088705]  tun_sendmsg+0x4f0/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] Freed by task 1155:
[613.088705]  __kasan_slab_free+0x12e/0x180
[613.088705]  kmem_cache_free+0xa0/0x230
[613.088705]  ip6_mc_input+0x40f/0x5a0
[613.088705]  ipv6_rcv+0xc9/0x1e0
[613.088705]  __netif_receive_skb_one_core+0xc1/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  br_pass_frame_up+0x2b9/0x2e0
[613.088705]  br_handle_frame_finish+0x2fb/0x7a0
[613.088705]  br_handle_frame+0x30f/0x6c0
[613.088705]  __netif_receive_skb_core+0x61a/0x15b0
[613.088705]  __netif_receive_skb_one_core+0x8e/0x100
[613.088705]  netif_receive_skb_internal+0xc4/0x270
[613.088705]  tun_sendmsg+0x738/0xc50 [tun]
[613.088705]  vhost_tx_batch.isra.14+0xeb/0x1f0 [vhost_net]
[613.088705]  handle_tx_copy+0x2d0/0x8f0 [vhost_net]
[613.088705]  handle_tx+0xc7/0x120 [vhost_net]
[613.088705]  vhost_worker+0x166/0x200 [vhost]
[613.088705]  kthread+0x1b1/0x1d0
[613.088705]  ret_from_fork+0x35/0x40
[613.088705]
[613.088705] The buggy address belongs to the object at 8881da9ab740
[613.088705]  which belongs to the cache skbuff_head_cache of size 232

Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
Reviewed-by: Toshiaki Makita 
Signed-off-by: Prashant Bhole 
---
 drivers/net/tun.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e244f5d7512a..6e388792c0a8 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2385,6 +2385,7 @@ static int tun_xdp_one(struct tun_struct *tun,
   struct tun_file *tfile,
   struct xdp_buff *xdp, int *flush)
 {
+   unsigned int datasize = xdp->data_end - xdp->data;
struct tun_xdp_hdr *hdr = xdp->data_hard_start;
struct virtio_net_hdr *gso = &hdr->gso;
struct tun_pcpu_stats *stats;
@@ -2461,7 +2462,7 @@ static int tun_xdp_one(struct tun_struct *tun,
stats = get_cpu_ptr(tun->pcpu_stats);
u64_stats_update_begin(&stats->syncp);
stats->rx_packets++;
-   stats->rx_bytes += skb->len;
+   stats->rx_bytes += datasize;
u64_stats_update_end(&stats->syncp);
put_cpu_ptr(stats);
 
-- 
2.17.2




Re: [PATCH v4 00/13] TCP transport binding for NVMe over Fabrics

2018-11-29 Thread David Miller
From: Sagi Grimberg 
Date: Thu, 29 Nov 2018 17:24:09 -0800

> 
>> What is the plan ahead here?  I think the nvme code looks pretty
>> reasonable now (I'll do another pass at nitpicking), but we need the
>> networking stuff sorted out with at least ACKs, or a merge through
>> the networking tree and then a shared branch we can pull in.
> 
> I would think that having Dave ack patches 1-3 and taking it via the
> nvme tree should be easier..
> 
> Dave? What would you prefer?

No preference, if the nvme tree makes things easier for you then
please do it that way.

Acked-by: David S. Miller 


Re: pull-request: bpf-next 2018-11-30

2018-11-29 Thread David Miller
From: Daniel Borkmann 
Date: Fri, 30 Nov 2018 03:01:38 +0100

> The following pull-request contains BPF updates for your *net-next* tree.
> 
> (Getting out bit earlier this time to pull in a dependency from bpf.)
> 
> The main changes are:
> 
> 1) Add libbpf ABI versioning and document API naming conventions
>as well as ABI versioning process, from Andrey.
> 
> 2) Add a new sk_msg_pop_data() helper for sk_msg based BPF
>programs that is used in conjunction with sk_msg_push_data()
>for adding / removing meta data to the msg data, from John.
> 
> 3) Optimize convert_bpf_ld_abs() for 0 offset and fix various
>lib and testsuite build failures on 32 bit, from David.
> 
> 4) Make BPF prog dump for !JIT identical to how we dump subprogs
>when JIT is in use, from Yonghong.
> 
> 5) Rename btf_get_from_id() to make it more conform with libbpf
>API naming conventions, from Martin.
> 
> 6) Add a missing BPF kselftest config item, from Naresh.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Pulled, thanks Daniel.


Re: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Cong Wang
On Thu, Nov 29, 2018 at 1:46 PM Eric Dumazet  wrote:
>
>
>
> On 11/29/2018 01:13 PM, Duyck, Alexander H wrote:
>
> > Instead of just checking for the max it might make more sense to do a
> > check using skb_is_gso, and then if true use gso_segs, otherwise just
> > default to 1.
> >
> > Also your bytes are going to be totally messed up as well since the
> > headers are trimmed in the GSO frames. It might be worthwhile to just
> > have a branch based on skb_is_gso that sets the packets and bytes based
> > on the GSO values, and one that sets it for the default case.
> >
> Note that __dev_queue_xmit() is already doing that, no need
> to re-implement in each driver.
>
> It calls qdisc_pkt_len_init(), meaning that drivers can use
> qdisc_skb_cb(skb)->pkt_len instead of skb->len
>
> (Qdisc layers should not modify qdisc_skb_cb(skb)->pkt_len )

It should not modify, but, on the other hand, it is neither supposed
to be used for non-qdisc layer. At very least, you have to audit
each ->ndo_start_xmit() to see if it uses its own skb->cb
first.


pull-request: bpf-next 2018-11-30

2018-11-29 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

(Getting out bit earlier this time to pull in a dependency from bpf.)

The main changes are:

1) Add libbpf ABI versioning and document API naming conventions
   as well as ABI versioning process, from Andrey.

2) Add a new sk_msg_pop_data() helper for sk_msg based BPF
   programs that is used in conjunction with sk_msg_push_data()
   for adding / removing meta data to the msg data, from John.

3) Optimize convert_bpf_ld_abs() for 0 offset and fix various
   lib and testsuite build failures on 32 bit, from David.

4) Make BPF prog dump for !JIT identical to how we dump subprogs
   when JIT is in use, from Yonghong.

5) Rename btf_get_from_id() to make it more conform with libbpf
   API naming conventions, from Martin.

6) Add a missing BPF kselftest config item, from Naresh.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!



The following changes since commit 4afe60a97ba6ffacc4d030b13653dc64099fea26:

  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next (2018-11-26 
13:08:17 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to b42699547fc9fb1057795bccc21a6445743a7fde:

  tools/bpf: make libbpf _GNU_SOURCE friendly (2018-11-30 02:41:02 +0100)


Alexei Starovoitov (2):
  Merge branch 'non-jit-btf-func_info'
  Merge branch 'libbpf-versioning-doc'

Andrey Ignatov (3):
  libbpf: Add version script for DSO
  libbpf: Verify versioned symbols
  libbpf: Document API and ABI conventions

Daniel Borkmann (1):
  Merge branch 'bpf-sk-msg-pop-data'

David Miller (2):
  bpf: Avoid unnecessary instruction in convert_bpf_ld_abs()
  bpf: Fix various lib and testsuite build failures on 32-bit.

John Fastabend (3):
  bpf: helper to pop data from messages
  bpf: add msg_pop_data helper to tools
  bpf: test_sockmap, add options for msg_pop_data() helper

Martin KaFai Lau (1):
  libbpf: Name changing for btf_get_from_id

Naresh Kamboju (1):
  selftests/bpf: add config fragment CONFIG_FTRACE_SYSCALLS

Yonghong Song (3):
  bpf: btf: support proper non-jit func info
  tools/bpf: change selftest test_btf for both jit and non-jit
  tools/bpf: make libbpf _GNU_SOURCE friendly

 include/linux/bpf.h |   6 +-
 include/linux/bpf_verifier.h|   1 -
 include/uapi/linux/bpf.h|  16 ++-
 kernel/bpf/core.c   |   3 +-
 kernel/bpf/syscall.c|  33 ++---
 kernel/bpf/verifier.c   |  55 +---
 net/core/filter.c   | 174 +++-
 net/ipv4/tcp_bpf.c  |  17 ++-
 net/tls/tls_sw.c|  11 +-
 tools/bpf/bpftool/map.c |   4 +-
 tools/bpf/bpftool/prog.c|   2 +-
 tools/include/uapi/linux/bpf.h  |  16 ++-
 tools/lib/bpf/Makefile  |  23 +++-
 tools/lib/bpf/README.rst| 139 +++
 tools/lib/bpf/btf.c |   4 +-
 tools/lib/bpf/btf.h |   2 +-
 tools/lib/bpf/libbpf.c  |   2 +
 tools/lib/bpf/libbpf.map| 121 
 tools/lib/bpf/libbpf_errno.c|   1 +
 tools/testing/selftests/bpf/bpf_helpers.h   |   2 +
 tools/testing/selftests/bpf/config  |   1 +
 tools/testing/selftests/bpf/test_btf.c  |  35 +
 tools/testing/selftests/bpf/test_progs.c|  10 +-
 tools/testing/selftests/bpf/test_sockmap.c  | 127 -
 tools/testing/selftests/bpf/test_sockmap_kern.h |  70 --
 25 files changed, 760 insertions(+), 115 deletions(-)
 create mode 100644 tools/lib/bpf/README.rst
 create mode 100644 tools/lib/bpf/libbpf.map


Re: [PATCH bpf-next v2] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Daniel Borkmann
On 11/30/2018 12:47 AM, Jakub Kicinski wrote:
> On Thu, 29 Nov 2018 15:31:45 -0800, Yonghong Song wrote:
>> During porting libbpf to bcc, I got some warnings like below:
>>   ...
>>   [  2%] Building C object 
>> src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf.c.o
>>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf.c:12:0:
>>   warning: "_GNU_SOURCE" redefined [enabled by default]
>>#define _GNU_SOURCE
>>   ...
>>   [  3%] Building C object 
>> src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf_errno.c.o
>>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c: In function 
>> ‘libbpf_strerror’:
>>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c:45:7:
>>   warning: assignment makes integer from pointer without a cast [enabled by 
>> default]
>>  ret = strerror_r(err, buf, size);
>>   ...
>>
>> bcc is built with _GNU_SOURCE defined and this caused the above warning.
>> This patch intends to make libpf _GNU_SOURCE friendly by
>>   . define _GNU_SOURCE in libbpf.c unless it is not defined
>>   . undefine _GNU_SOURCE as non-gnu version of strerror_r is expected.
>>
>> Signed-off-by: Yonghong Song 
> 
> Acked-by: Jakub Kicinski 
> 
> FWIW :)
> 

Applied, thanks.


Re: [PATCH v4 00/13] TCP transport binding for NVMe over Fabrics

2018-11-29 Thread Sagi Grimberg




What is the plan ahead here?  I think the nvme code looks pretty
reasonable now (I'll do another pass at nitpicking), but we need the
networking stuff sorted out with at least ACKs, or a merge through
the networking tree and then a shared branch we can pull in.


I would think that having Dave ack patches 1-3 and taking it via the
nvme tree should be easier..

Dave? What would you prefer?


Re: [PATCH v4 11/13] nvmet-tcp: add NVMe over TCP target driver

2018-11-29 Thread Sagi Grimberg




+static inline void nvmet_tcp_put_cmd(struct nvmet_tcp_cmd *cmd)
+{
+    if (unlikely(cmd == &cmd->queue->connect))
+    return;


if you don't return connect cmd to the list please don't add it to it in 
the first place (during alloc_cmd). and if you use it once, we might 
think of a cleaner/readable way to do it.


why there is a difference between regular cmd and connect_cmd ? can't 
you increase the nr_cmds by 1 and not distinguish between the two types ?


We don't have the queue size before we connect, hence we allocate a
single command for nvmf connect and when we process it we allocate the
rest.

Its done this way such that the command processing code does not
differentiate this case with a dedicated condition, hence I prefer to
keep it in the command list for the first pass such that we don't have
this check in the normal command processing code.

The reason it doesn't go back to the list is because it does not
belong to the cmds array which we rely on to lookup our command context
based on the ttag.


+static int nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
+    struct socket *newsock)
+{
+    struct nvmet_tcp_queue *queue;
+    int ret;
+
+    queue = kzalloc(sizeof(*queue), GFP_KERNEL);
+    if (!queue)
+    return -ENOMEM;
+
+    INIT_WORK(&queue->release_work, nvmet_tcp_release_queue_work);
+    INIT_WORK(&queue->io_work, nvmet_tcp_io_work);
+    queue->sock = newsock;
+    queue->port = port;
+    queue->nr_cmds = 0;
+    spin_lock_init(&queue->state_lock);
+    queue->state = NVMET_TCP_Q_CONNECTING;
+    INIT_LIST_HEAD(&queue->free_list);
+    init_llist_head(&queue->resp_list);
+    INIT_LIST_HEAD(&queue->resp_send_list);
+
+    queue->idx = ida_simple_get(&nvmet_tcp_queue_ida, 0, 0, GFP_KERNEL);
+    if (queue->idx < 0) {
+    ret = queue->idx;
+    goto out_free_queue;
+    }
+
+    ret = nvmet_tcp_alloc_cmd(queue, &queue->connect);
+    if (ret)
+    goto out_ida_remove;
+
+    ret = nvmet_sq_init(&queue->nvme_sq);
+    if (ret)
+    goto out_ida_remove;


please add a goto free_connect_cmd:

nvmet_tcp_free_cmd(&queue->connect);

to avoid memory leak in error flow.


Will do, thanks.


RE: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Banerjee, Debabrata
> From: Eric Dumazet 
> On 11/29/2018 01:13 PM, Duyck, Alexander H wrote:
> >
> > Also your bytes are going to be totally messed up as well since the
> > headers are trimmed in the GSO frames. It might be worthwhile to just
> > have a branch based on skb_is_gso that sets the packets and bytes
> > based on the GSO values, and one that sets it for the default case.
> >
> Note that __dev_queue_xmit() is already doing that, no need to re-
> implement in each driver.
> 
> It calls qdisc_pkt_len_init(), meaning that drivers can use qdisc_skb_cb(skb)-
> >pkt_len instead of skb->len
> 
> (Qdisc layers should not modify qdisc_skb_cb(skb)->pkt_len )

This works. But rx bytes are still a problem. I haven't spotted anything quick 
for this.


Re: Invalid transport_offset with AF_PACKET socket

2018-11-29 Thread Saeed Mahameed
On Wed, Nov 28, 2018 at 3:10 AM Maxim Mikityanskiy  wrote:
>
> Hi Saeed,
>
> > Can you elaborate more, what NIC? what configuration ? what do you mean
> > by confusion, anyway please see below
>
> ConnectX-4, after running `mlnx_qos -i eth1 --trust dscp`, which sets inline
> mode 2 (MLX5_INLINE_MODE_IP). I'll explain what I mean by confusion below.
>
> > in mlx5 with ConnectX4 or Connext4-LX there is a requirement to copy at
> > least the ethernet header to the tx descriptor otherwise this might
> > cause the packet to be dropped, and for RAW sockets the skb headers
> > offsets are not set, but the latest mlx5 upstream driver would know how
> > to handle this, and copy the minmum amount required
> > please see:
> >
> > static inline u16 mlx5e_calc_min_inline(enum mlx5_inline_modes mode,
> >   struct sk_buff *skb)
>
> Yes, I know that, and what I do is debugging an issue with this function.
>
> >
> > it should default to:
> >
> >
> > case MLX5_INLINE_MODE_L2:
> >   default:
> >   hlen = mlx5e_skb_l2_header_offset(skb);
>
> The issue appears in MLX5_INLINE_MODE_IP. I haven't tested
> MLX5_INLINE_MODE_TCP_UDP yet, though.
>
> > So it should return at least 18 and not 14.
>
> Yes, the function does its best to return at least 18, but it silently expects
> skb_transport_offset to exceed 18. In normal conditions, it will be more that
> 18, because it will be at least 14 + 20. But in my case, when I send a packet
> via an AF_PACKET socket, skb_transport_offset returns 14 (which is nonsense),
> and the driver uses this value, causing the hardware to fail, because it's 
> less
> than 18.
>

Got it, so even if you copy 18 it is not sufficient ! if the packet is
ipv4 or ipv6
and the inline mode is set to  MLX5_INLINE_MODE_IP in the vport
context you must copy the IP headers as well !

but what do you expect from AF_PACKET socket ? to parse each and every
packet and set skb_transport_offset ?

> > We had some issues with this in old driver such as kernels 4.14/15, and
> > it depends in the use case so i need some information first:
>
> No, it's not an old kernel. We actually have this bug in our internal bug
> tracking system, and I'm trying to resolve it.
>
> > 1. What Cards do you have ? (lspci)
>
> 03:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
> 03:00.1 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
> 81:00.0 Ethernet controller: Mellanox Technologies MT27520 Family [ConnectX-3 
> Pro]
>
> Testing with ConnectX-4.
>
> > 2. What kernel/driver version are you using ?
>
> I'm on net-next-mlx5, commit 66a4b5ef638a (the latest when I started the
> investigation).
>
> > 3. what is the current enum mlx5_inline_modes seen in
> > mlx5e_calc_min_inline or sq->min_inline_mode ?
>
> MLX5_INLINE_MODE_IP, as I said above.
>
> > 4. Firmware version ? (ethtool -i)
>
> 12.22.0238 (MT_2190110032)
>
> > can you share the packet format you are sending and seeing the bad
> > behavior with
>
> Here is the hexdump of the simplest packet that causes the problem when it's
> sent through AF_PACKET after `mlnx_qos -i eth1 --trust dscp`:
>
> : 11 22 33 44 55 66 77 88 99 aa bb cc 08 00 45 00
> 0010: 00 20 00 00 40 00 40 11 ae a5 c6 12 00 01 c6 12
> 0020: 00 02 00 00 4a 38 00 0c 29 82 61 62 63 64
>
> (Please ignore the wrong UDP checksum and non-existing MACs, it doesn't matter
> at all, I tested it with completely valid packets as well. The wrong UDP
> checksum is due to a bug in our internal pypacket utility).
>
> Thanks,
> Max


[PATCH RFC net-next v2] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Debabrata Banerjee
Fix packet count when using vlan/macvlan drivers with gso. Without this it
is not possible to reconcile packet counts between underlying devices and
these virtual devices. Additionally, the output looks wrong in a standalone
way i.e. device MTU of 1500, 1 packet sent, 31856 bytes sent.

There are many other drivers that likely have a similar problem, although
it is not clear how many of those could be used with gso. Perhaps all
packet counting should be wrapped in a helper fn.

v2: bytes were also incorrect for gso skb's, fix tx as that is readily
available. Fix rx packets for macvlan.

Signed-off-by: Debabrata Banerjee 
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |  4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  4 ++--
 drivers/net/macvlan.c | 13 -
 drivers/net/macvtap.c |  2 +-
 include/linux/if_macvlan.h|  6 +++---
 net/8021q/vlan_core.c |  3 ++-
 net/8021q/vlan_dev.c  |  4 ++--
 7 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 6fd15a734324..e39fad2d888f 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -431,8 +431,8 @@ static void fm10k_type_trans(struct fm10k_ring *rx_ring,
if (!l2_accel)
skb_record_rx_queue(skb, rx_ring->queue_index);
else
-   macvlan_count_rx(netdev_priv(dev), skb->len + ETH_HLEN, true,
-false);
+   macvlan_count_rx(netdev_priv(dev), skb_shinfo(skb)->gso_segs ?: 
1,
+skb->len + ETH_HLEN, true, false);
 
skb->protocol = eth_type_trans(skb, dev);
 }
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 49a4ea38eb07..474e72ec68b0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1704,8 +1704,8 @@ void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
if (netif_is_ixgbe(dev))
skb_record_rx_queue(skb, rx_ring->queue_index);
else
-   macvlan_count_rx(netdev_priv(dev), skb->len + ETH_HLEN, true,
-false);
+   macvlan_count_rx(netdev_priv(dev), skb_shinfo(skb)->gso_segs ?: 
1,
+skb->len + ETH_HLEN, true, false);
 
skb->protocol = eth_type_trans(skb, dev);
 }
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fc8d5f1ee1ad..ab8743777897 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -289,7 +289,8 @@ static void macvlan_broadcast(struct sk_buff *skb,
nskb, vlan, eth,
mode == MACVLAN_MODE_BRIDGE) ?:
  netif_rx_ni(nskb);
-   macvlan_count_rx(vlan, skb->len + ETH_HLEN,
+   macvlan_count_rx(vlan, skb_shinfo(skb)->gso_segs ?: 1,
+skb->len + ETH_HLEN,
 err == NET_RX_SUCCESS, true);
}
}
@@ -418,7 +419,8 @@ static void macvlan_forward_source_one(struct sk_buff *skb,
nskb->pkt_type = PACKET_HOST;
 
ret = netif_rx(nskb);
-   macvlan_count_rx(vlan, len, ret == NET_RX_SUCCESS, false);
+   macvlan_count_rx(vlan, skb_shinfo(skb)->gso_segs ?: 1, len,
+ret == NET_RX_SUCCESS, false);
 }
 
 static void macvlan_forward_source(struct sk_buff *skb,
@@ -505,7 +507,8 @@ static rx_handler_result_t macvlan_handle_frame(struct 
sk_buff **pskb)
ret = NET_RX_SUCCESS;
handle_res = RX_HANDLER_ANOTHER;
 out:
-   macvlan_count_rx(vlan, len, ret == NET_RX_SUCCESS, false);
+   macvlan_count_rx(vlan, skb_shinfo(skb)->gso_segs ?: 1, len,
+ret == NET_RX_SUCCESS, false);
return handle_res;
 }
 
@@ -553,7 +556,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
  struct net_device *dev)
 {
struct macvlan_dev *vlan = netdev_priv(dev);
-   unsigned int len = skb->len;
+   unsigned int len = qdisc_skb_cb(skb)->pkt_len;
int ret;
 
if (unlikely(netpoll_tx_running(dev)))
@@ -566,7 +569,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 
pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
u64_stats_update_begin(&pcpu_stats->syncp);
-   pcpu_stats->tx_packets++;
+   pcpu_stats->tx_packets += skb_shinfo(skb)->gso_segs ?: 1;
pcpu_stats->tx_bytes += len;
u64_stats_update_end(&pcpu_stats->syncp);
} else {
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index 9a

Re: Freeze when using ipheth+IPsec+IPv6

2018-11-29 Thread Kees Cook
On Thu, Nov 29, 2018 at 3:52 PM David Miller  wrote:
>
> From: Kees Cook 
> Date: Thu, 29 Nov 2018 15:31:25 -0800
>
> > Did you ever solve this?
>
> I think it was fixed by:
>
> commit 45611c61dd503454b2edae00aabe1e429ec49ebe
> Author: Bernd Eckstein <3erndeckst...@gmail.com>
> Date:   Fri Nov 23 13:51:26 2018 +0100
>
> usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2

Ah-ha! Thanks. :)

-- 
Kees Cook


Re: [PATCH mlx5-next 00/13] Mellanox, mlx5 core driver events API

2018-11-29 Thread Saeed Mahameed
On Mon, Nov 26, 2018 at 2:39 PM Saeed Mahameed  wrote:
>
> Hi,
>
> This patchset is for mlx5-next shared branch, and will be applied there
> once the review is done.
>
> The main idea of this change is to define a flexible scalable and
> simpler high level mlx5 core APIs to forward device and driver events
> to upper level interface drivers e.g mlx5_ib and mlx5e netdevice driver.
>
> Patch #1, Driver events notifier API:
>
> Use atomic notifier chain to fire events to mlx5 core driver
> consumers (mlx5e/mlx5_ib) and provide dynamic mlx5 register/unregister
> notifier API to replace the internal mlx5_interface->event and totally
> remove it.
>
> Patch #2, Forward port events via the new notifier chain.
> Patch #3, Use the mlx5 events API in mlx5e netdevice.
> Patch #4, Forward all device events sent via mlx5_core_event (old API)
> to the new notifier chain, this will allow seamless transition to the
> new AP.
> Patch #5, Use the new events API in mlx5_IB
> Patch #6, remove old interface callback, mlx5_interface->event
> Patch #7,8,9, mlx5_ib to handle raw FW events as is rather than the
> software version of them, this will remove any need for unnecessary
> processing of FW events in the low level mlx5_core driver.
> Patch #10, Remove unnecessary processing of FW events in the low level
> mlx5_core driver, all events are handled on demand by mlx5_ib and/or
> mlx5e netdevice.
>
> patch #11,12, forward QP and SRQ events via the new notifier chain,
> will be needed by mlx5_ib driver.
>
> Patch #13, Debug patch for mlx5 events.
>
> Thanks,
> Saeed.
>
> ---
>
> Saeed Mahameed (13):
>   net/mlx5: Driver events notifier API
>   net/mlx5: Allow port change event to be forwarded to driver notifiers
> chain
>   net/mlx5e: Use the new mlx5 core notifier API
>   net/mlx5: Forward all mlx5 events to mlx5 notifiers chain
>   IB/mlx5: Use the new mlx5 core notifier API
>   net/mlx5: Remove unused events callback and logic
>   IB/mlx5: Handle raw port change event rather than the software version
>   net/mlx5: Allow forwarding event type general event as is
>   IB/mlx5: Handle raw delay drop general event
>   net/mlx5: Remove all deprecated software versions of FW events
>   net/mlx5: Forward QP/WorkQueues resource events
>   net/mlx5: Forward SRQ resource events
>   net/mlx5: Debug print for forwarded async events
>

Applied to mlx5-next.

Thanks,
Saeed.


Re: [RFC PATCH net] net: phy: fix the issue that netif always links up after resuming

2018-11-29 Thread Florian Fainelli



On 11/29/2018 2:47 PM, Heiner Kallweit wrote:
> On 29.11.2018 09:12, Kunihiko Hayashi wrote:
>> Even though the link is down before entering hibernation,
>> there is an issue that the network interface always links up after resuming
>> from hibernation.
>>
>> The phydev->state is PHY_READY before enabling the network interface, so
>> the link is down. After resuming from hibernation, the phydev->state is
>> forcibly set to PHY_UP in mdio_bus_phy_restore(), and the link becomes up.
>>
>> This patch expects to solve the issue by changing phydev->state to PHY_UP
>> only when the link is up.
>>
>> Signed-off-by: Kunihiko Hayashi 
>> ---
>>  drivers/net/phy/phy_device.c | 6 --
>>  1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index ab33d17..d5bba0f 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -309,8 +309,10 @@ static int mdio_bus_phy_restore(struct device *dev)
>>  return ret;
>>  
>>  /* The PHY needs to renegotiate. */
>> -phydev->link = 0;
>> -phydev->state = PHY_UP;
>> +if (phydev->link) {
>> +phydev->link = 0;
>> +phydev->state = PHY_UP;
>> +}
>>  
> Thanks for reporting. I agree that it isn't right to unconditionally set
> PHY_UP, because we don't know whether the PHY was started before
> hibernation. However I don't think using phydev->link as criteria is
> right. Example would be: PHY was started before hibernation, but w/o link.
> In this case we want to set PHY_UP to start an aneg, because a cable may
> have been plugged in whilst system was sleeping.
> 
> So I think, similar to phy_stop_machine, we should use state >= UP and
> state != HALTED as criteria, and also phy_start_machine() would need to
> be called only if this criteria is met.
> 
> It may make sense to add a helper for checking whether PHY is in a
> started state (>=UP && !=HALTED), because we need this in more than
> one place.

Agreed, that would make sense.
-- 
Florian


[PATCHv2 bpf 2/2] bpf: Improve socket lookup reuseport documentation

2018-11-29 Thread Joe Stringer
Improve the wording around socket lookup for reuseport sockets, and
ensure that both bpf.h headers are in sync.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h   | 4 
 tools/include/uapi/linux/bpf.h | 8 
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 38924b306e9f..b73d574356f4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2203,6 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2237,6 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 465ad585c836..b73d574356f4 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -2203,8 +2203,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
@@ -2239,8 +2239,8 @@ union bpf_attr {
  * **CONFIG_NET** configuration option.
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
- * For sockets with reuseport option, *struct bpf_sock*
- * return is from reuse->socks[] using hash of the packet.
+ * For sockets with reuseport option, the *struct bpf_sock*
+ * result is from reuse->socks[] using the hash of the tuple.
  *
  * int bpf_sk_release(struct bpf_sock *sk)
  * Description
-- 
2.17.1



[PATCHv2 bpf 1/2] bpf: Support sk lookup in netns with id 0

2018-11-29 Thread Joe Stringer
David Ahern and Nicolas Dichtel report that the handling of the netns id
0 is incorrect for the BPF socket lookup helpers: rather than finding
the netns with id 0, it is resolving to the current netns. This renders
the netns_id 0 inaccessible.

To fix this, adjust the API for the netns to treat all negative s32
values as a lookup in the current netns, while any values with a
positive value in the signed 32-bit integer space would result in a
lookup for a socket in the netns corresponding to that id. As before, if
the netns with that ID does not exist, no socket will be found.
Furthermore, if any bits are set in the upper 32-bits, then no socket
will be found.

Signed-off-by: Joe Stringer 
---
 include/uapi/linux/bpf.h  | 35 ++---
 net/core/filter.c | 11 +++---
 tools/include/uapi/linux/bpf.h| 39 ---
 tools/testing/selftests/bpf/bpf_helpers.h |  4 +-
 .../selftests/bpf/test_sk_lookup_kern.c   | 18 -
 5 files changed, 63 insertions(+), 44 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 852dc17ab47a..38924b306e9f 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2170,7 +2170,7 @@ union bpf_attr {
  * Return
  * 0 on success, or a negative error in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_tcp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for TCP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2187,12 +2187,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2202,7 +2204,7 @@ union bpf_attr {
  * Return
  * Pointer to *struct bpf_sock*, or NULL in case of failure.
  *
- * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u32 netns, u64 flags)
+ * struct bpf_sock *bpf_sk_lookup_udp(void *ctx, struct bpf_sock_tuple *tuple, 
u32 tuple_size, u64 netns, u64 flags)
  * Description
  * Look for UDP socket matching *tuple*, optionally in a child
  * network namespace *netns*. The return value must be checked,
@@ -2219,12 +2221,14 @@ union bpf_attr {
  * **sizeof**\ (*tuple*\ **->ipv6**)
  * Look for an IPv6 socket.
  *
- * If the *netns* is zero, then the socket lookup table in the
- * netns associated with the *ctx* will be used. For the TC hooks,
- * this in the netns of the device in the skb. For socket hooks,
- * this in the netns of the socket. If *netns* is non-zero, then
- * it specifies the ID of the netns relative to the netns
- * associated with the *ctx*.
+ * If the *netns* is a negative signed 32-bit integer, then the
+ * socket lookup table in the netns associated with the *ctx* will
+ * will be used. For the TC hooks, this is the netns of the device
+ * in the skb. For socket hooks, this is the netns of the socket.
+ * If *netns* is any other signed 32-bit value greater than or
+ * equal to zero then it specifies the ID of the netns relative to
+ * the netns associated with the *ctx*. *netns* values beyond the
+ * range of 32-bit integers are reserved for future use.
  *
  * All values for *flags* are reserved for future usage, and must
  * be left at zero.
@@ -2405,6 +2409,9 @@ enum bpf_func_id {
 /* BPF_FUNC_perf_even

Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-29 Thread Saeed Mahameed
On Wed, Nov 28, 2018 at 8:09 PM Eric Dumazet  wrote:
>
> On Wed, Nov 28, 2018 at 7:53 PM Cong Wang  wrote:
> >
> > On Wed, Nov 28, 2018 at 7:50 PM Eric Dumazet  wrote:
> > >
> > > On Wed, Nov 28, 2018 at 7:40 PM Cong Wang  
> > > wrote:
> > > >
> > > > On Wed, Nov 28, 2018 at 4:07 PM Eric Dumazet  
> > > > wrote:
> > > > >
> > > > > A NIC is supposed to deliver frames, even the ones that 'seem' bad.
> > > >
> > > > A quick test shows this is not the case for mlx5.
> > > >
> > > > With the trafgen script you gave to me, with tot_len==40, the dest host
> > > > could receive all the packets. Changing tot_len to 80, tcpdump could no
> > > > longer see any packet. (Both sender and receiver are mlx5.)
> > > >
> > > > So, packets with tot_len > skb->len are clearly dropped before tcpdump
> > > > could see it, that is likely by mlx5 hardware.
> > >
> > > Or a router, or a switch.
> > >
> > > Are your two hosts connected back to back ?
> >
> > Both should be plugged into a same switch. I fail to see why a
> > switch could parse IP header as the packet is nothing of interest,
> > like a IGMP snooping.
>
> Well, _something_ is dropping the frames.
> It can be mlx5, or something else.
>

mlx5 HW should deliver such packets.
just tested with scapy :
sender:

#scapy
p = Ether(dst="24:8a:07:b4:24:6e")/IP(dst="1.2.3.4", len = 1000)
sendp(p, iface = "p5p1")

receiver:
tcpdump: listening on p5p1, link-type EN10MB (Ethernet), capture size
262144 bytes
16:24:26.427563 80:18:44:e5:2f:c4 > 24:8a:07:b4:24:6e, ethertype IPv4
(0x0800), length 60: truncated-ip - 954 bytes missing! (tos 0x0, ttl
64, id 1, offset 0, flags [none], proto Options (0), length 1000)
10.20.2.212 > 1.2.3.4:  ip-proto-0 980


> Does ethtool -S show any increasing counter ?


Re: [PATCH v2 net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread David Miller
From: Christoph Paasch 
Date: Thu, 29 Nov 2018 16:01:04 -0800

> __qdisc_drop_all() accesses skb->prev to get to the tail of the
> segment-list.
> 
> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> the skb-list handling has been changed to set skb->next to NULL and set
> the list-poison on skb->prev.
> 
> With that change, __qdisc_drop_all() will panic when it tries to
> dereference skb->prev.
> 
> Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
> __list_del_entry is used, leaving skb->prev unchanged (thus,
> pointing to the list-head if it's the first skb of the list).
> This will make __qdisc_drop_all modify the next-pointer of the list-head
> and result in a panic later on:
  ...
> This patch makes sure that skb->prev is set to NULL when entering
> netem_enqueue.
> 
> Cc: Prashant Bhole 
> Cc: Tyler Hicks 
> Cc: Eric Dumazet 
> Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
> Suggested-by: Eric Dumazet 
> Signed-off-by: Christoph Paasch 

Applied and queued up for -stable, thanks!


Re: [PATCH v2 net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Eric Dumazet



On 11/29/2018 04:01 PM, Christoph Paasch wrote:
> __qdisc_drop_all() accesses skb->prev to get to the tail of the
> segment-list.
> 
> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> the skb-list handling has been changed to set skb->next to NULL and set
> the list-poison on skb->prev.
> 
> With that change, __qdisc_drop_all() will panic when it tries to
> dereference skb->prev.
> 

Reviewed-by: Eric Dumazet 

Thanks !



Re: [Patch net v2] mlx5: fixup checksum for short ethernet frame padding

2018-11-29 Thread Saeed Mahameed
On Wed, Nov 28, 2018 at 4:05 PM Cong Wang  wrote:
>
> On Wed, Nov 28, 2018 at 3:57 PM Cong Wang  wrote:
> > But again, I kinda feel the hardware already does the sanity check,
> > otherwise we have much more serious trouble in mlx5e_lro_update_hdr()
> > which parses into TCP header.
> >
>
> While we are on this page, mlx5e_lro_update_hdr() incorrectly assumes
> TCP header is located right after struct iphdr, which is wrong if we could
> have IP options on this path.
>
> It could the hardware which already verified this corner case though.

yes HW will not do LRO in case of ip options, so no problem in
mlx5e_lro_update_hdr


[PATCH v2 net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch
__qdisc_drop_all() accesses skb->prev to get to the tail of the
segment-list.

With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
the skb-list handling has been changed to set skb->next to NULL and set
the list-poison on skb->prev.

With that change, __qdisc_drop_all() will panic when it tries to
dereference skb->prev.

Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
__list_del_entry is used, leaving skb->prev unchanged (thus,
pointing to the list-head if it's the first skb of the list).
This will make __qdisc_drop_all modify the next-pointer of the list-head
and result in a panic later on:

[   34.501053] general protection fault:  [#1] SMP KASAN PTI
[   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp #108
[   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
0.5.1 01/01/2011
[   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
[   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 24 
24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 <42> 0f b6 
04 30 84 c0 74 08 3c 04
[   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
[   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 111072e165a6
[   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 0038
[   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 0062
[   34.510346] R10: 0034 R11:  R12: 
[   34.511215] R13:  R14: dc00 R15: 8883ac338008
[   34.512082] FS:  () GS:8883af50() 
knlGS:
[   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
[   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 06e0
[   34.514593] Call Trace:
[   34.514893]  
[   34.515157]  napi_gro_receive+0x93/0x150
[   34.515632]  receive_buf+0x893/0x3700
[   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
[   34.516629]  ? virtnet_probe+0x1b40/0x1b40
[   34.517153]  ? __stable_node_chain+0x4d0/0x850
[   34.517684]  ? kfree+0x9a/0x180
[   34.518067]  ? __kasan_slab_free+0x171/0x190
[   34.518582]  ? detach_buf+0x1df/0x650
[   34.519061]  ? lapic_next_event+0x5a/0x90
[   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
[   34.520093]  virtnet_poll+0x2df/0xd60
[   34.520533]  ? receive_buf+0x3700/0x3700
[   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
[   34.521631]  ? htb_dequeue+0x1817/0x25f0
[   34.522107]  ? sch_direct_xmit+0x142/0xf30
[   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
[   34.523155]  net_rx_action+0x2f6/0xc50
[   34.523601]  ? napi_complete_done+0x2f0/0x2f0
[   34.524126]  ? kasan_check_read+0x11/0x20
[   34.524608]  ? _raw_spin_lock+0x7d/0xd0
[   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
[   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
[   34.526130]  ? apic_ack_irq+0x9e/0xe0
[   34.526567]  __do_softirq+0x188/0x4b5
[   34.527015]  irq_exit+0x151/0x180
[   34.527417]  do_IRQ+0xdb/0x150
[   34.527783]  common_interrupt+0xf/0xf
[   34.528223]  

This patch makes sure that skb->prev is set to NULL when entering
netem_enqueue.

Cc: Prashant Bhole 
Cc: Tyler Hicks 
Cc: Eric Dumazet 
Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
Suggested-by: Eric Dumazet 
Signed-off-by: Christoph Paasch 
---

Notes:
v2: - Set skb->prev to NULL in netem_enqueue instead.

 net/sched/sch_netem.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2c38e3d07924..22cd46a60057 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -431,6 +431,9 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
int count = 1;
int rc = NET_XMIT_SUCCESS;
 
+   /* Do not fool qdisc_drop_all() */
+   skb->prev = NULL;
+
/* Random duplication */
if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor))
++count;
-- 
2.16.2



Re: default-y for Aquantia AQtion USB to 5GbE (in -next)

2018-11-29 Thread David Miller
From: Pavel Machek 
Date: Fri, 30 Nov 2018 00:44:32 +0100

> In -next we have new driver, and it is default-y.
> 
> I'm pretty sure that's a no-no.
> 
> Can someone please fix it?

Done.


Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch
On Thu, Nov 29, 2018 at 3:54 PM David Miller  wrote:
>
> From: Eric Dumazet 
> Date: Thu, 29 Nov 2018 15:09:18 -0800
>
> > diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> > index 
> > 2c38e3d0792468162ee0dc4137f1400160ab9276..22cd46a600576f286803536d45875cd9d537cdca
> >  100644
> > --- a/net/sched/sch_netem.c
> > +++ b/net/sched/sch_netem.c
> > @@ -431,6 +431,9 @@ static int netem_enqueue(struct sk_buff *skb, struct 
> > Qdisc *sch,
> > int count = 1;
> > int rc = NET_XMIT_SUCCESS;
> >
> > +   /* Do not fool qdisc_drop_all() */
> > +   skb->prev = NULL;
> > +
> > /* Random duplication */
> > if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor))
> > ++count;
>
> If this works I definitely prefer it to making the entire stack pay the
> price to fix this crash.

Yes, I tried it out and it works.


Christoph


Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread David Miller
From: Christoph Paasch 
Date: Thu, 29 Nov 2018 15:45:19 -0800

> I can resubmit a patch.

Please do after testing.


Re: [Patch net] mlx5: fix get_ip_proto()

2018-11-29 Thread Saeed Mahameed
On Thu, Nov 29, 2018 at 5:13 AM Tariq Toukan  wrote:
>
>
>
> On 29/11/2018 1:04 AM, Cong Wang wrote:
> > IP header is not necessarily located right after struct ethhdr,
> > there could be multiple 802.1Q headers in between, this is why
> > we call __vlan_get_protocol().
> >
> > Fixes: fe1dc069990c ("net/mlx5e: don't set CHECKSUM_COMPLETE on SCTP 
> > packets")
> > Cc: Alaa Hleihel 
> > Cc: Or Gerlitz 
> > Cc: Saeed Mahameed 
> > Signed-off-by: Cong Wang 

Acked-by: Saeed Mahameed 


Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread David Miller
From: Eric Dumazet 
Date: Thu, 29 Nov 2018 15:09:18 -0800

> diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
> index 
> 2c38e3d0792468162ee0dc4137f1400160ab9276..22cd46a600576f286803536d45875cd9d537cdca
>  100644
> --- a/net/sched/sch_netem.c
> +++ b/net/sched/sch_netem.c
> @@ -431,6 +431,9 @@ static int netem_enqueue(struct sk_buff *skb, struct 
> Qdisc *sch,
> int count = 1;
> int rc = NET_XMIT_SUCCESS;
>  
> +   /* Do not fool qdisc_drop_all() */
> +   skb->prev = NULL;
> +
> /* Random duplication */
> if (q->duplicate && q->duplicate >= get_crandom(&q->dup_cor))
> ++count;

If this works I definitely prefer it to making the entire stack pay the
price to fix this crash.


Re: Freeze when using ipheth+IPsec+IPv6

2018-11-29 Thread David Miller
From: Kees Cook 
Date: Thu, 29 Nov 2018 15:31:25 -0800

> Did you ever solve this?

I think it was fixed by:

commit 45611c61dd503454b2edae00aabe1e429ec49ebe
Author: Bernd Eckstein <3erndeckst...@gmail.com>
Date:   Fri Nov 23 13:51:26 2018 +0100

usbnet: ipheth: fix potential recvmsg bug and recvmsg bug 2


Re: [PATCH] net: ethernet: ti: cpsw: allow to configure min tx packet size

2018-11-29 Thread Grygorii Strashko
Hi All,

On 11/28/18 7:22 PM, David Miller wrote:
> From: Andrew Lunn 
> Date: Thu, 29 Nov 2018 01:28:35 +0100
> 
>> On Wed, Nov 28, 2018 at 04:43:40PM -0600, Grygorii Strashko wrote:
>>> Hi Andrew,
>>>
>>> On 11/28/18 4:02 PM, Andrew Lunn wrote:
> This is dynamic configuration related to ALE VLAN entries and
> I do not see the way to support its auto-detection with current driver,
> unfortunately.

 I think you can subscribe to switchdev events which will tell you.
>>>
>>> Thanks a lot for your review and comments.
>>>
>>> Unfortunately, we still do not have switchdev in place, but i need to 
>>> resolve
>>> existing issue somehow. This patch is small, local to CPSW fix which can be
>>> backported to stable. So, if possible, and no strong objection, could we 
>>> proceed with it?
>>
>> Hi Grygorii
>>
>> David is the person who generally says no to kernel module
>> parameters. You should ask him.
> 
> No module options.
> 
> You'll be stuck with it forever.

Thank you for your feedback.

-- 
regards,
-grygorii


Re: [PATCH bpf-next v2] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Jakub Kicinski
On Thu, 29 Nov 2018 15:31:45 -0800, Yonghong Song wrote:
> During porting libbpf to bcc, I got some warnings like below:
>   ...
>   [  2%] Building C object 
> src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf.c.o
>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf.c:12:0:
>   warning: "_GNU_SOURCE" redefined [enabled by default]
>#define _GNU_SOURCE
>   ...
>   [  3%] Building C object 
> src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf_errno.c.o
>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c: In function 
> ‘libbpf_strerror’:
>   /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c:45:7:
>   warning: assignment makes integer from pointer without a cast [enabled by 
> default]
>  ret = strerror_r(err, buf, size);
>   ...
> 
> bcc is built with _GNU_SOURCE defined and this caused the above warning.
> This patch intends to make libpf _GNU_SOURCE friendly by
>   . define _GNU_SOURCE in libbpf.c unless it is not defined
>   . undefine _GNU_SOURCE as non-gnu version of strerror_r is expected.
> 
> Signed-off-by: Yonghong Song 

Acked-by: Jakub Kicinski 

FWIW :)


Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch
On 29/11/18 - 15:09:18, Eric Dumazet wrote:
> 
> 
> On 11/29/2018 02:55 PM, Christoph Paasch wrote:
> > On 29/11/18 - 14:44:44, Eric Dumazet wrote:
> >>
> >>
> >> On 11/29/2018 02:27 PM, Christoph Paasch wrote:
> >>> There are places in the stack, where we access skb->prev directly and
> >>> modify it. Namely, __qdisc_drop_all().
> >>>
> >>> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> >>> the skb-list handling has been changed to set skb->next to NULL and set
> >>> the list-poison on skb->prev.
> >>>
> >>> With that change, __qdisc_drop_all() will panic when it tries to
> >>> dereference skb->prev.
> >>>
> >>> Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
> >>> __list_del_entry is used, leaving skb->prev unchanged (thus,
> >>> pointing to the list-head if it's the first skb of the list).
> >>> This will make __qdisc_drop_all modify the next-pointer of the list-head
> >>> and result in a panic later on:
> >>>
> >>> [   34.501053] general protection fault:  [#1] SMP KASAN PTI
> >>> [   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp 
> >>> #108
> >>> [   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> >>> BIOS 0.5.1 01/01/2011
> >>> [   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
> >>> [   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 
> >>> 8b 24 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 
> >>> 03 <42> 0f b6 04 30 84 c0 74 08 3c 04
> >>> [   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
> >>> [   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 
> >>> 111072e165a6
> >>> [   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 
> >>> 0038
> >>> [   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 
> >>> 0062
> >>> [   34.510346] R10: 0034 R11:  R12: 
> >>> 
> >>> [   34.511215] R13:  R14: dc00 R15: 
> >>> 8883ac338008
> >>> [   34.512082] FS:  () GS:8883af50() 
> >>> knlGS:
> >>> [   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
> >>> [   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 
> >>> 06e0
> >>> [   34.514593] Call Trace:
> >>> [   34.514893]  
> >>> [   34.515157]  napi_gro_receive+0x93/0x150
> >>> [   34.515632]  receive_buf+0x893/0x3700
> >>> [   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
> >>> [   34.516629]  ? virtnet_probe+0x1b40/0x1b40
> >>> [   34.517153]  ? __stable_node_chain+0x4d0/0x850
> >>> [   34.517684]  ? kfree+0x9a/0x180
> >>> [   34.518067]  ? __kasan_slab_free+0x171/0x190
> >>> [   34.518582]  ? detach_buf+0x1df/0x650
> >>> [   34.519061]  ? lapic_next_event+0x5a/0x90
> >>> [   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
> >>> [   34.520093]  virtnet_poll+0x2df/0xd60
> >>> [   34.520533]  ? receive_buf+0x3700/0x3700
> >>> [   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
> >>> [   34.521631]  ? htb_dequeue+0x1817/0x25f0
> >>> [   34.522107]  ? sch_direct_xmit+0x142/0xf30
> >>> [   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
> >>> [   34.523155]  net_rx_action+0x2f6/0xc50
> >>> [   34.523601]  ? napi_complete_done+0x2f0/0x2f0
> >>> [   34.524126]  ? kasan_check_read+0x11/0x20
> >>> [   34.524608]  ? _raw_spin_lock+0x7d/0xd0
> >>> [   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
> >>> [   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
> >>> [   34.526130]  ? apic_ack_irq+0x9e/0xe0
> >>> [   34.526567]  __do_softirq+0x188/0x4b5
> >>> [   34.527015]  irq_exit+0x151/0x180
> >>> [   34.527417]  do_IRQ+0xdb/0x150
> >>> [   34.527783]  common_interrupt+0xf/0xf
> >>> [   34.528223]  
> >>>
> >>> This patch makes sure that skb->prev is also set to NULL when removing
> >>> it from the list.
> >>>
> >>> The bug is in v4.19.x as well, but the patch can't be backported easily.
> >>> I can post a follow-up for that.
> >>>
> >>> Cc: Prashant Bhole 
> >>> Cc: Tyler Hicks 
> >>> Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
> >>> Signed-off-by: Christoph Paasch 
> >>> ---
> >>>  include/linux/skbuff.h | 1 +
> >>>  1 file changed, 1 insertion(+)
> >>>
> >>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >>> index 0d1b2c3f127b..3bb3bfd390eb 100644
> >>> --- a/include/linux/skbuff.h
> >>> +++ b/include/linux/skbuff.h
> >>> @@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff 
> >>> *skb)
> >>>  static inline void skb_mark_not_on_list(struct sk_buff *skb)
> >>>  {
> >>>   skb->next = NULL;
> >>> + skb->prev = NULL;
> >>>  }
> >>>  
> >>>  static inline void skb_list_del_init(struct sk_buff *skb)
> >>>
> >>
> >> skb_mark_not_on_list() is used in many place where we do not care of 
> >> skb->prev
> >>
> >> What about fixing netem instead ?
> > 
> > Yes, I have been looking at that and Alexey's patch which introduced the
> > access to skb->prev (cfr.: https://patchwork.

default-y for Aquantia AQtion USB to 5GbE (in -next)

2018-11-29 Thread Pavel Machek
Hi!

In -next we have new driver, and it is default-y.

I'm pretty sure that's a no-no.

Can someone please fix it?

Thanks,

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


signature.asc
Description: Digital signature


[PATCH bpf-next v2] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Yonghong Song
During porting libbpf to bcc, I got some warnings like below:
  ...
  [  2%] Building C object 
src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf.c.o
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf.c:12:0:
  warning: "_GNU_SOURCE" redefined [enabled by default]
   #define _GNU_SOURCE
  ...
  [  3%] Building C object 
src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf_errno.c.o
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c: In function 
‘libbpf_strerror’:
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c:45:7:
  warning: assignment makes integer from pointer without a cast [enabled by 
default]
 ret = strerror_r(err, buf, size);
  ...

bcc is built with _GNU_SOURCE defined and this caused the above warning.
This patch intends to make libpf _GNU_SOURCE friendly by
  . define _GNU_SOURCE in libbpf.c unless it is not defined
  . undefine _GNU_SOURCE as non-gnu version of strerror_r is expected.

Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/libbpf.c   | 2 ++
 tools/lib/bpf/libbpf_errno.c | 1 +
 2 files changed, 3 insertions(+)

Changelog:
 v1 -> v2:
   . undef _GNU_SOURCE instead of multiversioning strerror_r,
 suggested by Jakub.

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ed4212a4c5f9..59b748ebd15f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -9,7 +9,9 @@
  * Copyright (C) 2017 Nicira, Inc.
  */
 
+#ifndef _GNU_SOURCE
 #define _GNU_SOURCE
+#endif
 #include 
 #include 
 #include 
diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c
index d83b17f8435c..4343e40588c6 100644
--- a/tools/lib/bpf/libbpf_errno.c
+++ b/tools/lib/bpf/libbpf_errno.c
@@ -7,6 +7,7 @@
  * Copyright (C) 2017 Nicira, Inc.
  */
 
+#undef _GNU_SOURCE
 #include 
 #include 
 
-- 
2.17.1



Re: Freeze when using ipheth+IPsec+IPv6

2018-11-29 Thread Kees Cook
On Wed, Jun 6, 2018 at 1:21 AM Yves-Alexis Perez  wrote:
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA512
>
> On Tue, Jun 05, 2018 at 10:54:51AM +0200, Yves-Alexis Perez wrote:
> > Hi,
> >
> > since some kernels releases (I didn't test thorougly but at least 4.16
> > and 4.17) I have regular freezes in certain situations on my laptop.
> >
> > It seems to happen when I:
> >
> > - tether using my iPhone (involving ipheth)
> > - mount an IPsec tunnel over IPv4
> > - run evolution to fetch my mail (IMAP traffic over IPv6 inside the IPv4
> >   IPsec tunnel)
> >
> > When I do that, the interface seems to freeze. Last time the mouse was
> > still moving so the kernel didn't completely crash, but the UI was
> > completely irresponsive. I managed to get the attached log from
> > /sys/fs/pstore with refcount_t stuff pointing to an underflow.
>
> Today I had a different behavior. Again same situation (ipheth, IPsec
> tunnel, refresh of the LKML folder in Evolution). The kernel didn't
> crash/freeze but I had multiple (33309 actually) "recvmsg bug:
> copied..." traces like this one:
>
>
> [ 1555.957599] [ cut here ]
> [ 1555.957619] recvmsg bug: copied ABEA08B2 seq 1 rcvnxt ABEA0DCE fl 0
> [ 1555.957805] WARNING: CPU: 3 PID: 2177 at 
> /home/corsac/projets/linux/linux/net/ipv4/tcp.c:1850 tcp_recvmsg+0x610/0xb40

(I'm going through ancient email while I try to catch up from travel...)

Did you ever solve this?

-Kees

> [ 1555.957813] Modules linked in: esp4 xfrm6_mode_tunnel xfrm4_mode_tunnel 
> bnep ipheth rtsx_pci_sdmmc snd_hda_codec_realtek iwlmvm snd_hda_codec_generic 
> snd_hda_codec_hdmi snd_hda_intel iwlwifi snd_hda_codec snd_hwdep rtsx_pci 
> snd_hda_core snd_pcm thinkpad_acpi efivarfs input_leds
> [ 1555.957895] CPU: 3 PID: 2177 Comm: pool Tainted: GT 4.17.0 
> #22
> [ 1555.957902] Hardware name: LENOVO 20CMCTO1WW/20CMCTO1WW, BIOS N10ET48W 
> (1.27 ) 09/12/2017
> [ 1555.957922] RIP: 0010:tcp_recvmsg+0x610/0xb40
> [ 1555.957927] RSP: 0018:b77e010f7cf8 EFLAGS: 00010282
> [ 1555.957932] RAX:  RBX: abea08b2 RCX: 
> 0006
> [ 1555.957935] RDX: 0007 RSI: 0086 RDI: 
> a37a8dd95610
> [ 1555.957939] RBP: b77e010f7db8 R08: 03b4 R09: 
> 0004
> [ 1555.957942] R10: a37a3b1180c8 R11: 0001 R12: 
> a37a81d40e00
> [ 1555.957945] R13: a37a3b118000 R14: a37a3b118524 R15: 
> 
> [ 1555.957951] FS:  738f795c0700() GS:a37a8dd8() 
> knlGS:
> [ 1555.957954] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1555.957957] CR2: 738f0879a028 CR3: 00024200c006 CR4: 
> 003606e0
> [ 1555.957964] Call Trace:
> [ 1555.957996]  inet_recvmsg+0x5c/0x110
> [ 1555.958017]  __sys_recvfrom+0xf2/0x160
> [ 1555.958030]  __x64_sys_recvfrom+0x1f/0x30
> [ 1555.958039]  do_syscall_64+0x72/0x1c0
> [ 1555.958048]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 1555.958053] RIP: 0033:0x73901a71deae
> [ 1555.958056] RSP: 002b:738f795bee50 EFLAGS: 0246 ORIG_RAX: 
> 002d
> [ 1555.958060] RAX: ffda RBX: 0028 RCX: 
> 73901a71deae
> [ 1555.958063] RDX: 0404 RSI: 738f087955a7 RDI: 
> 0028
> [ 1555.958066] RBP:  R08:  R09: 
> 
> [ 1555.958068] R10:  R11: 0246 R12: 
> 738f087955a7
> [ 1555.958071] R13: 0404 R14:  R15: 
> 
> [ 1555.958075] Code: e9 33 fd ff ff 4c 89 e0 41 8b 8d 20 05 00 00 89 de 48 c7 
> c7 10 47 05 ae 48 89 85 48 ff ff ff 44 8b 85 70 ff ff ff e8 80 0d 93 ff <0f> 
> 0b 48 8b 85 48 ff ff ff e9 ed fd ff ff 41 8b 8d 20 05 00 00
> [ 1555.958180] ---[ end trace e7da03c87ec51f13 ]---
>
> (complete log available but it seems that only R08 is changing between
> these traces)
>
> Followed by a "recvmsg bug 2:":
>
> [ 1563.657991] [ cut here ]
> [ 1563.657992] recvmsg bug 2: copied ABEA08B2 seq 6A7E3970 rcvnxt ABECA5EE fl > 0
> [ 1563.658002] WARNING: CPU: 1 PID: 2177 at 
> /home/corsac/projets/linux/linux/net/ipv4/tcp.c:1864 tcp_recvmsg+0x647/0xb40
> [ 1563.658002] Modules linked in: esp4 xfrm6_mode_tunnel xfrm4_mode_tunnel 
> bnep ipheth rtsx_pci_sdmmc snd_hda_codec_realtek iwlmvm snd_hda_codec_generic 
> snd_hda_codec_hdmi snd_hda_intel iwlwifi snd_hda_codec snd_hwdep rtsx_pci 
> snd_hda_core snd_pcm thinkpad_acpi efivarfs input_leds
> [ 1563.658016] CPU: 1 PID: 2177 Comm: pool Tainted: GW   T 4.17.0 
> #22
> [ 1563.658017] Hardware name: LENOVO 20CMCTO1WW/20CMCTO1WW, BIOS N10ET48W 
> (1.27 ) 09/12/2017
> [ 1563.658019] RIP: 0010:tcp_recvmsg+0x647/0xb40
> [ 1563.658020] RSP: 0018:b77e010f7cf8 EFLAGS: 00010282
> [ 1563.658022] RAX:  RBX: 416bcf42 RCX: 
> 0006
> [ 1563.658023] RDX: 0007 RSI: 0086 RDI: 
> a37a8dc95610
> [ 1563.658

[PATCH v3] rhashtable: detect when object movement between tables might have invalidated a lookup

2018-11-29 Thread NeilBrown

Some users of rhashtables might need to move an object from one table
to another -  this appears to be the reason for the incomplete usage
of NULLS markers.

To support these, we store a unique NULLS_MARKER at the end of
each chain, and when a search fails to find a match, we check
if the NULLS marker found was the expected one.  If not, the search
may not have examined all objects in the target bucket, so it is
repeated.

The unique NULLS_MARKER is derived from the address of the
head of the chain.  As this cannot be derived at load-time the
static rhnull in rht_bucket_nested() needs to be initialised
at run time.

Any caller of a lookup function must still be prepared for the
possibility that the object returned is in a different table - it
might have been there for some time.

Note that this does NOT provide support for other uses of
NULLS_MARKERs such as allocating with SLAB_TYPESAFE_BY_RCU or changing
the key of an object and re-inserting it in the same table.
These could only be done safely if new objects were inserted
at the *start* of a hash chain, and that is not currently the case.

Signed-off-by: NeilBrown 
---

This version has an added comment to explain the ">>1" in
RHT_NULLS_MARKER().

Thanks,
NeilBrown

 include/linux/rhashtable.h | 34 ++
 lib/rhashtable.c   |  8 +---
 2 files changed, 31 insertions(+), 11 deletions(-)

diff --git a/include/linux/rhashtable.h b/include/linux/rhashtable.h
index eb7111039247..ae507af54800 100644
--- a/include/linux/rhashtable.h
+++ b/include/linux/rhashtable.h
@@ -75,8 +75,19 @@ struct bucket_table {
struct rhash_head __rcu *buckets[] cacheline_aligned_in_smp;
 };
 
+/*
+ * NULLS_MARKER() expects a hash value with the low
+ * bits mostly likely to be significant, and it discards
+ * the msb.
+ * We git it an address, in which the bottom 2 bits are
+ * always 0, and the msb might be significant.
+ * So we shift the address down one bit to align with
+ * expectations and avoid losing a significant bit.
+ */
+#defineRHT_NULLS_MARKER(ptr)   \
+   ((void *)NULLS_MARKER(((unsigned long) (ptr)) >> 1))
 #define INIT_RHT_NULLS_HEAD(ptr)   \
-   ((ptr) = (typeof(ptr)) NULLS_MARKER(0))
+   ((ptr) = RHT_NULLS_MARKER(&(ptr)))
 
 static inline bool rht_is_a_nulls(const struct rhash_head *ptr)
 {
@@ -471,6 +482,7 @@ static inline struct rhash_head *__rhashtable_lookup(
.ht = ht,
.key = key,
};
+   struct rhash_head __rcu * const *head;
struct bucket_table *tbl;
struct rhash_head *he;
unsigned int hash;
@@ -478,13 +490,19 @@ static inline struct rhash_head *__rhashtable_lookup(
tbl = rht_dereference_rcu(ht->tbl, ht);
 restart:
hash = rht_key_hashfn(ht, tbl, key, params);
-   rht_for_each_rcu(he, tbl, hash) {
-   if (params.obj_cmpfn ?
-   params.obj_cmpfn(&arg, rht_obj(ht, he)) :
-   rhashtable_compare(&arg, rht_obj(ht, he)))
-   continue;
-   return he;
-   }
+   head = rht_bucket(tbl, hash);
+   do {
+   rht_for_each_rcu_continue(he, *head, tbl, hash) {
+   if (params.obj_cmpfn ?
+   params.obj_cmpfn(&arg, rht_obj(ht, he)) :
+   rhashtable_compare(&arg, rht_obj(ht, he)))
+   continue;
+   return he;
+   }
+   /* An object might have been moved to a different hash chain,
+* while we walk along it - better check and retry.
+*/
+   } while (he != RHT_NULLS_MARKER(head));
 
/* Ensure we see any new tables. */
smp_rmb();
diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 30526afa8343..852ffa5160f1 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -1179,8 +1179,7 @@ struct rhash_head __rcu **rht_bucket_nested(const struct 
bucket_table *tbl,
unsigned int hash)
 {
const unsigned int shift = PAGE_SHIFT - ilog2(sizeof(void *));
-   static struct rhash_head __rcu *rhnull =
-   (struct rhash_head __rcu *)NULLS_MARKER(0);
+   static struct rhash_head __rcu *rhnull;
unsigned int index = hash & ((1 << tbl->nest) - 1);
unsigned int size = tbl->size >> tbl->nest;
unsigned int subhash = hash;
@@ -1198,8 +1197,11 @@ struct rhash_head __rcu **rht_bucket_nested(const struct 
bucket_table *tbl,
subhash >>= shift;
}
 
-   if (!ntbl)
+   if (!ntbl) {
+   if (!rhnull)
+   INIT_RHT_NULLS_HEAD(rhnull);
return &rhnull;
+   }
 
return &ntbl[subhash].bucket;
 
-- 
2.14.0.rc0.dirty



signature.asc
Description: PGP signature


Re: [PATCH bpf-next] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Yonghong Song


On 11/29/18 1:00 PM, Jakub Kicinski wrote:
> On Thu, 29 Nov 2018 12:38:03 -0800, Yonghong Song wrote:
>> diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c
>> index d83b17f8435c..286e497c50ec 100644
>> --- a/tools/lib/bpf/libbpf_errno.c
>> +++ b/tools/lib/bpf/libbpf_errno.c
>> @@ -40,9 +40,19 @@ int libbpf_strerror(int err, char *buf, size_t size)
>>  err = err > 0 ? err : -err;
>>   
>>  if (err < __LIBBPF_ERRNO__START) {
>> +#ifdef _GNU_SOURCE
>> +const char *ret_buf;
>> +#endif
>>  int ret;
>>   
>> +#ifdef _GNU_SOURCE
>> +ret_buf = strerror_r(err, buf, size);
>> +if (ret_buf != buf)
>> +snprintf(buf, size, "%s", ret_buf);
>> +ret = 0;
>> +#else
>>  ret = strerror_r(err, buf, size);
>> +#endif
>>  buf[size - 1] = '\0';
>>  return ret;
>>  }
> 
> That is kinda strange, the whole point for this file was to have
> non-GNU strerror_r, would doing #undef _GNU_SOURCE at the top not work?

Thanks for suggestion as I did not the history of this file.
Yes, #undef _GNU_SOURCE works. Will send a revision shortly.


Re: [RFC PATCH 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-11-29 Thread Eric Dumazet



On 11/29/2018 03:00 PM, Paolo Abeni wrote:
> This header define a bunch of helpers that allow avoiding the
> retpoline overhead when calling builtin functions via function pointers.
> It boils down to explicitly comparing the function pointers to
> known builtin functions and eventually invoke directly the latter.
> 
> The macro defined here implement the boilerplate for the above schema
> and will be used by the next patches.
> 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Paolo Abeni 
> ---
>  include/linux/indirect_call_wrapper.h | 77 +++
>  1 file changed, 77 insertions(+)
>  create mode 100644 include/linux/indirect_call_wrapper.h
> 
> diff --git a/include/linux/indirect_call_wrapper.h 
> b/include/linux/indirect_call_wrapper.h
> new file mode 100644
> index ..57e82b4a166d
> --- /dev/null
> +++ b/include/linux/indirect_call_wrapper.h
> @@ -0,0 +1,77 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_INDIRECT_CALL_WRAPPER_H
> +#define _LINUX_INDIRECT_CALL_WRAPPER_H
> +
> +#ifdef CONFIG_RETPOLINE
> +
> +/*
> + * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
> + *  @f: function pointer
> + *  @name: base name for builtin functions, see INDIRECT_CALLABLE_DECLARE_$NR
> + *  @__VA_ARGS__: arguments for @f
> + *
> + * Avoid retpoline overhead for known builtin, checking @f vs each of them 
> and
> + * eventually invoking directly the builtin function. Fallback to the 
> indirect
> + * call
> + */
> +#define INDIRECT_CALL_1(f, name, ...)
> \
> + ({  \
> + f == name ## 1 ? name ## 1(__VA_ARGS__) :   \

  likely(f == name ## 1) ? ...

> +  f(__VA_ARGS__);\
> + })
> +#define INDIRECT_CALL_2(f, name, ...)
> \
> + ({  \
> + f == name ## 2 ? name ## 2(__VA_ARGS__) :   \

 likely(f == name ## 2) ? ...


> +  INDIRECT_CALL_1(f, name, __VA_ARGS__); \
> + })
> +
>


Re: [PATCH net-next] net: ethernet: ti: cpsw: drop vid0 configuration in dual_mac modey

2018-11-29 Thread Grygorii Strashko



On 11/29/18 9:26 AM, Ivan Khoronzhuk wrote:
> On Wed, Nov 28, 2018 at 03:15:46PM -0600, Grygorii Strashko wrote:
>>
>>
>> On 11/26/18 2:07 PM, Ivan Khoronzhuk wrote:
>>> On Mon, Nov 26, 2018 at 12:57:20PM -0600, Grygorii Strashko wrote:


 On 11/26/18 10:26 AM, Ivan Khoronzhuk wrote:
> On Sun, Nov 25, 2018 at 05:46:26PM -0600, Grygorii Strashko wrote:
>> In dual_mac mode CPSW driver uses vid1 and vid2 by default to implement
>> dual mac mode wich are used to configure pvids for each external ports.
>> But, historicaly, it also adds vid0 to ALE table and sets "untag" bits 
>> for both
>> ext. ports. As result, it's imposible to use priority tagged packets in
>> dual mac mode.
>>
>> Hence, drop vid0 configuration in dual mac mode as it's not required for 
>> dual
>> mac mode functionality and, this way, make it possible to use priority
>> tagged packet in dual mac mode.
> So, now it's enabled to be added via regular ndo.
> I have similar change in mind, but was going to send it after
> mcast/ucast, and - enabling same vlans patch...
>
> 2 things stopped me to add this:
>
> 1) Moving it to be enabled via regular call is Ok, but in dual mac mode
> it causes overlaps, at least while vid deletion. So decided to wait till
> same vlans series is applied.

 TI driver documentation mentions this restriction
 "While adding VLAN id to the eth interfaces,
 same VLAN id should not be added in both interfaces which will lead to VLAN
 forwarding and act as switch"
>>> It's not accurate now.
>>> This sw bug "acting like a switch" was fixed indirectly in LKML ).
>>> And at least for upstream version, not TISDK, desc should be updated,
>>> but better do this when it fixed completely and merged with TISDK.
>>>
>>> I know about this "written" restriction
>>> (for tiSDK, and it's not TRM after all ...),
>>> it can be avoided and it's partly avoided now ...
>>
>> I'd like to clarify point about supporting same VLANs in dual_mac mode,
>> to avoid future misunderstanding, overall: it's *not* supported as
>> adding same VLAN to both netdevices will cause unknown unicast packets
>> leaking between interfaces and it can't be avoided - hw limitation.
> 
> Simple test shows no issues with ucast leaking.
> But for current buggy ucast vlan implementation it's not possible to verify,
> not sure but probably leaking in your case cuased by hidden toggling of
> interface to promisc while added ucast to vlans or other reason or so.
> Anyway I just decided to check specifically ucasts
> (macst as you know are not normal now).
> 
> For verification you need to apply ucast fix (including vlans) first:
> https://git.linaro.org/people/ivan.khoronzhuk/tsn_kernel.git/log/?h=vlan_addr_flt_fix
> 
> This is generic fix (not sure it will be approved, need try RFC) but implement
> the same as local fix for vlan ucasts:
> https://git.linaro.org/people/ivan.khoronzhuk/tsn_kernel.git/log/?h=ucast_vlan_fix
> 
> Any of those are correct. I've used generic one.
> Applied the following scheme:
> 
>      +--+
>      | host 74:DA:EA:47:7D:9C   |
>      +--+
> 
>     +-+
>      |   am572 evm |
>     |    eth0  eth1   |
>     +--+--+
>     | eth0.400 | eth1.400 |
>     +--+--+
>     ^  |
>     |  |  +---+
> +-+    |  |  | PC    |
> | BBB eth0.400    |+  +->| Wireshark |
> +-+  +---+
> 
> 
> 1) Configure vlans on am572x evm
> ip link add link eth0 name eth0.400 type vlan id 400
> ip link add link eth1 name eth1.400 type vlan id 400
> 
> 2) On BBB side:
> # ip link add link eth0 name eth0.400 type vlan id 400
> Send ucast vlan traffic to the am572 evm, vlan ucast address is unreq on 
> am572.
> # ./plget -i eth0.400 -t ptpl2 -m tx-lat -n 160 -s 10 -a 74:DA:EA:47:7D:66
> # ./plget -i eth0.400 -t ptpl2 -m tx-lat -n 160 -s 10 -a 18:03:73:66:87:42
> 
> 3) Observe silence on PC wireshark.
> 
> Thus, no see issues with this.
> 
> PS: I'm sure in plget tool, you can use your own.

I'm using packeth to generate udp packets (vlan) src=PC dst=unknown
if there is record in ALE table which looks like:
type: vlan , vid = 100, untag_force = 0x0, reg_mcast = 0x7, unreg_mcast = 0x0, 
member_list = 0x7
then above udp packet will be forwarded to BBB.



-- 
regards,
-grygorii


Re: [PATCH 0/2] Don’t leave executable TLB entries to freed pages

2018-11-29 Thread Masami Hiramatsu
On Thu, 29 Nov 2018 18:49:26 +
"Edgecombe, Rick P"  wrote:

> On Thu, 2018-11-29 at 23:06 +0900, Masami Hiramatsu wrote:
> > On Tue, 27 Nov 2018 16:07:52 -0800
> > Rick Edgecombe  wrote:
> > 
> > > Sometimes when memory is freed via the module subsystem, an executable
> > > permissioned TLB entry can remain to a freed page. If the page is re-used 
> > > to
> > > back an address that will receive data from userspace, it can result in 
> > > user
> > > data being mapped as executable in the kernel. The root of this behavior 
> > > is
> > > vfree lazily flushing the TLB, but not lazily freeing the underlying 
> > > pages. 
> > 
> > Good catch!
> > 
> > > 
> > > There are sort of three categories of this which show up across modules,
> > > bpf,
> > > kprobes and ftrace:
> > 
> > For x86-64 kprobe, it sets the page NX and after that RW, and then release
> > via module_memfree. So I'm not sure it really happens on kprobes. (Of course
> > the default memory allocator is simpler so it may happen on other archs) But
> > interesting fixes.
> Yes, I think you are right, it should not leave an executable TLB entry in 
> this
> case. Ftrace actually does this on x86 as well.
> 
> Is there some other reason for calling set_memory_nx that should apply 
> elsewhere
> for module users? Or could it be removed in the case of this patch to 
> centralize
> the behavior?

According to the commit c93f5cf571e7 ("kprobes/x86: Fix to set RWX bits 
correctly
before releasing trampoline"), if we release readonly page by module_memfree(),
it causes kernel crash. And at this moment, on x86-64 set the trampoline page
readonly becuase it is an exacutable page. Setting NX bit is for security reason
that should be set before making it writable.
So I think if you centralize setting NX bit, it should be done before setting
writable bit.

Thank you,


-- 
Masami Hiramatsu 


Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Eric Dumazet



On 11/29/2018 02:55 PM, Christoph Paasch wrote:
> On 29/11/18 - 14:44:44, Eric Dumazet wrote:
>>
>>
>> On 11/29/2018 02:27 PM, Christoph Paasch wrote:
>>> There are places in the stack, where we access skb->prev directly and
>>> modify it. Namely, __qdisc_drop_all().
>>>
>>> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
>>> the skb-list handling has been changed to set skb->next to NULL and set
>>> the list-poison on skb->prev.
>>>
>>> With that change, __qdisc_drop_all() will panic when it tries to
>>> dereference skb->prev.
>>>
>>> Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
>>> __list_del_entry is used, leaving skb->prev unchanged (thus,
>>> pointing to the list-head if it's the first skb of the list).
>>> This will make __qdisc_drop_all modify the next-pointer of the list-head
>>> and result in a panic later on:
>>>
>>> [   34.501053] general protection fault:  [#1] SMP KASAN PTI
>>> [   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp 
>>> #108
>>> [   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> 0.5.1 01/01/2011
>>> [   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
>>> [   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 
>>> 24 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 
>>> <42> 0f b6 04 30 84 c0 74 08 3c 04
>>> [   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
>>> [   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 
>>> 111072e165a6
>>> [   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 
>>> 0038
>>> [   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 
>>> 0062
>>> [   34.510346] R10: 0034 R11:  R12: 
>>> 
>>> [   34.511215] R13:  R14: dc00 R15: 
>>> 8883ac338008
>>> [   34.512082] FS:  () GS:8883af50() 
>>> knlGS:
>>> [   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
>>> [   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 
>>> 06e0
>>> [   34.514593] Call Trace:
>>> [   34.514893]  
>>> [   34.515157]  napi_gro_receive+0x93/0x150
>>> [   34.515632]  receive_buf+0x893/0x3700
>>> [   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
>>> [   34.516629]  ? virtnet_probe+0x1b40/0x1b40
>>> [   34.517153]  ? __stable_node_chain+0x4d0/0x850
>>> [   34.517684]  ? kfree+0x9a/0x180
>>> [   34.518067]  ? __kasan_slab_free+0x171/0x190
>>> [   34.518582]  ? detach_buf+0x1df/0x650
>>> [   34.519061]  ? lapic_next_event+0x5a/0x90
>>> [   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
>>> [   34.520093]  virtnet_poll+0x2df/0xd60
>>> [   34.520533]  ? receive_buf+0x3700/0x3700
>>> [   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
>>> [   34.521631]  ? htb_dequeue+0x1817/0x25f0
>>> [   34.522107]  ? sch_direct_xmit+0x142/0xf30
>>> [   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
>>> [   34.523155]  net_rx_action+0x2f6/0xc50
>>> [   34.523601]  ? napi_complete_done+0x2f0/0x2f0
>>> [   34.524126]  ? kasan_check_read+0x11/0x20
>>> [   34.524608]  ? _raw_spin_lock+0x7d/0xd0
>>> [   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
>>> [   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
>>> [   34.526130]  ? apic_ack_irq+0x9e/0xe0
>>> [   34.526567]  __do_softirq+0x188/0x4b5
>>> [   34.527015]  irq_exit+0x151/0x180
>>> [   34.527417]  do_IRQ+0xdb/0x150
>>> [   34.527783]  common_interrupt+0xf/0xf
>>> [   34.528223]  
>>>
>>> This patch makes sure that skb->prev is also set to NULL when removing
>>> it from the list.
>>>
>>> The bug is in v4.19.x as well, but the patch can't be backported easily.
>>> I can post a follow-up for that.
>>>
>>> Cc: Prashant Bhole 
>>> Cc: Tyler Hicks 
>>> Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
>>> Signed-off-by: Christoph Paasch 
>>> ---
>>>  include/linux/skbuff.h | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index 0d1b2c3f127b..3bb3bfd390eb 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff 
>>> *skb)
>>>  static inline void skb_mark_not_on_list(struct sk_buff *skb)
>>>  {
>>> skb->next = NULL;
>>> +   skb->prev = NULL;
>>>  }
>>>  
>>>  static inline void skb_list_del_init(struct sk_buff *skb)
>>>
>>
>> skb_mark_not_on_list() is used in many place where we do not care of 
>> skb->prev
>>
>> What about fixing netem instead ?
> 
> Yes, I have been looking at that and Alexey's patch which introduced the
> access to skb->prev (cfr.: https://patchwork.ozlabs.org/patch/880717/).
> 
> But then I thought that setting skb->prev to NULL is a less risky approach for
> -stable.
> 
> 
> How would you go about fixing netem instead?
> 
> Because, from what I see we basically can enter netem_enqueue here with two
> different "

[RFC PATCH 2/4] net: use indirect call wrappers at GRO network layer

2018-11-29 Thread Paolo Abeni
This avoids an indirect calls for L3 GRO receive path, both
for ipv4 and ipv6, if the latter is not compiled as a module.

Note that when IPv6 is compiled as buildin, it will be checked first,
so we have a single additional compare for the more common path.

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  2 ++
 net/core/dev.c| 10 --
 net/ipv4/af_inet.c|  4 
 net/ipv6/ip6_offload.c|  4 
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 3ca969cbd161..56e7592811ea 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -2,6 +2,8 @@
 #ifndef _INET_COMMON_H
 #define _INET_COMMON_H
 
+#include 
+
 extern const struct proto_ops inet_stream_ops;
 extern const struct proto_ops inet_dgram_ops;
 
diff --git a/net/core/dev.c b/net/core/dev.c
index f69b2fcdee40..619f5902600f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -145,6 +145,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -5306,6 +5307,7 @@ static void flush_all_backlogs(void)
put_online_cpus();
 }
 
+INDIRECT_CALLABLE_DECLARE_2(int, network_gro_complete, struct sk_buff *, int);
 static int napi_gro_complete(struct sk_buff *skb)
 {
struct packet_offload *ptype;
@@ -5325,7 +5327,8 @@ static int napi_gro_complete(struct sk_buff *skb)
if (ptype->type != type || !ptype->callbacks.gro_complete)
continue;
 
-   err = ptype->callbacks.gro_complete(skb, 0);
+   err = INDIRECT_CALL_INET(ptype->callbacks.gro_complete,
+network_gro_complete, skb, 0);
break;
}
rcu_read_unlock();
@@ -5472,6 +5475,8 @@ static void gro_flush_oldest(struct list_head *head)
napi_gro_complete(oldest);
 }
 
+INDIRECT_CALLABLE_DECLARE_2(struct sk_buff *, network_gro_receive,
+   struct list_head *, struct sk_buff *);
 static enum gro_result dev_gro_receive(struct napi_struct *napi, struct 
sk_buff *skb)
 {
u32 hash = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
@@ -5521,7 +5526,8 @@ static enum gro_result dev_gro_receive(struct napi_struct 
*napi, struct sk_buff
NAPI_GRO_CB(skb)->csum_valid = 0;
}
 
-   pp = ptype->callbacks.gro_receive(gro_head, skb);
+   pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive,
+   network_gro_receive, gro_head, skb);
break;
}
rcu_read_unlock();
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 326c422c22f8..04ab7ebd6e9b 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1505,6 +1505,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
return pp;
 }
 EXPORT_SYMBOL(inet_gro_receive);
+INDIRECT_CALLABLE(inet_gro_receive, 1, struct sk_buff *, network_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static struct sk_buff *ipip_gro_receive(struct list_head *head,
struct sk_buff *skb)
@@ -1589,6 +1591,8 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
return err;
 }
 EXPORT_SYMBOL(inet_gro_complete);
+INDIRECT_CALLABLE(inet_gro_complete, 1, int, network_gro_complete,
+ struct sk_buff *, int);
 
 static int ipip_gro_complete(struct sk_buff *skb, int nhoff)
 {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 70f525c33cb6..a1c2bfb2ce0d 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -270,6 +270,8 @@ static struct sk_buff *ipv6_gro_receive(struct list_head 
*head,
 
return pp;
 }
+INDIRECT_CALLABLE(ipv6_gro_receive, 2, struct sk_buff *, network_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static struct sk_buff *sit_ip6ip6_gro_receive(struct list_head *head,
  struct sk_buff *skb)
@@ -327,6 +329,8 @@ static int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
 
return err;
 }
+INDIRECT_CALLABLE(ipv6_gro_complete, 2, int, network_gro_complete,
+ struct sk_buff *, int);
 
 static int sit_gro_complete(struct sk_buff *skb, int nhoff)
 {
-- 
2.19.2



[RFC PATCH 3/4] net: use indirect call wrapper at GRO transport layer

2018-11-29 Thread Paolo Abeni
This avoids an indirect call in the receive path for TCP and UDP
packets. TCP takes precedence on UDP, so that we have a single
additional conditional in the common case.

Signed-off-by: Paolo Abeni 
---
 include/net/inet_common.h |  7 +++
 net/ipv4/af_inet.c| 11 +--
 net/ipv4/tcp_offload.c|  5 +
 net/ipv4/udp_offload.c|  5 +
 net/ipv6/ip6_offload.c| 10 --
 net/ipv6/tcpv6_offload.c  |  5 +
 net/ipv6/udp_offload.c|  5 +
 7 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_common.h b/include/net/inet_common.h
index 56e7592811ea..667bb8247f9a 100644
--- a/include/net/inet_common.h
+++ b/include/net/inet_common.h
@@ -56,4 +56,11 @@ static inline void inet_ctl_sock_destroy(struct sock *sk)
sock_release(sk->sk_socket);
 }
 
+#define indirect_call_gro_receive(name, cb, head, skb) \
+({ \
+   unlikely(gro_recursion_inc_test(skb)) ? \
+   NAPI_GRO_CB(skb)->flush |= 1, NULL :\
+   INDIRECT_CALL_2(cb, name, head, skb);   \
+})
+
 #endif
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 04ab7ebd6e9b..774f183f56e3 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1385,6 +1385,8 @@ struct sk_buff *inet_gso_segment(struct sk_buff *skb,
 }
 EXPORT_SYMBOL(inet_gso_segment);
 
+INDIRECT_CALLABLE_DECLARE_2(struct sk_buff *, transport4_gro_receive,
+   struct list_head *head, struct sk_buff *skb);
 struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 {
const struct net_offload *ops;
@@ -1494,7 +1496,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, 
struct sk_buff *skb)
skb_gro_pull(skb, sizeof(*iph));
skb_set_transport_header(skb, skb_gro_offset(skb));
 
-   pp = call_gro_receive(ops->callbacks.gro_receive, head, skb);
+   pp = indirect_call_gro_receive(transport4_gro_receive,
+  ops->callbacks.gro_receive, head, skb);
 
 out_unlock:
rcu_read_unlock();
@@ -1558,6 +1561,8 @@ int inet_recv_error(struct sock *sk, struct msghdr *msg, 
int len, int *addr_len)
return -EINVAL;
 }
 
+INDIRECT_CALLABLE_DECLARE_2(int, transport4_gro_complete, struct sk_buff *skb,
+   int);
 int inet_gro_complete(struct sk_buff *skb, int nhoff)
 {
__be16 newlen = htons(skb->len - nhoff);
@@ -1583,7 +1588,9 @@ int inet_gro_complete(struct sk_buff *skb, int nhoff)
 * because any hdr with option will have been flushed in
 * inet_gro_receive().
 */
-   err = ops->callbacks.gro_complete(skb, nhoff + sizeof(*iph));
+   err = INDIRECT_CALL_2(ops->callbacks.gro_complete,
+ transport4_gro_complete, skb,
+ nhoff + sizeof(*iph));
 
 out_unlock:
rcu_read_unlock();
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 870b0a335061..3d5dfac4cd1b 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -10,6 +10,7 @@
  * TCPv4 GSO/GRO support
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -317,6 +318,8 @@ static struct sk_buff *tcp4_gro_receive(struct list_head 
*head, struct sk_buff *
 
return tcp_gro_receive(head, skb);
 }
+INDIRECT_CALLABLE(tcp4_gro_receive, 2, struct sk_buff *, 
transport4_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 {
@@ -332,6 +335,8 @@ static int tcp4_gro_complete(struct sk_buff *skb, int thoff)
 
return tcp_gro_complete(skb);
 }
+INDIRECT_CALLABLE(tcp4_gro_complete, 2, int, transport4_gro_complete,
+ struct sk_buff *, int);
 
 static const struct net_offload tcpv4_offload = {
.callbacks = {
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 0646d61f4fa8..c3c5b237c8e0 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static struct sk_buff *__skb_udp_tunnel_segment(struct sk_buff *skb,
netdev_features_t features,
@@ -477,6 +478,8 @@ static struct sk_buff *udp4_gro_receive(struct list_head 
*head,
NAPI_GRO_CB(skb)->flush = 1;
return NULL;
 }
+INDIRECT_CALLABLE(udp4_gro_receive, 1, struct sk_buff *, 
transport4_gro_receive,
+ struct list_head *, struct sk_buff *);
 
 static int udp_gro_complete_segment(struct sk_buff *skb)
 {
@@ -536,6 +539,8 @@ static int udp4_gro_complete(struct sk_buff *skb, int nhoff)
 
return udp_gro_complete(skb, nhoff, udp4_lib_lookup_skb);
 }
+INDIRECT_CALLABLE(udp4_gro_complete, 1, int, transport4_gro_complete,
+ struct sk_buff *, int);
 
 static const struct net_offload udpv4_offload = {
.callbacks = {
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index a1c2bfb2ce0d..eeca4164a155 100644
--- a/net/

[RFC PATCH 4/4] udp: use indirect call wrapper for GRO socket lookup

2018-11-29 Thread Paolo Abeni
This avoids another indirect call for UDP GRO. Again, the test
for the IPv6 variant is performed first.

Signed-off-by: Paolo Abeni 
---
 net/ipv4/udp.c | 2 ++
 net/ipv4/udp_offload.c | 6 --
 net/ipv6/udp.c | 2 ++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index aff2a8e99e01..9ea851f47598 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -544,6 +544,8 @@ struct sock *udp4_lib_lookup_skb(struct sk_buff *skb,
return __udp4_lib_lookup_skb(skb, sport, dport, &udp_table);
 }
 EXPORT_SYMBOL_GPL(udp4_lib_lookup_skb);
+INDIRECT_CALLABLE(udp4_lib_lookup_skb, 1, struct sock *, udp_lookup,
+ struct sk_buff *skb, __be16 sport, __be16 dport);
 
 /* Must be called under rcu_read_lock().
  * Does increment socket refcount.
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index c3c5b237c8e0..0ccd2aa1ab98 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -392,6 +392,8 @@ static struct sk_buff *udp_gro_receive_segment(struct 
list_head *head,
return NULL;
 }
 
+INDIRECT_CALLABLE_DECLARE_2(struct sock *, udp_lookup, struct sk_buff *skb,
+   __be16 sport, __be16 dport);
 struct sk_buff *udp_gro_receive(struct list_head *head, struct sk_buff *skb,
struct udphdr *uh, udp_lookup_t lookup)
 {
@@ -403,7 +405,7 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
struct sk_buff *skb,
struct sock *sk;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp_lookup, skb, uh->source, uh->dest);
if (!sk)
goto out_unlock;
 
@@ -505,7 +507,7 @@ int udp_gro_complete(struct sk_buff *skb, int nhoff,
uh->len = newlen;
 
rcu_read_lock();
-   sk = (*lookup)(skb, uh->source, uh->dest);
+   sk = INDIRECT_CALL_INET(lookup, udp_lookup, skb, uh->source, uh->dest);
if (sk && udp_sk(sk)->gro_enabled) {
err = udp_gro_complete_segment(skb);
} else if (sk && udp_sk(sk)->gro_complete) {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 09cba4cfe31f..616f374760d1 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -282,6 +282,8 @@ struct sock *udp6_lib_lookup_skb(struct sk_buff *skb,
 inet6_sdif(skb), &udp_table, skb);
 }
 EXPORT_SYMBOL_GPL(udp6_lib_lookup_skb);
+INDIRECT_CALLABLE(udp6_lib_lookup_skb, 2, struct sock *, udp_lookup,
+ struct sk_buff *skb, __be16 sport, __be16 dport);
 
 /* Must be called under rcu_read_lock().
  * Does increment socket refcount.
-- 
2.19.2



[RFC PATCH 1/4] indirect call wrappers: helpers to speed-up indirect calls of builtin

2018-11-29 Thread Paolo Abeni
This header define a bunch of helpers that allow avoiding the
retpoline overhead when calling builtin functions via function pointers.
It boils down to explicitly comparing the function pointers to
known builtin functions and eventually invoke directly the latter.

The macro defined here implement the boilerplate for the above schema
and will be used by the next patches.

Suggested-by: Eric Dumazet 
Signed-off-by: Paolo Abeni 
---
 include/linux/indirect_call_wrapper.h | 77 +++
 1 file changed, 77 insertions(+)
 create mode 100644 include/linux/indirect_call_wrapper.h

diff --git a/include/linux/indirect_call_wrapper.h 
b/include/linux/indirect_call_wrapper.h
new file mode 100644
index ..57e82b4a166d
--- /dev/null
+++ b/include/linux/indirect_call_wrapper.h
@@ -0,0 +1,77 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_INDIRECT_CALL_WRAPPER_H
+#define _LINUX_INDIRECT_CALL_WRAPPER_H
+
+#ifdef CONFIG_RETPOLINE
+
+/*
+ * INDIRECT_CALL_$NR - wrapper for indirect calls with $NR known builtin
+ *  @f: function pointer
+ *  @name: base name for builtin functions, see INDIRECT_CALLABLE_DECLARE_$NR
+ *  @__VA_ARGS__: arguments for @f
+ *
+ * Avoid retpoline overhead for known builtin, checking @f vs each of them and
+ * eventually invoking directly the builtin function. Fallback to the indirect
+ * call
+ */
+#define INDIRECT_CALL_1(f, name, ...)  \
+   ({  \
+   f == name ## 1 ? name ## 1(__VA_ARGS__) :   \
+f(__VA_ARGS__);\
+   })
+#define INDIRECT_CALL_2(f, name, ...)  \
+   ({  \
+   f == name ## 2 ? name ## 2(__VA_ARGS__) :   \
+INDIRECT_CALL_1(f, name, __VA_ARGS__); \
+   })
+
+/*
+ * INDIRECT_CALLABLE_DECLARE_$NR - declare $NR known builtin for
+ * INDIRECT_CALL_$NR usage
+ *  @type: return type for the builtin function
+ *  @name: base name for builtin functions, the full list is generated 
appending
+ *the numbers in the 1..@NR range
+ *  @__VA_ARGS__: arguments type list for the builtin function
+ *
+ * Builtin with higher $NR will be checked first by INDIRECT_CALL_$NR
+ */
+#define INDIRECT_CALLABLE_DECLARE_1(type, name, ...)   \
+   extern type name ## 1(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE_2(type, name, ...)   \
+   extern type name ## 2(__VA_ARGS__); \
+   INDIRECT_CALLABLE_DECLARE_1(type, name, __VA_ARGS__)
+
+/*
+ * INDIRECT_CALLABLE - allow usage of a builtin function from INDIRECT_CALL_$NR
+ *  @f: builtin function name
+ *  @nr: id associated with this builtin, higher values will be checked first 
by
+ *  INDIRECT_CALL_$NR
+ *  @type: function return type
+ *  @name: base name used by INDIRECT_CALL_ to access the builtin list
+ *  @__VA_ARGS__: arguments type list for @f
+ */
+#define INDIRECT_CALLABLE(f, nr, type, name, ...)  \
+   __alias(f) type name ## nr(__VA_ARGS__)
+
+#else
+#define INDIRECT_CALL_1(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALL_2(f, name, ...) f(__VA_ARGS__)
+#define INDIRECT_CALLABLE_DECLARE_1(type, name, ...)
+#define INDIRECT_CALLABLE_DECLARE_2(type, name, ...)
+#define INDIRECT_CALLABLE(f, nr, type, name, ...)
+#endif
+
+/*
+ * We can use INDIRECT_CALL_$NR for ipv6 related functions only if ipv6 is
+ * builtin, this macro simplify dealing with indirect calls with only ipv4/ipv6
+ * alternatives
+ */
+#if IS_BUILTIN(CONFIG_IPV6)
+#define INDIRECT_CALL_INET INDIRECT_CALL_2
+#elif IS_ENABLED(CONFIG_INET)
+#define INDIRECT_CALL_INET INDIRECT_CALL_1
+#else
+#define INDIRECT_CALL_INET(...)
+#endif
+
+#endif
-- 
2.19.2



[RFC PATCH 0/4] net: mitigate retpoline overhead

2018-11-29 Thread Paolo Abeni
The spectre v2 counter-measures, aka retpolines, are a source of measurable
overhead[1]. We can partially address that when the function pointer refers to
a builtin symbol resorting to a list of tests vs well-known builtin function and
direct calls.

This may lead to some uglification around the indirect calls. In netconf 2018
Eric Dumazet described a technique to hide the most relevant part of the needed
boilerplate with some macro help.

This series is a [re-]implementation of such idea, exposing the introduced
helpers in a new header file. They are later leveraged to avoid the indirect
call overhead in the GRO path, when possible.

Overall this gives > 10% performance improvement for UDP GRO benchmark, and
smaller but measurable under for TCP syn flood.

The added infra can be used in follow-up patches to cope with retpoline overhead
in other points of the networking stack (e.g. at the qdisc layer) and possibly
even in other subsystems.

Paolo Abeni (4):
  indirect call wrappers: helpers to speed-up indirect calls of builtin
  net: use indirect call wrappers at GRO network layer
  net: use indirect call wrapper at GRO transport layer
  udp: use indirect call wrapper for GRO socket lookup

 include/linux/indirect_call_wrapper.h | 77 +++
 include/net/inet_common.h |  9 
 net/core/dev.c| 10 +++-
 net/ipv4/af_inet.c| 15 +-
 net/ipv4/tcp_offload.c|  5 ++
 net/ipv4/udp.c|  2 +
 net/ipv4/udp_offload.c| 11 +++-
 net/ipv6/ip6_offload.c| 14 -
 net/ipv6/tcpv6_offload.c  |  5 ++
 net/ipv6/udp.c|  2 +
 net/ipv6/udp_offload.c|  5 ++
 11 files changed, 147 insertions(+), 8 deletions(-)
 create mode 100644 include/linux/indirect_call_wrapper.h

-- 
2.19.2



Re: [PATCH] udp: Allow to defer reception until connect() happened

2018-11-29 Thread Eric Dumazet
On Thu, Nov 29, 2018 at 2:47 PM Christoph Paasch  wrote:

> Indeed, the UDP-stack is not fully 4-tuple ready.
>
>
> What are your thoughts on getting it there?

This would request an additional lookup, and heavy duty servers using
non connected sockets
would pay the price for an extra lookup for each incoming packets.

DNS servers and QUIC servers would not like that, since they have better use
of a single (unconnected) UDP socket per cpu/thread.


>
> Should be doable by simply using a similar approach as TCP, no? Any caveats
> you see with that?
>
> Then, when one sets the "wait-for-connect"-flag we would add the socket to
> the hash-table only at connect()-time also addressing the cache-line miss
> you mentioned above.

Sure.


Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch
On 29/11/18 - 14:44:44, Eric Dumazet wrote:
> 
> 
> On 11/29/2018 02:27 PM, Christoph Paasch wrote:
> > There are places in the stack, where we access skb->prev directly and
> > modify it. Namely, __qdisc_drop_all().
> > 
> > With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> > the skb-list handling has been changed to set skb->next to NULL and set
> > the list-poison on skb->prev.
> > 
> > With that change, __qdisc_drop_all() will panic when it tries to
> > dereference skb->prev.
> > 
> > Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
> > __list_del_entry is used, leaving skb->prev unchanged (thus,
> > pointing to the list-head if it's the first skb of the list).
> > This will make __qdisc_drop_all modify the next-pointer of the list-head
> > and result in a panic later on:
> > 
> > [   34.501053] general protection fault:  [#1] SMP KASAN PTI
> > [   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp 
> > #108
> > [   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 0.5.1 01/01/2011
> > [   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
> > [   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 
> > 24 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 
> > <42> 0f b6 04 30 84 c0 74 08 3c 04
> > [   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
> > [   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 
> > 111072e165a6
> > [   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 
> > 0038
> > [   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 
> > 0062
> > [   34.510346] R10: 0034 R11:  R12: 
> > 
> > [   34.511215] R13:  R14: dc00 R15: 
> > 8883ac338008
> > [   34.512082] FS:  () GS:8883af50() 
> > knlGS:
> > [   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
> > [   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 
> > 06e0
> > [   34.514593] Call Trace:
> > [   34.514893]  
> > [   34.515157]  napi_gro_receive+0x93/0x150
> > [   34.515632]  receive_buf+0x893/0x3700
> > [   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
> > [   34.516629]  ? virtnet_probe+0x1b40/0x1b40
> > [   34.517153]  ? __stable_node_chain+0x4d0/0x850
> > [   34.517684]  ? kfree+0x9a/0x180
> > [   34.518067]  ? __kasan_slab_free+0x171/0x190
> > [   34.518582]  ? detach_buf+0x1df/0x650
> > [   34.519061]  ? lapic_next_event+0x5a/0x90
> > [   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
> > [   34.520093]  virtnet_poll+0x2df/0xd60
> > [   34.520533]  ? receive_buf+0x3700/0x3700
> > [   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
> > [   34.521631]  ? htb_dequeue+0x1817/0x25f0
> > [   34.522107]  ? sch_direct_xmit+0x142/0xf30
> > [   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
> > [   34.523155]  net_rx_action+0x2f6/0xc50
> > [   34.523601]  ? napi_complete_done+0x2f0/0x2f0
> > [   34.524126]  ? kasan_check_read+0x11/0x20
> > [   34.524608]  ? _raw_spin_lock+0x7d/0xd0
> > [   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
> > [   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
> > [   34.526130]  ? apic_ack_irq+0x9e/0xe0
> > [   34.526567]  __do_softirq+0x188/0x4b5
> > [   34.527015]  irq_exit+0x151/0x180
> > [   34.527417]  do_IRQ+0xdb/0x150
> > [   34.527783]  common_interrupt+0xf/0xf
> > [   34.528223]  
> > 
> > This patch makes sure that skb->prev is also set to NULL when removing
> > it from the list.
> > 
> > The bug is in v4.19.x as well, but the patch can't be backported easily.
> > I can post a follow-up for that.
> > 
> > Cc: Prashant Bhole 
> > Cc: Tyler Hicks 
> > Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
> > Signed-off-by: Christoph Paasch 
> > ---
> >  include/linux/skbuff.h | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 0d1b2c3f127b..3bb3bfd390eb 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff 
> > *skb)
> >  static inline void skb_mark_not_on_list(struct sk_buff *skb)
> >  {
> > skb->next = NULL;
> > +   skb->prev = NULL;
> >  }
> >  
> >  static inline void skb_list_del_init(struct sk_buff *skb)
> > 
> 
> skb_mark_not_on_list() is used in many place where we do not care of skb->prev
> 
> What about fixing netem instead ?

Yes, I have been looking at that and Alexey's patch which introduced the
access to skb->prev (cfr.: https://patchwork.ozlabs.org/patch/880717/).

But then I thought that setting skb->prev to NULL is a less risky approach for
-stable.


How would you go about fixing netem instead?

Because, from what I see we basically can enter netem_enqueue here with two
different "types" of skb's. The ones where skb->prev points to the tail of
the list of

Re: [PATCH net-next 0/2] Add mii_bus to ixgbe driver for dsa devs

2018-11-29 Thread Jeff Kirsher
On Thu, 2018-11-29 at 22:19 +, Steve Douthit wrote:
> On 11/29/18 2:03 PM, Jeff Kirsher wrote:
> > On Thu, 2018-11-29 at 18:54 +, Steve Douthit wrote:
> > > Most dsa devices currently expect to get a pointer to a mii_bus from
> > > platform data of device tree information.
> > > 
> > > The first patch exposes a mii_bus for ixgbe devices.
> > > 
> > > Currently the ixgbe driver only allows accesses to the probed PHY
> > > address via ioctls. The second patch changes that behavior to allow
> > > access to all addresses.  This should be useful for doing switch
> > > peeks &
> > > pokes from userspace for development, test, or if theres no dsa
> > > driver
> > > for a particular switch yet.
> > > 
> > > I've tested the clause 22 portion of this code on a board that has an
> > > Intel C3xxx series SoC connected to a Marvell Peridot switch.  I
> > > don't
> > > have any clause 45 devices or other ixgbe devices to test with.
> > > 
> > > Stephen Douthit (2):
> > >ixgbe: register a mdiobus
> > >ixgbe: use mii_bus to handle MII related ioctls
> > > 
> > >   drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   2 +
> > >   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  35 ++--
> > >   drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c  | 192
> > > ++
> > >   drivers/net/ethernet/intel/ixgbe/ixgbe_phy.h  |   2 +
> > >   4 files changed, 216 insertions(+), 15 deletions(-)
> > > 
> > 
> > I will add these changes to my queue, please remember to send patches
> > against the drivers in drivers/net/ethernet/intel/* to
> > intel-wired-...@lists.osuosl.org  mailing list as well, so that our
> > patchwork project can track the patches.
> 
> Sorry about that, I'll add it to the CC list for the next rev.
> 
> I started adding another board to the platform driver that needs this
> code and found a bug.  The new board has more lanes enabled via soft
> straps.  That caused sysfs to complain because I ended up with duplicate
> bus IDs from this code:
> 
> + /* Use the position of the device in the PCI hierarchy as the id */
> + snprintf(bus->id, MII_BUS_ID_SIZE, "%s-mdio-%s", ixgbe_driver_name,
> +  pci_name(pdev->bus->self));
> 
> Please drop this series for now.

Ok, it has been dropped and I will wait for v2.

> Thinking about this some more I probably need to ensure that only a
> single mii_bus is registered out of all the discovered x550em_a devices.
> More generally there's not always going to be a 1:1 mapping between MACs
> and MIIs on these devices and the code should handle that.
> 
> I think for the next rev I'll only register a single mii_bus for the
> lowest numbered bus:device.function x500em_a device, and skip mii_bus
> setup entirely for other device IDs.
> 
> Let me know if there's another/better way to tackle that.

I will bring this up with our 10GbE driver lead, he may have ideas.  Any
patches sent to intel-wired-...@lists.osuosl.org mailing list will be seen
by our 10GbE driver developers, so they can provide feedback.


signature.asc
Description: This is a digitally signed message part


Re: [PATCH] udp: Allow to defer reception until connect() happened

2018-11-29 Thread Christoph Paasch
Hello,

On 28/11/18 - 19:15:12, Eric Dumazet wrote:
> On Wed, Nov 28, 2018 at 7:09 PM Jana Iyengar  wrote:
> >
> >
> > On Wed, Nov 28, 2018 at 6:19 PM Eric Dumazet  wrote:
> >>
> >> On Wed, Nov 28, 2018 at 5:57 PM Christoph Paasch  wrote:
> >> >
> >> > There are use-cases where a host wants to use a UDP socket with a
> >> > specific 4-tuple. The way to do this is to bind() and then connect() the
> >> > socket. However, after the bind(), the socket starts receiving data even
> >> > if it does not match the intended 4-tuple. That is because after the
> >> > bind() UDP-socket will match in the lookup for all incoming UDP-traffic
> >> > that has the specific IP/port.
> >> >
> >> > This patch prevents any incoming traffic until the connect() system-call
> >> > is called whenever the app sets the UDP socket-option
> >> > UDP_WAIT_FOR_CONNECT.
> >>
> >> Please do not add something that could mislead applications writers to
> >> think UDP stack can scale.
> >
> >
> >> UDP stack does not have a full hash on 4-tuples, it means that
> >> incoming traffic on a 'shared port' has
> >> to scan a list of XXX sockets to find the best match ...
> >
> >
> >> Also you add another cache line miss in UDP lookup to access
> >> udp_sk()->wait_for_connect.

Fair enough. We could add the socket later to the hash (see below).

> >>
> >> recvfrom() can be used to filter whatever frame that came before the 
> >> connect()
> >
> >
> > I don't think I understand that argument -- connect() is supported for UDP 
> > sockets, and UDP sockets are being used for serving QUIC traffic. Are you 
> > suggesting that connect() never be used?
> 
> If the source port is not shared, Christoph patch is not needed.
> 
> If it is shared, then a whole can of worm is opened.
> 
> Trying to hack UDP stack while it is not fully 4-tuple ready is not
> going to fly.

Indeed, the UDP-stack is not fully 4-tuple ready.


What are your thoughts on getting it there?

Should be doable by simply using a similar approach as TCP, no? Any caveats
you see with that?

Then, when one sets the "wait-for-connect"-flag we would add the socket to
the hash-table only at connect()-time also addressing the cache-line miss
you mentioned above.


Thanks,
Christoph




Re: [RFC PATCH net] net: phy: fix the issue that netif always links up after resuming

2018-11-29 Thread Heiner Kallweit
On 29.11.2018 09:12, Kunihiko Hayashi wrote:
> Even though the link is down before entering hibernation,
> there is an issue that the network interface always links up after resuming
> from hibernation.
> 
> The phydev->state is PHY_READY before enabling the network interface, so
> the link is down. After resuming from hibernation, the phydev->state is
> forcibly set to PHY_UP in mdio_bus_phy_restore(), and the link becomes up.
> 
> This patch expects to solve the issue by changing phydev->state to PHY_UP
> only when the link is up.
> 
> Signed-off-by: Kunihiko Hayashi 
> ---
>  drivers/net/phy/phy_device.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index ab33d17..d5bba0f 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -309,8 +309,10 @@ static int mdio_bus_phy_restore(struct device *dev)
>   return ret;
>  
>   /* The PHY needs to renegotiate. */
> - phydev->link = 0;
> - phydev->state = PHY_UP;
> + if (phydev->link) {
> + phydev->link = 0;
> + phydev->state = PHY_UP;
> + }
>  
Thanks for reporting. I agree that it isn't right to unconditionally set
PHY_UP, because we don't know whether the PHY was started before
hibernation. However I don't think using phydev->link as criteria is
right. Example would be: PHY was started before hibernation, but w/o link.
In this case we want to set PHY_UP to start an aneg, because a cable may
have been plugged in whilst system was sleeping.

So I think, similar to phy_stop_machine, we should use state >= UP and
state != HALTED as criteria, and also phy_start_machine() would need to
be called only if this criteria is met.

It may make sense to add a helper for checking whether PHY is in a
started state (>=UP && !=HALTED), because we need this in more than
one place.

>   phy_start_machine(phydev);
>  
> 



Re: [PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Eric Dumazet



On 11/29/2018 02:27 PM, Christoph Paasch wrote:
> There are places in the stack, where we access skb->prev directly and
> modify it. Namely, __qdisc_drop_all().
> 
> With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
> the skb-list handling has been changed to set skb->next to NULL and set
> the list-poison on skb->prev.
> 
> With that change, __qdisc_drop_all() will panic when it tries to
> dereference skb->prev.
> 
> Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
> __list_del_entry is used, leaving skb->prev unchanged (thus,
> pointing to the list-head if it's the first skb of the list).
> This will make __qdisc_drop_all modify the next-pointer of the list-head
> and result in a panic later on:
> 
> [   34.501053] general protection fault:  [#1] SMP KASAN PTI
> [   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp #108
> [   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 0.5.1 01/01/2011
> [   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
> [   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 24 
> 24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 <42> 0f 
> b6 04 30 84 c0 74 08 3c 04
> [   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
> [   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 
> 111072e165a6
> [   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 
> 0038
> [   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 
> 0062
> [   34.510346] R10: 0034 R11:  R12: 
> 
> [   34.511215] R13:  R14: dc00 R15: 
> 8883ac338008
> [   34.512082] FS:  () GS:8883af50() 
> knlGS:
> [   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
> [   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 
> 06e0
> [   34.514593] Call Trace:
> [   34.514893]  
> [   34.515157]  napi_gro_receive+0x93/0x150
> [   34.515632]  receive_buf+0x893/0x3700
> [   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
> [   34.516629]  ? virtnet_probe+0x1b40/0x1b40
> [   34.517153]  ? __stable_node_chain+0x4d0/0x850
> [   34.517684]  ? kfree+0x9a/0x180
> [   34.518067]  ? __kasan_slab_free+0x171/0x190
> [   34.518582]  ? detach_buf+0x1df/0x650
> [   34.519061]  ? lapic_next_event+0x5a/0x90
> [   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
> [   34.520093]  virtnet_poll+0x2df/0xd60
> [   34.520533]  ? receive_buf+0x3700/0x3700
> [   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
> [   34.521631]  ? htb_dequeue+0x1817/0x25f0
> [   34.522107]  ? sch_direct_xmit+0x142/0xf30
> [   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
> [   34.523155]  net_rx_action+0x2f6/0xc50
> [   34.523601]  ? napi_complete_done+0x2f0/0x2f0
> [   34.524126]  ? kasan_check_read+0x11/0x20
> [   34.524608]  ? _raw_spin_lock+0x7d/0xd0
> [   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
> [   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
> [   34.526130]  ? apic_ack_irq+0x9e/0xe0
> [   34.526567]  __do_softirq+0x188/0x4b5
> [   34.527015]  irq_exit+0x151/0x180
> [   34.527417]  do_IRQ+0xdb/0x150
> [   34.527783]  common_interrupt+0xf/0xf
> [   34.528223]  
> 
> This patch makes sure that skb->prev is also set to NULL when removing
> it from the list.
> 
> The bug is in v4.19.x as well, but the patch can't be backported easily.
> I can post a follow-up for that.
> 
> Cc: Prashant Bhole 
> Cc: Tyler Hicks 
> Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
> Signed-off-by: Christoph Paasch 
> ---
>  include/linux/skbuff.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 0d1b2c3f127b..3bb3bfd390eb 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff *skb)
>  static inline void skb_mark_not_on_list(struct sk_buff *skb)
>  {
>   skb->next = NULL;
> + skb->prev = NULL;
>  }
>  
>  static inline void skb_list_del_init(struct sk_buff *skb)
> 

skb_mark_not_on_list() is used in many place where we do not care of skb->prev

What about fixing netem instead ?




Re: [PATCH net] net/sched: act_police: fix memory leak in case of invalid control action

2018-11-29 Thread Cong Wang
On Wed, Nov 28, 2018 at 9:44 AM Davide Caratti  wrote:
>
> when users set an invalid control action, kmemleak complains as follows:
...
> change tcf_police_init() to avoid leaking 'new' in case TCA_POLICE_RESULT
> contains TC_ACT_GOTO_CHAIN extended action.
>
> Fixes: c08f5ed5d625 ("net/sched: act_police: disallow 'goto chain' on 
> fallback control action")
> Reported-by: Dan Carpenter 
> Signed-off-by: Davide Caratti 

Acked-by: Cong Wang 


[PATCH net] net: Prevent invalid access to skb->prev in __qdisc_drop_all

2018-11-29 Thread Christoph Paasch
There are places in the stack, where we access skb->prev directly and
modify it. Namely, __qdisc_drop_all().

With commit 68d2f84a1368 ("net: gro: properly remove skb from list")
the skb-list handling has been changed to set skb->next to NULL and set
the list-poison on skb->prev.

With that change, __qdisc_drop_all() will panic when it tries to
dereference skb->prev.

Since commit 992cba7e276d ("net: Add and use skb_list_del_init().")
__list_del_entry is used, leaving skb->prev unchanged (thus,
pointing to the list-head if it's the first skb of the list).
This will make __qdisc_drop_all modify the next-pointer of the list-head
and result in a panic later on:

[   34.501053] general protection fault:  [#1] SMP KASAN PTI
[   34.501968] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.20.0-rc2.mptcp #108
[   34.502887] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
0.5.1 01/01/2011
[   34.504074] RIP: 0010:dev_gro_receive+0x343/0x1f90
[   34.504751] Code: e0 48 c1 e8 03 42 80 3c 30 00 0f 85 4a 1c 00 00 4d 8b 24 
24 4c 39 65 d0 0f 84 0a 04 00 00 49 8d 7c 24 38 48 89 f8 48 c1 e8 03 <42> 0f b6 
04 30 84 c0 74 08 3c 04
[   34.507060] RSP: 0018:8883af507930 EFLAGS: 00010202
[   34.507761] RAX: 0007 RBX: 8883970b2c80 RCX: 111072e165a6
[   34.508640] RDX: 111075867008 RSI: 8883ac338040 RDI: 0038
[   34.509493] RBP: 8883af5079d0 R08: 8883970b2d40 R09: 0062
[   34.510346] R10: 0034 R11:  R12: 
[   34.511215] R13:  R14: dc00 R15: 8883ac338008
[   34.512082] FS:  () GS:8883af50() 
knlGS:
[   34.513036] CS:  0010 DS:  ES:  CR0: 80050033
[   34.513741] CR2: 55ccc3e9d020 CR3: 0003abf32000 CR4: 06e0
[   34.514593] Call Trace:
[   34.514893]  
[   34.515157]  napi_gro_receive+0x93/0x150
[   34.515632]  receive_buf+0x893/0x3700
[   34.516094]  ? __netif_receive_skb+0x1f/0x1a0
[   34.516629]  ? virtnet_probe+0x1b40/0x1b40
[   34.517153]  ? __stable_node_chain+0x4d0/0x850
[   34.517684]  ? kfree+0x9a/0x180
[   34.518067]  ? __kasan_slab_free+0x171/0x190
[   34.518582]  ? detach_buf+0x1df/0x650
[   34.519061]  ? lapic_next_event+0x5a/0x90
[   34.519539]  ? virtqueue_get_buf_ctx+0x280/0x7f0
[   34.520093]  virtnet_poll+0x2df/0xd60
[   34.520533]  ? receive_buf+0x3700/0x3700
[   34.521027]  ? qdisc_watchdog_schedule_ns+0xd5/0x140
[   34.521631]  ? htb_dequeue+0x1817/0x25f0
[   34.522107]  ? sch_direct_xmit+0x142/0xf30
[   34.522595]  ? virtqueue_napi_schedule+0x26/0x30
[   34.523155]  net_rx_action+0x2f6/0xc50
[   34.523601]  ? napi_complete_done+0x2f0/0x2f0
[   34.524126]  ? kasan_check_read+0x11/0x20
[   34.524608]  ? _raw_spin_lock+0x7d/0xd0
[   34.525070]  ? _raw_spin_lock_bh+0xd0/0xd0
[   34.525563]  ? kvm_guest_apic_eoi_write+0x6b/0x80
[   34.526130]  ? apic_ack_irq+0x9e/0xe0
[   34.526567]  __do_softirq+0x188/0x4b5
[   34.527015]  irq_exit+0x151/0x180
[   34.527417]  do_IRQ+0xdb/0x150
[   34.527783]  common_interrupt+0xf/0xf
[   34.528223]  

This patch makes sure that skb->prev is also set to NULL when removing
it from the list.

The bug is in v4.19.x as well, but the patch can't be backported easily.
I can post a follow-up for that.

Cc: Prashant Bhole 
Cc: Tyler Hicks 
Fixes: 68d2f84a1368 ("net: gro: properly remove skb from list")
Signed-off-by: Christoph Paasch 
---
 include/linux/skbuff.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0d1b2c3f127b..3bb3bfd390eb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1373,6 +1373,7 @@ static inline void skb_zcopy_abort(struct sk_buff *skb)
 static inline void skb_mark_not_on_list(struct sk_buff *skb)
 {
skb->next = NULL;
+   skb->prev = NULL;
 }
 
 static inline void skb_list_del_init(struct sk_buff *skb)
-- 
2.16.2



Re: [PATCH 1/3] net/x25: fix called/calling length calculation in x25_parse_address_block

2018-11-29 Thread David Miller
From: Martin Schiller 
Date: Tue, 27 Nov 2018 09:50:27 +0100

> The length of the called and calling address was not calculated
> correctly (BCD encoding).
> 
> Signed-off-by: Martin Schiller 

Applied.


Re: [PATCH 3/3] net/x25: handle call collisions

2018-11-29 Thread David Miller
From: Martin Schiller 
Date: Tue, 27 Nov 2018 09:50:29 +0100

> If a session in X25_STATE_1 (Awaiting Call Accept) receives a call
> request, the session will be closed (x25_disconnect), cause=0x01
> (Number Busy) and diag=0x48 (Call Collision) will be set and a clear
> request will be send.
> 
> Signed-off-by: Martin Schiller 

Applied.


Re: [PATCH 2/3] net/x25: fix null_x25_address handling

2018-11-29 Thread David Miller
From: Martin Schiller 
Date: Tue, 27 Nov 2018 09:50:28 +0100

>  o x25_find_listener(): the compare for the null_x25_address was wrong.
>We have to check the x25_addr of the listener socket instead of the
>x25_addr of the incomming call.
> 
>  o x25_bind(): it was not possible to bind a socket to null_x25_address
> 
> Signed-off-by: Martin Schiller 

Applied.


Re: [PATCH net-next 0/2] Add mii_bus to ixgbe driver for dsa devs

2018-11-29 Thread Steve Douthit
On 11/29/18 2:03 PM, Jeff Kirsher wrote:
> On Thu, 2018-11-29 at 18:54 +, Steve Douthit wrote:
>> Most dsa devices currently expect to get a pointer to a mii_bus from
>> platform data of device tree information.
>>
>> The first patch exposes a mii_bus for ixgbe devices.
>>
>> Currently the ixgbe driver only allows accesses to the probed PHY
>> address via ioctls. The second patch changes that behavior to allow
>> access to all addresses.  This should be useful for doing switch peeks &
>> pokes from userspace for development, test, or if theres no dsa driver
>> for a particular switch yet.
>>
>> I've tested the clause 22 portion of this code on a board that has an
>> Intel C3xxx series SoC connected to a Marvell Peridot switch.  I don't
>> have any clause 45 devices or other ixgbe devices to test with.
>>
>> Stephen Douthit (2):
>>ixgbe: register a mdiobus
>>ixgbe: use mii_bus to handle MII related ioctls
>>
>>   drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   2 +
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  35 ++--
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_phy.c  | 192 ++
>>   drivers/net/ethernet/intel/ixgbe/ixgbe_phy.h  |   2 +
>>   4 files changed, 216 insertions(+), 15 deletions(-)
>>
> 
> I will add these changes to my queue, please remember to send patches
> against the drivers in drivers/net/ethernet/intel/* to
> intel-wired-...@lists.osuosl.org  mailing list as well, so that our
> patchwork project can track the patches.

Sorry about that, I'll add it to the CC list for the next rev.

I started adding another board to the platform driver that needs this
code and found a bug.  The new board has more lanes enabled via soft
straps.  That caused sysfs to complain because I ended up with duplicate
bus IDs from this code:

+   /* Use the position of the device in the PCI hierarchy as the id */
+   snprintf(bus->id, MII_BUS_ID_SIZE, "%s-mdio-%s", ixgbe_driver_name,
+pci_name(pdev->bus->self));

Please drop this series for now.

Thinking about this some more I probably need to ensure that only a
single mii_bus is registered out of all the discovered x550em_a devices.
More generally there's not always going to be a 1:1 mapping between MACs
and MIIs on these devices and the code should handle that.

I think for the next rev I'll only register a single mii_bus for the
lowest numbered bus:device.function x500em_a device, and skip mii_bus
setup entirely for other device IDs.

Let me know if there's another/better way to tackle that.


Re: [PATCH] mv88e6060: disable hardware level MAC learning

2018-11-29 Thread Andrew Lunn
On Thu, Nov 29, 2018 at 07:18:50PM -0200, Alacn wrote:
> >From ecc3afc357aeece71842d2d9e3f7ec63e2b4ab67 Mon Sep 17 00:00:00 2001
> From: Anderson Luiz Alves 
> Date: Thu, 29 Nov 2018 19:02:03 -0200
> Subject: [PATCH] mv88e6060: disable hardware level MAC learning
> 
> Disable hardware level MAC learning because it breaks station roaming.
> When enabled it drops all frames that arrive from a MAC address
> that is on a different port at learning table.
> 
> Signed-off-by: Anderson Luiz Alves 

Reviewed-by: Andrew Lunn 

Hi Anderson

A few process things.  

In the subject line, you should put v2, since this is the second
version.

Also, in the subject line, indicate which tree it is for. I would say
in this case, it is net, since it is a fix. So the subject should be

[PATCH v2 net] mv88e6060: disable hardware level MAC learning

Under the --- you should indicate what changed between each version.

Andrew


Re: [Patch net-next] net: explain __skb_checksum_complete() with comments

2018-11-29 Thread David Miller
From: Cong Wang 
Date: Mon, 26 Nov 2018 09:31:26 -0800

> Cc: Herbert Xu 
> Signed-off-by: Cong Wang 

Applied, thanks Cong.


Re: [RFC PATCH] net: mvpp2: fix detection of 10G SFP modules

2018-11-29 Thread Russell King - ARM Linux
On Thu, Nov 29, 2018 at 11:31:23AM -0800, Florian Fainelli wrote:
> 
> 
> On 11/29/2018 4:49 AM, Baruch Siach wrote:
> > The mvpp2_phylink_validate() relies on the interface field of
> > phylink_link_state to determine valid link modes. However, when called
> > from phylink_sfp_module_insert() this field in not initialized. The
> > default switch case then excludes 10G link modes. This allows 10G SFP
> > modules that are detected correctly to be configured at max rate of
> > 2.5G.
> > 
> > Catch the uninitialized PHY mode case, and allow 10G rates.
> > 
> > Cc: Maxime Chevallier 
> > Cc: Antoine Tenart 
> > Signed-off-by: Baruch Siach 
> > ---
> > Is that the right fix?
> 
> It would be a bit surprising that this is the right fix, you would
> expect validate to be called once everything has been parsed
> successfully from the SFP, is not that the case here? If not, can you
> find out what happens?

Two calls are made - the first with PHY_INTERFACE_MODE_NA to
determine what the advertising link mode may be, and then again
once the interface mode has been selected from the advertising mask.

Why?

Consider a 4.3Mbps fiberchannel SFP plugged into a 1G-only MAC.
If we did it as a single pass, we would end up passing an
interface mode of 2500BASEX first time around which is illogical.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up


Re: [Patch net] mlx5: fixup checksum for ethernet padding

2018-11-29 Thread kbuild test robot
Hi Cong,

I love your patch! Perhaps something to improve:

[auto build test WARNING on net/master]

url:
https://github.com/0day-ci/linux/commits/Cong-Wang/mlx5-fixup-checksum-for-ethernet-padding/20181130-014928
config: ia64-allmodconfig (attached as .config)
compiler: ia64-linux-gcc (GCC) 8.1.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=8.1.0 make.cross ARCH=ia64 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c: In function 
'mlx5e_csum_padding':
>> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:740:6: warning: 'pad_offset' 
>> may be used uninitialized in this function [-Wmaybe-uninitialized]
 u32 pad_offset, pad_len;
 ^~

vim +/pad_offset +740 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c

   734  
   735  static void mlx5e_csum_padding(struct sk_buff *skb, int network_depth,
   736 __be16 proto, bool has_fcs)
   737  {
   738  u32 frame_len = has_fcs ? skb->len - ETH_FCS_LEN : skb->len;
   739  void *ip_p = skb->data + network_depth;
 > 740  u32 pad_offset, pad_len;
   741  void *pad;
   742  
   743  if (likely(frame_len > ETH_ZLEN))
   744  return;
   745  
   746  if (proto == htons(ETH_P_IP)) {
   747  struct iphdr *ipv4 = ip_p;
   748  
   749  pad_offset =  network_depth + 
be16_to_cpu(ipv4->tot_len);
   750  } else if (proto == htons(ETH_P_IPV6)) {
   751  struct ipv6hdr *ipv6 = ip_p;
   752  
   753  pad_offset = network_depth + sizeof(struct ipv6hdr) +
   754   be16_to_cpu(ipv6->payload_len);
   755  }
   756  
   757  pad = skb->data + pad_offset;
   758  pad_len = frame_len - pad_offset;
   759  
   760  skb->csum = csum_block_add(skb->csum, csum_partial(pad, 
pad_len, 0),
   761 pad_offset);
   762  }
   763  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Eric Dumazet



On 11/29/2018 01:13 PM, Duyck, Alexander H wrote:

> Instead of just checking for the max it might make more sense to do a
> check using skb_is_gso, and then if true use gso_segs, otherwise just
> default to 1.
> 
> Also your bytes are going to be totally messed up as well since the
> headers are trimmed in the GSO frames. It might be worthwhile to just
> have a branch based on skb_is_gso that sets the packets and bytes based
> on the GSO values, and one that sets it for the default case.
> 
Note that __dev_queue_xmit() is already doing that, no need
to re-implement in each driver.

It calls qdisc_pkt_len_init(), meaning that drivers can use 
qdisc_skb_cb(skb)->pkt_len instead of skb->len

(Qdisc layers should not modify qdisc_skb_cb(skb)->pkt_len )


Re: [PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Duyck, Alexander H
On Thu, 2018-11-29 at 15:58 -0500, Debabrata Banerjee wrote:
> Fix packet count when using vlan/macvlan drivers with gso. Without this it
> is not possible to reconcile packet counts between underlying devices and
> these virtual devices. Additionally, the output looks wrong in a standalone
> way i.e. device MTU of 1500, 1 packet sent, 31856 bytes sent.
> 
> There are many other drivers that likely have a similar problem, although
> it is not clear how many of those could be used with gso. Perhaps all
> packet counting should be wrapped in a helper fn.
> 
> Signed-off-by: Debabrata Banerjee 
> ---
>  drivers/net/macvlan.c | 2 +-
>  net/8021q/vlan_core.c | 2 +-
>  net/8021q/vlan_dev.c  | 2 +-
>  3 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
> index fc8d5f1ee1ad..15e67a87f202 100644
> --- a/drivers/net/macvlan.c
> +++ b/drivers/net/macvlan.c
> @@ -566,7 +566,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
>  
>   pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
>   u64_stats_update_begin(&pcpu_stats->syncp);
> - pcpu_stats->tx_packets++;
> + pcpu_stats->tx_packets += max_t(u16, 1, 
> skb_shinfo(skb)->gso_segs);
>   pcpu_stats->tx_bytes += len;
>   u64_stats_update_end(&pcpu_stats->syncp);
>   } else {
> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
> index a313165e7a67..e85f6665d0ed 100644
> --- a/net/8021q/vlan_core.c
> +++ b/net/8021q/vlan_core.c
> @@ -62,7 +62,7 @@ bool vlan_do_receive(struct sk_buff **skbp)
>   rx_stats = this_cpu_ptr(vlan_dev_priv(vlan_dev)->vlan_pcpu_stats);
>  
>   u64_stats_update_begin(&rx_stats->syncp);
> - rx_stats->rx_packets++;
> + rx_stats->rx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
>   rx_stats->rx_bytes += skb->len;
>   if (skb->pkt_type == PACKET_MULTICAST)
>   rx_stats->rx_multicast++;
> diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
> index b2d9c8f27cd7..b28e7535a0b9 100644
> --- a/net/8021q/vlan_dev.c
> +++ b/net/8021q/vlan_dev.c
> @@ -135,7 +135,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct 
> sk_buff *skb,
>  
>   stats = this_cpu_ptr(vlan->vlan_pcpu_stats);
>   u64_stats_update_begin(&stats->syncp);
> - stats->tx_packets++;
> + stats->tx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
>   stats->tx_bytes += len;
>   u64_stats_update_end(&stats->syncp);
>   } else {

Instead of just checking for the max it might make more sense to do a
check using skb_is_gso, and then if true use gso_segs, otherwise just
default to 1.

Also your bytes are going to be totally messed up as well since the
headers are trimmed in the GSO frames. It might be worthwhile to just
have a branch based on skb_is_gso that sets the packets and bytes based
on the GSO values, and one that sets it for the default case.


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH bpf-next] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Jakub Kicinski
On Thu, 29 Nov 2018 12:38:03 -0800, Yonghong Song wrote:
> diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c
> index d83b17f8435c..286e497c50ec 100644
> --- a/tools/lib/bpf/libbpf_errno.c
> +++ b/tools/lib/bpf/libbpf_errno.c
> @@ -40,9 +40,19 @@ int libbpf_strerror(int err, char *buf, size_t size)
>   err = err > 0 ? err : -err;
>  
>   if (err < __LIBBPF_ERRNO__START) {
> +#ifdef _GNU_SOURCE
> + const char *ret_buf;
> +#endif
>   int ret;
>  
> +#ifdef _GNU_SOURCE
> + ret_buf = strerror_r(err, buf, size);
> + if (ret_buf != buf)
> + snprintf(buf, size, "%s", ret_buf);
> + ret = 0;
> +#else
>   ret = strerror_r(err, buf, size);
> +#endif
>   buf[size - 1] = '\0';
>   return ret;
>   }

That is kinda strange, the whole point for this file was to have
non-GNU strerror_r, would doing #undef _GNU_SOURCE at the top not work?


[PATCH RFC net-next] net: vlan/macvlan: count packets properly with gso

2018-11-29 Thread Debabrata Banerjee
Fix packet count when using vlan/macvlan drivers with gso. Without this it
is not possible to reconcile packet counts between underlying devices and
these virtual devices. Additionally, the output looks wrong in a standalone
way i.e. device MTU of 1500, 1 packet sent, 31856 bytes sent.

There are many other drivers that likely have a similar problem, although
it is not clear how many of those could be used with gso. Perhaps all
packet counting should be wrapped in a helper fn.

Signed-off-by: Debabrata Banerjee 
---
 drivers/net/macvlan.c | 2 +-
 net/8021q/vlan_core.c | 2 +-
 net/8021q/vlan_dev.c  | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fc8d5f1ee1ad..15e67a87f202 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -566,7 +566,7 @@ static netdev_tx_t macvlan_start_xmit(struct sk_buff *skb,
 
pcpu_stats = this_cpu_ptr(vlan->pcpu_stats);
u64_stats_update_begin(&pcpu_stats->syncp);
-   pcpu_stats->tx_packets++;
+   pcpu_stats->tx_packets += max_t(u16, 1, 
skb_shinfo(skb)->gso_segs);
pcpu_stats->tx_bytes += len;
u64_stats_update_end(&pcpu_stats->syncp);
} else {
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index a313165e7a67..e85f6665d0ed 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -62,7 +62,7 @@ bool vlan_do_receive(struct sk_buff **skbp)
rx_stats = this_cpu_ptr(vlan_dev_priv(vlan_dev)->vlan_pcpu_stats);
 
u64_stats_update_begin(&rx_stats->syncp);
-   rx_stats->rx_packets++;
+   rx_stats->rx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
rx_stats->rx_bytes += skb->len;
if (skb->pkt_type == PACKET_MULTICAST)
rx_stats->rx_multicast++;
diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c
index b2d9c8f27cd7..b28e7535a0b9 100644
--- a/net/8021q/vlan_dev.c
+++ b/net/8021q/vlan_dev.c
@@ -135,7 +135,7 @@ static netdev_tx_t vlan_dev_hard_start_xmit(struct sk_buff 
*skb,
 
stats = this_cpu_ptr(vlan->vlan_pcpu_stats);
u64_stats_update_begin(&stats->syncp);
-   stats->tx_packets++;
+   stats->tx_packets += max_t(u16, 1, skb_shinfo(skb)->gso_segs);
stats->tx_bytes += len;
u64_stats_update_end(&stats->syncp);
} else {
-- 
2.19.2



Re: [PATCHv2 net] sctp: hold transport before accessing its asoc in sctp_epaddr_lookup_transport

2018-11-29 Thread Neil Horman
On Thu, Nov 29, 2018 at 02:44:07PM +0800, Xin Long wrote:
> Without holding transport to dereference its asoc, a use after
> free panic can be caused in sctp_epaddr_lookup_transport. Note
> that a sock lock can't protect these transports that belong to
> other socks.
> 
> A similar fix as Commit bab1be79a516 ("sctp: hold transport
> before accessing its asoc in sctp_transport_get_next") is
> needed to hold the transport before accessing its asoc in
> sctp_epaddr_lookup_transport.
> 
> Note that this extra atomic operation is on the datapath,
> but as rhlist keeps the lists to a small size, it won't
> see a noticeable performance hurt.
> 
> v1->v2:
>   - improve the changelog.
> 
> Fixes: 7fda702f9315 ("sctp: use new rhlist interface on sctp transport 
> rhashtable")
> Reported-by: syzbot+aad231d51b1923158...@syzkaller.appspotmail.com
> Signed-off-by: Xin Long 
> ---
>  net/sctp/input.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/input.c b/net/sctp/input.c
> index 5c36a99..ce7351c 100644
> --- a/net/sctp/input.c
> +++ b/net/sctp/input.c
> @@ -967,9 +967,15 @@ struct sctp_transport *sctp_epaddr_lookup_transport(
>   list = rhltable_lookup(&sctp_transport_hashtable, &arg,
>  sctp_hash_params);
>  
> - rhl_for_each_entry_rcu(t, tmp, list, node)
> - if (ep == t->asoc->ep)
> + rhl_for_each_entry_rcu(t, tmp, list, node) {
> + if (!sctp_transport_hold(t))
> + continue;
> + if (ep == t->asoc->ep) {
> + sctp_transport_put(t);
>   return t;
> + }
> + sctp_transport_put(t);
> + }
>  
>   return NULL;
>  }

Wait a second, what if we just added an rcu_head to the association structure
and changed the kfree call in sctp_association_destroy to a kfree_rcu call
instead?  That would force the actual freeing of the association to pass through
a grace period, during which any in flight list traversal in
sctp_epaddr_lookup_transport could complete safely.  Its another two pointers
worth of space in the association, but I think that would be a worthwhile
tradeoff for not having to do N atomic adds/puts every time you wanted to
receive or send a frame.

Neil

> -- 
> 2.1.0
> 
> 


Re: [PATCH net-next,v4 05/12] flow_offload: add statistics retrieval infrastructure and use it

2018-11-29 Thread Jakub Kicinski
On Thu, 29 Nov 2018 03:22:24 +0100, Pablo Neira Ayuso wrote:
> This patch provides the flow_stats structure that acts as container for
> tc_cls_flower_offload, then we can use to restore the statistics on the
> existing TC actions. Hence, tcf_exts_stats_update() is not used from
> drivers anymore.
> 
> Signed-off-by: Pablo Neira Ayuso 

> diff --git a/include/net/flow_offload.h b/include/net/flow_offload.h
> index dabc819b6cc9..040c092c000a 100644
> --- a/include/net/flow_offload.h
> +++ b/include/net/flow_offload.h
> @@ -179,4 +179,18 @@ static inline bool flow_rule_match_key(const struct 
> flow_rule *rule,
>   return dissector_uses_key(rule->match.dissector, key);
>  }
>  
> +struct flow_stats {
> + u64 pkts;
> + u64 bytes;
> + u64 lastused;
> +};
> +
> +static inline void flow_stats_update(struct flow_stats *flow_stats,
> +  u64 pkts, u64 bytes, u64 lastused)
> +{
> + flow_stats->pkts= pkts;
> + flow_stats->bytes   = bytes;
> + flow_stats->lastused= lastused;
> +}
> +
>  #endif /* _NET_FLOW_OFFLOAD_H */
> diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
> index abb035f84321..a08c06e383db 100644
> --- a/include/net/pkt_cls.h
> +++ b/include/net/pkt_cls.h
> @@ -767,6 +767,7 @@ struct tc_cls_flower_offload {
>   enum tc_fl_command command;
>   unsigned long cookie;
>   struct flow_rule *rule;
> + struct flow_stats stats;
>   struct tcf_exts *exts;
>   u32 classid;
>  };
> diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
> index 8898943b8ee6..b88cf29aff7b 100644
> --- a/net/sched/cls_flower.c
> +++ b/net/sched/cls_flower.c
> @@ -432,6 +432,10 @@ static void fl_hw_update_stats(struct tcf_proto *tp, 
> struct cls_fl_filter *f)
>  
>   tc_setup_cb_call(block, &f->exts, TC_SETUP_CLSFLOWER,
>&cls_flower, false);
> +
> + tcf_exts_stats_update(&f->exts, cls_flower.stats.bytes,
> +   cls_flower.stats.pkts,
> +   cls_flower.stats.lastused);
>  }
>  
>  static bool __fl_delete(struct tcf_proto *tp, struct cls_fl_filter *f,

Apart from the bug Venkat mentioned I think this patch exposes a
potentially strange and unexpected TC-ism through to the abstracted API.
The stats for TC store the increment in update cycle, so flow->stats
will be equal to the diff between TC_CLSFLOWER_STATS calls.

Its this behaviour going to be subsystem-specific?  Looks not-so-clean.


[PATCH bpf-next] tools/bpf: make libbpf _GNU_SOURCE friendly

2018-11-29 Thread Yonghong Song
During porting libbpf to bcc, I got some warnings like below:
  ...
  [  2%] Building C object 
src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf.c.o
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf.c:12:0:
  warning: "_GNU_SOURCE" redefined [enabled by default]
   #define _GNU_SOURCE
  ...
  [  3%] Building C object 
src/cc/CMakeFiles/bpf-shared.dir/libbpf/src/libbpf_errno.c.o
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c: In function 
‘libbpf_strerror’:
  /home/yhs/work/bcc2/src/cc/libbpf/src/libbpf_errno.c:45:7:
  warning: assignment makes integer from pointer without a cast [enabled by 
default]
 ret = strerror_r(err, buf, size);
  ...

bcc is built with _GNU_SOURCE defined and this caused the above warning.
This patch intends to make libpf _GNU_SOURCE friendly by
  . define _GNU_SOURCE unless it is not defined
  . gnu version strerror_r has different return value than non-gnu version,
so strerror_r return value is handled differently if _GNU_SOURCE is defined.

Signed-off-by: Yonghong Song 
---
 tools/lib/bpf/libbpf.c   |  2 ++
 tools/lib/bpf/libbpf_errno.c | 10 ++
 2 files changed, 12 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index ed4212a4c5f9..59b748ebd15f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -9,7 +9,9 @@
  * Copyright (C) 2017 Nicira, Inc.
  */
 
+#ifndef _GNU_SOURCE
 #define _GNU_SOURCE
+#endif
 #include 
 #include 
 #include 
diff --git a/tools/lib/bpf/libbpf_errno.c b/tools/lib/bpf/libbpf_errno.c
index d83b17f8435c..286e497c50ec 100644
--- a/tools/lib/bpf/libbpf_errno.c
+++ b/tools/lib/bpf/libbpf_errno.c
@@ -40,9 +40,19 @@ int libbpf_strerror(int err, char *buf, size_t size)
err = err > 0 ? err : -err;
 
if (err < __LIBBPF_ERRNO__START) {
+#ifdef _GNU_SOURCE
+   const char *ret_buf;
+#endif
int ret;
 
+#ifdef _GNU_SOURCE
+   ret_buf = strerror_r(err, buf, size);
+   if (ret_buf != buf)
+   snprintf(buf, size, "%s", ret_buf);
+   ret = 0;
+#else
ret = strerror_r(err, buf, size);
+#endif
buf[size - 1] = '\0';
return ret;
}
-- 
2.17.1



Re: [PATCH net-next v3 2/3] udp: elide zerocopy operation in hot path

2018-11-29 Thread Willem de Bruijn
On Thu, Nov 29, 2018 at 3:26 PM Willem de Bruijn
 wrote:
>
> From: Willem de Bruijn 
>
> With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
> Release of its last reference triggers a completion notification.
>
> The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
> the skbs, because it can build, send and free skbs within its loop,
> possibly reaching refcount zero and freeing the ubuf_info too soon.
>
> The UDP stack currently also takes this extra ref, but does not need
> it as all skbs are sent after return from __ip(6)_append_data.
>
> Avoid the extra refcount_inc and refcount_dec_and_test, and generally
> the sock_zerocopy_put in the common path, by passing the initial
> reference to the first skb.
>
> This approach is taken instead of initializing the refcount to 0, as
> that would generate error "refcount_t: increment on 0" on the
> next skb_zcopy_set.
>
> Signed-off-by: Willem de Bruijn 
> ---

> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index 9602746d7175..08ff04d12642 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -881,8 +881,8 @@ static int __ip_append_data(struct sock *sk,
> int csummode = CHECKSUM_NONE;
> struct rtable *rt = (struct rtable *)cork->dst;
> unsigned int wmem_alloc_delta = 0;
> +   bool paged, extra_uref;
> u32 tskey = 0;
> -   bool paged;
>
> skb = skb_peek_tail(queue);
>
> @@ -921,12 +921,13 @@ static int __ip_append_data(struct sock *sk,
> uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
> if (!uarg)
> return -ENOBUFS;
> +   extra_uref = true;
> if (rt->dst.dev->features & NETIF_F_SG &&
> csummode == CHECKSUM_PARTIAL) {
> paged = true;
> } else {
> uarg->zerocopy = 0;
> -   skb_zcopy_set(skb, uarg);
> +   skb_zcopy_set(skb, uarg, &extra_uref);
> }
> }
>
> @@ -1019,7 +1020,7 @@ static int __ip_append_data(struct sock *sk,
> cork->tx_flags = 0;
> skb_shinfo(skb)->tskey = tskey;
> tskey = 0;
> -   skb_zcopy_set(skb, uarg);
> +   skb_zcopy_set(skb, uarg, &extra_uref);
>
> /*
>  *  Find where to start putting bytes.
> @@ -1123,13 +1124,12 @@ static int __ip_append_data(struct sock *sk,
>
> if (wmem_alloc_delta)
> refcount_add(wmem_alloc_delta, &sk->sk_wmem_alloc);
> -   sock_zerocopy_put(uarg);
> return 0;
>
>  error_efault:
> err = -EFAULT;
>  error:
> -   sock_zerocopy_put_abort(uarg);
> +   sock_zerocopy_put_abort(uarg, extra_uref);

I'll need another revision. Sorry for the spam.

In the draft patch I suggested that the skb_zcopy_set needs to be
moved below getfrag, so that the uarg is not freed on the only
error path that calls kfree_skb inside the main loop.

This is still needed. Else sock_zerocopy_put_abort here is
reached with a non-NULL, but already freed uarg.

Will send a v4. Will let this sit for at least a day in case anyone
else has comments.


Re: [PATCH net-next v3 1/3] udp: msg_zerocopy

2018-11-29 Thread Willem de Bruijn
On Thu, Nov 29, 2018 at 3:26 PM Willem de Bruijn
 wrote:
>
> From: Willem de Bruijn 
>
> Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
> interpret flag MSG_ZEROCOPY.
>
> This patch was previously part of the zerocopy RFC patchsets. Zerocopy
> is not effective at small MTU. With segmentation offload building
> larger datagrams, the benefit of page flipping outweights the cost of
> generating a completion notification.
>
> tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
> test patch and making skb_orphan_frags_rx same as skb_orphan_frags:
>
> ipv4 udp -t 1
> tx=191312 (11938 MB) txc=0 zc=n
> rx=191312 (11938 MB)
> ipv4 udp -z -t 1
> tx=304507 (19002 MB) txc=304507 zc=y
> rx=304507 (19002 MB)
> ok
> ipv6 udp -t 1
> tx=174485 (10888 MB) txc=0 zc=n
> rx=174485 (10888 MB)
> ipv6 udp -z -t 1
> tx=294801 (18396 MB) txc=294801 zc=y
> rx=294801 (18396 MB)
> ok
>
> Changes
>   v1 -> v2
> - Fixup reverse christmas tree violation
>   v2 -> v3
> - Split refcount avoidance optimization into separate patch
>   - Fix refcount leak on error in fragmented case
> (thanks to Paolo Abeni for pointing this one out!)
>   - Fix refcount inc on zero

one additional change that I forgot to add to the changelog

  - test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
this is needed since commit 5cf4a8532c99 ("tcp: really ignore
MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.


[PATCH net-next v3 2/3] udp: elide zerocopy operation in hot path

2018-11-29 Thread Willem de Bruijn
From: Willem de Bruijn 

With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
Release of its last reference triggers a completion notification.

The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
the skbs, because it can build, send and free skbs within its loop,
possibly reaching refcount zero and freeing the ubuf_info too soon.

The UDP stack currently also takes this extra ref, but does not need
it as all skbs are sent after return from __ip(6)_append_data.

Avoid the extra refcount_inc and refcount_dec_and_test, and generally
the sock_zerocopy_put in the common path, by passing the initial
reference to the first skb.

This approach is taken instead of initializing the refcount to 0, as
that would generate error "refcount_t: increment on 0" on the
next skb_zcopy_set.

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h | 12 
 net/core/skbuff.c  |  9 +
 net/ipv4/ip_output.c   | 10 +-
 net/ipv4/tcp.c |  2 +-
 net/ipv6/ip6_output.c  | 10 +-
 5 files changed, 24 insertions(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 04f52e719571..75d50ab7997c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -481,7 +481,7 @@ static inline void sock_zerocopy_get(struct ubuf_info *uarg)
 }
 
 void sock_zerocopy_put(struct ubuf_info *uarg);
-void sock_zerocopy_put_abort(struct ubuf_info *uarg);
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
@@ -1326,10 +1326,14 @@ static inline struct ubuf_info *skb_zcopy(struct 
sk_buff *skb)
return is_zcopy ? skb_uarg(skb) : NULL;
 }
 
-static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg)
+static inline void skb_zcopy_set(struct sk_buff *skb, struct ubuf_info *uarg,
+bool *have_ref)
 {
if (skb && uarg && !skb_zcopy(skb)) {
-   sock_zerocopy_get(uarg);
+   if (unlikely(have_ref && *have_ref))
+   *have_ref = false;
+   else
+   sock_zerocopy_get(uarg);
skb_shinfo(skb)->destructor_arg = uarg;
skb_shinfo(skb)->tx_flags |= SKBTX_ZEROCOPY_FRAG;
}
@@ -1374,7 +1378,7 @@ static inline void skb_zcopy_abort(struct sk_buff *skb)
struct ubuf_info *uarg = skb_zcopy(skb);
 
if (uarg) {
-   sock_zerocopy_put_abort(uarg);
+   sock_zerocopy_put_abort(uarg, false);
skb_shinfo(skb)->tx_flags &= ~SKBTX_ZEROCOPY_FRAG;
}
 }
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 435bac91d293..7cfc2144228a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1089,7 +1089,7 @@ void sock_zerocopy_put(struct ubuf_info *uarg)
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put);
 
-void sock_zerocopy_put_abort(struct ubuf_info *uarg)
+void sock_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref)
 {
if (uarg) {
struct sock *sk = skb_from_uarg(uarg)->sk;
@@ -1097,7 +1097,8 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg)
atomic_dec(&sk->sk_zckey);
uarg->len--;
 
-   sock_zerocopy_put(uarg);
+   if (have_uref)
+   sock_zerocopy_put(uarg);
}
 }
 EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
@@ -1137,7 +1138,7 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct 
sk_buff *skb,
return err;
}
 
-   skb_zcopy_set(skb, uarg);
+   skb_zcopy_set(skb, uarg, NULL);
return skb->len - orig_len;
 }
 EXPORT_SYMBOL_GPL(skb_zerocopy_iter_stream);
@@ -1157,7 +1158,7 @@ static int skb_zerocopy_clone(struct sk_buff *nskb, 
struct sk_buff *orig,
if (skb_copy_ubufs(nskb, GFP_ATOMIC))
return -EIO;
}
-   skb_zcopy_set(nskb, skb_uarg(orig));
+   skb_zcopy_set(nskb, skb_uarg(orig), NULL);
}
return 0;
 }
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 9602746d7175..08ff04d12642 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -881,8 +881,8 @@ static int __ip_append_data(struct sock *sk,
int csummode = CHECKSUM_NONE;
struct rtable *rt = (struct rtable *)cork->dst;
unsigned int wmem_alloc_delta = 0;
+   bool paged, extra_uref;
u32 tskey = 0;
-   bool paged;
 
skb = skb_peek_tail(queue);
 
@@ -921,12 +921,13 @@ static int __ip_append_data(struct sock *sk,
uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
if (!uarg)
return -ENOBUFS;
+   extra_uref = true;
if (rt->dst.dev->features & NETIF_F_SG &&
csummode == CHECKSUM_PARTIAL) {
paged = true;
} else {

[PATCH net-next v3 3/3] selftests: extend zerocopy tests to udp

2018-11-29 Thread Willem de Bruijn
From: Willem de Bruijn 

Both msg_zerocopy and udpgso_bench have udp zerocopy variants.
Exercise these as part of the standard kselftest run.

With udp, msg_zerocopy has no control channel. Ensure that the
receiver exits after the sender by accounting for the initial
delay in starting them (in msg_zerocopy.sh).

Signed-off-by: Willem de Bruijn 
---
 tools/testing/selftests/net/msg_zerocopy.c  | 3 ++-
 tools/testing/selftests/net/msg_zerocopy.sh | 2 ++
 tools/testing/selftests/net/udpgso_bench.sh | 3 +++
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/msg_zerocopy.c 
b/tools/testing/selftests/net/msg_zerocopy.c
index 406cc70c571d..4b02933cab8a 100644
--- a/tools/testing/selftests/net/msg_zerocopy.c
+++ b/tools/testing/selftests/net/msg_zerocopy.c
@@ -651,12 +651,13 @@ static void do_flush_datagram(int fd, int type)
 
 static void do_rx(int domain, int type, int protocol)
 {
+   const int cfg_receiver_wait_ms = 400;
uint64_t tstop;
int fd;
 
fd = do_setup_rx(domain, type, protocol);
 
-   tstop = gettimeofday_ms() + cfg_runtime_ms;
+   tstop = gettimeofday_ms() + cfg_runtime_ms + cfg_receiver_wait_ms;
do {
if (type == SOCK_STREAM)
do_flush_tcp(fd);
diff --git a/tools/testing/selftests/net/msg_zerocopy.sh 
b/tools/testing/selftests/net/msg_zerocopy.sh
index c43c6debda06..825ffec85cea 100755
--- a/tools/testing/selftests/net/msg_zerocopy.sh
+++ b/tools/testing/selftests/net/msg_zerocopy.sh
@@ -25,6 +25,8 @@ readonly path_sysctl_mem="net.core.optmem_max"
 if [[ "$#" -eq "0" ]]; then
$0 4 tcp -t 1
$0 6 tcp -t 1
+   $0 4 udp -t 1
+   $0 6 udp -t 1
echo "OK. All tests passed"
exit 0
 fi
diff --git a/tools/testing/selftests/net/udpgso_bench.sh 
b/tools/testing/selftests/net/udpgso_bench.sh
index 0f0628613f81..5670a9ffd8eb 100755
--- a/tools/testing/selftests/net/udpgso_bench.sh
+++ b/tools/testing/selftests/net/udpgso_bench.sh
@@ -35,6 +35,9 @@ run_udp() {
 
echo "udp gso"
run_in_netns ${args} -S 0
+
+   echo "udp gso zerocopy"
+   run_in_netns ${args} -S 0 -z
 }
 
 run_tcp() {
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH net-next v3 1/3] udp: msg_zerocopy

2018-11-29 Thread Willem de Bruijn
From: Willem de Bruijn 

Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
interpret flag MSG_ZEROCOPY.

This patch was previously part of the zerocopy RFC patchsets. Zerocopy
is not effective at small MTU. With segmentation offload building
larger datagrams, the benefit of page flipping outweights the cost of
generating a completion notification.

tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
test patch and making skb_orphan_frags_rx same as skb_orphan_frags:

ipv4 udp -t 1
tx=191312 (11938 MB) txc=0 zc=n
rx=191312 (11938 MB)
ipv4 udp -z -t 1
tx=304507 (19002 MB) txc=304507 zc=y
rx=304507 (19002 MB)
ok
ipv6 udp -t 1
tx=174485 (10888 MB) txc=0 zc=n
rx=174485 (10888 MB)
ipv6 udp -z -t 1
tx=294801 (18396 MB) txc=294801 zc=y
rx=294801 (18396 MB)
ok

Changes
  v1 -> v2
- Fixup reverse christmas tree violation
  v2 -> v3
- Split refcount avoidance optimization into separate patch
  - Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
  - Fix refcount inc on zero

Signed-off-by: Willem de Bruijn 
---
 include/linux/skbuff.h |  1 +
 net/core/skbuff.c  |  6 ++
 net/core/sock.c|  5 -
 net/ipv4/ip_output.c   | 23 ++-
 net/ipv6/ip6_output.c  | 23 ++-
 5 files changed, 55 insertions(+), 3 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 73902acf2b71..04f52e719571 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -485,6 +485,7 @@ void sock_zerocopy_put_abort(struct ubuf_info *uarg);
 
 void sock_zerocopy_callback(struct ubuf_info *uarg, bool success);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len);
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 struct msghdr *msg, int len,
 struct ubuf_info *uarg);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 02cd7ae3d0fb..435bac91d293 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1105,6 +1105,12 @@ EXPORT_SYMBOL_GPL(sock_zerocopy_put_abort);
 extern int __zerocopy_sg_from_iter(struct sock *sk, struct sk_buff *skb,
   struct iov_iter *from, size_t length);
 
+int skb_zerocopy_iter_dgram(struct sk_buff *skb, struct msghdr *msg, int len)
+{
+   return __zerocopy_sg_from_iter(skb->sk, skb, &msg->msg_iter, len);
+}
+EXPORT_SYMBOL_GPL(skb_zerocopy_iter_dgram);
+
 int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
 struct msghdr *msg, int len,
 struct ubuf_info *uarg)
diff --git a/net/core/sock.c b/net/core/sock.c
index 6d7e189e3cd9..f5bb89785e47 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1018,7 +1018,10 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
 
case SO_ZEROCOPY:
if (sk->sk_family == PF_INET || sk->sk_family == PF_INET6) {
-   if (sk->sk_protocol != IPPROTO_TCP)
+   if (!((sk->sk_type == SOCK_STREAM &&
+  sk->sk_protocol == IPPROTO_TCP) ||
+ (sk->sk_type == SOCK_DGRAM &&
+  sk->sk_protocol == IPPROTO_UDP)))
ret = -ENOTSUPP;
} else if (sk->sk_family != PF_RDS) {
ret = -ENOTSUPP;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index c09219e7f230..9602746d7175 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -867,6 +867,7 @@ static int __ip_append_data(struct sock *sk,
unsigned int flags)
 {
struct inet_sock *inet = inet_sk(sk);
+   struct ubuf_info *uarg = NULL;
struct sk_buff *skb;
 
struct ip_options *opt = cork->opt;
@@ -916,6 +917,19 @@ static int __ip_append_data(struct sock *sk,
(!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM)))
csummode = CHECKSUM_PARTIAL;
 
+   if (flags & MSG_ZEROCOPY && length && sock_flag(sk, SOCK_ZEROCOPY)) {
+   uarg = sock_zerocopy_realloc(sk, length, skb_zcopy(skb));
+   if (!uarg)
+   return -ENOBUFS;
+   if (rt->dst.dev->features & NETIF_F_SG &&
+   csummode == CHECKSUM_PARTIAL) {
+   paged = true;
+   } else {
+   uarg->zerocopy = 0;
+   skb_zcopy_set(skb, uarg);
+   }
+   }
+
cork->length += length;
 
/* So, what's going on in the loop below?
@@ -1005,6 +1019,7 @@ static int __ip_append_data(struct sock *sk,
cork->tx_flags = 0;
skb_shinfo(skb)->tskey = tskey;
tskey = 0;
+   skb_zcopy_set(skb, uarg);
 

[PATCH net-next v3 0/3] udp msg_zerocopy

2018-11-29 Thread Willem de Bruijn
From: Willem de Bruijn 

Enable MSG_ZEROCOPY for udp sockets

Patch 1/3 is the main patch, a rework of RFC patch
  http://patchwork.ozlabs.org/patch/899630/
  more details in the patch commit message

Patch 2/3 is an optimization to remove a branch from the UDP hot path
  and refcount_inc/refcount_dec_and_test pair when zerocopy is used.
  This used to be included in the first patch in v2.

Patch 3/3 runs the already existing udp zerocopy tests
  as part of kselftest

See also recent Linux Plumbers presentation
  
https://linuxplumbersconf.org/event/2/contributions/106/attachments/104/128/willemdebruijn-lpc2018-udpgso-presentation-20181113.pdf

Changes:
  v1 -> v2
- Fixup reverse christmas tree violation
  v2 -> v3
- Split refcount avoidance optimization into separate patch
  - Fix refcount leak on error in fragmented case
(thanks to Paolo Abeni for pointing this one out!)
  - Fix refcount inc on zero

Willem de Bruijn (3):
  udp: msg_zerocopy
  udp: elide zerocopy operation in hot path
  selftests: extend zerocopy tests to udp

 include/linux/skbuff.h  | 13 +++
 net/core/skbuff.c   | 15 +
 net/core/sock.c |  5 -
 net/ipv4/ip_output.c| 25 +++--
 net/ipv4/tcp.c  |  2 +-
 net/ipv6/ip6_output.c   | 25 +++--
 tools/testing/selftests/net/msg_zerocopy.c  |  3 ++-
 tools/testing/selftests/net/msg_zerocopy.sh |  2 ++
 tools/testing/selftests/net/udpgso_bench.sh |  3 +++
 9 files changed, 78 insertions(+), 15 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



Re: [PATCH net-next 2/3] vxlan: extack support for some changelink cases

2018-11-29 Thread kbuild test robot
Hi Roopa,

I love your patch! Yet something to improve:

[auto build test ERROR on net-next/master]

url:
https://github.com/0day-ci/linux/commits/Roopa-Prabhu/vxlan-support-changelink-for-a-few-more-attributes/20181130-030315
config: x86_64-randconfig-x006-201847 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   drivers//net/vxlan.c: In function 'vxlan_nl2conf':
>> drivers//net/vxlan.c:3527:3: error: expected '}' before 'else'
  else {
  ^~~~

vim +3527 drivers//net/vxlan.c

  3376  
  3377  static int vxlan_nl2conf(struct nlattr *tb[], struct nlattr *data[],
  3378   struct net_device *dev, struct vxlan_config 
*conf,
  3379   bool changelink, struct netlink_ext_ack 
*extack)
  3380  {
  3381  struct vxlan_dev *vxlan = netdev_priv(dev);
  3382  
  3383  memset(conf, 0, sizeof(*conf));
  3384  
  3385  /* if changelink operation, start with old existing cfg */
  3386  if (changelink)
  3387  memcpy(conf, &vxlan->cfg, sizeof(*conf));
  3388  
  3389  if (data[IFLA_VXLAN_ID]) {
  3390  __be32 vni = 
cpu_to_be32(nla_get_u32(data[IFLA_VXLAN_ID]));
  3391  
  3392  if (changelink && (vni != conf->vni)) {
  3393  NL_SET_ERR_MSG_ATTR(extack, tb[IFLA_VXLAN_ID],
  3394  "Cannot change vni");
  3395  return -EOPNOTSUPP;
  3396  }
  3397  conf->vni = 
cpu_to_be32(nla_get_u32(data[IFLA_VXLAN_ID]));
  3398  }
  3399  
  3400  if (data[IFLA_VXLAN_GROUP]) {
  3401  if (changelink && (conf->remote_ip.sa.sa_family != 
AF_INET)) {
  3402  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_GROUP],
  3403  "New group addr family does 
not match old group");
  3404  return -EOPNOTSUPP;
  3405  }
  3406  conf->remote_ip.sin.sin_addr.s_addr = 
nla_get_in_addr(data[IFLA_VXLAN_GROUP]);
  3407  conf->remote_ip.sa.sa_family = AF_INET;
  3408  } else if (data[IFLA_VXLAN_GROUP6]) {
  3409  if (!IS_ENABLED(CONFIG_IPV6)) {
  3410  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_GROUP6],
  3411  "IPv6 support not enabled 
in the kernel");
  3412  return -EPFNOSUPPORT;
  3413  }
  3414  
  3415  if (changelink && (conf->remote_ip.sa.sa_family != 
AF_INET6)) {
  3416  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_GROUP6],
  3417  "New group addr family does 
not match old group");
  3418  return -EOPNOTSUPP;
  3419  }
  3420  
  3421  conf->remote_ip.sin6.sin6_addr = 
nla_get_in6_addr(data[IFLA_VXLAN_GROUP6]);
  3422  conf->remote_ip.sa.sa_family = AF_INET6;
  3423  }
  3424  
  3425  if (data[IFLA_VXLAN_LOCAL]) {
  3426  if (changelink && (conf->saddr.sa.sa_family != 
AF_INET)) {
  3427  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_LOCAL],
  3428  "New local addr family does 
not match old");
  3429  return -EOPNOTSUPP;
  3430  }
  3431  
  3432  conf->saddr.sin.sin_addr.s_addr = 
nla_get_in_addr(data[IFLA_VXLAN_LOCAL]);
  3433  conf->saddr.sa.sa_family = AF_INET;
  3434  } else if (data[IFLA_VXLAN_LOCAL6]) {
  3435  if (!IS_ENABLED(CONFIG_IPV6)) {
  3436  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_LOCAL6],
  3437  "IPv6 support not enabled 
in the kernel");
  3438  return -EPFNOSUPPORT;
  3439  }
  3440  
  3441  if (changelink && (conf->saddr.sa.sa_family != 
AF_INET6)) {
  3442  NL_SET_ERR_MSG_ATTR(extack, 
tb[IFLA_VXLAN_LOCAL6],
  3443  "New local6 addr family 
does not match old");
  3444  return -EOPNOTSUPP;
  3445  }
  3446  
  3447  /* TODO: respect scope id */
  3448  conf->saddr.sin6.sin6_addr = 
nla_get_in6_addr(data[IFLA_VXLAN_LOCAL6]);
  3449  conf->saddr.sa.sa_family = AF_INET6;
  3450  }
  3451  
  3452  if (data[IFLA_VXLAN_LINK])
  3453  conf->remote_ifindex = 
nla_get_u32(data[IFLA_VXLAN_LINK]);
  3454  
  3455  if (data[IFLA_VXLAN_TOS])
  3456  conf->tos

Re: [RFC PATCH] net: mvpp2: fix detection of 10G SFP modules

2018-11-29 Thread Baruch Siach
Hi Florian,

Florian Fainelli writes:
> On 11/29/2018 4:49 AM, Baruch Siach wrote:
>> The mvpp2_phylink_validate() relies on the interface field of
>> phylink_link_state to determine valid link modes. However, when called
>> from phylink_sfp_module_insert() this field in not initialized. The
>> default switch case then excludes 10G link modes. This allows 10G SFP
>> modules that are detected correctly to be configured at max rate of
>> 2.5G.
>>
>> Catch the uninitialized PHY mode case, and allow 10G rates.
>>
>> Cc: Maxime Chevallier 
>> Cc: Antoine Tenart 
>> Signed-off-by: Baruch Siach 
>> ---
>> Is that the right fix?
>
> It would be a bit surprising that this is the right fix, you would
> expect validate to be called once everything has been parsed
> successfully from the SFP, is not that the case here? If not, can you
> find out what happens?

This is the code from phylink_sfp_module_insert() at
drivers/net/phy/phylink.c:

__ETHTOOL_DECLARE_LINK_MODE_MASK(support) = { 0, };
struct phylink_link_state config;
...

sfp_parse_support(pl->sfp_bus, id, support);
port = sfp_parse_port(pl->sfp_bus, id, support);

memset(&config, 0, sizeof(config));
linkmode_copy(config.advertising, support);
config.interface = PHY_INTERFACE_MODE_NA;
/* ... more 'config' fields initialization ... */
ret = phylink_validate(pl, support, &config);

The 'interface' field is not detected at this stage. I think it is not
meant to represent the detected information, but the actual
configuration.

baruch

--
 http://baruch.siach.name/blog/  ~. .~   Tk Open Systems
=}ooO--U--Ooo{=
   - bar...@tkos.co.il - tel: +972.52.368.4656, http://www.tkos.co.il -


Re: [PATCH] net: phy: sfp: correct location of SFP standards

2018-11-29 Thread David Miller
From: Baruch Siach 
Date: Thu, 29 Nov 2018 12:00:05 +0200

> SFP standards are now available from the SNIA (Storage Networking
> Industry Association) website.
> 
> Cc: Andrew Lunn 
> Cc: Florian Fainelli 
> Signed-off-by: Baruch Siach 

Applied.


  1   2   3   >