date:20170420

[PATCH] mac80211: fix enumeration type

2017-04-20 Thread Stefan Agner

Use enum nl80211_channel_type for cfg80211_chandef_create function
calls. This does not has an effect in practise since the enum values
are equal. This fixes a warning when compiling with clang:
  warning: implicit conversion from enumeration type 'enum nl80211_chan_width'
  to different enumeration type 'enum nl80211_channel_type'

Signed-off-by: Stefan Agner 
---
 net/mac80211/ibss.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/mac80211/ibss.c b/net/mac80211/ibss.c
index 98999d3d5262..5b748025147d 100644
--- a/net/mac80211/ibss.c
+++ b/net/mac80211/ibss.c
@@ -425,7 +425,7 @@ static void ieee80211_sta_join_ibss(struct 
ieee80211_sub_if_data *sdata,
case NL80211_CHAN_WIDTH_5:
case NL80211_CHAN_WIDTH_10:
cfg80211_chandef_create(&chandef, cbss->channel,
-   NL80211_CHAN_WIDTH_20_NOHT);
+   NL80211_CHAN_HT20);
chandef.width = sdata->u.ibss.chandef.width;
break;
case NL80211_CHAN_WIDTH_80:
@@ -437,7 +437,7 @@ static void ieee80211_sta_join_ibss(struct 
ieee80211_sub_if_data *sdata,
default:
/* fall back to 20 MHz for unsupported modes */
cfg80211_chandef_create(&chandef, cbss->channel,
-   NL80211_CHAN_WIDTH_20_NOHT);
+   NL80211_CHAN_HT20);
break;
}
 
-- 
2.12.2

Re: mwifiex: MAC randomization should not be persistent

2017-04-20 Thread Kalle Valo

Brian Norris  wrote:
> nl80211 provides the NL80211_SCAN_FLAG_RANDOM_ADDR for every scan
> request that should be randomized; the absence of such a flag means we
> should not randomize. However, mwifiex was stashing the latest
> randomization request and *always* using it for future scans, even those
> that didn't set the flag.
> 
> Let's zero out the randomization info whenever we get a scan request
> without NL80211_SCAN_FLAG_RANDOM_ADDR. I'd prefer to remove
> priv->random_mac entirely (and plumb the randomization MAC properly
> through the call sequence), but the spaghetti is a little difficult to
> unravel here for me.
> 
> Fixes: c2a8f0ff9c6c ("mwifiex: support random MAC address for scanning")
> Cc:  # 4.9+
> Signed-off-by: Brian Norris 

Patch applied to wireless-drivers-next.git, thanks.

7e2f18f06408 mwifiex: MAC randomization should not be persistent

-- 
https://patchwork.kernel.org/patch/9665813/

https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches

Re: [PATCH] nl80211: fix enumeration type

2017-04-20 Thread Johannes Berg

On Wed, 2017-04-19 at 23:55 -0700, Stefan Agner wrote:
> Use type enum nl80211_rate_info for bitrate information. This fixes
> a warning when compiling with clang:
>   warning: implicit conversion from enumeration type 'enum
> nl80211_rate_info'
>   to different enumeration type 'enum nl80211_attrs'

I'm pretty sure I just applied exactly these two patches very recently.

What made *two* people suddenly see this?

johannes

[PATCH net] Fix net/hsr/hsr_device to check for freed skb buffer.

2017-04-20 Thread Peter Heise

Fixed an unchecked call of skb_put_padto. Return value was ignored
before, however, skb_put_padto frees skb buffer in case of error.

As reported by Dan Carpenter on kernel-janitors.

Signed-off-by: Peter Heise 
---
 net/hsr/hsr_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/hsr/hsr_device.c b/net/hsr/hsr_device.c
index c73160fb11e7..22d693f213be 100644
--- a/net/hsr/hsr_device.c
+++ b/net/hsr/hsr_device.c
@@ -314,7 +314,8 @@ static void send_hsr_supervision_frame(struct hsr_port 
*master,
hsr_sp = (typeof(hsr_sp)) skb_put(skb, sizeof(struct hsr_sup_payload));
ether_addr_copy(hsr_sp->MacAddressA, master->dev->dev_addr);
 
-   skb_put_padto(skb, ETH_ZLEN + HSR_HLEN);
+   if(skb_put_padto(skb, ETH_ZLEN + HSR_HLEN))
+   return;
 
hsr_forward_skb(skb, master);
return;
-- 
2.11.0

Re: [PATCH] mdio_bus: Issue GPIO RESET to PHYs.

2017-04-20 Thread Roger Quadros

On 19/04/17 16:38, Andrew Lunn wrote:
> On Wed, Apr 19, 2017 at 02:56:48PM +0300, Roger Quadros wrote:
>> Hi,
>>
>> On 19/04/17 14:39, Andrew Lunn wrote:
>>> On Wed, Apr 19, 2017 at 12:24:26PM +0300, Roger Quadros wrote:
 Some boards [1] leave the PHYs at an invalid state
 during system power-up or reset thus causing unreliability
 issues with the PHY which manifests as PHY not being detected
 or link not functional. To fix this, these PHYs need to be RESET
 via a GPIO connected to the PHY's RESET pin.

 Some boards have a single GPIO controlling the PHY RESET pin of all
 PHYs on the bus whereas some others have separate GPIOs controlling
 individual PHY RESETs.

 In both cases, the RESET de-assertion cannot be done in the PHY driver
 as the PHY will not probe till its reset is de-asserted.
 So do the RESET de-assertion in the MDIO bus driver.

 [1] - am572x-idk, am571x-idk, a437x-idk

 Signed-off-by: Roger Quadros 
 ---
  drivers/net/phy/mdio_bus.c | 26 ++
  drivers/of/of_mdio.c   |  4 
  include/linux/phy.h|  5 +
  3 files changed, 35 insertions(+)
>>>
>>> Hi Roger
>>>
>>> Thanks for making this generic.
>>>
>>> Please add device tree binding documentation. I think that actually
>>> means you have to document MDIO in general, since there currently is
>>> not a binding document.
>>
>> OK.
>>
>>>
 diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
 index fa7d51f..25fda2b 100644
 --- a/drivers/net/phy/mdio_bus.c
 +++ b/drivers/net/phy/mdio_bus.c
 @@ -22,8 +22,11 @@
  #include 
  #include 
  #include 
 +#include 
 +#include 
  #include 
  #include 
 +#include 
  #include 
  #include 
  #include 
 @@ -43,6 +46,8 @@
  
  #include "mdio-boardinfo.h"
  
 +#define DEFAULT_GPIO_RESET_DELAY  10  /* in microseconds */
 +
  int mdiobus_register_device(struct mdio_device *mdiodev)
  {
if (mdiodev->bus->mdio_map[mdiodev->addr])
 @@ -307,6 +312,7 @@ int __mdiobus_register(struct mii_bus *bus, struct 
 module *owner)
  {
struct mdio_device *mdiodev;
int i, err;
 +  struct gpio_desc *gpiod;
  
if (NULL == bus || NULL == bus->name ||
NULL == bus->read || NULL == bus->write)
 @@ -333,6 +339,26 @@ int __mdiobus_register(struct mii_bus *bus, struct 
 module *owner)
if (bus->reset)
bus->reset(bus);
  
 +  /* de-assert bus level PHY GPIO resets */
 +  for (i = 0; i < bus->num_reset_gpios; i++) {
 +  gpiod = devm_gpiod_get_index(&bus->dev, "reset", i,
 +   GPIOD_OUT_LOW);
 +  if (IS_ERR(gpiod)) {
 +  err = PTR_ERR(gpiod);
 +  if (err != -ENOENT) {
 +  pr_err("mii_bus %s couldn't get reset GPIO\n",
 + bus->id);
 +  return err;
 +  }
 +  } else {
 +  gpiod_set_value_cansleep(gpiod, 1);
>>>
>>>
 +  if (!bus->reset_delay_us)
 +  bus->reset_delay_us = DEFAULT_GPIO_RESET_DELAY;
>>>
>>> Maybe do this once, where you read the device tree property.
>>
>> I was thinking from point of view that GPIO RESET code should work even 
>> without
>> device tree. Although I'm not sure if there would be any users or not.
> 
> Hi Roger
> 
> I don't see how this would work. What would devm_gpiod_get_index()
> return? Something from ACPI? But then there would be something

Yes or just lookup tables based on platform data.

> equivalent for getting the delay.

I'm not sure about delay part.

> 
> Lets keep it simple and OF only. If there is a real need for something
> in addition to OF, it can be added later.

OK. I'll revise the patch based on your comments.

cheers,
-roger

Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone

2017-04-20 Thread Dmitry Vyukov

On Thu, Apr 20, 2017 at 1:51 AM, David Ahern  wrote:
> On 4/19/17 5:47 PM, Cong Wang wrote:
>> On Wed, Apr 19, 2017 at 9:12 AM, Andrey Konovalov  
>> wrote:
>>>
>>> Anyway, I just finished simplifying the reproducer. Give this one a try.
>>
>> Thanks for providing such a minimal reproducer!
>>
>> The following patch could fix this crash, but I am not 100% sure if we should
>> just clear these bits or reject them with an errno.
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 9db14189..cf524c2 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -2086,7 +2086,7 @@ static struct rt6_info
>> *ip6_route_info_create(struct fib6_config *cfg)
>> } else
>> rt->rt6i_prefsrc.plen = 0;
>>
>> -   rt->rt6i_flags = cfg->fc_flags;
>> +   rt->rt6i_flags = cfg->fc_flags & ~(RTF_PCPU | RTF_CACHE);
>>
>>  install_route:
>> rt->dst.dev = dev;
>>
>
> I sent a patch returning EINVAL if RTF_PCPU is set in fc_flags


Andrey, does it fix the other crashes?

[PATCH net] MAINTAINERS: Add new IPsec offloading files.

2017-04-20 Thread Steffen Klassert

This adds two new files to IPsec maintenance scope:

net/ipv4/esp4_offload.c
net/ipv6/ip6_offload.c

Signed-off-by: Steffen Klassert 
---
 MAINTAINERS | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5397f54..b01f923 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8801,12 +8801,12 @@ F:  net/core/flow.c
 F: net/xfrm/
 F: net/key/
 F: net/ipv4/xfrm*
-F: net/ipv4/esp4.c
+F: net/ipv4/esp4*
 F: net/ipv4/ah4.c
 F: net/ipv4/ipcomp.c
 F: net/ipv4/ip_vti.c
 F: net/ipv6/xfrm*
-F: net/ipv6/esp6.c
+F: net/ipv6/esp6*
 F: net/ipv6/ah6.c
 F: net/ipv6/ipcomp6.c
 F: net/ipv6/ip6_vti.c
-- 
2.7.4

[PATCH v2] mdio_bus: Issue GPIO RESET to PHYs.

2017-04-20 Thread Roger Quadros

Some boards [1] leave the PHYs at an invalid state
during system power-up or reset thus causing unreliability
issues with the PHY which manifests as PHY not being detected
or link not functional. To fix this, these PHYs need to be RESET
via a GPIO connected to the PHY's RESET pin.

Some boards have a single GPIO controlling the PHY RESET pin of all
PHYs on the bus whereas some others have separate GPIOs controlling
individual PHY RESETs.

In both cases, the RESET de-assertion cannot be done in the PHY driver
as the PHY will not probe till its reset is de-asserted.
So do the RESET de-assertion in the MDIO bus driver.

[1] - am572x-idk, am571x-idk, a437x-idk

Signed-off-by: Roger Quadros 
---
v2:
- add device tree binding document (mdio.txt)
- specify default reset delay in of_mdio.c instead of mdio_bus.c

 Documentation/devicetree/bindings/net/mdio.txt | 20 
 drivers/net/phy/mdio_bus.c | 22 ++
 drivers/of/of_mdio.c   |  7 +++
 include/linux/phy.h|  5 +
 4 files changed, 54 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/mdio.txt

diff --git a/Documentation/devicetree/bindings/net/mdio.txt 
b/Documentation/devicetree/bindings/net/mdio.txt
new file mode 100644
index 000..6e703d7
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/mdio.txt
@@ -0,0 +1,20 @@
+Common MDIO bus properties.
+
+These are generic properties that can apply to any MDIO bus.
+
+Optional properties:
+- reset-gpios: List of one or more GPIOs that control the RESET lines
+  of the PHYs on that MDIO bus.
+- reset-delay-us: RESET pulse width as per PHY datasheet.
+
+Example :
+
+   davinci_mdio: ethernet@0x5c03 {
+   compatible = "ti,davinci_mdio";
+   ti,hwmods = "davinci_mdio";
+   reg = <0x5c03 0x1000>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+   reset-gpios = <&gpio2 5 GPIO_ACTIVE_LOW>;
+   reset-delay-us = <2>;   /* PHY datasheet states 1uS min */
+   };
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index fa7d51f..b353d99 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -22,8 +22,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -307,6 +310,7 @@ int __mdiobus_register(struct mii_bus *bus, struct module 
*owner)
 {
struct mdio_device *mdiodev;
int i, err;
+   struct gpio_desc *gpiod;
 
if (NULL == bus || NULL == bus->name ||
NULL == bus->read || NULL == bus->write)
@@ -333,6 +337,24 @@ int __mdiobus_register(struct mii_bus *bus, struct module 
*owner)
if (bus->reset)
bus->reset(bus);
 
+   /* de-assert bus level PHY GPIO resets */
+   for (i = 0; i < bus->num_reset_gpios; i++) {
+   gpiod = devm_gpiod_get_index(&bus->dev, "reset", i,
+GPIOD_OUT_LOW);
+   if (IS_ERR(gpiod)) {
+   err = PTR_ERR(gpiod);
+   if (err != -ENOENT) {
+   pr_err("mii_bus %s couldn't get reset GPIO\n",
+  bus->id);
+   return err;
+   }
+   } else {
+   gpiod_set_value_cansleep(gpiod, 1);
+   udelay(bus->reset_delay_us);
+   gpiod_set_value_cansleep(gpiod, 0);
+   }
+   }
+
for (i = 0; i < PHY_MAX_ADDR; i++) {
if ((bus->phy_mask & (1 << i)) == 0) {
struct phy_device *phydev;
diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index 0b29798..7e4c80f 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 
+#define DEFAULT_GPIO_RESET_DELAY   10  /* in microseconds */
+
 MODULE_AUTHOR("Grant Likely ");
 MODULE_LICENSE("GPL");
 
@@ -221,6 +223,11 @@ int of_mdiobus_register(struct mii_bus *mdio, struct 
device_node *np)
 
mdio->dev.of_node = np;
 
+   /* Get bus level PHY reset GPIO details */
+   mdio->reset_delay_us = DEFAULT_GPIO_RESET_DELAY;
+   of_property_read_u32(np, "reset-delay-us", &mdio->reset_delay_us);
+   mdio->num_reset_gpios = of_gpio_named_count(np, "reset-gpios");
+
/* Register the MDIO bus */
rc = mdiobus_register(mdio);
if (rc)
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 43a7748..80a6574 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -217,6 +217,11 @@ struct mii_bus {
 * matching its address
 */
int irq[PHY_MAX_ADDR];
+
+   /* GPIO reset pulse width in uS */
+   int reset_delay_us;
+   /* Number of reset GPIOs */
+   int num_reset_gpios;
 };
 #define to_mii_bus(d) conta

Re: [net-next 00/15][pull request] 10GbE Intel Wired LAN Driver Updates 2017-04-18

2017-04-20 Thread Jeff Kirsher

On Wed, 2017-04-19 at 22:25 -0700, Alexei Starovoitov wrote:
> On Tue, Apr 18, 2017 at 04:01:50PM -0700, Jeff Kirsher wrote:
> > The following are changes since commit
> > 4116c97689b9b1732ac5b68afd922406f9fc842e:
> >    Merge branch 'ftgmac100-batch5-features'
> > and are available in the git repository at:
> >    git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-
> > queue 10GbE
> 
> Nothing against this set of commits,
> but I saw ixgbe xdp support in there pending for few months.
> Now it's gone. What happened?

XDP is in the next series of patches, per Dave, we need to keep patch
series smaller to help in revue.  Next series will contain what you are
looking for, just need to get this up before I can push the next
series.

signature.asc
Description: This is a digitally signed message part

[PATCH 04/16] xfrm: Add mode handlers for IPsec on layer 2

2017-04-20 Thread Steffen Klassert

This patch adds a gso_segment and xmit callback for the
xfrm_mode and implement these functions for tunnel and
transport mode.

Signed-off-by: Steffen Klassert 
---
 include/net/xfrm.h  | 10 ++
 net/ipv4/xfrm4_mode_transport.c | 32 
 net/ipv4/xfrm4_mode_tunnel.c| 25 +
 net/ipv6/xfrm6_mode_transport.c | 33 +
 net/ipv6/xfrm6_mode_tunnel.c| 24 
 5 files changed, 124 insertions(+)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index ac984da..54515d9 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -444,6 +444,16 @@ struct xfrm_mode {
 */
int (*output)(struct xfrm_state *x, struct sk_buff *skb);
 
+   /*
+* Adjust pointers into the packet and do GSO segmentation.
+*/
+   struct sk_buff *(*gso_segment)(struct xfrm_state *x, struct sk_buff 
*skb, netdev_features_t features);
+
+   /*
+* Adjust pointers into the packet when IPsec is done at layer2.
+*/
+   void (*xmit)(struct xfrm_state *x, struct sk_buff *skb);
+
struct xfrm_state_afinfo *afinfo;
struct module *owner;
unsigned int encap;
diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c
index 4acc050..6c2411d 100644
--- a/net/ipv4/xfrm4_mode_transport.c
+++ b/net/ipv4/xfrm4_mode_transport.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Add encapsulation header.
  *
@@ -56,9 +57,40 @@ static int xfrm4_transport_input(struct xfrm_state *x, 
struct sk_buff *skb)
return 0;
 }
 
+static struct sk_buff *xfrm4_transport_gso_segment(struct xfrm_state *x,
+  struct sk_buff *skb,
+  netdev_features_t features)
+{
+   const struct net_offload *ops;
+   struct sk_buff *segs = ERR_PTR(-EINVAL);
+   struct xfrm_offload *xo = xfrm_offload(skb);
+
+   skb->transport_header += x->props.header_len;
+   ops = rcu_dereference(inet_offloads[xo->proto]);
+   if (likely(ops && ops->callbacks.gso_segment))
+   segs = ops->callbacks.gso_segment(skb, features);
+
+   return segs;
+}
+
+static void xfrm4_transport_xmit(struct xfrm_state *x, struct sk_buff *skb)
+{
+   struct xfrm_offload *xo = xfrm_offload(skb);
+
+   skb_reset_mac_len(skb);
+   pskb_pull(skb, skb->mac_len + sizeof(struct iphdr) + 
x->props.header_len);
+
+   if (xo->flags & XFRM_GSO_SEGMENT) {
+skb_reset_transport_header(skb);
+skb->transport_header -= x->props.header_len;
+   }
+}
+
 static struct xfrm_mode xfrm4_transport_mode = {
.input = xfrm4_transport_input,
.output = xfrm4_transport_output,
+   .gso_segment = xfrm4_transport_gso_segment,
+   .xmit = xfrm4_transport_xmit,
.owner = THIS_MODULE,
.encap = XFRM_MODE_TRANSPORT,
 };
diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
index 35feda6..d3f2434 100644
--- a/net/ipv4/xfrm4_mode_tunnel.c
+++ b/net/ipv4/xfrm4_mode_tunnel.c
@@ -96,11 +96,36 @@ static int xfrm4_mode_tunnel_input(struct xfrm_state *x, 
struct sk_buff *skb)
return err;
 }
 
+static struct sk_buff *xfrm4_mode_tunnel_gso_segment(struct xfrm_state *x,
+struct sk_buff *skb,
+netdev_features_t features)
+{
+   __skb_push(skb, skb->mac_len);
+   return skb_mac_gso_segment(skb, features);
+
+}
+
+static void xfrm4_mode_tunnel_xmit(struct xfrm_state *x, struct sk_buff *skb)
+{
+   struct xfrm_offload *xo = xfrm_offload(skb);
+
+   if (xo->flags & XFRM_GSO_SEGMENT) {
+   skb->network_header = skb->network_header - x->props.header_len;
+   skb->transport_header = skb->network_header +
+   sizeof(struct iphdr);
+   }
+
+   skb_reset_mac_len(skb);
+   pskb_pull(skb, skb->mac_len + x->props.header_len);
+}
+
 static struct xfrm_mode xfrm4_tunnel_mode = {
.input2 = xfrm4_mode_tunnel_input,
.input = xfrm_prepare_input,
.output2 = xfrm4_mode_tunnel_output,
.output = xfrm4_prepare_output,
+   .gso_segment = xfrm4_mode_tunnel_gso_segment,
+   .xmit = xfrm4_mode_tunnel_xmit,
.owner = THIS_MODULE,
.encap = XFRM_MODE_TUNNEL,
.flags = XFRM_MODE_FLAG_TUNNEL,
diff --git a/net/ipv6/xfrm6_mode_transport.c b/net/ipv6/xfrm6_mode_transport.c
index 4439ee4..eb9b36b 100644
--- a/net/ipv6/xfrm6_mode_transport.c
+++ b/net/ipv6/xfrm6_mode_transport.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* Add encapsulation header.
  *
@@ -61,9 +62,41 @@ static int xfrm6_transport_input(struct xfrm_state *x, 
struct sk_buff *skb)
return 0;
 }
 
+static struct sk_buff *xfrm4_transport_gso_segment(struct xfrm_state

[PATCH 02/16] xfrm: Add a xfrm type offload.

2017-04-20 Thread Steffen Klassert

We add a struct  xfrm_type_offload so that we have the offloaded
codepath separated to the non offloaded codepath. With this the
non offloade and the offloaded codepath can coexist.

Signed-off-by: Steffen Klassert 
---
 include/net/xfrm.h| 28 +++-
 net/xfrm/xfrm_state.c | 73 +++
 2 files changed, 95 insertions(+), 6 deletions(-)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 9e3dc7b..159342f 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -222,6 +222,8 @@ struct xfrm_state {
struct xfrm_mode*inner_mode_iaf;
struct xfrm_mode*outer_mode;
 
+   const struct xfrm_type_offload  *type_offload;
+
/* Security context */
struct xfrm_sec_ctx *security;
 
@@ -314,12 +316,14 @@ void km_state_expired(struct xfrm_state *x, int hard, u32 
portid);
 int __xfrm_state_delete(struct xfrm_state *x);
 
 struct xfrm_state_afinfo {
-   unsigned intfamily;
-   unsigned intproto;
-   __be16  eth_proto;
-   struct module   *owner;
-   const struct xfrm_type  *type_map[IPPROTO_MAX];
-   struct xfrm_mode*mode_map[XFRM_MODE_MAX];
+   unsigned intfamily;
+   unsigned intproto;
+   __be16  eth_proto;
+   struct module   *owner;
+   const struct xfrm_type  *type_map[IPPROTO_MAX];
+   const struct xfrm_type_offload  *type_offload_map[IPPROTO_MAX];
+   struct xfrm_mode*mode_map[XFRM_MODE_MAX];
+
int (*init_flags)(struct xfrm_state *x);
void(*init_tempsel)(struct xfrm_selector *sel,
const struct flowi *fl);
@@ -380,6 +384,18 @@ struct xfrm_type {
 int xfrm_register_type(const struct xfrm_type *type, unsigned short family);
 int xfrm_unregister_type(const struct xfrm_type *type, unsigned short family);
 
+struct xfrm_type_offload {
+   char*description;
+   struct module   *owner;
+   u8  proto;
+   void(*encap)(struct xfrm_state *, struct sk_buff *pskb);
+   int (*input_tail)(struct xfrm_state *x, struct sk_buff 
*skb);
+   int (*xmit)(struct xfrm_state *, struct sk_buff *pskb, 
netdev_features_t features);
+};
+
+int xfrm_register_type_offload(const struct xfrm_type_offload *type, unsigned 
short family);
+int xfrm_unregister_type_offload(const struct xfrm_type_offload *type, 
unsigned short family);
+
 struct xfrm_mode {
/*
 * Remove encapsulation header.
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 5a597db..47fefe9 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -251,6 +251,75 @@ static void xfrm_put_type(const struct xfrm_type *type)
module_put(type->owner);
 }
 
+static DEFINE_SPINLOCK(xfrm_type_offload_lock);
+int xfrm_register_type_offload(const struct xfrm_type_offload *type,
+  unsigned short family)
+{
+   struct xfrm_state_afinfo *afinfo = xfrm_state_get_afinfo(family);
+   const struct xfrm_type_offload **typemap;
+   int err = 0;
+
+   if (unlikely(afinfo == NULL))
+   return -EAFNOSUPPORT;
+   typemap = afinfo->type_offload_map;
+   spin_lock_bh(&xfrm_type_offload_lock);
+
+   if (likely(typemap[type->proto] == NULL))
+   typemap[type->proto] = type;
+   else
+   err = -EEXIST;
+   spin_unlock_bh(&xfrm_type_offload_lock);
+   rcu_read_unlock();
+   return err;
+}
+EXPORT_SYMBOL(xfrm_register_type_offload);
+
+int xfrm_unregister_type_offload(const struct xfrm_type_offload *type,
+unsigned short family)
+{
+   struct xfrm_state_afinfo *afinfo = xfrm_state_get_afinfo(family);
+   const struct xfrm_type_offload **typemap;
+   int err = 0;
+
+   if (unlikely(afinfo == NULL))
+   return -EAFNOSUPPORT;
+   typemap = afinfo->type_offload_map;
+   spin_lock_bh(&xfrm_type_offload_lock);
+
+   if (unlikely(typemap[type->proto] != type))
+   err = -ENOENT;
+   else
+   typemap[type->proto] = NULL;
+   spin_unlock_bh(&xfrm_type_offload_lock);
+   rcu_read_unlock();
+   return err;
+}
+EXPORT_SYMBOL(xfrm_unregister_type_offload);
+
+static const struct xfrm_type_offload *xfrm_get_type_offload(u8 proto, 
unsigned short family)
+{
+   struct xfrm_state_afinfo *afinfo;
+   const struct xfrm_type_offload **typemap;
+   const struct xfrm_type_offload *type;
+
+   afinfo = xfrm_state_get_afinfo(family);
+   if (unlikely(afinfo == NULL))
+   return NULL;
+   typemap = afinfo->type_offload_map;
+
+   type = typemap[proto];
+   if ((type && !try_module_get(type->owner)))
+   ty

[PATCH 05/16] xfrm: Add an IPsec hardware offloading API

2017-04-20 Thread Steffen Klassert

This patch adds all the bits that are needed to do
IPsec hardware offload for IPsec states and ESP packets.
We add xfrmdev_ops to the net_device. xfrmdev_ops has
function pointers that are needed to manage the xfrm
states in the hardware and to do a per packet
offloading decision.

Joint work with:
Ilan Tayari 
Guy Shapiro 
Yossi Kuperman 

Signed-off-by: Guy Shapiro 
Signed-off-by: Ilan Tayari 
Signed-off-by: Yossi Kuperman 
Signed-off-by: Steffen Klassert 
---
 include/linux/netdevice.h |  14 +
 include/net/xfrm.h|  65 +-
 include/uapi/linux/xfrm.h |   8 +++
 net/ipv4/esp4.c   |   7 +--
 net/ipv4/xfrm4_output.c   |   3 +-
 net/ipv6/esp6.c   |   4 +-
 net/ipv6/xfrm6_output.c   |   9 ++-
 net/xfrm/Makefile |   3 +-
 net/xfrm/xfrm_device.c| 138 +-
 net/xfrm/xfrm_input.c |  41 +-
 net/xfrm/xfrm_output.c|  44 +--
 net/xfrm/xfrm_policy.c|  10 ++--
 net/xfrm/xfrm_state.c |  74 +
 net/xfrm/xfrm_user.c  |  28 ++
 14 files changed, 424 insertions(+), 24 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5bb03d1..b3eb83d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -824,6 +824,16 @@ struct netdev_xdp {
};
 };
 
+#ifdef CONFIG_XFRM_OFFLOAD
+struct xfrmdev_ops {
+   int (*xdo_dev_state_add) (struct xfrm_state *x);
+   void(*xdo_dev_state_delete) (struct xfrm_state *x);
+   void(*xdo_dev_state_free) (struct xfrm_state *x);
+   bool(*xdo_dev_offload_ok) (struct sk_buff *skb,
+  struct xfrm_state *x);
+};
+#endif
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1697,6 +1707,10 @@ struct net_device {
const struct ndisc_ops *ndisc_ops;
 #endif
 
+#ifdef CONFIG_XFRM
+   const struct xfrmdev_ops *xfrmdev_ops;
+#endif
+
const struct header_ops *header_ops;
 
unsigned intflags;
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 54515d9..17603bf 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -120,6 +120,13 @@ struct xfrm_state_walk {
struct xfrm_address_filter *filter;
 };
 
+struct xfrm_state_offload {
+   struct net_device   *dev;
+   unsigned long   offload_handle;
+   unsigned intnum_exthdrs;
+   u8  flags;
+};
+
 /* Full description of state of transformer. */
 struct xfrm_state {
possible_net_t  xs_net;
@@ -207,6 +214,8 @@ struct xfrm_state {
struct xfrm_lifetime_cur curlft;
struct tasklet_hrtimer  mtimer;
 
+   struct xfrm_state_offload xso;
+
/* used to fix curlft->add_time when changing date */
longsaved_tmo;
 
@@ -1453,7 +1462,6 @@ struct xfrm6_tunnel {
 void xfrm_init(void);
 void xfrm4_init(void);
 int xfrm_state_init(struct net *net);
-void xfrm_dev_init(void);
 void xfrm_state_fini(struct net *net);
 void xfrm4_state_init(void);
 void xfrm4_protocol_init(void);
@@ -1559,6 +1567,7 @@ struct xfrmk_spdinfo {
 struct xfrm_state *xfrm_find_acq_byseq(struct net *net, u32 mark, u32 seq);
 int xfrm_state_delete(struct xfrm_state *x);
 int xfrm_state_flush(struct net *net, u8 proto, bool task_valid);
+int xfrm_dev_state_flush(struct net *net, struct net_device *dev, bool 
task_valid);
 void xfrm_sad_getinfo(struct net *net, struct xfrmk_sadinfo *si);
 void xfrm_spd_getinfo(struct net *net, struct xfrmk_spdinfo *si);
 u32 xfrm_replay_seqhi(struct xfrm_state *x, __be32 net_seq);
@@ -1641,6 +1650,11 @@ static inline int xfrm4_udp_encap_rcv(struct sock *sk, 
struct sk_buff *skb)
 }
 #endif
 
+struct dst_entry *__xfrm_dst_lookup(struct net *net, int tos, int oif,
+   const xfrm_address_t *saddr,
+   const xfrm_address_t *daddr,
+   int family);
+
 struct xfrm_policy *xfrm_policy_alloc(struct net *net, gfp_t gfp);
 
 void xfrm_policy_walk_init(struct xfrm_policy_walk *walk, u8 type);
@@ -1846,6 +1860,55 @@ static inline struct xfrm_offload *xfrm_offload(struct 
sk_buff *skb)
 }
 #endif
 
+#ifdef CONFIG_XFRM_OFFLOAD
+void __net_init xfrm_dev_init(void);
+int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
+  struct xfrm_user_offload *xuo);
+bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x);
+
+static inline void xfrm_dev_state_delete(struct xfrm_state *x)
+{
+   struct xfrm_state_offload *xso = &x->xso;
+
+   if (xso->dev)
+   xso->dev->xfrmdev_ops->xdo_dev_state_delete(x);
+}
+
+static inline void xfrm_dev_state_free(struct xfrm_state *x)
+{
+   struct xfrm_state_offload *xso = &x->xso;
+struct net_device *dev = xso->dev;
+
+   if (dev && dev->xf

pull request (net-next): ipsec-next 2017-04-20

2017-04-20 Thread Steffen Klassert

This adds the basic infrastructure for IPsec hardware
offloading, it creates a configuration API and adjusts
the packet path.

1) Add the needed netdev features to configure IPsec offloads.

2) Add the IPsec hardware offloading API.

3) Prepare the ESP packet path for hardware offloading.

4) Add gso handlers for esp4 and esp6, this implements
   the software fallback for GSO packets.

5) Add xfrm replay handler functions for offloading.

6) Change ESP to use a synchronous crypto algorithm on
   offloading, we don't have the option for asynchronous
   returns when we handle IPsec at layer2.

7) Add a xfrm validate function to validate_xmit_skb. This
   implements the software fallback for non GSO packets.

8) Set the inner_network and inner_transport members of
   the SKB, as well as encapsulation, to reflect the actual
   positions of these headers, and removes them only once
   encryption is done on the payload.
   From Ilan Tayari.

9) Prepare the ESP GRO codepath for hardware offloading.

10) Fix incorrect null pointer check in esp6.
From Colin Ian King.

11) Fix for the GSO software fallback path to detect the
fallback correctly.
From Ilan Tayari.

Please pull or let me know if there are problems.

Thanks!

The following changes since commit f221dcd91d20cdcb893cf6e9c8894b7d6c97d649:

  Merge branch 'net-smc-next' (2017-04-11 23:01:15 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next.git master

for you to fetch changes up to 8f92e03ecca390beed3d5ccc81023d050f0369fd:

  esp4/6: Fix GSO path for non-GSO SW-crypto packets (2017-04-19 07:48:57 +0200)


Colin Ian King (1):
  esp6: fix incorrect null pointer check on xo

Ilan Tayari (2):
  xfrm: Add encapsulation header offsets while SKB is not encrypted
  esp4/6: Fix GSO path for non-GSO SW-crypto packets

Steffen Klassert (13):
  net: Add ESP offload features
  xfrm: Add a xfrm type offload.
  xfrm: Move device notifications to a sepatate file
  xfrm: Add mode handlers for IPsec on layer 2
  xfrm: Add an IPsec hardware offloading API
  esp6: Remame esp_input_done2
  esp4: Reorganize esp_output
  esp6: Reorganize esp_output
  esp: Add gso handlers for esp4 and esp6
  xfrm: Add xfrm_replay_overflow functions for offloading
  esp: Use a synchronous crypto algorithm on offloading.
  net: Add a xfrm validate function to validate_xmit_skb
  xfrm: Prepare the GRO codepath for hardware offloading.

 include/linux/netdev_features.h |   8 +-
 include/linux/netdevice.h   |  15 ++
 include/linux/skbuff.h  |   2 +
 include/net/esp.h   |  19 +++
 include/net/xfrm.h  | 108 +++-
 include/uapi/linux/xfrm.h   |   8 +
 net/core/dev.c  |   3 +
 net/core/ethtool.c  |   3 +
 net/ipv4/esp4.c | 370 ++--
 net/ipv4/esp4_offload.c | 231 +++--
 net/ipv4/xfrm4_mode_transport.c |  34 
 net/ipv4/xfrm4_mode_tunnel.c|  28 +++
 net/ipv4/xfrm4_output.c |   3 +-
 net/ipv6/esp6.c | 292 +--
 net/ipv6/esp6_offload.c | 233 +++--
 net/ipv6/xfrm6_mode_transport.c |  34 
 net/ipv6/xfrm6_mode_tunnel.c|  27 +++
 net/ipv6/xfrm6_output.c |   9 +-
 net/xfrm/Makefile   |   1 +
 net/xfrm/xfrm_device.c  | 208 ++
 net/xfrm/xfrm_input.c   |  41 -
 net/xfrm/xfrm_output.c  |  46 -
 net/xfrm/xfrm_policy.c  |  27 +--
 net/xfrm/xfrm_replay.c  | 162 +-
 net/xfrm/xfrm_state.c   | 147 
 net/xfrm/xfrm_user.c|  28 +++
 26 files changed, 1717 insertions(+), 370 deletions(-)
 create mode 100644 net/xfrm/xfrm_device.c

[PATCH 03/16] xfrm: Move device notifications to a sepatate file

2017-04-20 Thread Steffen Klassert

This is needed for the upcomming IPsec device offloading.

Signed-off-by: Steffen Klassert 
---
 include/net/xfrm.h |  1 +
 net/xfrm/Makefile  |  2 +-
 net/xfrm/xfrm_device.c | 43 +++
 net/xfrm/xfrm_policy.c | 17 +
 4 files changed, 46 insertions(+), 17 deletions(-)
 create mode 100644 net/xfrm/xfrm_device.c

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 159342f..ac984da 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1443,6 +1443,7 @@ struct xfrm6_tunnel {
 void xfrm_init(void);
 void xfrm4_init(void);
 int xfrm_state_init(struct net *net);
+void xfrm_dev_init(void);
 void xfrm_state_fini(struct net *net);
 void xfrm4_state_init(void);
 void xfrm4_protocol_init(void);
diff --git a/net/xfrm/Makefile b/net/xfrm/Makefile
index c0e9619..55b2ac3 100644
--- a/net/xfrm/Makefile
+++ b/net/xfrm/Makefile
@@ -4,7 +4,7 @@
 
 obj-$(CONFIG_XFRM) := xfrm_policy.o xfrm_state.o xfrm_hash.o \
  xfrm_input.o xfrm_output.o \
- xfrm_sysctl.o xfrm_replay.o
+ xfrm_sysctl.o xfrm_replay.o xfrm_device.o
 obj-$(CONFIG_XFRM_STATISTICS) += xfrm_proc.o
 obj-$(CONFIG_XFRM_ALGO) += xfrm_algo.o
 obj-$(CONFIG_XFRM_USER) += xfrm_user.o
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
new file mode 100644
index 000..34a260a
--- /dev/null
+++ b/net/xfrm/xfrm_device.c
@@ -0,0 +1,43 @@
+/*
+ * xfrm_device.c - IPsec device offloading code.
+ *
+ * Copyright (c) 2015 secunet Security Networks AG
+ *
+ * Author:
+ * Steffen Klassert 
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static int xfrm_dev_event(struct notifier_block *this, unsigned long event, 
void *ptr)
+{
+   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+
+   switch (event) {
+   case NETDEV_DOWN:
+   xfrm_garbage_collect(dev_net(dev));
+   }
+   return NOTIFY_DONE;
+}
+
+static struct notifier_block xfrm_dev_notifier = {
+   .notifier_call  = xfrm_dev_event,
+};
+
+void __net_init xfrm_dev_init(void)
+{
+   register_netdevice_notifier(&xfrm_dev_notifier);
+}
diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 236cbbc..7befca2 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -2929,21 +2929,6 @@ void xfrm_policy_unregister_afinfo(const struct 
xfrm_policy_afinfo *afinfo)
 }
 EXPORT_SYMBOL(xfrm_policy_unregister_afinfo);
 
-static int xfrm_dev_event(struct notifier_block *this, unsigned long event, 
void *ptr)
-{
-   struct net_device *dev = netdev_notifier_info_to_dev(ptr);
-
-   switch (event) {
-   case NETDEV_DOWN:
-   xfrm_garbage_collect(dev_net(dev));
-   }
-   return NOTIFY_DONE;
-}
-
-static struct notifier_block xfrm_dev_notifier = {
-   .notifier_call  = xfrm_dev_event,
-};
-
 #ifdef CONFIG_XFRM_STATISTICS
 static int __net_init xfrm_statistics_init(struct net *net)
 {
@@ -3020,7 +3005,7 @@ static int __net_init xfrm_policy_init(struct net *net)
INIT_WORK(&net->xfrm.policy_hash_work, xfrm_hash_resize);
INIT_WORK(&net->xfrm.policy_hthresh.work, xfrm_hash_rebuild);
if (net_eq(net, &init_net))
-   register_netdevice_notifier(&xfrm_dev_notifier);
+   xfrm_dev_init();
return 0;
 
 out_bydst:
-- 
2.7.4

[PATCH 01/16] net: Add ESP offload features

2017-04-20 Thread Steffen Klassert

This patch adds netdev features to configure IPsec offloads.

Signed-off-by: Steffen Klassert 
---
 include/linux/netdev_features.h | 8 +++-
 include/linux/netdevice.h   | 1 +
 include/linux/skbuff.h  | 2 ++
 net/core/ethtool.c  | 3 +++
 4 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
index 9a04195..1d4737c 100644
--- a/include/linux/netdev_features.h
+++ b/include/linux/netdev_features.h
@@ -54,8 +54,9 @@ enum {
 */
NETIF_F_GSO_TUNNEL_REMCSUM_BIT, /* ... TUNNEL with TSO & REMCSUM */
NETIF_F_GSO_SCTP_BIT,   /* ... SCTP fragmentation */
+   NETIF_F_GSO_ESP_BIT,/* ... ESP with TSO */
/**/NETIF_F_GSO_LAST =  /* last bit, see GSO_MASK */
-   NETIF_F_GSO_SCTP_BIT,
+   NETIF_F_GSO_ESP_BIT,
 
NETIF_F_FCOE_CRC_BIT,   /* FCoE CRC32 */
NETIF_F_SCTP_CRC_BIT,   /* SCTP checksum offload */
@@ -73,6 +74,8 @@ enum {
NETIF_F_HW_L2FW_DOFFLOAD_BIT,   /* Allow L2 Forwarding in Hardware */
 
NETIF_F_HW_TC_BIT,  /* Offload TC infrastructure */
+   NETIF_F_HW_ESP_BIT, /* Hardware ESP transformation offload 
*/
+   NETIF_F_HW_ESP_TX_CSUM_BIT, /* ESP with TX checksum offload */
 
/*
 * Add your fresh new feature above and remember to update
@@ -129,11 +132,14 @@ enum {
 #define NETIF_F_GSO_PARTIAL __NETIF_F(GSO_PARTIAL)
 #define NETIF_F_GSO_TUNNEL_REMCSUM __NETIF_F(GSO_TUNNEL_REMCSUM)
 #define NETIF_F_GSO_SCTP   __NETIF_F(GSO_SCTP)
+#define NETIF_F_GSO_ESP__NETIF_F(GSO_ESP)
 #define NETIF_F_HW_VLAN_STAG_FILTER __NETIF_F(HW_VLAN_STAG_FILTER)
 #define NETIF_F_HW_VLAN_STAG_RX__NETIF_F(HW_VLAN_STAG_RX)
 #define NETIF_F_HW_VLAN_STAG_TX__NETIF_F(HW_VLAN_STAG_TX)
 #define NETIF_F_HW_L2FW_DOFFLOAD   __NETIF_F(HW_L2FW_DOFFLOAD)
 #define NETIF_F_HW_TC  __NETIF_F(HW_TC)
+#define NETIF_F_HW_ESP __NETIF_F(HW_ESP)
+#define NETIF_F_HW_ESP_TX_CSUM __NETIF_F(HW_ESP_TX_CSUM)
 
 #define for_each_netdev_feature(mask_addr, bit)\
for_each_set_bit(bit, (unsigned long *)mask_addr, NETDEV_FEATURE_COUNT)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cc07c3b..5bb03d1 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4070,6 +4070,7 @@ static inline bool net_gso_ok(netdev_features_t features, 
int gso_type)
BUILD_BUG_ON(SKB_GSO_PARTIAL != (NETIF_F_GSO_PARTIAL >> 
NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_TUNNEL_REMCSUM != (NETIF_F_GSO_TUNNEL_REMCSUM >> 
NETIF_F_GSO_SHIFT));
BUILD_BUG_ON(SKB_GSO_SCTP!= (NETIF_F_GSO_SCTP >> 
NETIF_F_GSO_SHIFT));
+   BUILD_BUG_ON(SKB_GSO_ESP != (NETIF_F_GSO_ESP >> NETIF_F_GSO_SHIFT));
 
return (features & feature) == feature;
 }
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 741d75c..81ef53f 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -492,6 +492,8 @@ enum {
SKB_GSO_TUNNEL_REMCSUM = 1 << 14,
 
SKB_GSO_SCTP = 1 << 15,
+
+   SKB_GSO_ESP = 1 << 16,
 };
 
 #if BITS_PER_LONG > 32
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 905a88a..03111a2 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -90,6 +90,7 @@ static const char 
netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
[NETIF_F_GSO_UDP_TUNNEL_CSUM_BIT] = "tx-udp_tnl-csum-segmentation",
[NETIF_F_GSO_PARTIAL_BIT] =  "tx-gso-partial",
[NETIF_F_GSO_SCTP_BIT] = "tx-sctp-segmentation",
+   [NETIF_F_GSO_ESP_BIT] =  "tx-esp-segmentation",
 
[NETIF_F_FCOE_CRC_BIT] = "tx-checksum-fcoe-crc",
[NETIF_F_SCTP_CRC_BIT] ="tx-checksum-sctp",
@@ -103,6 +104,8 @@ static const char 
netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN]
[NETIF_F_RXALL_BIT] ="rx-all",
[NETIF_F_HW_L2FW_DOFFLOAD_BIT] = "l2-fwd-offload",
[NETIF_F_HW_TC_BIT] ="hw-tc-offload",
+   [NETIF_F_HW_ESP_BIT] =   "esp-hw-offload",
+   [NETIF_F_HW_ESP_TX_CSUM_BIT] =   "esp-tx-csum-hw-offload",
 };
 
 static const char
-- 
2.7.4

[PATCH 06/16] esp6: Remame esp_input_done2

2017-04-20 Thread Steffen Klassert

We are going to export the ipv4 and the ipv6
version of esp_input_done2. They are not static
anymore and can't have the same name. So rename
the ipv6 version to esp6_input_done2.

Signed-off-by: Steffen Klassert 
---
 net/ipv6/esp6.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 3d3757d..5bd1dcc 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -447,7 +447,7 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff 
*skb)
return err;
 }
 
-static int esp_input_done2(struct sk_buff *skb, int err)
+static int esp6_input_done2(struct sk_buff *skb, int err)
 {
struct xfrm_state *x = xfrm_input_state(skb);
struct xfrm_offload *xo = xfrm_offload(skb);
@@ -499,7 +499,7 @@ static void esp_input_done(struct crypto_async_request 
*base, int err)
 {
struct sk_buff *skb = base->data;
 
-   xfrm_input_resume(skb, esp_input_done2(skb, err));
+   xfrm_input_resume(skb, esp6_input_done2(skb, err));
 }
 
 static void esp_input_restore_header(struct sk_buff *skb)
@@ -621,7 +621,7 @@ static int esp6_input(struct xfrm_state *x, struct sk_buff 
*skb)
if ((x->props.flags & XFRM_STATE_ESN))
esp_input_restore_header(skb);
 
-   ret = esp_input_done2(skb, ret);
+   ret = esp6_input_done2(skb, ret);
 
 out:
return ret;
-- 
2.7.4

[PATCH 07/16] esp4: Reorganize esp_output

2017-04-20 Thread Steffen Klassert

We need a fallback for ESP at layer 2, so split esp_output
into generic functions that can be used at layer 3 and layer 2
and use them in esp_output. We also add esp_xmit which is
used for the layer 2 fallback.

Signed-off-by: Steffen Klassert 
---
 include/net/esp.h   |  16 +++
 net/ipv4/esp4.c | 341 ++--
 net/ipv4/esp4_offload.c | 102 +++
 3 files changed, 301 insertions(+), 158 deletions(-)

diff --git a/include/net/esp.h b/include/net/esp.h
index a43be85..411a499 100644
--- a/include/net/esp.h
+++ b/include/net/esp.h
@@ -10,4 +10,20 @@ static inline struct ip_esp_hdr *ip_esp_hdr(const struct 
sk_buff *skb)
return (struct ip_esp_hdr *)skb_transport_header(skb);
 }
 
+struct esp_info {
+   struct  ip_esp_hdr *esph;
+   __be64  seqno;
+   int tfclen;
+   int tailen;
+   int plen;
+   int clen;
+   int len;
+   int nfrags;
+   __u8proto;
+   boolinplace;
+};
+
+int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info 
*esp);
+int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info 
*esp);
+int esp_input_done2(struct sk_buff *skb, int err);
 #endif
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index c6aba23..91e6a40 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -152,11 +152,10 @@ static void esp_output_restore_header(struct sk_buff *skb)
 }
 
 static struct ip_esp_hdr *esp_output_set_extra(struct sk_buff *skb,
+  struct xfrm_state *x,
   struct ip_esp_hdr *esph,
   struct esp_output_extra *extra)
 {
-   struct xfrm_state *x = skb_dst(skb)->xfrm;
-
/* For ESN we move the header forward by 4 bytes to
 * accomodate the high bits.  We will move it back after
 * encryption.
@@ -198,98 +197,56 @@ static void esp_output_fill_trailer(u8 *tail, int tfclen, 
int plen, __u8 proto)
tail[plen - 1] = proto;
 }
 
-static int esp_output(struct xfrm_state *x, struct sk_buff *skb)
+static void esp_output_udp_encap(struct xfrm_state *x, struct sk_buff *skb, 
struct esp_info *esp)
 {
-   struct esp_output_extra *extra;
-   int err = -ENOMEM;
-   struct ip_esp_hdr *esph;
-   struct crypto_aead *aead;
-   struct aead_request *req;
-   struct scatterlist *sg, *dsg;
-   struct sk_buff *trailer;
-   struct page *page;
-   void *tmp;
-   u8 *iv;
-   u8 *tail;
-   u8 *vaddr;
-   int blksize;
-   int clen;
-   int alen;
-   int plen;
-   int ivlen;
-   int tfclen;
-   int nfrags;
-   int assoclen;
-   int extralen;
-   int tailen;
-   __be64 seqno;
-   __u8 proto = *skb_mac_header(skb);
-
-   /* skb is pure payload to encrypt */
-
-   aead = x->data;
-   alen = crypto_aead_authsize(aead);
-   ivlen = crypto_aead_ivsize(aead);
-
-   tfclen = 0;
-   if (x->tfcpad) {
-   struct xfrm_dst *dst = (struct xfrm_dst *)skb_dst(skb);
-   u32 padto;
-
-   padto = min(x->tfcpad, esp4_get_mtu(x, dst->child_mtu_cached));
-   if (skb->len < padto)
-   tfclen = padto - skb->len;
+   int encap_type;
+   struct udphdr *uh;
+   __be32 *udpdata32;
+   __be16 sport, dport;
+   struct xfrm_encap_tmpl *encap = x->encap;
+   struct ip_esp_hdr *esph = esp->esph;
+
+   spin_lock_bh(&x->lock);
+   sport = encap->encap_sport;
+   dport = encap->encap_dport;
+   encap_type = encap->encap_type;
+   spin_unlock_bh(&x->lock);
+
+   uh = (struct udphdr *)esph;
+   uh->source = sport;
+   uh->dest = dport;
+   uh->len = htons(skb->len + esp->tailen
+ - skb_transport_offset(skb));
+   uh->check = 0;
+
+   switch (encap_type) {
+   default:
+   case UDP_ENCAP_ESPINUDP:
+   esph = (struct ip_esp_hdr *)(uh + 1);
+   break;
+   case UDP_ENCAP_ESPINUDP_NON_IKE:
+   udpdata32 = (__be32 *)(uh + 1);
+   udpdata32[0] = udpdata32[1] = 0;
+   esph = (struct ip_esp_hdr *)(udpdata32 + 2);
+   break;
}
-   blksize = ALIGN(crypto_aead_blocksize(aead), 4);
-   clen = ALIGN(skb->len + 2 + tfclen, blksize);
-   plen = clen - skb->len - tfclen;
-   tailen = tfclen + plen + alen;
-   assoclen = sizeof(*esph);
-   extralen = 0;
 
-   if (x->props.flags & XFRM_STATE_ESN) {
-   extralen += sizeof(*extra);
-   assoclen += sizeof(__be32);
-   }
+   *skb_mac_header(skb) = IPPROTO_UDP;
+   esp->esph = esph;
+}
 
-   *skb_mac_header(skb) = IPPROTO_ESP;
-   esph = ip_esp_hdr(skb);
+int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info 
*esp)
+{
+   u8 *tail;
+   u8 *vaddr;
+

[PATCH 09/16] esp: Add gso handlers for esp4 and esp6

2017-04-20 Thread Steffen Klassert

This patch extends the xfrm_type by an encap function pointer
and implements esp4_gso_encap and esp6_gso_encap. These functions
doing the basic esp encapsulation for a GSO packet. In case the
GSO packet needs to be segmented in software, we add gso_segment
functions. This codepath is going to be used on esp hardware
offloads.

Signed-off-by: Steffen Klassert 
---
 net/ipv4/esp4.c | 10 +-
 net/ipv4/esp4_offload.c | 93 +
 net/ipv6/esp6.c |  8 +++--
 net/ipv6/esp6_offload.c | 93 +
 net/xfrm/xfrm_replay.c  |  3 +-
 5 files changed, 203 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 91e6a40..4382f30 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -161,11 +161,19 @@ static struct ip_esp_hdr *esp_output_set_extra(struct 
sk_buff *skb,
 * encryption.
 */
if ((x->props.flags & XFRM_STATE_ESN)) {
+   __u32 seqhi;
+   struct xfrm_offload *xo = xfrm_offload(skb);
+
+   if (xo)
+   seqhi = xo->seq.hi;
+   else
+   seqhi = XFRM_SKB_CB(skb)->seq.output.hi;
+
extra->esphoff = (unsigned char *)esph -
 skb_transport_header(skb);
esph = (struct ip_esp_hdr *)((unsigned char *)esph - 4);
extra->seqhi = esph->spi;
-   esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.hi);
+   esph->seq_no = htonl(seqhi);
}
 
esph->spi = x->id.spi;
diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index efaaa44..1e39564 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -84,6 +84,97 @@ static struct sk_buff **esp4_gro_receive(struct sk_buff 
**head,
return NULL;
 }
 
+static void esp4_gso_encap(struct xfrm_state *x, struct sk_buff *skb)
+{
+   struct ip_esp_hdr *esph;
+   struct iphdr *iph = ip_hdr(skb);
+   struct xfrm_offload *xo = xfrm_offload(skb);
+   int proto = iph->protocol;
+
+   skb_push(skb, -skb_network_offset(skb));
+   esph = ip_esp_hdr(skb);
+   *skb_mac_header(skb) = IPPROTO_ESP;
+
+   esph->spi = x->id.spi;
+   esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
+
+   xo->proto = proto;
+}
+
+static struct sk_buff *esp4_gso_segment(struct sk_buff *skb,
+   netdev_features_t features)
+{
+   __u32 seq;
+   int err = 0;
+   struct sk_buff *skb2;
+   struct xfrm_state *x;
+   struct ip_esp_hdr *esph;
+   struct crypto_aead *aead;
+   struct sk_buff *segs = ERR_PTR(-EINVAL);
+   netdev_features_t esp_features = features;
+   struct xfrm_offload *xo = xfrm_offload(skb);
+
+   if (!xo)
+   goto out;
+
+   seq = xo->seq.low;
+
+   x = skb->sp->xvec[skb->sp->len - 1];
+   aead = x->data;
+   esph = ip_esp_hdr(skb);
+
+   if (esph->spi != x->id.spi)
+   goto out;
+
+   if (!pskb_may_pull(skb, sizeof(*esph) + crypto_aead_ivsize(aead)))
+   goto out;
+
+   __skb_pull(skb, sizeof(*esph) + crypto_aead_ivsize(aead));
+
+   skb->encap_hdr_csum = 1;
+
+   if (!(features & NETIF_F_HW_ESP))
+   esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
+
+   segs = x->outer_mode->gso_segment(x, skb, esp_features);
+   if (IS_ERR_OR_NULL(segs))
+   goto out;
+
+   __skb_pull(skb, skb->data - skb_mac_header(skb));
+
+   skb2 = segs;
+   do {
+   struct sk_buff *nskb = skb2->next;
+
+   xo = xfrm_offload(skb2);
+   xo->flags |= XFRM_GSO_SEGMENT;
+   xo->seq.low = seq;
+   xo->seq.hi = xfrm_replay_seqhi(x, seq);
+
+   if(!(features & NETIF_F_HW_ESP))
+   xo->flags |= CRYPTO_FALLBACK;
+
+   x->outer_mode->xmit(x, skb2);
+
+   err = x->type_offload->xmit(x, skb2, esp_features);
+   if (err) {
+   kfree_skb_list(segs);
+   return ERR_PTR(err);
+   }
+
+   if (!skb_is_gso(skb2))
+   seq++;
+   else
+   seq += skb_shinfo(skb2)->gso_segs;
+
+   skb_push(skb2, skb2->mac_len);
+   skb2 = nskb;
+   } while (skb2);
+
+out:
+   return segs;
+}
+
 static int esp_input_tail(struct xfrm_state *x, struct sk_buff *skb)
 {
struct crypto_aead *aead = x->data;
@@ -173,6 +264,7 @@ static int esp_xmit(struct xfrm_state *x, struct sk_buff 
*skb,  netdev_features_
 static const struct net_offload esp4_offload = {
.callbacks = {
.gro_receive = esp4_gro_receive,
+   .gso_segment = esp4_gso_segment,
},
 };
 
@@ -182,6 +274,7 @@ static const struct xfrm_type_offload esp_type_offload = {
.pro

[PATCH 08/16] esp6: Reorganize esp_output

2017-04-20 Thread Steffen Klassert

We need a fallback for ESP at layer 2, so split esp6_output
into generic functions that can be used at layer 3 and layer 2
and use them in esp_output. We also add esp6_xmit which is
used for the layer 2 fallback.

Signed-off-by: Steffen Klassert 
---
 include/net/esp.h   |   3 +
 net/ipv6/esp6.c | 264 +---
 net/ipv6/esp6_offload.c | 103 +++
 3 files changed, 246 insertions(+), 124 deletions(-)

diff --git a/include/net/esp.h b/include/net/esp.h
index 411a499..c41994d 100644
--- a/include/net/esp.h
+++ b/include/net/esp.h
@@ -26,4 +26,7 @@ struct esp_info {
 int esp_output_head(struct xfrm_state *x, struct sk_buff *skb, struct esp_info 
*esp);
 int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info 
*esp);
 int esp_input_done2(struct sk_buff *skb, int err);
+int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct 
esp_info *esp);
+int esp6_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct 
esp_info *esp);
+int esp6_input_done2(struct sk_buff *skb, int err);
 #endif
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 5bd1dcc..cc654a7 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -170,11 +170,10 @@ static void esp_output_restore_header(struct sk_buff *skb)
 }
 
 static struct ip_esp_hdr *esp_output_set_esn(struct sk_buff *skb,
+struct xfrm_state *x,
 struct ip_esp_hdr *esph,
 __be32 *seqhi)
 {
-   struct xfrm_state *x = skb_dst(skb)->xfrm;
-
/* For ESN we move the header forward by 4 bytes to
 * accomodate the high bits.  We will move it back after
 * encryption.
@@ -214,59 +213,15 @@ static void esp_output_fill_trailer(u8 *tail, int tfclen, 
int plen, __u8 proto)
tail[plen - 1] = proto;
 }
 
-static int esp6_output(struct xfrm_state *x, struct sk_buff *skb)
+int esp6_output_head(struct xfrm_state *x, struct sk_buff *skb, struct 
esp_info *esp)
 {
-   int err;
-   struct ip_esp_hdr *esph;
-   struct crypto_aead *aead;
-   struct aead_request *req;
-   struct scatterlist *sg, *dsg;
-   struct sk_buff *trailer;
-   struct page *page;
-   void *tmp;
-   int blksize;
-   int clen;
-   int alen;
-   int plen;
-   int ivlen;
-   int tfclen;
-   int nfrags;
-   int assoclen;
-   int seqhilen;
-   int tailen;
-   u8 *iv;
u8 *tail;
u8 *vaddr;
-   __be32 *seqhi;
-   __be64 seqno;
-   __u8 proto = *skb_mac_header(skb);
-
-   /* skb is pure payload to encrypt */
-   aead = x->data;
-   alen = crypto_aead_authsize(aead);
-   ivlen = crypto_aead_ivsize(aead);
-
-   tfclen = 0;
-   if (x->tfcpad) {
-   struct xfrm_dst *dst = (struct xfrm_dst *)skb_dst(skb);
-   u32 padto;
-
-   padto = min(x->tfcpad, esp6_get_mtu(x, dst->child_mtu_cached));
-   if (skb->len < padto)
-   tfclen = padto - skb->len;
-   }
-   blksize = ALIGN(crypto_aead_blocksize(aead), 4);
-   clen = ALIGN(skb->len + 2 + tfclen, blksize);
-   plen = clen - skb->len - tfclen;
-   tailen = tfclen + plen + alen;
-
-   assoclen = sizeof(*esph);
-   seqhilen = 0;
-
-   if (x->props.flags & XFRM_STATE_ESN) {
-   seqhilen += sizeof(__be32);
-   assoclen += seqhilen;
-   }
+   int nfrags;
+   struct page *page;
+   struct ip_esp_hdr *esph;
+   struct sk_buff *trailer;
+   int tailen = esp->tailen;
 
*skb_mac_header(skb) = IPPROTO_ESP;
esph = ip_esp_hdr(skb);
@@ -284,6 +239,8 @@ static int esp6_output(struct xfrm_state *x, struct sk_buff 
*skb)
struct sock *sk = skb->sk;
struct page_frag *pfrag = &x->xfrag;
 
+   esp->inplace = false;
+
allocsize = ALIGN(tailen, L1_CACHE_BYTES);
 
spin_lock_bh(&x->lock);
@@ -300,10 +257,12 @@ static int esp6_output(struct xfrm_state *x, struct 
sk_buff *skb)
 
tail = vaddr + pfrag->offset;
 
-   esp_output_fill_trailer(tail, tfclen, plen, proto);
+   esp_output_fill_trailer(tail, esp->tfclen, esp->plen, 
esp->proto);
 
kunmap_atomic(vaddr);
 
+   spin_unlock_bh(&x->lock);
+
nfrags = skb_shinfo(skb)->nr_frags;
 
__skb_fill_page_desc(skb, nfrags, page, pfrag->offset,
@@ -319,77 +278,56 @@ static int esp6_output(struct xfrm_state *x, struct 
sk_buff *skb)
if (sk)
atomic_add(tailen, &sk->sk_wmem_alloc);
 
-   skb_push(skb, -skb_network_offset(skb));
-
-   esph->seq_no = htonl(XFRM_SKB_CB(skb)->seq

[PATCH 15/16] esp6: fix incorrect null pointer check on xo

2017-04-20 Thread Steffen Klassert

From: Colin Ian King 

The check for xo being null is incorrect, currently it is checking
for non-null, it should be checking for null.

Detected with CoverityScan, CID#1429349 ("Dereference after null check")

Fixes: 7862b4058b9f ("esp: Add gso handlers for esp4 and esp6")
Signed-off-by: Colin Ian King 
Signed-off-by: Steffen Klassert 
---
 net/ipv6/esp6_offload.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index 1cc..95f10728 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -120,7 +120,7 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb,
netdev_features_t esp_features = features;
struct xfrm_offload *xo = xfrm_offload(skb);
 
-   if (xo)
+   if (!xo)
goto out;
 
seq = xo->seq.low;
-- 
2.7.4

[PATCH 12/16] net: Add a xfrm validate function to validate_xmit_skb

2017-04-20 Thread Steffen Klassert

When we do IPsec offloading, we need a fallback for
packets that were targeted to be IPsec offloaded but
rerouted to a device that does not support IPsec offload.
For that we add a function that checks the offloading
features of the sending device and and flags the
requirement of a fallback before it calls the IPsec
output function. The IPsec output function adds the IPsec
trailer and does encryption if needed.

Signed-off-by: Steffen Klassert 
---
 include/net/xfrm.h |  6 ++
 net/core/dev.c |  3 +++
 net/xfrm/xfrm_device.c | 29 +
 3 files changed, 38 insertions(+)

diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 17603bf..6793a30c 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1862,6 +1862,7 @@ static inline struct xfrm_offload *xfrm_offload(struct 
sk_buff *skb)
 
 #ifdef CONFIG_XFRM_OFFLOAD
 void __net_init xfrm_dev_init(void);
+int validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features);
 int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
   struct xfrm_user_offload *xuo);
 bool xfrm_dev_offload_ok(struct sk_buff *skb, struct xfrm_state *x);
@@ -1890,6 +1891,11 @@ static inline void __net_init xfrm_dev_init(void)
 {
 }
 
+static inline int validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t 
features)
+{
+   return 0;
+}
+
 static inline int xfrm_dev_state_add(struct net *net, struct xfrm_state *x, 
struct xfrm_user_offload *xuo)
 {
return 0;
diff --git a/net/core/dev.c b/net/core/dev.c
index ef9fe60e..5f0a864 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2972,6 +2972,9 @@ static struct sk_buff *validate_xmit_skb(struct sk_buff 
*skb, struct net_device
__skb_linearize(skb))
goto out_kfree_skb;
 
+   if (validate_xmit_xfrm(skb, features))
+   goto out_kfree_skb;
+
/* If packet is not checksummed and device does not
 * support checksumming for this protocol, complete
 * checksumming here.
diff --git a/net/xfrm/xfrm_device.c b/net/xfrm/xfrm_device.c
index 9bac2ba..8ec8a3f 100644
--- a/net/xfrm/xfrm_device.c
+++ b/net/xfrm/xfrm_device.c
@@ -22,6 +22,35 @@
 #include 
 #include 
 
+int validate_xmit_xfrm(struct sk_buff *skb, netdev_features_t features)
+{
+   int err;
+   struct xfrm_state *x;
+   struct xfrm_offload *xo = xfrm_offload(skb);
+
+   if (skb_is_gso(skb))
+   return 0;
+
+   if (xo) {
+   x = skb->sp->xvec[skb->sp->len - 1];
+   if (xo->flags & XFRM_GRO || x->xso.flags & XFRM_OFFLOAD_INBOUND)
+   return 0;
+
+   x->outer_mode->xmit(x, skb);
+
+   err = x->type_offload->xmit(x, skb, features);
+   if (err) {
+   XFRM_INC_STATS(xs_net(x), 
LINUX_MIB_XFRMOUTSTATEPROTOERROR);
+   return err;
+   }
+
+   skb_push(skb, skb->data - skb_mac_header(skb));
+   }
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(validate_xmit_xfrm);
+
 int xfrm_dev_state_add(struct net *net, struct xfrm_state *x,
   struct xfrm_user_offload *xuo)
 {
-- 
2.7.4

[PATCH 16/16] esp4/6: Fix GSO path for non-GSO SW-crypto packets

2017-04-20 Thread Steffen Klassert

From: Ilan Tayari 

If esp*_offload module is loaded, outbound packets take the
GSO code path, being encapsulated at layer 3, but encrypted
in layer 2. validate_xmit_xfrm calls esp*_xmit for that.

esp*_xmit was wrongfully detecting these packets as going
through hardware crypto offload, while in fact they should
be encrypted in software, causing plaintext leakage to
the network, and also dropping at the receiver side.

Perform the encryption in esp*_xmit, if the SA doesn't have
a hardware offload_handle.

Also, align esp6 code to esp4 logic.

Fixes: fca11ebde3f0 ("esp4: Reorganize esp_output")
Fixes: 383d0350f2cc ("esp6: Reorganize esp_output")
Signed-off-by: Ilan Tayari 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/esp4_offload.c | 4 ++--
 net/ipv6/esp6_offload.c | 7 ---
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index f3e33c2..e066601 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -209,8 +209,8 @@ static int esp_xmit(struct xfrm_state *x, struct sk_buff 
*skb,  netdev_features_
if (!xo)
return -EINVAL;
 
-   if (!(features & NETIF_F_HW_ESP) ||
-   (x->xso.offload_handle &&  x->xso.dev != skb->dev)) {
+   if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
+   (x->xso.dev != skb->dev)) {
xo->flags |= CRYPTO_FALLBACK;
hw_offload = false;
}
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index 95f10728..d950d43 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -211,9 +211,10 @@ static int esp6_xmit(struct xfrm_state *x, struct sk_buff 
*skb,  netdev_features
if (!xo)
return -EINVAL;
 
-   if (!(features & NETIF_F_HW_ESP) ||
-   (x->xso.offload_handle &&  x->xso.dev != skb->dev)) {
+   if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
+   (x->xso.dev != skb->dev)) {
xo->flags |= CRYPTO_FALLBACK;
+   hw_offload = false;
}
 
esp.proto = xo->proto;
@@ -254,7 +255,7 @@ static int esp6_xmit(struct xfrm_state *x, struct sk_buff 
*skb,  netdev_features
ipv6_hdr(skb)->payload_len = htons(len);
}
 
-   if (x->xso.offload_handle && !(xo->flags & CRYPTO_FALLBACK))
+   if (hw_offload)
return 0;
 
esp.seqno = cpu_to_be64(xo->seq.low + ((u64)xo->seq.hi << 32));
-- 
2.7.4

[PATCH 10/16] xfrm: Add xfrm_replay_overflow functions for offloading

2017-04-20 Thread Steffen Klassert

This patch adds functions that handles IPsec sequence
numbers for GSO segments and TSO offloading. We need
to calculate and update the sequence numbers based
on the segments that GSO/TSO will generate. We need
this to keep software and hardware sequence number
counter in sync.

Signed-off-by: Steffen Klassert 
---
 net/xfrm/xfrm_replay.c | 159 -
 1 file changed, 157 insertions(+), 2 deletions(-)

diff --git a/net/xfrm/xfrm_replay.c b/net/xfrm/xfrm_replay.c
index 20e68a3..8b23c5b 100644
--- a/net/xfrm/xfrm_replay.c
+++ b/net/xfrm/xfrm_replay.c
@@ -559,6 +559,158 @@ static void xfrm_replay_advance_esn(struct xfrm_state *x, 
__be32 net_seq)
x->repl->notify(x, XFRM_REPLAY_UPDATE);
 }
 
+#ifdef CONFIG_XFRM_OFFLOAD
+static int xfrm_replay_overflow_offload(struct xfrm_state *x, struct sk_buff 
*skb)
+{
+   int err = 0;
+   struct net *net = xs_net(x);
+   struct xfrm_offload *xo = xfrm_offload(skb);
+   __u32 oseq = x->replay.oseq;
+
+   if (!xo)
+   return xfrm_replay_overflow(x, skb);
+
+   if (x->type->flags & XFRM_TYPE_REPLAY_PROT) {
+   if (!skb_is_gso(skb)) {
+   XFRM_SKB_CB(skb)->seq.output.low = ++oseq;
+   xo->seq.low = oseq;
+   } else {
+   XFRM_SKB_CB(skb)->seq.output.low = oseq + 1;
+   xo->seq.low = oseq + 1;
+   oseq += skb_shinfo(skb)->gso_segs;
+   }
+
+   XFRM_SKB_CB(skb)->seq.output.hi = 0;
+   xo->seq.hi = 0;
+   if (unlikely(oseq < x->replay.oseq)) {
+   xfrm_audit_state_replay_overflow(x, skb);
+   err = -EOVERFLOW;
+
+   return err;
+   }
+
+   x->replay.oseq = oseq;
+
+   if (xfrm_aevent_is_on(net))
+   x->repl->notify(x, XFRM_REPLAY_UPDATE);
+   }
+
+   return err;
+}
+
+static int xfrm_replay_overflow_offload_bmp(struct xfrm_state *x, struct 
sk_buff *skb)
+{
+   int err = 0;
+   struct xfrm_offload *xo = xfrm_offload(skb);
+   struct xfrm_replay_state_esn *replay_esn = x->replay_esn;
+   struct net *net = xs_net(x);
+   __u32 oseq = replay_esn->oseq;
+
+   if (!xo)
+   return xfrm_replay_overflow_bmp(x, skb);
+
+   if (x->type->flags & XFRM_TYPE_REPLAY_PROT) {
+   if (!skb_is_gso(skb)) {
+   XFRM_SKB_CB(skb)->seq.output.low = ++oseq;
+   xo->seq.low = oseq;
+   } else {
+   XFRM_SKB_CB(skb)->seq.output.low = oseq + 1;
+   xo->seq.low = oseq + 1;
+   oseq += skb_shinfo(skb)->gso_segs;
+   }
+
+   XFRM_SKB_CB(skb)->seq.output.hi = 0;
+   xo->seq.hi = 0;
+   if (unlikely(oseq < replay_esn->oseq)) {
+   xfrm_audit_state_replay_overflow(x, skb);
+   err = -EOVERFLOW;
+
+   return err;
+   } else {
+   replay_esn->oseq = oseq;
+   }
+
+   if (xfrm_aevent_is_on(net))
+   x->repl->notify(x, XFRM_REPLAY_UPDATE);
+   }
+
+   return err;
+}
+
+static int xfrm_replay_overflow_offload_esn(struct xfrm_state *x, struct 
sk_buff *skb)
+{
+   int err = 0;
+   struct xfrm_offload *xo = xfrm_offload(skb);
+   struct xfrm_replay_state_esn *replay_esn = x->replay_esn;
+   struct net *net = xs_net(x);
+   __u32 oseq = replay_esn->oseq;
+   __u32 oseq_hi = replay_esn->oseq_hi;
+
+   if (!xo)
+   return xfrm_replay_overflow_esn(x, skb);
+
+   if (x->type->flags & XFRM_TYPE_REPLAY_PROT) {
+   if (!skb_is_gso(skb)) {
+   XFRM_SKB_CB(skb)->seq.output.low = ++oseq;
+   XFRM_SKB_CB(skb)->seq.output.hi = oseq_hi;
+   xo->seq.low = oseq;
+   xo->seq.hi = oseq_hi;
+   } else {
+   XFRM_SKB_CB(skb)->seq.output.low = oseq + 1;
+   XFRM_SKB_CB(skb)->seq.output.hi = oseq_hi;
+   xo->seq.low = oseq = oseq + 1;
+   xo->seq.hi = oseq_hi;
+   oseq += skb_shinfo(skb)->gso_segs;
+   }
+
+   if (unlikely(oseq < replay_esn->oseq)) {
+   XFRM_SKB_CB(skb)->seq.output.hi = ++oseq_hi;
+   xo->seq.hi = oseq_hi;
+
+   if (replay_esn->oseq_hi == 0) {
+   replay_esn->oseq--;
+   replay_esn->oseq_hi--;
+   xfrm_audit_state_replay_overflow(x, skb);
+   err = -EOVERFLOW;
+
+   return err;
+   }
+   }
+
+

[PATCH 14/16] xfrm: Prepare the GRO codepath for hardware offloading.

2017-04-20 Thread Steffen Klassert

On IPsec hardware offloading, we already get a secpath with
valid state attached when the packet enters the GRO handlers.
So check for hardware offload and skip the state lookup in this
case.

Signed-off-by: Steffen Klassert 
---
 net/ipv4/esp4_offload.c | 42 ++---
 net/ipv6/esp6_offload.c | 42 ++---
 net/xfrm/xfrm_input.c   | 50 -
 3 files changed, 71 insertions(+), 63 deletions(-)

diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index 1e39564..f3e33c2 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -43,27 +43,31 @@ static struct sk_buff **esp4_gro_receive(struct sk_buff 
**head,
if ((err = xfrm_parse_spi(skb, IPPROTO_ESP, &spi, &seq)) != 0)
goto out;
 
-   err = secpath_set(skb);
-   if (err)
-   goto out;
-
-   if (skb->sp->len == XFRM_MAX_DEPTH)
-   goto out;
-
-   x = xfrm_state_lookup(dev_net(skb->dev), skb->mark,
- (xfrm_address_t *)&ip_hdr(skb)->daddr,
- spi, IPPROTO_ESP, AF_INET);
-   if (!x)
-   goto out;
-
-   skb->sp->xvec[skb->sp->len++] = x;
-   skb->sp->olen++;
-
xo = xfrm_offload(skb);
-   if (!xo) {
-   xfrm_state_put(x);
-   goto out;
+   if (!xo || !(xo->flags & CRYPTO_DONE)) {
+   err = secpath_set(skb);
+   if (err)
+   goto out;
+
+   if (skb->sp->len == XFRM_MAX_DEPTH)
+   goto out;
+
+   x = xfrm_state_lookup(dev_net(skb->dev), skb->mark,
+ (xfrm_address_t *)&ip_hdr(skb)->daddr,
+ spi, IPPROTO_ESP, AF_INET);
+   if (!x)
+   goto out;
+
+   skb->sp->xvec[skb->sp->len++] = x;
+   skb->sp->olen++;
+
+   xo = xfrm_offload(skb);
+   if (!xo) {
+   xfrm_state_put(x);
+   goto out;
+   }
}
+
xo->flags |= XFRM_GRO;
 
XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip4 = NULL;
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index 06e9721..1cc 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -45,27 +45,31 @@ static struct sk_buff **esp6_gro_receive(struct sk_buff 
**head,
if ((err = xfrm_parse_spi(skb, IPPROTO_ESP, &spi, &seq)) != 0)
goto out;
 
-   err = secpath_set(skb);
-   if (err)
-   goto out;
-
-   if (skb->sp->len == XFRM_MAX_DEPTH)
-   goto out;
-
-   x = xfrm_state_lookup(dev_net(skb->dev), skb->mark,
- (xfrm_address_t *)&ipv6_hdr(skb)->daddr,
- spi, IPPROTO_ESP, AF_INET6);
-   if (!x)
-   goto out;
-
-   skb->sp->xvec[skb->sp->len++] = x;
-   skb->sp->olen++;
-
xo = xfrm_offload(skb);
-   if (!xo) {
-   xfrm_state_put(x);
-   goto out;
+   if (!xo || !(xo->flags & CRYPTO_DONE)) {
+   err = secpath_set(skb);
+   if (err)
+   goto out;
+
+   if (skb->sp->len == XFRM_MAX_DEPTH)
+   goto out;
+
+   x = xfrm_state_lookup(dev_net(skb->dev), skb->mark,
+ (xfrm_address_t *)&ipv6_hdr(skb)->daddr,
+ spi, IPPROTO_ESP, AF_INET6);
+   if (!x)
+   goto out;
+
+   skb->sp->xvec[skb->sp->len++] = x;
+   skb->sp->olen++;
+
+   xo = xfrm_offload(skb);
+   if (!xo) {
+   xfrm_state_put(x);
+   goto out;
+   }
}
+
xo->flags |= XFRM_GRO;
 
XFRM_TUNNEL_SKB_CB(skb)->tunnel.ip6 = NULL;
diff --git a/net/xfrm/xfrm_input.c b/net/xfrm/xfrm_input.c
index 362d655..21c6cc9 100644
--- a/net/xfrm/xfrm_input.c
+++ b/net/xfrm/xfrm_input.c
@@ -223,38 +223,38 @@ int xfrm_input(struct sk_buff *skb, int nexthdr, __be32 
spi, int encap_type)
seq = XFRM_SKB_CB(skb)->seq.input.low;
goto resume;
}
+
/* encap_type < -1 indicates a GRO call. */
encap_type = 0;
seq = XFRM_SPI_SKB_CB(skb)->seq;
-   goto lock;
-   }
-
-   if (xo && (xo->flags & CRYPTO_DONE)) {
-   crypto_done = true;
-   x = xfrm_input_state(skb);
-   family = XFRM_SPI_SKB_CB(skb)->family;
 
-   if (!(xo->status & CRYPTO_SUCCESS)) {
-   if (xo->status &
-   (CRYPTO_TRANSPORT_AH_AUTH_FAILED |
-CRYPTO_TRANSPORT_ESP_AUTH_FAILED |
-C

[PATCH 13/16] xfrm: Add encapsulation header offsets while SKB is not encrypted

2017-04-20 Thread Steffen Klassert

From: Ilan Tayari 

Both esp4 and esp6 used to assume that the SKB payload is encrypted
and therefore the inner_network and inner_transport offsets are
not relevant.
When doing crypto offload in the NIC, this is no longer the case
and the NIC driver needs these offsets so it can do TX TCP checksum
offloading.
This patch sets the inner_network and inner_transport members of
the SKB, as well as encapsulation, to reflect the actual positions
of these headers, and removes them only once encryption is done
on the payload.

Signed-off-by: Ilan Tayari 
Signed-off-by: Steffen Klassert 
---
 net/ipv4/xfrm4_mode_transport.c | 2 ++
 net/ipv4/xfrm4_mode_tunnel.c| 3 +++
 net/ipv6/xfrm6_mode_transport.c | 1 +
 net/ipv6/xfrm6_mode_tunnel.c| 3 +++
 net/xfrm/xfrm_output.c  | 2 ++
 5 files changed, 11 insertions(+)

diff --git a/net/ipv4/xfrm4_mode_transport.c b/net/ipv4/xfrm4_mode_transport.c
index 6c2411d..3d36644 100644
--- a/net/ipv4/xfrm4_mode_transport.c
+++ b/net/ipv4/xfrm4_mode_transport.c
@@ -24,6 +24,8 @@ static int xfrm4_transport_output(struct xfrm_state *x, 
struct sk_buff *skb)
struct iphdr *iph = ip_hdr(skb);
int ihl = iph->ihl * 4;
 
+   skb_set_inner_transport_header(skb, skb_transport_offset(skb));
+
skb_set_network_header(skb, -x->props.header_len);
skb->mac_header = skb->network_header +
  offsetof(struct iphdr, protocol);
diff --git a/net/ipv4/xfrm4_mode_tunnel.c b/net/ipv4/xfrm4_mode_tunnel.c
index d3f2434..e6265e2 100644
--- a/net/ipv4/xfrm4_mode_tunnel.c
+++ b/net/ipv4/xfrm4_mode_tunnel.c
@@ -33,6 +33,9 @@ static int xfrm4_mode_tunnel_output(struct xfrm_state *x, 
struct sk_buff *skb)
struct iphdr *top_iph;
int flags;
 
+   skb_set_inner_network_header(skb, skb_network_offset(skb));
+   skb_set_inner_transport_header(skb, skb_transport_offset(skb));
+
skb_set_network_header(skb, -x->props.header_len);
skb->mac_header = skb->network_header +
  offsetof(struct iphdr, protocol);
diff --git a/net/ipv6/xfrm6_mode_transport.c b/net/ipv6/xfrm6_mode_transport.c
index eb9b36b..7a92c0f 100644
--- a/net/ipv6/xfrm6_mode_transport.c
+++ b/net/ipv6/xfrm6_mode_transport.c
@@ -27,6 +27,7 @@ static int xfrm6_transport_output(struct xfrm_state *x, 
struct sk_buff *skb)
int hdr_len;
 
iph = ipv6_hdr(skb);
+   skb_set_inner_transport_header(skb, skb_transport_offset(skb));
 
hdr_len = x->type->hdr_offset(x, skb, &prevhdr);
skb_set_mac_header(skb, (prevhdr - x->props.header_len) - skb->data);
diff --git a/net/ipv6/xfrm6_mode_tunnel.c b/net/ipv6/xfrm6_mode_tunnel.c
index 19a60fc..02556e3 100644
--- a/net/ipv6/xfrm6_mode_tunnel.c
+++ b/net/ipv6/xfrm6_mode_tunnel.c
@@ -36,6 +36,9 @@ static int xfrm6_mode_tunnel_output(struct xfrm_state *x, 
struct sk_buff *skb)
struct ipv6hdr *top_iph;
int dsfield;
 
+   skb_set_inner_network_header(skb, skb_network_offset(skb));
+   skb_set_inner_transport_header(skb, skb_transport_offset(skb));
+
skb_set_network_header(skb, -x->props.header_len);
skb->mac_header = skb->network_header +
  offsetof(struct ipv6hdr, nexthdr);
diff --git a/net/xfrm/xfrm_output.c b/net/xfrm/xfrm_output.c
index a150886..8c0b672 100644
--- a/net/xfrm/xfrm_output.c
+++ b/net/xfrm/xfrm_output.c
@@ -205,6 +205,7 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
int err;
 
secpath_reset(skb);
+   skb->encapsulation = 0;
 
if (xfrm_dev_offload_ok(skb, x)) {
struct sec_path *sp;
@@ -218,6 +219,7 @@ int xfrm_output(struct sock *sk, struct sk_buff *skb)
if (skb->sp)
secpath_put(skb->sp);
skb->sp = sp;
+   skb->encapsulation = 1;
 
sp->olen++;
sp->xvec[skb->sp->len++] = x;
-- 
2.7.4

[PATCH 11/16] esp: Use a synchronous crypto algorithm on offloading.

2017-04-20 Thread Steffen Klassert

We need a fallback algorithm for crypto offloading to a NIC.
This is because packets can be rerouted to other NICs that
don't support crypto offloading. The fallback is going to be
implemented at layer2 where we know the final output device
but can't handle asynchronous returns fron the crypto layer.

Signed-off-by: Steffen Klassert 
---
 net/ipv4/esp4.c | 12 ++--
 net/ipv6/esp6.c | 12 ++--
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 4382f30..7e501ad 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -783,13 +783,17 @@ static int esp_init_aead(struct xfrm_state *x)
char aead_name[CRYPTO_MAX_ALG_NAME];
struct crypto_aead *aead;
int err;
+   u32 mask = 0;
 
err = -ENAMETOOLONG;
if (snprintf(aead_name, CRYPTO_MAX_ALG_NAME, "%s(%s)",
 x->geniv, x->aead->alg_name) >= CRYPTO_MAX_ALG_NAME)
goto error;
 
-   aead = crypto_alloc_aead(aead_name, 0, 0);
+   if (x->xso.offload_handle)
+   mask |= CRYPTO_ALG_ASYNC;
+
+   aead = crypto_alloc_aead(aead_name, 0, mask);
err = PTR_ERR(aead);
if (IS_ERR(aead))
goto error;
@@ -819,6 +823,7 @@ static int esp_init_authenc(struct xfrm_state *x)
char authenc_name[CRYPTO_MAX_ALG_NAME];
unsigned int keylen;
int err;
+   u32 mask = 0;
 
err = -EINVAL;
if (!x->ealg)
@@ -844,7 +849,10 @@ static int esp_init_authenc(struct xfrm_state *x)
goto error;
}
 
-   aead = crypto_alloc_aead(authenc_name, 0, 0);
+   if (x->xso.offload_handle)
+   mask |= CRYPTO_ALG_ASYNC;
+
+   aead = crypto_alloc_aead(authenc_name, 0, mask);
err = PTR_ERR(aead);
if (IS_ERR(aead))
goto error;
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 82d3da8..8b55abf 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -704,13 +704,17 @@ static int esp_init_aead(struct xfrm_state *x)
char aead_name[CRYPTO_MAX_ALG_NAME];
struct crypto_aead *aead;
int err;
+   u32 mask = 0;
 
err = -ENAMETOOLONG;
if (snprintf(aead_name, CRYPTO_MAX_ALG_NAME, "%s(%s)",
 x->geniv, x->aead->alg_name) >= CRYPTO_MAX_ALG_NAME)
goto error;
 
-   aead = crypto_alloc_aead(aead_name, 0, 0);
+   if (x->xso.offload_handle)
+   mask |= CRYPTO_ALG_ASYNC;
+
+   aead = crypto_alloc_aead(aead_name, 0, mask);
err = PTR_ERR(aead);
if (IS_ERR(aead))
goto error;
@@ -740,6 +744,7 @@ static int esp_init_authenc(struct xfrm_state *x)
char authenc_name[CRYPTO_MAX_ALG_NAME];
unsigned int keylen;
int err;
+   u32 mask = 0;
 
err = -EINVAL;
if (!x->ealg)
@@ -765,7 +770,10 @@ static int esp_init_authenc(struct xfrm_state *x)
goto error;
}
 
-   aead = crypto_alloc_aead(authenc_name, 0, 0);
+   if (x->xso.offload_handle)
+   mask |= CRYPTO_ALG_ASYNC;
+
+   aead = crypto_alloc_aead(authenc_name, 0, mask);
err = PTR_ERR(aead);
if (IS_ERR(aead))
goto error;
-- 
2.7.4

Heads-up: two regressions in v4.11-rc series

2017-04-20 Thread Jesper Dangaard Brouer

Hi Linus,

Just wanted to give a heads-up on two regressions in 4.11-rc series.

(1) page allocator optimization revert

Mel Gorman and I have been playing with optimizing the page allocator,
but Tariq spotted that we caused a regression for (NIC) drivers that
refill DMA RX rings in softirq context.

The end result was a revert, and this is waiting in AKPMs quilt queue:
 
http://ozlabs.org/~akpm/mmots/broken-out/revert-mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch


(2) Busy softirq can cause userspace not to be scheduled

I bisected the problem to a499a5a14dbd ("sched/cputime: Increment
kcpustat directly on irqtime account"). See email thread with
 Subject: Bisected softirq accounting issue in v4.11-rc1~170^2~28
 http://lkml.kernel.org/r/20170328101403.34a82...@redhat.com

I don't know the scheduler code well enough to fix this, and will have
to rely others to figure out this scheduler regression.

To make it clear: I'm only seeing this scheduler regression when a
remote host is sending many many network packets, towards the kernel
which keeps NAPI/softirq busy all the time.  A possible hint: tool
"top" only shows this in "si" column, while on v4.10 "top" also blames
"ksoftirqd/N", plus "ps" reported cputime (0:00) seems wrong for ksoftirqd.


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Re: [PATCH 1/1] drivers: net: usb: qmi_wwan: add QMI_QUIRK_SET_DTR for Telit PID 0x1201

2017-04-20 Thread Bjørn Mork

Aleksander Morgado  writes:

> I'm just running --dms-get-operating-mode multiple times, and getting
> errors frequently:

Could you retry that with cdc-wdm debugging enabled (e.g.
"echo 'module cdc_wdm +fp' > /sys/kernel/debug/dynamic_debug/control")
and something like this:


diff --git a/drivers/usb/class/cdc-wdm.c b/drivers/usb/class/cdc-wdm.c
index 8fda45a45bd3..c0f25ace9fc6 100644
--- a/drivers/usb/class/cdc-wdm.c
+++ b/drivers/usb/class/cdc-wdm.c
@@ -211,6 +211,8 @@ static void wdm_in_callback(struct urb *urb)
}
}
 
+   dev_dbg(&desc->intf->dev, "rerr=%d, status=%d, length=%d, 
resp_count=%d\n", desc->rerr, status, length, desc->resp_count);
+
/*
 * Handling devices with the WDM_DRAIN_ON_OPEN flag set:
 * If desc->resp_count is unset, then the urb was submitted
@@ -516,6 +518,9 @@ static ssize_t wdm_read
return -ERESTARTSYS;
 
cntr = ACCESS_ONCE(desc->length);
+
+   dev_dbg(&desc->intf->dev, "rerr=%d, length=%d\n", desc->rerr, cntr);
+
if (cntr == 0) {
desc->read = 0;
 retry:



And/or a usbmon dump?  That would be.probably show us the sequence of
events leading to this problem.

Desperately trying to figure out some other way to solve the issue that
commit  833415a3e781 ("cdc-wdm: fix "out-of-sync" due to missing
notifications") was supposed to fix



Bjørn

[PATCH 1/1] bonding: use 'M' or 'G' based on the speed

2017-04-20 Thread Zhu Yanjun

If the speed of the slave netdev is more than 1000M,
it is better to use 'G' instead of 'M'.

Signed-off-by: Zhu Yanjun 
---
 drivers/net/bonding/bond_main.c   | 8 ++--
 drivers/net/bonding/bond_procfs.c | 6 +-
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index af9f0ce..1aad13d 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2147,9 +2147,13 @@ static void bond_miimon_commit(struct bonding *bond)
bond_set_backup_slave(slave);
}
 
-   netdev_info(bond->dev, "link status definitely up for 
interface %s, %u Mbps %s duplex\n",
+   netdev_info(bond->dev, "link status definitely up for 
interface %s, %u%sbps %s duplex\n",
slave->dev->name,
-   slave->speed == SPEED_UNKNOWN ? 0 : 
slave->speed,
+   slave->speed == SPEED_UNKNOWN ? 0 :
+   (slave->speed > 1000 ?
+   slave->speed / 1000 : slave->speed),
+   slave->speed > 1000 ?
+   slave->speed % 1000 ? ".5 G" : " G" : " M",
slave->duplex ? "full" : "half");
 
/* notify ad that the link status has changed */
diff --git a/drivers/net/bonding/bond_procfs.c 
b/drivers/net/bonding/bond_procfs.c
index f514fe5..4c31055 100644
--- a/drivers/net/bonding/bond_procfs.c
+++ b/drivers/net/bonding/bond_procfs.c
@@ -173,7 +173,11 @@ static void bond_info_show_slave(struct seq_file *seq,
if (slave->speed == SPEED_UNKNOWN)
seq_printf(seq, "Speed: %s\n", "Unknown");
else
-   seq_printf(seq, "Speed: %d Mbps\n", slave->speed);
+   seq_printf(seq, "Speed: %d%sbps\n",
+  slave->speed > 1000 ?
+  slave->speed / 1000 : slave->speed,
+  slave->speed > 1000 ?
+  (slave->speed % 1000 ? ".5 G" : " G") : " M");
 
if (slave->duplex == DUPLEX_UNKNOWN)
seq_printf(seq, "Duplex: %s\n", "Unknown");
-- 
2.7.4

Re: [PATCH 11/16] esp: Use a synchronous crypto algorithm on offloading.

2017-04-20 Thread Herbert Xu

On Thu, Apr 20, 2017 at 10:55:10AM +0200, Steffen Klassert wrote:
> We need a fallback algorithm for crypto offloading to a NIC.
> This is because packets can be rerouted to other NICs that
> don't support crypto offloading. The fallback is going to be
> implemented at layer2 where we know the final output device
> but can't handle asynchronous returns fron the crypto layer.

Can you explain why we can't handle async algorithms on the fallback
path?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

Re: [Patch net-next v4 1/2] net_sched: move the empty tp check from ->destroy() to ->delete()

2017-04-20 Thread Daniel Borkmann


On 04/19/2017 11:21 PM, Cong Wang wrote:

We could have a race condition where in ->classify() path we
dereference tp->root and meanwhile a parallel ->destroy() makes it
a NULL. Daniel cured this bug in commit d936377414fa
("net, sched: respect rcu grace period on cls destruction").

This happens when ->destroy() is called for deleting a filter to
check if we are the last one in tp, this tp is still linked and
visible at that time. The root cause of this problem is the semantic
of ->destroy(), it does two things (for non-force case):

1) check if tp is empty
2) if tp is empty we could really destroy it

and its caller, if cares, needs to check its return value to see if it
is really destroyed. Therefore we can't unlink tp unless we know it is
empty.

As suggested by Daniel, we could actually move the test logic to ->delete()
so that we can safely unlink tp after ->delete() tells us the last one is
just deleted and before ->destroy().

Fixes: 1e052be69d04 ("net_sched: destroy proto tp when all filters are gone")
Cc: Roi Dayan 
Cc: Daniel Borkmann 
Cc: John Fastabend 
Cc: Jamal Hadi Salim 
Signed-off-by: Cong Wang 


Acked-by: Daniel Borkmann 

Thanks!

Re: [PATCH 11/16] esp: Use a synchronous crypto algorithm on offloading.

2017-04-20 Thread Steffen Klassert

On Thu, Apr 20, 2017 at 05:06:17PM +0800, Herbert Xu wrote:
> On Thu, Apr 20, 2017 at 10:55:10AM +0200, Steffen Klassert wrote:
> > We need a fallback algorithm for crypto offloading to a NIC.
> > This is because packets can be rerouted to other NICs that
> > don't support crypto offloading. The fallback is going to be
> > implemented at layer2 where we know the final output device
> > but can't handle asynchronous returns fron the crypto layer.
> 
> Can you explain why we can't handle async algorithms on the fallback
> path?

I tried to use async algorithms but it lead to serveral problems.
The GSO layer can't handle async returns, we'd need callbacks
for all the GSO handlers. Also we need something where we can
requeue packets if the driver is busy etc.

This would require a lot of changes in the generic code,
so I decided to use a synchronous fallback algorithm for
now.

Re: [PATCH] net: phy: fix auto-negotiation stall due to unavailable interrupt

2017-04-20 Thread Roger Quadros

Hi,

On 19/04/17 19:53, Alexander Kochetkov wrote:
> 
>> 19 апр. 2017 г., в 19:32, Florian Fainelli  написал(а):
>>
>> http://patchwork.ozlabs.org/patch/743773/
>>
>> Roger can you also test Alexander's patch?

This patch fixes my problem and doesn't have the 1 second delay that my patch 
had.

So,

Tested-by: Roger Quadros 

> 
> If MAC use phy_start_aneg() instead of phy_start() my patch will not work as
> expected. Roger, if patch don’t work for you please check what MAC bring up 
> PHY using
> phy_start():

Our MAC driver is using phy_start() and I didn't face any issues with your 
patch.

> 
> http://patchwork.ozlabs.org/patch/752308/
> 
> Is it correct to start PHY inside MAC probe using phy_start_aneg()? Or 
> phy_start() must be used?
> 
> And probably this tags should be added for my patch:
> Fixes: 3c293f4e08b5 ("net: phy: Trigger state machine on state change and not 
> polling.")
> Cc: stable  # v4.9+
> 
> Because I bisected to commit 529ed1275263 ("net: phy: phy drivers
> should not set SUPPORTED_[Asym_]Pause») that looks pretty good.
> 
> Also, there is another issue I found. link_timeout doesn’t work for interrupt 
> driven PHY.
> It is possible to implement timer to handle this case.
> Florian, what do you think? Should this be fixed?
> 
> Alexander.
> 

cheers,
-roger

[PATCH] ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabled

2017-04-20 Thread Paolo Abeni

When creating a new ipvs service, ipv6 addresses are always accepted
if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family
is not explicitly checked.

This allows the user-space to configure ipvs services even if the
system is booted with ipv6.disable=1. On specific configuration, ipvs
can try to call ipv6 routing code at setup time, causing the kernel to
oops due to fib6_rules_ops being NULL.

This change addresses the issue adding a check for the ipv6
module being enabled while validating ipv6 service operations and
adding the same validation for dest operations.

According to git history, this issue is apparently present since
the introduction of ipv6 support, and the oops can be triggered
since commit 09571c7ae30865ad ("IPVS: Add function to determine
if IPv6 address is local")

Fixes: 09571c7ae30865ad ("IPVS: Add function to determine if IPv6 address is 
local")
Signed-off-by: Paolo Abeni 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 5aeb0dd..4d753be 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -3078,6 +3078,17 @@ static int ip_vs_genl_dump_services(struct sk_buff *skb,
return skb->len;
 }
 
+static bool ip_vs_is_af_valid(int af)
+{
+   if (af == AF_INET)
+   return true;
+#ifdef CONFIG_IP_VS_IPV6
+   if (af == AF_INET6 && ipv6_mod_enabled())
+   return true;
+#endif
+   return false;
+}
+
 static int ip_vs_genl_parse_service(struct netns_ipvs *ipvs,
struct ip_vs_service_user_kern *usvc,
struct nlattr *nla, int full_entry,
@@ -3104,11 +3115,7 @@ static int ip_vs_genl_parse_service(struct netns_ipvs 
*ipvs,
memset(usvc, 0, sizeof(*usvc));
 
usvc->af = nla_get_u16(nla_af);
-#ifdef CONFIG_IP_VS_IPV6
-   if (usvc->af != AF_INET && usvc->af != AF_INET6)
-#else
-   if (usvc->af != AF_INET)
-#endif
+   if (!ip_vs_is_af_valid(usvc->af))
return -EAFNOSUPPORT;
 
if (nla_fwmark) {
@@ -3610,6 +3617,11 @@ static int ip_vs_genl_set_cmd(struct sk_buff *skb, 
struct genl_info *info)
if (udest.af == 0)
udest.af = svc->af;
 
+   if (!ip_vs_is_af_valid(udest.af)) {
+   ret = -EAFNOSUPPORT;
+   goto out;
+   }
+
if (udest.af != svc->af && cmd != IPVS_CMD_DEL_DEST) {
/* The synchronization protocol is incompatible
 * with mixed family services
-- 
2.9.3

Re: [PATCH net] xfrm: calculate L4 checksums also for GSO case before encrypting packets

2017-04-20 Thread Steffen Klassert

On Tue, Apr 18, 2017 at 07:10:03PM -0700, Ansis Atteka wrote:
> 
> However, after taking pointers from your patch I came up with this one
> that may solve this problem once and for all (note, that I was seeing
> this bug only with ixgbe NIC that supports tx csum offloads). I hope
> it does not break any other IPsec tests that you have.
> 
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index b2be1d9..7812501 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -29,6 +29,7 @@ static struct sk_buff
> *__skb_udp_tunnel_segment(struct sk_buff *skb,
> u16 mac_len = skb->mac_len;
> int udp_offset, outer_hlen;
> __wsum partial;
> +   bool need_ipsec;
> 
> if (unlikely(!pskb_may_pull(skb, tnl_hlen)))
> goto out;
> @@ -62,8 +63,10 @@ static struct sk_buff
> *__skb_udp_tunnel_segment(struct sk_buff *skb,
> 
> ufo = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP);
> 
> +   need_ipsec = skb_dst(skb) && dst_xfrm(skb_dst(skb));
> /* Try to offload checksum if possible */
> offload_csum = !!(need_csum &&
> + !need_ipsec &&
>   (skb->dev->features &
>(is_ipv6 ? (NETIF_F_HW_CSUM | NETIF_F_IPV6_CSUM) :
>   (NETIF_F_HW_CSUM | NETIF_F_IP_CSUM;

This looks good, but we should fix udp4_ufo_fragment() too.

Thanks!

Re: [PATCH 11/16] esp: Use a synchronous crypto algorithm on offloading.

2017-04-20 Thread Herbert Xu

On Thu, Apr 20, 2017 at 11:17:52AM +0200, Steffen Klassert wrote:
>
> I tried to use async algorithms but it lead to serveral problems.
> The GSO layer can't handle async returns, we'd need callbacks
> for all the GSO handlers. Also we need something where we can
> requeue packets if the driver is busy etc.

Why would we need to requeue? As it is if you get an EBUSY on
an IPsec packet it's simply dropped.

The main AES implementation on x86 is now async so I think it's
pretty important that we support it out-of-the-box.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

RE: [net-next 04/14] i40e: dump VF information in debugfs

2017-04-20 Thread Mintz, Yuval

> > Dump some internal state about VFs through debugfs. This provides
> > information not available with 'ip link show'.
> 
> such as?
> 
> donnwantobethedebugfspolice, but still, in the 2-3 times we tried to push
> debugfs to MLNX NIC drivers, Dave disallowed that, and lately the switch
> team even went further and deleted that portion of the mlxsw driver -- all to
> all,  I don't see much point for these type of changes, thoughts?

Don't want to hikjack your thread, but continuing this topic -
Is there some flat-out disapproval for debugfs in net-next now?

We're currently internally engaged with adding qed support for register dumps
[~equivalents for `ethtool -d' outputs] through debugfs, on behalf of storage
drivers [qedi/qedf] lacking the API for doing that.

[PATCH] usb: plusb: Add support for PL-27A1

2017-04-20 Thread Roman Spychała

From: Roman Spychała 

This patch adds support for the PL-27A1 by adding the appropriate
USB ID's. This chip is used in the goobay Active USB 3.0 Data Link
and Unitek Y-3501 cables.

Signed-off-by: Roman Spychała 
---
 drivers/net/usb/Kconfig |  2 +-
 drivers/net/usb/plusb.c | 15 +--
 2 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/Kconfig b/drivers/net/usb/Kconfig
index 3dd490f53e48..f28bd74ac275 100644
--- a/drivers/net/usb/Kconfig
+++ b/drivers/net/usb/Kconfig
@@ -369,7 +369,7 @@ config USB_NET_NET1080
  optionally with LEDs that indicate traffic
 
 config USB_NET_PLUSB
-   tristate "Prolific PL-2301/2302/25A1 based cables"
+   tristate "Prolific PL-2301/2302/25A1/27A1 based cables"
# if the handshake/init/reset problems, from original 'plusb',
# are ever resolved ... then remove "experimental"
depends on USB_USBNET
diff --git a/drivers/net/usb/plusb.c b/drivers/net/usb/plusb.c
index 22e1a9a99a7d..6fe59373cba9 100644
--- a/drivers/net/usb/plusb.c
+++ b/drivers/net/usb/plusb.c
@@ -102,7 +102,7 @@ static int pl_reset(struct usbnet *dev)
 }
 
 static const struct driver_infoprolific_info = {
-   .description =  "Prolific PL-2301/PL-2302/PL-25A1",
+   .description =  "Prolific PL-2301/PL-2302/PL-25A1/PL-27A1",
.flags =FLAG_POINTTOPOINT | FLAG_NO_SETINT,
/* some PL-2302 versions seem to fail usb_set_interface() */
.reset =pl_reset,
@@ -139,6 +139,17 @@ static const struct usb_device_id  products [] = {
 * Host-to-Host Cable
 */
.driver_info =  (unsigned long) &prolific_info,
+
+},
+
+/* super speed cables */
+{
+   USB_DEVICE(0x067b, 0x27a1), /* PL-27A1, no eeprom
+* also: goobay Active USB 3.0
+* Data Link,
+* Unitek Y-3501
+*/
+   .driver_info =  (unsigned long) &prolific_info,
 },
 
{ },// END
@@ -158,5 +169,5 @@ static struct usb_driver plusb_driver = {
 module_usb_driver(plusb_driver);
 
 MODULE_AUTHOR("David Brownell");
-MODULE_DESCRIPTION("Prolific PL-2301/2302/25A1 USB Host to Host Link Driver");
+MODULE_DESCRIPTION("Prolific PL-2301/2302/25A1/27A1 USB Host to Host Link 
Driver");
 MODULE_LICENSE("GPL");
-- 
2.12.2

Re: [PATCH net] Fix net/hsr/hsr_device to check for freed skb buffer.

2017-04-20 Thread Sergei Shtylyov


Hello!

On 4/20/2017 10:28 AM, Peter Heise wrote:


Fixed an unchecked call of skb_put_padto. Return value was ignored
before, however, skb_put_padto frees skb buffer in case of error.

As reported by Dan Carpenter on kernel-janitors.

Signed-off-by: Peter Heise 
---
 net/hsr/hsr_device.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/hsr/hsr_device.c b/net/hsr/hsr_device.c
index c73160fb11e7..22d693f213be 100644
--- a/net/hsr/hsr_device.c
+++ b/net/hsr/hsr_device.c
@@ -314,7 +314,8 @@ static void send_hsr_supervision_frame(struct hsr_port 
*master,
hsr_sp = (typeof(hsr_sp)) skb_put(skb, sizeof(struct hsr_sup_payload));
ether_addr_copy(hsr_sp->MacAddressA, master->dev->dev_addr);

-   skb_put_padto(skb, ETH_ZLEN + HSR_HLEN);
+   if(skb_put_padto(skb, ETH_ZLEN + HSR_HLEN))


   Need a space after *if*. Pleae run your patches thru scripts/checkpatch.pl 
before sending.



+   return;

hsr_forward_skb(skb, master);
return;


MBR, Seregi

Re: [PATCH block-tree] net: off by one in inet6_pton()

2017-04-20 Thread Sagi Grimberg


Thanks Dan,

Reviewed-by: Sagi Grimberg

[PATCH] net: qrtr: potential use after free in qrtr_sendmsg()

2017-04-20 Thread Dan Carpenter

If skb_pad() fails then it frees the skb so we should check for errors.

Fixes: bdabad3e363d ("net: Add Qualcomm IPC router")
Signed-off-by: Dan Carpenter 

diff --git a/net/qrtr/qrtr.c b/net/qrtr/qrtr.c
index c36b0ec364a4..a9a8c7d5a4a9 100644
--- a/net/qrtr/qrtr.c
+++ b/net/qrtr/qrtr.c
@@ -658,7 +658,9 @@ static int qrtr_sendmsg(struct socket *sock, struct msghdr 
*msg, size_t len)
}
 
if (plen != len) {
-   skb_pad(skb, plen - len);
+   rc = skb_pad(skb, plen - len);
+   if (rc)
+   goto out_node;
skb_put(skb, plen - len);
}

Re: [PATCH v2 net 2/2] net sched actions: decrement module refcount earlier

2017-04-20 Thread Jamal Hadi Salim


On 17-04-19 01:25 PM, Cong Wang wrote:

On Wed, Apr 19, 2017 at 4:32 AM, Jamal Hadi Salim  wrote:

This solves one issue, but I am afraid the issue Cong mentioned is a
possibility still.
Lets say user did a replace and tried to also replace the cookie
in that transaction. The init() succeeds but the cookie allocation
fails. To be correct we'll have to undo the replace i.e something
like uninit() which will restore back the old values.
This is very complex and unnecessary.


It is not complex once we move to RCU completely, replacement
is merely a pointer assignment, rollback is same.



Nice to hear.


It is necessary too according to the rules of RCU, as I said before, the
current code is broken, we can't modify an existing action with RCU,
we have to make a copy. I do have plan to make actions truly RCU,
but I have to redesign the action API's first.


Ok ;->
Lucas: Lets work to get those test suites in.

cheers,
jamal

Re: [PATCH v2 net 2/2] net sched actions: decrement module refcount earlier

2017-04-20 Thread Jamal Hadi Salim


On 17-04-19 11:03 AM, Wolfgang Bumiller wrote:

On April 19, 2017 at 1:32 PM Jamal Hadi Salim  wrote:


On 17-04-19 04:09 AM, Wolfgang Bumiller wrote:

This solves one issue, but I am afraid the issue Cong mentioned is a
possibility still.
Lets say user did a replace and tried to also replace the cookie
in that transaction. The init() succeeds but the cookie allocation
fails. To be correct we'll have to undo the replace i.e something
like uninit() which will restore back the old values.
This is very complex and unnecessary.

My suggestion:
If we move the cookie allocation before init we can save it and
only when init succeeds do we attach it to the action, otherwise
we free it on error path.


Shouldn't the old code have freed an old a->act_cookie as well before
replacing it then? nla_memdup_cookie() starts off by assigning a->act_cookie.

(I've been running this loop for a while now:
# while : ; do tc actions change action ok index 500 cookie $i; let i++; done
and memory usage *seems* to be growing faster with the loop running - still
slow, but visible. (I stopped most but not all background processes in this
VM))

I don't see the growth with the change below (replacing both patches).
(although given the freeing of the old act_cookie pointer I wonder if
this needs additional locking?)



I think we are safe. The cookie should not be touched by any datapath
code.

Acked-by: Jamal Hadi Salim 

cheers,
jamal

Re: [PATCH v4 3/3] VSOCK: Add virtio vsock vsockmon hooks

2017-04-20 Thread Stefan Hajnoczi

On Thu, Apr 13, 2017 at 09:47:08PM +0300, Michael S. Tsirkin wrote:
> On Thu, Apr 13, 2017 at 05:18:11PM +0100, Stefan Hajnoczi wrote:
> > diff --git a/net/vmw_vsock/virtio_transport_common.c 
> > b/net/vmw_vsock/virtio_transport_common.c
> > index af087b4..aae60c1 100644
> > --- a/net/vmw_vsock/virtio_transport_common.c
> > +++ b/net/vmw_vsock/virtio_transport_common.c
> > @@ -16,6 +16,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  
> >  #include 
> >  #include 
> > @@ -85,6 +86,63 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info 
> > *info,
> > return NULL;
> >  }
> >  
> > +/* Packet capture */
> > +void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt)
> > +{
> > +   struct sk_buff *skb;
> > +   struct af_vsockmon_hdr *hdr;
> > +   unsigned char *t_hdr, *payload;
> > +
> > +   skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
> > +   GFP_ATOMIC);
> > +   if (!skb)
> > +   return; /* nevermind if we cannot capture the packet */
> > +
> > +   hdr = (struct af_vsockmon_hdr *)skb_put(skb, sizeof(*hdr));
> > +
> > +   /* pkt->hdr is little-endian so no need to byteswap here */
> 
> Comment does not seem to make sense. Drop it?

All fields in pkt->hdr are little-endian.  All fields in the
af_vsockmon_hdr are little-endian too.

Normally code operating on either of these structs byteswaps the fields.
Therefore it seemed worth a comment to clarify that it's okay to omit
byteswaps.  The comment will help anyone auditing the code for
endianness bugs.

Why do you say it doesn't make sense?

> > +   hdr->src_cid = pkt->hdr.src_cid;
> > +   hdr->src_port = pkt->hdr.src_port;
> > +   hdr->dst_cid = pkt->hdr.dst_cid;
> > +   hdr->dst_port = pkt->hdr.dst_port;
> > +
> > +   hdr->transport = cpu_to_le16(AF_VSOCK_TRANSPORT_VIRTIO);
> > +   hdr->len = cpu_to_le16(sizeof(pkt->hdr));
> > +   hdr->reserved[0] = hdr->reserved[1] = 0;
> > +
> > +   switch(cpu_to_le16(pkt->hdr.op)) {
> 
> I'd expect this to be le16_to_cpu.
> Does this pass check build?

You are right, make C=2 warns about this and it should be le16_to_cpu().
I'll run check builds from now on.


signature.asc
Description: PGP signature

Re: [PATCH net-next v4 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-20 Thread Jamal Hadi Salim


On 17-04-19 12:17 PM, Jiri Pirko wrote:

Wed, Apr 19, 2017 at 06:07:48PM CEST, j...@mojatatu.com wrote:

On 17-04-19 11:54 AM, Jiri Pirko wrote:

Wed, Apr 19, 2017 at 05:37:15PM CEST, j...@mojatatu.com wrote:

On 17-04-19 09:13 AM, Jiri Pirko wrote:

Wed, Apr 19, 2017 at 03:03:59PM CEST, j...@mojatatu.com wrote:

On 17-04-19 08:36 AM, Jiri Pirko wrote:

Wed, Apr 19, 2017 at 01:57:29PM CEST, j...@mojatatu.com wrote:

From: Jamal Hadi Salim 



include/uapi/linux/rtnetlink.h | 21 +++--
net/sched/act_api.c| 43 --
3 files changed, 53 insertions(+), 12 deletions(-)



+#define TCAA_MAX (__TCAA_MAX - 1)
#define TA_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct 
tcamsg
#define TA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct tcamsg))
-#define TCA_ACT_TAB 1 /* attr type must be >=1 */   
-#define TCAA_MAX 1
+#define TCA_ACT_TAB TCAA_ACT_TAB


This is mess. What does "TCAA" stand for?


TC Actions Attributes.  What would you call it? I could have
called it TCA_ROOT etc. But maybe a comment to just call it
TC Actions Attributes would be enough?


TCA_DUMP_X

it is only for dumping. Naming it "attribute" seems weird. Same as if
you have: int variable_something;



Jiri, this is not just for dumping. We are describing high level
attributes for tc actions.


This is already present:
enum {
TCA_ACT_UNSPEC,
TCA_ACT_KIND,
TCA_ACT_OPTIONS,
TCA_ACT_INDEX,
TCA_ACT_STATS,
TCA_ACT_PAD,
TCA_ACT_COOKIE,
__TCA_ACT_MAX
};

This is nested inside the dump message (TCAA_MAX, TCA_ACT_TAB)

Looks like you are mixing these 2.



No - That space is per action. The space i am defining is
above that in the the hierarchy. There used to be only
one attribute there (TCA_ACT_TAB) and now we are making
it more generic.


Right. That's what I say. And that higher level is used only for
dumping. That's why I suggested TCA_ACT_DUMP prefix.



DUMP is not right. _TAB is used for SET/DEL as well.
why dont we leave this alone so we can progress?
You can submit changes later if it bothers you still.











It is a _lot_ of code to change! Note:
This is all the UAPI visible code (the same coding style for 20 years).
I am worried about this part.


We'll see. Lets do it in a sensitive way, in steps. But for new things,
I think it is good not to stick with old and outlived habits.




I know you have the muscle to get it done - so fine, i will start
with this one change.



Netlink is TLV, should be used as TLV. I don't see how you can run out
any space. You tend to use Netlink in some weird hybrid mode, with only
argument being space. I think that couple of bytes wasted is not
a problem at all...



You are not making sense to me still.
What you describe as "a couple of bytes" adds up when you have
a large number of objects. I am trying to cut down on data back
and forth from user/kernel and a bitmap is a well understood entity.


The attributes in question are per-message, not per-object




Even if i did use a TLV - when i am representing flags which require one
bit each - it makes sense to use a bitmask encapsulated in a TLV. Not
to waste a TLV per bit.


That is the only correct way. For example, in future the kernel may
report in extended ack that it does not support some specific attribute
user passed. If you pack it all in one attr, that would not be possible.
Also, what prevents user from passing random data on bit flag positions
that you are not using? Current kernel would ignore it. Next kernel will
do something different according to the flag bit. That is UAPI break.
Essentialy the same thing what DaveM said about the struct. You have to
define this completely, at the beginning, not possible to use the unused
bits for other things in the future.




They are not the same issue Jiri. We have used bitmasks fine on netlink
message for a millenia. Nobody sets garbage on a bitmask they are not
supposed to touch. The struct padding thing is a shame the way it
turned out - now netlink can no longer have a claim to be a (good)
wire protocol.
Other thing: I know you feel strongly about this but I dont agree that
everything has to be a TLV and in no way, as a matter of principle, you 
are going to convince me  (even when the cows come home) that I have to

use 64 bits to carry a single bit just so I can use TLVs.

cheers,
jamal

Re: [PATCH 11/16] esp: Use a synchronous crypto algorithm on offloading.

2017-04-20 Thread Steffen Klassert

On Thu, Apr 20, 2017 at 05:52:35PM +0800, Herbert Xu wrote:
> On Thu, Apr 20, 2017 at 11:17:52AM +0200, Steffen Klassert wrote:
> >
> > I tried to use async algorithms but it lead to serveral problems.
> > The GSO layer can't handle async returns, we'd need callbacks
> > for all the GSO handlers. Also we need something where we can
> > requeue packets if the driver is busy etc.
> 
> Why would we need to requeue? As it is if you get an EBUSY on
> an IPsec packet it's simply dropped.

Yes we could do this, but the GSO problem remain.

We discussed this last year at netdevconf but could not come
up with an acceptable solutuion.

> 
> The main AES implementation on x86 is now async so I think it's
> pretty important that we support it out-of-the-box.

For now this is just a fallback to make hardware offloading
possible at all, so this is slowpath anyway. Allowing async
algorithms can (and should) be done in a second step once we
found a not too intrusive solution.

[PATCH v1] net: phy: fix auto-negotiation stall due to unavailable interrupt

2017-04-20 Thread Alexander Kochetkov

Hello Florian, Roger and Madalin!

This is slightly modified version of previous patch[1].

It take into account that phy_start_aneg() may be called from
differend places not only from PHY state machine. So the patch
use phy_trigger_machine() to schedule link state update
correctly.

I borrow some text from Roger's commit message.

Added 'Cc: stable  tag'
because the problem exist on 4.9 and 4.10 branches.

Don't know what commit brake the code, so I don't add
Fixes tag.

Roger, could you please test this one patch?

Madalin, I found, that you tested Roger’s patch[2] and this
one patch will get upstream instead of patch[2]. Could you
please test it?

Regards,
Alexander.

[1] http://patchwork.ozlabs.org/patch/743773/
[2] https://lkml.org/lkml/2017/3/30/517

Alexander Kochetkov (1):
  net: phy: fix auto-negotiation stall due to unavailable interrupt

 drivers/net/phy/phy.c |   40 
 include/linux/phy.h   |1 +
 2 files changed, 37 insertions(+), 4 deletions(-)

-- 
1.7.9.5

[PATCH v1] net: phy: fix auto-negotiation stall due to unavailable interrupt

2017-04-20 Thread Alexander Kochetkov

The Ethernet link on an interrupt driven PHY was not coming up if the Ethernet
cable was plugged before the Ethernet interface was brought up.

The patch trigger PHY state machine to update link state if PHY was requested to
do auto-negotiation and auto-negotiation complete flag already set.

During power-up cycle the PHY do auto-negotiation, generate interrupt and set
auto-negotiation complete flag. Interrupt is handled by PHY state machine but
doesn't update link state because PHY is in PHY_READY state. After some time
MAC bring up, start and request PHY to do auto-negotiation. If there are no new
settings to advertise genphy_config_aneg() doesn't start PHY auto-negotiation.
PHY continue to stay in auto-negotiation complete state and doesn't fire
interrupt. At the same time PHY state machine expect that PHY started
auto-negotiation and is waiting for interrupt from PHY and it won't get it.

Signed-off-by: Alexander Kochetkov 
Cc: stable  # v4.9+
---
 drivers/net/phy/phy.c |   40 
 include/linux/phy.h   |1 +
 2 files changed, 37 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 7cc1b7d..2d9975b 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -591,16 +591,18 @@ int phy_mii_ioctl(struct phy_device *phydev, struct ifreq 
*ifr, int cmd)
 EXPORT_SYMBOL(phy_mii_ioctl);
 
 /**
- * phy_start_aneg - start auto-negotiation for this PHY device
+ * phy_start_aneg_priv - start auto-negotiation for this PHY device
  * @phydev: the phy_device struct
+ * @sync: indicate whether we should wait for the workqueue cancelation
  *
  * Description: Sanitizes the settings (if we're not autonegotiating
  *   them), and then calls the driver's config_aneg function.
  *   If the PHYCONTROL Layer is operating, we change the state to
  *   reflect the beginning of Auto-negotiation or forcing.
  */
-int phy_start_aneg(struct phy_device *phydev)
+static int phy_start_aneg_priv(struct phy_device *phydev, bool sync)
 {
+   bool trigger = 0;
int err;
 
mutex_lock(&phydev->lock);
@@ -625,10 +627,40 @@ int phy_start_aneg(struct phy_device *phydev)
}
}
 
+   /* Re-schedule a PHY state machine to check PHY status because
+* negotiation may already be done and aneg interrupt may not be
+* generated.
+*/
+   if (phy_interrupt_is_valid(phydev) && (phydev->state == PHY_AN)) {
+   err = phy_aneg_done(phydev);
+   if (err > 0) {
+   trigger = true;
+   err = 0;
+   }
+   }
+
 out_unlock:
mutex_unlock(&phydev->lock);
+
+   if (trigger)
+   phy_trigger_machine(phydev, sync);
+
return err;
 }
+
+/**
+ * phy_start_aneg - start auto-negotiation for this PHY device
+ * @phydev: the phy_device struct
+ *
+ * Description: Sanitizes the settings (if we're not autonegotiating
+ *   them), and then calls the driver's config_aneg function.
+ *   If the PHYCONTROL Layer is operating, we change the state to
+ *   reflect the beginning of Auto-negotiation or forcing.
+ */
+int phy_start_aneg(struct phy_device *phydev)
+{
+   return phy_start_aneg_priv(phydev, true);
+}
 EXPORT_SYMBOL(phy_start_aneg);
 
 /**
@@ -656,7 +688,7 @@ void phy_start_machine(struct phy_device *phydev)
  *   state machine runs.
  */
 
-static void phy_trigger_machine(struct phy_device *phydev, bool sync)
+void phy_trigger_machine(struct phy_device *phydev, bool sync)
 {
if (sync)
cancel_delayed_work_sync(&phydev->state_queue);
@@ -1151,7 +1183,7 @@ void phy_state_machine(struct work_struct *work)
mutex_unlock(&phydev->lock);
 
if (needs_aneg)
-   err = phy_start_aneg(phydev);
+   err = phy_start_aneg_priv(phydev, false);
else if (do_suspend)
phy_suspend(phydev);
 
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 7fc1105..b19ae66 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -840,6 +840,7 @@ int phy_drivers_register(struct phy_driver *new_driver, int 
n,
 void phy_mac_interrupt(struct phy_device *phydev, int new_link);
 void phy_start_machine(struct phy_device *phydev);
 void phy_stop_machine(struct phy_device *phydev);
+void phy_trigger_machine(struct phy_device *phydev, bool sync);
 int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd);
 int phy_ethtool_gset(struct phy_device *phydev, struct ethtool_cmd *cmd);
 int phy_ethtool_ksettings_get(struct phy_device *phydev,
-- 
1.7.9.5

Re: [PATCH net-next 2/3] net: hns: support deferred probe when no mdio

2017-04-20 Thread Matthias Brugger


On 18/04/17 12:12, Yankejian wrote:

From: lipeng 

In the hip06 and hip07 SoCs, phy connect to mdio bus.The mdio
module is probed with module_init, and, as such,
is not guaranteed to probe before the HNS driver. So we need
to support deferred probe.

We check for probe deferral in the mac init, so we not init DSAF
when there is no mdio, and free all resource, to later learn that
we need to defer the probe.

Signed-off-by: lipeng 


on which kernel version is this patch based?
I checked against next-20170420 and it does not apply.



---
 drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c | 34 +++
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c 
b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
index bdd8cdd..284ebfe 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.c
@@ -735,6 +735,8 @@ static int hns_mac_init_ex(struct hns_mac_cb *mac_cb)
rc = phy_device_register(phy);
if (rc) {
phy_device_free(phy);
+   dev_err(&mdio->dev, "registered phy fail at address %i\n",
+   addr);
return -ENODEV;
}

@@ -745,7 +747,7 @@ static int hns_mac_init_ex(struct hns_mac_cb *mac_cb)
return 0;
 }

-static void hns_mac_register_phy(struct hns_mac_cb *mac_cb)
+static int hns_mac_register_phy(struct hns_mac_cb *mac_cb)
 {
struct acpi_reference_args args;
struct platform_device *pdev;
@@ -755,24 +757,39 @@ static void hns_mac_register_phy(struct hns_mac_cb 
*mac_cb)

/* Loop over the child nodes and register a phy_device for each one */
if (!to_acpi_device_node(mac_cb->fw_port))
-   return;
+   return 0;


Please return appropriate errno.



rc = acpi_node_get_property_reference(
mac_cb->fw_port, "mdio-node", 0, &args);
if (rc)
-   return;
+   return 0;


Please return appropriate errno.



addr = hns_mac_phy_parse_addr(mac_cb->dev, mac_cb->fw_port);
if (addr < 0)
-   return;
+   return 0;


Shouldn't we just error out here by returning addr?



/* dev address in adev */
pdev = hns_dsaf_find_platform_device(acpi_fwnode_handle(args.adev));
+   if (!pdev) {
+   dev_err(mac_cb->dev, "mac%d mdio pdev is NULL\n",
+   mac_cb->mac_id);
+   return 0;


Please return appropriate errno.

Regards,
Matthias


+   }
+
mii_bus = platform_get_drvdata(pdev);
+   if (!mii_bus) {
+   dev_err(mac_cb->dev,
+   "mac%d mdio is NULL, dsaf will probe again later\n",
+   mac_cb->mac_id);
+   return -EPROBE_DEFER;
+   }
+
rc = hns_mac_register_phydev(mii_bus, mac_cb, addr);
if (!rc)
dev_dbg(mac_cb->dev, "mac%d register phy addr:%d\n",
mac_cb->mac_id, addr);
+
+   return rc;
 }

 #define MAC_MEDIA_TYPE_MAX_LEN 16
@@ -793,7 +810,7 @@ static void hns_mac_register_phy(struct hns_mac_cb *mac_cb)
  *@np:device node
  * return: 0 --success, negative --fail
  */
-static int  hns_mac_get_info(struct hns_mac_cb *mac_cb)
+static int hns_mac_get_info(struct hns_mac_cb *mac_cb)
 {
struct device_node *np;
struct regmap *syscon;
@@ -903,7 +920,10 @@ static int  hns_mac_get_info(struct hns_mac_cb *mac_cb)
}
}
} else if (is_acpi_node(mac_cb->fw_port)) {
-   hns_mac_register_phy(mac_cb);
+   ret = hns_mac_register_phy(mac_cb);
+
+   if (ret)
+   return ret;
} else {
dev_err(mac_cb->dev, "mac%d cannot find phy node\n",
mac_cb->mac_id);
@@ -1065,6 +1085,7 @@ int hns_mac_init(struct dsaf_device *dsaf_dev)
dsaf_dev->mac_cb[port_id] = mac_cb;
}
}
+
/* init mac_cb for all port */
for (port_id = 0; port_id < max_port_num; port_id++) {
mac_cb = dsaf_dev->mac_cb[port_id];
@@ -1074,6 +1095,7 @@ int hns_mac_init(struct dsaf_device *dsaf_dev)
ret = hns_mac_get_cfg(dsaf_dev, mac_cb);
if (ret)
return ret;
+
ret = hns_mac_init_ex(mac_cb);
if (ret)
return ret;

[PATCH] brcm80211: brcmfmac: Ensure that incoming skb's are writable

2017-04-20 Thread James Hughes

The driver was adding header information to incoming skb
without ensuring the head was uncloned and hence writable.

skb_cow_head has been used to ensure they are writable, however,
this required some changes to error handling to ensure that
if skb_cow_head failed it was not ignored.

This really needs to be reviewed by someone who is more familiar
with this code base to ensure any deallocation of skb's is
still correct.

Signed-off-by: James Hughes 
---
 .../wireless/broadcom/brcm80211/brcmfmac/bcdc.c| 15 --
 .../wireless/broadcom/brcm80211/brcmfmac/core.c| 23 +---
 .../broadcom/brcm80211/brcmfmac/fwsignal.c | 32 +-
 .../wireless/broadcom/brcm80211/brcmfmac/sdio.c|  7 -
 4 files changed, 51 insertions(+), 26 deletions(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/bcdc.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/bcdc.c
index 038a960..b9d7d08 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/bcdc.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/bcdc.c
@@ -249,14 +249,19 @@ brcmf_proto_bcdc_set_dcmd(struct brcmf_pub *drvr, int 
ifidx, uint cmd,
return ret;
 }
 
-static void
+static int
 brcmf_proto_bcdc_hdrpush(struct brcmf_pub *drvr, int ifidx, u8 offset,
 struct sk_buff *pktbuf)
 {
struct brcmf_proto_bcdc_header *h;
+   int err;
 
brcmf_dbg(BCDC, "Enter\n");
 
+   err = skb_cow_head(pktbuf, BCDC_HEADER_LEN);
+   if (err)
+   return err;
+
/* Push BDC header used to convey priority for buses that don't */
skb_push(pktbuf, BCDC_HEADER_LEN);
 
@@ -271,6 +276,8 @@ brcmf_proto_bcdc_hdrpush(struct brcmf_pub *drvr, int ifidx, 
u8 offset,
h->data_offset = offset;
BCDC_SET_IF_IDX(h, ifidx);
trace_brcmf_bcdchdr(pktbuf->data);
+
+   return 0;
 }
 
 static int
@@ -330,7 +337,11 @@ static int
 brcmf_proto_bcdc_txdata(struct brcmf_pub *drvr, int ifidx, u8 offset,
struct sk_buff *pktbuf)
 {
-   brcmf_proto_bcdc_hdrpush(drvr, ifidx, offset, pktbuf);
+   int err = brcmf_proto_bcdc_hdrpush(drvr, ifidx, offset, pktbuf);
+
+   if (err)
+   return err;
+
return brcmf_bus_txdata(drvr->bus_if, pktbuf);
 }
 
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
index 5eaac13..08272e8 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
@@ -198,7 +198,7 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff 
*skb,
int ret;
struct brcmf_if *ifp = netdev_priv(ndev);
struct brcmf_pub *drvr = ifp->drvr;
-   struct ethhdr *eh = (struct ethhdr *)(skb->data);
+   struct ethhdr *eh;
 
brcmf_dbg(DATA, "Enter, bsscfgidx=%d\n", ifp->bsscfgidx);
 
@@ -212,23 +212,14 @@ static netdev_tx_t brcmf_netdev_start_xmit(struct sk_buff 
*skb,
}
 
/* Make sure there's enough room for any header */
-   if (skb_headroom(skb) < drvr->hdrlen) {
-   struct sk_buff *skb2;
-
-   brcmf_dbg(INFO, "%s: insufficient headroom\n",
- brcmf_ifname(ifp));
-   drvr->bus_if->tx_realloc++;
-   skb2 = skb_realloc_headroom(skb, drvr->hdrlen);
-   dev_kfree_skb(skb);
-   skb = skb2;
-   if (skb == NULL) {
-   brcmf_err("%s: skb_realloc_headroom failed\n",
- brcmf_ifname(ifp));
-   ret = -ENOMEM;
-   goto done;
-   }
+   ret = skb_cow_head(skb, drvr->hdrlen);
+   if (ret) {
+   dev_kfree_skb_any(skb);
+   goto done;
}
 
+   eh = (struct ethhdr *)(skb->data);
+
/* validate length for ether packet */
if (skb->len < sizeof(*eh)) {
ret = -EINVAL;
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c
index a190f53..2510408 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c
@@ -877,12 +877,15 @@ static void brcmf_fws_cleanup(struct brcmf_fws_info *fws, 
int ifidx)
brcmf_fws_hanger_cleanup(fws, matchfn, ifidx);
 }
 
-static u8 brcmf_fws_hdrpush(struct brcmf_fws_info *fws, struct sk_buff *skb)
+static int brcmf_fws_hdrpush(struct brcmf_fws_info *fws, struct sk_buff *skb,
+u8 *offset)
 {
struct brcmf_fws_mac_descriptor *entry = brcmf_skbcb(skb)->mac;
u8 *wlh;
u16 data_offset = 0;
u8 fillers;
+   int err;
+
__le32 pkttag = cpu_to_le32(brcmf_skbcb(skb)->htod);
__le16 pktseq = cpu_to_le16(brcmf_skbcb(skb)->htod_seq);
 
@@ -899,6 +902,11 @@ static u8 brcmf_fw

Re: [PATCH] brcm80211: brcmfmac: Ensure that incoming skb's are writable

2017-04-20 Thread Kalle Valo

+ linux-wireless

James Hughes  writes:

> The driver was adding header information to incoming skb
> without ensuring the head was uncloned and hence writable.
>
> skb_cow_head has been used to ensure they are writable, however,
> this required some changes to error handling to ensure that
> if skb_cow_head failed it was not ignored.
>
> This really needs to be reviewed by someone who is more familiar
> with this code base to ensure any deallocation of skb's is
> still correct.
>
> Signed-off-by: James Hughes 

You should also CC linux-wireless, otherwise patchwork won't see it.

-- 
Kalle Valo

Re: [PATCH net-next v4 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-20 Thread Jiri Pirko

Thu, Apr 20, 2017 at 12:42:55PM CEST, j...@mojatatu.com wrote:
>On 17-04-19 12:17 PM, Jiri Pirko wrote:
>> Wed, Apr 19, 2017 at 06:07:48PM CEST, j...@mojatatu.com wrote:
>> > On 17-04-19 11:54 AM, Jiri Pirko wrote:
>> > > Wed, Apr 19, 2017 at 05:37:15PM CEST, j...@mojatatu.com wrote:
>> > > > On 17-04-19 09:13 AM, Jiri Pirko wrote:
>> > > > > Wed, Apr 19, 2017 at 03:03:59PM CEST, j...@mojatatu.com wrote:
>> > > > > > On 17-04-19 08:36 AM, Jiri Pirko wrote:
>> > > > > > > Wed, Apr 19, 2017 at 01:57:29PM CEST, j...@mojatatu.com wrote:
>> > > > > > > > From: Jamal Hadi Salim 
>> > > > > > 
>> > > > > > > > include/uapi/linux/rtnetlink.h | 21 +++--
>> > > > > > > > net/sched/act_api.c| 43 
>> > > > > > > > --
>> > > > > > > > 3 files changed, 53 insertions(+), 12 deletions(-)
>> > > > 
>> > > > > > > > +#define TCAA_MAX (__TCAA_MAX - 1)
>> > > > > > > > #define TA_RTA(r)  ((struct rtattr*)(((char*)(r)) + 
>> > > > > > > > NLMSG_ALIGN(sizeof(struct tcamsg
>> > > > > > > > #define TA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct tcamsg))
>> > > > > > > > -#define TCA_ACT_TAB 1 /* attr type must be >=1 */ 
>> > > > > > > > -#define TCAA_MAX 1
>> > > > > > > > +#define TCA_ACT_TAB TCAA_ACT_TAB
>> > > > > > > 
>> > > > > > > This is mess. What does "TCAA" stand for?
>> > > > > > 
>> > > > > > TC Actions Attributes.  What would you call it? I could have
>> > > > > > called it TCA_ROOT etc. But maybe a comment to just call it
>> > > > > > TC Actions Attributes would be enough?
>> > > > > 
>> > > > > TCA_DUMP_X
>> > > > > 
>> > > > > it is only for dumping. Naming it "attribute" seems weird. Same as if
>> > > > > you have: int variable_something;
>> > > > > 
>> > > > 
>> > > > Jiri, this is not just for dumping. We are describing high level
>> > > > attributes for tc actions.
>> > > 
>> > > This is already present:
>> > > enum {
>> > > TCA_ACT_UNSPEC,
>> > > TCA_ACT_KIND,
>> > > TCA_ACT_OPTIONS,
>> > > TCA_ACT_INDEX,
>> > > TCA_ACT_STATS,
>> > > TCA_ACT_PAD,
>> > > TCA_ACT_COOKIE,
>> > > __TCA_ACT_MAX
>> > > };
>> > > 
>> > > This is nested inside the dump message (TCAA_MAX, TCA_ACT_TAB)
>> > > 
>> > > Looks like you are mixing these 2.
>> > > 
>> > 
>> > No - That space is per action. The space i am defining is
>> > above that in the the hierarchy. There used to be only
>> > one attribute there (TCA_ACT_TAB) and now we are making
>> > it more generic.
>> 
>> Right. That's what I say. And that higher level is used only for
>> dumping. That's why I suggested TCA_ACT_DUMP prefix.
>> 
>
>DUMP is not right. _TAB is used for SET/DEL as well.
>why dont we leave this alone so we can progress?
>You can submit changes later if it bothers you still.

Ha. So the current code is wrong, I see it now. Following is needed:

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 82b1d48..c432b22 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1005,7 +1005,7 @@ static int tc_ctl_action(struct sk_buff *skb, struct 
nlmsghdr *n,
!netlink_capable(skb, CAP_NET_ADMIN))
return -EPERM;
 
-   ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCA_ACT_MAX, NULL,
+   ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCAA_MAX, NULL,
  extack);
if (ret < 0)
return ret;

You confused me squashing the change to this patch.

Ok, so the name could be:
TCA_ACTS_*
?
I believe it is crucial to figure this out right now. TC UAPI is in deep
mess already, no need to push it even more.


>
>> 
>> > 
>> > > 
>> > > 
>> > 
>> > > > It is a _lot_ of code to change! Note:
>> > > > This is all the UAPI visible code (the same coding style for 20 years).
>> > > > I am worried about this part.
>> > > 
>> > > We'll see. Lets do it in a sensitive way, in steps. But for new things,
>> > > I think it is good not to stick with old and outlived habits.
>> > > 
>> > > 
>> > 
>> > I know you have the muscle to get it done - so fine, i will start
>> > with this one change.
>> > 
>> > > 
>> > > Netlink is TLV, should be used as TLV. I don't see how you can run out
>> > > any space. You tend to use Netlink in some weird hybrid mode, with only
>> > > argument being space. I think that couple of bytes wasted is not
>> > > a problem at all...
>> > > 
>> > 
>> > You are not making sense to me still.
>> > What you describe as "a couple of bytes" adds up when you have
>> > a large number of objects. I am trying to cut down on data back
>> > and forth from user/kernel and a bitmap is a well understood entity.
>> 
>> The attributes in question are per-message, not per-object
>> 
>> 
>> > 
>> > Even if i did use a TLV - when i am representing flags which require one
>> > bit each - it makes sense to use a bitmask encapsulated in a TLV. Not
>> > to waste a TLV per bit.
>> 
>> That is the only correct way. For example, in future the kernel may
>> report

Re: [net-next 04/14] i40e: dump VF information in debugfs

2017-04-20 Thread Jiri Pirko

Thu, Apr 20, 2017 at 12:09:28PM CEST, yuval.mi...@cavium.com wrote:
>> > Dump some internal state about VFs through debugfs. This provides
>> > information not available with 'ip link show'.
>> 
>> such as?
>> 
>> donnwantobethedebugfspolice, but still, in the 2-3 times we tried to push
>> debugfs to MLNX NIC drivers, Dave disallowed that, and lately the switch
>> team even went further and deleted that portion of the mlxsw driver -- all to
>> all,  I don't see much point for these type of changes, thoughts?
>
>Don't want to hikjack your thread, but continuing this topic -
>Is there some flat-out disapproval for debugfs in net-next now?

It was mentioned by DaveM on the list multiple times.


>
>We're currently internally engaged with adding qed support for register dumps
>[~equivalents for `ethtool -d' outputs] through debugfs, on behalf of storage
>drivers [qedi/qedf] lacking the API for doing that.

That sounds wrong. Either introduce a generic infra to expose the info you
need or you don't do that. For your inhouse debugging, you should have
oot patch to expose whatever you need.

[PATCH v3 net] net sched actions: allocate act cookie early

2017-04-20 Thread Wolfgang Bumiller

Policing filters do not use the TCA_ACT_* enum and the tb[]
nlattr array in tcf_action_init_1() doesn't get filled for
them so we should not try to look for a TCA_ACT_COOKIE
attribute in the then uninitialized array.
The error handling in cookie allocation then calls
tcf_hash_release() leading to invalid memory access later
on.
Additionally, if cookie allocation fails after an already
existing non-policing filter has successfully been changed,
tcf_action_release() should not be called, also we would
have to roll back the changes in the error handling, so
instead we now allocate the cookie early and assign it on
success at the end.

CVE-2017-7979
Fixes: 1045ba77a596 ("net sched actions: Add support for user cookies")
Signed-off-by: Wolfgang Bumiller 
Acked-by: Jamal Hadi Salim 
---
This replaces both patches of the previous series:
  - 1/2 net sched actions: fix access to uninitialized data
   Changed: The affected code is now moved up into the first
(name==NULL) branch rather than adding this condition anew.
  - 2/2 net sched actions: decrement module refcount earlier
   Changed: Instead of moving the module_put above the cookie code,
the cookie code is moved as described in the commit
message.

 net/sched/act_api.c | 55 +++--
 1 file changed, 32 insertions(+), 23 deletions(-)

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index b70aa57319ea..e05b924618a0 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -529,20 +529,20 @@ int tcf_action_dump(struct sk_buff *skb, struct list_head 
*actions,
return err;
 }
 
-static int nla_memdup_cookie(struct tc_action *a, struct nlattr **tb)
+static struct tc_cookie *nla_memdup_cookie(struct nlattr **tb)
 {
-   a->act_cookie = kzalloc(sizeof(*a->act_cookie), GFP_KERNEL);
-   if (!a->act_cookie)
-   return -ENOMEM;
+   struct tc_cookie *c = kzalloc(sizeof(*c), GFP_KERNEL);
+   if (!c)
+   return NULL;
 
-   a->act_cookie->data = nla_memdup(tb[TCA_ACT_COOKIE], GFP_KERNEL);
-   if (!a->act_cookie->data) {
-   kfree(a->act_cookie);
-   return -ENOMEM;
+   c->data = nla_memdup(tb[TCA_ACT_COOKIE], GFP_KERNEL);
+   if (!c->data) {
+   kfree(c);
+   return NULL;
}
-   a->act_cookie->len = nla_len(tb[TCA_ACT_COOKIE]);
+   c->len = nla_len(tb[TCA_ACT_COOKIE]);
 
-   return 0;
+   return c;
 }
 
 struct tc_action *tcf_action_init_1(struct net *net, struct nlattr *nla,
@@ -551,6 +551,7 @@ struct tc_action *tcf_action_init_1(struct net *net, struct 
nlattr *nla,
 {
struct tc_action *a;
struct tc_action_ops *a_o;
+   struct tc_cookie *cookie = NULL;
char act_name[IFNAMSIZ];
struct nlattr *tb[TCA_ACT_MAX + 1];
struct nlattr *kind;
@@ -566,6 +567,18 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct nlattr *nla,
goto err_out;
if (nla_strlcpy(act_name, kind, IFNAMSIZ) >= IFNAMSIZ)
goto err_out;
+   if (tb[TCA_ACT_COOKIE]) {
+   int cklen = nla_len(tb[TCA_ACT_COOKIE]);
+
+   if (cklen > TC_COOKIE_MAX_SIZE)
+   goto err_out;
+
+   cookie = nla_memdup_cookie(tb);
+   if (!cookie) {
+   err = -ENOMEM;
+   goto err_out;
+   }
+   }
} else {
err = -EINVAL;
if (strlcpy(act_name, name, IFNAMSIZ) >= IFNAMSIZ)
@@ -604,20 +617,12 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct nlattr *nla,
if (err < 0)
goto err_mod;
 
-   if (tb[TCA_ACT_COOKIE]) {
-   int cklen = nla_len(tb[TCA_ACT_COOKIE]);
-
-   if (cklen > TC_COOKIE_MAX_SIZE) {
-   err = -EINVAL;
-   tcf_hash_release(a, bind);
-   goto err_mod;
-   }
-
-   if (nla_memdup_cookie(a, tb) < 0) {
-   err = -ENOMEM;
-   tcf_hash_release(a, bind);
-   goto err_mod;
+   if (name == NULL && tb[TCA_ACT_COOKIE]) {
+   if (a->act_cookie) {
+   kfree(a->act_cookie->data);
+   kfree(a->act_cookie);
}
+   a->act_cookie = cookie;
}
 
/* module count goes up only when brand new policy is created
@@ -632,6 +637,10 @@ struct tc_action *tcf_action_init_1(struct net *net, 
struct nlattr *nla,
 err_mod:
module_put(a_o->owner);
 err_out:
+   if (cookie) {
+   kfree(cookie->data);
+   kfree(cookie);
+   }
return ERR_PTR(err);
 }
 
-- 
2.11.0

Re: [PATCH net] net: ipv6: RTF_PCPU should not be settable from userspace

2017-04-20 Thread Andrey Konovalov

Thanks!

Tested-by: Andrey Konovalov 

On Wed, Apr 19, 2017 at 11:52 PM, Martin KaFai Lau  wrote:
> On Wed, Apr 19, 2017 at 02:19:43PM -0700, David Ahern wrote:
>> Andrey reported a fault in the IPv6 route code:
>>
>> kasan: GPF could be caused by NULL-ptr deref or user memory access
>> general protection fault:  [#1] SMP KASAN
>> Modules linked in:
>> CPU: 1 PID: 4035 Comm: a.out Not tainted 4.11.0-rc7+ #250
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> task: 880069809600 task.stack: 880062dc8000
>> RIP: 0010:ip6_rt_cache_alloc+0xa6/0x560 net/ipv6/route.c:975
>> RSP: 0018:880062dced30 EFLAGS: 00010206
>> RAX: dc00 RBX: 8800670561c0 RCX: 0006
>> RDX: 0003 RSI: 880062dcfb28 RDI: 0018
>> RBP: 880062dced68 R08: 0001 R09: 
>> R10:  R11:  R12: 
>> R13: 880062dcfb28 R14: dc00 R15: 
>> FS:  7feebe37e7c0() GS:88006cb0() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 205a0fe4 CR3: 6b5c9000 CR4: 06e0
>> Call Trace:
>>  ip6_pol_route+0x1512/0x1f20 net/ipv6/route.c:1128
>>  ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
>> ...
>>
>> Andrey's syzkaller program passes rtmsg.rtmsg_flags with the RTF_PCPU bit
>> set. Flags passed to the kernel are blindly copied to the allocated
>> rt6_info by ip6_route_info_create making a newly inserted route appear
>> as though it is a per-cpu route. ip6_rt_cache_alloc sees the flag set
>> and expects rt->dst.from to be set - which it is not since it is not
>> really a per-cpu copy. The subsequent call to __ip6_dst_alloc then
>> generates the fault.
>>
>> Fix by checking for the flag and failing with EINVAL.
> Thanks for the fix.
>
> Acked-by: Martin KaFai Lau 
>
>>
>> Fixes: d52d3997f843f ("ipv6: Create percpu rt6_info")
>> Reported-by: Andrey Konovalov 
>> Signed-off-by: David Ahern 
>> ---
>>  include/uapi/linux/ipv6_route.h | 2 +-
>>  net/ipv6/route.c| 4 
>>  2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/ipv6_route.h 
>> b/include/uapi/linux/ipv6_route.h
>> index 85bbb1799df3..d496c02e14bc 100644
>> --- a/include/uapi/linux/ipv6_route.h
>> +++ b/include/uapi/linux/ipv6_route.h
>> @@ -35,7 +35,7 @@
>>  #define RTF_PREF(pref)   ((pref) << 27)
>>  #define RTF_PREF_MASK0x1800
>>
>> -#define RTF_PCPU 0x4000
>> +#define RTF_PCPU 0x4000  /* read-only: can not be set by user */
>>  #define RTF_LOCAL0x8000
>>
>>
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index 4ba7c49872ff..a1bf426c959b 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1854,6 +1854,10 @@ static struct rt6_info *ip6_route_info_create(struct 
>> fib6_config *cfg)
>>   int addr_type;
>>   int err = -EINVAL;
>>
>> + /* RTF_PCPU is an internal flag; can not be set by userspace */
>> + if (cfg->fc_flags & RTF_PCPU)
>> + goto out;
>> +
>>   if (cfg->fc_dst_len > 128 || cfg->fc_src_len > 128)
>>   goto out;
>>  #ifndef CONFIG_IPV6_SUBTREES
>> --
>> 2.9.3
>>

Re: [PATCH] brcm80211: brcmfmac: Ensure that incoming skb's are writable

2017-04-20 Thread James Hughes

On 20 April 2017 at 12:31, Kalle Valo  wrote:
> + linux-wireless
>
> James Hughes  writes:
>
>> The driver was adding header information to incoming skb
>> without ensuring the head was uncloned and hence writable.
>>
>> skb_cow_head has been used to ensure they are writable, however,
>> this required some changes to error handling to ensure that
>> if skb_cow_head failed it was not ignored.
>>
>> This really needs to be reviewed by someone who is more familiar
>> with this code base to ensure any deallocation of skb's is
>> still correct.
>>
>> Signed-off-by: James Hughes 
>
> You should also CC linux-wireless, otherwise patchwork won't see it.
>
> --
> Kalle Valo

Thanks Kalle, I wasn't subscribed to wireless, but have now done so. I
also failed to read the MAINTAINERS list correctly..

With regard to this particular patch, this is related to the recent
patches to use skb_cow_head in a number of USB net drivers to ensure
they can write headers correctly, and I suspect the same fault is
possible/likely in other drivers outside the USB net realm, as this
patch shows.

I'm not overly happy with the error handling in this patch, but that
said, the error handling over this entire driver does strike me as
suspect. Quite a few places where return codes are ignored, just in my
quick examination. So not really sure how to proceed past this patch,
if at all.

Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone

2017-04-20 Thread Andrey Konovalov

On Thu, Apr 20, 2017 at 10:35 AM, Dmitry Vyukov  wrote:
> On Thu, Apr 20, 2017 at 1:51 AM, David Ahern  wrote:
>> On 4/19/17 5:47 PM, Cong Wang wrote:
>>> On Wed, Apr 19, 2017 at 9:12 AM, Andrey Konovalov  
>>> wrote:

 Anyway, I just finished simplifying the reproducer. Give this one a try.
>>>
>>> Thanks for providing such a minimal reproducer!
>>>
>>> The following patch could fix this crash, but I am not 100% sure if we 
>>> should
>>> just clear these bits or reject them with an errno.
>>>
>>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>>> index 9db14189..cf524c2 100644
>>> --- a/net/ipv6/route.c
>>> +++ b/net/ipv6/route.c
>>> @@ -2086,7 +2086,7 @@ static struct rt6_info
>>> *ip6_route_info_create(struct fib6_config *cfg)
>>> } else
>>> rt->rt6i_prefsrc.plen = 0;
>>>
>>> -   rt->rt6i_flags = cfg->fc_flags;
>>> +   rt->rt6i_flags = cfg->fc_flags & ~(RTF_PCPU | RTF_CACHE);
>>>
>>>  install_route:
>>> rt->dst.dev = dev;
>>>
>>
>> I sent a patch returning EINVAL if RTF_PCPU is set in fc_flags
>
>
> Andrey, does it fix the other crashes?

No, still see them.

I'm working on reproducing those.

Re: [PATCH net-next v4 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-20 Thread Eric Dumazet

On Thu, 2017-04-20 at 06:42 -0400, Jamal Hadi Salim wrote:

> 
> They are not the same issue Jiri. We have used bitmasks fine on netlink
> message for a millenia. Nobody sets garbage on a bitmask they are not
> supposed to touch. The struct padding thing is a shame the way it
> turned out - now netlink can no longer have a claim to be a (good)
> wire protocol.

Except that users wrote programs, and these programs work today.

By changing the kernel and recognizing new flags in existing padding,
you might break the programs.

This is not acceptable. Period.

Had we checked the padding being 0 in old kernels, this change would
have been possible today.

But because old kernels did not care of the padding contents, then there
is no way new kernel can suddenly trust them at all.

Please Jamal, you have to forget this nonsense.

Re: [PATCH v4 1/3] VSOCK: Add vsockmon tap functions

2017-04-20 Thread Jorgen S. Hansen


> On Apr 13, 2017, at 6:18 PM, Stefan Hajnoczi  wrote:
> 
> +
> +static void __vsock_deliver_tap(struct sk_buff *skb)
> +{
> + int ret;
> + struct vsock_tap *tmp;
> +
> + list_for_each_entry_rcu(tmp, &vsock_tap_all, list) {
> + ret = __vsock_deliver_tap_skb(skb, tmp->dev);
> + if (unlikely(ret))
> + break;
> + }
> +
> + consume_skb(skb);

It looks like the caller will allocate the skb regardless of whether 
vsock_tap_all is empty, so shouldn’t the skb be consumed in vsock_deliver_tap?

> +}
> +
> +void vsock_deliver_tap(struct sk_buff *skb)
> +{
> + rcu_read_lock();
> +
> + if (unlikely(!list_empty(&vsock_tap_all)))
> + __vsock_deliver_tap(skb);
> +
> + rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(vsock_deliver_tap);
> -- 
> 2.9.3
>

Re: sctp: deny peeloff operation on asocs with threads sleeping on it

2017-04-20 Thread Salvatore Bonaccorso

Hi 

According to the documentation I should have sent this to
netdev@vger.kernel.org rather than sta...@vger.kernel.org.

Rationale: Whilst 00eff2ebbd229758e90659907724c14dd5a18339 went to
stable, the dfcb9f4f99f1e9a49e43398a7bfbf56927544af1 commit is
missing.

Full quoting, my original misleaded mail:

On Thu, Apr 20, 2017 at 01:44:05PM +0200, Salvatore Bonaccorso wrote:
> Hi
> 
> Apparently the following commit
> 
> dfcb9f4f99f1e9a49e43398a7bfbf56927544af1 (sctp: deny peeloff operation
> on asocs with threads sleeping on it)
> 
> (was not CC'ed sta...@vger.kernel.org, but was already applied in
> 3.2.87 and 3.16.42) is missing from 4.9:
> 
> 2dcab598484185dea7ec22219c76dcdd59e3cb90 was included in 4.9.11 via
> 00eff2ebbd229758e90659907724c14dd5a18339 . But
> dfcb9f4f99f1e9a49e43398a7bfbf56927544af1 is missing from 4.9.
> 
> This was assigned CVE-2017-6353.
> 
> Reference:
> https://marc.info/?l=linux-netdev&m=148785309416337&w=2
> http://www.openwall.com/lists/oss-security/2017/02/27/2
> 
> Regards,
> Salvatore

Regards,
Salvatore
>From dfcb9f4f99f1e9a49e43398a7bfbf56927544af1 Mon Sep 17 00:00:00 2001
From: Marcelo Ricardo Leitner 
Date: Thu, 23 Feb 2017 09:31:18 -0300
Subject: [PATCH] sctp: deny peeloff operation on asocs with threads sleeping
 on it

commit 2dcab5984841 ("sctp: avoid BUG_ON on sctp_wait_for_sndbuf")
attempted to avoid a BUG_ON call when the association being used for a
sendmsg() is blocked waiting for more sndbuf and another thread did a
peeloff operation on such asoc, moving it to another socket.

As Ben Hutchings noticed, then in such case it would return without
locking back the socket and would cause two unlocks in a row.

Further analysis also revealed that it could allow a double free if the
application managed to peeloff the asoc that is created during the
sendmsg call, because then sctp_sendmsg() would try to free the asoc
that was created only for that call.

This patch takes another approach. It will deny the peeloff operation
if there is a thread sleeping on the asoc, so this situation doesn't
exist anymore. This avoids the issues described above and also honors
the syscalls that are already being handled (it can be multiple sendmsg
calls).

Joint work with Xin Long.

Fixes: 2dcab5984841 ("sctp: avoid BUG_ON on sctp_wait_for_sndbuf")
Cc: Alexander Popov 
Cc: Ben Hutchings 
Signed-off-by: Marcelo Ricardo Leitner 
Signed-off-by: Xin Long 
Signed-off-by: David S. Miller 
---
 net/sctp/socket.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index b532148..465a9c8 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -4862,6 +4862,12 @@ int sctp_do_peeloff(struct sock *sk, sctp_assoc_t id, struct socket **sockp)
 	if (!asoc)
 		return -EINVAL;

+	/* If there is a thread waiting on more sndbuf space for
+	 * sending on this asoc, it cannot be peeled.
+	 */
+	if (waitqueue_active(&asoc->wait))
+		return -EBUSY;
+
 	/* An association cannot be branched off from an already peeled-off
 	 * socket, nor is this supported for tcp style sockets.
 	 */
@@ -7599,8 +7605,6 @@ static int sctp_wait_for_sndbuf(struct sctp_association *asoc, long *timeo_p,
 		 */
 		release_sock(sk);
 		current_timeo = schedule_timeout(current_timeo);
-		if (sk != asoc->base.sk)
-			goto do_error;
 		lock_sock(sk);

 		*timeo_p = current_timeo;
-- 
2.1.4

Re: [PATCH v4 3/3] VSOCK: Add virtio vsock vsockmon hooks

2017-04-20 Thread Jorgen S. Hansen

> 
> +/* Packet capture */
> +void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt)
> +{
> + struct sk_buff *skb;
> + struct af_vsockmon_hdr *hdr;
> + unsigned char *t_hdr, *payload;
> +
> + skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
> + GFP_ATOMIC);

So with the current API, an skb will always be allocated, even if there are no 
listeners? And we’ll copy the payload into the skb as well, if there is any. 
Would it make sense to introduce a check here (exposed as part of the vsock tap 
API), that skips all that in the case of no listeners? In the common case, 
there won’t be any listeners, so it would save some cycles.

> + if (!skb)
> + return; /* nevermind if we cannot capture the packet */
> +
> + hdr = (struct af_vsockmon_hdr *)skb_put(skb, sizeof(*hdr));
> +
> + /* pkt->hdr is little-endian so no need to byteswap here */
> + hdr->src_cid = pkt->hdr.src_cid;
> + hdr->src_port = pkt->hdr.src_port;
> + hdr->dst_cid = pkt->hdr.dst_cid;
> + hdr->dst_port = pkt->hdr.dst_port;
> +
> + hdr->transport = cpu_to_le16(AF_VSOCK_TRANSPORT_VIRTIO);
> + hdr->len = cpu_to_le16(sizeof(pkt->hdr));
> + hdr->reserved[0] = hdr->reserved[1] = 0;
> +
> + switch(cpu_to_le16(pkt->hdr.op)) {
> + case VIRTIO_VSOCK_OP_REQUEST:
> + case VIRTIO_VSOCK_OP_RESPONSE:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_CONNECT);
> + break;
> + case VIRTIO_VSOCK_OP_RST:
> + case VIRTIO_VSOCK_OP_SHUTDOWN:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_DISCONNECT);
> + break;
> + case VIRTIO_VSOCK_OP_RW:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_PAYLOAD);
> + break;
> + case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
> + case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_CONTROL);
> + break;
> + default:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_UNKNOWN);
> + break;
> + }
> +
> + t_hdr = skb_put(skb, sizeof(pkt->hdr));
> + memcpy(t_hdr, &pkt->hdr, sizeof(pkt->hdr));
> +
> + if (pkt->len) {
> + payload = skb_put(skb, pkt->len);
> + memcpy(payload, pkt->buf, pkt->len);
> + }
> +
> + vsock_deliver_tap(skb);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> +
> static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> struct virtio_vsock_pkt_info *info)
> {
> -- 
> 2.9.3
>

Re: [PATCH] brcm80211: brcmfmac: Ensure that incoming skb's are writable

2017-04-20 Thread Eric Dumazet

On Thu, 2017-04-20 at 12:16 +0100, James Hughes wrote:
> The driver was adding header information to incoming skb
> without ensuring the head was uncloned and hence writable.
> 
> skb_cow_head has been used to ensure they are writable, however,
> this required some changes to error handling to ensure that
> if skb_cow_head failed it was not ignored.
> 
> This really needs to be reviewed by someone who is more familiar
> with this code base to ensure any deallocation of skb's is
> still correct.
> 
> Signed-off-by: James Hughes 
> ---

You mention incoming skb in patch title, yet you change part of TX path.

Please split your changes in individual ones, to ease future bisections,
and ease backports as well.


Also, while it is known that TX path can deal with cloned skbs,
it is not clear why RX path has clones.

Most drivers allocate an skb, fill it, and pass it to network stack.

So please make sure all these changes are really needed.

> diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c 
> b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c
> index d138260..0e53c8a 100644
> --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c
> +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/sdio.c
> @@ -2719,7 +2719,7 @@ static bool brcmf_sdio_prec_enq(struct pktq *q, struct 
> sk_buff *pkt, int prec)
>  
>  static int brcmf_sdio_bus_txdata(struct device *dev, struct sk_buff *pkt)
>  {
> - int ret = -EBADE;
> + int ret = -EBADE, err;
>   uint prec;
>   struct brcmf_bus *bus_if = dev_get_drvdata(dev);
>   struct brcmf_sdio_dev *sdiodev = bus_if->bus_priv.sdio;
> @@ -2729,6 +2729,11 @@ static int brcmf_sdio_bus_txdata(struct device *dev, 
> struct sk_buff *pkt)
>   if (sdiodev->state != BRCMF_SDIOD_DATA)
>   return -EIO;
>  
> + err = skb_cow_head(pkt, bus->tx_hdrlen);
> +
> + if (err)
> + return err;
> +
>   /* Add space for the header */
>   skb_push(pkt, bus->tx_hdrlen);
>   /* precondition: IS_ALIGNED((unsigned long)(pkt->data), 2) */

macvlan: Fix device ref leak when purging bc_queue

2017-04-20 Thread Herbert Xu

When a parent macvlan device is destroyed we end up purging its
broadcast queue without dropping the device reference count on
the packet source device.  This causes the source device to linger.

This patch drops that reference count.

Fixes: 260916dfb48c ("macvlan: Fix potential use-after free for...")
Reported-by: Joe Ghalam 
Signed-off-by: Herbert Xu 

diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index 9261722..b34eaaa 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -1139,6 +1139,7 @@ static int macvlan_port_create(struct net_device *dev)
 static void macvlan_port_destroy(struct net_device *dev)
 {
struct macvlan_port *port = macvlan_port_get_rtnl(dev);
+   struct sk_buff *skb;
 
dev->priv_flags &= ~IFF_MACVLAN_PORT;
netdev_rx_handler_unregister(dev);
@@ -1147,7 +1148,15 @@ static void macvlan_port_destroy(struct net_device *dev)
 * but we need to cancel it and purge left skbs if any.
 */
cancel_work_sync(&port->bc_work);
-   __skb_queue_purge(&port->bc_queue);
+
+   while ((skb = __skb_dequeue(&port->bc_queue))) {
+   const struct macvlan_dev *src = MACVLAN_SKB_CB(skb)->src;
+
+   if (src)
+   dev_put(src->dev);
+
+   kfree_skb(skb);
+   }
 
kfree(port);
 }
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

[PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check

2017-04-20 Thread Robert Shearman

David reported that doing the following:

ip li add red type vrf table 10
ip link set dev eth1 vrf red
ip addr add 127.0.0.1/8 dev red
ip link set dev eth1 up
ip li set red up
ping -c1 -w1 -I red 127.0.0.1
ip li del red

results in a hang with this message:

unregister_netdevice: waiting for red to become free. Usage count = 1

The problem is caused by caching the dst used for sending the packet
out of the specified interface on the route that the lookup returned
from the local table when the rule for the lookup in the local table
is ordered before the rule for lookups using l3mdevs. Thus the dst
could stay around until the route in the local table is deleted which
may be never.

Address the problem by not allocating a cacheable output dst if
FLOWI_FLAG_SKIP_NH_OIF is set and the nh device differs from the
device used for the dst.

Fixes: ebfc102c566d ("net: vrf: Flip IPv4 output path from FIB lookup hook to 
out hook")
Reported-by: David Ahern 
Signed-off-by: Robert Shearman 
---
 net/ipv4/route.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index acd69cfe2951..f667783ffd19 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2125,6 +2125,14 @@ static struct rtable *__mkroute_output(const struct 
fib_result *res,
fi = NULL;
}
 
+   /* If the flag to skip the nh oif check is set then the output
+* device may not match the nh device, so cannot use or add to
+* cache in that case.
+*/
+   if (unlikely(fl4->flowi4_flags & FLOWI_FLAG_SKIP_NH_OIF &&
+FIB_RES_NH(*res).nh_dev != dev_out))
+   do_cache = false;
+
fnhe = NULL;
do_cache &= fi != NULL;
if (do_cache) {
-- 
2.1.4

Re: [PATCH net-next 0/3] l3mdev: Improve use with main table

2017-04-20 Thread Robert Shearman


On 13/04/17 15:36, David Ahern wrote:

On 4/13/17 6:48 AM, Robert Shearman wrote:

the patches look ok to me, but in testing them I see I refcnt problem.
simple reproducer:

ip li add red type vrf table 254
ip link set dev eth1 vrf red
ip addr add 127.0.0.1/8 dev red
ip link set dev eth1 up
ip li set red up
ping -c1 -w1 -I red 127.0.0.1
ip li del red
--> hangs with 'uregister_netdevice: waiting for red to become free.'


Right, I've reproduced the same and it occurs even without using the
main table and I believe it has been introduced within the last week.


The cached dst on sockets is one known place that causes a hang on a
delete. Basically the delete stalls until the sockets are closed. I have
a patch for sk_rx_dst which is the one I chased down last week.
sk_dst_cache is another possibility.

Neither of those should be at play with the above example because the
ping command runs and then exits.


Thanks for the pointers. My earlier assessment that this was something 
recent turned out to be wrong. I've sent a patch against net that fixes 
the issue.


Thanks,
Rob

[PATCH net-next v5 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-20 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

When you dump hundreds of thousands of actions, getting only 32 per
dump batch even when the socket buffer and memory allocations allow
is inefficient.

With this change, the user will get as many as possibly fitting
within the given constraints available to the kernel.

A new top level TLV space is introduced. An attribute
TCAA_ACT_FLAGS is used to carry the flags indicating the user
is capable of processing these large dumps. Older user space which
doesn't set this flag doesn't get the large (than 32) batches.
The kernel uses the TCAA_ACT_COUNT attribute to tell the user how many
actions are put in a single batch. As such user space app knows how long
to iterate (independent of the type of action being dumped)
instead of hardcoded maximum of 32.

Some results dumping 1.5M actions, first unpatched tc which the
kernel doesn't help:

prompt$ time -p tc actions ls action gact | grep index | wc -l
150
real 1388.43
user 2.07
sys 1386.79

Now lets see a patched tc which sets the correct flags when requesting
a dump:

prompt$ time -p updatedtc actions ls action gact | grep index | wc -l
150
real 178.13
user 2.02
sys 176.96

Signed-off-by: Jamal Hadi Salim 
---
 include/uapi/linux/rtnetlink.h | 21 +--
 net/sched/act_api.c| 46 +-
 2 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index cce0613..d7d28ec 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -674,10 +674,27 @@ struct tcamsg {
unsigned char   tca__pad1;
unsigned short  tca__pad2;
 };
+
+enum {
+   TCAA_UNSPEC,
+   TCAA_ACT_TAB,
+#define TCA_ACT_TAB TCAA_ACT_TAB
+   TCAA_ACT_FLAGS,
+   TCAA_ACT_COUNT,
+   __TCAA_MAX,
+#defineTCAA_MAX (__TCAA_MAX - 1)
+};
+
 #define TA_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct 
tcamsg
 #define TA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct tcamsg))
-#define TCA_ACT_TAB 1 /* attr type must be >=1 */  
-#define TCAA_MAX 1
+/* tcamsg flags stored in attribute TCAA_ACT_FLAGS
+ *
+ * ACT_LARGE_DUMP_ON user->kernel to request for larger than TCA_ACT_MAX_PRIO
+ * actions in a dump. All dump responses will contain the number of actions
+ * being dumped stored in for user app's consumption in TCAA_ACT_COUNT
+ *
+ */
+#define ACT_LARGE_DUMP_ON  BIT(0)
 
 /* New extended info filters for IFLA_EXT_MASK */
 #define RTEXT_FILTER_VF(1 << 0)
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 82b1d48..f85b8fd 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -83,6 +83,7 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, struct 
sk_buff *skb,
   struct netlink_callback *cb)
 {
int err = 0, index = -1, i = 0, s_i = 0, n_i = 0;
+   unsigned short act_flags = cb->args[2];
struct nlattr *nest;
 
spin_lock_bh(&hinfo->lock);
@@ -111,14 +112,18 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, 
struct sk_buff *skb,
}
nla_nest_end(skb, nest);
n_i++;
-   if (n_i >= TCA_ACT_MAX_PRIO)
+   if (!(act_flags & ACT_LARGE_DUMP_ON) &&
+   n_i >= TCA_ACT_MAX_PRIO)
goto done;
}
}
 done:
spin_unlock_bh(&hinfo->lock);
-   if (n_i)
+   if (n_i) {
cb->args[0] += n_i;
+   if (act_flags & ACT_LARGE_DUMP_ON)
+   cb->args[1] = n_i;
+   }
return n_i;
 
 nla_put_failure:
@@ -993,11 +998,15 @@ static int tcf_action_add(struct net *net, struct nlattr 
*nla,
return tcf_add_notify(net, n, &actions, portid);
 }
 
+static const struct nla_policy tcaa_policy[TCAA_MAX + 1] = {
+   [TCAA_ACT_FLAGS]  = { .type = NLA_U32 },
+};
+
 static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
 struct netlink_ext_ack *extack)
 {
struct net *net = sock_net(skb->sk);
-   struct nlattr *tca[TCA_ACT_MAX + 1];
+   struct nlattr *tca[TCAA_MAX + 1];
u32 portid = skb ? NETLINK_CB(skb).portid : 0;
int ret = 0, ovr = 0;
 
@@ -1005,7 +1014,7 @@ static int tc_ctl_action(struct sk_buff *skb, struct 
nlmsghdr *n,
!netlink_capable(skb, CAP_NET_ADMIN))
return -EPERM;
 
-   ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCA_ACT_MAX, NULL,
+   ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCAA_MAX, tcaa_policy,
  extack);
if (ret < 0)
return ret;
@@ -1046,16 +1055,12 @@ static int tc_ctl_action(struct sk_buff *skb, struct 
nlmsghdr *n,
return ret;
 }
 
-static struct nlattr *find_dump_kind(const struct nlmsghdr *n)
+static struct nlattr *find_dump_kind(struct nlattr **nla

[PATCH net-next v5 2/2] net sched actions: add time filter for action dumping

2017-04-20 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

This adds support for filtering based on time since last used.
When we are dumping a large number of actions it is useful to
have the option of filtering based on when the action was last
used to reduce the amount of data crossing to user space.

With this patch the user space app sets the TCAA_ACT_TIME_DELTA
attribute with the value in milliseconds with "time of interest
since now".  The kernel converts this to jiffies and does the
filtering comparison matching entries that have seen activity
since then and returns them to user space.
Old kernels and old tc continue to work in legacy mode since
they dont specify this attribute.

Some example (we have 400 actions bound to 400 filters); at
installation time using  hacked tc which sets the time of
interest to 120 seconds:

prompt$ hackedtc actions ls action gact | grep index | wc -l
400

go get some coffee and  wait for > 120 seconds and try again:

prompt$ hackedtc actions ls action gact | grep index | wc -l
0

Lets see a filter bound to one of these actions:
..
filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule 
hit 2 success 1)
  match 7f02/ at 12 (success 1 )
action order 1: gact action pass
 random type none pass val 0
 index 23 ref 2 bind 1 installed 1145 sec used 802 sec
Action statistics:
Sent 84 bytes 1 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0


that coffee took long, no? It was good.

Now lets ping -c 1 127.0.0.2, then run the actions again:

prompt$ hackedtc actions ls action gact | grep index | wc -l
1

More details please:

prompt$ hackedtc -s actions ls action gact

action order 0: gact action pass
 random type none pass val 0
 index 23 ref 2 bind 1 installed 1270 sec used 30 sec
Action statistics:
Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

And the filter?

filter pref 10 u32
filter pref 10 u32 fh 800: ht divisor 1
filter pref 10 u32 fh 800::800 order 2048 key ht 800 bkt 0 flowid 1:10  (rule 
hit 4 success 2)
  match 7f02/ at 12 (success 2 )
action order 1: gact action pass
 random type none pass val 0
 index 23 ref 2 bind 1 installed 1324 sec used 84 sec
Action statistics:
Sent 168 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
backlog 0b 0p requeues 0

Signed-off-by: Jamal Hadi Salim 
---
 include/uapi/linux/rtnetlink.h |  1 +
 net/sched/act_api.c| 20 +++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index d7d28ec..da8a5de 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -681,6 +681,7 @@ enum {
 #define TCA_ACT_TAB TCAA_ACT_TAB
TCAA_ACT_FLAGS,
TCAA_ACT_COUNT,
+   TCAA_ACT_TIME_DELTA,
__TCAA_MAX,
 #defineTCAA_MAX (__TCAA_MAX - 1)
 };
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index f85b8fd..d163ff4 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -84,6 +84,7 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, struct 
sk_buff *skb,
 {
int err = 0, index = -1, i = 0, s_i = 0, n_i = 0;
unsigned short act_flags = cb->args[2];
+   unsigned long jiffy_since = cb->args[3];
struct nlattr *nest;
 
spin_lock_bh(&hinfo->lock);
@@ -101,6 +102,11 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, 
struct sk_buff *skb,
if (index < s_i)
continue;
 
+   if (jiffy_since &&
+   time_after(jiffy_since,
+  (unsigned long)p->tcfa_tm.lastuse))
+   continue;
+
nest = nla_nest_start(skb, n_i);
if (nest == NULL)
goto nla_put_failure;
@@ -118,9 +124,11 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, 
struct sk_buff *skb,
}
}
 done:
+   if (index > 0)
+   cb->args[0] = index + 1;
+
spin_unlock_bh(&hinfo->lock);
if (n_i) {
-   cb->args[0] += n_i;
if (act_flags & ACT_LARGE_DUMP_ON)
cb->args[1] = n_i;
}
@@ -1000,6 +1008,7 @@ static int tcf_action_add(struct net *net, struct nlattr 
*nla,
 
 static const struct nla_policy tcaa_policy[TCAA_MAX + 1] = {
[TCAA_ACT_FLAGS]  = { .type = NLA_U32 },
+   [TCAA_ACT_TIME_DELTA]  = { .type = NLA_U32 },
 };
 
 static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
@@ -1091,6 +1100,8 @@ static int tc_dump_action(struct sk_buff *skb, struct 
netlink_callback *cb)
struct nlattr *kind = NULL;
struct nlattr *count_attr = NULL;
u32 act_flags = 0;
+   u32 msecs_since = 0;
+   unsigned long jiff

[PATCH net-next v5 0/2] net sched actions: improve action dump performance

2017-04-20 Thread Jamal Hadi Salim

From: Jamal Hadi Salim 

Changes since v4:

1) Eric D.

pointed out that when all skb space is used up by the dump
there will be no space to insert the TCAA_ACT_COUNT attribute.

2) Jiri:

i) Change:

enum {
TCAA_UNSPEC,
TCAA_ACT_TAB,
TCAA_ACT_FLAGS,
TCAA_ACT_COUNT,
TCAA_ACT_TIME_FILTER,
__TCAA_MAX
};

#define TCAA_MAX (__TCAA_MAX - 1)
#define ACT_LARGE_DUMP_ON   (1 << 0)

to:
enum {
   TCAA_UNSPEC,
   TCAA_ACT_TAB,
#define TCA_ACT_TAB TCAA_ACT_TAB
   TCAA_ACT_FLAGS,
   TCAA_ACT_COUNT,
   __TCAA_MAX,
#defineTCAA_MAX (__TCAA_MAX - 1)
};

#define ACT_LARGE_DUMP_ON  BIT(0)

Jiri plans to followup with the rest of the code to make the
style consistent.

ii) Rename attribute TCAA_ACT_TIME_FILTER --> TCAA_ACT_TIME_DELTA

iii) Rename variable jiffy_filter --> jiffy_since
iv) Rename msecs_filter --> msecs_since
v) get rid of unused cb->args[0] and rename cb->args[4] to cb->args[0]

Jamal Hadi Salim (2):
  net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch
  net sched actions: add time filter for action dumping

 include/uapi/linux/rtnetlink.h | 22 --
 net/sched/act_api.c| 66 +++---
 2 files changed, 75 insertions(+), 13 deletions(-)

-- 
1.9.1

Re: [PATCH v2] mdio_bus: Issue GPIO RESET to PHYs.

2017-04-20 Thread Andrew Lunn

On Thu, Apr 20, 2017 at 11:39:20AM +0300, Roger Quadros wrote:
> Some boards [1] leave the PHYs at an invalid state
> during system power-up or reset thus causing unreliability
> issues with the PHY which manifests as PHY not being detected
> or link not functional. To fix this, these PHYs need to be RESET
> via a GPIO connected to the PHY's RESET pin.
> 
> Some boards have a single GPIO controlling the PHY RESET pin of all
> PHYs on the bus whereas some others have separate GPIOs controlling
> individual PHY RESETs.
> 
> In both cases, the RESET de-assertion cannot be done in the PHY driver
> as the PHY will not probe till its reset is de-asserted.
> So do the RESET de-assertion in the MDIO bus driver.

Hi Roger

Networking patches should include in the subject line which tree they
are for. In this case, net-next. So please make the Subject

[patch v3 net-next] .

> [1] - am572x-idk, am571x-idk, a437x-idk
> 
> Signed-off-by: Roger Quadros 
> ---
> v2:
> - add device tree binding document (mdio.txt)
> - specify default reset delay in of_mdio.c instead of mdio_bus.c
> 
>  Documentation/devicetree/bindings/net/mdio.txt | 20 
>  drivers/net/phy/mdio_bus.c | 22 ++
>  drivers/of/of_mdio.c   |  7 +++
>  include/linux/phy.h|  5 +
>  4 files changed, 54 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/mdio.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/mdio.txt 
> b/Documentation/devicetree/bindings/net/mdio.txt
> new file mode 100644
> index 000..6e703d7
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/mdio.txt
> @@ -0,0 +1,20 @@
> +Common MDIO bus properties.
> +
> +These are generic properties that can apply to any MDIO bus.
> +
> +Optional properties:
> +- reset-gpios: List of one or more GPIOs that control the RESET lines
> +  of the PHYs on that MDIO bus.
> +- reset-delay-us: RESET pulse width as per PHY datasheet.

It would be good to explicitly say that it is in uS as part of the
comment.

Also, please document that we expect a list of child nodes, one per
device on the bus. These should follow the generic phy.txt, or a
device specific binding document.

> +
> +Example :
> +

It would be good to say something like:

This example shows these optional properties, plus other properties
required for the TI Davinci MDIO driver.

Pointing this out may stop people cut/past the ti,hwmods property.

> + davinci_mdio: ethernet@0x5c03 {
> + compatible = "ti,davinci_mdio";
> + ti,hwmods = "davinci_mdio";
> + reg = <0x5c03 0x1000>;
> + #address-cells = <1>;
> + #size-cells = <0>;
> + reset-gpios = <&gpio2 5 GPIO_ACTIVE_LOW>;
> + reset-delay-us = <2>;   /* PHY datasheet states 1uS min */
> + };

And please include at least one PHY on the bus.

Sorry for asking for so much in the documentation. That is the problem
with the documentation being missing to start with.

The code looks good now.

Andrew

raw patch for iproute2 fast action dumping

2017-04-20 Thread Jamal Hadi Salim



Not to be used in production.

cheers,
jamal
diff --git a/tc/f_basic.c b/tc/f_basic.c
index d663668..8370ea6 100644
--- a/tc/f_basic.c
+++ b/tc/f_basic.c
@@ -135,7 +135,7 @@ static int basic_print_opt(struct filter_util *qu, FILE *f,
}
 
if (tb[TCA_BASIC_ACT]) {
-   tc_print_action(f, tb[TCA_BASIC_ACT]);
+   tc_print_action(f, tb[TCA_BASIC_ACT], 0);
}
 
return 0;
diff --git a/tc/f_bpf.c b/tc/f_bpf.c
index df8a259..52027af 100644
--- a/tc/f_bpf.c
+++ b/tc/f_bpf.c
@@ -231,7 +231,7 @@ static int bpf_print_opt(struct filter_util *qu, FILE *f,
}
 
if (tb[TCA_BPF_ACT])
-   tc_print_action(f, tb[TCA_BPF_ACT]);
+   tc_print_action(f, tb[TCA_BPF_ACT], 0);
 
return 0;
 }
diff --git a/tc/f_cgroup.c b/tc/f_cgroup.c
index ecf9909..633700e 100644
--- a/tc/f_cgroup.c
+++ b/tc/f_cgroup.c
@@ -102,7 +102,7 @@ static int cgroup_print_opt(struct filter_util *qu, FILE *f,
}
 
if (tb[TCA_CGROUP_ACT])
-   tc_print_action(f, tb[TCA_CGROUP_ACT]);
+   tc_print_action(f, tb[TCA_CGROUP_ACT], 0);
 
return 0;
 }
diff --git a/tc/f_flow.c b/tc/f_flow.c
index 09ddcaa..b157104 100644
--- a/tc/f_flow.c
+++ b/tc/f_flow.c
@@ -347,7 +347,7 @@ static int flow_print_opt(struct filter_util *fu, FILE *f, 
struct rtattr *opt,
tc_print_police(f, tb[TCA_FLOW_POLICE]);
if (tb[TCA_FLOW_ACT]) {
fprintf(f, "\n");
-   tc_print_action(f, tb[TCA_FLOW_ACT]);
+   tc_print_action(f, tb[TCA_FLOW_ACT], 0);
}
return 0;
 }
diff --git a/tc/f_flower.c b/tc/f_flower.c
index 5aac4a0..04733d9 100644
--- a/tc/f_flower.c
+++ b/tc/f_flower.c
@@ -1174,7 +1174,7 @@ static int flower_print_opt(struct filter_util *qu, FILE 
*f,
}
 
if (tb[TCA_FLOWER_ACT])
-   tc_print_action(f, tb[TCA_FLOWER_ACT]);
+   tc_print_action(f, tb[TCA_FLOWER_ACT], 0);
 
return 0;
 }
diff --git a/tc/f_fw.c b/tc/f_fw.c
index 790bef9..c39789b 100644
--- a/tc/f_fw.c
+++ b/tc/f_fw.c
@@ -160,7 +160,7 @@ static int fw_print_opt(struct filter_util *qu, FILE *f, 
struct rtattr *opt, __u
 
if (tb[TCA_FW_ACT]) {
fprintf(f, "\n");
-   tc_print_action(f, tb[TCA_FW_ACT]);
+   tc_print_action(f, tb[TCA_FW_ACT], 0);
}
return 0;
 }
diff --git a/tc/f_matchall.c b/tc/f_matchall.c
index ac48630..5c8fe2a 100644
--- a/tc/f_matchall.c
+++ b/tc/f_matchall.c
@@ -140,7 +140,7 @@ static int matchall_print_opt(struct filter_util *qu, FILE 
*f,
}
 
if (tb[TCA_MATCHALL_ACT])
-   tc_print_action(f, tb[TCA_MATCHALL_ACT]);
+   tc_print_action(f, tb[TCA_MATCHALL_ACT], 0);
 
return 0;
 }
diff --git a/tc/f_route.c b/tc/f_route.c
index 30514c4..e88313f 100644
--- a/tc/f_route.c
+++ b/tc/f_route.c
@@ -168,7 +168,7 @@ static int route_print_opt(struct filter_util *qu, FILE *f, 
struct rtattr *opt,
if (tb[TCA_ROUTE4_POLICE])
tc_print_police(f, tb[TCA_ROUTE4_POLICE]);
if (tb[TCA_ROUTE4_ACT])
-   tc_print_action(f, tb[TCA_ROUTE4_ACT]);
+   tc_print_action(f, tb[TCA_ROUTE4_ACT], 0);
return 0;
 }
 
diff --git a/tc/f_rsvp.c b/tc/f_rsvp.c
index 94bfbef..65caeb4 100644
--- a/tc/f_rsvp.c
+++ b/tc/f_rsvp.c
@@ -402,7 +402,7 @@ static int rsvp_print_opt(struct filter_util *qu, FILE *f, 
struct rtattr *opt, _
}
 
if (tb[TCA_RSVP_ACT]) {
-   tc_print_action(f, tb[TCA_RSVP_ACT]);
+   tc_print_action(f, tb[TCA_RSVP_ACT], 0);
}
if (tb[TCA_RSVP_POLICE])
tc_print_police(f, tb[TCA_RSVP_POLICE]);
diff --git a/tc/f_tcindex.c b/tc/f_tcindex.c
index 784c890..dd1cb47 100644
--- a/tc/f_tcindex.c
+++ b/tc/f_tcindex.c
@@ -173,7 +173,7 @@ static int tcindex_print_opt(struct filter_util *qu, FILE 
*f,
}
if (tb[TCA_TCINDEX_ACT]) {
fprintf(f, "\n");
-   tc_print_action(f, tb[TCA_TCINDEX_ACT]);
+   tc_print_action(f, tb[TCA_TCINDEX_ACT], 0);
}
return 0;
 }
diff --git a/tc/f_u32.c b/tc/f_u32.c
index 92c1fcd..dbc6939 100644
--- a/tc/f_u32.c
+++ b/tc/f_u32.c
@@ -1332,7 +1332,7 @@ static int u32_print_opt(struct filter_util *qu, FILE *f, 
struct rtattr *opt,
}
 
if (tb[TCA_U32_ACT])
-   tc_print_action(f, tb[TCA_U32_ACT]);
+   tc_print_action(f, tb[TCA_U32_ACT], 0);
 
return 0;
 }
diff --git a/tc/m_action.c b/tc/m_action.c
index 6ba7615..c5d390d 100644
--- a/tc/m_action.c
+++ b/tc/m_action.c
@@ -230,7 +230,7 @@ done0:
NEXT_ARG();
slen = strlen(*argv);
if (slen > TC_COOKIE_MAX_SIZE * 2)
-   invarg("cookie cannot exceed %d\n",
+   invarg("cookie exceeded len\n",

Re: [RFC PATCH net] net/mlx5e: Race between mlx5e_update_stats() and getting the stats

2017-04-20 Thread Saeed Mahameed

On Thu, Apr 20, 2017 at 2:35 AM, Eric Dumazet  wrote:
> On Wed, 2017-04-19 at 14:53 -0700, Martin KaFai Lau wrote:
>
>> Right, a temp and a memcpy should be enough to solve our spike problem.
>> It may be the right fix for net.
>>
>> Agree that using a spinlock is better (likely changing state_lock
>> to spinlock).  A quick grep shows 80 line changes.  Saeed, thoughts?

No, changing the current state_lock is a bad idea, as Eric said it is
for a reason,
to synchronize between arbitrary ring/device state changes which might sleep.

memcpy is a good idea to better hide or delay the issue :),
new dedicated spin lock is the right way to go, As Eric suggested below.

BTW, very nice catch Martin, I just got back from vacation to work on this bug,
and you already root caused it, Thanks !!

>
> I was not advising replacing the mutex (maybe it is a mutex for good
> reason), I simply suggested to use another spinlock only for this very
> specific section.
>
> Something like :
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 
> dc52053128bc752ccd398449330c24c0bdf8b3a1..9b2e1b79fded22d55e9409cb572308190679cfdd
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -722,6 +722,7 @@ struct mlx5e_priv {
> struct mlx5_core_dev  *mdev;
> struct net_device *netdev;
> struct mlx5e_stats stats;
> +   spinlock_t stats_lock;
> struct mlx5e_tstamptstamp;
> u16 q_counter;
>  #ifdef CONFIG_MLX5_CORE_EN_DCB
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> index 
> a004a5a1a4c22a742ef3f9939769c6b5c9445f46..b4b7d43bf899cadca2c2a17151d35acac9773859
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_ethtool.c
> @@ -315,9 +315,11 @@ static void mlx5e_get_ethtool_stats(struct net_device 
> *dev,
> mlx5e_update_stats(priv);
> mutex_unlock(&priv->state_lock);
>
> +   spin_lock(&priv->stats_lock);
> for (i = 0; i < NUM_SW_COUNTERS; i++)
> data[idx++] = MLX5E_READ_CTR64_CPU(&priv->stats.sw,
>sw_stats_desc, i);
> +   spin_unlock(&priv->stats_lock);
>
> for (i = 0; i < MLX5E_NUM_Q_CNTRS(priv); i++)
> data[idx++] = MLX5E_READ_CTR32_CPU(&priv->stats.qcnt,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 
> 66c133757a5ee8daae122e93322306b1c5c44336..4d6672045b1126a8bab4d6f2035e6a9b830560d2
>  100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -174,7 +174,7 @@ static void mlx5e_tx_timeout_work(struct work_struct 
> *work)
>
>  static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
>  {
> -   struct mlx5e_sw_stats *s = &priv->stats.sw;
> +   struct mlx5e_sw_stats temp, *s = &temp;
> struct mlx5e_rq_stats *rq_stats;
> struct mlx5e_sq_stats *sq_stats;
> u64 tx_offload_none = 0;
> @@ -229,6 +229,9 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
> *priv)
> s->link_down_events_phy = MLX5_GET(ppcnt_reg,
> priv->stats.pport.phy_counters,
> 
> counter_set.phys_layer_cntrs.link_down_events);
> +   spin_lock(&priv->stats_lock);
> +   memcpy(&priv->stats.sw, s, sizeof(*s));
> +   spin_unlock(&priv->stats_lock);

I like this ! minimized the critical section with a temp buffer and a
memcpy .. perfect.

>  }
>
>  static void mlx5e_update_vport_counters(struct mlx5e_priv *priv)
> @@ -2754,11 +2757,13 @@ mlx5e_get_stats(struct net_device *dev, struct 
> rtnl_link_stats64 *stats)
> stats->tx_packets = PPORT_802_3_GET(pstats, 
> a_frames_transmitted_ok);
> stats->tx_bytes   = PPORT_802_3_GET(pstats, 
> a_octets_transmitted_ok);
> } else {
> +   spin_lock(&priv->stats_lock);
> stats->rx_packets = sstats->rx_packets;
> stats->rx_bytes   = sstats->rx_bytes;
> stats->tx_packets = sstats->tx_packets;
> stats->tx_bytes   = sstats->tx_bytes;
> stats->tx_dropped = sstats->tx_queue_dropped;
> +   spin_unlock(&priv->stats_lock);
> }
>
> stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
> @@ -3561,6 +3566,8 @@ static void mlx5e_build_nic_netdev_priv(struct 
> mlx5_core_dev *mdev,
>
> mutex_init(&priv->state_lock);
>
> +   spin_lock_init(&priv->stats_lock);
> +
> INIT_WORK(&priv->update_carrier_work, mlx5e_update_carrier_work);
> INIT_WORK(&priv->set_rx_mode_work, mlx5e_set_rx_mode_work);
> INIT_WORK(

Re: Heads-up: two regressions in v4.11-rc series

2017-04-20 Thread Frederic Weisbecker

On Thu, Apr 20, 2017 at 11:00:42AM +0200, Jesper Dangaard Brouer wrote:
> Hi Linus,
> 
> Just wanted to give a heads-up on two regressions in 4.11-rc series.
> 
> (1) page allocator optimization revert
> 
> Mel Gorman and I have been playing with optimizing the page allocator,
> but Tariq spotted that we caused a regression for (NIC) drivers that
> refill DMA RX rings in softirq context.
> 
> The end result was a revert, and this is waiting in AKPMs quilt queue:
>  
> http://ozlabs.org/~akpm/mmots/broken-out/revert-mm-page_alloc-only-use-per-cpu-allocator-for-irq-safe-requests.patch
> 
> 
> (2) Busy softirq can cause userspace not to be scheduled
> 
> I bisected the problem to a499a5a14dbd ("sched/cputime: Increment
> kcpustat directly on irqtime account"). See email thread with
>  Subject: Bisected softirq accounting issue in v4.11-rc1~170^2~28
>  http://lkml.kernel.org/r/20170328101403.34a82...@redhat.com
> 
> I don't know the scheduler code well enough to fix this, and will have
> to rely others to figure out this scheduler regression.
> 
> To make it clear: I'm only seeing this scheduler regression when a
> remote host is sending many many network packets, towards the kernel
> which keeps NAPI/softirq busy all the time.  A possible hint: tool
> "top" only shows this in "si" column, while on v4.10 "top" also blames
> "ksoftirqd/N", plus "ps" reported cputime (0:00) seems wrong for ksoftirqd.

(I'm currently working on reproducing that one.)

Re: [PATCH net-next v4 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-20 Thread Jamal Hadi Salim


On 17-04-20 08:18 AM, Eric Dumazet wrote:

On Thu, 2017-04-20 at 06:42 -0400, Jamal Hadi Salim wrote:



They are not the same issue Jiri. We have used bitmasks fine on netlink
message for a millenia. Nobody sets garbage on a bitmask they are not
supposed to touch. The struct padding thing is a shame the way it
turned out - now netlink can no longer have a claim to be a (good)
wire protocol.


Except that users wrote programs, and these programs work today.

By changing the kernel and recognizing new flags in existing padding,
you might break the programs.

This is not acceptable. Period.

Had we checked the padding being 0 in old kernels, this change would
have been possible today.

But because old kernels did not care of the padding contents, then there
is no way new kernel can suddenly trust them at all.

Please Jamal, you have to forget this nonsense.


That is fine. We can rule out netlink ever being able to work
across machines. That was the dream in the past. Lets close that
discussion.

The issue Jiri is bringing up is unrelated. He is talking about
a bitmap and conflating it with a data structure. They are not
the same issue.

cheers,
jamal

[PATCH v2] net: arc_emac: switch to phy_start()/phy_stop()

2017-04-20 Thread Alexander Kochetkov

Currently driver use phy_start_aneg() in arc_emac_open() to bring
up PHY. But phy_start() function is more appropriate for this purposes.
Besides that it call phy_start_aneg() as part of PHY startup sequence
it also can correctly bring up PHY from error and suspended states.
So the patch replace phy_start_aneg() to phy_start().

Also the patch add call to phy_stop() to arc_emac_stop() to allow
the PHY device to be fully suspended when the interface is unused.

Signed-off-by: Alexander Kochetkov 
---

Changes in v2:
- Updated commit message to clarify changes

 drivers/net/ethernet/arc/emac_main.c |4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/arc/emac_main.c 
b/drivers/net/ethernet/arc/emac_main.c
index abc9f2a..188676d 100644
--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -434,7 +434,7 @@ static int arc_emac_open(struct net_device *ndev)
/* Enable EMAC */
arc_reg_or(priv, R_CTRL, EN_MASK);
 
-   phy_start_aneg(ndev->phydev);
+   phy_start(ndev->phydev);
 
netif_start_queue(ndev);
 
@@ -556,6 +556,8 @@ static int arc_emac_stop(struct net_device *ndev)
napi_disable(&priv->napi);
netif_stop_queue(ndev);
 
+   phy_stop(ndev->phydev);
+
/* Disable interrupts */
arc_reg_clr(priv, R_ENABLE, RXINT_MASK | TXINT_MASK | ERR_MASK);
 
-- 
1.7.9.5

[PATCH RFC net-next v4 0/7] net: improve support for SCTP checksums

2017-04-20 Thread Davide Caratti

hello Tom,

On Fri, 2017-04-07 at 11:11 -0700, Tom Herbert wrote:
> maybe just call it csum_not_ip then. Then just do "if
> (unlikely(skb->csum_not_ip)) ..."

Ok, done. V4 uses this bit for SCTP only and leaves unmodified behavior
when offloaded FCoE frames are processed. Further work is still possible
to extend this fix for FCoE, if needed, either by using additional sk_buff
bits, or using skb->csum_not_ip and use other data (e.g. skb->csum_offset)
to distinguish SCTP from FCoE.

> the only case where this new bit is relevant is when
> CHECKSUM_PARTIAL for a CRC is being done. When it's set for offloading
> sctp crc it must be set. When CRC is resolved, in the helper for
> instance, it must be cleared.

in V4 the bit is set when SCTP packets with offloaded checksum are
generated; the bit is cleared when CRC32c is resolved for such packets
(i.e. skb->ip_summed transitions from CHECKSUM_PARTIAL to CHECKSUM_NONE).

Any feedbacks are appreciated!
thank you in advance,
--
davide


Davide Caratti (7):
  skbuff: add stub to help computing crc32c on SCTP packets
  net: introduce skb_crc32c_csum_help
  sk_buff: remove support for csum_bad in sk_buff
  net: use skb->csum_not_inet to identify packets needing crc32c
  net: more accurate checksumming in validate_xmit_skb()
  openvswitch: more accurate checksumming in queue_userspace_packet()
  sk_buff.h: improve description of CHECKSUM_{COMPLETE,UNNECESSARY}

 Documentation/networking/checksum-offloads.txt   | 11 +++--
 drivers/net/ethernet/aquantia/atlantic/aq_ring.c |  2 +-
 include/linux/netdevice.h|  8 +--
 include/linux/skbuff.h   | 58 +-
 net/bridge/netfilter/nft_reject_bridge.c |  5 +-
 net/core/dev.c   | 63 +---
 net/core/skbuff.c| 24 +
 net/ipv4/netfilter/nf_reject_ipv4.c  |  2 +-
 net/ipv6/netfilter/nf_reject_ipv6.c  |  3 --
 net/openvswitch/datapath.c   |  2 +-
 net/sched/act_csum.c |  1 +
 net/sctp/offload.c   |  8 +++
 net/sctp/output.c|  1 +
 13 files changed, 128 insertions(+), 60 deletions(-)

-- 
2.7.4

[PATCH RFC net-next v4 2/7] net: introduce skb_crc32c_csum_help

2017-04-20 Thread Davide Caratti

skb_crc32c_csum_help is like skb_checksum_help, but it is designed for
checksumming SCTP packets using crc32c (see RFC3309), provided that
libcrc32c.ko has been loaded before. In case libcrc32c is not loaded,
invoking skb_crc32c_csum_help on a skb results in one the following
printouts:

warn_crc32c_csum_update: attempt to compute crc32c without libcrc32c.ko
warn_crc32c_csum_combine: attempt to compute crc32c without libcrc32c.ko

Signed-off-by: Davide Caratti 
---
 include/linux/netdevice.h |  1 +
 include/linux/skbuff.h|  3 ++-
 net/core/dev.c| 40 
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b0aa089..bf84a67 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3898,6 +3898,7 @@ void netdev_rss_key_fill(void *buffer, size_t len);
 
 int dev_get_nest_level(struct net_device *dev);
 int skb_checksum_help(struct sk_buff *skb);
+int skb_crc32c_csum_help(struct sk_buff *skb);
 struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
  netdev_features_t features, bool tx_path);
 struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ba3ae21..ec4551b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -193,7 +193,8 @@
  * accordingly. Note the there is no indication in the skbuff that the
  * CHECKSUM_PARTIAL refers to an SCTP checksum, a driver that supports
  * both IP checksum offload and SCTP CRC offload must verify which offload
- * is configured for a packet presumably by inspecting packet headers.
+ * is configured for a packet presumably by inspecting packet headers; in
+ * case, skb_crc32c_csum_help is provided to compute CRC on SCTP packets.
  *
  *   NETIF_F_FCOE_CRC - This feature indicates that a device is capable of
  * offloading the FCOE CRC in a packet. To perform this offload the stack
diff --git a/net/core/dev.c b/net/core/dev.c
index 5d33e2b..c7aec95 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -140,6 +140,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "net-sysfs.h"
 
@@ -2606,6 +2607,45 @@ int skb_checksum_help(struct sk_buff *skb)
 }
 EXPORT_SYMBOL(skb_checksum_help);
 
+int skb_crc32c_csum_help(struct sk_buff *skb)
+{
+   __le32 crc32c_csum;
+   int ret = 0, offset;
+
+   if (skb->ip_summed != CHECKSUM_PARTIAL)
+   goto out;
+
+   if (unlikely(skb_is_gso(skb)))
+   goto out;
+
+   /* Before computing a checksum, we should make sure no frag could
+* be modified by an external entity : checksum could be wrong.
+*/
+   if (unlikely(skb_has_shared_frag(skb))) {
+   ret = __skb_linearize(skb);
+   if (ret)
+   goto out;
+   }
+
+   offset = skb_checksum_start_offset(skb);
+   crc32c_csum = cpu_to_le32(~__skb_checksum(skb, offset,
+ skb->len - offset, ~(__u32)0,
+ crc32c_csum_stub));
+   offset += offsetof(struct sctphdr, checksum);
+   BUG_ON(offset >= skb_headlen(skb));
+
+   if (skb_cloned(skb) &&
+   !skb_clone_writable(skb, offset + sizeof(__le32))) {
+   ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+   if (ret)
+   goto out;
+   }
+   *(__le32 *)(skb->data + offset) = crc32c_csum;
+   skb->ip_summed = CHECKSUM_NONE;
+out:
+   return ret;
+}
+
 __be16 skb_network_protocol(struct sk_buff *skb, int *depth)
 {
__be16 type = skb->protocol;
-- 
2.7.4

[PATCH RFC net-next v4 1/7] skbuff: add stub to help computing crc32c on SCTP packets

2017-04-20 Thread Davide Caratti

sctp_compute_checksum requires crc32c symbol (provided by libcrc32c), so
it can't be used in net core. Like it has been done previously with other
symbols (e.g. ipv6_dst_lookup), introduce a stub struct skb_checksum_ops
to allow computation of crc32c checksum in net core after sctp.ko (and thus
libcrc32c) has been loaded.

Reviewed-by: Marcelo Ricardo Leitner 
Signed-off-by: Davide Caratti 
---
 include/linux/skbuff.h |  2 ++
 net/core/skbuff.c  | 24 
 net/sctp/offload.c |  7 +++
 3 files changed, 33 insertions(+)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 741d75c..ba3ae21 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3127,6 +3127,8 @@ struct skb_checksum_ops {
__wsum (*combine)(__wsum csum, __wsum csum2, int offset, int len);
 };
 
+extern const struct skb_checksum_ops *crc32c_csum_stub __read_mostly;
+
 __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len,
  __wsum csum, const struct skb_checksum_ops *ops);
 __wsum skb_checksum(const struct sk_buff *skb, int offset, int len,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ad2af56..182608b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2242,6 +2242,30 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, 
int offset,
 }
 EXPORT_SYMBOL(skb_copy_and_csum_bits);
 
+static __wsum warn_crc32c_csum_update(const void *buff, int len, __wsum sum)
+{
+   net_warn_ratelimited(
+   "%s: attempt to compute crc32c without libcrc32c.ko\n",
+   __func__);
+   return 0;
+}
+
+static __wsum warn_crc32c_csum_combine(__wsum csum, __wsum csum2,
+  int offset, int len)
+{
+   net_warn_ratelimited(
+   "%s: attempt to compute crc32c without libcrc32c.ko\n",
+   __func__);
+   return 0;
+}
+
+const struct skb_checksum_ops *crc32c_csum_stub __read_mostly =
+   &(struct skb_checksum_ops) {
+   .update  = warn_crc32c_csum_update,
+   .combine = warn_crc32c_csum_combine,
+};
+EXPORT_SYMBOL(crc32c_csum_stub);
+
  /**
  * skb_zerocopy_headlen - Calculate headroom needed for skb_zerocopy()
  * @from: source buffer
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 4f5a2b5..378f462 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -98,6 +98,12 @@ static const struct net_offload sctp6_offload = {
},
 };
 
+static const struct skb_checksum_ops *crc32c_csum_ops __read_mostly =
+   &(struct skb_checksum_ops) {
+   .update  = sctp_csum_update,
+   .combine = sctp_csum_combine,
+};
+
 int __init sctp_offload_init(void)
 {
int ret;
@@ -110,6 +116,7 @@ int __init sctp_offload_init(void)
if (ret)
goto ipv4;
 
+   crc32c_csum_stub = crc32c_csum_ops;
return ret;
 
 ipv4:
-- 
2.7.4

[PATCH RFC net-next v4 5/7] net: more accurate checksumming in validate_xmit_skb()

2017-04-20 Thread Davide Caratti

skb_csum_hwoffload_help() uses netdev features and skb->csum_not_inet to
determine if skb needs software computation of Internet Checksum or crc32c
(or nothing, if this computation can be done by the hardware). Use it in
place of skb_checksum_help() in validate_xmit_skb() to avoid corruption
of non-GSO SCTP packets having skb->ip_summed equal to CHECKSUM_PARTIAL.

While at it, remove references to skb_csum_off_chk* functions, since they
are not present anymore in Linux since commit cf53b1da73bd ('Revert "net:
Add driver helper functions to determine checksum"').

Signed-off-by: Davide Caratti 
---
 Documentation/networking/checksum-offloads.txt | 11 +++
 include/linux/netdevice.h  |  3 +++
 include/linux/skbuff.h | 13 +
 net/core/dev.c | 14 --
 4 files changed, 27 insertions(+), 14 deletions(-)

diff --git a/Documentation/networking/checksum-offloads.txt 
b/Documentation/networking/checksum-offloads.txt
index 56e3686..d52d191 100644
--- a/Documentation/networking/checksum-offloads.txt
+++ b/Documentation/networking/checksum-offloads.txt
@@ -35,6 +35,9 @@ This interface only allows a single checksum to be offloaded. 
 Where
  encapsulation is used, the packet may have multiple checksum fields in
  different header layers, and the rest will have to be handled by another
  mechanism such as LCO or RCO.
+CRC32c can also be offloaded using this interface, by means of filling
+ skb->csum_start and skb->csum_offset as described above, and setting
+ skb->csum_not_inet: see skbuff.h comment (section 'D') for more details.
 No offloading of the IP header checksum is performed; it is always done in
  software.  This is OK because when we build the IP header, we obviously
  have it in cache, so summing it isn't expensive.  It's also rather short.
@@ -49,9 +52,9 @@ A driver declares its offload capabilities in 
netdev->hw_features; see
  and csum_offset given in the SKB; if it tries to deduce these itself in
  hardware (as some NICs do) the driver should check that the values in the
  SKB match those which the hardware will deduce, and if not, fall back to
- checksumming in software instead (with skb_checksum_help or one of the
- skb_csum_off_chk* functions as mentioned in include/linux/skbuff.h).  This
- is a pain, but that's what you get when hardware tries to be clever.
+ checksumming in software instead (with skb_csum_hwoffload_help() or one of
+ the skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in
+ include/linux/skbuff.h).
 
 The stack should, for the most part, assume that checksum offload is
  supported by the underlying device.  The only place that should check is
@@ -60,7 +63,7 @@ The stack should, for the most part, assume that checksum 
offload is
  may include other offloads besides TX Checksum Offload) and, if they are
  not supported or enabled on the device (determined by netdev->features),
  performs the corresponding offload in software.  In the case of TX
- Checksum Offload, that means calling skb_checksum_help(skb).
+ Checksum Offload, that means calling skb_csum_hwoffload_help(skb, features).
 
 
 LCO: Local Checksum Offload
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ab9e3dc..45e8958 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3897,6 +3897,9 @@ void netdev_rss_key_fill(void *buffer, size_t len);
 int dev_get_nest_level(struct net_device *dev);
 int skb_checksum_help(struct sk_buff *skb);
 int skb_crc32c_csum_help(struct sk_buff *skb);
+int skb_csum_hwoffload_help(struct sk_buff *skb,
+   const netdev_features_t features);
+
 struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
  netdev_features_t features, bool tx_path);
 struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 419f4c8..4002c11 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -162,14 +162,11 @@
  *
  *   NETIF_F_IP_CSUM and NETIF_F_IPV6_CSUM are being deprecated in favor of
  *   NETIF_F_HW_CSUM. New devices should use NETIF_F_HW_CSUM to indicate
- *   checksum offload capability. If a device has limited checksum capabilities
- *   (for instance can only perform NETIF_F_IP_CSUM or NETIF_F_IPV6_CSUM as
- *   described above) a helper function can be called to resolve
- *   CHECKSUM_PARTIAL. The helper functions are skb_csum_off_chk*. The helper
- *   function takes a spec argument that describes the protocol layer that is
- *   supported for checksum offload and can be called for each packet. If a
- *   packet does not match the specification for offload, skb_checksum_help
- *   is called to resolve the checksum.
+ *   checksum offload capability.
+ *   skb_csum_hwoffload_help() can be called to resolve CHECKSUM_PARTIAL based
+ *   on network device checksumming capabilities: if a packet does not match
+

[PATCH RFC net-next v4 6/7] openvswitch: more accurate checksumming in queue_userspace_packet()

2017-04-20 Thread Davide Caratti

if skb carries an SCTP packet and ip_summed is CHECKSUM_PARTIAL, it needs
CRC32c in place of Internet Checksum: use skb_csum_hwoffload_help to avoid
corrupting such packets while queueing them towards userspace.

Signed-off-by: Davide Caratti 
---
 net/openvswitch/datapath.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 7b17da9..9ddc9f8 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -453,7 +453,7 @@ static int queue_userspace_packet(struct datapath *dp, 
struct sk_buff *skb,
 
/* Complete checksum if needed */
if (skb->ip_summed == CHECKSUM_PARTIAL &&
-   (err = skb_checksum_help(skb)))
+   (err = skb_csum_hwoffload_help(skb, 0)))
goto out;
 
/* Older versions of OVS user space enforce alignment of the last
-- 
2.7.4

[PATCH RFC net-next v4 4/7] net: use skb->csum_not_inet to identify packets needing crc32c

2017-04-20 Thread Davide Caratti

skb->csum_not_inet carries the indication on which algorithm is needed to
compute checksum on skb in the transmit path, when skb->ip_summed is equal
to CHECKSUM_PARTIAL. If skb carries a SCTP packet and crc32c hasn't been
yet written in L4 header, skb->csum_not_inet is assigned to 1; otherwise,
assume Internet Checksum is needed and thus set skb->csum_not_inet to 0.

Suggested-by: Tom Herbert 
Signed-off-by: Davide Caratti 
---
 include/linux/skbuff.h | 16 +---
 net/core/dev.c |  1 +
 net/sched/act_csum.c   |  1 +
 net/sctp/offload.c |  1 +
 net/sctp/output.c  |  1 +
 5 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 927309e..419f4c8 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -189,12 +189,13 @@
  *
  *   NETIF_F_SCTP_CRC - This feature indicates that a device is capable of
  * offloading the SCTP CRC in a packet. To perform this offload the stack
- * will set ip_summed to CHECKSUM_PARTIAL and set csum_start and 
csum_offset
- * accordingly. Note the there is no indication in the skbuff that the
- * CHECKSUM_PARTIAL refers to an SCTP checksum, a driver that supports
- * both IP checksum offload and SCTP CRC offload must verify which offload
- * is configured for a packet presumably by inspecting packet headers; in
- * case, skb_crc32c_csum_help is provided to compute CRC on SCTP packets.
+ * will set set csum_start and csum_offset accordingly, set ip_summed to
+ * CHECKSUM_PARTIAL and set csum_not_inet to 1, to provide an indication in
+ * the skbuff that the CHECKSUM_PARTIAL refers to CRC32c.
+ * A driver that supports both IP checksum offload and SCTP CRC32c offload
+ * must verify which offload is configured for a packet by testing the
+ * value of skb->csum_not_inet; skb_crc32c_csum_help is provided to resolve
+ * CHECKSUM_PARTIAL on skbs where csum_not_inet is set to 1.
  *
  *   NETIF_F_FCOE_CRC - This feature indicates that a device is capable of
  * offloading the FCOE CRC in a packet. To perform this offload the stack
@@ -615,6 +616,7 @@ static inline bool skb_mstamp_after(const struct skb_mstamp 
*t1,
  * @wifi_acked_valid: wifi_acked was set
  * @wifi_acked: whether frame was acked on wifi or not
  * @no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS
+ * @csum_not_inet: use CRC32c to resolve CHECKSUM_PARTIAL
  * @dst_pending_confirm: need to confirm neighbour
   *@napi_id: id of the NAPI struct this skb came from
  * @secmark: security marking
@@ -743,7 +745,7 @@ struct sk_buff {
__u8csum_valid:1;
__u8csum_complete_sw:1;
__u8csum_level:2;
-   __u8__csum_bad_unused:1; /* one bit hole */
+   __u8csum_not_inet:1;
 
__u8dst_pending_confirm:1;
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
diff --git a/net/core/dev.c b/net/core/dev.c
index 77a2d73..9f56f87 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2642,6 +2642,7 @@ int skb_crc32c_csum_help(struct sk_buff *skb)
}
*(__le32 *)(skb->data + offset) = crc32c_csum;
skb->ip_summed = CHECKSUM_NONE;
+   skb->csum_not_inet = 0;
 out:
return ret;
 }
diff --git a/net/sched/act_csum.c b/net/sched/act_csum.c
index ab6fdbd..3317a2f 100644
--- a/net/sched/act_csum.c
+++ b/net/sched/act_csum.c
@@ -350,6 +350,7 @@ static int tcf_csum_sctp(struct sk_buff *skb, unsigned int 
ihl,
sctph->checksum = sctp_compute_cksum(skb,
 skb_network_offset(skb) + ihl);
skb->ip_summed = CHECKSUM_NONE;
+   skb->csum_not_inet = 0;
 
return 1;
 }
diff --git a/net/sctp/offload.c b/net/sctp/offload.c
index 378f462..ef156ac 100644
--- a/net/sctp/offload.c
+++ b/net/sctp/offload.c
@@ -35,6 +35,7 @@
 static __le32 sctp_gso_make_checksum(struct sk_buff *skb)
 {
skb->ip_summed = CHECKSUM_NONE;
+   skb->csum_not_inet = 0;
return sctp_compute_cksum(skb, skb_transport_offset(skb));
 }
 
diff --git a/net/sctp/output.c b/net/sctp/output.c
index 1409a87..e2edf2e 100644
--- a/net/sctp/output.c
+++ b/net/sctp/output.c
@@ -538,6 +538,7 @@ static int sctp_packet_pack(struct sctp_packet *packet,
} else {
 chksum:
head->ip_summed = CHECKSUM_PARTIAL;
+   head->csum_not_inet = 1;
head->csum_start = skb_transport_header(head) - head->head;
head->csum_offset = offsetof(struct sctphdr, checksum);
}
-- 
2.7.4

[PATCH RFC net-next v4 7/7] sk_buff.h: improve description of CHECKSUM_{COMPLETE,UNNECESSARY}

2017-04-20 Thread Davide Caratti

Add FCoE to the list of protocols that can set CHECKSUM_UNNECESSARY; add a
note to CHECKSUM_COMPLETE section to specify that it does not apply to SCTP
and FCoE protocols.

Suggested-by: Tom Herbert 
Signed-off-by: Davide Caratti 
---
 include/linux/skbuff.h | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4002c11..c902b77 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -109,6 +109,7 @@
  *   may perform further validation in this case.
  * GRE: only if the checksum is present in the header.
  * SCTP: indicates the CRC in SCTP header has been validated.
+ * FCOE: indicates the CRC in FC frame has been validated.
  *
  *   skb->csum_level indicates the number of consecutive checksums found in
  *   the packet minus one that have been verified as CHECKSUM_UNNECESSARY.
@@ -126,8 +127,10 @@
  *   packet as seen by netif_rx() and fills out in skb->csum. Meaning, the
  *   hardware doesn't need to parse L3/L4 headers to implement this.
  *
- *   Note: Even if device supports only some protocols, but is able to produce
- *   skb->csum, it MUST use CHECKSUM_COMPLETE, not CHECKSUM_UNNECESSARY.
+ *   Notes:
+ *   - Even if device supports only some protocols, but is able to produce
+ * skb->csum, it MUST use CHECKSUM_COMPLETE, not CHECKSUM_UNNECESSARY.
+ *   - CHECKSUM_COMPLETE is not applicable to SCTP and FCoE protocols.
  *
  * CHECKSUM_PARTIAL:
  *
-- 
2.7.4

[PATCH RFC net-next v4 3/7] sk_buff: remove support for csum_bad in sk_buff

2017-04-20 Thread Davide Caratti

This bit was introduced with 5a21232983aa ("net: Support for csum_bad in
skbuff") to reduce the stack workload when processing RX packets carrying
a wrong Internet Checksum. Up to now, only one driver (besides GRO core)
are setting it.
The test on NAPI_GRO_CB(skb)->flush in dev_gro_receive() is now done
before the test on same_flow, to preserve behavior in case of wrong
checksum.

Suggested-by: Tom Herbert 
Signed-off-by: Davide Caratti 
---
 drivers/net/ethernet/aquantia/atlantic/aq_ring.c |  2 +-
 include/linux/netdevice.h|  4 +---
 include/linux/skbuff.h   | 23 ++-
 net/bridge/netfilter/nft_reject_bridge.c |  5 +
 net/core/dev.c   |  8 +++-
 net/ipv4/netfilter/nf_reject_ipv4.c  |  2 +-
 net/ipv6/netfilter/nf_reject_ipv6.c  |  3 ---
 7 files changed, 9 insertions(+), 38 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
index 3a8a4aa..9a08179 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ring.c
@@ -223,7 +223,7 @@ int aq_ring_rx_clean(struct aq_ring_s *self, int 
*work_done, int budget)
skb->protocol = eth_type_trans(skb, ndev);
if (unlikely(buff->is_cso_err)) {
++self->stats.rx.errors;
-   __skb_mark_checksum_bad(skb);
+   skb->ip_summed = CHECKSUM_NONE;
} else {
if (buff->is_ip_cso) {
__skb_incr_checksum_unnecessary(skb);
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bf84a67..ab9e3dc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2546,9 +2546,7 @@ static inline void skb_gro_incr_csum_unnecessary(struct 
sk_buff *skb)
if (__skb_gro_checksum_validate_needed(skb, zero_okay, check))  \
__ret = __skb_gro_checksum_validate_complete(skb,   \
compute_pseudo(skb, proto));\
-   if (__ret)  \
-   __skb_mark_checksum_bad(skb);   \
-   else\
+   if (!__ret) \
skb_gro_incr_csum_unnecessary(skb); \
__ret;  \
 })
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ec4551b..927309e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -743,7 +743,7 @@ struct sk_buff {
__u8csum_valid:1;
__u8csum_complete_sw:1;
__u8csum_level:2;
-   __u8csum_bad:1;
+   __u8__csum_bad_unused:1; /* one bit hole */
 
__u8dst_pending_confirm:1;
 #ifdef CONFIG_IPV6_NDISC_NODETYPE
@@ -3387,21 +3387,6 @@ static inline void 
__skb_incr_checksum_unnecessary(struct sk_buff *skb)
}
 }
 
-static inline void __skb_mark_checksum_bad(struct sk_buff *skb)
-{
-   /* Mark current checksum as bad (typically called from GRO
-* path). In the case that ip_summed is CHECKSUM_NONE
-* this must be the first checksum encountered in the packet.
-* When ip_summed is CHECKSUM_UNNECESSARY, this is the first
-* checksum after the last one validated. For UDP, a zero
-* checksum can not be marked as bad.
-*/
-
-   if (skb->ip_summed == CHECKSUM_NONE ||
-   skb->ip_summed == CHECKSUM_UNNECESSARY)
-   skb->csum_bad = 1;
-}
-
 /* Check if we need to perform checksum complete validation.
  *
  * Returns true if checksum complete is needed, false otherwise
@@ -3455,9 +3440,6 @@ static inline __sum16 
__skb_checksum_validate_complete(struct sk_buff *skb,
skb->csum_valid = 1;
return 0;
}
-   } else if (skb->csum_bad) {
-   /* ip_summed == CHECKSUM_NONE in this case */
-   return (__force __sum16)1;
}
 
skb->csum = psum;
@@ -3517,8 +3499,7 @@ static inline __wsum null_compute_pseudo(struct sk_buff 
*skb, int proto)
 
 static inline bool __skb_checksum_convert_check(struct sk_buff *skb)
 {
-   return (skb->ip_summed == CHECKSUM_NONE &&
-   skb->csum_valid && !skb->csum_bad);
+   return (skb->ip_summed == CHECKSUM_NONE && skb->csum_valid);
 }
 
 static inline void __skb_checksum_convert(struct sk_buff *skb,
diff --git a/net/bridge/netfilter/nft_reject_bridge.c 
b/net/bridge/netfilter/nft_reject_bridge.c
index 346ef6b..c16dd3a 100644
--- a/net/bridge/netfilter/nft_reject_bridge.c
+++ b/net/bridge/netfilter/nft_reject

Re: sctp: deny peeloff operation on asocs with threads sleeping on it

2017-04-20 Thread Marcelo Ricardo Leitner


Em 20-04-2017 09:32, Salvatore Bonaccorso escreveu:

Hi

According to the documentation I should have sent this to
netdev@vger.kernel.org rather than sta...@vger.kernel.org.

Rationale: Whilst 00eff2ebbd229758e90659907724c14dd5a18339 went to
stable, the dfcb9f4f99f1e9a49e43398a7bfbf56927544af1 commit is
missing.


Cc'ing Greg too.

This is probably because I noticed this commit was missing when Greg 
posted the commits being considered for backports, then I raised the 
concern and Greg decided to pull it too before David replied.


Not sure if this caused some out-of-the-process stuff.



Full quoting, my original misleaded mail:

On Thu, Apr 20, 2017 at 01:44:05PM +0200, Salvatore Bonaccorso wrote:

Hi

Apparently the following commit

dfcb9f4f99f1e9a49e43398a7bfbf56927544af1 (sctp: deny peeloff operation
on asocs with threads sleeping on it)

(was not CC'ed sta...@vger.kernel.org, but was already applied in
3.2.87 and 3.16.42) is missing from 4.9:

2dcab598484185dea7ec22219c76dcdd59e3cb90 was included in 4.9.11 via
00eff2ebbd229758e90659907724c14dd5a18339 . But
dfcb9f4f99f1e9a49e43398a7bfbf56927544af1 is missing from 4.9.

This was assigned CVE-2017-6353.

Reference:
https://marc.info/?l=linux-netdev&m=148785309416337&w=2
http://www.openwall.com/lists/oss-security/2017/02/27/2

Regards,
Salvatore


Regards,
Salvatore

RE: [net-next 04/14] i40e: dump VF information in debugfs

2017-04-20 Thread Mintz, Yuval

> >> > Dump some internal state about VFs through debugfs. This provides
> >> > information not available with 'ip link show'.
> >>
> >> such as?
> >>
> >> donnwantobethedebugfspolice, but still, in the 2-3 times we tried to
> >> push debugfs to MLNX NIC drivers, Dave disallowed that, and lately
> >> the switch team even went further and deleted that portion of the
> >> mlxsw driver -- all to all,  I don't see much point for these type of 
> >> changes,
> thoughts?
> >
> >Don't want to hikjack your thread, but continuing this topic - Is there
> >some flat-out disapproval for debugfs in net-next now?
> 
> It was mentioned by DaveM on the list multiple times.

> >
> >We're currently internally engaged with adding qed support for register
> >dumps [~equivalents for `ethtool -d' outputs] through debugfs, on
> >behalf of storage drivers [qedi/qedf] lacking the API for doing that.
> 
> That sounds wrong. Either introduce a generic infra to expose the info you
> need or you don't do that.

I agree this surely isn't *our* convention, but for scsi  using debugfs
[or sysfs] is a common practice for debug purposes.

Also, our HW debug capabilities are highly-customable, and I want to
have the ability to configure those on the fly [E.g., dynamically
configuring various HW events to be recorded].
Each such configuration involves multiple register writes and reads
according to user provided inputs.
I don't really see how to generalize the information collection in a way
that would benefit anyone else.

> For your inhouse debugging, you should have oot
> patch to expose whatever you need.

I don't want in-house debugging capabilities -
I want field debug capabilities.

Re: [PATCH v2] mdio_bus: Issue GPIO RESET to PHYs.

2017-04-20 Thread Roger Quadros

On 20/04/17 16:13, Andrew Lunn wrote:
> On Thu, Apr 20, 2017 at 11:39:20AM +0300, Roger Quadros wrote:
>> Some boards [1] leave the PHYs at an invalid state
>> during system power-up or reset thus causing unreliability
>> issues with the PHY which manifests as PHY not being detected
>> or link not functional. To fix this, these PHYs need to be RESET
>> via a GPIO connected to the PHY's RESET pin.
>>
>> Some boards have a single GPIO controlling the PHY RESET pin of all
>> PHYs on the bus whereas some others have separate GPIOs controlling
>> individual PHY RESETs.
>>
>> In both cases, the RESET de-assertion cannot be done in the PHY driver
>> as the PHY will not probe till its reset is de-asserted.
>> So do the RESET de-assertion in the MDIO bus driver.
> 
> Hi Roger
> 
> Networking patches should include in the subject line which tree they
> are for. In this case, net-next. So please make the Subject
> 
> [patch v3 net-next] .
> 
>> [1] - am572x-idk, am571x-idk, a437x-idk
>>
>> Signed-off-by: Roger Quadros 
>> ---
>> v2:
>> - add device tree binding document (mdio.txt)
>> - specify default reset delay in of_mdio.c instead of mdio_bus.c
>>
>>  Documentation/devicetree/bindings/net/mdio.txt | 20 
>>  drivers/net/phy/mdio_bus.c | 22 ++
>>  drivers/of/of_mdio.c   |  7 +++
>>  include/linux/phy.h|  5 +
>>  4 files changed, 54 insertions(+)
>>  create mode 100644 Documentation/devicetree/bindings/net/mdio.txt
>>
>> diff --git a/Documentation/devicetree/bindings/net/mdio.txt 
>> b/Documentation/devicetree/bindings/net/mdio.txt
>> new file mode 100644
>> index 000..6e703d7
>> --- /dev/null
>> +++ b/Documentation/devicetree/bindings/net/mdio.txt
>> @@ -0,0 +1,20 @@
>> +Common MDIO bus properties.
>> +
>> +These are generic properties that can apply to any MDIO bus.
>> +
>> +Optional properties:
>> +- reset-gpios: List of one or more GPIOs that control the RESET lines
>> +  of the PHYs on that MDIO bus.
>> +- reset-delay-us: RESET pulse width as per PHY datasheet.
> 
> It would be good to explicitly say that it is in uS as part of the
> comment.
> 
> Also, please document that we expect a list of child nodes, one per
> device on the bus. These should follow the generic phy.txt, or a
> device specific binding document.
> 
>> +
>> +Example :
>> +
> 
> It would be good to say something like:
> 
> This example shows these optional properties, plus other properties
> required for the TI Davinci MDIO driver.
> 
> Pointing this out may stop people cut/past the ti,hwmods property.
> 
>> +davinci_mdio: ethernet@0x5c03 {
>> +compatible = "ti,davinci_mdio";
>> +ti,hwmods = "davinci_mdio";
>> +reg = <0x5c03 0x1000>;
>> +#address-cells = <1>;
>> +#size-cells = <0>;
>> +reset-gpios = <&gpio2 5 GPIO_ACTIVE_LOW>;
>> +reset-delay-us = <2>;   /* PHY datasheet states 1uS min */
>> +};
> 
> And please include at least one PHY on the bus.
> 
> Sorry for asking for so much in the documentation. That is the problem
> with the documentation being missing to start with.
> 
> The code looks good now.
> 

OK Andrew. I'll post a v3 with your suggestions.

cheers,
-roger

Re: [PATCH net-next v2 2/5] virtio-net: transmit napi

2017-04-20 Thread Willem de Bruijn

On Thu, Apr 20, 2017 at 2:27 AM, Jason Wang  wrote:
>
>
> On 2017年04月19日 04:21, Willem de Bruijn wrote:
>>
>> +static void virtnet_napi_tx_enable(struct virtnet_info *vi,
>> +  struct virtqueue *vq,
>> +  struct napi_struct *napi)
>> +{
>> +   if (!napi->weight)
>> +   return;
>> +
>> +   if (!vi->affinity_hint_set) {
>> +   napi->weight = 0;
>> +   return;
>> +   }
>> +
>> +   return virtnet_napi_enable(vq, napi);
>> +}
>> +
>>   static void refill_work(struct work_struct *work)
>
>
> Maybe I was wrong, but according to Michael's comment it looks like he want
> check affinity_hint_set just for speculative tx polling on rx napi instead
> of disabling it at all.
>
> And I'm not convinced this is really needed, driver only provide affinity
> hint instead of affinity, so it's not guaranteed that tx and rx interrupt
> are in the same vcpus.

You're right. I made the restriction broader than the request, to really err
on the side of caution for the initial merge of napi tx. And enabling
the optimization is always a win over keeping it off, even without irq
affinity.

The cycle cost is significant without affinity regardless of whether the
optimization is used. Though this is not limited to napi-tx, it is more
pronounced in that mode than without napi.

1x TCP_RR for affinity configuration {process, rx_irq, tx_irq}:

upstream:

1,1,1: 28985 Mbps, 278 Gcyc
1,0,2: 30067 Mbps, 402 Gcyc

napi tx:

1,1,1: 34492 Mbps, 269 Gcyc
1,0,2: 36527 Mbps, 537 Gcyc (!)
1,0,1: 36269 Mbps, 394 Gcyc
1,0,0: 34674 Mbps, 402 Gcyc

This is a particularly strong example. It is also representative
of most RR tests. It is less pronounced in other streaming tests.
10x TCP_RR, for instance:

upstream:

1,1,1: 42267 Mbps, 301 Gcyc
1,0,2: 40663 Mbps, 445 Gcyc

napi tx:

1,1,1: 42420 Mbps, 303 Gcyc
1,0,2:  42267 Mbps, 431 Gcyc

These numbers were obtained with the virtqueue_enable_cb_delayed
optimization after xmit_skb, btw. It turns out that moving that before
increases 1x TCP_RR further to ~39 Gbps, at the cost of reducing
100x TCP_RR a bit.

Re: [PATCH net-next v5 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-20 Thread Jiri Pirko

Thu, Apr 20, 2017 at 03:06:21PM CEST, j...@mojatatu.com wrote:
>From: Jamal Hadi Salim 
>
>When you dump hundreds of thousands of actions, getting only 32 per
>dump batch even when the socket buffer and memory allocations allow
>is inefficient.
>
>With this change, the user will get as many as possibly fitting
>within the given constraints available to the kernel.
>
>A new top level TLV space is introduced. An attribute
>TCAA_ACT_FLAGS is used to carry the flags indicating the user
>is capable of processing these large dumps. Older user space which
>doesn't set this flag doesn't get the large (than 32) batches.
>The kernel uses the TCAA_ACT_COUNT attribute to tell the user how many
>actions are put in a single batch. As such user space app knows how long
>to iterate (independent of the type of action being dumped)
>instead of hardcoded maximum of 32.
>
>Some results dumping 1.5M actions, first unpatched tc which the
>kernel doesn't help:
>
>prompt$ time -p tc actions ls action gact | grep index | wc -l
>150
>real 1388.43
>user 2.07
>sys 1386.79
>
>Now lets see a patched tc which sets the correct flags when requesting
>a dump:
>
>prompt$ time -p updatedtc actions ls action gact | grep index | wc -l
>150
>real 178.13
>user 2.02
>sys 176.96
>
>Signed-off-by: Jamal Hadi Salim 
>---
> include/uapi/linux/rtnetlink.h | 21 +--
> net/sched/act_api.c| 46 +-
> 2 files changed, 55 insertions(+), 12 deletions(-)
>
>diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
>index cce0613..d7d28ec 100644
>--- a/include/uapi/linux/rtnetlink.h
>+++ b/include/uapi/linux/rtnetlink.h
>@@ -674,10 +674,27 @@ struct tcamsg {
>   unsigned char   tca__pad1;
>   unsigned short  tca__pad2;
> };
>+
>+enum {
>+  TCAA_UNSPEC,

TCAA stands for "traffic control action action". I don't get it :( 
Prefix still sounds wrong to me, sorry :/
Should be:
TCA_SOMETHING_*


>+  TCAA_ACT_TAB,
>+#define TCA_ACT_TAB TCAA_ACT_TAB
>+  TCAA_ACT_FLAGS,
>+  TCAA_ACT_COUNT,
>+  __TCAA_MAX,
>+#define   TCAA_MAX (__TCAA_MAX - 1)
>+};
>+
> #define TA_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct 
> tcamsg
> #define TA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct tcamsg))
>-#define TCA_ACT_TAB 1 /* attr type must be >=1 */ 
>-#define TCAA_MAX 1
>+/* tcamsg flags stored in attribute TCAA_ACT_FLAGS
>+ *
>+ * ACT_LARGE_DUMP_ON user->kernel to request for larger than TCA_ACT_MAX_PRIO
>+ * actions in a dump. All dump responses will contain the number of actions
>+ * being dumped stored in for user app's consumption in TCAA_ACT_COUNT
>+ *
>+ */
>+#define ACT_LARGE_DUMP_ON BIT(0)

Please put some prefix to the name, as I asked for in the previous
version.


> 
> /* New extended info filters for IFLA_EXT_MASK */
> #define RTEXT_FILTER_VF   (1 << 0)
>diff --git a/net/sched/act_api.c b/net/sched/act_api.c
>index 82b1d48..f85b8fd 100644
>--- a/net/sched/act_api.c
>+++ b/net/sched/act_api.c
>@@ -83,6 +83,7 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, 
>struct sk_buff *skb,
>  struct netlink_callback *cb)
> {
>   int err = 0, index = -1, i = 0, s_i = 0, n_i = 0;
>+  unsigned short act_flags = cb->args[2];
>   struct nlattr *nest;
> 
>   spin_lock_bh(&hinfo->lock);
>@@ -111,14 +112,18 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, 
>struct sk_buff *skb,
>   }
>   nla_nest_end(skb, nest);
>   n_i++;
>-  if (n_i >= TCA_ACT_MAX_PRIO)
>+  if (!(act_flags & ACT_LARGE_DUMP_ON) &&
>+  n_i >= TCA_ACT_MAX_PRIO)
>   goto done;
>   }
>   }
> done:
>   spin_unlock_bh(&hinfo->lock);
>-  if (n_i)
>+  if (n_i) {
>   cb->args[0] += n_i;
>+  if (act_flags & ACT_LARGE_DUMP_ON)
>+  cb->args[1] = n_i;
>+  }
>   return n_i;
> 
> nla_put_failure:
>@@ -993,11 +998,15 @@ static int tcf_action_add(struct net *net, struct nlattr 
>*nla,
>   return tcf_add_notify(net, n, &actions, portid);
> }
> 
>+static const struct nla_policy tcaa_policy[TCAA_MAX + 1] = {
>+  [TCAA_ACT_FLAGS]  = { .type = NLA_U32 },
>+};
>+
> static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
>struct netlink_ext_ack *extack)
> {
>   struct net *net = sock_net(skb->sk);
>-  struct nlattr *tca[TCA_ACT_MAX + 1];
>+  struct nlattr *tca[TCAA_MAX + 1];
>   u32 portid = skb ? NETLINK_CB(skb).portid : 0;
>   int ret = 0, ovr = 0;
> 
>@@ -1005,7 +1014,7 @@ static int tc_ctl_action(struct sk_buff *skb, struct 
>nlmsghdr *n,
>   !netlink_capable(skb, CAP_NET_ADMIN))
>   return -EPERM;
> 
>-  ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCA_ACT_MAX, NULL,
>+

Re: [PATCH net v2] net/mlx5e: Fix race in mlx5e_sw_stats and mlx5e_vport_stats

2017-04-20 Thread Saeed Mahameed

On Thu, Apr 20, 2017 at 2:32 AM, Martin KaFai Lau  wrote:
> We have observed a sudden spike in rx/tx_packets and rx/tx_bytes
> reported under /proc/net/dev.  There is a race in mlx5e_update_stats()
> and some of the get-stats functions (the one that we hit is the
> mlx5e_get_stats() which is called by ndo_get_stats64()).
>
> In particular, the very first thing mlx5e_update_sw_counters()
> does is 'memset(s, 0, sizeof(*s))'.  For example, if mlx5e_get_stats()
> is unlucky at one point, rx_bytes and rx_packets could be 0.  One second
> later, a normal (and much bigger than 0) value will be reported.
>
> This patch is to use a 'struct mlx5e_sw_stats temp' to avoid
> a direct memset zero on priv->stats.sw.
>
> mlx5e_update_vport_counters() has a similar race.  Hence, addressed
> together.
>
> I am lucky enough to catch this 0-reset in rx multicast:
> eth0: 41457665   76804   700070  0 47085 15586634   
> 87502300 0   3  0
> eth0: 41459860   76815   700070  0 47094 15588376   
> 87516300 0   3  0
> eth0: 41460577   76822   700070  0 0 15589083   
> 87521300 0   3  0
> eth0: 41463293   76838   700070  0 47108 15595872   
> 87538300 0   3  0
> eth0: 41463379   76839   700070  0 47116 15596138   
> 87539300 0   3  0
>
> Cc: Saeed Mahameed 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Martin KaFai Lau 
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 66c133757a5e..246786bb861b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -174,7 +174,7 @@ static void mlx5e_tx_timeout_work(struct work_struct 
> *work)
>
>  static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
>  {
> -   struct mlx5e_sw_stats *s = &priv->stats.sw;
> +   struct mlx5e_sw_stats temp, *s = &temp;
> struct mlx5e_rq_stats *rq_stats;
> struct mlx5e_sq_stats *sq_stats;
> u64 tx_offload_none = 0;
> @@ -229,12 +229,14 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
> *priv)
> s->link_down_events_phy = MLX5_GET(ppcnt_reg,
> priv->stats.pport.phy_counters,
> 
> counter_set.phys_layer_cntrs.link_down_events);
> +   memcpy(&priv->stats.sw, s, sizeof(*s));
>  }
>
>  static void mlx5e_update_vport_counters(struct mlx5e_priv *priv)
>  {
> +   struct mlx5e_vport_stats temp;
> int outlen = MLX5_ST_SZ_BYTES(query_vport_counter_out);
> -   u32 *out = (u32 *)priv->stats.vport.query_vport_out;
> +   u32 *out = (u32 *)temp.query_vport_out;
> u32 in[MLX5_ST_SZ_DW(query_vport_counter_in)] = {0};
> struct mlx5_core_dev *mdev = priv->mdev;
>
> @@ -245,6 +247,7 @@ static void mlx5e_update_vport_counters(struct mlx5e_priv 
> *priv)
>
> memset(out, 0, outlen);

Actually you don't need any temp here, it is safe to just remove this
redundant memset
and mlx5_cmd_exec will do the copy for you.

> mlx5_cmd_exec(mdev, in, sizeof(in), out, outlen);
> +   memcpy(priv->stats.vport.query_vport_out, out, outlen);
>  }

Anyway we still need a spin lock here, and also for all the counters
under priv->stats which are affected by this race as well.

If you want I can accept this as a temporary  fix for net and Gal will
work on a spin lock based mechanism to fix the memcpy race for all the
counters.

>
>  static void mlx5e_update_pport_counters(struct mlx5e_priv *priv)
> --
> 2.9.3
>

Re: [PATCH net-next v4 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-20 Thread Jamal Hadi Salim


On 17-04-20 07:35 AM, Jiri Pirko wrote:

Thu, Apr 20, 2017 at 12:42:55PM CEST, j...@mojatatu.com wrote:

On 17-04-19 12:17 PM, Jiri Pirko wrote:


Ha. So the current code is wrong, I see it now. Following is needed:

diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 82b1d48..c432b22 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -1005,7 +1005,7 @@ static int tc_ctl_action(struct sk_buff *skb, struct 
nlmsghdr *n,
!netlink_capable(skb, CAP_NET_ADMIN))
return -EPERM;

-   ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCA_ACT_MAX, NULL,
+   ret = nlmsg_parse(n, sizeof(struct tcamsg), tca, TCAA_MAX, NULL,
  extack);
if (ret < 0)
return ret;

You confused me squashing the change to this patch.

Ok, so the name could be:
TCA_ACTS_*
?
I believe it is crucial to figure this out right now. TC UAPI is in deep
mess already, no need to push it even more.



I could change that name. Look at the last series i sent, add any extra
comments to that  and i will get to it by tomorrow.


They are not the same issue Jiri. We have used bitmasks fine on netlink


Howcome they are not the same? I'm pretty certain they are.



They are not Jiri. I challenge you to point me to one netlink bitmask
representation used that has some bits that were not being used that
someone or some compiler is going to set some random values on.
Typically flags are u64/32/16




message for a millenia. Nobody sets garbage on a bitmask they are not
supposed to touch. The struct padding thing is a shame the way it
turned out - now netlink can no longer have a claim to be a (good)
wire protocol.
Other thing: I know you feel strongly about this but I dont agree that
everything has to be a TLV and in no way, as a matter of principle, you are
going to convince me  (even when the cows come home) that I have to
use 64 bits to carry a single bit just so I can use TLVs.


Then I guess we have to agree to disagree. I believe that your approach
is wrong and will break users in future when you add even a single flag.
Argument that "we are doing this for ages-therefore it is correct" has
simply 0 value.



"doing these for 80 years" is not the main driving point.
I appreciate that data structures - even when fields are
annotated as "pads" there is no guarantee when i allocate one
the pads will be zeroes.

Bitmasks are atomic in nature and therefore used in whole - not complex
like structs. Thats the difference. I dont define this u32 bitmap that
has 8 bits that are "pads" because that concept doesnt make sense for
atomic fields.
This implies that 6 months down the road I dont regret and say "shit, I
need 2 more bits for flagbits but some idiot or compiler was setting
those fields so i cant use them now because the kernel wont tell the
difference between what s/he is doing and real values".
Allocations of data structures I can understand, unless we enforce
always reset to zeroes all the struct members as part of creation.

But lets agree to disagree if you are still not convinced.

On your "everything must be a TLV". Caveat is: there is a balance
between  performance and flexibility. I should be able to pack aligned
complex data structures in a TLV for performance reasons.
One lesson is: in the future we can make these extra service header like
tcamsg very small and always use all their fields instead of defining
pads.

cheers,
jamal

Re: [PATCH net-next v2 5/5] virtio-net: keep tx interrupts disabled unless kick

2017-04-20 Thread Willem de Bruijn

>> -   if (!use_napi)
>> +   if (use_napi) {
>> +   if (kick)
>> +   virtqueue_enable_cb_delayed(sq->vq);
>> +   else
>> +   virtqueue_disable_cb(sq->vq);
>
>
> Since virtqueue_disable_cb() do nothing for event idx. I wonder whether or
> not just calling enable_cb_dealyed() is ok here.

Good point.

> Btw, it does not disable interrupt at all, I propose a patch in the past
> which can do more than this:
>
> https://patchwork.kernel.org/patch/6472601/

Interesting. Yes, let me evaluate that variant.

Thanks for reviewing,

  Willem

Re: __sk_buff.data_end

2017-04-20 Thread Daniel Borkmann


On 04/20/2017 08:01 AM, Johannes Berg wrote:

On Thu, 2017-04-20 at 02:01 +0200, Daniel Borkmann wrote:


Yeah, should work as well for the 32 bit archs, on 64 bit we
have this effectively already:


Right.

[...]


Can you elaborate on why this works for mac80211? It uses cb
only up to that point from where you invoke the prog?


No, it works because then I can move a u64 field to the same offset,
and save/restore it across the BPF call :)


Right.


But I don't have a *pointer* field to move there, and no space for the
alignment anyway (already using all 48 bytes).

Come to think of it - somebody had proposed extensions to this by
passing an on-stack pointer in addition to the data in the cb.

Perhaps we can extend BPF to have an optional second argument, and
track a second context around the verifier, if applicable? Then we can
solve all of this really easily, because it means we don't always have
to go from the SKB context but could go from the other one (which could
be that on-stack buffer).


I think this would be a rather more complex operation on the BPF side,
it would need changes from LLVM (which assumes initial ctx sits in r1),
verifier for tracking this ctx2, all the way down to JITs plus some way
to handle 1 and 2 argument program calls generically. Much easier to
pass additional meta data for the program via cb[], for example.


Alternatively I can clear another pointer (u64) in the CB, store a
pointer there, and always emit code following that pointer - should be
possible right?


What kind of pointer? If it's something like data_end as read-only, then
this needs to be tracked in the verifier in addition, of course. Other
option you could do (depending on what you want to achieve) is to have
a bpf_probe_read() version as a helper for your prog type that would
further walk that pointer/struct (similar to tracing) where this comes
w/o any backward compat guarantees, though.

[PATCH v3 net-next] mdio_bus: Issue GPIO RESET to PHYs.

2017-04-20 Thread Roger Quadros

Some boards [1] leave the PHYs at an invalid state
during system power-up or reset thus causing unreliability
issues with the PHY which manifests as PHY not being detected
or link not functional. To fix this, these PHYs need to be RESET
via a GPIO connected to the PHY's RESET pin.

Some boards have a single GPIO controlling the PHY RESET pin of all
PHYs on the bus whereas some others have separate GPIOs controlling
individual PHY RESETs.

In both cases, the RESET de-assertion cannot be done in the PHY driver
as the PHY will not probe till its reset is de-asserted.
So do the RESET de-assertion in the MDIO bus driver.

[1] - am572x-idk, am571x-idk, a437x-idk

Signed-off-by: Roger Quadros 
---
v3:
- added more information in the DT binding document.

v2:
- add device tree binding document (mdio.txt)
- specify default reset delay in of_mdio.c instead of mdio_bus.c

 Documentation/devicetree/bindings/net/mdio.txt | 33 ++
 drivers/net/phy/mdio_bus.c | 22 +
 drivers/of/of_mdio.c   |  7 ++
 include/linux/phy.h|  5 
 4 files changed, 67 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/mdio.txt

diff --git a/Documentation/devicetree/bindings/net/mdio.txt 
b/Documentation/devicetree/bindings/net/mdio.txt
new file mode 100644
index 000..3517369
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/mdio.txt
@@ -0,0 +1,33 @@
+Common MDIO bus properties.
+
+These are generic properties that can apply to any MDIO bus.
+
+Optional properties:
+- reset-gpios: List of one or more GPIOs that control the RESET lines
+  of the PHYs on that MDIO bus.
+- reset-delay-us: RESET pulse width in microseconds as per PHY datasheet.
+
+A list of child nodes, one per device on the bus is expected. These
+should follow the generic phy.txt, or a device specific binding document.
+
+Example :
+This example shows these optional properties, plus other properties
+required for the TI Davinci MDIO driver.
+
+   davinci_mdio: ethernet@0x5c03 {
+   compatible = "ti,davinci_mdio";
+   reg = <0x5c03 0x1000>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   reset-gpios = <&gpio2 5 GPIO_ACTIVE_LOW>;
+   reset-delay-us = <2>;   /* PHY datasheet states 1uS min */
+
+   ethphy0: ethernet-phy@1 {
+   reg = <1>;
+   };
+
+   ethphy1: ethernet-phy@3 {
+   reg = <3>;
+   };
+   };
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index fa7d51f..b353d99 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -22,8 +22,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -307,6 +310,7 @@ int __mdiobus_register(struct mii_bus *bus, struct module 
*owner)
 {
struct mdio_device *mdiodev;
int i, err;
+   struct gpio_desc *gpiod;
 
if (NULL == bus || NULL == bus->name ||
NULL == bus->read || NULL == bus->write)
@@ -333,6 +337,24 @@ int __mdiobus_register(struct mii_bus *bus, struct module 
*owner)
if (bus->reset)
bus->reset(bus);
 
+   /* de-assert bus level PHY GPIO resets */
+   for (i = 0; i < bus->num_reset_gpios; i++) {
+   gpiod = devm_gpiod_get_index(&bus->dev, "reset", i,
+GPIOD_OUT_LOW);
+   if (IS_ERR(gpiod)) {
+   err = PTR_ERR(gpiod);
+   if (err != -ENOENT) {
+   pr_err("mii_bus %s couldn't get reset GPIO\n",
+  bus->id);
+   return err;
+   }
+   } else {
+   gpiod_set_value_cansleep(gpiod, 1);
+   udelay(bus->reset_delay_us);
+   gpiod_set_value_cansleep(gpiod, 0);
+   }
+   }
+
for (i = 0; i < PHY_MAX_ADDR; i++) {
if ((bus->phy_mask & (1 << i)) == 0) {
struct phy_device *phydev;
diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
index 0b29798..7e4c80f 100644
--- a/drivers/of/of_mdio.c
+++ b/drivers/of/of_mdio.c
@@ -22,6 +22,8 @@
 #include 
 #include 
 
+#define DEFAULT_GPIO_RESET_DELAY   10  /* in microseconds */
+
 MODULE_AUTHOR("Grant Likely ");
 MODULE_LICENSE("GPL");
 
@@ -221,6 +223,11 @@ int of_mdiobus_register(struct mii_bus *mdio, struct 
device_node *np)
 
mdio->dev.of_node = np;
 
+   /* Get bus level PHY reset GPIO details */
+   mdio->reset_delay_us = DEFAULT_GPIO_RESET_DELAY;
+   of_property_read_u32(np, "reset-delay-us", &mdio->reset_delay_us);
+   mdio->num_reset_gpios = of_gpio_named_count(np, "reset-gpios");
+
/* Register the M

Re: [PATCH net v2] net/mlx5e: Fix race in mlx5e_sw_stats and mlx5e_vport_stats

2017-04-20 Thread Martin KaFai Lau

On Thu, Apr 20, 2017 at 05:00:13PM +0300, Saeed Mahameed wrote:
> On Thu, Apr 20, 2017 at 2:32 AM, Martin KaFai Lau  wrote:
> > We have observed a sudden spike in rx/tx_packets and rx/tx_bytes
> > reported under /proc/net/dev.  There is a race in mlx5e_update_stats()
> > and some of the get-stats functions (the one that we hit is the
> > mlx5e_get_stats() which is called by ndo_get_stats64()).
> >
> > In particular, the very first thing mlx5e_update_sw_counters()
> > does is 'memset(s, 0, sizeof(*s))'.  For example, if mlx5e_get_stats()
> > is unlucky at one point, rx_bytes and rx_packets could be 0.  One second
> > later, a normal (and much bigger than 0) value will be reported.
> >
> > This patch is to use a 'struct mlx5e_sw_stats temp' to avoid
> > a direct memset zero on priv->stats.sw.
> >
> > mlx5e_update_vport_counters() has a similar race.  Hence, addressed
> > together.
> >
> > I am lucky enough to catch this 0-reset in rx multicast:
> > eth0: 41457665   76804   700070  0 47085 15586634   
> > 87502300 0   3  0
> > eth0: 41459860   76815   700070  0 47094 15588376   
> > 87516300 0   3  0
> > eth0: 41460577   76822   700070  0 0 15589083   
> > 87521300 0   3  0
> > eth0: 41463293   76838   700070  0 47108 15595872   
> > 87538300 0   3  0
> > eth0: 41463379   76839   700070  0 47116 15596138   
> > 87539300 0   3  0
> >
> > Cc: Saeed Mahameed 
> > Suggested-by: Eric Dumazet 
> > Signed-off-by: Martin KaFai Lau 
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index 66c133757a5e..246786bb861b 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -174,7 +174,7 @@ static void mlx5e_tx_timeout_work(struct work_struct 
> > *work)
> >
> >  static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
> >  {
> > -   struct mlx5e_sw_stats *s = &priv->stats.sw;
> > +   struct mlx5e_sw_stats temp, *s = &temp;
> > struct mlx5e_rq_stats *rq_stats;
> > struct mlx5e_sq_stats *sq_stats;
> > u64 tx_offload_none = 0;
> > @@ -229,12 +229,14 @@ static void mlx5e_update_sw_counters(struct 
> > mlx5e_priv *priv)
> > s->link_down_events_phy = MLX5_GET(ppcnt_reg,
> > priv->stats.pport.phy_counters,
> > 
> > counter_set.phys_layer_cntrs.link_down_events);
> > +   memcpy(&priv->stats.sw, s, sizeof(*s));
> >  }
> >
> >  static void mlx5e_update_vport_counters(struct mlx5e_priv *priv)
> >  {
> > +   struct mlx5e_vport_stats temp;
> > int outlen = MLX5_ST_SZ_BYTES(query_vport_counter_out);
> > -   u32 *out = (u32 *)priv->stats.vport.query_vport_out;
> > +   u32 *out = (u32 *)temp.query_vport_out;
> > u32 in[MLX5_ST_SZ_DW(query_vport_counter_in)] = {0};
> > struct mlx5_core_dev *mdev = priv->mdev;
> >
> > @@ -245,6 +247,7 @@ static void mlx5e_update_vport_counters(struct 
> > mlx5e_priv *priv)
> >
> > memset(out, 0, outlen);
>
> Actually you don't need any temp here, it is safe to just remove this
> redundant memset
> and mlx5_cmd_exec will do the copy for you.
>
> > mlx5_cmd_exec(mdev, in, sizeof(in), out, outlen);
> > +   memcpy(priv->stats.vport.query_vport_out, out, outlen);
> >  }
>
> Anyway we still need a spin lock here, and also for all the counters
> under priv->stats which are affected by this race as well.
>
> If you want I can accept this as a temporary  fix for net and Gal will
> work on a spin lock based mechanism to fix the memcpy race for all the
> counters.
A follow-up patch approach by Gal will be nice.


>
> >
> >  static void mlx5e_update_pport_counters(struct mlx5e_priv *priv)
> > --
> > 2.9.3
> >

Re: __sk_buff.data_end

2017-04-20 Thread Johannes Berg

On Thu, 2017-04-20 at 16:10 +0200, Daniel Borkmann wrote:
> 
> I think this would be a rather more complex operation on the BPF
> side, it would need changes from LLVM (which assumes initial ctx sits
> in r1), verifier for tracking this ctx2, all the way down to JITs
> plus some way to handle 1 and 2 argument program calls generically.
> Much easier to pass additional meta data for the program via cb[],
> for example.

Yeah, it did seem very complex :)

> > Alternatively I can clear another pointer (u64) in the CB, store a
> > pointer there, and always emit code following that pointer - should
> > be possible right?
> 
> What kind of pointer? If it's something like data_end as read-only,
> then this needs to be tracked in the verifier in addition, of course.
> Other option you could do (depending on what you want to achieve) is
> to have a bpf_probe_read() version as a helper for your prog type
> that would further walk that pointer/struct (similar to tracing)
> where this comes w/o any backward compat guarantees, though.

I meant something like this

struct wifi_cb {
struct wifi_data *wifi_data;
...
void *data_end; // with BUILD_BUG_ON to the right offset
};

Then struct wifi_data can contain extra data that doesn't fit into
wifi_cb, like the stuff I evicted for *data_end and *wifi_data. Let's
say one of those fields is "u64 boottime_ns;" (as I did in my patch
now), so we have

struct wifi_data {
u64 boottime_ns;
};

then I can still have

struct __wifi_sk_buff {
u32 len;
u32 data;
u32 data_end;
u32 boottime_ns; // this is strange but
 // seems to be done this way?
};

And then when boottime_ns is accessed, I can have:

case offsetof(struct __wifi_sk_buff, boottime_ns):
off  = si->off;
off -= offsetof(struct __wifi_sk_buff, boottime_ns);
off += offsetof(struct sk_buff, cb);
off += offsetof(struct wifi_cb, wifi_data);
*insn++ = BPF_LDX_MEM(BPF_SIZEOF(void *), si->dst_reg,
  si->src_reg, off);
off = offsetof(struct wifi_data, boottime_ns);
*isns++ = BPF_LDX_MEM(BPF_SIZEOF(u64), si->dst_reg,
  si->src_reg, off);
break;

no?

It seems to me this should work, and essentially emit code to follow
the pointer to inside struct wifi_data. Assuming 

johannes

Re: [PATCH net-next v5 1/2] net sched actions: dump more than TCA_ACT_MAX_PRIO actions per batch

2017-04-20 Thread Jamal Hadi Salim


On 17-04-20 09:59 AM, Jiri Pirko wrote:

Thu, Apr 20, 2017 at 03:06:21PM CEST, j...@mojatatu.com wrote:

From: Jamal Hadi Salim 

When you dump hundreds of thousands of actions, getting only 32 per
dump batch even when the socket buffer and memory allocations allow
is inefficient.

With this change, the user will get as many as possibly fitting
within the given constraints available to the kernel.

A new top level TLV space is introduced. An attribute
TCAA_ACT_FLAGS is used to carry the flags indicating the user
is capable of processing these large dumps. Older user space which
doesn't set this flag doesn't get the large (than 32) batches.
The kernel uses the TCAA_ACT_COUNT attribute to tell the user how many
actions are put in a single batch. As such user space app knows how long
to iterate (independent of the type of action being dumped)
instead of hardcoded maximum of 32.

Some results dumping 1.5M actions, first unpatched tc which the
kernel doesn't help:

prompt$ time -p tc actions ls action gact | grep index | wc -l
150
real 1388.43
user 2.07
sys 1386.79

Now lets see a patched tc which sets the correct flags when requesting
a dump:

prompt$ time -p updatedtc actions ls action gact | grep index | wc -l
150
real 178.13
user 2.02
sys 176.96

Signed-off-by: Jamal Hadi Salim 
---
include/uapi/linux/rtnetlink.h | 21 +--
net/sched/act_api.c| 46 +-
2 files changed, 55 insertions(+), 12 deletions(-)

diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index cce0613..d7d28ec 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -674,10 +674,27 @@ struct tcamsg {
unsigned char   tca__pad1;
unsigned short  tca__pad2;
};
+
+enum {
+   TCAA_UNSPEC,


TCAA stands for "traffic control action action". I don't get it :(
Prefix still sounds wrong to me, sorry :/
Should be:
TCA_SOMETHING_*



+   TCAA_ACT_TAB,
+#define TCA_ACT_TAB TCAA_ACT_TAB
+   TCAA_ACT_FLAGS,
+   TCAA_ACT_COUNT,
+   __TCAA_MAX,
+#defineTCAA_MAX (__TCAA_MAX - 1)
+};
+
#define TA_RTA(r)  ((struct rtattr*)(((char*)(r)) + NLMSG_ALIGN(sizeof(struct 
tcamsg
#define TA_PAYLOAD(n) NLMSG_PAYLOAD(n,sizeof(struct tcamsg))
-#define TCA_ACT_TAB 1 /* attr type must be >=1 */   
-#define TCAA_MAX 1
+/* tcamsg flags stored in attribute TCAA_ACT_FLAGS
+ *
+ * ACT_LARGE_DUMP_ON user->kernel to request for larger than TCA_ACT_MAX_PRIO
+ * actions in a dump. All dump responses will contain the number of actions
+ * being dumped stored in for user app's consumption in TCAA_ACT_COUNT
+ *
+ */
+#define ACT_LARGE_DUMP_ON  BIT(0)


Please put some prefix to the name, as I asked for in the previous
version.



Didnt mean to leave out but:
I cant seem to find it. Do you recall what you said it should be?





/* New extended info filters for IFLA_EXT_MASK */
#define RTEXT_FILTER_VF (1 << 0)
diff --git a/net/sched/act_api.c b/net/sched/act_api.c
index 82b1d48..f85b8fd 100644
--- a/net/sched/act_api.c
+++ b/net/sched/act_api.c
@@ -83,6 +83,7 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, struct 
sk_buff *skb,
   struct netlink_callback *cb)
{
int err = 0, index = -1, i = 0, s_i = 0, n_i = 0;
+   unsigned short act_flags = cb->args[2];
struct nlattr *nest;

spin_lock_bh(&hinfo->lock);
@@ -111,14 +112,18 @@ static int tcf_dump_walker(struct tcf_hashinfo *hinfo, 
struct sk_buff *skb,
}
nla_nest_end(skb, nest);
n_i++;
-   if (n_i >= TCA_ACT_MAX_PRIO)
+   if (!(act_flags & ACT_LARGE_DUMP_ON) &&
+   n_i >= TCA_ACT_MAX_PRIO)
goto done;
}
}
done:
spin_unlock_bh(&hinfo->lock);
-   if (n_i)
+   if (n_i) {
cb->args[0] += n_i;
+   if (act_flags & ACT_LARGE_DUMP_ON)
+   cb->args[1] = n_i;
+   }
return n_i;

nla_put_failure:
@@ -993,11 +998,15 @@ static int tcf_action_add(struct net *net, struct nlattr 
*nla,
return tcf_add_notify(net, n, &actions, portid);
}

+static const struct nla_policy tcaa_policy[TCAA_MAX + 1] = {
+   [TCAA_ACT_FLAGS]  = { .type = NLA_U32 },
+};
+
static int tc_ctl_action(struct sk_buff *skb, struct nlmsghdr *n,
 struct netlink_ext_ack *extack)
{
struct net *net = sock_net(skb->sk);
-   struct nlattr *tca[TCA_ACT_MAX + 1];
+   struct nlattr *tca[TCAA_MAX + 1];
u32 portid = skb ? NETLINK_CB(skb).portid : 0;
int ret = 0, ovr = 0;

@@ -1005,7 +1014,7 @@ static int tc_ctl_action(struct sk_buff *skb, struct 
nlmsghdr *n,
!netlink_capable(skb, CAP_NET_ADMIN))
return -EPERM;

-   ret = nlmsg_parse(n, sizeof(struct tcamsg), tca,

Re: [PATCH net] ipv4: Avoid caching dsts when lookup skipped nh oif check

2017-04-20 Thread David Ahern

On 4/20/17 6:58 AM, Robert Shearman wrote:
> David reported that doing the following:
> 
> ip li add red type vrf table 10
> ip link set dev eth1 vrf red
> ip addr add 127.0.0.1/8 dev red
> ip link set dev eth1 up
> ip li set red up
> ping -c1 -w1 -I red 127.0.0.1
> ip li del red
> 
> results in a hang with this message:
> 
> unregister_netdevice: waiting for red to become free. Usage count = 1

I think you misunderstood my comment. The above works fine today. There
is no bug with refcnts.

It breaks with your patches wanting to use a VRF device with the main
table (254).

Re: [PATCH] nl80211: fix enumeration type

2017-04-20 Thread Stefan Agner

On 2017-04-20 00:22, Johannes Berg wrote:
> On Wed, 2017-04-19 at 23:55 -0700, Stefan Agner wrote:
>> Use type enum nl80211_rate_info for bitrate information. This fixes
>> a warning when compiling with clang:
>>   warning: implicit conversion from enumeration type 'enum
>> nl80211_rate_info'
>>   to different enumeration type 'enum nl80211_attrs'
> 
> I'm pretty sure I just applied exactly these two patches very recently.
> 
> What made *two* people suddenly see this?

There has been renewed work on compiling Linux using clang... Sorry for
the clash.

--
Stefan

Re: [PATCH net v2] net/mlx5e: Fix race in mlx5e_sw_stats and mlx5e_vport_stats

2017-04-20 Thread Saeed Mahameed

On Thu, Apr 20, 2017 at 5:15 PM, Martin KaFai Lau  wrote:
> On Thu, Apr 20, 2017 at 05:00:13PM +0300, Saeed Mahameed wrote:
>> On Thu, Apr 20, 2017 at 2:32 AM, Martin KaFai Lau  wrote:
>> > We have observed a sudden spike in rx/tx_packets and rx/tx_bytes
>> > reported under /proc/net/dev.  There is a race in mlx5e_update_stats()
>> > and some of the get-stats functions (the one that we hit is the
>> > mlx5e_get_stats() which is called by ndo_get_stats64()).
>> >
>> > In particular, the very first thing mlx5e_update_sw_counters()
>> > does is 'memset(s, 0, sizeof(*s))'.  For example, if mlx5e_get_stats()
>> > is unlucky at one point, rx_bytes and rx_packets could be 0.  One second
>> > later, a normal (and much bigger than 0) value will be reported.
>> >
>> > This patch is to use a 'struct mlx5e_sw_stats temp' to avoid
>> > a direct memset zero on priv->stats.sw.
>> >
>> > mlx5e_update_vport_counters() has a similar race.  Hence, addressed
>> > together.
>> >
>> > I am lucky enough to catch this 0-reset in rx multicast:
>> > eth0: 41457665   76804   700070  0 47085 15586634  
>> >  87502300 0   3  0
>> > eth0: 41459860   76815   700070  0 47094 15588376  
>> >  87516300 0   3  0
>> > eth0: 41460577   76822   700070  0 0 15589083  
>> >  87521300 0   3  0
>> > eth0: 41463293   76838   700070  0 47108 15595872  
>> >  87538300 0   3  0
>> > eth0: 41463379   76839   700070  0 47116 15596138  
>> >  87539300 0   3  0
>> >
>> > Cc: Saeed Mahameed 
>> > Suggested-by: Eric Dumazet 
>> > Signed-off-by: Martin KaFai Lau 
>> > ---
>> >  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +--
>> >  1 file changed, 5 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
>> > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> > index 66c133757a5e..246786bb861b 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> > @@ -174,7 +174,7 @@ static void mlx5e_tx_timeout_work(struct work_struct 
>> > *work)
>> >
>> >  static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
>> >  {
>> > -   struct mlx5e_sw_stats *s = &priv->stats.sw;
>> > +   struct mlx5e_sw_stats temp, *s = &temp;
>> > struct mlx5e_rq_stats *rq_stats;
>> > struct mlx5e_sq_stats *sq_stats;
>> > u64 tx_offload_none = 0;
>> > @@ -229,12 +229,14 @@ static void mlx5e_update_sw_counters(struct 
>> > mlx5e_priv *priv)
>> > s->link_down_events_phy = MLX5_GET(ppcnt_reg,
>> > priv->stats.pport.phy_counters,
>> > 
>> > counter_set.phys_layer_cntrs.link_down_events);
>> > +   memcpy(&priv->stats.sw, s, sizeof(*s));
>> >  }
>> >
>> >  static void mlx5e_update_vport_counters(struct mlx5e_priv *priv)
>> >  {
>> > +   struct mlx5e_vport_stats temp;
>> > int outlen = MLX5_ST_SZ_BYTES(query_vport_counter_out);
>> > -   u32 *out = (u32 *)priv->stats.vport.query_vport_out;
>> > +   u32 *out = (u32 *)temp.query_vport_out;
>> > u32 in[MLX5_ST_SZ_DW(query_vport_counter_in)] = {0};
>> > struct mlx5_core_dev *mdev = priv->mdev;
>> >
>> > @@ -245,6 +247,7 @@ static void mlx5e_update_vport_counters(struct 
>> > mlx5e_priv *priv)
>> >
>> > memset(out, 0, outlen);
>>
>> Actually you don't need any temp here, it is safe to just remove this
>> redundant memset
>> and mlx5_cmd_exec will do the copy for you.
>>
>> > mlx5_cmd_exec(mdev, in, sizeof(in), out, outlen);
>> > +   memcpy(priv->stats.vport.query_vport_out, out, outlen);
>> >  }
>>
>> Anyway we still need a spin lock here, and also for all the counters
>> under priv->stats which are affected by this race as well.
>>
>> If you want I can accept this as a temporary  fix for net and Gal will
>> work on a spin lock based mechanism to fix the memcpy race for all the
>> counters.
> A follow-up patch approach by Gal will be nice.
>

Ok, I will expect V3 with the removal of the memset from
mlx5e_update_vport_counters instead of the memcpy.
and Gal will work on the follow up patch

Re: [PATCH v4 1/3] VSOCK: Add vsockmon tap functions

2017-04-20 Thread Stefan Hajnoczi

On Thu, Apr 20, 2017 at 12:27:37PM +, Jorgen S. Hansen wrote:
> 
> > On Apr 13, 2017, at 6:18 PM, Stefan Hajnoczi  wrote:
> > 
> > +
> > +static void __vsock_deliver_tap(struct sk_buff *skb)
> > +{
> > +   int ret;
> > +   struct vsock_tap *tmp;
> > +
> > +   list_for_each_entry_rcu(tmp, &vsock_tap_all, list) {
> > +   ret = __vsock_deliver_tap_skb(skb, tmp->dev);
> > +   if (unlikely(ret))
> > +   break;
> > +   }
> > +
> > +   consume_skb(skb);
> 
> It looks like the caller will allocate the skb regardless of whether 
> vsock_tap_all is empty, so shouldn’t the skb be consumed in vsock_deliver_tap?

You are right.  Thanks!


signature.asc
Description: PGP signature

Re: [PATCH v4 3/3] VSOCK: Add virtio vsock vsockmon hooks

2017-04-20 Thread Stefan Hajnoczi

On Thu, Apr 20, 2017 at 12:35:40PM +, Jorgen S. Hansen wrote:
> > 
> > +/* Packet capture */
> > +void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt)
> > +{
> > +   struct sk_buff *skb;
> > +   struct af_vsockmon_hdr *hdr;
> > +   unsigned char *t_hdr, *payload;
> > +
> > +   skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
> > +   GFP_ATOMIC);
> 
> So with the current API, an skb will always be allocated, even if there are 
> no listeners? And we’ll copy the payload into the skb as well, if there is 
> any. Would it make sense to introduce a check here (exposed as part of the 
> vsock tap API), that skips all that in the case of no listeners? In the 
> common case, there won’t be any listeners, so it would save some cycles.

Good point, will fix.


signature.asc
Description: PGP signature

1 2 3 4 >

1 - 100 of 303 matches

Mail list logo