[PATCH net] ip6_tunnel: update mtu properly for ARPHRD_ETHER tunnel device in tx path

2017-09-27 Thread Xin Long
Now when updating mtu in tx path, it doesn't consider ARPHRD_ETHER tunnel
device, like ip6gre_tap tunnel, for which it should also subtract ether
header to get the correct mtu.

Signed-off-by: Xin Long 
---
 net/ipv6/ip6_tunnel.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index f2f21c2..a1c2444 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -1043,6 +1043,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
struct dst_entry *dst = NULL, *ndst = NULL;
struct net_device *tdev;
int mtu;
+   unsigned int eth_hlen = t->dev->type == ARPHRD_ETHER ? ETH_HLEN : 0;
unsigned int psh_hlen = sizeof(struct ipv6hdr) + t->encap_hlen;
unsigned int max_headroom = psh_hlen;
bool use_cache = false;
@@ -1124,7 +1125,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
 t->parms.name);
goto tx_err_dst_release;
}
-   mtu = dst_mtu(dst) - psh_hlen - t->tun_hlen;
+   mtu = dst_mtu(dst) - eth_hlen - psh_hlen - t->tun_hlen;
if (encap_limit >= 0) {
max_headroom += 8;
mtu -= 8;
@@ -1133,7 +1134,7 @@ int ip6_tnl_xmit(struct sk_buff *skb, struct net_device 
*dev, __u8 dsfield,
mtu = IPV6_MIN_MTU;
if (skb_dst(skb) && !t->parms.collect_md)
skb_dst(skb)->ops->update_pmtu(skb_dst(skb), NULL, skb, mtu);
-   if (skb->len - t->tun_hlen > mtu && !skb_is_gso(skb)) {
+   if (skb->len - t->tun_hlen - eth_hlen > mtu && !skb_is_gso(skb)) {
*pmtu = mtu;
err = -EMSGSIZE;
goto tx_err_dst_release;
-- 
2.1.0



[PATCH net] ip6_gre: ip6gre_tap device should keep dst

2017-09-27 Thread Xin Long
The patch 'ip_gre: ipgre_tap device should keep dst' fixed
a issue that ipgre_tap mtu couldn't be updated in tx path.

The same fix is needed for ip6gre_tap as well.

Signed-off-by: Xin Long 
---
 net/ipv6/ip6_gre.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 20f66f4..1602b49 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1311,6 +1311,7 @@ static void ip6gre_tap_setup(struct net_device *dev)
dev->features |= NETIF_F_NETNS_LOCAL;
dev->priv_flags &= ~IFF_TX_SKB_SHARING;
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+   netif_keep_dst(dev);
 }
 
 static bool ip6gre_netlink_encap_parms(struct nlattr *data[],
-- 
2.1.0



[PATCH net] ip_gre: ipgre_tap device should keep dst

2017-09-27 Thread Xin Long
Without keeping dst, the tunnel will not update any mtu/pmtu info,
since it does not have a dst on the skb.

Reproducer:
  client(ipgre_tap1 - eth1) <-> (eth1 - ipgre_tap1)server

After reducing eth1's mtu on client, then perforamnce became 0.

This patch is to netif_keep_dst in gre_tap_init, as ipgre does.

Reported-by: Jianlin Shi 
Signed-off-by: Xin Long 
---
 net/ipv4/ip_gre.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 0162fb9..8b837f6 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -1223,6 +1223,7 @@ static int gre_tap_init(struct net_device *dev)
 {
__gre_tunnel_init(dev);
dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
+   netif_keep_dst(dev);
 
return ip_tunnel_init(dev);
 }
-- 
2.1.0



Re: [PATCH net-next 0/3] support changing steering policies in tuntap

2017-09-27 Thread Tom Herbert
On Wed, Sep 27, 2017 at 4:25 PM, Willem de Bruijn
 wrote:
>>> In the future, both simple and sophisticated policy like RSS or other guest
>>> driven steering policies could be done on top.
>>
>> IMHO there should be a more practical example before adding all this
>> indirection. And it would be nice to understand why this queue selection
>> needs to be tun specific.
>
> I was thinking the same and this reminds me of the various strategies
> implemented in packet fanout. tun_cpu_select_queue is analogous to
> fanout_demux_cpu though it is tun-specific in that it requires tun->numqueues.
>
> Fanout accrued various strategies until it gained an eBPF variant. Just
> supporting BPF is probably sufficient here, too.

+1, in addition to packet fanout, we have SO_REUSEPORT with BPF, RPS,
RFS, etc. It would be nice if existing packet steering mechanisms
could be leveraged for tun.


Re: [PATCH v6 05/11] dt-bindings: net: dwmac-sun8i: update documentation about integrated PHY

2017-09-27 Thread Florian Fainelli


On 09/27/2017 12:34 AM, Corentin Labbe wrote:
> This patch add documentation about the MDIO switch used on sun8i-h3-emac
> for integrated PHY.
> 
> Signed-off-by: Corentin Labbe 
> ---
>  .../devicetree/bindings/net/dwmac-sun8i.txt| 138 
> +++--
>  1 file changed, 126 insertions(+), 12 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt 
> b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
> index 725f3b187886..e2ef4683df08 100644
> --- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
> +++ b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
> @@ -4,18 +4,18 @@ This device is a platform glue layer for stmmac.
>  Please see stmmac.txt for the other unchanged properties.
>  
>  Required properties:
> -- compatible: should be one of the following string:
> +- compatible: must be one of the following string:
>   "allwinner,sun8i-a83t-emac"
>   "allwinner,sun8i-h3-emac"
>   "allwinner,sun8i-v3s-emac"
>   "allwinner,sun50i-a64-emac"
>  - reg: address and length of the register for the device.
>  - interrupts: interrupt for the device
> -- interrupt-names: should be "macirq"
> +- interrupt-names: must be "macirq"
>  - clocks: A phandle to the reference clock for this device
> -- clock-names: should be "stmmaceth"
> +- clock-names: must be "stmmaceth"
>  - resets: A phandle to the reset control for this device
> -- reset-names: should be "stmmaceth"
> +- reset-names: must be "stmmaceth"
>  - phy-mode: See ethernet.txt
>  - phy-handle: See ethernet.txt
>  - #address-cells: shall be 1
> @@ -39,23 +39,38 @@ Optional properties for the following compatibles:
>  - allwinner,leds-active-low: EPHY LEDs are active low
>  
>  Required child node of emac:
> -- mdio bus node: should be named mdio
> +- mdio bus node: with compatible "snps,dwmac-mdio"
>  
>  Required properties of the mdio node:
>  - #address-cells: shall be 1
>  - #size-cells: shall be 0
>  
> -The device node referenced by "phy" or "phy-handle" should be a child node
> +The device node referenced by "phy" or "phy-handle" must be a child node
>  of the mdio node. See phy.txt for the generic PHY bindings.
>  
> -Required properties of the phy node with the following compatibles:
> +The following compatibles require that the mdio node have a mdio-mux child
> +node called "mdio-mux":
> +  - "allwinner,sun8i-h3-emac"
> +  - "allwinner,sun8i-v3s-emac":
> +Required properties for the mdio-mux node:
> +  - compatible = "mdio-mux"
> +  - one child mdio for the integrated mdio
> +  - one child mdio for the external mdio if present (V3s have none)
> +Required properties for the mdio-mux children node:
> +  - reg: 1 for internal MDIO bus, 2 for external MDIO bus
> +
> +The following compatibles require a PHY node representing the integrated
> +PHY, under the integrated MDIO bus node if an mdio-mux node is used:
>- "allwinner,sun8i-h3-emac",
>- "allwinner,sun8i-v3s-emac":
> +
> +Required properties of the integrated phy node:
>  - clocks: a phandle to the reference clock for the EPHY
>  - resets: a phandle to the reset control for the EPHY
> +- phy-is-integrated
> +- Must be a child of the integrated mdio
>  
> -Example:
> -
> +Example with integrated PHY:
>  emac: ethernet@1c0b000 {
>   compatible = "allwinner,sun8i-h3-emac";
>   syscon = <>;
> @@ -72,13 +87,112 @@ emac: ethernet@1c0b000 {
>   phy-handle = <_mii_phy>;
>   phy-mode = "mii";
>   allwinner,leds-active-low;
> +
> + mdio0: mdio {
> + #address-cells = <1>;
> + #size-cells = <0>;
> + compatible = "snps,dwmac-mdio";
> +
> + mdio-mux {
> + compatible = "mdio-mux";
> + #address-cells = <1>;
> + #size-cells = <0>;

Sorry for chiming in so late, but why don't we have the mdio-mux be the
root node here in the mdio bus hierarchy? I understand that with this
binding proposed here, we need to have patch 11 included (which btw,
should come before any DTS change), but this does not seem to accurately
model the HW.

The mux itself is not a child node of the MDIO bus controller, it does
not really belong in that address space although it does mangle the MDIO
bus controller address space between the two ends of the mux.

If this has been debated before, apologies for missing that part of the
discussion.

> +
> + int_mdio: mdio@1 {
> + reg = <1>;
> + #address-cells = <1>;
> + #size-cells = <0>;
> + int_mii_phy: ethernet-phy@1 {
> + reg = <1>;
> + clocks = < CLK_BUS_EPHY>;
> + resets = < RST_BUS_EPHY>;
> + phy-is-integrated;
> + };
> +

Re: [PATCH net-next] net: ipv4: remove fib_weight

2017-09-27 Thread David Miller
From: David Ahern 
Date: Wed, 27 Sep 2017 19:08:00 -0700

> fib_weight in fib_info is set but not used. Remove it and the
> helpers for setting it.
> 
> Signed-off-by: David Ahern 

Hmmm, I wonder then what Peter intended in commit
0e884c78ee19e902f300ed147083c28a0c6302f0 ("ipv4: L3 hash-based
multipath") because that's where this came from.


Re: [PATCH v6 11/11] of: mdio: Prevent of_mdiobus_register from scanning mdio-mux nodes

2017-09-27 Thread Florian Fainelli


On 09/27/2017 07:12 AM, Andrew Lunn wrote:
> On Wed, Sep 27, 2017 at 09:34:14AM +0200, Corentin Labbe wrote:
>> Each child node of an MDIO node is scanned as a PHY when calling
>> of_mdiobus_register() givint the following result:
>> [   18.175379] mdio_bus stmmac-0: /soc/ethernet@1c3/mdio/mdio-mux has 
>> invalid PHY address
>> [   18.175408] mdio_bus stmmac-0: scan phy mdio-mux at address 0
>> [   18.175450] mdio_bus stmmac-0: scan phy mdio-mux at address 1
>> [...]
>> [   18.176420] mdio_bus stmmac-0: scan phy mdio-mux at address 30
>> [   18.176452] mdio_bus stmmac-0: scan phy mdio-mux at address 31
>>
>> Since mdio-mux nodes are not PHY, this patch a way to to not scan
>> them.
> 
> Hi Corentin
> 
> I still don't like this, but ...

Me neither, even more so as I don't understand the reasoning behind
putting the mux as a child node of the MDIO bus controller in the first
place.

Also, you need to re-order patches such that this patch comes before the
DTS changes.

> 
>>
>> Signed-off-by: Corentin Labbe 
>> ---
>>  drivers/of/of_mdio.c | 10 ++
>>  1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/of/of_mdio.c b/drivers/of/of_mdio.c
>> index d94dd8b77abd..d90ddb0d90f2 100644
>> --- a/drivers/of/of_mdio.c
>> +++ b/drivers/of/of_mdio.c
>> @@ -190,6 +190,10 @@ int of_mdiobus_register(struct mii_bus *mdio, struct 
>> device_node *np)
>>  struct device_node *child;
>>  bool scanphys = false;
>>  int addr, rc;
>> +static const struct of_device_id do_not_scan[] = {
>> +{ .compatible = "mdio-mux" },
>> +{}
>> +};
> 
> Please rename this to some less generic. What i don't want is other
> compatible strings added here. We want to make the exception for
> muxes, but nothing else. So something like compatible_muxes?
> 
>Andrew
> 

-- 
Florian


Re: [PATCH net] ipv6: fix net.ipv6.conf.all interface DAD handlers

2017-09-27 Thread Erik Kline
> Erik, please review.

I apologize for the delay. I see that you've already applied this, and
it's mostly LGTM except I have one thing I'm not seeing clearly.

The documentation accept_dad  now claims:

DAD operation and mode on a given interface will be selected according
to the maximum value of conf/{all,interface}/accept_dad.

but I'm try to square this with my reading of the changes to
addrconf_dad_begin().  I think setting all.accept_dad to 0 but
ifname.accept_dad to non-0 still results in the short-circuit call to
addrconf_dad_completed().

Am I just not seeing (thinking) straight?


smime.p7s
Description: S/MIME Cryptographic Signature


[pull request][net 00/11] Mellanox, mlx5 fixes 2017-09-28

2017-09-27 Thread Saeed Mahameed
Hi Dave,

This series provides misc fixes for mlx5 dirver.

Please pull and let me know if there's any problem.

for -stable:
  net/mlx5e: IPoIB, Fix access to invalid memory address (Kernels >= 4.12)

Thanks,
Saeed.

-- 

The following changes since commit c2cc187e53011c1c4931055984657da9085c763b:

  sctp: Fix a big endian bug in sctp_diag_dump() (2017-09-26 21:16:29 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git 
tags/mlx5-fixes-2017-09-28

for you to fetch changes up to 353f59f4d41e9c5798a15c5c52958f25b579a3d5:

  net/mlx5: Fix wrong indentation in enable SRIOV code (2017-09-28 07:23:10 
+0300)


mlx5-fixes-2017-09-28

Misc. fixes for mlx5 drivers.


Gal Pressman (3):
  net/mlx5e: Print netdev features correctly in error message
  net/mlx5e: Don't add/remove 802.1ad rules when changing 802.1Q VLAN filter
  net/mlx5e: Fix calculated checksum offloads counters

Inbar Karmy (1):
  net/mlx5: Fix FPGA capability location

Matan Barak (1):
  net/mlx5: Fix static checker warning on steering tracepoints code

Or Gerlitz (2):
  net/mlx5e: Disallow TC offloading of unsupported match/action combinations
  net/mlx5: Fix wrong indentation in enable SRIOV code

Paul Blakey (1):
  net/mlx5e: Fix erroneous freeing of encap header buffer

Raed Salem (1):
  net/mlx5: Check device capability for maximum flow counters

Roi Dayan (1):
  net/mlx5e: IPoIB, Fix access to invalid memory address

Vlad Buslov (1):
  net/mlx5e: Check encap entry state when offloading tunneled flows

 .../mellanox/mlx5/core/diag/fs_tracepoint.h|  4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c|  4 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 13 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|  3 +
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h |  6 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 91 --
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c|  1 +
 drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c |  4 +-
 drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.h |  2 +-
 .../net/ethernet/mellanox/mlx5/core/fpga/core.c|  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c   |  8 ++
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h  | 11 +++
 .../net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c  |  3 +-
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c|  2 +-
 include/linux/mlx5/device.h|  5 +-
 include/linux/mlx5/driver.h|  1 +
 include/linux/mlx5/mlx5_ifc.h  |  3 +-
 17 files changed, 133 insertions(+), 31 deletions(-)


[net 04/11] net/mlx5e: Fix erroneous freeing of encap header buffer

2017-09-27 Thread Saeed Mahameed
From: Paul Blakey 

In case the neighbour for the tunnel destination isn't valid,
we send a neighbour update request but we free the encap
header buffer. This is wrong, because we still need it for
allocating a HW encap entry once the neighbour is available.

Fix that by skipping freeing it if we wait for neighbour.

Fixes: 232c001398ae ('net/mlx5e: Add support to neighbour update flow')
Signed-off-by: Paul Blakey 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index da503e6411da..4e2fc016bdd6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1564,7 +1564,7 @@ static int mlx5e_create_encap_header_ipv4(struct 
mlx5e_priv *priv,
break;
default:
err = -EOPNOTSUPP;
-   goto out;
+   goto free_encap;
}
fl4.flowi4_tos = tun_key->tos;
fl4.daddr = tun_key->u.ipv4.dst;
@@ -1573,7 +1573,7 @@ static int mlx5e_create_encap_header_ipv4(struct 
mlx5e_priv *priv,
err = mlx5e_route_lookup_ipv4(priv, mirred_dev, _dev,
  , , );
if (err)
-   goto out;
+   goto free_encap;
 
/* used by mlx5e_detach_encap to lookup a neigh hash table
 * entry in the neigh hash table when a user deletes a rule
@@ -1590,7 +1590,7 @@ static int mlx5e_create_encap_header_ipv4(struct 
mlx5e_priv *priv,
 */
err = mlx5e_rep_encap_entry_attach(netdev_priv(out_dev), e);
if (err)
-   goto out;
+   goto free_encap;
 
read_lock_bh(>lock);
nud_state = n->nud_state;
@@ -1630,8 +1630,9 @@ static int mlx5e_create_encap_header_ipv4(struct 
mlx5e_priv *priv,
 
 destroy_neigh_entry:
mlx5e_rep_encap_entry_detach(netdev_priv(e->out_dev), e);
-out:
+free_encap:
kfree(encap_header);
+out:
if (n)
neigh_release(n);
return err;
@@ -1668,7 +1669,7 @@ static int mlx5e_create_encap_header_ipv6(struct 
mlx5e_priv *priv,
break;
default:
err = -EOPNOTSUPP;
-   goto out;
+   goto free_encap;
}
 
fl6.flowlabel = ip6_make_flowinfo(RT_TOS(tun_key->tos), tun_key->label);
@@ -1678,7 +1679,7 @@ static int mlx5e_create_encap_header_ipv6(struct 
mlx5e_priv *priv,
err = mlx5e_route_lookup_ipv6(priv, mirred_dev, _dev,
  , , );
if (err)
-   goto out;
+   goto free_encap;
 
/* used by mlx5e_detach_encap to lookup a neigh hash table
 * entry in the neigh hash table when a user deletes a rule
@@ -1695,7 +1696,7 @@ static int mlx5e_create_encap_header_ipv6(struct 
mlx5e_priv *priv,
 */
err = mlx5e_rep_encap_entry_attach(netdev_priv(out_dev), e);
if (err)
-   goto out;
+   goto free_encap;
 
read_lock_bh(>lock);
nud_state = n->nud_state;
@@ -1736,8 +1737,9 @@ static int mlx5e_create_encap_header_ipv6(struct 
mlx5e_priv *priv,
 
 destroy_neigh_entry:
mlx5e_rep_encap_entry_detach(netdev_priv(e->out_dev), e);
-out:
+free_encap:
kfree(encap_header);
+out:
if (n)
neigh_release(n);
return err;
-- 
2.13.0



[net 01/11] net/mlx5e: IPoIB, Fix access to invalid memory address

2017-09-27 Thread Saeed Mahameed
From: Roi Dayan 

When cleaning rdma netdevice we need to save the mdev pointer
because priv is released when we release netdev.

This bug was found using the kernel address sanitizer (KASAN).
use-after-free in mlx5_rdma_netdev_free+0xe3/0x100 [mlx5_core]

Fixes: 48935bbb7ae8 ("net/mlx5e: IPoIB, Add netdevice profile skeleton")
Signed-off-by: Roi Dayan 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c 
b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
index 85298051a3e4..145e392ab849 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/ipoib/ipoib.c
@@ -572,12 +572,13 @@ void mlx5_rdma_netdev_free(struct net_device *netdev)
 {
struct mlx5e_priv  *priv= mlx5i_epriv(netdev);
const struct mlx5e_profile *profile = priv->profile;
+   struct mlx5_core_dev   *mdev= priv->mdev;
 
mlx5e_detach_netdev(priv);
profile->cleanup(priv);
destroy_workqueue(priv->wq);
free_netdev(netdev);
 
-   mlx5e_destroy_mdev_resources(priv->mdev);
+   mlx5e_destroy_mdev_resources(mdev);
 }
 EXPORT_SYMBOL(mlx5_rdma_netdev_free);
-- 
2.13.0



[net 05/11] net/mlx5e: Disallow TC offloading of unsupported match/action combinations

2017-09-27 Thread Saeed Mahameed
From: Or Gerlitz 

When offloading header re-write, the HW may need to adjust checksums along
the packet. For IP traffic, and a case where we are asked to modify fields in
the IP header, current HW supports that only for TCP and UDP. Enforce it, in
this case fail the offloading attempt for non TCP/UDP packets.

Fixes: d7e75a325cb2 ('net/mlx5e: Add offloading of E-Switch TC pedit (header 
re-write) actions')
Fixes: 2f4fe4cab073 ('net/mlx5e: Add offloading of NIC TC pedit (header 
re-write) actions')
Signed-off-by: Or Gerlitz 
Reviewed-by: Paul Blakey 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 70 +
 1 file changed, 70 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 4e2fc016bdd6..d3786005fba7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1317,6 +1317,69 @@ static bool csum_offload_supported(struct mlx5e_priv 
*priv, u32 action, u32 upda
return true;
 }
 
+static bool modify_header_match_supported(struct mlx5_flow_spec *spec,
+ struct tcf_exts *exts)
+{
+   const struct tc_action *a;
+   bool modify_ip_header;
+   LIST_HEAD(actions);
+   u8 htype, ip_proto;
+   void *headers_v;
+   u16 ethertype;
+   int nkeys, i;
+
+   headers_v = MLX5_ADDR_OF(fte_match_param, spec->match_value, 
outer_headers);
+   ethertype = MLX5_GET(fte_match_set_lyr_2_4, headers_v, ethertype);
+
+   /* for non-IP we only re-write MACs, so we're okay */
+   if (ethertype != ETH_P_IP && ethertype != ETH_P_IPV6)
+   goto out_ok;
+
+   modify_ip_header = false;
+   tcf_exts_to_list(exts, );
+   list_for_each_entry(a, , list) {
+   if (!is_tcf_pedit(a))
+   continue;
+
+   nkeys = tcf_pedit_nkeys(a);
+   for (i = 0; i < nkeys; i++) {
+   htype = tcf_pedit_htype(a, i);
+   if (htype == TCA_PEDIT_KEY_EX_HDR_TYPE_IP4 ||
+   htype == TCA_PEDIT_KEY_EX_HDR_TYPE_IP6) {
+   modify_ip_header = true;
+   break;
+   }
+   }
+   }
+
+   ip_proto = MLX5_GET(fte_match_set_lyr_2_4, headers_v, ip_protocol);
+   if (modify_ip_header && ip_proto != IPPROTO_TCP && ip_proto != 
IPPROTO_UDP) {
+   pr_info("can't offload re-write of ip proto %d\n", ip_proto);
+   return false;
+   }
+
+out_ok:
+   return true;
+}
+
+static bool actions_match_supported(struct mlx5e_priv *priv,
+   struct tcf_exts *exts,
+   struct mlx5e_tc_flow_parse_attr *parse_attr,
+   struct mlx5e_tc_flow *flow)
+{
+   u32 actions;
+
+   if (flow->flags & MLX5E_TC_FLOW_ESWITCH)
+   actions = flow->esw_attr->action;
+   else
+   actions = flow->nic_attr->action;
+
+   if (actions & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
+   return modify_header_match_supported(_attr->spec, exts);
+
+   return true;
+}
+
 static int parse_tc_nic_actions(struct mlx5e_priv *priv, struct tcf_exts *exts,
struct mlx5e_tc_flow_parse_attr *parse_attr,
struct mlx5e_tc_flow *flow)
@@ -1378,6 +1441,9 @@ static int parse_tc_nic_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
return -EINVAL;
}
 
+   if (!actions_match_supported(priv, exts, parse_attr, flow))
+   return -EOPNOTSUPP;
+
return 0;
 }
 
@@ -1936,6 +2002,10 @@ static int parse_tc_fdb_actions(struct mlx5e_priv *priv, 
struct tcf_exts *exts,
 
return -EINVAL;
}
+
+   if (!actions_match_supported(priv, exts, parse_attr, flow))
+   return -EOPNOTSUPP;
+
return err;
 }
 
-- 
2.13.0



[net 03/11] net/mlx5: Check device capability for maximum flow counters

2017-09-27 Thread Saeed Mahameed
From: Raed Salem 

Added check for the maximal number of flow counters attached
to rule (FTE).

Fixes: bd5251dbf156b ('net/mlx5_core: Introduce flow steering destination of 
type counter')
Signed-off-by: Raed Salem 
Reviewed-by: Maor Gottlieb 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c  |  8 
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.h | 11 +++
 include/linux/mlx5/mlx5_ifc.h |  3 ++-
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
index e0d0efd903bc..36ecc2b2e187 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
@@ -293,6 +293,9 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
}
 
if (fte->action & MLX5_FLOW_CONTEXT_ACTION_COUNT) {
+   int max_list_size = BIT(MLX5_CAP_FLOWTABLE_TYPE(dev,
+   log_max_flow_counter,
+   ft->type));
int list_size = 0;
 
list_for_each_entry(dst, >node.children, node.list) {
@@ -305,12 +308,17 @@ static int mlx5_cmd_set_fte(struct mlx5_core_dev *dev,
in_dests += MLX5_ST_SZ_BYTES(dest_format_struct);
list_size++;
}
+   if (list_size > max_list_size) {
+   err = -EINVAL;
+   goto err_out;
+   }
 
MLX5_SET(flow_context, in_flow_context, flow_counter_list_size,
 list_size);
}
 
err = mlx5_cmd_exec(dev, in, inlen, out, sizeof(out));
+err_out:
kvfree(in);
return err;
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
index 5509a752f98e..48dd78975062 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_core.h
@@ -52,6 +52,7 @@ enum fs_flow_table_type {
FS_FT_FDB = 0X4,
FS_FT_SNIFFER_RX= 0X5,
FS_FT_SNIFFER_TX= 0X6,
+   FS_FT_MAX_TYPE = FS_FT_SNIFFER_TX,
 };
 
 enum fs_flow_table_op_mod {
@@ -260,4 +261,14 @@ void mlx5_cleanup_fs(struct mlx5_core_dev *dev);
 #define fs_for_each_dst(pos, fte)  \
fs_list_for_each_entry(pos, &(fte)->node.children)
 
+#define MLX5_CAP_FLOWTABLE_TYPE(mdev, cap, type) ( \
+   (type == FS_FT_NIC_RX) ? MLX5_CAP_FLOWTABLE_NIC_RX(mdev, cap) : 
\
+   (type == FS_FT_ESW_EGRESS_ACL) ? MLX5_CAP_ESW_EGRESS_ACL(mdev, cap) :   
\
+   (type == FS_FT_ESW_INGRESS_ACL) ? MLX5_CAP_ESW_INGRESS_ACL(mdev, cap) : 
\
+   (type == FS_FT_FDB) ? MLX5_CAP_ESW_FLOWTABLE_FDB(mdev, cap) :   
\
+   (type == FS_FT_SNIFFER_RX) ? MLX5_CAP_FLOWTABLE_SNIFFER_RX(mdev, cap) : 
\
+   (type == FS_FT_SNIFFER_TX) ? MLX5_CAP_FLOWTABLE_SNIFFER_TX(mdev, cap) : 
\
+   (BUILD_BUG_ON_ZERO(FS_FT_SNIFFER_TX != FS_FT_MAX_TYPE))\
+   )
+
 #endif
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index a528b35a022e..69772347f866 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -327,7 +327,8 @@ struct mlx5_ifc_flow_table_prop_layout_bits {
u8 reserved_at_80[0x18];
u8 log_max_destination[0x8];
 
-   u8 reserved_at_a0[0x18];
+   u8 log_max_flow_counter[0x8];
+   u8 reserved_at_a8[0x10];
u8 log_max_flow[0x8];
 
u8 reserved_at_c0[0x40];
-- 
2.13.0



[net 11/11] net/mlx5: Fix wrong indentation in enable SRIOV code

2017-09-27 Thread Saeed Mahameed
From: Or Gerlitz 

Smatch is screaming:

drivers/net/ethernet/mellanox/mlx5/core/sriov.c:112
mlx5_device_enable_sriov() warn: inconsistent indenting

fix that.

Fixes: 7ecf6d8ff154 ('IB/mlx5: Restore IB guid/policy for virtual functions')
Signed-off-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/sriov.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c 
b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
index 6c48e9959b65..2a8b529ce6dd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/sriov.c
@@ -109,7 +109,7 @@ static int mlx5_device_enable_sriov(struct mlx5_core_dev 
*dev, int num_vfs)
mlx5_core_warn(dev,
   "failed to restore VF %d 
settings, err %d\n",
   vf, err);
-   continue;
+   continue;
}
}
mlx5_core_dbg(dev, "successfully enabled VF* %d\n", vf);
-- 
2.13.0



[net 09/11] net/mlx5e: Fix calculated checksum offloads counters

2017-09-27 Thread Saeed Mahameed
From: Gal Pressman 

Instead of calculating the offloads counters, count them explicitly.
The calculations done for these counters would result in bugs in some
cases, for example:
When running TCP traffic over a VXLAN tunnel with TSO enabled the following
counters would increase:
   tx_csum_partial: 1,333,284
   tx_csum_partial_inner: 29,286
   tx4_csum_partial_inner: 384
   tx7_csum_partial_inner: 8
   tx9_csum_partial_inner: 34
   tx10_csum_partial_inner: 26,807
   tx11_csum_partial_inner: 287
   tx12_csum_partial_inner: 27
   tx16_csum_partial_inner: 6
   tx25_csum_partial_inner: 1,733

Seems like tx_csum_partial increased out of nowhere.
The issue is in the following calculation in mlx5e_update_sw_counters:
s->tx_csum_partial = s->tx_packets - tx_offload_none - s->tx_csum_partial_inner;

While tx_packets increases by the number of GSO segments for each SKB,
tx_csum_partial_inner will only increase by one, resulting in wrong
tx_csum_partial counter.

Fixes: bfe6d8d1d433 ("net/mlx5e: Reorganize ethtool statistics")
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 9 +++--
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c| 3 +++
 drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 6 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c| 1 +
 4 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 84b013dc62e9..cc11bbbd0309 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -184,7 +184,6 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
*priv)
struct mlx5e_sw_stats temp, *s = 
struct mlx5e_rq_stats *rq_stats;
struct mlx5e_sq_stats *sq_stats;
-   u64 tx_offload_none = 0;
int i, j;
 
memset(s, 0, sizeof(*s));
@@ -199,6 +198,7 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
*priv)
s->rx_lro_bytes += rq_stats->lro_bytes;
s->rx_csum_none += rq_stats->csum_none;
s->rx_csum_complete += rq_stats->csum_complete;
+   s->rx_csum_unnecessary += rq_stats->csum_unnecessary;
s->rx_csum_unnecessary_inner += 
rq_stats->csum_unnecessary_inner;
s->rx_xdp_drop += rq_stats->xdp_drop;
s->rx_xdp_tx += rq_stats->xdp_tx;
@@ -229,14 +229,11 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv 
*priv)
s->tx_queue_dropped += sq_stats->dropped;
s->tx_xmit_more += sq_stats->xmit_more;
s->tx_csum_partial_inner += 
sq_stats->csum_partial_inner;
-   tx_offload_none += sq_stats->csum_none;
+   s->tx_csum_none += sq_stats->csum_none;
+   s->tx_csum_partial  += sq_stats->csum_partial;
}
}
 
-   /* Update calculated offload counters */
-   s->tx_csum_partial = s->tx_packets - tx_offload_none - 
s->tx_csum_partial_inner;
-   s->rx_csum_unnecessary = s->rx_packets - s->rx_csum_none - 
s->rx_csum_complete;
-
s->link_down_events_phy = MLX5_GET(ppcnt_reg,
priv->stats.pport.phy_counters,
counter_set.phys_layer_cntrs.link_down_events);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index f1dd638384d3..15a1687483cc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -627,6 +627,7 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
 
if (lro) {
skb->ip_summed = CHECKSUM_UNNECESSARY;
+   rq->stats.csum_unnecessary++;
return;
}
 
@@ -644,7 +645,9 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
skb->csum_level = 1;
skb->encapsulation = 1;
rq->stats.csum_unnecessary_inner++;
+   return;
}
+   rq->stats.csum_unnecessary++;
return;
}
 csum_none:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h 
b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 6d199ffb1c0b..f8637213afc0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -68,6 +68,7 @@ struct mlx5e_sw_stats {
u64 rx_xdp_drop;
u64 rx_xdp_tx;
u64 rx_xdp_tx_full;
+   u64 tx_csum_none;
u64 tx_csum_partial;
u64 tx_csum_partial_inner;
u64 tx_queue_stopped;
@@ -108,6 +109,7 @@ static const struct 

[net 10/11] net/mlx5: Fix static checker warning on steering tracepoints code

2017-09-27 Thread Saeed Mahameed
From: Matan Barak 

Fix this sparse complaint:

drivers/net/ethernet/mellanox/mlx5/core/./diag/fs_tracepoint.h:172:1:
warning: odd constant _Bool cast ( becomes 1)

Fixes: d9fea79171ee ('net/mlx5: Add tracepoints')
Signed-off-by: Matan Barak 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h 
b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
index 1e3a6c3e4132..80eef4163f52 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/diag/fs_tracepoint.h
@@ -139,7 +139,7 @@ TRACE_EVENT(mlx5_fs_del_fg,
{MLX5_FLOW_CONTEXT_ACTION_FWD_NEXT_PRIO, "NEXT_PRIO"}
 
 TRACE_EVENT(mlx5_fs_set_fte,
-   TP_PROTO(const struct fs_fte *fte, bool new_fte),
+   TP_PROTO(const struct fs_fte *fte, int new_fte),
TP_ARGS(fte, new_fte),
TP_STRUCT__entry(
__field(const struct fs_fte *, fte)
@@ -149,7 +149,7 @@ TRACE_EVENT(mlx5_fs_set_fte,
__field(u32, action)
__field(u32, flow_tag)
__field(u8,  mask_enable)
-   __field(bool, new_fte)
+   __field(int, new_fte)
__array(u32, mask_outer, MLX5_ST_SZ_DW(fte_match_set_lyr_2_4))
__array(u32, mask_inner, MLX5_ST_SZ_DW(fte_match_set_lyr_2_4))
__array(u32, mask_misc, MLX5_ST_SZ_DW(fte_match_set_misc))
-- 
2.13.0



[net 02/11] net/mlx5: Fix FPGA capability location

2017-09-27 Thread Saeed Mahameed
From: Inbar Karmy 

Currently, FPGA capability is located in (mdev)->caps.hca_cur,
change the location to be (mdev)->caps.fpga,
since hca_cur is reserved for HCA device capabilities.

Fixes: e29341fb3a5b ("net/mlx5: FPGA, Add basic support for Innova")
Signed-off-by: Inbar Karmy 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c  | 4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.h  | 2 +-
 drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c | 3 +--
 include/linux/mlx5/device.h | 5 ++---
 include/linux/mlx5/driver.h | 1 +
 5 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c
index e37453d838db..c0fd2212e890 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.c
@@ -71,11 +71,11 @@ int mlx5_fpga_access_reg(struct mlx5_core_dev *dev, u8 
size, u64 addr,
return 0;
 }
 
-int mlx5_fpga_caps(struct mlx5_core_dev *dev, u32 *caps)
+int mlx5_fpga_caps(struct mlx5_core_dev *dev)
 {
u32 in[MLX5_ST_SZ_DW(fpga_cap)] = {0};
 
-   return mlx5_core_access_reg(dev, in, sizeof(in), caps,
+   return mlx5_core_access_reg(dev, in, sizeof(in), dev->caps.fpga,
MLX5_ST_SZ_BYTES(fpga_cap),
MLX5_REG_FPGA_CAP, 0, 0);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.h 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.h
index 94bdfd47c3f0..d05233c9b4f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/cmd.h
@@ -65,7 +65,7 @@ struct mlx5_fpga_qp_counters {
u64 rx_total_drop;
 };
 
-int mlx5_fpga_caps(struct mlx5_core_dev *dev, u32 *caps);
+int mlx5_fpga_caps(struct mlx5_core_dev *dev);
 int mlx5_fpga_query(struct mlx5_core_dev *dev, struct mlx5_fpga_query *query);
 int mlx5_fpga_ctrl_op(struct mlx5_core_dev *dev, u8 op);
 int mlx5_fpga_access_reg(struct mlx5_core_dev *dev, u8 size, u64 addr,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c 
b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c
index 9034e9960a76..dc8970346521 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/fpga/core.c
@@ -139,8 +139,7 @@ int mlx5_fpga_device_start(struct mlx5_core_dev *mdev)
if (err)
goto out;
 
-   err = mlx5_fpga_caps(fdev->mdev,
-fdev->mdev->caps.hca_cur[MLX5_CAP_FPGA]);
+   err = mlx5_fpga_caps(fdev->mdev);
if (err)
goto out;
 
diff --git a/include/linux/mlx5/device.h b/include/linux/mlx5/device.h
index eaf4ad209c8f..e32dbc4934db 100644
--- a/include/linux/mlx5/device.h
+++ b/include/linux/mlx5/device.h
@@ -980,7 +980,6 @@ enum mlx5_cap_type {
MLX5_CAP_RESERVED,
MLX5_CAP_VECTOR_CALC,
MLX5_CAP_QOS,
-   MLX5_CAP_FPGA,
/* NUM OF CAP Types */
MLX5_CAP_NUM
 };
@@ -1110,10 +1109,10 @@ enum mlx5_mcam_feature_groups {
MLX5_GET(mcam_reg, (mdev)->caps.mcam, 
mng_feature_cap_mask.enhanced_features.fld)
 
 #define MLX5_CAP_FPGA(mdev, cap) \
-   MLX5_GET(fpga_cap, (mdev)->caps.hca_cur[MLX5_CAP_FPGA], cap)
+   MLX5_GET(fpga_cap, (mdev)->caps.fpga, cap)
 
 #define MLX5_CAP64_FPGA(mdev, cap) \
-   MLX5_GET64(fpga_cap, (mdev)->caps.hca_cur[MLX5_CAP_FPGA], cap)
+   MLX5_GET64(fpga_cap, (mdev)->caps.fpga, cap)
 
 enum {
MLX5_CMD_STAT_OK= 0x0,
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 02ff700e4f30..401c8972cc3a 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -774,6 +774,7 @@ struct mlx5_core_dev {
u32 hca_max[MLX5_CAP_NUM][MLX5_UN_SZ_DW(hca_cap_union)];
u32 pcam[MLX5_ST_SZ_DW(pcam_reg)];
u32 mcam[MLX5_ST_SZ_DW(mcam_reg)];
+   u32 fpga[MLX5_ST_SZ_DW(fpga_cap)];
} caps;
phys_addr_t iseg_base;
struct mlx5_init_seg __iomem *iseg;
-- 
2.13.0



[net 06/11] net/mlx5e: Check encap entry state when offloading tunneled flows

2017-09-27 Thread Saeed Mahameed
From: Vlad Buslov 

Encap entries cached by the driver could be invalidated due to
tunnel destination neighbour state changes.
When attempting to offload a flow that uses a cached encap entry,
we must check the entry validity and defer the offloading
if the entry exists but not valid.

When EAGAIN is returned, the flow offloading to hardware takes place
by the neigh update code when the tunnel destination neighbour
becomes connected.

Fixes: 232c001398ae ("net/mlx5e: Add support to neighbour update flow")
Signed-off-by: Vlad Buslov 
Reviewed-by: Or Gerlitz 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index d3786005fba7..1aa2028ed995 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1859,6 +1859,7 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
}
}
 
+   /* must verify if encap is valid or not */
if (found)
goto attach_flow;
 
@@ -1885,6 +1886,8 @@ static int mlx5e_attach_encap(struct mlx5e_priv *priv,
*encap_dev = e->out_dev;
if (e->flags & MLX5_ENCAP_ENTRY_VALID)
attr->encap_id = e->encap_id;
+   else
+   err = -EAGAIN;
 
return err;
 
-- 
2.13.0



[net 07/11] net/mlx5e: Print netdev features correctly in error message

2017-09-27 Thread Saeed Mahameed
From: Gal Pressman 

Use the correct formatting for netdev features.

Fixes: 0e405443e803 ("net/mlx5e: Improve set features ndo resiliency")
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index dfc29720ab77..84b013dc62e9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -,8 +,8 @@ static int mlx5e_handle_feature(struct net_device *netdev,
 
err = feature_handler(netdev, enable);
if (err) {
-   netdev_err(netdev, "%s feature 0x%llx failed err %d\n",
-  enable ? "Enable" : "Disable", feature, err);
+   netdev_err(netdev, "%s feature %pNF failed, err %d\n",
+  enable ? "Enable" : "Disable", , err);
return err;
}
 
-- 
2.13.0



[net 08/11] net/mlx5e: Don't add/remove 802.1ad rules when changing 802.1Q VLAN filter

2017-09-27 Thread Saeed Mahameed
From: Gal Pressman 

Toggling of C-tag VLAN filter should not affect the "any S-tag" steering rule.

Fixes: 8a271746a264 ("net/mlx5e: Receive s-tagged packets in promiscuous mode")
Signed-off-by: Gal Pressman 
Signed-off-by: Saeed Mahameed 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
index f11fd07ac4dd..850cdc980ab5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_fs.c
@@ -291,7 +291,7 @@ void mlx5e_enable_vlan_filter(struct mlx5e_priv *priv)
priv->fs.vlan.filter_disabled = false;
if (priv->netdev->flags & IFF_PROMISC)
return;
-   mlx5e_del_any_vid_rules(priv);
+   mlx5e_del_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_ANY_CTAG_VID, 0);
 }
 
 void mlx5e_disable_vlan_filter(struct mlx5e_priv *priv)
@@ -302,7 +302,7 @@ void mlx5e_disable_vlan_filter(struct mlx5e_priv *priv)
priv->fs.vlan.filter_disabled = true;
if (priv->netdev->flags & IFF_PROMISC)
return;
-   mlx5e_add_any_vid_rules(priv);
+   mlx5e_add_vlan_rule(priv, MLX5E_VLAN_RULE_TYPE_ANY_CTAG_VID, 0);
 }
 
 int mlx5e_vlan_rx_add_vid(struct net_device *dev, __always_unused __be16 proto,
-- 
2.13.0



Re: [PATCH iproute2] tc: fix ipv6 filter selector attribute for some prefix lengths

2017-09-27 Thread Yulia Kartseva
Hello Stephen,
Sending as an attachment.
Thank you!

On Wed, Sep 27, 2017 at 1:26 AM, Stephen Hemminger
 wrote:
> On Mon, 25 Sep 2017 11:12:38 -0700
> Yulia Kartseva  wrote:
>
>> Wrong TCA_U32_SEL attribute packing if prefixLen AND 0x1f equals 0x1f.
>> These are  /31, /63, /95 and /127 prefix lengths.
>>
>> Example:
>> # tc filter add dev eth0 protocol ipv6 parent b: prio 2307 u32 match
>> ip6 dst face:b00f::/31
>> # tc filter show dev eth0
>> filter parent b: protocol ipv6 pref 2307 u32
>> filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
>> filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
>> key ht 800 bkt 0
>>   match faceb00f/ at 24
>>
>>
>> The correct match would be "faceb00e/fffe": don't count the last
>> bit of the 4th byte as the network prefix. With fix:
>>
>> # tc filter show dev eth0
>> filter parent b: protocol ipv6 pref 2307 u32
>> filter parent b: protocol ipv6 pref 2307 u32 fh 800: ht divisor 1
>> filter parent b: protocol ipv6 pref 2307 u32 fh 800::800 order 2048
>> key ht 800 bkt 0
>>   match faceb00e/fffe at 24
>>
>>  tc/f_u32.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/tc/f_u32.c b/tc/f_u32.c
>> index 5815be9..14b9588 100644
>> --- a/tc/f_u32.c
>> +++ b/tc/f_u32.c
>> @@ -385,8 +385,7 @@ static int parse_ip6_addr(int *argc_p, char ***argv_p,
>>
>>   plen = addr.bitlen;
>>   for (i = 0; i < plen; i += 32) {
>> - /* if (((i + 31) & ~0x1F) <= plen) { */
>> - if (i + 31 <= plen) {
>> + if (i + 31 < plen) {
>>   res = pack_key(sel, addr.data[i / 32],
>> 0x, off + 4 * (i / 32), offmask);
>>   if (res < 0)
>
> This patch looks correct, but will not apply cleanly because
> the mail system that you submitted it with is removing whitespace.
> If possible use a different client, or send as an attachment.
>



-- 
C уважением, Юлия


f_u32.c.patch
Description: Binary data


[PATCH net-next] net: ipv4: remove fib_info arg to fib_check_nh

2017-09-27 Thread David Ahern
fib_check_nh does not use the fib_info arg; remove t.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_semantics.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 57a5d48acee8..79989124607e 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -774,8 +774,8 @@ bool fib_metrics_match(struct fib_config *cfg, struct 
fib_info *fi)
  * |
  * |-> {local prefix} (terminal node)
  */
-static int fib_check_nh(struct fib_config *cfg, struct fib_info *fi,
-   struct fib_nh *nh, struct netlink_ext_ack *extack)
+static int fib_check_nh(struct fib_config *cfg, struct fib_nh *nh,
+   struct netlink_ext_ack *extack)
 {
int err = 0;
struct net *net;
@@ -1258,7 +1258,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
int linkdown = 0;
 
change_nexthops(fi) {
-   err = fib_check_nh(cfg, fi, nexthop_nh, extack);
+   err = fib_check_nh(cfg, nexthop_nh, extack);
if (err != 0)
goto failure;
if (nexthop_nh->nh_flags & RTNH_F_LINKDOWN)
-- 
2.1.4



[PATCH net-next] net: ipv4: remove fib_weight

2017-09-27 Thread David Ahern
fib_weight in fib_info is set but not used. Remove it and the
helpers for setting it.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h | 3 ---
 net/ipv4/fib_semantics.c | 9 -
 2 files changed, 12 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 1a7f7e424320..f80524396c06 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -122,9 +122,6 @@ struct fib_info {
 #define fib_rtt fib_metrics->metrics[RTAX_RTT-1]
 #define fib_advmss fib_metrics->metrics[RTAX_ADVMSS-1]
int fib_nhs;
-#ifdef CONFIG_IP_ROUTE_MULTIPATH
-   int fib_weight;
-#endif
struct rcu_head rcu;
struct fib_nh   fib_nh[0];
 #define fib_devfib_nh[0].nh_dev
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 57a5d48acee8..be0874620ecc 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -601,17 +601,9 @@ static void fib_rebalance(struct fib_info *fi)
atomic_set(_nh->nh_upper_bound, upper_bound);
} endfor_nexthops(fi);
 }
-
-static inline void fib_add_weight(struct fib_info *fi,
- const struct fib_nh *nh)
-{
-   fi->fib_weight += nh->nh_weight;
-}
-
 #else /* CONFIG_IP_ROUTE_MULTIPATH */
 
 #define fib_rebalance(fi) do { } while (0)
-#define fib_add_weight(fi, nh) do { } while (0)
 
 #endif /* CONFIG_IP_ROUTE_MULTIPATH */
 
@@ -1275,7 +1267,6 @@ struct fib_info *fib_create_info(struct fib_config *cfg,
 
change_nexthops(fi) {
fib_info_update_nh_saddr(net, nexthop_nh);
-   fib_add_weight(fi, nexthop_nh);
} endfor_nexthops(fi)
 
fib_rebalance(fi);
-- 
2.1.4



[PATCH net v2] i40e: Fix limit imprecise of the number of MAC/VLAN that can be added for VFs

2017-09-27 Thread w00273186
From: Yunjian Wang 

Now it doesn't limit the number of MAC/VLAN strictly. When there is more
elements in the virtchnl MAC/VLAN list, it can still add successfully.

Signed-off-by: Yunjian Wang 
---
 drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c | 27 +-
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 4d1e670..285b96a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -2065,11 +2065,6 @@ static inline int i40e_check_vf_permission(struct 
i40e_vf *vf, u8 *macaddr)
dev_err(>pdev->dev,
"VF attempting to override administratively set MAC 
address, reload the VF driver to resume normal operation\n");
ret = -EPERM;
-   } else if ((vf->num_mac >= I40E_VC_MAX_MAC_ADDR_PER_VF) &&
-  !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, >vf_caps)) {
-   dev_err(>pdev->dev,
-   "VF is not trusted, switch the VF to trusted to add 
more functionality\n");
-   ret = -EPERM;
}
return ret;
 }
@@ -2128,6 +2123,15 @@ static int i40e_vc_add_mac_addr_msg(struct i40e_vf *vf, 
u8 *msg, u16 msglen)
} else {
vf->num_mac++;
}
+
+   if ((vf->num_mac >= I40E_VC_MAX_MAC_ADDR_PER_VF) &&
+   !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, >vf_caps)) {
+   dev_err(>pdev->dev,
+   "VF is not trusted, switch the VF to trusted to 
add more functionality\n");
+   ret = -EPERM;
+   spin_unlock_bh(>mac_filter_hash_lock);
+   goto error_param;
+   }
}
spin_unlock_bh(>mac_filter_hash_lock);
 
@@ -2221,12 +2225,6 @@ static int i40e_vc_add_vlan_msg(struct i40e_vf *vf, u8 
*msg, u16 msglen)
i40e_status aq_ret = 0;
int i;
 
-   if ((vf->num_vlan >= I40E_VC_MAX_VLAN_PER_VF) &&
-   !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, >vf_caps)) {
-   dev_err(>pdev->dev,
-   "VF is not trusted, switch the VF to trusted to add 
more VLAN addresses\n");
-   goto error_param;
-   }
if (!test_bit(I40E_VF_STATE_ACTIVE, >vf_states) ||
!i40e_vc_isvalid_vsi_id(vf, vsi_id)) {
aq_ret = I40E_ERR_PARAM;
@@ -2269,6 +2267,13 @@ static int i40e_vc_add_vlan_msg(struct i40e_vf *vf, u8 
*msg, u16 msglen)
dev_err(>pdev->dev,
"Unable to add VLAN filter %d for VF %d, error 
%d\n",
vfl->vlan_id[i], vf->vf_id, ret);
+   if ((vf->num_vlan >= I40E_VC_MAX_VLAN_PER_VF) &&
+   !test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, >vf_caps)) {
+   dev_err(>pdev->dev,
+   "VF is not trusted, switch the VF to trusted to 
add more VLAN addresses\n");
+   aq_ret = -EPERM;
+   goto error_param;
+   }
}
 
 error_param:
-- 
1.8.3.1




RE: [PATCH net] i40e: Fix limit imprecise of the number of MAC/VLAN that can be added for VFs

2017-09-27 Thread wangyunjian
Thanks, I will send the v2 later.

> -Original Message-
> From: Sergei Shtylyov [mailto:sergei.shtyl...@cogentembedded.com]
> Sent: Wednesday, September 27, 2017 7:34 PM
> To: wangyunjian ; da...@davemloft.net;
> jeffrey.t.kirs...@intel.com
> Cc: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; caihe
> 
> Subject: Re: [PATCH net] i40e: Fix limit imprecise of the number of
> MAC/VLAN that can be added for VFs
> 
> Hello!
> 
> On 9/27/2017 9:58 AM, w00273186 wrote:
> 
> > From: Yunjian Wang 
> >
> > Now it don't limit the number of MAC/VLAN strictly. When there is more
> 
> Doesn't.
> 
> > elements in the virtchnl MAC/VLAN list, it can still add successfully.
> >
> > Signed-off-by: Yunjian Wang 
> 
> [...]
> 
> MBR, Sergei


[PATCH next] bonding: speed/duplex update at NETDEV_UP event

2017-09-27 Thread Mahesh Bandewar
From: Mahesh Bandewar 

Some NIC drivers don't have correct speed/duplex settings at the
time they send NETDEV_UP notification and that messes up the
bonding state. Especially 802.3ad mode which is very sensitive
to these settings. In the current implementation we invoke
bond_update_speed_duplex() when we receive NETDEV_UP, however,
ignore the return value. If the values we get are invalid
(UNKNOWN), then slave gets removed from the aggregator with
speed and duplex set to UNKNOWN while link is still marked as UP.

This patch fixes this scenario. Also 802.3ad mode is sensitive to
these conditions while other modes are not, so making sure that it
doesn't change the behavior for other modes.

Signed-off-by: Mahesh Bandewar 
---
 drivers/net/bonding/bond_main.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index b7313c1d9dcd..177be373966b 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3076,7 +3076,16 @@ static int bond_slave_netdev_event(unsigned long event,
break;
case NETDEV_UP:
case NETDEV_CHANGE:
-   bond_update_speed_duplex(slave);
+   /* For 802.3ad mode only:
+* Getting invalid Speed/Duplex values here will put slave
+* in weird state. So mark it as link-down for the time
+* being and let link-monitoring (miimon) set it right when
+* correct speeds/duplex are available.
+*/
+   if (bond_update_speed_duplex(slave) &&
+   BOND_MODE(bond) == BOND_MODE_8023AD)
+   slave->link = BOND_LINK_DOWN;
+
if (BOND_MODE(bond) == BOND_MODE_8023AD)
bond_3ad_adapter_speed_duplex_changed(slave);
/* Fallthrough */
-- 
2.14.2.822.g60be5d43e6-goog



Re: [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()

2017-09-27 Thread Willem de Bruijn
>> > > @@ -199,6 +199,7 @@ int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
>> > >   void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
>> > >   int vhost_vq_init_access(struct vhost_virtqueue *);
>> > > +int vhost_add_used_idx(struct vhost_virtqueue *vq, int n);
>> > >   int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int 
>> > > len);
>> > >   int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem 
>> > > *heads,
>> > >unsigned count);
>> > Please change the API to hide the fact that there's an index that needs
>> > to be updated.
>>
>> In fact, an interesting optimization on top is just call
>> vhost_add_used_idx(vq, n) instead of n vhost_add_used_idx(vq, 1). That's the
>> reason I leave n in the API.
>>
>> Thanks
>
> Right but you could increment some internal counter in the vq
> structure then update the used index using some api
> with a generic name, e.g.  add_used_complete or something like this.

That adds a layer of information hiding. If the same variable can be
kept close to the computation in a local variable and passed directly
to vhost_add_used_idx_n that is easier to follow.


Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing

2017-09-27 Thread Willem de Bruijn
> @@ -461,6 +460,7 @@ static void handle_tx(struct vhost_net *net)
> struct socket *sock;
> struct vhost_net_ubuf_ref *uninitialized_var(ubufs);
> bool zcopy, zcopy_used;
> +   int i, batched = VHOST_NET_BATCH;
>
> mutex_lock(>mutex);
> sock = vq->private_data;
> @@ -475,6 +475,12 @@ static void handle_tx(struct vhost_net *net)
> hdr_size = nvq->vhost_hlen;
> zcopy = nvq->ubufs;
>
> +   /* Disable zerocopy batched fetching for simplicity */

This special case can perhaps be avoided if we no longer block
on vhost_exceeds_maxpend, but revert to copying.

> +   if (zcopy) {
> +   heads = 

Can this special case of batchsize 1 not use vq->heads?

> +   batched = 1;
> +   }
> +
> for (;;) {
> /* Release DMAs done buffers first */
> if (zcopy)
> @@ -486,95 +492,114 @@ static void handle_tx(struct vhost_net *net)
> if (unlikely(vhost_exceeds_maxpend(net)))
> break;

> +   /* TODO: Check specific error and bomb out
> +* unless ENOBUFS?
> +*/
> +   err = sock->ops->sendmsg(sock, , len);
> +   if (unlikely(err < 0)) {
> +   if (zcopy_used) {
> +   vhost_net_ubuf_put(ubufs);
> +   nvq->upend_idx =
> +  ((unsigned)nvq->upend_idx - 1) % 
> UIO_MAXIOV;
> +   }
> +   vhost_discard_vq_desc(vq, 1);
> +   goto out;
> +   }
> +   if (err != len)
> +   pr_debug("Truncated TX packet: "
> +   " len %d != %zd\n", err, len);
> +   if (!zcopy) {
> +   vhost_add_used_idx(vq, 1);
> +   vhost_signal(>dev, vq);
> +   } else if (!zcopy_used) {
> +   vhost_add_used_and_signal(>dev,
> + vq, head, 0);

While batching, perhaps can also move this producer index update
out of the loop and using vhost_add_used_and_signal_n.

> +   } else
> +   vhost_zerocopy_signal_used(net, vq);
> +   vhost_net_tx_packet(net);
> +   if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> +   vhost_poll_queue(>poll);
> +   goto out;
> }
> -   vhost_discard_vq_desc(vq, 1);
> -   break;
> -   }
> -   if (err != len)
> -   pr_debug("Truncated TX packet: "
> -" len %d != %zd\n", err, len);
> -   if (!zcopy_used)
> -   vhost_add_used_and_signal(>dev, vq, head, 0);
> -   else
> -   vhost_zerocopy_signal_used(net, vq);
> -   vhost_net_tx_packet(net);
> -   if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> -   vhost_poll_queue(>poll);
> -   break;

This patch touches many lines just for indentation. If having to touch
these lines anyway (dirtying git blame), it may be a good time to move
the processing of a single descriptor code into a separate helper function.
And while breaking up, perhaps another helper for setting up ubuf_info.
If you agree, preferably in a separate noop refactor patch that precedes
the functional changes.


Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index

2017-09-27 Thread Willem de Bruijn
On Fri, Sep 22, 2017 at 4:02 AM, Jason Wang  wrote:
> This patch introduces vhost_prefetch_desc_indices() which could batch
> descriptor indices fetching and used ring updating. This intends to
> reduce the cache misses of indices fetching and updating and reduce
> cache line bounce when virtqueue is almost full. copy_to_user() was
> used in order to benefit from modern cpus that support fast string
> copy. Batched virtqueue processing will be the first user.
>
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/vhost.c | 55 
> +++
>  drivers/vhost/vhost.h |  3 +++
>  2 files changed, 58 insertions(+)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f87ec75..8424166d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct 
> vhost_dev *dev,
>  }
>  EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
>
> +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> +   struct vring_used_elem *heads,
> +   u16 num, bool used_update)
> +{
> +   int ret, ret2;
> +   u16 last_avail_idx, last_used_idx, total, copied;
> +   __virtio16 avail_idx;
> +   struct vring_used_elem __user *used;
> +   int i;
> +
> +   if (unlikely(vhost_get_avail(vq, avail_idx, >avail->idx))) {
> +   vq_err(vq, "Failed to access avail idx at %p\n",
> +  >avail->idx);
> +   return -EFAULT;
> +   }
> +   last_avail_idx = vq->last_avail_idx & (vq->num - 1);
> +   vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> +   total = vq->avail_idx - vq->last_avail_idx;
> +   ret = total = min(total, num);
> +
> +   for (i = 0; i < ret; i++) {
> +   ret2 = vhost_get_avail(vq, heads[i].id,
> + >avail->ring[last_avail_idx]);
> +   if (unlikely(ret2)) {
> +   vq_err(vq, "Failed to get descriptors\n");
> +   return -EFAULT;
> +   }
> +   last_avail_idx = (last_avail_idx + 1) & (vq->num - 1);
> +   }

This is understandably very similar to the existing logic in vhost_get_vq_desc.
Can that be extracted to a helper to avoid code duplication?

Perhaps one helper to update vq->avail_idx and return num, and
another to call vhost_get_avail one or more times.

> +
> +   if (!used_update)
> +   return ret;
> +
> +   last_used_idx = vq->last_used_idx & (vq->num - 1);
> +   while (total) {
> +   copied = min((u16)(vq->num - last_used_idx), total);
> +   ret2 = vhost_copy_to_user(vq,
> + >used->ring[last_used_idx],
> + [ret - total],
> + copied * sizeof(*used));
> +
> +   if (unlikely(ret2)) {
> +   vq_err(vq, "Failed to update used ring!\n");
> +   return -EFAULT;
> +   }
> +
> +   last_used_idx = 0;
> +   total -= copied;
> +   }

This second part seems unrelated and could be a separate function?

Also, no need for ret2 and double assignment "ret = total =" if not
modifying total
in the the second loop:

  for (i = 0; i < total; ) {
...
i += copied;
  }


Re: [PATCH net-next] vhost_net: do not stall on zerocopy depletion

2017-09-27 Thread Willem de Bruijn
On Wed, Sep 27, 2017 at 8:25 PM, Willem de Bruijn
 wrote:
> From: Willem de Bruijn 
>
> Vhost-net has a hard limit on the number of zerocopy skbs in flight.
> When reached, transmission stalls. Stalls cause latency, as well as
> head-of-line blocking of other flows that do not use zerocopy.
>
> Instead of stalling, revert to copy-based transmission.
>
> Tested by sending two udp flows from guest to host, one with payload
> of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
> large flow is redirected to a netem instance with 1MBps rate limit
> and deep 1000 entry queue.
>
>   modprobe ifb
>   ip link set dev ifb0 up
>   tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit
>
>   tc qdisc add dev tap0 ingress
>   tc filter add dev tap0 parent : protocol ip \
>   u32 match ip dport 8000 0x \
>   action mirred egress redirect dev ifb0
>
> Before the delay, both flows process around 80K pps. With the delay,
> before this patch, both process around 400. After this patch, the
> large flow is still rate limited, while the small reverts to its
> original rate. See also discussion in the first link, below.
>
> The limit in vhost_exceeds_maxpend must be carefully chosen. When
> vq->num >> 1, the flows remain correlated. This value happens to
> correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
> fractions and ensure correctness also for much smaller values of
> vq->num, by testing the min() of both explicitly. See also the
> discussion in the second link below.
>
> Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3thtxa6bnzkygp3tgvpjbp...@mail.gmail.com

>From the same discussion thread: it would be good to expose stats
on the number of zerocopy skb sent and number completed without
copy.

To test this patch, I also added ethtool stats to tun and extended them
with two zerocopy counters. Then had tun override the uarg->callback
with its own and update the counters before calling the original callback.

The one useful datapoint I did not get out of that is why skbs would
revert to non-zerocopy: because of size, vhost_exceeds_maxpend
or vhost_net_tx_select_zcopy. The simplistic implementation with an
extra indirect function call and without percpu counters is also not
suitable for submission as is.


[PATCH net-next] vhost_net: do not stall on zerocopy depletion

2017-09-27 Thread Willem de Bruijn
From: Willem de Bruijn 

Vhost-net has a hard limit on the number of zerocopy skbs in flight.
When reached, transmission stalls. Stalls cause latency, as well as
head-of-line blocking of other flows that do not use zerocopy.

Instead of stalling, revert to copy-based transmission.

Tested by sending two udp flows from guest to host, one with payload
of VHOST_GOODCOPY_LEN, the other too small for zerocopy (1B). The
large flow is redirected to a netem instance with 1MBps rate limit
and deep 1000 entry queue.

  modprobe ifb
  ip link set dev ifb0 up
  tc qdisc add dev ifb0 root netem limit 1000 rate 1MBit

  tc qdisc add dev tap0 ingress
  tc filter add dev tap0 parent : protocol ip \
  u32 match ip dport 8000 0x \
  action mirred egress redirect dev ifb0

Before the delay, both flows process around 80K pps. With the delay,
before this patch, both process around 400. After this patch, the
large flow is still rate limited, while the small reverts to its
original rate. See also discussion in the first link, below.

The limit in vhost_exceeds_maxpend must be carefully chosen. When
vq->num >> 1, the flows remain correlated. This value happens to
correspond to VHOST_MAX_PENDING for vq->num == 256. Allow smaller
fractions and ensure correctness also for much smaller values of
vq->num, by testing the min() of both explicitly. See also the
discussion in the second link below.

Link:http://lkml.kernel.org/r/CAF=yD-+Wk9sc9dXMUq1+x_hh=3thtxa6bnzkygp3tgvpjbp...@mail.gmail.com
Link:http://lkml.kernel.org/r/20170819064129.27272-1-...@klaipeden.com
Signed-off-by: Willem de Bruijn 
---
 drivers/vhost/net.c | 14 --
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 58585ec8699e..50758602ae9d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -436,8 +436,8 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
 
-   return (nvq->upend_idx + vq->num - VHOST_MAX_PEND) % UIO_MAXIOV
-   == nvq->done_idx;
+   return (nvq->upend_idx + UIO_MAXIOV - nvq->done_idx) % UIO_MAXIOV >
+  min(VHOST_MAX_PEND, vq->num >> 2);
 }
 
 /* Expects to be always run from workqueue - which acts as
@@ -480,12 +480,6 @@ static void handle_tx(struct vhost_net *net)
if (zcopy)
vhost_zerocopy_signal_used(net, vq);
 
-   /* If more outstanding DMAs, queue the work.
-* Handle upend_idx wrap around
-*/
-   if (unlikely(vhost_exceeds_maxpend(net)))
-   break;
-
head = vhost_net_tx_get_vq_desc(net, vq, vq->iov,
ARRAY_SIZE(vq->iov),
, );
@@ -509,6 +503,7 @@ static void handle_tx(struct vhost_net *net)
len = iov_length(vq->iov, out);
iov_iter_init(_iter, WRITE, vq->iov, out, len);
iov_iter_advance(_iter, hdr_size);
+
/* Sanity check */
if (!msg_data_left()) {
vq_err(vq, "Unexpected header len for TX: "
@@ -519,8 +514,7 @@ static void handle_tx(struct vhost_net *net)
len = msg_data_left();
 
zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
-  && (nvq->upend_idx + 1) % UIO_MAXIOV !=
- nvq->done_idx
+  && !vhost_exceeds_maxpend(net)
   && vhost_net_tx_select_zcopy(net);
 
/* use msg_control to pass vhost zerocopy ubuf info to skb */
-- 
2.14.2.822.g60be5d43e6-goog



Re: [PATCH V3] r8152: add Linksys USB3GIGV1 id

2017-09-27 Thread Grant Grundler
Hi Doug!

On Wed, Sep 27, 2017 at 4:47 PM, Doug Anderson  wrote:
> Hi,
>
> On Wed, Sep 27, 2017 at 10:28 AM, Grant Grundler  
> wrote:
>> This linksys dongle by default comes up in cdc_ether mode.
>> This patch allows r8152 to claim the device:
>>Bus 002 Device 002: ID 13b1:0041 Linksys
>>
>> Signed-off-by: Grant Grundler 
>> ---
>>  drivers/net/usb/cdc_ether.c | 10 ++
>>  drivers/net/usb/r8152.c |  2 ++
>>  2 files changed, 12 insertions(+)
>>
>> V3: for backwards compat, add #ifdef CONFIG_USB_RTL8152 around
>> the cdc_ether blacklist entry so the cdc_ether driver can
>> still claim the device if r8152 driver isn't configured.
>>
>> V2: add LINKSYS_VENDOR_ID to cdc_ether blacklist
>>
>> diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
>> index 8ab281b478f2..446dcc0f1f70 100644
>> --- a/drivers/net/usb/cdc_ether.c
>> +++ b/drivers/net/usb/cdc_ether.c
>> @@ -546,6 +546,7 @@ static const struct driver_info wwan_info = {
>>  #define DELL_VENDOR_ID 0x413C
>>  #define REALTEK_VENDOR_ID  0x0bda
>>  #define SAMSUNG_VENDOR_ID  0x04e8
>> +#define LINKSYS_VENDOR_ID  0x13b1
>>  #define LENOVO_VENDOR_ID   0x17ef
>
> Slight nit that "LI" sorts after "LE".  You got it right in the other case...

The list isn't sorted by any rational thing I can see.  I managed to
check my OCD reaction to sort the list numerically. :)

>>  #define NVIDIA_VENDOR_ID   0x0955
>>  #define HP_VENDOR_ID   0x03f0
>> @@ -737,6 +738,15 @@ static const struct usb_device_id  products[] = {
>> .driver_info = 0,
>>  },
>>
>> +#ifdef CONFIG_USB_RTL8152
>> +/* Linksys USB3GIGV1 Ethernet Adapter */
>> +{
>> +   USB_DEVICE_AND_INTERFACE_INFO(LINKSYS_VENDOR_ID, 0x0041, 
>> USB_CLASS_COMM,
>> +   USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
>> +   .driver_info = 0,
>> +},
>> +#endif
>
> I believe you want to use IS_ENABLED(), don't you?

Ah yes - I wasn't aware IS_ENABLED existed.  Will respin V4 with this
if there isn't any other feedback.


> There's still a weird esoteric side case where kernel modules don't
> all need to be included in the filesystem just because they were built
> at the same time.  ...but IMHO that seems like enough of a nit that we
> can probably ignore it unless someone has a better idea.

I think that would require a run time check. I'm perfectly willing to
ignore that case. :)

thanks!
grant

>
>
> -Doug


Re: [PATCH V3] r8152: add Linksys USB3GIGV1 id

2017-09-27 Thread Doug Anderson
Hi,

On Wed, Sep 27, 2017 at 10:28 AM, Grant Grundler  wrote:
> This linksys dongle by default comes up in cdc_ether mode.
> This patch allows r8152 to claim the device:
>Bus 002 Device 002: ID 13b1:0041 Linksys
>
> Signed-off-by: Grant Grundler 
> ---
>  drivers/net/usb/cdc_ether.c | 10 ++
>  drivers/net/usb/r8152.c |  2 ++
>  2 files changed, 12 insertions(+)
>
> V3: for backwards compat, add #ifdef CONFIG_USB_RTL8152 around
> the cdc_ether blacklist entry so the cdc_ether driver can
> still claim the device if r8152 driver isn't configured.
>
> V2: add LINKSYS_VENDOR_ID to cdc_ether blacklist
>
> diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
> index 8ab281b478f2..446dcc0f1f70 100644
> --- a/drivers/net/usb/cdc_ether.c
> +++ b/drivers/net/usb/cdc_ether.c
> @@ -546,6 +546,7 @@ static const struct driver_info wwan_info = {
>  #define DELL_VENDOR_ID 0x413C
>  #define REALTEK_VENDOR_ID  0x0bda
>  #define SAMSUNG_VENDOR_ID  0x04e8
> +#define LINKSYS_VENDOR_ID  0x13b1
>  #define LENOVO_VENDOR_ID   0x17ef

Slight nit that "LI" sorts after "LE".  You got it right in the other case...


>  #define NVIDIA_VENDOR_ID   0x0955
>  #define HP_VENDOR_ID   0x03f0
> @@ -737,6 +738,15 @@ static const struct usb_device_id  products[] = {
> .driver_info = 0,
>  },
>
> +#ifdef CONFIG_USB_RTL8152
> +/* Linksys USB3GIGV1 Ethernet Adapter */
> +{
> +   USB_DEVICE_AND_INTERFACE_INFO(LINKSYS_VENDOR_ID, 0x0041, 
> USB_CLASS_COMM,
> +   USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
> +   .driver_info = 0,
> +},
> +#endif

I believe you want to use IS_ENABLED(), don't you?

There's still a weird esoteric side case where kernel modules don't
all need to be included in the filesystem just because they were built
at the same time.  ...but IMHO that seems like enough of a nit that we
can probably ignore it unless someone has a better idea.


-Doug


Re: [PATCH net-next 0/3] support changing steering policies in tuntap

2017-09-27 Thread Willem de Bruijn
>> In the future, both simple and sophisticated policy like RSS or other guest
>> driven steering policies could be done on top.
>
> IMHO there should be a more practical example before adding all this
> indirection. And it would be nice to understand why this queue selection
> needs to be tun specific.

I was thinking the same and this reminds me of the various strategies
implemented in packet fanout. tun_cpu_select_queue is analogous to
fanout_demux_cpu though it is tun-specific in that it requires tun->numqueues.

Fanout accrued various strategies until it gained an eBPF variant. Just
supporting BPF is probably sufficient here, too.


Re: [next-queue PATCH 2/3] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-09-27 Thread Guedes, Andre
On Wed, 2017-09-27 at 15:57 -0700, Jesus Sanchez-Palencia wrote:
> Hi,
> 
> 
> On 09/27/2017 02:14 PM, Vinicius Costa Gomes wrote:
> > Hi,
> > 
> > Cong Wang  writes:
> > 
> > > On Tue, Sep 26, 2017 at 4:39 PM, Vinicius Costa Gomes
> > >  wrote:
> > > > +static int cbs_init(struct Qdisc *sch, struct nlattr *opt)
> > > > +{
> > > > +   struct cbs_sched_data *q = qdisc_priv(sch);
> > > > +   struct net_device *dev = qdisc_dev(sch);
> > > > +
> > > > +   if (!opt)
> > > > +   return -EINVAL;
> > > > +
> > > > +   /* FIXME: this means that we can only install this qdisc
> > > > +* "under" mqprio. Do we need a more generic way to retrieve
> > > > +* the queue, or do we pass the netdev_queue to the driver?
> > > > +*/
> > > > +   q->queue = TC_H_MIN(sch->parent) - 1 - netdev_get_num_tc(dev);
> > > > +
> > > > +   return cbs_change(sch, opt);
> > > > +}
> > > 
> > > Yeah it is ugly to assume its parent is mqprio, at least you should
> > > error out if it is not the case.
> > 
> > Will add an error for this, for now.
> > 
> > > 
> > > I am not sure how we can solve this elegantly, perhaps you should
> > > extend mqprio rather than add a new one?
> > 
> > Is the alternative hinted in the FIXME worse? Instead of passing the
> > index of the hardware queue to the driver we pass the pointer to a
> > netdev_queue to the driver and it "discovers" the HW queue from that.

I don't see why we should move the queue index "discovery" into the driver. The
driver layer should be dead simple and getting the queue index from the upper
layer (qdisc) looks right to me.

> What if we keep passing the index, but calculate it from the netdev_queue
> pointer instead?
> 
> i.e.:  q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);
> 
> At least it wouldn't rely on the root qdisc being of any specific type.

+1

- Andre

smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v4 net-next 00/12] gtp: Additional feature support - Part I

2017-09-27 Thread Tom Herbert
On Wed, Sep 27, 2017 at 5:24 AM, Harald Welte  wrote:
> Hi Tom,
>
> thanks for your updated series!
>
> I'll revie as soon as I a, but that may likely not be before next
> week.  As indicated before, I'm on a motorbike roadtrip on vacation,
> with very limited connectivity and even more limited time for any
> technical work.  Thanks for your understanding.
>
Harald,

No problem. Enjoy your vacation!

Tom

> --
> - Harald Welte    http://laforge.gnumonks.org/
> 
> "Privacy in residential applications is a desirable marketing option."
>   (ETSI EN 300 175-7 Ch. A6)


Re: [next-queue PATCH 2/3] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-09-27 Thread Jesus Sanchez-Palencia
Hi,


On 09/27/2017 02:14 PM, Vinicius Costa Gomes wrote:
> Hi,
> 
> Cong Wang  writes:
> 
>> On Tue, Sep 26, 2017 at 4:39 PM, Vinicius Costa Gomes
>>  wrote:
>>> +static int cbs_init(struct Qdisc *sch, struct nlattr *opt)
>>> +{
>>> +   struct cbs_sched_data *q = qdisc_priv(sch);
>>> +   struct net_device *dev = qdisc_dev(sch);
>>> +
>>> +   if (!opt)
>>> +   return -EINVAL;
>>> +
>>> +   /* FIXME: this means that we can only install this qdisc
>>> +* "under" mqprio. Do we need a more generic way to retrieve
>>> +* the queue, or do we pass the netdev_queue to the driver?
>>> +*/
>>> +   q->queue = TC_H_MIN(sch->parent) - 1 - netdev_get_num_tc(dev);
>>> +
>>> +   return cbs_change(sch, opt);
>>> +}
>>
>> Yeah it is ugly to assume its parent is mqprio, at least you should
>> error out if it is not the case.
> 
> Will add an error for this, for now.
> 
>>
>> I am not sure how we can solve this elegantly, perhaps you should
>> extend mqprio rather than add a new one?
> 
> Is the alternative hinted in the FIXME worse? Instead of passing the
> index of the hardware queue to the driver we pass the pointer to a
> netdev_queue to the driver and it "discovers" the HW queue from that.

What if we keep passing the index, but calculate it from the netdev_queue
pointer instead?

i.e.:  q->queue = sch->dev_queue - netdev_get_tx_queue(dev, 0);

At least it wouldn't rely on the root qdisc being of any specific type.

Regards,
Jesus


Re: [PATCH net-next RFC 3/5] vhost: introduce vhost_add_used_idx()

2017-09-27 Thread Michael S. Tsirkin
On Wed, Sep 27, 2017 at 08:38:24AM +0800, Jason Wang wrote:
> 
> 
> On 2017年09月27日 03:13, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2017 at 04:02:33PM +0800, Jason Wang wrote:
> > > This patch introduces a helper which just increase the used idx. This
> > > will be used in pair with vhost_prefetch_desc_indices() by batching
> > > code.
> > > 
> > > Signed-off-by: Jason Wang 
> > > ---
> > >   drivers/vhost/vhost.c | 33 +
> > >   drivers/vhost/vhost.h |  1 +
> > >   2 files changed, 34 insertions(+)
> > > 
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 8424166d..6532cda 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -2178,6 +2178,39 @@ int vhost_add_used(struct vhost_virtqueue *vq, 
> > > unsigned int head, int len)
> > >   }
> > >   EXPORT_SYMBOL_GPL(vhost_add_used);
> > > +int vhost_add_used_idx(struct vhost_virtqueue *vq, int n)
> > > +{
> > > + u16 old, new;
> > > +
> > > + old = vq->last_used_idx;
> > > + new = (vq->last_used_idx += n);
> > > + /* If the driver never bothers to signal in a very long while,
> > > +  * used index might wrap around. If that happens, invalidate
> > > +  * signalled_used index we stored. TODO: make sure driver
> > > +  * signals at least once in 2^16 and remove this.
> > > +  */
> > > + if (unlikely((u16)(new - vq->signalled_used) < (u16)(new - old)))
> > > + vq->signalled_used_valid = false;
> > > +
> > > + /* Make sure buffer is written before we update index. */
> > > + smp_wmb();
> > > + if (vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> > > +>used->idx)) {
> > > + vq_err(vq, "Failed to increment used idx");
> > > + return -EFAULT;
> > > + }
> > > + if (unlikely(vq->log_used)) {
> > > + /* Log used index update. */
> > > + log_write(vq->log_base,
> > > +   vq->log_addr + offsetof(struct vring_used, idx),
> > > +   sizeof(vq->used->idx));
> > > + if (vq->log_ctx)
> > > + eventfd_signal(vq->log_ctx, 1);
> > > + }
> > > + return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(vhost_add_used_idx);
> > > +
> > >   static int __vhost_add_used_n(struct vhost_virtqueue *vq,
> > >   struct vring_used_elem *heads,
> > >   unsigned count)
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index 16c2cb6..5dd6c05 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -199,6 +199,7 @@ int __vhost_get_vq_desc(struct vhost_virtqueue *vq,
> > >   void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
> > >   int vhost_vq_init_access(struct vhost_virtqueue *);
> > > +int vhost_add_used_idx(struct vhost_virtqueue *vq, int n);
> > >   int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int 
> > > len);
> > >   int vhost_add_used_n(struct vhost_virtqueue *, struct vring_used_elem 
> > > *heads,
> > >unsigned count);
> > Please change the API to hide the fact that there's an index that needs
> > to be updated.
> 
> In fact, an interesting optimization on top is just call
> vhost_add_used_idx(vq, n) instead of n vhost_add_used_idx(vq, 1). That's the
> reason I leave n in the API.
> 
> Thanks

Right but you could increment some internal counter in the vq
structure then update the used index using some api
with a generic name, e.g.  add_used_complete or something like this.

> > 
> > > -- 
> > > 2.7.4


Re: [PATCH net-next RFC 2/5] vhost: introduce helper to prefetch desc index

2017-09-27 Thread Michael S. Tsirkin
On Wed, Sep 27, 2017 at 08:35:47AM +0800, Jason Wang wrote:
> 
> 
> On 2017年09月27日 03:19, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2017 at 04:02:32PM +0800, Jason Wang wrote:
> > > This patch introduces vhost_prefetch_desc_indices() which could batch
> > > descriptor indices fetching and used ring updating. This intends to
> > > reduce the cache misses of indices fetching and updating and reduce
> > > cache line bounce when virtqueue is almost full. copy_to_user() was
> > > used in order to benefit from modern cpus that support fast string
> > > copy. Batched virtqueue processing will be the first user.
> > > 
> > > Signed-off-by: Jason Wang 
> > > ---
> > >   drivers/vhost/vhost.c | 55 
> > > +++
> > >   drivers/vhost/vhost.h |  3 +++
> > >   2 files changed, 58 insertions(+)
> > > 
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index f87ec75..8424166d 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -2437,6 +2437,61 @@ struct vhost_msg_node *vhost_dequeue_msg(struct 
> > > vhost_dev *dev,
> > >   }
> > >   EXPORT_SYMBOL_GPL(vhost_dequeue_msg);
> > > +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> > > + struct vring_used_elem *heads,
> > > + u16 num, bool used_update)
> > why do you need to combine used update with prefetch?
> 
> For better performance


Why is sticking a branch in there better than requesting the update
conditionally from the caller?



> and I believe we don't care about the overhead when
> we meet errors in tx.

That's a separate question, I do not really understand how
you can fetch a descriptor and update the used ring at the same
time. This allows the guest to overwrite the buffer.
I might be misunderstanding what is going on here though.


> > 
> > > +{
> > > + int ret, ret2;
> > > + u16 last_avail_idx, last_used_idx, total, copied;
> > > + __virtio16 avail_idx;
> > > + struct vring_used_elem __user *used;
> > > + int i;
> > > +
> > > + if (unlikely(vhost_get_avail(vq, avail_idx, >avail->idx))) {
> > > + vq_err(vq, "Failed to access avail idx at %p\n",
> > > +>avail->idx);
> > > + return -EFAULT;
> > > + }
> > > + last_avail_idx = vq->last_avail_idx & (vq->num - 1);
> > > + vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> > > + total = vq->avail_idx - vq->last_avail_idx;
> > > + ret = total = min(total, num);
> > > +
> > > + for (i = 0; i < ret; i++) {
> > > + ret2 = vhost_get_avail(vq, heads[i].id,
> > > +   >avail->ring[last_avail_idx]);
> > > + if (unlikely(ret2)) {
> > > + vq_err(vq, "Failed to get descriptors\n");
> > > + return -EFAULT;
> > > + }
> > > + last_avail_idx = (last_avail_idx + 1) & (vq->num - 1);
> > > + }
> > > +
> > > + if (!used_update)
> > > + return ret;
> > > +
> > > + last_used_idx = vq->last_used_idx & (vq->num - 1);
> > > + while (total) {
> > > + copied = min((u16)(vq->num - last_used_idx), total);
> > > + ret2 = vhost_copy_to_user(vq,
> > > +   >used->ring[last_used_idx],
> > > +   [ret - total],
> > > +   copied * sizeof(*used));
> > > +
> > > + if (unlikely(ret2)) {
> > > + vq_err(vq, "Failed to update used ring!\n");
> > > + return -EFAULT;
> > > + }
> > > +
> > > + last_used_idx = 0;
> > > + total -= copied;
> > > + }
> > > +
> > > + /* Only get avail ring entries after they have been exposed by guest. */
> > > + smp_rmb();
> > Barrier before return is a very confusing API. I guess it's designed to
> > be used in a specific way to make it necessary - but what is it?
> 
> Looks like a and we need do this after reading avail_idx.
> 
> Thanks
> 
> > 
> > 
> > > + return ret;
> > > +}
> > > +EXPORT_SYMBOL(vhost_prefetch_desc_indices);
> > >   static int __init vhost_init(void)
> > >   {
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index 39ff897..16c2cb6 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -228,6 +228,9 @@ ssize_t vhost_chr_read_iter(struct vhost_dev *dev, 
> > > struct iov_iter *to,
> > >   ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
> > >struct iov_iter *from);
> > >   int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled);
> > > +int vhost_prefetch_desc_indices(struct vhost_virtqueue *vq,
> > > + struct vring_used_elem *heads,
> > > + u16 num, bool used_update);
> > >   #define vq_err(vq, fmt, ...) do {  \
> > >   pr_debug(pr_fmt(fmt), ##__VA_ARGS__);   \
> > > -- 
> > > 2.7.4


[PATCH v2] netlink: do not proceed if dump's start() errs

2017-09-27 Thread Jason A. Donenfeld
Drivers that use the start method for netlink dumping rely on dumpit not
being called if start fails. For example, ila_xlat.c allocates memory
and assigns it to cb->args[0] in its start() function. It might fail to
do that and return -ENOMEM instead. However, even when returning an
error, dumpit will be called, which, in the example above, quickly
dereferences the memory in cb->args[0], which will OOPS the kernel. This
is but one example of how this goes wrong.

Since start() has always been a function with an int return type, it
therefore makes sense to use it properly, rather than ignoring it. This
patch thus returns early and does not call dumpit() when start() fails.

Signed-off-by: Jason A. Donenfeld 
Cc: Johannes Berg 
---
 net/netlink/af_netlink.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 327807731b44..94c11cf0459d 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2270,10 +2270,13 @@ int __netlink_dump_start(struct sock *ssk, struct 
sk_buff *skb,
 
mutex_unlock(nlk->cb_mutex);
 
+   ret = 0;
if (cb->start)
-   cb->start(cb);
+   ret = cb->start(cb);
+
+   if (!ret)
+   ret = netlink_dump(sk);
 
-   ret = netlink_dump(sk);
sock_put(sk);
 
if (ret)
-- 
2.14.1



Re: [PATCH net-next RFC 0/5] batched tx processing in vhost_net

2017-09-27 Thread Michael S. Tsirkin
On Wed, Sep 27, 2017 at 08:27:37AM +0800, Jason Wang wrote:
> 
> 
> On 2017年09月26日 21:45, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2017 at 04:02:30PM +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > This series tries to implement basic tx batched processing. This is
> > > done by prefetching descriptor indices and update used ring in a
> > > batch. This intends to speed up used ring updating and improve the
> > > cache utilization.
> > Interesting, thanks for the patches. So IIUC most of the gain is really
> > overcoming some of the shortcomings of virtio 1.0 wrt cache utilization?
> 
> Yes.
> 
> Actually, looks like batching in 1.1 is not as easy as in 1.0.
> 
> In 1.0, we could do something like:
> 
> batch update used ring by user copy_to_user()
> smp_wmb()
> update used_idx
> In 1.1, we need more memory barriers, can't benefit from fast copy helpers?
> 
> for () {
>     update desc.addr
>     smp_wmb()
>     update desc.flag
> }

Yes but smp_wmb is a NOP on e.g. x86. We can switch to other types of
barriers as well.  We do need to do the updates in order, so we might
need new APIs for that to avoid re-doing the translation all the time.

In 1.0 the last update is a cache miss always. You need batching to get
less misses. In 1.1 you don't have it so fundamentally there is less
need for batching. But batching does not always work.  DPDK guys (which
batch things aggressively) already tried 1.1 and saw performance gains
so we do not need to argue theoretically.



> > 
> > Which is fair enough (1.0 is already deployed) but I would like to avoid
> > making 1.1 support harder, and this patchset does this unfortunately,
> 
> I think the new APIs do not expose more internal data structure of virtio
> than before? (vq->heads has already been used by vhost_net for years).

For sure we might need to change vring_used_elem.

> Consider the layout is re-designed completely, I don't see an easy method to
> reuse current 1.0 API for 1.1.

Current API just says you get buffers then you use them. It is not tied
to actual separate used ring.


> > see comments on individual patches. I'm sure it can be addressed though.
> > 
> > > Test shows about ~22% improvement in tx pss.
> > Is this with or without tx napi in guest?
> 
> MoonGen is used in guest for better numbers.
> 
> Thanks

Not sure I understand. Did you set napi_tx to true or false?

> > 
> > > Please review.
> > > 
> > > Jason Wang (5):
> > >vhost: split out ring head fetching logic
> > >vhost: introduce helper to prefetch desc index
> > >vhost: introduce vhost_add_used_idx()
> > >vhost_net: rename VHOST_RX_BATCH to VHOST_NET_BATCH
> > >vhost_net: basic tx virtqueue batched processing
> > > 
> > >   drivers/vhost/net.c   | 221 
> > > --
> > >   drivers/vhost/vhost.c | 165 +++--
> > >   drivers/vhost/vhost.h |   9 ++
> > >   3 files changed, 270 insertions(+), 125 deletions(-)
> > > 
> > > -- 
> > > 2.7.4


Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing

2017-09-27 Thread Michael S. Tsirkin
On Wed, Sep 27, 2017 at 10:04:18AM +0800, Jason Wang wrote:
> 
> 
> On 2017年09月27日 03:25, Michael S. Tsirkin wrote:
> > On Fri, Sep 22, 2017 at 04:02:35PM +0800, Jason Wang wrote:
> > > This patch implements basic batched processing of tx virtqueue by
> > > prefetching desc indices and updating used ring in a batch. For
> > > non-zerocopy case, vq->heads were used for storing the prefetched
> > > indices and updating used ring. It is also a requirement for doing
> > > more batching on top. For zerocopy case and for simplicity, batched
> > > processing were simply disabled by only fetching and processing one
> > > descriptor at a time, this could be optimized in the future.
> > > 
> > > XDP_DROP (without touching skb) on tun (with Moongen in guest) with
> > > zercopy disabled:
> > > 
> > > Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz:
> > > Before: 3.20Mpps
> > > After:  3.90Mpps (+22%)
> > > 
> > > No differences were seen with zerocopy enabled.
> > > 
> > > Signed-off-by: Jason Wang 
> > So where is the speedup coming from? I'd guess the ring is
> > hot in cache, it's faster to access it in one go, then
> > pass many packets to net stack. Is that right?
> > 
> > Another possibility is better code cache locality.
> 
> Yes, I think the speed up comes from:
> 
> - less cache misses
> - less cache line bounce when virtqueue is about to be full (guest is faster
> than host which is the case of MoonGen)
> - less memory barriers
> - possible faster copy speed by using copy_to_user() on modern CPUs
> 
> > 
> > So how about this patchset is refactored:
> > 
> > 1. use existing APIs just first get packets then
> > transmit them all then use them all
> 
> Looks like current API can not get packets first, it only support get packet
> one by one (if you mean vhost_get_vq_desc()). And used ring updating may get
> more misses in this case.

Right. So if you do

for (...)
vhost_get_vq_desc


then later

for (...)
vhost_add_used


then you get most of benefits except maybe code cache misses
and copy_to_user.







> > 2. add new APIs and move the loop into vhost core
> > for more speedups
> 
> I don't see any advantages, looks like just need some e.g callbacks in this
> case.
> 
> Thanks

IUC callbacks pretty much destroy the code cache locality advantages,
IP is jumping around too much.


-- 
MST


Re: [PATCH net-next 0/3] support changing steering policies in tuntap

2017-09-27 Thread Michael S. Tsirkin
On Wed, Sep 27, 2017 at 04:23:54PM +0800, Jason Wang wrote:
> Hi all:
> 
> We use flow caches based flow steering policy now. This is good for
> connection-oriented communication such as TCP but not for the others
> e.g connectionless unidirectional workload which cares only about
> pps. This calls the ability of supporting changing steering policies
> in tuntap which was done by this series.
> 
> Flow steering policy was abstracted into tun_steering_ops in the first
> patch. Then new ioctls to set or query current policy were introduced,
> and the last patch introduces a very simple policy that select txq
> based on processor id as an example.
> 
> Test was done by using xdp_redirect to redirect traffic generated from
> MoonGen that was running on a remote machine. And I see 37%
> improvement for processor id policy compared to automatic flow
> steering policy.

For sure, if you don't need to figure out the flow hash then you can
save a bunch of cycles.  But I don't think the cpu policy is too
practical outside of a benchmark.

Did you generate packets and just send them to tun? If so, this is not a
typical configuration, is it? With packets coming e.g.  from a real nic
they might already have the hash pre-calculated, and you won't
see the benefit.

> In the future, both simple and sophisticated policy like RSS or other guest
> driven steering policies could be done on top.

IMHO there should be a more practical example before adding all this
indirection. And it would be nice to understand why this queue selection
needs to be tun specific.

> Thanks
> 
> Jason Wang (3):
>   tun: abstract flow steering logic
>   tun: introduce ioctls to set and get steering policies
>   tun: introduce cpu id based steering policy
> 
>  drivers/net/tun.c   | 151 
> +---
>  include/uapi/linux/if_tun.h |   8 +++
>  2 files changed, 136 insertions(+), 23 deletions(-)
> 
> -- 
> 2.7.4


Re: [PATCH net-next] libbpf: use map_flags when creating maps

2017-09-27 Thread Daniel Borkmann

On 09/27/2017 06:29 PM, Alexei Starovoitov wrote:

On 9/27/17 7:04 AM, Craig Gallek wrote:

From: Craig Gallek 

This extends struct bpf_map_def to include a flags field.  Note that
this has the potential to break the validation logic in
bpf_object__validate_maps and bpf_object__init_maps as they use
sizeof(struct bpf_map_def) as a minimal allowable size of a map section.
Any bpf program compiled with a smaller struct bpf_map_def will fail this
check.

I don't believe this will be an issue in practice as both compile-time
definitions of struct bpf_map_def (in samples/bpf/bpf_load.h and
tools/testing/selftests/bpf/bpf_helpers.h) have always been larger
than this newly updated version in libbpf.h.

Signed-off-by: Craig Gallek 
---
 tools/lib/bpf/libbpf.c | 2 +-
 tools/lib/bpf/libbpf.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 35f6dfcdc565..6bea85f260a3 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -874,7 +874,7 @@ bpf_object__create_maps(struct bpf_object *obj)
   def->key_size,
   def->value_size,
   def->max_entries,
-  0);
+  def->map_flags);
 if (*pfd < 0) {
 size_t j;
 int err = *pfd;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 7959086eb9c9..6e20003109e0 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -207,6 +207,7 @@ struct bpf_map_def {
 unsigned int key_size;
 unsigned int value_size;
 unsigned int max_entries;
+unsigned int map_flags;
 };


yes it will break loading of pre-compiled .o
Instead of breaking, let's fix the loader to do it the way
samples/bpf/bpf_load.c does.
See commit 156450d9d964 ("samples/bpf: make bpf_load.c code compatible with ELF maps 
section changes")


+1, iproute2 loader also does map spec fixup

For libbpf it would be good also such that it reduces the diff
further between the libbpf and bpf_load so that it allows move
to libbpf for samples in future.


[PATCH] mkiss: remove redundant check on len being zero

2017-09-27 Thread Colin King
From: Colin Ian King 

The check on len is redundant as it is always greater than 1,
so just remove it and make the printk less complex.

Detected by CoverityScan, CID#1226729 ("Logically dead code")

Signed-off-by: Colin Ian King 
---
 drivers/net/hamradio/mkiss.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hamradio/mkiss.c b/drivers/net/hamradio/mkiss.c
index aec6c26563cf..54bf8e6e4a09 100644
--- a/drivers/net/hamradio/mkiss.c
+++ b/drivers/net/hamradio/mkiss.c
@@ -477,7 +477,8 @@ static void ax_encaps(struct net_device *dev, unsigned char 
*icp, int len)
  cmd = 0;
}
ax->crcauto = (cmd ? 0 : 1);
-   printk(KERN_INFO "mkiss: %s: crc mode %s %d\n", 
ax->dev->name, (len) ? "set to" : "is", cmd);
+   printk(KERN_INFO "mkiss: %s: crc mode set to 
%d\n",
+  ax->dev->name, cmd);
}
spin_unlock_bh(>buflock);
netif_start_queue(dev);
-- 
2.14.1



[PATCH net-next 3/5] bpf: libbpf: Provide basic API support to specify BPF obj name

2017-09-27 Thread Martin KaFai Lau
This patch extends the libbpf to provide API support to
allow specifying BPF object name.

In tools/lib/bpf/libbpf, the C symbol of the function
and the map is used.  Regarding section name, all maps are
under the same section named "maps".  Hence, section name
is not a good choice for map's name.  To be consistent with
map, bpf_prog also follows and uses its function symbol as
the prog's name.

This patch adds logic to collect function's symbols in libbpf.
There is existing codes to collect the map's symbols and no change
is needed.

The bpf_load_program_name() and bpf_map_create_name() are
added to take the name argument.  For the other bpf_map_create_xxx()
variants, a name argument is directly added to them.

In samples/bpf, bpf_load.c in particular, the symbol is also
used as the map's name and the map symbols has already been
collected in the existing code.  For bpf_prog, bpf_load.c does
not collect the function symbol name.  We can consider to collect
them later if there is a need to continue supporting the bpf_load.c.

Signed-off-by: Martin KaFai Lau 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 samples/bpf/bpf_load.c  |   2 +
 samples/bpf/map_perf_test_user.c|   1 +
 tools/include/uapi/linux/bpf.h  |  10 +++
 tools/lib/bpf/bpf.c |  57 +++
 tools/lib/bpf/bpf.h |  23 --
 tools/lib/bpf/libbpf.c  | 109 +---
 tools/testing/selftests/bpf/test_verifier.c |   2 +-
 7 files changed, 157 insertions(+), 47 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 6aa50098dfb8..18b1c8dd0391 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -221,6 +221,7 @@ static int load_maps(struct bpf_map_data *maps, int nr_maps,
int inner_map_fd = map_fd[maps[i].def.inner_map_idx];
 
map_fd[i] = bpf_create_map_in_map_node(maps[i].def.type,
+   maps[i].name,
maps[i].def.key_size,
inner_map_fd,
maps[i].def.max_entries,
@@ -228,6 +229,7 @@ static int load_maps(struct bpf_map_data *maps, int nr_maps,
numa_node);
} else {
map_fd[i] = bpf_create_map_node(maps[i].def.type,
+   maps[i].name,
maps[i].def.key_size,
maps[i].def.value_size,
maps[i].def.max_entries,
diff --git a/samples/bpf/map_perf_test_user.c b/samples/bpf/map_perf_test_user.c
index a0310fc70057..519d9af4b04a 100644
--- a/samples/bpf/map_perf_test_user.c
+++ b/samples/bpf/map_perf_test_user.c
@@ -137,6 +137,7 @@ static void do_test_lru(enum test_type test, int cpu)
 
inner_lru_map_fds[cpu] =
bpf_create_map_node(BPF_MAP_TYPE_LRU_HASH,
+   
test_map_names[INNER_LRU_HASH_PREALLOC],
sizeof(uint32_t),
sizeof(long),
inner_lru_hash_size, 0,
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e43491ac4823..6d2137b4cf38 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -175,6 +175,8 @@ enum bpf_attach_type {
 /* Specify numa node during map creation */
 #define BPF_F_NUMA_NODE(1U << 2)
 
+#define BPF_OBJ_NAME_LEN 16U
+
 union bpf_attr {
struct { /* anonymous struct used by BPF_MAP_CREATE command */
__u32   map_type;   /* one of enum bpf_map_type */
@@ -188,6 +190,7 @@ union bpf_attr {
__u32   numa_node;  /* numa node (effective only if
 * BPF_F_NUMA_NODE is set).
 */
+   __u8map_name[BPF_OBJ_NAME_LEN];
};
 
struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -210,6 +213,7 @@ union bpf_attr {
__aligned_u64   log_buf;/* user supplied buffer */
__u32   kern_version;   /* checked when 
prog_type=kprobe */
__u32   prog_flags;
+   __u8prog_name[BPF_OBJ_NAME_LEN];
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -812,6 +816,11 @@ struct bpf_prog_info {
__u32 xlated_prog_len;
__aligned_u64 jited_prog_insns;

[PATCH net-next 2/5] bpf: Add map_name to bpf_map_info

2017-09-27 Thread Martin KaFai Lau
This patch allows userspace to specify a name for a map
during BPF_MAP_CREATE.

The map's name can later be exported to user space
via BPF_OBJ_GET_INFO_BY_FD.

Signed-off-by: Martin KaFai Lau 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 include/linux/bpf.h  | 1 +
 include/uapi/linux/bpf.h | 2 ++
 kernel/bpf/syscall.c | 7 ++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 33ccc474fb04..252f4bc9eb25 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -56,6 +56,7 @@ struct bpf_map {
struct work_struct work;
atomic_t usercnt;
struct bpf_map *inner_map_meta;
+   u8 name[BPF_OBJ_NAME_LEN];
 };
 
 /* function argument constraints */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bd6348269bf5..6d2137b4cf38 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -190,6 +190,7 @@ union bpf_attr {
__u32   numa_node;  /* numa node (effective only if
 * BPF_F_NUMA_NODE is set).
 */
+   __u8map_name[BPF_OBJ_NAME_LEN];
};
 
struct { /* anonymous struct used by BPF_MAP_*_ELEM commands */
@@ -829,6 +830,7 @@ struct bpf_map_info {
__u32 value_size;
__u32 max_entries;
__u32 map_flags;
+   __u8  name[BPF_OBJ_NAME_LEN];
 } __attribute__((aligned(8)));
 
 /* User bpf_sock_ops struct to access socket values and specify request ops
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 45970df3f820..11a7f82a55d1 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -339,7 +339,7 @@ static int bpf_obj_name_cpy(char *dst, const char *src)
return 0;
 }
 
-#define BPF_MAP_CREATE_LAST_FIELD numa_node
+#define BPF_MAP_CREATE_LAST_FIELD map_name
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
 {
@@ -361,6 +361,10 @@ static int map_create(union bpf_attr *attr)
if (IS_ERR(map))
return PTR_ERR(map);
 
+   err = bpf_obj_name_cpy(map->name, attr->map_name);
+   if (err)
+   goto free_map_nouncharge;
+
atomic_set(>refcnt, 1);
atomic_set(>usercnt, 1);
 
@@ -1462,6 +1466,7 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map,
info.value_size = map->value_size;
info.max_entries = map->max_entries;
info.map_flags = map->map_flags;
+   memcpy(info.name, map->name, sizeof(map->name));
 
if (copy_to_user(uinfo, , info_len) ||
put_user(info_len, >info.info_len))
-- 
2.9.5



[PATCH net-next 1/5] bpf: Add name, load_time, uid and map_ids to bpf_prog_info

2017-09-27 Thread Martin KaFai Lau
The patch adds name and load_time to struct bpf_prog_aux.  They
are also exported to bpf_prog_info.

The bpf_prog's name is passed by userspace during BPF_PROG_LOAD.
The kernel only stores the first (BPF_PROG_NAME_LEN - 1) bytes
and the name stored in the kernel is always \0 terminated.

The kernel will reject name that contains characters other than
isalnum() and '_'.  It will also reject name that is not null
terminated.

The existing 'user->uid' of the bpf_prog_aux is also exported to
the bpf_prog_info as created_by_uid.

The existing 'used_maps' of the bpf_prog_aux is exported to
the newly added members 'nr_map_ids' and 'map_ids' of
the bpf_prog_info.  On the input, nr_map_ids tells how
big the userspace's map_ids buffer is.  On the output,
nr_map_ids tells the exact user_map_cnt and it will only
copy up to the userspace's map_ids buffer is allowed.

Signed-off-by: Martin KaFai Lau 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 include/linux/bpf.h  |  2 ++
 include/uapi/linux/bpf.h |  8 
 kernel/bpf/syscall.c | 51 +++-
 3 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 2b672c50f160..33ccc474fb04 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -187,6 +187,8 @@ struct bpf_prog_aux {
struct bpf_map **used_maps;
struct bpf_prog *prog;
struct user_struct *user;
+   u64 load_time; /* ns since boottime */
+   u8 name[BPF_OBJ_NAME_LEN];
union {
struct work_struct work;
struct rcu_head rcu;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e43491ac4823..bd6348269bf5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -175,6 +175,8 @@ enum bpf_attach_type {
 /* Specify numa node during map creation */
 #define BPF_F_NUMA_NODE(1U << 2)
 
+#define BPF_OBJ_NAME_LEN 16U
+
 union bpf_attr {
struct { /* anonymous struct used by BPF_MAP_CREATE command */
__u32   map_type;   /* one of enum bpf_map_type */
@@ -210,6 +212,7 @@ union bpf_attr {
__aligned_u64   log_buf;/* user supplied buffer */
__u32   kern_version;   /* checked when 
prog_type=kprobe */
__u32   prog_flags;
+   __u8prog_name[BPF_OBJ_NAME_LEN];
};
 
struct { /* anonymous struct used by BPF_OBJ_* commands */
@@ -812,6 +815,11 @@ struct bpf_prog_info {
__u32 xlated_prog_len;
__aligned_u64 jited_prog_insns;
__aligned_u64 xlated_prog_insns;
+   __u64 load_time;/* ns since boottime */
+   __u32 created_by_uid;
+   __u32 nr_map_ids;
+   __aligned_u64 map_ids;
+   __u8  name[BPF_OBJ_NAME_LEN];
 } __attribute__((aligned(8)));
 
 struct bpf_map_info {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 25d074920a00..45970df3f820 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -23,6 +23,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #define IS_FD_ARRAY(map) ((map)->map_type == BPF_MAP_TYPE_PROG_ARRAY || \
   (map)->map_type == BPF_MAP_TYPE_PERF_EVENT_ARRAY || \
@@ -312,6 +315,30 @@ int bpf_map_new_fd(struct bpf_map *map)
   offsetof(union bpf_attr, CMD##_LAST_FIELD) - \
   sizeof(attr->CMD##_LAST_FIELD)) != NULL
 
+/* dst and src must have at least BPF_OBJ_NAME_LEN number of bytes.
+ * Return 0 on success and < 0 on error.
+ */
+static int bpf_obj_name_cpy(char *dst, const char *src)
+{
+   const char *end = src + BPF_OBJ_NAME_LEN;
+
+   /* Copy all isalnum() and '_' char */
+   while (src < end && *src) {
+   if (!isalnum(*src) && *src != '_')
+   return -EINVAL;
+   *dst++ = *src++;
+   }
+
+   /* No '\0' found in BPF_OBJ_NAME_LEN number of bytes */
+   if (src == end)
+   return -EINVAL;
+
+   /* '\0' terminates dst */
+   *dst = 0;
+
+   return 0;
+}
+
 #define BPF_MAP_CREATE_LAST_FIELD numa_node
 /* called via syscall */
 static int map_create(union bpf_attr *attr)
@@ -973,7 +1000,7 @@ struct bpf_prog *bpf_prog_get_type(u32 ufd, enum 
bpf_prog_type type)
 EXPORT_SYMBOL_GPL(bpf_prog_get_type);
 
 /* last field in 'union bpf_attr' used by this command */
-#defineBPF_PROG_LOAD_LAST_FIELD prog_flags
+#defineBPF_PROG_LOAD_LAST_FIELD prog_name
 
 static int bpf_prog_load(union bpf_attr *attr)
 {
@@ -1037,6 +1064,11 @@ static int bpf_prog_load(union bpf_attr *attr)
if (err < 0)
goto free_prog;
 
+   prog->aux->load_time = ktime_get_boot_ns();
+   err = bpf_obj_name_cpy(prog->aux->name, attr->prog_name);
+   if (err)
+   goto free_prog;
+
/* run eBPF verifier */
err = bpf_check(, 

[PATCH net-next 5/5] bpf: Test new fields in bpf_attr and bpf_{prog,map}_info

2017-09-27 Thread Martin KaFai Lau
This patch tests newly added fields of the bpf_attr,
bpf_prog_info and bpf_map_info.

Signed-off-by: Martin KaFai Lau 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 tools/testing/selftests/bpf/test_progs.c | 143 ---
 1 file changed, 132 insertions(+), 11 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 31ae27dc8d04..69427531408d 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 typedef __u16 __sum16;
@@ -19,6 +20,8 @@ typedef __u16 __sum16;
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -273,16 +276,26 @@ static void test_bpf_obj_id(void)
const int nr_iters = 2;
const char *file = "./test_obj_id.o";
const char *jit_sysctl = "/proc/sys/net/core/bpf_jit_enable";
+   const char *expected_prog_name = "test_obj_id";
+   const char *expected_map_name = "test_map_id";
+   const __u64 nsec_per_sec = 10;
 
struct bpf_object *objs[nr_iters];
int prog_fds[nr_iters], map_fds[nr_iters];
/* +1 to test for the info_len returned by kernel */
struct bpf_prog_info prog_infos[nr_iters + 1];
struct bpf_map_info map_infos[nr_iters + 1];
+   /* Each prog only uses one map. +1 to test nr_map_ids
+* returned by kernel.
+*/
+   __u32 map_ids[nr_iters + 1];
char jited_insns[128], xlated_insns[128], zeros[128];
__u32 i, next_id, info_len, nr_id_found, duration = 0;
+   struct timespec real_time_ts, boot_time_ts;
int sysctl_fd, jit_enabled = 0, err = 0;
__u64 array_value;
+   uid_t my_uid = getuid();
+   time_t now, load_time;
 
sysctl_fd = open(jit_sysctl, 0, O_RDONLY);
if (sysctl_fd != -1) {
@@ -307,6 +320,7 @@ static void test_bpf_obj_id(void)
/* Check bpf_obj_get_info_by_fd() */
bzero(zeros, sizeof(zeros));
for (i = 0; i < nr_iters; i++) {
+   now = time(NULL);
err = bpf_prog_load(file, BPF_PROG_TYPE_SOCKET_FILTER,
[i], _fds[i]);
/* test_obj_id.o is a dumb prog. It should never fail
@@ -334,16 +348,18 @@ static void test_bpf_obj_id(void)
  map_infos[i].value_size != sizeof(__u64) ||
  map_infos[i].max_entries != 1 ||
  map_infos[i].map_flags != 0 ||
- info_len != sizeof(struct bpf_map_info),
+ info_len != sizeof(struct bpf_map_info) ||
+ strcmp((char *)map_infos[i].name, expected_map_name),
  "get-map-info(fd)",
- "err %d errno %d type %d(%d) info_len %u(%lu) 
key_size %u value_size %u max_entries %u map_flags %X\n",
+ "err %d errno %d type %d(%d) info_len %u(%lu) 
key_size %u value_size %u max_entries %u map_flags %X name %s(%s)\n",
  err, errno,
  map_infos[i].type, BPF_MAP_TYPE_ARRAY,
  info_len, sizeof(struct bpf_map_info),
  map_infos[i].key_size,
  map_infos[i].value_size,
  map_infos[i].max_entries,
- map_infos[i].map_flags))
+ map_infos[i].map_flags,
+ map_infos[i].name, expected_map_name))
goto done;
 
/* Check getting prog info */
@@ -355,8 +371,16 @@ static void test_bpf_obj_id(void)
prog_infos[i].jited_prog_len = sizeof(jited_insns);
prog_infos[i].xlated_prog_insns = ptr_to_u64(xlated_insns);
prog_infos[i].xlated_prog_len = sizeof(xlated_insns);
+   prog_infos[i].map_ids = ptr_to_u64(map_ids + i);
+   prog_infos[i].nr_map_ids = 2;
+   err = clock_gettime(CLOCK_REALTIME, _time_ts);
+   assert(!err);
+   err = clock_gettime(CLOCK_BOOTTIME, _time_ts);
+   assert(!err);
err = bpf_obj_get_info_by_fd(prog_fds[i], _infos[i],
 _len);
+   load_time = (real_time_ts.tv_sec - boot_time_ts.tv_sec)
+   + (prog_infos[i].load_time / nsec_per_sec);
if (CHECK(err ||
  prog_infos[i].type != BPF_PROG_TYPE_SOCKET_FILTER ||
  info_len != sizeof(struct bpf_prog_info) ||
@@ -364,9 +388,14 @@ static void test_bpf_obj_id(void)
  (jit_enabled &&
   !memcmp(jited_insns, zeros, sizeof(zeros))) ||
  !prog_infos[i].xlated_prog_len ||
- 

[PATCH net-next 0/5] bpf: Extend bpf_{prog,map}_info

2017-09-27 Thread Martin KaFai Lau
This patch series adds more fields to bpf_prog_info and bpf_map_info.
Please see individual patch for details.

Martin KaFai Lau (5):
  bpf: Add name, load_time, uid and map_ids to bpf_prog_info
  bpf: Add map_name to bpf_map_info
  bpf: libbpf: Provide basic API support to specify BPF obj name
  bpf: Swap the order of checking prog_info and map_info
  bpf: Test new fields in bpf_attr and bpf_{prog,map}_info

 include/linux/bpf.h |   3 +
 include/uapi/linux/bpf.h|  10 ++
 kernel/bpf/syscall.c|  58 -
 samples/bpf/bpf_load.c  |   2 +
 samples/bpf/map_perf_test_user.c|   1 +
 tools/include/uapi/linux/bpf.h  |  10 ++
 tools/lib/bpf/bpf.c |  57 ++--
 tools/lib/bpf/bpf.h |  23 +++-
 tools/lib/bpf/libbpf.c  | 109 
 tools/testing/selftests/bpf/test_progs.c| 195 +++-
 tools/testing/selftests/bpf/test_verifier.c |   2 +-
 11 files changed, 385 insertions(+), 85 deletions(-)

-- 
2.9.5



[PATCH net-next 4/5] bpf: Swap the order of checking prog_info and map_info

2017-09-27 Thread Martin KaFai Lau
This patch swaps the checking order.  It now checks the map_info
first and then prog_info.  It is a prep work for adding
test to the newly added fields (the map_ids of prog_info field
in particular).

Signed-off-by: Martin KaFai Lau 
Acked-by: Alexei Starovoitov 
Acked-by: Daniel Borkmann 
---
 tools/testing/selftests/bpf/test_progs.c | 58 +---
 1 file changed, 30 insertions(+), 28 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_progs.c 
b/tools/testing/selftests/bpf/test_progs.c
index 11ee25cea227..31ae27dc8d04 100644
--- a/tools/testing/selftests/bpf/test_progs.c
+++ b/tools/testing/selftests/bpf/test_progs.c
@@ -316,6 +316,36 @@ static void test_bpf_obj_id(void)
error_cnt++;
assert(!err);
 
+   /* Insert a magic value to the map */
+   map_fds[i] = bpf_find_map(__func__, objs[i], "test_map_id");
+   assert(map_fds[i] >= 0);
+   err = bpf_map_update_elem(map_fds[i], _key,
+ _magic_value, 0);
+   assert(!err);
+
+   /* Check getting map info */
+   info_len = sizeof(struct bpf_map_info) * 2;
+   bzero(_infos[i], info_len);
+   err = bpf_obj_get_info_by_fd(map_fds[i], _infos[i],
+_len);
+   if (CHECK(err ||
+ map_infos[i].type != BPF_MAP_TYPE_ARRAY ||
+ map_infos[i].key_size != sizeof(__u32) ||
+ map_infos[i].value_size != sizeof(__u64) ||
+ map_infos[i].max_entries != 1 ||
+ map_infos[i].map_flags != 0 ||
+ info_len != sizeof(struct bpf_map_info),
+ "get-map-info(fd)",
+ "err %d errno %d type %d(%d) info_len %u(%lu) 
key_size %u value_size %u max_entries %u map_flags %X\n",
+ err, errno,
+ map_infos[i].type, BPF_MAP_TYPE_ARRAY,
+ info_len, sizeof(struct bpf_map_info),
+ map_infos[i].key_size,
+ map_infos[i].value_size,
+ map_infos[i].max_entries,
+ map_infos[i].map_flags))
+   goto done;
+
/* Check getting prog info */
info_len = sizeof(struct bpf_prog_info) * 2;
bzero(_infos[i], info_len);
@@ -347,34 +377,6 @@ static void test_bpf_obj_id(void)
  !!memcmp(xlated_insns, zeros, sizeof(zeros
goto done;
 
-   map_fds[i] = bpf_find_map(__func__, objs[i], "test_map_id");
-   assert(map_fds[i] >= 0);
-   err = bpf_map_update_elem(map_fds[i], _key,
- _magic_value, 0);
-   assert(!err);
-
-   /* Check getting map info */
-   info_len = sizeof(struct bpf_map_info) * 2;
-   bzero(_infos[i], info_len);
-   err = bpf_obj_get_info_by_fd(map_fds[i], _infos[i],
-_len);
-   if (CHECK(err ||
- map_infos[i].type != BPF_MAP_TYPE_ARRAY ||
- map_infos[i].key_size != sizeof(__u32) ||
- map_infos[i].value_size != sizeof(__u64) ||
- map_infos[i].max_entries != 1 ||
- map_infos[i].map_flags != 0 ||
- info_len != sizeof(struct bpf_map_info),
- "get-map-info(fd)",
- "err %d errno %d type %d(%d) info_len %u(%lu) 
key_size %u value_size %u max_entries %u map_flags %X\n",
- err, errno,
- map_infos[i].type, BPF_MAP_TYPE_ARRAY,
- info_len, sizeof(struct bpf_map_info),
- map_infos[i].key_size,
- map_infos[i].value_size,
- map_infos[i].max_entries,
- map_infos[i].map_flags))
-   goto done;
}
 
/* Check bpf_prog_get_next_id() */
-- 
2.9.5



[PATCH] rndis_host: support Novatel Verizon USB730L

2017-09-27 Thread Aleksander Morgado
Treat the ef/04/01 interface class/subclass/protocol combination used
by the Novatel Verizon USB730L (1410:9030) as a possible RNDIS
interface.

 T:  Bus=01 Lev=02 Prnt=02 Port=01 Cnt=02 Dev#= 17 Spd=480 MxCh= 0
 D:  Ver= 2.00 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  3
 P:  Vendor=1410 ProdID=9030 Rev=03.10
 S:  Manufacturer=Novatel Wireless
 S:  Product=MiFi USB730L
 S:  SerialNumber=0123456789ABCDEF
 C:  #Ifs= 3 Cfg#= 1 Atr=80 MxPwr=500mA
 I:  If#= 0 Alt= 0 #EPs= 1 Cls=ef(misc ) Sub=04 Prot=01 Driver=rndis_host
 I:  If#= 1 Alt= 0 #EPs= 2 Cls=0a(data ) Sub=00 Prot=00 Driver=rndis_host
 I:  If#= 2 Alt= 0 #EPs= 1 Cls=03(HID  ) Sub=00 Prot=00 Driver=usbhid

Once the network interface is brought up, the user just needs to run a
DHCP client to get IP address and routing setup.

As a side note, other Novatel Verizon USB730L models with the same
vid:pid end up exposing a standard ECM interface which doesn't require
any other kernel update to make it work.

Signed-off-by: Aleksander Morgado 
---

Hey,

I'm not sure if binding this logic to a specific vid:pid (1410:9030) would be 
more appropriate here, or if it's ok to just bind class/subclass/protocol (as 
in the activesync case).
Let me know what you think.

---
 drivers/net/usb/cdc_ether.c  | 11 ++-
 drivers/net/usb/rndis_host.c |  4 
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 8ab281b478f2..2df0bcc6d30b 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -54,11 +54,19 @@ static int is_wireless_rndis(struct 
usb_interface_descriptor *desc)
desc->bInterfaceProtocol == 3);
 }

+static int is_novatel_rndis(struct usb_interface_descriptor *desc)
+{
+   return (desc->bInterfaceClass == USB_CLASS_MISC &&
+   desc->bInterfaceSubClass == 4 &&
+   desc->bInterfaceProtocol == 1);
+}
+
 #else

 #define is_rndis(desc) 0
 #define is_activesync(desc)0
 #define is_wireless_rndis(desc)0
+#define is_novatel_rndis(desc) 0

 #endif

@@ -150,7 +158,8 @@ int usbnet_generic_cdc_bind(struct usbnet *dev, struct 
usb_interface *intf)
 */
rndis = (is_rndis(>cur_altsetting->desc) ||
 is_activesync(>cur_altsetting->desc) ||
-is_wireless_rndis(>cur_altsetting->desc));
+is_wireless_rndis(>cur_altsetting->desc) ||
+is_novatel_rndis(>cur_altsetting->desc));

memset(info, 0, sizeof(*info));
info->control = intf;
diff --git a/drivers/net/usb/rndis_host.c b/drivers/net/usb/rndis_host.c
index a151f267aebb..b807c91abe1d 100644
--- a/drivers/net/usb/rndis_host.c
+++ b/drivers/net/usb/rndis_host.c
@@ -632,6 +632,10 @@ static const struct usb_device_id  products [] = {
/* RNDIS for tethering */
USB_INTERFACE_INFO(USB_CLASS_WIRELESS_CONTROLLER, 1, 3),
.driver_info = (unsigned long) _info,
+}, {
+   /* Novatel Verizon USB730L */
+   USB_INTERFACE_INFO(USB_CLASS_MISC, 4, 1),
+   .driver_info = (unsigned long) _info,
 },
{ },// END
 };
--
2.14.1


Re: [next-queue PATCH 2/3] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-09-27 Thread Vinicius Costa Gomes
Hi,

Cong Wang  writes:

> On Tue, Sep 26, 2017 at 4:39 PM, Vinicius Costa Gomes
>  wrote:
>> +static int cbs_init(struct Qdisc *sch, struct nlattr *opt)
>> +{
>> +   struct cbs_sched_data *q = qdisc_priv(sch);
>> +   struct net_device *dev = qdisc_dev(sch);
>> +
>> +   if (!opt)
>> +   return -EINVAL;
>> +
>> +   /* FIXME: this means that we can only install this qdisc
>> +* "under" mqprio. Do we need a more generic way to retrieve
>> +* the queue, or do we pass the netdev_queue to the driver?
>> +*/
>> +   q->queue = TC_H_MIN(sch->parent) - 1 - netdev_get_num_tc(dev);
>> +
>> +   return cbs_change(sch, opt);
>> +}
>
> Yeah it is ugly to assume its parent is mqprio, at least you should
> error out if it is not the case.

Will add an error for this, for now.

>
> I am not sure how we can solve this elegantly, perhaps you should
> extend mqprio rather than add a new one?

Is the alternative hinted in the FIXME worse? Instead of passing the
index of the hardware queue to the driver we pass the pointer to a
netdev_queue to the driver and it "discovers" the HW queue from that.


Cheers,
--
Vinicius


Re: [PATCH net] ipv6: remove incorrect WARN_ON() in fib6_del()

2017-09-27 Thread Wei Wang
On Tue, Sep 26, 2017 at 6:20 AM, Eric Dumazet  wrote:
> On Mon, Sep 25, 2017 at 10:52 PM, Wei Wang  wrote:
>> On Mon, Sep 25, 2017 at 7:23 PM, Eric Dumazet  wrote:
>>> On Mon, Sep 25, 2017 at 7:07 PM, Martin KaFai Lau  wrote:
>>>
 I am probably still missing something.

 Considering the del operation should be under the writer lock,
 if rt->rt6i_node should be NULL (for rt that has already been
 removed from fib6), why this WARN_ON() is triggered?

 An example may help.

>>>
>>> Look at the stack trace, you'll find the answers...
>>>
>>> ip6_link_failure() -> ip6_del_rt()
>>>
>>> Note that rt might have been deleted from the _tree_ already.
>>
>> Had a brief talk with Martin.
>> He has a valid point.
>> The current WARN_ON() code is as follows:
>> #if RT6_DEBUG >= 2
>>if (rt->dst.obsolete > 0) {
>>WARN_ON(fn);
>>return -ENOENT;
>>}
>> #endif
>>
>> The WARN_ON() only triggers when fn is not NULL. (I missed it before.)
>> In theory, fib6_del() calls fib6_del_route() which should set
>> rt->rt6i_node to NULL and rt->dst.obsolete to DST_OBSOLETE_DEAD within
>> the same write_lock session.
>> If those 2 values are inconsistent, it indicates something is wrong.
>> Will need more time to root cause the issue.
>>
>> Please ignore this patch. Sorry about the confusion.
>
> Oh well, for some reason I was seeing WARN_ON(1)  here, since this is
> a construct I often add in my tests ...

Just an update on this issue:
This WARNING issue should already be fixed by commit
7483cea79957312e9f8e9cf760a1bc5d6c507113:
Author: Ido Schimmel 
Date:   Thu Aug 3 13:28:22 2017 +0200

ipv6: fib: Unlink replaced routes from their nodes

When a route is deleted its node pointer is set to NULL to indicate it's
no longer linked to its node. Do the same for routes that are replaced.

This will later allow us to test if a route is still in the FIB by
checking its node pointer instead of its reference count.

Signed-off-by: Ido Schimmel 
Signed-off-by: Jiri Pirko 
Signed-off-by: David S. Miller 

So no further action is needed on this.

Thanks.
Wei


Re: [PATCH v2 2/2] ip_tunnel: add mpls over gre encapsulation

2017-09-27 Thread kbuild test robot
Hi Amine,

[auto build test ERROR on net/master]
[also build test ERROR on v4.14-rc2 next-20170927]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Amine-Kherbouche/mpls-expose-stack-entry-function/20170928-012842
config: x86_64-rhel (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> ERROR: "mpls_forward" [net/ipv4/gre.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


RTL8169 vs low-latency (was: Re: Re: RTL 8169 linux driver question)

2017-09-27 Thread Kirill Smelkov
+ klaus, ivan, ...

On Mon, Nov 26, 2012 at 09:15:19AM -, David Laight wrote:
> On Fri, Nov 23, 2012 at 08:14:37PM +0100, Stéphane ANCELOT wrote:
> > I had problem with it, my application sends a frame that is immediately
> > transmitted back by some slaves, there was abnormally 100us  lost
> > between the send and receive call.
> > 
> > Finally I found it was coming from the following register setup in the
> > driver :
> > 
> > RTL_W16(IntrMitigate, 0x5151);
> > 
> > Can you give me some details about it, since I do not have the RTL8169
> > programming guide.
> 
> That sounds like an 'interrupt mitigation' setting - which will cause
> RX interrupts to be delayed a short time in order to reduce the
> interrupt load on the kernel.
> 
> There is usually an 'ethtool' setting to disable interrupt mitigation.

On Wed, Nov 28, 2012 at 10:32:12AM +0800, hayeswang wrote:
> Francois Romieu [mailto:rom...@fr.zoreil.com] 
> [...]
> > Something like the patch below against net-next could help once I will
> > have tested it.
> > 
> > I completely guessed the Tx usec scale factor at gigabit 
> > speed (125 us, 
> > 100 us, disabled, who knows ?) and I have no idea which 
> > specific chipsets
> > it should work with.
> > 
> > Hayes, may I expect some hindsight regarding:
> > 1 - the availability of the IntrMitigate (0xe2) register through the
> > 8169, 8168 and 810x line of chipsets
> 
> 8169, 8168, and 8136(810x) serial chipsets support it.
> 
> > 2 - the Tx timer unit at gigabit speed
> 
> The unit of the timer depneds on both the speed and the setting of CPlusCmd
> (0xe0) bit 1 and bit 0.
> 
> For 8169
> bit[1:0] \ speed  1000M   100M10M
> 0 0   320ns   2.56us  40.96us
> 0 1   2.56us  20.48us 327.7us
> 1 0   5.12us  40.96us 655.4us
> 1 1   10.24us 81.92us 1.31ms
> 
> For the other
> bit[1:0] \ speed  1000M   100M10M
> 0 0   5us 2.56us  40.96us
> 0 1   40us20.48us 327.7us
> 1 0   80us40.96us 655.4us
> 1 1   160us   81.92us 1.31ms

On Thu, Nov 29, 2012 at 12:18:22AM +0100, Francois Romieu wrote:
> David Laight  :
> [David's life]
> 
> 
> The version below fixes several bugs and refuses the frame or timing
> values it can't set. Hayes's Tx parameters still need to be pluged
> into rtl_coalesce_scale.
> 
> Rx delays seem lower than what I had expected when testing with a 8168b
> (XID 1800).
> 
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index 248f883..d2594b1 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -349,6 +349,12 @@ enum rtl_registers {
>   RxMaxSize   = 0xda,
>   CPlusCmd= 0xe0,
>   IntrMitigate= 0xe2,
> +
> +#define RTL_COALESCE_MASK0x0f
> +#define RTL_COALESCE_SHIFT   4
> +#define RTL_COALESCE_T_MAX   (RTL_COALESCE_MASK)
> +#define RTL_COALESCE_FRAME_MAX   (RTL_COALESCE_MASK << 2)
> +
>
> [...]

Hello up there. Let me chime in into this a bit old thread.

Like Stéphane I care about timings. It is not real-time but in my case network
round-trip latency almost directly translates into client-server
database request/response time and thus it significantly affects
throughput for workloads with many serially performed requests.

We have many computers with gigabit Realtek NICs. For 2 such computers
connected to a gigabit store-and-forward switch the minimum round-trip
time for small pings (`ping -i 0 -w 3 -s 56 -q peer`) is ~ 30μs.

However it turned out that when Ethernet frame length transitions 127 ->
128 bytes (`ping -i 0 -w 3 -s {81 -> 82} -q peer`) the lowest RTT
transitions step-wise to ~ 270μs.

As David said this is RX interrupt mitigation done by NIC which creates
the latency. For workloads when low-latency is required with e.g. Intel,
BCM etc NIC drivers one just uses `ethtool -C rx-usecs ...` to reduce
the time NIC delays before interrupting CPU, but it turned out
`ethtool -C` is not supported by r8169 driver.

Like Stéphane I've traced the problem down to IntrMitigate being
hardcoded to != 0 for our chips (we have 8168 based NICs):

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n5460
static void rtl_hw_start_8169(struct net_device *dev) {
...
/*
 * Undocumented corner. Supposedly:
 * (TxTimer << 12) | (TxPackets << 8) | (RxTimer << 4) | RxPackets
 */ 
RTL_W16(IntrMitigate, 0x);

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/ethernet/realtek/r8169.c#n6346
static void rtl_hw_start_8168(struct net_device *dev) {
...
RTL_W16(IntrMitigate, 0x5151);

and then I've also found this thread.

So could we please finally get 

Re: [next-queue PATCH 2/3] net/sched: Introduce Credit Based Shaper (CBS) qdisc

2017-09-27 Thread Cong Wang
On Tue, Sep 26, 2017 at 4:39 PM, Vinicius Costa Gomes
 wrote:
> +static int cbs_init(struct Qdisc *sch, struct nlattr *opt)
> +{
> +   struct cbs_sched_data *q = qdisc_priv(sch);
> +   struct net_device *dev = qdisc_dev(sch);
> +
> +   if (!opt)
> +   return -EINVAL;
> +
> +   /* FIXME: this means that we can only install this qdisc
> +* "under" mqprio. Do we need a more generic way to retrieve
> +* the queue, or do we pass the netdev_queue to the driver?
> +*/
> +   q->queue = TC_H_MIN(sch->parent) - 1 - netdev_get_num_tc(dev);
> +
> +   return cbs_change(sch, opt);
> +}

Yeah it is ugly to assume its parent is mqprio, at least you should
error out if it is not the case.

I am not sure how we can solve this elegantly, perhaps you should
extend mqprio rather than add a new one?


Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware

2017-09-27 Thread Andrew Lunn
On Wed, Sep 27, 2017 at 12:33:29PM -0700, Florian Fainelli wrote:
> On 09/27/2017 12:19 PM, Vivien Didelot wrote:
> > Hi Andrew, Florian,
> > 
> > Andrew Lunn  writes:
> > 
> >> It took me a while to make this work with CONFIG_BRIDGE_VLAN_FILTERING
> >> enabled. Any change to enable hardware flooding needs careful testing
> >> for lots of different configurations. This is another reason i don't
> >> want to do it at the DSA level, until we have a good understanding
> >> what it means in each individual driver.
> > 
> > Then if we are worried about how broadcast flooding is handled on
> > different switches, we can provide a new .flood_broadcast(ds, vid)
> > switch operation for the drivers to implement.
> 
> We don't really have a good visibility on the number of switches
> requiring special configuration for broadcast addresses nor how this
> would have to happen so it would be a tad difficult to define an
> appropriate API with a single user.

Yes, i agree with this. We should wait before adding a generic
solution. I want to wait until a few drivers do whatever is needed for
hardware broadcast. We can then see what is common, and what is
different, find an API to suit and do some refactoring.

   Andrew


Re: [PATCH net-next 2/6] net: dsa: {e}dsa: set offload_fwd_mark on received packets

2017-09-27 Thread Andrew Lunn
On Wed, Sep 27, 2017 at 12:46:35PM -0700, Florian Fainelli wrote:
> On 09/26/2017 03:25 PM, Andrew Lunn wrote:
> > The software bridge needs to know if a packet has already been bridged
> > by hardware offload to ports in the same hardware offload, in order
> > that it does not re-flood them, causing duplicates. This is
> > particularly true for broadcast and multicast traffic which the host
> > has requested.
> > 
> > By setting offload_fwd_mark in the skb the bridge will only flood to
> > ports in other offloads and other netifs. Set this flag in the DSA and
> > EDSA tag driver.
> 
> Is not there some kind of forwarding code/reason code being provided in
> the EDSA/DSA tag that tell you why this packet was sent to the CPU in
> the first place?

Hi Florian

There are some codes, but nothing specific to broadcast, or ATU
misses. I'm also trying to keep the code generic so it could be a
template for other drivers. Many of the tagging schemes don't provide
a reason code. So i want that any frame that comes from the switch has
no need to go back to the switch. KISS.
 
> What is the impact on non-broadcast traffic, e.g: multicast and unicast?

The bridge uses this flag when flooding. unicast traffic from the
switch should not need flooding. Either it is known in the switch and
hence won't be forwarded to the host, or it is unknown in the switch,
so it probably is on some other interface.

My testing with multicast has not shown issues. The switch pushes down
mdb entries, which causes frames to be replicated out ports. So again,
there should not be a need to pass the frame back to the switch. But
it is possible i missed a corner case or two...
 
> Nit: I don't really have a solution on how to order patches, but until
> the next 4 patches get in, I suppose we temporarily have broadcast
> flooding by the bridge "broken"? Ordering in the opposite way would
> probably result in an equally bad situation so...

Yes, it is an issue. I could put this patch last. We then get
duplication of broadcast...

Which is the lesser of two evils?

  Andrew


Re: [PATCH net-next 2/6] net: dsa: {e}dsa: set offload_fwd_mark on received packets

2017-09-27 Thread Florian Fainelli
On 09/26/2017 03:25 PM, Andrew Lunn wrote:
> The software bridge needs to know if a packet has already been bridged
> by hardware offload to ports in the same hardware offload, in order
> that it does not re-flood them, causing duplicates. This is
> particularly true for broadcast and multicast traffic which the host
> has requested.
> 
> By setting offload_fwd_mark in the skb the bridge will only flood to
> ports in other offloads and other netifs. Set this flag in the DSA and
> EDSA tag driver.

Is not there some kind of forwarding code/reason code being provided in
the EDSA/DSA tag that tell you why this packet was sent to the CPU in
the first place?

What is the impact on non-broadcast traffic, e.g: multicast and unicast?

Nit: I don't really have a solution on how to order patches, but until
the next 4 patches get in, I suppose we temporarily have broadcast
flooding by the bridge "broken"? Ordering in the opposite way would
probably result in an equally bad situation so...

> 
> Signed-off-by: Andrew Lunn 
> ---
> 
> v2
> --
> For the moment, do this in the tag drivers, not the generic code.
> Once we get more test results from other switches, maybe move it back
> again.
> ---
>  net/dsa/tag_dsa.c  | 1 +
>  net/dsa/tag_edsa.c | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/net/dsa/tag_dsa.c b/net/dsa/tag_dsa.c
> index fbf9ca954773..ea6ada9d5016 100644
> --- a/net/dsa/tag_dsa.c
> +++ b/net/dsa/tag_dsa.c
> @@ -154,6 +154,7 @@ static struct sk_buff *dsa_rcv(struct sk_buff *skb, 
> struct net_device *dev,
>   }
>  
>   skb->dev = ds->ports[source_port].netdev;
> + skb->offload_fwd_mark = 1;
>  
>   return skb;
>  }
> diff --git a/net/dsa/tag_edsa.c b/net/dsa/tag_edsa.c
> index 76367ba1b2e2..a961b22a7018 100644
> --- a/net/dsa/tag_edsa.c
> +++ b/net/dsa/tag_edsa.c
> @@ -173,6 +173,7 @@ static struct sk_buff *edsa_rcv(struct sk_buff *skb, 
> struct net_device *dev,
>   }
>  
>   skb->dev = ds->ports[source_port].netdev;
> + skb->offload_fwd_mark = 1;
>  
>   return skb;
>  }
> 


-- 
Florian


Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware

2017-09-27 Thread Florian Fainelli
On 09/27/2017 12:19 PM, Vivien Didelot wrote:
> Hi Andrew, Florian,
> 
> Andrew Lunn  writes:
> 
>> It took me a while to make this work with CONFIG_BRIDGE_VLAN_FILTERING
>> enabled. Any change to enable hardware flooding needs careful testing
>> for lots of different configurations. This is another reason i don't
>> want to do it at the DSA level, until we have a good understanding
>> what it means in each individual driver.
> 
> Then if we are worried about how broadcast flooding is handled on
> different switches, we can provide a new .flood_broadcast(ds, vid)
> switch operation for the drivers to implement.

We don't really have a good visibility on the number of switches
requiring special configuration for broadcast addresses nor how this
would have to happen so it would be a tad difficult to define an
appropriate API with a single user.

In general, single user "generic" facilities tend to be biased towards
their particular problem space (c.f: devlink) so a generic interface to
call into HW specific details does not usually sell well...
-- 
Florian


Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware

2017-09-27 Thread Vivien Didelot
Hi Andrew, Florian,

Andrew Lunn  writes:

> It took me a while to make this work with CONFIG_BRIDGE_VLAN_FILTERING
> enabled. Any change to enable hardware flooding needs careful testing
> for lots of different configurations. This is another reason i don't
> want to do it at the DSA level, until we have a good understanding
> what it means in each individual driver.

Then if we are worried about how broadcast flooding is handled on
different switches, we can provide a new .flood_broadcast(ds, vid)
switch operation for the drivers to implement.


Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware

2017-09-27 Thread Vivien Didelot
Hi Andrew,

Andrew Lunn  writes:

>> Adding the broadcast address to an Ethernet switch's FDB is pretty
>> generic and mv88e6xxx mustn't be the only driver doing this.
>
> Actually, it is. All the others seem to do this in hardware without
> needing an FDB. Since mv88e6xxx is the only one requiring it, it has
> to be done in the mv88e6xxx driver.

Adding the broadcast address from the DSA layer wouldn't hurt and make
things pretty obvious. This would also avoid drivers to get
unnecessarily complex. A .port_vlan_add implementation must remain
simple and mustn't do more than adding a VLAN entry.

Don't forget that we want the DSA drivers to be dump and have the core
logic of Ethernet switch handling resides in DSA core itself.

If some switch chips can flood broadcast without an FDB entry, good for
them, they can skip it. We will have the same issue for special L2
Multicast destination addresses, some switches have special bits to
consider them as management, some others don't and require to load the
ATU with them.

Regarding Marvell, what value do you have for the global FloodBC bit
(Global 2, offset 0x05)?


Thanks,

Vivien


Re: [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available

2017-09-27 Thread Martin KaFai Lau
On Wed, Sep 27, 2017 at 06:03:33PM +, Eric Dumazet wrote:
> >> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> 
> >
> > Hi Paolo,
> >
> > Eric and I discussed about this issue recently as well :).
> >
> > What about the following change:
> >
> > diff --git a/include/net/dst.h b/include/net/dst.h
> > index 93568bd0a352..33e1d86bcef6 100644
> > --- a/include/net/dst.h
> > +++ b/include/net/dst.h
> > @@ -258,14 +258,18 @@ static inline void dst_hold(struct dst_entry *dst)
> >  static inline void dst_use(struct dst_entry *dst, unsigned long time)
> >  {
> > dst_hold(dst);
> > -   dst->__use++;
> > -   dst->lastuse = time;
> > +   if (dst->lastuse != time) {
> > +   dst->__use++;
> > +   dst->lastuse = time;
> > +   }
> >  }
> >
> >  static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
> >  {
> > -   dst->__use++;
> > -   dst->lastuse = time;
> > +   if (dst->lastuse != time) {
> > +   dst->__use++;
> > +   dst->lastuse = time;
> > +   }
> >  }
> >
> >  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> > index 26cc9f483b6d..e195f093add3 100644
> > --- a/net/ipv6/route.c
> > +++ b/net/ipv6/route.c
> > @@ -1170,8 +1170,7 @@ struct rt6_info *ip6_pol_route(struct net *net,
> > struct fib6_table *table,
> >
> > struct rt6_info *pcpu_rt;
> >
> > -   rt->dst.lastuse = jiffies;
> > -   rt->dst.__use++;
> > +   dst_use_noref(rt, jiffies);
> > pcpu_rt = rt6_get_pcpu_route(rt);
> >
> > if (pcpu_rt) {
> >
> >
> > This way we always only update dst->__use and dst->lastuse at most
> > once per jiffy. And we don't really need to update pcpu and then do
> > the copy over from pcpu_rt to rt operation.
> >
> > Another thing is that I don't really see any places making use of
> > dst->__use. So maybe we can also get rid of this dst->__use field?
> >
> > Thanks.
> > Wei
> 
> Paolo, given we are very close to send Wei awesome work about IPv6
> routing cache,
> could we ask you to wait few days before doing the same work from your side ?
> 
> Main issue is the rwlock, and we are converting it to full RCU.
+1

We can get a better picture of other optimizations once
the rwlock is removed.

> 
> You are sending patches that are making our job very difficult IMO.
> 
> We chose to mimic the change done in neighbour code years ago
> ( 0ed8ddf4045fcfcac36bad753dc4046118c603ec )
> 
> Thanks.


Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware

2017-09-27 Thread Andrew Lunn
> What if I don't have CONFIG_BRIDGE_VLAN_FILTERING enabled, what happens
> in that case, would not this result in not programming the broadcast
> address?

Hi Florian

It took me a while to make this work with CONFIG_BRIDGE_VLAN_FILTERING
enabled. Any change to enable hardware flooding needs careful testing
for lots of different configurations. This is another reason i don't
want to do it at the DSA level, until we have a good understanding
what it means in each individual driver.

 Andrew


Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware

2017-09-27 Thread Andrew Lunn
> Adding the broadcast address to an Ethernet switch's FDB is pretty
> generic and mv88e6xxx mustn't be the only driver doing this.

Actually, it is. All the others seem to do this in hardware without
needing an FDB. Since mv88e6xxx is the only one requiring it, it has
to be done in the mv88e6xxx driver.

   Andrew


Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware

2017-09-27 Thread Florian Fainelli
On 09/27/2017 11:24 AM, Vivien Didelot wrote:
> Hi Andrew,
> 
> Andrew Lunn  writes:
> 
>> +static int mv88e6xxx_port_add_broadcast(struct mv88e6xxx_chip *chip, int 
>> port,
>> +u16 vid)
>> +{
>> +const char broadcast[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
>> +
>> +return mv88e6xxx_port_db_load_purge(
>> +chip, port, broadcast, vid,
>> +MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC);
> 
> Please don't do this. This is not a valid coding style and has already
> shown to be a bad example for other DSA drivers copying mv88e6xxx.
> 
> Simply declare u8 state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC above...
> 
>> +}
>> +
>> +static int mv88e6xxx_broadcast_setup(struct mv88e6xxx_chip *chip, u16 vid)
>> +{
>> +int port;
>> +int err;
>> +
>> +for (port = 0; port < mv88e6xxx_num_ports(chip); ++port) {
>> +err = mv88e6xxx_port_add_broadcast(chip, port, vid);
>> +if (err)
>> +return err;
>> +}
>> +
>> +return 0;
>> +}
>> +
>>  static int _mv88e6xxx_port_vlan_add(struct mv88e6xxx_chip *chip, int port,
>>  u16 vid, u8 member)
>>  {
>> @@ -1247,7 +1271,11 @@ static int _mv88e6xxx_port_vlan_add(struct 
>> mv88e6xxx_chip *chip, int port,
>>  
>>  vlan.member[port] = member;
>>  
>> -return mv88e6xxx_vtu_loadpurge(chip, );
>> +err = mv88e6xxx_vtu_loadpurge(chip, );
>> +if (err)
>> +return err;
>> +
>> +return mv88e6xxx_broadcast_setup(chip, vid);
>>  }
>>  
>>  static void mv88e6xxx_port_vlan_add(struct dsa_switch *ds, int port,
>> @@ -2025,6 +2053,10 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
>>  if (err)
>>  goto unlock;
>>  
>> +err = mv88e6xxx_broadcast_setup(chip, 0);
>> +if (err)
>> +goto unlock;
>> +
>>  err = mv88e6xxx_pot_setup(chip);
>>  if (err)
>>  goto unlock;
> 
> 
> Adding the broadcast address to an Ethernet switch's FDB is pretty
> generic and mv88e6xxx mustn't be the only driver doing this.

I have not had time to test Andrew's IGMP patch set on bcm_sf2/b53, but
while I agree that adding a broadcast address using a FDB entry is
generic in premise, we don't know yet whether other switch drivers need
that or not, so until we do, it seems like Andrew's approach is
appropriate here by keeping this local to mv88e6xxx.

> 
> They do not have to care about the broadcast address, this is just a
> standard MDB entry for them. This must be moved up to the DSA layer.
> 
> Adding the broadcast address in dsa_port_vlan_add and dsa_port_enable
> like this should be enough: http://ix.io/AmS
> 

What if I don't have CONFIG_BRIDGE_VLAN_FILTERING enabled, what happens
in that case, would not this result in not programming the broadcast
address?
-- 
Florian


Re: [patch net-next v3 00/12] mlxsw: Add support for offloading IPv4 multicast routes

2017-09-27 Thread David Miller
From: Jiri Pirko 
Date: Wed, 27 Sep 2017 08:23:10 +0200

> This patch-set introduces offloading of the kernel IPv4 multicast router
> logic in the Spectrum driver.

Series applied, thank you.


Re: [PATCH net-next 0/6] mv88e6xxx broadcast flooding in hardware

2017-09-27 Thread Vivien Didelot
Hi Andrew,

Andrew Lunn  writes:

> This patchset makes the mv88e6xxx driver perform flooding in hardware,
> rather than let the software bridge perform the flooding. This is a
> prerequisite for IGMP snooping on the bridge interface.
>
> In order to make hardware broadcasting work, a few other issues need
> fixing or improving. SWITCHDEV_ATTR_ID_PORT_PARENT_ID is broken, which
> is apparent when testing on the ZII devel board with multiple
> switches.
>
> Some of these patches are taken from a previous RFC patchset of IGMP
> support. Hence the v2 comments...

mlxsw and others return BR_FLOOD and BR_MCAST_FLOOD in
SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT, is this something DSA needs
here?


Thanks,

Vivien


Re: [PATCH net-next 6/6] net: dsa: mv88e6xxx: Flood broadcast frames in hardware

2017-09-27 Thread Vivien Didelot
Hi Andrew,

Andrew Lunn  writes:

> +static int mv88e6xxx_port_add_broadcast(struct mv88e6xxx_chip *chip, int 
> port,
> + u16 vid)
> +{
> + const char broadcast[6] = { 0xff, 0xff, 0xff, 0xff, 0xff, 0xff };
> +
> + return mv88e6xxx_port_db_load_purge(
> + chip, port, broadcast, vid,
> + MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC);

Please don't do this. This is not a valid coding style and has already
shown to be a bad example for other DSA drivers copying mv88e6xxx.

Simply declare u8 state = MV88E6XXX_G1_ATU_DATA_STATE_MC_STATIC above...

> +}
> +
> +static int mv88e6xxx_broadcast_setup(struct mv88e6xxx_chip *chip, u16 vid)
> +{
> + int port;
> + int err;
> +
> + for (port = 0; port < mv88e6xxx_num_ports(chip); ++port) {
> + err = mv88e6xxx_port_add_broadcast(chip, port, vid);
> + if (err)
> + return err;
> + }
> +
> + return 0;
> +}
> +
>  static int _mv88e6xxx_port_vlan_add(struct mv88e6xxx_chip *chip, int port,
>   u16 vid, u8 member)
>  {
> @@ -1247,7 +1271,11 @@ static int _mv88e6xxx_port_vlan_add(struct 
> mv88e6xxx_chip *chip, int port,
>  
>   vlan.member[port] = member;
>  
> - return mv88e6xxx_vtu_loadpurge(chip, );
> + err = mv88e6xxx_vtu_loadpurge(chip, );
> + if (err)
> + return err;
> +
> + return mv88e6xxx_broadcast_setup(chip, vid);
>  }
>  
>  static void mv88e6xxx_port_vlan_add(struct dsa_switch *ds, int port,
> @@ -2025,6 +2053,10 @@ static int mv88e6xxx_setup(struct dsa_switch *ds)
>   if (err)
>   goto unlock;
>  
> + err = mv88e6xxx_broadcast_setup(chip, 0);
> + if (err)
> + goto unlock;
> +
>   err = mv88e6xxx_pot_setup(chip);
>   if (err)
>   goto unlock;


Adding the broadcast address to an Ethernet switch's FDB is pretty
generic and mv88e6xxx mustn't be the only driver doing this.

They do not have to care about the broadcast address, this is just a
standard MDB entry for them. This must be moved up to the DSA layer.

Adding the broadcast address in dsa_port_vlan_add and dsa_port_enable
like this should be enough: http://ix.io/AmS


Thanks,

Vivien


Re: [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available

2017-09-27 Thread Cong Wang
On Wed, Sep 27, 2017 at 10:44 AM, Wei Wang  wrote:
> Another thing is that I don't really see any places making use of
> dst->__use. So maybe we can also get rid of this dst->__use field?

It is used in rtnl_put_cacheinfo():

struct rta_cacheinfo ci = {
.rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse),
.rta_used = dst->__use,
.rta_clntref = atomic_read(&(dst->__refcnt)),
.rta_error = error,
.rta_id =  id,
};


Re: [PATCH v2 16/16] net: Add support for networking over Thunderbolt cable

2017-09-27 Thread David Miller
From: Mika Westerberg 
Date: Wed, 27 Sep 2017 20:27:02 +0300

> On Wed, Sep 27, 2017 at 09:27:09AM -0700, David Miller wrote:
>> From: Mika Westerberg 
>> Date: Wed, 27 Sep 2017 16:42:38 +0300
>> 
>> > Using build_skb() then would require to allocate larger buffer, that
>> > includes NET_SKB_PAD + SKB_DATA_ALIGN(skb_shared_info) and that exceeds
>> > page size. Is this something supported by build_skb()? It was not clear
>> > to me based on the code and other users of build_skb() but I may be
>> > missing something.
>> 
>> You need NET_SKB_PAD before and SKB_DATA_ALIGN(skb_shared_info) afterwards.
>> An order 1 page, if that's what you need, should work just fine.
> 
> I mean in order to fit a single ThunderboltIP frame, I would need to
> allocate NET_SKB_PAD+4096+SKB_DATA_ALIGN(skb_shared_info) size buffer.

Which would be an order 1 page or 8192 bytes.

> Is that still fine for build_skb()? Also can I use that with
> skb_add_rx_frag() which seem to take single page?

Again, an order 1 page should work fine.


Re: [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available

2017-09-27 Thread Eric Dumazet
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c

>
> Hi Paolo,
>
> Eric and I discussed about this issue recently as well :).
>
> What about the following change:
>
> diff --git a/include/net/dst.h b/include/net/dst.h
> index 93568bd0a352..33e1d86bcef6 100644
> --- a/include/net/dst.h
> +++ b/include/net/dst.h
> @@ -258,14 +258,18 @@ static inline void dst_hold(struct dst_entry *dst)
>  static inline void dst_use(struct dst_entry *dst, unsigned long time)
>  {
> dst_hold(dst);
> -   dst->__use++;
> -   dst->lastuse = time;
> +   if (dst->lastuse != time) {
> +   dst->__use++;
> +   dst->lastuse = time;
> +   }
>  }
>
>  static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
>  {
> -   dst->__use++;
> -   dst->lastuse = time;
> +   if (dst->lastuse != time) {
> +   dst->__use++;
> +   dst->lastuse = time;
> +   }
>  }
>
>  static inline struct dst_entry *dst_clone(struct dst_entry *dst)
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 26cc9f483b6d..e195f093add3 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1170,8 +1170,7 @@ struct rt6_info *ip6_pol_route(struct net *net,
> struct fib6_table *table,
>
> struct rt6_info *pcpu_rt;
>
> -   rt->dst.lastuse = jiffies;
> -   rt->dst.__use++;
> +   dst_use_noref(rt, jiffies);
> pcpu_rt = rt6_get_pcpu_route(rt);
>
> if (pcpu_rt) {
>
>
> This way we always only update dst->__use and dst->lastuse at most
> once per jiffy. And we don't really need to update pcpu and then do
> the copy over from pcpu_rt to rt operation.
>
> Another thing is that I don't really see any places making use of
> dst->__use. So maybe we can also get rid of this dst->__use field?
>
> Thanks.
> Wei

Paolo, given we are very close to send Wei awesome work about IPv6
routing cache,
could we ask you to wait few days before doing the same work from your side ?

Main issue is the rwlock, and we are converting it to full RCU.

You are sending patches that are making our job very difficult IMO.

We chose to mimic the change done in neighbour code years ago
( 0ed8ddf4045fcfcac36bad753dc4046118c603ec )

Thanks.


Re: [PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available

2017-09-27 Thread Wei Wang
On Wed, Sep 27, 2017 at 10:14 AM, Paolo Abeni  wrote:
> When a host is under high ipv6 load, the updates of the ingress
> route '__use' field are a source of relevant contention: such
> field is updated for each packet and several cores can access
> concurrently the dst, even if percpu dst entries are available
> and used.
>
> The __use value is just a rough indication of the dst usage: is
> already updated concurrently from multiple CPUs without any lock,
> so we can decrease the contention leveraging the percpu dst to perform
> __use bulk updates: if a per cpu dst entry is found, we account on
> such entry and we flush the percpu counter once per jiffy.
>
> Performace gain under UDP flood is as follows:
>
> nr RX queuesbefore  after   delta
> kppskpps(%)
> 2   2316268816
> 3   3033360518
> 4   396343289
> 5   4379525319
> 6   5137600016
>
> Performance gain under TCP syn flood should be measurable as well.
>
> Signed-off-by: Paolo Abeni 
> ---
>  net/ipv6/route.c | 18 --
>  1 file changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 26cc9f483b6d..e69f304de950 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1170,12 +1170,24 @@ struct rt6_info *ip6_pol_route(struct net *net, 
> struct fib6_table *table,
>
> struct rt6_info *pcpu_rt;
>
> -   rt->dst.lastuse = jiffies;
> -   rt->dst.__use++;
> pcpu_rt = rt6_get_pcpu_route(rt);
>
> if (pcpu_rt) {
> +   unsigned long ts;
> +
> read_unlock_bh(>tb6_lock);
> +
> +   /* do lazy updates of rt->dst->__use, at most once
> +* per jiffy, to avoid contention on such cacheline.
> +*/
> +   ts = jiffies;
> +   pcpu_rt->dst.__use++;
> +   if (pcpu_rt->dst.lastuse != ts) {
> +   rt->dst.__use += pcpu_rt->dst.__use;
> +   rt->dst.lastuse = ts;
> +   pcpu_rt->dst.__use = 0;
> +   pcpu_rt->dst.lastuse = ts;
> +   }
> } else {
> /* We have to do the read_unlock first
>  * because rt6_make_pcpu_route() may trigger
> @@ -1185,6 +1197,8 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
> fib6_table *table,
> read_unlock_bh(>tb6_lock);
> pcpu_rt = rt6_make_pcpu_route(rt);
> dst_release(>dst);
> +   rt->dst.lastuse = jiffies;
> +   rt->dst.__use++;
> }
>
> trace_fib6_table_lookup(net, pcpu_rt, table->tb6_id, fl6);
> --
> 2.13.5
>

Hi Paolo,

Eric and I discussed about this issue recently as well :).

What about the following change:

diff --git a/include/net/dst.h b/include/net/dst.h
index 93568bd0a352..33e1d86bcef6 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -258,14 +258,18 @@ static inline void dst_hold(struct dst_entry *dst)
 static inline void dst_use(struct dst_entry *dst, unsigned long time)
 {
dst_hold(dst);
-   dst->__use++;
-   dst->lastuse = time;
+   if (dst->lastuse != time) {
+   dst->__use++;
+   dst->lastuse = time;
+   }
 }

 static inline void dst_use_noref(struct dst_entry *dst, unsigned long time)
 {
-   dst->__use++;
-   dst->lastuse = time;
+   if (dst->lastuse != time) {
+   dst->__use++;
+   dst->lastuse = time;
+   }
 }

 static inline struct dst_entry *dst_clone(struct dst_entry *dst)
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 26cc9f483b6d..e195f093add3 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1170,8 +1170,7 @@ struct rt6_info *ip6_pol_route(struct net *net,
struct fib6_table *table,

struct rt6_info *pcpu_rt;

-   rt->dst.lastuse = jiffies;
-   rt->dst.__use++;
+   dst_use_noref(rt, jiffies);
pcpu_rt = rt6_get_pcpu_route(rt);

if (pcpu_rt) {


This way we always only update dst->__use and dst->lastuse at most
once per jiffy. And we don't really need to update pcpu and then do
the copy over from pcpu_rt to rt operation.

Another thing is that I don't really see any places making use of
dst->__use. So maybe we can also get rid of this dst->__use field?

Thanks.
Wei


Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-27 Thread Alexei Starovoitov
On Wed, Sep 27, 2017 at 04:54:57PM +0200, Jesper Dangaard Brouer wrote:
> On Wed, 27 Sep 2017 06:35:40 -0700
> John Fastabend  wrote:
> 
> > On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:
> > > On Tue, 26 Sep 2017 21:58:53 +0200
> > > Daniel Borkmann  wrote:
> > >   
> > >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> > >> [...]  
> > >>> I'm currently implementing a cpumap type, that transfers raw XDP frames
> > >>> to another CPU, and the SKB is allocated on the remote CPU.  (It
> > >>> actually works extremely well).
> > >>
> > >> Meaning you let all the XDP_PASS packets get processed on a
> > >> different CPU, so you can reserve the whole CPU just for
> > >> prefiltering, right?   
> > > 
> > > Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> > > The trick is using the map-flush point, to transfer packets in bulk to
> > > the remote CPU (single call IPC is too slow), but at the same time
> > > flush single packets if NAPI didn't see a bulk.
> > >   
> > >> Do you have some numbers to share at this point, just curious when
> > >> you mention it works extremely well.  
> > > 
> > > Sure... I've done a lot of benchmarking on this patchset ;-)
> > > I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> > > stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> > > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> > > 
> > >  [1] 
> > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
> > >  [2] 
> > > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> > > 
> > > Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> > > packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> > > the 11.9Mpps DDoS attack is hitting.
> > > 
> > > Running XDP/eBPF prog_num:4
> > > XDP-cpumap  CPU:to  ppsdrop-ppsextra-info
> > > XDP-RX  0   12,030,471 11,966,982  0  
> > > XDP-RX  total   12,030,471 11,966,982 
> > > cpumap-enqueue0:2   63,488 0   0  
> > > cpumap-enqueue  sum:2   63,488 0   0  
> > > cpumap_kthread  2   63,488 0   3  time_exceed
> > > cpumap_kthread  total   63,488 0   0  
> > > redirect_errtotal   0  0  
> > > 
> > > $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> > > Local /Remote
> > > Socket Size   Request  Resp.   Elapsed  Trans.
> > > Send   Recv   Size SizeTime Rate 
> > > bytes  Bytes  bytesbytes   secs.per sec   
> > > 
> > > 16384  87380  1024 102410.0012735.97   
> > > 16384  87380 
> > > 
> > > The netperf TCP_CRR performance is the same, without XDP loaded.
> > >   
> > 
> > Just curious could you also try this with RPS enabled (or does this have
> > RPS enabled). RPS should effectively do the same thing but higher in the
> > stack. I'm curious what the delta would be. Might be another interesting
> > case and fairly easy to setup if you already have the above scripts.
> 
> Yes, I'm essentially competing with RSP, thus such a comparison is very
> relevant...
> 
> This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let
> other 4 CPUS process packet.
> 
> Summary of RPS (Receive Packet Steering) performance:
>  * End result is 6.3 Mpps max performance
>  * netperf TCP_CRR is 1 trans/sec.
>  * Each RX-RPS CPU stall at ~3.2Mpps.
> 
> The full test report below with setup:
> 
> The mask needed::
> 
>  perl -e 'printf "%b\n",0x3C'
>  00
> 
> RPS setup::
> 
>  sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries'
> 
>  for N in $(seq 0 5) ; do \
>sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; 
> \
>sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \
>grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \
>  done
> 
> Reduce RX queues to two ::
> 
>  ethtool -L ixgbe1 combined 2
> 
> IRQ align to CPU numbers::
> 
>  $ ~/setup01.sh
>  Not root, running with sudo
>   --- Disable Ethernet flow-control ---
>  rx unmodified, ignoring
>  tx unmodified, ignoring
>  no pause parameters changed, aborting
>  rx unmodified, ignoring
>  tx unmodified, ignoring
>  no pause parameters changed, aborting
>   --- Align IRQs ---
>  /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0
>  /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1
>  /proc/irq/56/ixgbe1/../smp_affinity_list:0-5
> 
> $ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus
> /sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c
> /sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c
> 
> Generator is sending: 12,715,782 tx_packets /sec
> 
>  ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \
> -d 172.16.0.2 -t8
> 
> $ nstat > /dev/null 

[PATCH V3] r8152: add Linksys USB3GIGV1 id

2017-09-27 Thread Grant Grundler
This linksys dongle by default comes up in cdc_ether mode.
This patch allows r8152 to claim the device:
   Bus 002 Device 002: ID 13b1:0041 Linksys

Signed-off-by: Grant Grundler 
---
 drivers/net/usb/cdc_ether.c | 10 ++
 drivers/net/usb/r8152.c |  2 ++
 2 files changed, 12 insertions(+)

V3: for backwards compat, add #ifdef CONFIG_USB_RTL8152 around
the cdc_ether blacklist entry so the cdc_ether driver can
still claim the device if r8152 driver isn't configured.

V2: add LINKSYS_VENDOR_ID to cdc_ether blacklist

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 8ab281b478f2..446dcc0f1f70 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -546,6 +546,7 @@ static const struct driver_info wwan_info = {
 #define DELL_VENDOR_ID 0x413C
 #define REALTEK_VENDOR_ID  0x0bda
 #define SAMSUNG_VENDOR_ID  0x04e8
+#define LINKSYS_VENDOR_ID  0x13b1
 #define LENOVO_VENDOR_ID   0x17ef
 #define NVIDIA_VENDOR_ID   0x0955
 #define HP_VENDOR_ID   0x03f0
@@ -737,6 +738,15 @@ static const struct usb_device_id  products[] = {
.driver_info = 0,
 },
 
+#ifdef CONFIG_USB_RTL8152
+/* Linksys USB3GIGV1 Ethernet Adapter */
+{
+   USB_DEVICE_AND_INTERFACE_INFO(LINKSYS_VENDOR_ID, 0x0041, USB_CLASS_COMM,
+   USB_CDC_SUBCLASS_ETHERNET, USB_CDC_PROTO_NONE),
+   .driver_info = 0,
+},
+#endif
+
 /* ThinkPad USB-C Dock (based on Realtek RTL8153) */
 {
USB_DEVICE_AND_INTERFACE_INFO(LENOVO_VENDOR_ID, 0x3062, USB_CLASS_COMM,
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index ceb78e2ea4f0..941ece08ba78 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -613,6 +613,7 @@ enum rtl8152_flags {
 #define VENDOR_ID_MICROSOFT0x045e
 #define VENDOR_ID_SAMSUNG  0x04e8
 #define VENDOR_ID_LENOVO   0x17ef
+#define VENDOR_ID_LINKSYS  0x13b1
 #define VENDOR_ID_NVIDIA   0x0955
 
 #define MCU_TYPE_PLA   0x0100
@@ -5316,6 +5317,7 @@ static const struct usb_device_id rtl8152_table[] = {
{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7205)},
{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x720c)},
{REALTEK_USB_DEVICE(VENDOR_ID_LENOVO,  0x7214)},
+   {REALTEK_USB_DEVICE(VENDOR_ID_LINKSYS, 0x0041)},
{REALTEK_USB_DEVICE(VENDOR_ID_NVIDIA,  0x09ff)},
{}
 };
-- 
2.14.2.822.g60be5d43e6-goog



Re: [PATCH v2 16/16] net: Add support for networking over Thunderbolt cable

2017-09-27 Thread Mika Westerberg
On Wed, Sep 27, 2017 at 09:27:09AM -0700, David Miller wrote:
> From: Mika Westerberg 
> Date: Wed, 27 Sep 2017 16:42:38 +0300
> 
> > Using build_skb() then would require to allocate larger buffer, that
> > includes NET_SKB_PAD + SKB_DATA_ALIGN(skb_shared_info) and that exceeds
> > page size. Is this something supported by build_skb()? It was not clear
> > to me based on the code and other users of build_skb() but I may be
> > missing something.
> 
> You need NET_SKB_PAD before and SKB_DATA_ALIGN(skb_shared_info) afterwards.
> An order 1 page, if that's what you need, should work just fine.

I mean in order to fit a single ThunderboltIP frame, I would need to
allocate NET_SKB_PAD+4096+SKB_DATA_ALIGN(skb_shared_info) size buffer.
Is that still fine for build_skb()? Also can I use that with
skb_add_rx_frag() which seem to take single page?

ThunderboltIP protocol basically takes advantage of TSO/LRO but it
actually does not do any segmentation. Instead it just splits the 64kB
large package into smaller 4k frames (which each include 12 byte header)
and pushes those over the Thunderbolt medium. The receiver side then
does the opposite.

Thanks and sorry for dummy questions. I'm just not too familiar with
the networking subsystem (yet).


[PATCH net-next] ipv6: do lazy dst->__use update when per cpu dst is available

2017-09-27 Thread Paolo Abeni
When a host is under high ipv6 load, the updates of the ingress
route '__use' field are a source of relevant contention: such
field is updated for each packet and several cores can access
concurrently the dst, even if percpu dst entries are available
and used.

The __use value is just a rough indication of the dst usage: is
already updated concurrently from multiple CPUs without any lock,
so we can decrease the contention leveraging the percpu dst to perform
__use bulk updates: if a per cpu dst entry is found, we account on
such entry and we flush the percpu counter once per jiffy.

Performace gain under UDP flood is as follows:

nr RX queuesbefore  after   delta
kppskpps(%)
2   2316268816
3   3033360518
4   396343289
5   4379525319
6   5137600016

Performance gain under TCP syn flood should be measurable as well.

Signed-off-by: Paolo Abeni 
---
 net/ipv6/route.c | 18 --
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 26cc9f483b6d..e69f304de950 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1170,12 +1170,24 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
 
struct rt6_info *pcpu_rt;
 
-   rt->dst.lastuse = jiffies;
-   rt->dst.__use++;
pcpu_rt = rt6_get_pcpu_route(rt);
 
if (pcpu_rt) {
+   unsigned long ts;
+
read_unlock_bh(>tb6_lock);
+
+   /* do lazy updates of rt->dst->__use, at most once
+* per jiffy, to avoid contention on such cacheline.
+*/
+   ts = jiffies;
+   pcpu_rt->dst.__use++;
+   if (pcpu_rt->dst.lastuse != ts) {
+   rt->dst.__use += pcpu_rt->dst.__use;
+   rt->dst.lastuse = ts;
+   pcpu_rt->dst.__use = 0;
+   pcpu_rt->dst.lastuse = ts;
+   }
} else {
/* We have to do the read_unlock first
 * because rt6_make_pcpu_route() may trigger
@@ -1185,6 +1197,8 @@ struct rt6_info *ip6_pol_route(struct net *net, struct 
fib6_table *table,
read_unlock_bh(>tb6_lock);
pcpu_rt = rt6_make_pcpu_route(rt);
dst_release(>dst);
+   rt->dst.lastuse = jiffies;
+   rt->dst.__use++;
}
 
trace_fib6_table_lookup(net, pcpu_rt, table->tb6_id, fl6);
-- 
2.13.5



Re: [PATCH 2/2] uapi: add a compatibility layer between linux/uio.h and glibc

2017-09-27 Thread Lee Duncan
Ping? I never saw any response to this.

On Wed, 22 Feb 2017 05:29:47 +0300, Dmitry V Levin wrote:
> Do not define struct iovec in linux/uio.h when  or 
> is already included and provides these definitions.
> 
> This fixes the following compilation error when  or 
> is included before :
> 
> /usr/include/linux/uio.h:16:8: error: redefinition of 'struct iovec'
> 
> Signed-off-by: Dmitry V. Levin 
> ---
>  include/uapi/linux/libc-compat.h | 10 ++
>  include/uapi/linux/uio.h |  3 +++
>  2 files changed, 13 insertions(+)
> 
> diff --git a/include/uapi/linux/libc-compat.h 
> b/include/uapi/linux/libc-compat.h
> index 481e3b1..9b88586 100644
> --- a/include/uapi/linux/libc-compat.h
> +++ b/include/uapi/linux/libc-compat.h
> @@ -205,6 +205,13 @@
>  #define __UAPI_DEF_TIMEZONE  1
>  #endif
>  
> +/* Coordinate with glibc bits/uio.h header. */
> +#if defined(_SYS_UIO_H) || defined(_FCNTL_H)
> +#define __UAPI_DEF_IOVEC 0
> +#else
> +#define __UAPI_DEF_IOVEC 1
> +#endif
> +
>  /* Definitions for xattr.h */
>  #if defined(_SYS_XATTR_H)
>  #define __UAPI_DEF_XATTR 0
> @@ -261,6 +268,9 @@
>  #define __UAPI_DEF_TIMEVAL   1
>  #define __UAPI_DEF_TIMEZONE  1
>  
> +/* Definitions for uio.h */
> +#define __UAPI_DEF_IOVEC 1
> +
>  /* Definitions for xattr.h */
>  #define __UAPI_DEF_XATTR 1
>  
> diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h
> index 2731d56..e6e12cf 100644
> --- a/include/uapi/linux/uio.h
> +++ b/include/uapi/linux/uio.h
> @@ -9,15 +9,18 @@
>  #ifndef _UAPI__LINUX_UIO_H
>  #define _UAPI__LINUX_UIO_H
>  
> +#include 
>  #include 
>  #include 
>  
>  
> +#if __UAPI_DEF_IOVEC
>  struct iovec
>  {
>   void __user *iov_base;  /* BSD uses caddr_t (1003.1g requires void *) */
>   __kernel_size_t iov_len; /* Must be size_t (1003.1g) */
>  };
> +#endif /* __UAPI_DEF_IOVEC */
>  
>  /*
>   *   UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1)
> -- 
> ldv

-- 
Lee Duncan
SUSE Labs


RE: [PATCH net v2] net: dsa: mv88e6xxx: lock mutex when freeing IRQs

2017-09-27 Thread David Laight
From: Andrew Lunn
> Sent: 27 September 2017 14:07
> To: David Laight
> On Wed, Sep 27, 2017 at 09:06:01AM +, David Laight wrote:
> > From: Vivien Didelot
> > > Sent: 26 September 2017 19:57
> > > mv88e6xxx_g2_irq_free locks the registers mutex, but not
> > > mv88e6xxx_g1_irq_free, which results in a stack trace from
> > > assert_reg_lock when unloading the mv88e6xxx module. Fix this.
> > >
> > > Fixes: 3460a5770ce9 ("net: dsa: mv88e6xxx: Mask g1 interrupts and free 
> > > interrupt")
> > > Signed-off-by: Vivien Didelot 
> > > ---
> > >  drivers/net/dsa/mv88e6xxx/chip.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/net/dsa/mv88e6xxx/chip.c 
> > > b/drivers/net/dsa/mv88e6xxx/chip.c
> > > index c6678aa9b4ef..e7ff7483d2fb 100644
> > > --- a/drivers/net/dsa/mv88e6xxx/chip.c
> > > +++ b/drivers/net/dsa/mv88e6xxx/chip.c
> > > @@ -3947,7 +3947,9 @@ static void mv88e6xxx_remove(struct mdio_device 
> > > *mdiodev)
> > >   if (chip->irq > 0) {
> > >   if (chip->info->g2_irqs > 0)
> > >   mv88e6xxx_g2_irq_free(chip);
> > > + mutex_lock(>reg_lock);
> > >   mv88e6xxx_g1_irq_free(chip);
> > > + mutex_unlock(>reg_lock);
> >
> > Isn't the irq_free code likely to have to sleep waiting for any
> > ISR to complete??
> 
> Hi David
> 
> Possibly. But this is a mutex, not a spinlock. So sleeping is O.K.
> Or am i missing something?

Looks like I was missing some coffee :-)

David



Re: [PATCH V2] r8152: add Linksys USB3GIGV1 id

2017-09-27 Thread Grant Grundler
On Wed, Sep 27, 2017 at 12:15 AM, Oliver Neukum  wrote:
> Am Dienstag, den 26.09.2017, 08:19 -0700 schrieb Doug Anderson:
>>
>> I know that for at least some of the adapters in the CDC Ethernet
>> blacklist it was claimed that the CDC Ethernet support in the adapter
>> was kinda broken anyway so the blacklist made sense.  ...but for the
>> Linksys Gigabit adapter the CDC Ethernet driver seems to work OK, it's
>> just not quite as full featured / efficient as the R8152 driver.
>>
>> Is that not a concern?  I guess you could tell people in this
>> situation that they simply need to enable the R8152 driver to get
>> continued support for their Ethernet adapter?
>
> Hi,
>
> yes, it is a valid concern. An #ifdef will be needed.

Good idea - I will post V3 shortly.

I'm assuming you mean to add #ifdef CONFIG_USB_RTL8152 around the
blacklist entry in cdc_ether driver.

cheers,
grant


[PATCH net-next] tcp: fix under-evaluated ssthresh in TCP Vegas

2017-09-27 Thread Hoang Tran
With the commit 76174004a0f19785 (tcp: do not slow start when cwnd equals
ssthresh), the comparison to the reduced cwnd in tcp_vegas_ssthresh() would
under-evaluate the ssthresh.

Signed-off-by: Hoang Tran 
---
 net/ipv4/tcp_vegas.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_vegas.c b/net/ipv4/tcp_vegas.c
index 218cfcc..ee113ff 100644
--- a/net/ipv4/tcp_vegas.c
+++ b/net/ipv4/tcp_vegas.c
@@ -158,7 +158,7 @@ EXPORT_SYMBOL_GPL(tcp_vegas_cwnd_event);
 
 static inline u32 tcp_vegas_ssthresh(struct tcp_sock *tp)
 {
-   return  min(tp->snd_ssthresh, tp->snd_cwnd-1);
+   return  min(tp->snd_ssthresh, tp->snd_cwnd);
 }
 
 static void tcp_vegas_cong_avoid(struct sock *sk, u32 ack, u32 acked)
-- 
2.7.4



Re: [PATCH net-next] libbpf: use map_flags when creating maps

2017-09-27 Thread Alexei Starovoitov

On 9/27/17 7:04 AM, Craig Gallek wrote:

From: Craig Gallek 

This extends struct bpf_map_def to include a flags field.  Note that
this has the potential to break the validation logic in
bpf_object__validate_maps and bpf_object__init_maps as they use
sizeof(struct bpf_map_def) as a minimal allowable size of a map section.
Any bpf program compiled with a smaller struct bpf_map_def will fail this
check.

I don't believe this will be an issue in practice as both compile-time
definitions of struct bpf_map_def (in samples/bpf/bpf_load.h and
tools/testing/selftests/bpf/bpf_helpers.h) have always been larger
than this newly updated version in libbpf.h.

Signed-off-by: Craig Gallek 
---
 tools/lib/bpf/libbpf.c | 2 +-
 tools/lib/bpf/libbpf.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 35f6dfcdc565..6bea85f260a3 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -874,7 +874,7 @@ bpf_object__create_maps(struct bpf_object *obj)
  def->key_size,
  def->value_size,
  def->max_entries,
- 0);
+ def->map_flags);
if (*pfd < 0) {
size_t j;
int err = *pfd;
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 7959086eb9c9..6e20003109e0 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -207,6 +207,7 @@ struct bpf_map_def {
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
+   unsigned int map_flags;
 };


yes it will break loading of pre-compiled .o
Instead of breaking, let's fix the loader to do it the way
samples/bpf/bpf_load.c does.
See commit 156450d9d964 ("samples/bpf: make bpf_load.c code compatible 
with ELF maps section changes")





Re: [PATCH v2 16/16] net: Add support for networking over Thunderbolt cable

2017-09-27 Thread David Miller
From: Mika Westerberg 
Date: Wed, 27 Sep 2017 16:42:38 +0300

> Using build_skb() then would require to allocate larger buffer, that
> includes NET_SKB_PAD + SKB_DATA_ALIGN(skb_shared_info) and that exceeds
> page size. Is this something supported by build_skb()? It was not clear
> to me based on the code and other users of build_skb() but I may be
> missing something.

You need NET_SKB_PAD before and SKB_DATA_ALIGN(skb_shared_info) afterwards.
An order 1 page, if that's what you need, should work just fine.


Re: [PATCH v3 1/1] ip_tunnel: add mpls over gre encapsulation

2017-09-27 Thread Amine Kherbouche



On 09/27/2017 06:20 PM, Roopa Prabhu wrote:

I think its better to bring the patch back in.


Sounds good, ok


Re: [PATCH] net/ipv4: Update sk_for_each_entry_offset_rcu macro to utilize rcu methods hlist_next_rcu. This fixes the warnings thrown by sparse regarding net/ipv4/udp.c on line 1974.

2017-09-27 Thread Tim Hansen
> Tue, Sep 26, 2017 at 08:03:39PM -0700, David Miller wrote:
> From: Tim Hansen 
> Date: Tue, 26 Sep 2017 20:54:05 -0400
>
> > Signed-off-by: Tim Hansen 
>
> This is a poor patch submission on many levels.
>

Apologies Dave, this is my first patch. I appreciate the quick review and 
helpful feedback.

> But the main problem, is that there is no use of
> sk_for_each_entry_offset_rcu() in any of my networking kernel trees.
>

Using the get_maintainers.pl on include/net/sock.h brings up your name and the 
netdev mailing list.  Forgive me if I'm misunderstanding what you mean by this?

> Referencing code by line number never works, you have to mention
> what version of the kernel, what tree, and where in what fucntion
> the problem is occurring.
>

I am using your tree net tree for now: 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git on HEAD. HEAD is 
c2cc187e53011c1c4931055984657da9085c763b for me currently on your tree.

Before I was on the 4.13 tag pulled from linus' tree.

The line number is indeed useless in hindsight since there are many different 
trees. I won't do that again.

Using sparse 0.5.0 on HEAD of your net tree, I run make C=1 net/ipv4/. It 
throws the error:

"net/ipv4/udp.c:1981:9: error: incompatible types in comparison expression 
(different address spaces)
net/ipv4/udp.c:1981:9: error: incompatible types in comparison expression 
(different address spaces)"

This points to the function sk_for_each_entry_offset_rcu() in 
__udp4_lib_mcast_deliver in net/ipv4/udp.c.

Inspecting this macro in include/net/sock.h is what lead to this patch.

Applying the patch silences those warnings but clearly this is -not- a proper 
way of fixing the error. Any guidance would be greatly appreciated.

> Secondly, sk_for_each_entry_offset_rcu() is not meant to be used
> in _raw() contexts.  This is why it's not called
> sk_for_each_entry_offset_rcu_raw().

Absolutely makes sense. I am not familar with the kernel naming standards fully 
yet but this is obvious in hindsight.

>
> The sparse warning is probably legitimate, and points to a bug.
>
> But nobody can tell where becuase you haven't told us what tree
> and where this happens.

Hopefully my reply has enough detail for reproduction and further debugging. 
Please let me know if I should supply any additional information.


Re: [PATCH v2 02/16] thunderbolt: Add support for XDomain properties

2017-09-27 Thread David Miller
From: Mika Westerberg 
Date: Wed, 27 Sep 2017 14:32:41 +0300

> Just for my education, is there some rule which tells when __packed is
> to be used? For example the above structures are all 32-bit aligned but
> how about something like:
> 
> struct foo {
>   u32 value1;
>   u8 value2;
> };
> 
> If the on-wire format requires such structures I assume __packed
> is needed here?

Usually header elements are 32-bit aligned in a protocol, so it wouldn't
be specified like that.

The only legitimate case I've seen is where things are purposefully
misaligned within the header, like this:

struct foo {
u16 x;
u64 y;
u16 z;
};

Where the 'y' element is 2-byte aligned.

Fortunately, those situations are extremely rare.


Re: [PATCH v3 1/1] ip_tunnel: add mpls over gre encapsulation

2017-09-27 Thread Roopa Prabhu
On Wed, Sep 27, 2017 at 9:08 AM, Amine Kherbouche
 wrote:
>
>
> On 09/27/2017 05:36 PM, Roopa Prabhu wrote:
>>
>> Amine, one small nit here.., if you define mpls_gre_rcv in gre header
>> (like you had initially), you could do the below...
>>
>> #if IS_ENABLED(CONFIG_MPLS)
>> mpls_gre_rcv()
>> {
>>  /* real func */
>> }
>> #else
>> mpls_gre_rcv()
>> {
>> kfree_skb(skb)
>> return NET_RX_DROP
>> }
>> #endif
>>
>> and the check in gre_rcv() reduces to
>>
>> if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC)))
>> return mpls_gre_rcv(skb, hdr_len);
>>
>> Which looks much cleaner.
>
>
> If I do that, do I have to add back the patch that export mpls_forward() or
> just merge it with this one ?

I think its better to bring the patch back in.


RE: [PATCH v2 02/16] thunderbolt: Add support for XDomain properties

2017-09-27 Thread David Laight
From: Mika Westerberg
> Sent: 27 September 2017 12:33
...
> Just for my education, is there some rule which tells when __packed is
> to be used? For example the above structures are all 32-bit aligned but
> how about something like:
> 
> struct foo {
>   u32 value1;
>   u8 value2;
> };
> 
> If the on-wire format requires such structures I assume __packed
> is needed here?

You've endianness considerations as well with on-wire formats.

__packed indicates two things:
1) There will be no padding bytes between fields.
2) The structure itself might appear on any byte boundary.

The latter causes the compiler to do byte memory accesses and
shifts to load/store the data on some architectures.
So only mark things __packed when they might be misaligned in
memory.

David



Re: [PATCH v3 1/1] ip_tunnel: add mpls over gre encapsulation

2017-09-27 Thread Amine Kherbouche



On 09/27/2017 05:36 PM, Roopa Prabhu wrote:

Amine, one small nit here.., if you define mpls_gre_rcv in gre header
(like you had initially), you could do the below...

#if IS_ENABLED(CONFIG_MPLS)
mpls_gre_rcv()
{
 /* real func */
}
#else
mpls_gre_rcv()
{
kfree_skb(skb)
return NET_RX_DROP
}
#endif

and the check in gre_rcv() reduces to

if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC)))
return mpls_gre_rcv(skb, hdr_len);

Which looks much cleaner.


If I do that, do I have to add back the patch that export mpls_forward() 
or just merge it with this one ?


Re: [iproute PATCH v2 0/3] Check user supplied interface name lengths

2017-09-27 Thread Phil Sutter
On Wed, Sep 27, 2017 at 08:42:49AM +0100, Stephen Hemminger wrote:
> On Tue, 26 Sep 2017 18:35:45 +0200
> Phil Sutter  wrote:
> 
> > This series adds explicit checks for user-supplied interface names to
> > make sure their length fits Linux's requirements.
> > 
> > The first two patches simplify interface name parsing in some places -
> > these are side-effects of working on the actual implementation provided
> > in patch three.
> > 
> > Changes since v1:
> > - Patches 1 and 2 introduced.
> > - Changes to patch 3 are listed in there.
> > 
> > Phil Sutter (3):
> >   ip{6,}tunnel: Avoid copying user-supplied interface name around
> >   tc: flower: No need to cache indev arg
> >   Check user supplied interface name lengths
> > 
> >  include/utils.h |  1 +
> >  ip/ip6tunnel.c  |  9 +
> >  ip/ipl2tp.c |  3 ++-
> >  ip/iplink.c | 27 ---
> >  ip/ipmaddr.c|  1 +
> >  ip/iprule.c |  4 
> >  ip/iptunnel.c   | 27 +--
> >  ip/iptuntap.c   |  4 +++-
> >  lib/utils.c | 10 ++
> >  misc/arpd.c |  1 +
> >  tc/f_flower.c   |  6 ++
> >  11 files changed, 50 insertions(+), 43 deletions(-)
> > 
> 
> I like the idea, and checking arguments is good.

Cool!

> Why not merge the check and copy and put in lib/utils.c
> 
> int get_ifname(char *name, const char *arg)
> {
> ...

What do you have in mind exactly? There are basically three situations
to which check_ifname() is added:

1) Simple pointer caching:

   | check_ifname("name", *argv);
   | name = *argv;

2) Value caching:

   | check_ifname("name", *argv);
   | strncpy(name, *argv, IFNAMSIZ);

3) Direct netlink attribute creation:

   | check_ifname("name", *argv);
   | addattr_l(, sizeof(req), IFNAME, *argv, strlen(*argv) + 1);

To cover them all, I could introduce the following:

| char *check_ifname(const char *name, const char *argv)
| {
|   /* check *arg, call invarg() if invalid */
|   return *argv;
| }
| 
| void copy_ifname(char *dst, const char *name, const char *argv)
| {
|   strncpy(dst, check_ifname(name, argv), IFNAMSIZ);
| }
| 
| void addattr_ifname(struct nlmsghdr *n, int maxlen, int type,
|   const char *name, const char *argv)
| {
|   addattr_l(n, maxlen, type, check_ifname(name, argv),
| strlen(*argv) + 1);
| }

What do you think?

Cheers, Phil


Re: [PATCH net-next 5/6] net: dsa: mv88e6xxx: Move mv88e6xxx_port_db_load_purge()

2017-09-27 Thread Vivien Didelot
Andrew Lunn  writes:

> This function is going to be needed by a soon to be added new
> function. Move it earlier so we can avoid a forward declaration.
> No code changes.

s/code/functional/

> Signed-off-by: Andrew Lunn 

Reviewed-by: Vivien Didelot 


Re: [PATCH net-next 4/6] net: dsa: mv88e6xxx: Print offending port when vlan check fails

2017-09-27 Thread Vivien Didelot
Andrew Lunn  writes:

> When testing if a VLAN is one more than one bridge, we print an error
> message that the VLAN is already in use somewhere else. Print both the
> new port which would like the VLAN, and the port which already has it,
> to aid debugging.
>
> Signed-off-by: Andrew Lunn 

Reviewed-by: Vivien Didelot 


Re: [PATCH v2 net-next 0/2] net/sched: support tunnel options in cls_flower and act_tunnel_key

2017-09-27 Thread Jiri Benc
On Wed, 27 Sep 2017 10:16:32 +0200, Simon Horman wrote:
> * Geneve
> 
>   In the case of Geneve options are TLVs[1]. My reading is that in collect
>   metadata mode the kernel does not appear to do anything other than pass
>   them around as a bytestring.
> 
>   [1] https://tools.ietf.org/html/draft-ietf-nvo3-geneve-05#section-3.5
[...]
> 
> Neither of the above appear to assume any structure for the data.

But that's not true. Geneve uses TLVs, you even mentioned that
yourself. Matching on a block of TLVs as a bytestring doesn't make
sense. The TLV fields may be in any order.

We need better matching here. Bytestring is useless for Geneve.

NACK for this direction of option matching. We'd need to introduce
matching on TLVs sooner or later anyway and this would be just a never
used compat cruft that we need to keep around forever.

 Jiri


Re: [PATCH v3 1/1] ip_tunnel: add mpls over gre encapsulation

2017-09-27 Thread Roopa Prabhu
On Wed, Sep 27, 2017 at 2:37 AM, Amine Kherbouche
 wrote:
> This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
> API.
>
> Encap:
>   - Add a new iptunnel type mpls.
>   - Share tx path: gre type mpls loaded from skb->protocol.
>
> Decap:
>   - pull gre hdr and call mpls_forward().
>
> Signed-off-by: Amine Kherbouche 
> ---
>  include/linux/mpls.h   |  2 ++
>  include/uapi/linux/if_tunnel.h |  1 +
>  net/ipv4/ip_gre.c  | 11 +
>  net/ipv6/ip6_gre.c | 11 +
>  net/mpls/af_mpls.c | 52 
> ++
>  5 files changed, 77 insertions(+)
>
> diff --git a/include/linux/mpls.h b/include/linux/mpls.h
> index 384fb22..57203c1 100644
> --- a/include/linux/mpls.h
> +++ b/include/linux/mpls.h
> @@ -8,4 +8,6 @@
>  #define MPLS_TC_MASK   (MPLS_LS_TC_MASK >> MPLS_LS_TC_SHIFT)
>  #define MPLS_LABEL_MASK(MPLS_LS_LABEL_MASK >> 
> MPLS_LS_LABEL_SHIFT)
>
> +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len);
> +
>  #endif  /* _LINUX_MPLS_H */
> diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
> index 2e52088..a2f48c0 100644
> --- a/include/uapi/linux/if_tunnel.h
> +++ b/include/uapi/linux/if_tunnel.h
> @@ -84,6 +84,7 @@ enum tunnel_encap_types {
> TUNNEL_ENCAP_NONE,
> TUNNEL_ENCAP_FOU,
> TUNNEL_ENCAP_GUE,
> +   TUNNEL_ENCAP_MPLS,
>  };
>
>  #define TUNNEL_ENCAP_FLAG_CSUM (1<<0)
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 9cee986..0a898f4 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -32,6 +32,9 @@
>  #include 
>  #include 
>  #include 
> +#if IS_ENABLED(CONFIG_MPLS)
> +#include 
> +#endif
>
>  #include 
>  #include 
> @@ -412,6 +415,14 @@ static int gre_rcv(struct sk_buff *skb)
> return 0;
> }
>
> +   if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC))) {
> +#if IS_ENABLED(CONFIG_MPLS)
> +   return mpls_gre_rcv(skb, hdr_len);
> +#else
> +   goto drop;
> +#endif
> +   }
> +

Amine, one small nit here.., if you define mpls_gre_rcv in gre header
(like you had initially), you could do the below...

#if IS_ENABLED(CONFIG_MPLS)
mpls_gre_rcv()
{
 /* real func */
}
#else
mpls_gre_rcv()
{
kfree_skb(skb)
return NET_RX_DROP
}
#endif

and the check in gre_rcv() reduces to

if (unlikely(tpi.proto == htons(ETH_P_MPLS_UC)))
return mpls_gre_rcv(skb, hdr_len);

Which looks much cleaner.

Other than that, looks great. pls add my Acked-by: Roopa Prabhu
 to your next version.

thanks!


Re: tc H/W offload issue with vxlan tunnels [was: nfp: flower vxlan tunnel offload]

2017-09-27 Thread Jiri Pirko
Wed, Sep 27, 2017 at 05:27:29PM CEST, jb...@redhat.com wrote:
>On Wed, 27 Sep 2017 14:55:09 +0200, Jiri Pirko wrote:
>> So where do you attach the tc filter instead of eth0? vxlan0?
>
>Yes, vxlan0. I'm pasting the example from earlier in this thread again:
>
>This will match:
>
>ip link add vxlan0 type vxlan dstport 4789 dev eth0 external
>ip link set dev vxlan0 up
>tc qdisc add dev vxlan0 ingress
>ethtool -K eth0 hw-tc-offload on
>tc filter add dev vxlan0 protocol ip parent : flower enc_key_id 102 \
>   enc_dst_port 4789 src_ip 3.4.5.6 skip_sw action [...]
>
>while this must NOT match:
>
>ip link add vxlan0 type vxlan dstport 4789 dev eth0 external
>ip link set dev vxlan0 up
>tc qdisc add dev eth0 ingress
>ethtool -K eth0 hw-tc-offload on
>tc filter add dev eth0 protocol ip parent : flower enc_key_id 102 \
>   enc_dst_port 4789 src_ip 3.4.5.6 skip_sw action [...]

Right. Driver of eth0 should get ndo_setup_tc calls for vxlan0. Similar
thing will be needed for bonding/team. I'm taking this into
considaration for my sharedblock patchset. Work in progress:
https://github.com/jpirko/linux_mlxsw/commits/jiri_devel_shblock
Basically the eth0 driver will register a callback function that would
be called whenever filter is added/deleted on vxlan0


Re: [PATCH net-next 1/6] net: dsa: Fix SWITCHDEV_ATTR_ID_PORT_PARENT_ID

2017-09-27 Thread Vivien Didelot
Hi Andrew,

Andrew Lunn  writes:

> @@ -354,13 +354,16 @@ static int dsa_slave_port_attr_get(struct net_device 
> *dev,
>  struct switchdev_attr *attr)
>  {
>   struct dsa_slave_priv *p = netdev_priv(dev);
> - struct dsa_switch *ds = p->dp->ds;
>  
>   switch (attr->id) {
> - case SWITCHDEV_ATTR_ID_PORT_PARENT_ID:
> - attr->u.ppid.id_len = sizeof(ds->index);
> - memcpy(>u.ppid.id, >index, attr->u.ppid.id_len);
> + case SWITCHDEV_ATTR_ID_PORT_PARENT_ID: {
> + struct dsa_switch *ds = p->dp->ds;
> + struct dsa_switch_tree *dst = ds->dst;
> +
> + attr->u.ppid.id_len = sizeof(dst->tree);
> + memcpy(>u.ppid.id, >tree, sizeof(dst->tree));
>   break;
> + }
>   case SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT:
>   attr->u.brport_flags_support = 0;
>   break;

I am pretty sure the kernel folks won't like blocks within case
statments. Simply declare dst = p->dp->ds->dst at the beginning.

Also keeping attr->u.ppid.id_len as memcpy's third argument like other
switchdev users do would be prefered.

Otherwise using the tree index is indeed the way to go:

Reviewed-by: Vivien Didelot 


Thanks,

Vivien


Re: tc H/W offload issue with vxlan tunnels [was: nfp: flower vxlan tunnel offload]

2017-09-27 Thread Jiri Benc
On Wed, 27 Sep 2017 14:55:09 +0200, Jiri Pirko wrote:
> So where do you attach the tc filter instead of eth0? vxlan0?

Yes, vxlan0. I'm pasting the example from earlier in this thread again:

This will match:

ip link add vxlan0 type vxlan dstport 4789 dev eth0 external
ip link set dev vxlan0 up
tc qdisc add dev vxlan0 ingress
ethtool -K eth0 hw-tc-offload on
tc filter add dev vxlan0 protocol ip parent : flower enc_key_id 102 \
   enc_dst_port 4789 src_ip 3.4.5.6 skip_sw action [...]

while this must NOT match:

ip link add vxlan0 type vxlan dstport 4789 dev eth0 external
ip link set dev vxlan0 up
tc qdisc add dev eth0 ingress
ethtool -K eth0 hw-tc-offload on
tc filter add dev eth0 protocol ip parent : flower enc_key_id 102 \
   enc_dst_port 4789 src_ip 3.4.5.6 skip_sw action [...]

 Jiri


Re: [PATCH net-next 2/6] bpf: add meta pointer for direct access

2017-09-27 Thread Jesper Dangaard Brouer
On Wed, 27 Sep 2017 06:35:40 -0700
John Fastabend  wrote:

> On 09/27/2017 02:26 AM, Jesper Dangaard Brouer wrote:
> > On Tue, 26 Sep 2017 21:58:53 +0200
> > Daniel Borkmann  wrote:
> >   
> >> On 09/26/2017 09:13 PM, Jesper Dangaard Brouer wrote:
> >> [...]  
> >>> I'm currently implementing a cpumap type, that transfers raw XDP frames
> >>> to another CPU, and the SKB is allocated on the remote CPU.  (It
> >>> actually works extremely well).
> >>
> >> Meaning you let all the XDP_PASS packets get processed on a
> >> different CPU, so you can reserve the whole CPU just for
> >> prefiltering, right?   
> > 
> > Yes, exactly.  Except I use the XDP_REDIRECT action to steer packets.
> > The trick is using the map-flush point, to transfer packets in bulk to
> > the remote CPU (single call IPC is too slow), but at the same time
> > flush single packets if NAPI didn't see a bulk.
> >   
> >> Do you have some numbers to share at this point, just curious when
> >> you mention it works extremely well.  
> > 
> > Sure... I've done a lot of benchmarking on this patchset ;-)
> > I have a benchmark program called xdp_redirect_cpu [1][2], that collect
> > stats via tracepoints (atm I'm limiting bulking 8 packets, and have
> > tracepoints at bulk spots, to amortize tracepoint cost 25ns/8=3.125ns)
> > 
> >  [1] 
> > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_kern.c
> >  [2] 
> > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/samples/bpf/xdp_redirect_cpu_user.c
> > 
> > Here I'm installing a DDoS program that drops UDP port 9 (pktgen
> > packets) on RX CPU=0.  I'm forcing my netperf to hit the same CPU, that
> > the 11.9Mpps DDoS attack is hitting.
> > 
> > Running XDP/eBPF prog_num:4
> > XDP-cpumap  CPU:to  ppsdrop-ppsextra-info
> > XDP-RX  0   12,030,471 11,966,982  0  
> > XDP-RX  total   12,030,471 11,966,982 
> > cpumap-enqueue0:2   63,488 0   0  
> > cpumap-enqueue  sum:2   63,488 0   0  
> > cpumap_kthread  2   63,488 0   3  time_exceed
> > cpumap_kthread  total   63,488 0   0  
> > redirect_errtotal   0  0  
> > 
> > $ netperf -H 172.16.0.2 -t TCP_CRR  -l 10 -D1 -T5,5 -- -r 1024,1024
> > Local /Remote
> > Socket Size   Request  Resp.   Elapsed  Trans.
> > Send   Recv   Size SizeTime Rate 
> > bytes  Bytes  bytesbytes   secs.per sec   
> > 
> > 16384  87380  1024 102410.0012735.97   
> > 16384  87380 
> > 
> > The netperf TCP_CRR performance is the same, without XDP loaded.
> >   
> 
> Just curious could you also try this with RPS enabled (or does this have
> RPS enabled). RPS should effectively do the same thing but higher in the
> stack. I'm curious what the delta would be. Might be another interesting
> case and fairly easy to setup if you already have the above scripts.

Yes, I'm essentially competing with RSP, thus such a comparison is very
relevant...

This is only a 6 CPUs system. Allocate 2 CPUs to RPS receive and let
other 4 CPUS process packet.

Summary of RPS (Receive Packet Steering) performance:
 * End result is 6.3 Mpps max performance
 * netperf TCP_CRR is 1 trans/sec.
 * Each RX-RPS CPU stall at ~3.2Mpps.

The full test report below with setup:

The mask needed::

 perl -e 'printf "%b\n",0x3C'
 00

RPS setup::

 sudo sh -c 'echo 32768 > /proc/sys/net/core/rps_sock_flow_entries'

 for N in $(seq 0 5) ; do \
   sudo sh -c "echo 8192 > /sys/class/net/ixgbe1/queues/rx-$N/rps_flow_cnt" ; \
   sudo sh -c "echo 3c > /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus" ; \
   grep -H . /sys/class/net/ixgbe1/queues/rx-$N/rps_cpus ; \
 done

Reduce RX queues to two ::

 ethtool -L ixgbe1 combined 2

IRQ align to CPU numbers::

 $ ~/setup01.sh
 Not root, running with sudo
  --- Disable Ethernet flow-control ---
 rx unmodified, ignoring
 tx unmodified, ignoring
 no pause parameters changed, aborting
 rx unmodified, ignoring
 tx unmodified, ignoring
 no pause parameters changed, aborting
  --- Align IRQs ---
 /proc/irq/54/ixgbe1-TxRx-0/../smp_affinity_list:0
 /proc/irq/55/ixgbe1-TxRx-1/../smp_affinity_list:1
 /proc/irq/56/ixgbe1/../smp_affinity_list:0-5

$ grep -H . /sys/class/net/ixgbe1/queues/rx-*/rps_cpus
/sys/class/net/ixgbe1/queues/rx-0/rps_cpus:3c
/sys/class/net/ixgbe1/queues/rx-1/rps_cpus:3c

Generator is sending: 12,715,782 tx_packets /sec

 ./pktgen_sample04_many_flows.sh -vi ixgbe2 -m 00:1b:21:bb:9a:84 \
-d 172.16.0.2 -t8

$ nstat > /dev/null && sleep 1 && nstat
#kernel
IpInReceives63465440.0
IpInDelivers63465440.0
IpOutRequests   1020   0.0
IcmpOutMsgs 1020   0.0
IcmpOutDestUnreachs 1020   0.0
IcmpMsgOutType3 

Re: [patch net-next v3 06/12] net: mroute: Check if rule is a default rule

2017-09-27 Thread Nikolay Aleksandrov
On 27/09/17 09:23, Jiri Pirko wrote:
> From: Yotam Gigi 
> 
> When the ipmr starts, it adds one default FIB rule that matches all packets
> and sends them to the DEFAULT (multicast) FIB table. A more complex rule
> can be added by user to specify that for a specific interface, a packet
> should be look up at either an arbitrary table or according to the l3mdev
> of the interface.
> 
> For drivers willing to offload the ipmr logic into a hardware but don't
> want to offload all the FIB rules functionality, provide a function that
> can indicate whether the FIB rule is the default multicast rule, thus only
> one routing table is needed.
> 
> This way, a driver can register to the FIB notification chain, get
> notifications about FIB rules added and trigger some kind of an internal
> abort mechanism when a non default rule is added by the user.
> 
> Signed-off-by: Yotam Gigi 
> Reviewed-by: Ido Schimmel 
> Signed-off-by: Jiri Pirko 
> ---
> v2->v3:
>  - Use the already existing ifdef CONFIG_IP_MROUTE_MULTIPLE_TABLES
> v1->v2:
>  - Update the lastuse MFC entry field too, in addition to packets an bytes.
> ---
>  include/linux/mroute.h |  7 +++
>  net/ipv4/ipmr.c| 12 
>  2 files changed, 19 insertions(+)
> 

Reviewed-by: Nikolay Aleksandrov 




  1   2   3   >